ISCA24

Extensive Reading Author Info Background Generative LLM inference is characterized by two distinct phases: a compute-intensive prompt computation phase and a memory-intensive token generation phase, each with unique resource demands. Current systems run both phases on the same powerful, expensive GPUs, which is inefficient because the memory-bound token generation phase underutilizes the advanced compute resources of the hardware. This inefficiency is worsening as new GPUs (like the H100) increase compute power much faster than memory bandwidth or capacity, leading to higher-than-necessary costs and power consumption for large-scale deployments. Challenges The memory-intensive token generation phase, which accounts for the majority of end-to-end latency, severely underutilizes the expensive compute resources of modern GPUs. This inefficiency is exacerbated by hardware trends, as new GPUs (like the H100) provide massive compute gains (3.43x) but much smaller increases in memory bandwidth (1.6x) and no increase in capacity, making them poorly suited for the memory-bound token phase. Running both distinct phases on the same machine leads to inconsistent latencies and resource contention , forcing providers to over-provision expensive, power-hungry hardware to meet service level objectives (SLOs). Insights Prefill phase is compute-intensive, and decoding phase is memory-intensive, decoding does not need the compute capability of the latest GPUs and can be run with lower power and cost. Approaches 在提出具体方法之前，论文的很大部分篇幅都是在说明 LLM Inference 的一些特性，这些特性对 Splitwise 方法的设计有着非常大的影响 ...