DistServe Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Extensive Reading Author Info Background Existing LLM serving systems typically colocate the prefill and decoding phases on the same set of GPUs, often using scheduling techniques like continuous batching to mix the computation of both phases. This colocation strategy creates severe prefill-decoding interference, where the long, compute-intensive prefill tasks block the short, memory-intensive decoding tasks, significantly degrading both the Time-To-First-Token (TTFT) and the Time-Per-Output-Token (TPOT). Colocation also couples the resource allocation and parallelism strategies for both phases, forcing them to share the same configuration even though their computational characteristics and latency requirements are fundamentally different, which leads to resource over-provisioning and inefficient performance. Insights Disaggregate the prefill and decoding phases of LLM inference, assigning them to separate GPUs, which brings two benefits: ...