DistServe Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Extensive Reading Author Info Background Existing LLM serving systems typically colocate the prefill and decoding phases on the same set of GPUs, often using scheduling techniques like continuous batching to mix the computation of both phases. This colocation strategy creates severe prefill-decoding interference, where the long, compute-intensive prefill tasks block the short, memory-intensive decoding tasks, significantly degrading both the Time-To-First-Token (TTFT) and the Time-Per-Output-Token (TPOT). Colocation also couples the resource allocation and parallelism strategies for both phases, forcing them to share the same configuration even though their computational characteristics and latency requirements are fundamentally different, which leads to resource over-provisioning and inefficient performance. Insights Disaggregate the prefill and decoding phases of LLM inference, assigning them to separate GPUs, which brings two benefits: ...

November 3, 2025 · Last updated on October 4, 2025 · 2 min · KKKZOZ

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Extensive Reading Author Info Background Current LLM inference schedulers can be broadly classified into two categories: Prefill-Prioritizing Throughput first: allows subsequent decodes to operate at high batch sizes Compromise on latency: prefill can take arbitrarily long time depending on the lengths of the given prompts Decode-Prioritizing Latency first - new requests do not affect the execution of ongoing requests in their decode phase Compromise on throughput: even if some requests in a batch finish early, the execution continues with reduced batch size until the completion of the last request Analysis 论文指出, matrix multiplication 的执行时间可以看做 $T=max(T_{math}, T_{mem})$ ...

November 1, 2025 · Last updated on October 4, 2025 · 2 min · KKKZOZ

ServerlessLLM Locality-Enhanced Serverless Inference for Large Language Models

Background Serverless inference can significantly reduce costs for LLM users by charging only for the duration of inference and the volume of processed data. Key components in GPU serverless clusters: Controller: Request Router: direct incoming requests to nodes already running LLM inference processes, or instructs the model loading scheduler. Model Loading Scheduler: activate LLM inference processes on unallocated GPUs. TODO The deployment of LLMs on serverless systems, although promising, often incurs significant latency overheads. This is largely due to the substantial proportions of cold-start in serverless clusters. ...

June 28, 2025 · Last updated on September 1, 2025 · 3 min · KKKZOZ