AIConfigurator Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving

AI-Aided Author Info Background The primary challenges identified in optimizing LLM inference in production are the combinatorial explosion of configuration parameters (parallelism strategies, batch sizes, quantization) and the diversity of inference frameworks (TensorRT-LLM, vLLM, SGLang), which makes manual tuning or exhaustive GPU benchmarking prohibitively expensive and slow. Insights Instead of modeling an entire neural network as a black box, the system breaks down LLM inference into fundamental, reusable operations called primitives, profile these primitives then combine the statistics to model the LLM inference process. ...

February 5, 2026 · Last updated on February 9, 2026 · 3 min · KKKZOZ

Revati Transparent GPU-Free Time-Warp Emulation for LLM Serving

AI-Aided Author Info Background The paper identifies a critical bottleneck in deploying Large Language Models (LLMs): The Optimization Challenge: Efficient deployment requires tuning a vast configuration space (e.g., parallelism strategies, batch sizes, caching policies). Cost vs. Fidelity Trade-off: Real GPU Execution: Testing on physical hardware is prohibitively expensive and slow. Discrete-Event Simulators (DES): While fast and cheap, traditional simulators require manually re-implementing the serving system’s complex control logic. Because frameworks (like vLLM and SGLang) evolve rapidly, simulators suffer from a perpetual “semantic gap” and high maintenance burden. Insights ...

February 5, 2026 · Last updated on February 9, 2026 · 2 min · KKKZOZ

DistServe Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Extensive Reading Author Info Background Existing LLM serving systems typically colocate the prefill and decoding phases on the same set of GPUs, often using scheduling techniques like continuous batching to mix the computation of both phases. This colocation strategy creates severe prefill-decoding interference, where the long, compute-intensive prefill tasks block the short, memory-intensive decoding tasks, significantly degrading both the Time-To-First-Token (TTFT) and the Time-Per-Output-Token (TPOT). Colocation also couples the resource allocation and parallelism strategies for both phases, forcing them to share the same configuration even though their computational characteristics and latency requirements are fundamentally different, which leads to resource over-provisioning and inefficient performance. Insights Disaggregate the prefill and decoding phases of LLM inference, assigning them to separate GPUs, which brings two benefits: ...

November 3, 2025 · Last updated on October 4, 2025 · 2 min · KKKZOZ

Splitwise Efficient Generative LLM Inference Using Phase Splitting

Extensive Reading Author Info Background Generative LLM inference is characterized by two distinct phases: a compute-intensive prompt computation phase and a memory-intensive token generation phase, each with unique resource demands. Current systems run both phases on the same powerful, expensive GPUs, which is inefficient because the memory-bound token generation phase underutilizes the advanced compute resources of the hardware. This inefficiency is worsening as new GPUs (like the H100) increase compute power much faster than memory bandwidth or capacity, leading to higher-than-necessary costs and power consumption for large-scale deployments. Challenges The memory-intensive token generation phase, which accounts for the majority of end-to-end latency, severely underutilizes the expensive compute resources of modern GPUs. This inefficiency is exacerbated by hardware trends, as new GPUs (like the H100) provide massive compute gains (3.43x) but much smaller increases in memory bandwidth (1.6x) and no increase in capacity, making them poorly suited for the memory-bound token phase. Running both distinct phases on the same machine leads to inconsistent latencies and resource contention , forcing providers to over-provision expensive, power-hungry hardware to meet service level objectives (SLOs). Insights Prefill phase is compute-intensive, and decoding phase is memory-intensive, decoding does not need the compute capability of the latest GPUs and can be run with lower power and cost. Approaches 在提出具体方法之前,论文的很大部分篇幅都是在说明 LLM Inference 的一些特性,这些特性对 Splitwise 方法的设计有着非常大的影响 ...

November 3, 2025 · Last updated on October 4, 2025 · 3 min · KKKZOZ

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Extensive Reading Author Info Background Current LLM inference schedulers can be broadly classified into two categories: Prefill-Prioritizing Throughput first: allows subsequent decodes to operate at high batch sizes Compromise on latency: prefill can take arbitrarily long time depending on the lengths of the given prompts Decode-Prioritizing Latency first - new requests do not affect the execution of ongoing requests in their decode phase Compromise on throughput: even if some requests in a batch finish early, the execution continues with reduced batch size until the completion of the last request Analysis 论文指出, matrix multiplication 的执行时间可以看做 $T=max(T_{math}, T_{mem})$ ...

November 1, 2025 · Last updated on October 4, 2025 · 2 min · KKKZOZ