LServe Efficient Long-sequence LLM Serving with Unified Sparse Attention

Extensive Reading Author Info MIT HAN Lab Background Long-context LLM serving is bottlenecked by attention and KV caches. Prefilling has quadratic attention cost in sequence length, while decoding is memory-bound due to ever-growing KV caches; this makes 128k–512k contexts and long reasoning traces (e.g., 20k-token CoT) slow and expensive in practice. Existing KV cache optimizations are incomplete. Quantization and compression methods (e.g., KV quantization, paged KV cache) reduce memory and bandwidth but do not change the asymptotic attention complexity, so latency still grows linearly (decoding) or quadratically (prefilling) with context length. ...

November 15, 2025 · Last updated on November 17, 2025 · 3 min · KKKZOZ

DuoAttention Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Extensive Reading Author Info MIT HAN Lab Background Long-context LLMs strain attention and KV caches. As sequence length grows, prefill cost scales quadratically and decoding linearly, while KV cache memory grows linearly, making naive full-attention inference impractical in real-world long-context applications. Existing architectural and approximate-attention methods trade accuracy or require retraining. Linear-attention and specialized long-context architectures reduce complexity but often underperform standard Transformers on long-range reasoning, while methods like H2O, StreamingLLM, TOVA, and FastGen drop or sparsify tokens uniformly across heads, which can severely damage long-context retrieval accuracy and are difficult to apply safely in settings with KV-sharing schemes such as GQA. ...

November 13, 2025 · Last updated on November 17, 2025 · 3 min · KKKZOZ

Efficient Streaming Language Models with Attention Sinks

Extensive Reading Author Info MIT HAN Lab Background When applying LLMs for infinite input streams, two main challenges arise: KV Cache will grow infinitely which leads to excessive memory usage and decode latency LLM’s performance will degrade when the sequence length goes beyond the attention window size set during pre-training Window Attention: Only keep $L$ recent tokens in KV cache Model degrades dramatically once the sequence length exceeds the cache size (even just evict the first token) Slide Window with Re-computation: Do not reuse KV. At every step, rebuild the whole window last $L$ tokens and run the Transformer on that small segment from scratch Slide Window Example t = 1: Window: [x₁] Run the model on this length-1 sequence, use the output of x₁. t = 2: Window: [x₁, x₂] Run the model on [x₁, x₂] (full self-attention 2×2), use the output of x₂. t = 3: Window: [x₁, x₂, x₃] Run the model on these 3 tokens (3×3 attention), use x₃. t = 4: Window slides: [x₂, x₃, x₄] Run the model again on this 3-token segment (3×3 attention), use x₄. t = 5: window [x₃, x₄, x₅], full 3×3 attention, use x₅. t = 6: window [x₄, x₅, x₆], full 3×3 attention, use x₆. Observations A surprisingly large amount of attention score is allocated to the initial tokens, irrespective of their relevance to the language modeling task. ...

November 13, 2025 · Last updated on November 17, 2025 · 3 min · KKKZOZ

Quest Query-Aware Sparsity for Efficient Long-Context LLM Inference

Extensive Reading Author Info MIT HAN Lab Background In long-context inference: The KV cache grows linearly with context length ($L$). At each decoding step, the model must read the entire KV cache to compute attention. Existing works recognize there is small part of tokens that can domoinate the accuracy of token generation, and they choose to evict unimportant tokens: StreamingLLM keeps a sliding window plus a few “anchor” tokens. H2O, TOVA, etc., use heuristics or statistics to permanently drop “less important” tokens. Once a token is evicted, it’s gone. BUT, the important tokens are Query-dependent. ...

November 13, 2025 · Last updated on November 17, 2025 · 2 min · KKKZOZ

A dynamic parallel method for performance optimization on hybrid CPUs

Extensive Reading Author Info Background 混合CPU(如包含性能核P-core和能效核E-core的CPU)由于其核心硬件能力不均衡 ,导致在运行AI推理任务时性能低下。传统的并行计算方法会平均分配工作,导致高性能核心必须等待低性能核心,造成资源浪费 Insights 放弃“均匀分配”任务的传统并行策略,转而采用“按能力分配”的动态策略 它确保在并行计算时,每个核心(无论强弱)都能在大致相同的时间完成各自的子任务,从而最大限度地提高 CPU 的整体利用率 动态地为每个核心的特定指令集维护一个性能比率 $pr_i$ 也就是说其实维护的是一张表 核心 (Core) 核心类型 性能比率 (用于 AVX-VNNI) 性能比率 (用于 AVX2) Core 0 P-core 3.5 2.0 Core 1 P-core 3.5 2.0 Core 2 E-core 1.0 1.0 Core 3 E-core 1.0 1.0 分配任务时按照 $$\theta_i = \dfrac{pr_i}{\sum pr_i}$$ 进行切分 执行任务后根据实际的执行时间动态调整 $pr_i$ $$p{r_{i}}^{\prime}=\frac{pr_{i}}{\sum_{j}t_{i}pr_{j}/t_{j}}$$Approaches 由两大组件构成 CPU Runtime Thread Scheduler CPU Runtime 管理CPU状态,负责跟踪和更新每个核心的相对性能 绑定核心: 它创建的线程池会将每个线程严格绑定到特定的物理核心上 性能比率表 : 它为每个核心维护一个“性能比率”($pr_i$),这个比率在初始化时都设为 动态更新: 内核(kernel,如一次矩阵乘法)执行完成后,运行时会跟踪每个线程的实际执行时间 ($t_i$)。然后,它使用一个公式(公式2)来更新每个核心的性能比率$pr_i$, 为了避免噪声干扰,还采用了一个滤波器 考虑指令集: P-core 和 E-core 在执行不同指令集(ISA,如AVX-VNNI)时性能差异不同,因此会为不同的 ISA 维护不同的性能比率。 Thread Scheduler 负责在推理过程中(如矩阵乘法或张量复制)具体分发并行计算任务 ...

November 11, 2025 · Last updated on November 17, 2025 · 1 min · KKKZOZ

Dynamic Sparse Attention on Mobile SoCs

Extensive Reading Author Info ‪Wangsong Yin‬ - ‪Google Scholar‬ Daliang Xu (徐大亮) - Daliang Xu’s Website Mengwei Xu Background State-of-the-art on-device inference frameworks fall back to the CPU/GPU for the attention operation , which is necessary for accuracy but causes resource contention and degrades user experience. Running the full attention operation directly on the NPU is not a viable alternative, as its high sensitivity to quantization results in significant accuracy degradation (an 18 pp average drop) when using the NPU’s low-precision integer compute. Applying traditional sparse attention on the CPU/GPU to lessen the workload yields minimal performance gain, as the required estimation stage to find important tokens becomes the new computational bottleneck. Insights Compute sparse attention accurately and efficiently in NPU-centric LLM inference ...

November 11, 2025 · Last updated on November 17, 2025 · 3 min · KKKZOZ

EAGLE Speculative Sampling Requires Rethinking Feature Uncertainty

Extensive Reading Author Info Background The standard method for large language model (LLM) inference, autoregressive decoding, is slow and costly because it generates tokens sequentially, one at a time. Existing acceleration methods like speculative sampling often struggle to find a suitable draft model; using a smaller version of the LLM can have high overhead, while training a new, appropriately-sized draft model is prohibitively expensive. Other approaches like Lookahead and Medusa successfully reduce drafting latency but are ultimately limited by the low accuracy of their drafts, which restricts their maximum achievable speedup. Insights Two key insights: ...

November 10, 2025 · Last updated on November 17, 2025 · 3 min · KKKZOZ

DistServe Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Extensive Reading Author Info Background Existing LLM serving systems typically colocate the prefill and decoding phases on the same set of GPUs, often using scheduling techniques like continuous batching to mix the computation of both phases. This colocation strategy creates severe prefill-decoding interference, where the long, compute-intensive prefill tasks block the short, memory-intensive decoding tasks, significantly degrading both the Time-To-First-Token (TTFT) and the Time-Per-Output-Token (TPOT). Colocation also couples the resource allocation and parallelism strategies for both phases, forcing them to share the same configuration even though their computational characteristics and latency requirements are fundamentally different, which leads to resource over-provisioning and inefficient performance. Insights Disaggregate the prefill and decoding phases of LLM inference, assigning them to separate GPUs, which brings two benefits: ...

November 3, 2025 · Last updated on October 4, 2025 · 2 min · KKKZOZ