KKKZOZ's Blog

Efficient Streaming Language Models with Attention Sinks

Extensive Reading Author Info MIT HAN Lab Background When applying LLMs for infinite input streams, two main challenges arise: KV Cache will grow infinitely which leads to excessive memory usage and decode latency LLM’s performance will degrade when the sequence length goes beyond the attention window size set during pre-training Window Attention: Only keep $L$ recent tokens in KV cache Model degrades dramatically once the sequence length exceeds the cache size (even just evict the first token) Slide Window with Re-computation: Do not reuse KV. At every step, rebuild the whole window last $L$ tokens and run the Transformer on that small segment from scratch Slide Window Example t = 1: Window: [x₁] Run the model on this length-1 sequence, use the output of x₁. t = 2: Window: [x₁, x₂] Run the model on [x₁, x₂] (full self-attention 2×2), use the output of x₂. t = 3: Window: [x₁, x₂, x₃] Run the model on these 3 tokens (3×3 attention), use x₃. t = 4: Window slides: [x₂, x₃, x₄] Run the model again on this 3-token segment (3×3 attention), use x₄. t = 5: window [x₃, x₄, x₅], full 3×3 attention, use x₅. t = 6: window [x₄, x₅, x₆], full 3×3 attention, use x₆. Observations A surprisingly large amount of attention score is allocated to the initial tokens, irrespective of their relevance to the language modeling task. ...

Quest Query-Aware Sparsity for Efficient Long-Context LLM Inference

Extensive Reading Author Info MIT HAN Lab Background In long-context inference: The KV cache grows linearly with context length ($L$). At each decoding step, the model must read the entire KV cache to compute attention. Existing works recognize there is small part of tokens that can domoinate the accuracy of token generation, and they choose to evict unimportant tokens: StreamingLLM keeps a sliding window plus a few “anchor” tokens. H2O, TOVA, etc., use heuristics or statistics to permanently drop “less important” tokens. Once a token is evicted, it’s gone. BUT, the important tokens are Query-dependent. ...

A dynamic parallel method for performance optimization on hybrid CPUs

Extensive Reading Author Info Background 混合CPU（如包含性能核P-core和能效核E-core的CPU）由于其核心硬件能力不均衡，导致在运行AI推理任务时性能低下。传统的并行计算方法会平均分配工作，导致高性能核心必须等待低性能核心，造成资源浪费 Insights 放弃“均匀分配”任务的传统并行策略，转而采用“按能力分配”的动态策略它确保在并行计算时，每个核心（无论强弱）都能在大致相同的时间完成各自的子任务，从而最大限度地提高 CPU 的整体利用率动态地为每个核心的特定指令集维护一个性能比率 $pr_i$ 也就是说其实维护的是一张表核心 (Core) 核心类型性能比率 (用于 AVX-VNNI) 性能比率 (用于 AVX2) Core 0 P-core 3.5 2.0 Core 1 P-core 3.5 2.0 Core 2 E-core 1.0 1.0 Core 3 E-core 1.0 1.0 分配任务时按照 $$\theta_i = \dfrac{pr_i}{\sum pr_i}$$ 进行切分执行任务后根据实际的执行时间动态调整 $pr_i$ $$p{r_{i}}^{\prime}=\frac{pr_{i}}{\sum_{j}t_{i}pr_{j}/t_{j}}$$Approaches 由两大组件构成 CPU Runtime Thread Scheduler CPU Runtime 管理CPU状态，负责跟踪和更新每个核心的相对性能绑定核心：它创建的线程池会将每个线程严格绑定到特定的物理核心上性能比率表：它为每个核心维护一个“性能比率”（$pr_i$）,这个比率在初始化时都设为动态更新：内核（kernel，如一次矩阵乘法）执行完成后，运行时会跟踪每个线程的实际执行时间 ($t_i$)。然后，它使用一个公式（公式2）来更新每个核心的性能比率$pr_i$, 为了避免噪声干扰，还采用了一个滤波器考虑指令集： P-core 和 E-core 在执行不同指令集（ISA，如AVX-VNNI）时性能差异不同，因此会为不同的 ISA 维护不同的性能比率。 Thread Scheduler 负责在推理过程中（如矩阵乘法或张量复制）具体分发并行计算任务 ...

Dynamic Sparse Attention on Mobile SoCs

Extensive Reading Author Info ‪Wangsong Yin‬ - ‪Google Scholar‬ Daliang Xu （徐大亮） - Daliang Xu’s Website Mengwei Xu Background State-of-the-art on-device inference frameworks fall back to the CPU/GPU for the attention operation , which is necessary for accuracy but causes resource contention and degrades user experience. Running the full attention operation directly on the NPU is not a viable alternative, as its high sensitivity to quantization results in significant accuracy degradation (an 18 pp average drop) when using the NPU’s low-precision integer compute. Applying traditional sparse attention on the CPU/GPU to lessen the workload yields minimal performance gain, as the required estimation stage to find important tokens becomes the new computational bottleneck. Insights Compute sparse attention accurately and efficiently in NPU-centric LLM inference ...

EAGLE Speculative Sampling Requires Rethinking Feature Uncertainty

Extensive Reading Author Info Background The standard method for large language model (LLM) inference, autoregressive decoding, is slow and costly because it generates tokens sequentially, one at a time. Existing acceleration methods like speculative sampling often struggle to find a suitable draft model; using a smaller version of the LLM can have high overhead, while training a new, appropriately-sized draft model is prohibitively expensive. Other approaches like Lookahead and Medusa successfully reduce drafting latency but are ultimately limited by the low accuracy of their drafts, which restricts their maximum achievable speedup. Insights Two key insights: ...

DistServe Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Extensive Reading Author Info Background Existing LLM serving systems typically colocate the prefill and decoding phases on the same set of GPUs, often using scheduling techniques like continuous batching to mix the computation of both phases. This colocation strategy creates severe prefill-decoding interference, where the long, compute-intensive prefill tasks block the short, memory-intensive decoding tasks, significantly degrading both the Time-To-First-Token (TTFT) and the Time-Per-Output-Token (TPOT). Colocation also couples the resource allocation and parallelism strategies for both phases, forcing them to share the same configuration even though their computational characteristics and latency requirements are fundamentally different, which leads to resource over-provisioning and inefficient performance. Insights Disaggregate the prefill and decoding phases of LLM inference, assigning them to separate GPUs, which brings two benefits: ...

Splitwise Efficient Generative LLM Inference Using Phase Splitting

Extensive Reading Author Info Background Generative LLM inference is characterized by two distinct phases: a compute-intensive prompt computation phase and a memory-intensive token generation phase, each with unique resource demands. Current systems run both phases on the same powerful, expensive GPUs, which is inefficient because the memory-bound token generation phase underutilizes the advanced compute resources of the hardware. This inefficiency is worsening as new GPUs (like the H100) increase compute power much faster than memory bandwidth or capacity, leading to higher-than-necessary costs and power consumption for large-scale deployments. Challenges The memory-intensive token generation phase, which accounts for the majority of end-to-end latency, severely underutilizes the expensive compute resources of modern GPUs. This inefficiency is exacerbated by hardware trends, as new GPUs (like the H100) provide massive compute gains (3.43x) but much smaller increases in memory bandwidth (1.6x) and no increase in capacity, making them poorly suited for the memory-bound token phase. Running both distinct phases on the same machine leads to inconsistent latencies and resource contention , forcing providers to over-provision expensive, power-hungry hardware to meet service level objectives (SLOs). Insights Prefill phase is compute-intensive, and decoding phase is memory-intensive, decoding does not need the compute capability of the latest GPUs and can be run with lower power and cost. Approaches 在提出具体方法之前，论文的很大部分篇幅都是在说明 LLM Inference 的一些特性，这些特性对 Splitwise 方法的设计有着非常大的影响 ...

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Extensive Reading Author Info Background Current LLM inference schedulers can be broadly classified into two categories: Prefill-Prioritizing Throughput first: allows subsequent decodes to operate at high batch sizes Compromise on latency: prefill can take arbitrarily long time depending on the lengths of the given prompts Decode-Prioritizing Latency first - new requests do not affect the execution of ongoing requests in their decode phase Compromise on throughput: even if some requests in a batch finish early, the execution continues with reduced batch size until the completion of the last request Analysis 论文指出， matrix multiplication 的执行时间可以看做 $T=max(T_{math}, T_{mem})$ ...