Posts

A dynamic parallel method for performance optimization on hybrid CPUs

Extensive Reading Author Info Background 混合CPU（如包含性能核P-core和能效核E-core的CPU）由于其核心硬件能力不均衡，导致在运行AI推理任务时性能低下。传统的并行计算方法会平均分配工作，导致高性能核心必须等待低性能核心，造成资源浪费 Insights 放弃“均匀分配”任务的传统并行策略，转而采用“按能力分配”的动态策略它确保在并行计算时，每个核心（无论强弱）都能在大致相同的时间完成各自的子任务，从而最大限度地提高 CPU 的整体利用率动态地为每个核心的特定指令集维护一个性能比率 $pr_i$ 也就是说其实维护的是一张表核心 (Core) 核心类型性能比率 (用于 AVX-VNNI) 性能比率 (用于 AVX2) Core 0 P-core 3.5 2.0 Core 1 P-core 3.5 2.0 Core 2 E-core 1.0 1.0 Core 3 E-core 1.0 1.0 分配任务时按照 $$\theta_i = \dfrac{pr_i}{\sum pr_i}$$ 进行切分执行任务后根据实际的执行时间动态调整 $pr_i$ $$p{r_{i}}^{\prime}=\frac{pr_{i}}{\sum_{j}t_{i}pr_{j}/t_{j}}$$Approaches 由两大组件构成 CPU Runtime Thread Scheduler CPU Runtime 管理CPU状态，负责跟踪和更新每个核心的相对性能绑定核心：它创建的线程池会将每个线程严格绑定到特定的物理核心上性能比率表：它为每个核心维护一个“性能比率”（$pr_i$）,这个比率在初始化时都设为动态更新：内核（kernel，如一次矩阵乘法）执行完成后，运行时会跟踪每个线程的实际执行时间 ($t_i$)。然后，它使用一个公式（公式2）来更新每个核心的性能比率$pr_i$, 为了避免噪声干扰，还采用了一个滤波器考虑指令集： P-core 和 E-core 在执行不同指令集（ISA，如AVX-VNNI）时性能差异不同，因此会为不同的 ISA 维护不同的性能比率。 Thread Scheduler 负责在推理过程中（如矩阵乘法或张量复制）具体分发并行计算任务 ...

Dynamic Sparse Attention on Mobile SoCs

Extensive Reading Author Info ‪Wangsong Yin‬ - ‪Google Scholar‬ Daliang Xu （徐大亮） - Daliang Xu’s Website Mengwei Xu Background State-of-the-art on-device inference frameworks fall back to the CPU/GPU for the attention operation , which is necessary for accuracy but causes resource contention and degrades user experience. Running the full attention operation directly on the NPU is not a viable alternative, as its high sensitivity to quantization results in significant accuracy degradation (an 18 pp average drop) when using the NPU’s low-precision integer compute. Applying traditional sparse attention on the CPU/GPU to lessen the workload yields minimal performance gain, as the required estimation stage to find important tokens becomes the new computational bottleneck. Insights Compute sparse attention accurately and efficiently in NPU-centric LLM inference ...

EAGLE Speculative Sampling Requires Rethinking Feature Uncertainty

Extensive Reading Author Info Background The standard method for large language model (LLM) inference, autoregressive decoding, is slow and costly because it generates tokens sequentially, one at a time. Existing acceleration methods like speculative sampling often struggle to find a suitable draft model; using a smaller version of the LLM can have high overhead, while training a new, appropriately-sized draft model is prohibitively expensive. Other approaches like Lookahead and Medusa successfully reduce drafting latency but are ultimately limited by the low accuracy of their drafts, which restricts their maximum achievable speedup. Insights Two key insights: ...

DistServe Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Extensive Reading Author Info Background Existing LLM serving systems typically colocate the prefill and decoding phases on the same set of GPUs, often using scheduling techniques like continuous batching to mix the computation of both phases. This colocation strategy creates severe prefill-decoding interference, where the long, compute-intensive prefill tasks block the short, memory-intensive decoding tasks, significantly degrading both the Time-To-First-Token (TTFT) and the Time-Per-Output-Token (TPOT). Colocation also couples the resource allocation and parallelism strategies for both phases, forcing them to share the same configuration even though their computational characteristics and latency requirements are fundamentally different, which leads to resource over-provisioning and inefficient performance. Insights Disaggregate the prefill and decoding phases of LLM inference, assigning them to separate GPUs, which brings two benefits: ...

Splitwise Efficient Generative LLM Inference Using Phase Splitting

Extensive Reading Author Info Background Generative LLM inference is characterized by two distinct phases: a compute-intensive prompt computation phase and a memory-intensive token generation phase, each with unique resource demands. Current systems run both phases on the same powerful, expensive GPUs, which is inefficient because the memory-bound token generation phase underutilizes the advanced compute resources of the hardware. This inefficiency is worsening as new GPUs (like the H100) increase compute power much faster than memory bandwidth or capacity, leading to higher-than-necessary costs and power consumption for large-scale deployments. Challenges The memory-intensive token generation phase, which accounts for the majority of end-to-end latency, severely underutilizes the expensive compute resources of modern GPUs. This inefficiency is exacerbated by hardware trends, as new GPUs (like the H100) provide massive compute gains (3.43x) but much smaller increases in memory bandwidth (1.6x) and no increase in capacity, making them poorly suited for the memory-bound token phase. Running both distinct phases on the same machine leads to inconsistent latencies and resource contention , forcing providers to over-provision expensive, power-hungry hardware to meet service level objectives (SLOs). Insights Prefill phase is compute-intensive, and decoding phase is memory-intensive, decoding does not need the compute capability of the latest GPUs and can be run with lower power and cost. Approaches 在提出具体方法之前，论文的很大部分篇幅都是在说明 LLM Inference 的一些特性，这些特性对 Splitwise 方法的设计有着非常大的影响 ...

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Extensive Reading Author Info Background Current LLM inference schedulers can be broadly classified into two categories: Prefill-Prioritizing Throughput first: allows subsequent decodes to operate at high batch sizes Compromise on latency: prefill can take arbitrarily long time depending on the lengths of the given prompts Decode-Prioritizing Latency first - new requests do not affect the execution of ongoing requests in their decode phase Compromise on throughput: even if some requests in a batch finish early, the execution continues with reduced batch size until the completion of the last request Analysis 论文指出， matrix multiplication 的执行时间可以看做 $T=max(T_{math}, T_{mem})$ ...

Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash

Intensive Reading Author Info About Ting Cao - Dr. Ting Cao: A Professor at the Institute of AI Industry Research (AIR), Tsinghua University. Background Challenges 挑战一：如何准确地识别出下一步计算到底需要哪些“活跃权重” 。如果识别错误，会降低模型的准确度。挑战二：如何能足够早地预测出需要的活跃权重，从而将缓慢的闪存加载过程与当前的计算过程并行处理，以隐藏延迟。现有的一些方法依赖 ReLU 激活函数来预测稀疏性，但这不适用于 Llama 等为追求高精度而未使用 ReLU 的现代 LLM. Insights 利用了 Top-K 的稀疏性，实现了在非 ReLu 上的权重值预测和预取论文提出了两个核心观察： Similarities in Cross-Layer Activations The input activations of the attention and MLP blocks in LLMs exhibit high cross-layer similarity due to residuals to the input activations. 由于激活值相似度很高，所以用当前层最重要的 K 个激活通道去预测下一层最重要的 K 个激活通道，准确度也很高 Contextual Hot Active Weights During Decoding Contextual active weights exhibit high temporal locality across inference iterations during decoding. 在一个具体的对话或任务中（上下文层面），“热点权重”的重复使用率，远高于在所有通用任务中（任务层面）的平均重复使用率所以根据上下文的激活频率设计缓存会更有效（缓存命中率会更高） Approaches Cross-Layer Active Weight Preloading 当计算第 N 层时，ActiveFlow 利用第 N 层的激活值来预测并提前加载第 N+1 层到第 N+k 层（一个“层组”）所需要的活跃权重 ...

ELMS Elasticized Large Language Models On Mobile Devices

Intensive Reading Author Info ‪Wangsong Yin‬ - ‪Google Scholar‬ ‪Rongjie Yi‬ - ‪Google Scholar‬ Daliang Xu （徐大亮） - Daliang Xu’s Website: An Assistant Professor (Associate Researcher) at BUPT. Mengwei Xu Xuanzhe Liu Background Existing LLMs lack the flexibility to accommodate the diverse Service-Level Objectives (SLOs) regarding inference latency across different applications. Prerequisite In-context learning is a paradigm that allows language models to learn tasks given only a few examples in the form of demonstration. ...