DuoAttention Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Extensive Reading Author Info MIT HAN Lab Background Long-context LLMs strain attention and KV caches. As sequence length grows, prefill cost scales quadratically and decoding linearly, while KV cache memory grows linearly, making naive full-attention inference impractical in real-world long-context applications. Existing architectural and approximate-attention methods trade accuracy or require retraining. Linear-attention and specialized long-context architectures reduce complexity but often underperform standard Transformers on long-range reasoning, while methods like H2O, StreamingLLM, TOVA, and FastGen drop or sparsify tokens uniformly across heads, which can severely damage long-context retrieval accuracy and are difficult to apply safely in settings with KV-sharing schemes such as GQA. ...

November 13, 2025 · Last updated on November 17, 2025 · 3 min · KKKZOZ

A dynamic parallel method for performance optimization on hybrid CPUs

Extensive Reading Author Info Background 混合CPU(如包含性能核P-core和能效核E-core的CPU)由于其核心硬件能力不均衡 ,导致在运行AI推理任务时性能低下。传统的并行计算方法会平均分配工作,导致高性能核心必须等待低性能核心,造成资源浪费 Insights 放弃“均匀分配”任务的传统并行策略,转而采用“按能力分配”的动态策略 它确保在并行计算时,每个核心(无论强弱)都能在大致相同的时间完成各自的子任务,从而最大限度地提高 CPU 的整体利用率 动态地为每个核心的特定指令集维护一个性能比率 $pr_i$ 也就是说其实维护的是一张表 核心 (Core) 核心类型 性能比率 (用于 AVX-VNNI) 性能比率 (用于 AVX2) Core 0 P-core 3.5 2.0 Core 1 P-core 3.5 2.0 Core 2 E-core 1.0 1.0 Core 3 E-core 1.0 1.0 分配任务时按照 $$\theta_i = \dfrac{pr_i}{\sum pr_i}$$ 进行切分 执行任务后根据实际的执行时间动态调整 $pr_i$ $$p{r_{i}}^{\prime}=\frac{pr_{i}}{\sum_{j}t_{i}pr_{j}/t_{j}}$$Approaches 由两大组件构成 CPU Runtime Thread Scheduler CPU Runtime 管理CPU状态,负责跟踪和更新每个核心的相对性能 绑定核心: 它创建的线程池会将每个线程严格绑定到特定的物理核心上 性能比率表 : 它为每个核心维护一个“性能比率”($pr_i$),这个比率在初始化时都设为 动态更新: 内核(kernel,如一次矩阵乘法)执行完成后,运行时会跟踪每个线程的实际执行时间 ($t_i$)。然后,它使用一个公式(公式2)来更新每个核心的性能比率$pr_i$, 为了避免噪声干扰,还采用了一个滤波器 考虑指令集: P-core 和 E-core 在执行不同指令集(ISA,如AVX-VNNI)时性能差异不同,因此会为不同的 ISA 维护不同的性能比率。 Thread Scheduler 负责在推理过程中(如矩阵乘法或张量复制)具体分发并行计算任务 ...

November 11, 2025 · Last updated on November 17, 2025 · 1 min · KKKZOZ

ELMS Elasticized Large Language Models On Mobile Devices

Intensive Reading Author Info ‪Wangsong Yin‬ - ‪Google Scholar‬ ‪Rongjie Yi‬ - ‪Google Scholar‬ Daliang Xu (徐大亮) - Daliang Xu’s Website: An Assistant Professor (Associate Researcher) at BUPT. Mengwei Xu Xuanzhe Liu Background Existing LLMs lack the flexibility to accommodate the diverse Service-Level Objectives (SLOs) regarding inference latency across different applications. Prerequisite In-context learning is a paradigm that allows language models to learn tasks given only a few examples in the form of demonstration. ...

August 27, 2025 · Last updated on September 2, 2025 · 2 min · KKKZOZ

A Survey of Resource-efficient LLM and Multimodal Foundation Models

Extensive Reading Goal The goal of this survey is to provide an overarching understanding of how current approaches are tackling the resource challenges posed by large foundation models and to potentially inspire future breakthroughs in this field. ...

August 21, 2025 · Last updated on August 26, 2025 · 3 min · KKKZOZ

LLM as a System Service on Mobile Devices

Intensive Reading Author Info ‪Wangsong Yin‬ - ‪Google Scholar‬ Mengwei Xu Background 论文首先提出了 LLMaaS: LLM as a system service on mobile devices (LLMaaS): The mobile OS exposes an LLM and its inference infrastructure as a system feature to mobile apps, akin to the location or notification services. LLMaaS 的提出主要有以下原因: LLMaaS needs only one copy of LLM weights in memory. 不同应用程序应该去调用由系统维护的同一个大模型,而不是自己单独去加载一个 A system-level LLM can be better customized for on-device accelerator and enjoy the performance gain over commodity hardware. 在系统层面去做大模型的管理和推理更接近底层,能够更好地利用底层的硬件资源 这篇文章要解决的核心问题是 How to efficiently manage the LLM contexts ...

August 18, 2025 · Last updated on September 1, 2025 · 4 min · KKKZOZ

Striped Attention Faster Ring Attention for Causal Transformers

Skimming Author Info Implementation and Benchmark zhuzilin/ring-flash-attention: Ring attention implementation with flash attention Corresponding virtualization is here Background Challenges Insights Ring attention suffers from workload imbalance Due to the casual mask mechanism, some devices are doing meaningless computations in the iterations while other devices stays busy all the time. Stripped attention propose an another way to distribute workloads across devices to eliminate the imbalance. Approaches Striped Attention 让每个设备都持有了在原始序列中均匀分布的、不连续的词元 Example Important 理解这个例子最重要的一点:Ring Attention 和 Striped Attention 都不是采用朴素的注意力计算 ...

August 17, 2025 · Last updated on October 4, 2025 · 3 min · KKKZOZ

PowerInfer-2 Fast Large Language Model Inference on a Smartphone

Intensive Reading Author Info Zhenliang Xue: From IPADS. Yixin Song: First author of PowerInfer. Zeyu Mi (糜泽羽): He is an associate professor at School of Software, Shanghai Jiao Tong University (SJTU). Haibo Chen [IPADS]: Director of Institute of Parallel and Distributed Systems. Background Sparsity FFN 的参数占比大,稀疏化特征也明显(特别是在使用 ReLU 时),所以可以在执行计算前利用一个 predictor 来预测哪些神经元会被激活,从而降低计算和 I/O 开销。 PowerInfer2 还探索了 LLM 推理过程中的动态稀疏性: 当批次很大时,对于任何一个神经元,只要它被输入中的至少一个激活,它在这一步的计算中就不是稀疏的。由于不同输入会激活不同神经元,其聚合效应导致大量神经元被激活,形成稳定、密集的“热点”,整体稀疏度显著降低。 由于某些序列会更早终止,所以有效批次的大小也会动态波动。这个实时变化导致了模型的计算模式在一个任务的生命周期内,会从一个接近稠密的模式平滑地过渡到一个高度稀疏的模式。 Mobile Hardware Characteristics 与 PC 相比,手机的硬件有两个特点: Heterogeneous computing capabilities with distinct sparse computation characteristics. CPU 更擅长稀疏计算 NPU 更擅长稠密计算 GPU 比 CPU 和 NPU 都更慢,而且在推理中使用 GPU 会影响设备的渲染帧率 移动 LLM 推理框架应同时利用异构处理器,以最大限度地利用共享内存带宽 Distinct storage architecture with unique I/O characteristics. 读的块大小越大,吞吐量越高 数据范围越小,吞吐量越高 频率越高的 CPU core 读取时吞吐量越高 UFS 并发能力有限 ...

July 29, 2025 · Last updated on August 19, 2025 · 4 min · KKKZOZ

A Survey on Efficient Inference for Large Language Models

General Background Resources LLMs typically demand: Higher Computational Cost Higher Memory Access Cost Higher Memory Cost Inference Process of LLMs auto-regressive generation In each generation step, the LLM takes as input the whole token sequences, including the input tokens and previously generated tokens, and generates the next token. With the increase in sequence length, the time cost of the generation process grows rapidly. KV cache technique can store and reuse previous key and value pairs within the Multi-Head Self-Attention block. ...

July 20, 2025 · Last updated on August 25, 2025 · 21 min · KKKZOZ