LLM Preliminaries

Math Vector-Matrix Multiplication 从三个不同的角度分析向量乘以矩阵的运算过程 $xW$。 假设向量 $x$ 的形状是 $(1, 3)$,矩阵 $W$ 的形状是 $(3, 6)$。 $$x = \begin{bmatrix} x_1 & x_2 & x_3 \end{bmatrix}$$$$ W = \begin{bmatrix} w_{11} & w_{12} & w_{13} & w_{14} & w_{15} & w_{16} \\\\ w_{21} & w_{22} & w_{23} & w_{24} & w_{25} & w_{26} \\\\ w_{31} & w_{32} & w_{33} & w_{34} & w_{35} & w_{36} \end{bmatrix} $$根据矩阵乘法规则,结果 $y = xW$ 的形状将是 $(1, 6)$。 角度一:将 W 视为元素的二维集合 这是最基本、最微观的视角。我们将矩阵 $W$ 看作是一个 $3 \times 6$ 的数字网格。结果向量 $y$ 中的每一个元素 $y_j$,都是通过将向量 $x$ 的每个元素与其在矩阵 $W$ 中对应列的每个元素相乘,然后将结果相加得到的。 ...

August 4, 2025 · Last updated on August 25, 2025 · 12 min · KKKZOZ

LLM in a flash Efficient Large Language Model Inference with Limited Memory

Intensive Reading Author Info ‪Keivan Alizadeh-Vahid‬ - ‪Google Scholar‬ Iman Mirzadeh: An ML Research Engineer at Apple. Background LLM is hard for personal devices to load. The standard approach is to load the entire model into DRAM (Dynamic Random Access Memory) for inference. However, this severely limits the maximum model size that can be run. Challenges The primary challenge is that the memory footprint of large language models (LLMs) often exceeds the limited DRAM capacity of personal devices. While storing models on high-capacity flash memory is a potential solution, it introduces two new major challenges: ...

July 30, 2025 · Last updated on August 19, 2025 · 3 min · KKKZOZ

PowerInfer-2 Fast Large Language Model Inference on a Smartphone

Intensive Reading Author Info Zhenliang Xue: From IPADS. Yixin Song: First author of PowerInfer. Zeyu Mi (糜泽羽): He is an associate professor at School of Software, Shanghai Jiao Tong University (SJTU). Haibo Chen [IPADS]: Director of Institute of Parallel and Distributed Systems. Background Sparsity FFN 的参数占比大,稀疏化特征也明显(特别是在使用 ReLU 时),所以可以在执行计算前利用一个 predictor 来预测哪些神经元会被激活,从而降低计算和 I/O 开销。 PowerInfer2 还探索了 LLM 推理过程中的动态稀疏性: 当批次很大时,对于任何一个神经元,只要它被输入中的至少一个激活,它在这一步的计算中就不是稀疏的。由于不同输入会激活不同神经元,其聚合效应导致大量神经元被激活,形成稳定、密集的“热点”,整体稀疏度显著降低。 由于某些序列会更早终止,所以有效批次的大小也会动态波动。这个实时变化导致了模型的计算模式在一个任务的生命周期内,会从一个接近稠密的模式平滑地过渡到一个高度稀疏的模式。 Mobile Hardware Characteristics 与 PC 相比,手机的硬件有两个特点: Heterogeneous computing capabilities with distinct sparse computation characteristics. CPU 更擅长稀疏计算 NPU 更擅长稠密计算 GPU 比 CPU 和 NPU 都更慢,而且在推理中使用 GPU 会影响设备的渲染帧率 移动 LLM 推理框架应同时利用异构处理器,以最大限度地利用共享内存带宽 Distinct storage architecture with unique I/O characteristics. 读的块大小越大,吞吐量越高 数据范围越小,吞吐量越高 频率越高的 CPU core 读取时吞吐量越高 UFS 并发能力有限 ...

July 29, 2025 · Last updated on August 19, 2025 · 4 min · KKKZOZ

AWQ Activation-aware Weight Quantization for LLM Compression and Acceleration

Extensive Reading Author Info Ji Lin’s Homepage Jiaming Tang Shang Yang | MIT EECS Song Han - Associate Professor, MIT EECS Background Quantization is vital for running LLM on edge devices. Challenges Quantization-aware training (QAT) is not efficient due to the high training cost. Post-training quantization (PTQ) suffers from large accuracy degradation under a low-bit setting. Insights Not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. Mixed-precision format is not hardware-efficient, we can employ activation-aware scaling. Approaches Activation-aware Weight Quantization ...

July 28, 2025 · Last updated on August 25, 2025 · 2 min · KKKZOZ

PowerInfer Fast Large Language Model Serving with a Consumer-grade GPU

Intensive Reading Author Info ‪Yixin Song‬ - ‪Google Scholar‬ Zeyu Mi (糜泽羽): He is an associate professor at School of Software, Shanghai Jiao Tong University (SJTU). Haotong Xie (谢昊彤) Haibo Chen [IPADS]: Director of Institute of Parallel and Distributed Systems. Background Local deployments focus on low latency in processing small batches. LLM inference exhibits notable sparsity in neuron activation, a phenomenon observed in both self-attention and MLP blocks. The offloading technique leverages the CPU’s additional computational and memory resources. GPU-centric offloading utilizes CPU memory to store portions of the model parameters that exceed the GPU’s capacity. Lead to substantial per-token latency mainly due to frequent data transfers between GPU and CPU. Over 99.5% of processing time is consumed by transferring LLM weights from CPU to GPU. Hybrid offloading distributes model parameters between GPU and CPU, splitting them at the Transformer layer level. The CPU processes its layers first, then sends intermediate results to the GPU for token generation. The CPU, with higher memory but lower computational power, ends up handling 98% of the total computational time. ...

July 28, 2025 · Last updated on August 26, 2025 · 3 min · KKKZOZ

FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU

Extensive Reading Author Info Ying Sheng: She got her Ph.D. in Computer Science at Stanford University (Centaur), where she was advised by Clark Barrett. Before that, she received an M.S. in Computer Science from Columbia University in 2017 and a B.E. in Computer Science and Technology from ACM Honored Class, Shanghai Jiao Tong University in 2016. Lianmin Zheng: He is a member of technical staff at xAI. His research interests include machine learning systems, large language models, compilers, and distributed systems. Previously, he completed his Ph.D. at UC Berkeley, where he was advised by Ion Stoica and Joseph E. Gonzalez. Binhang Yuan(袁彬航) – Assistant Profossor@CSE HKUST: He is an assistant professor in the Department of Computer Science & Engineering (CSE), also affiliated with World Sustainable Development Institute, at the Hong Kong University of Science and Technology (HKUST). He is leading the Relaxed System Lab. Background Prior efforts to lower resource requirements of LLM inference correspond to three directions: ...

July 25, 2025 · Last updated on August 19, 2025 · 3 min · KKKZOZ

LoRA Low-Rank Adaptation of Large Language Models

Extensive Reading Author Info About | Edward Hu: Edward Hu is a founding partner in a stealth AI company in Woodside, CA. He was a researcher at OpenAI and received his research training as a Ph.D. student advised by Yoshua Bengio, a recipient of the 2018 A.M. Turing Award. Before graduate school, Edward was a researcher at Microsoft, where he invented LoRA and μTransfer. Yelong Shen - Microsoft | AMiner Background The dominant paradigm in modern NLP is ...

July 25, 2025 · Last updated on August 19, 2025 · 3 min · KKKZOZ

SpecInfer Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification

Extensive Reading Author Info Xupeng Miao Gabriele Oliaro Zhihao Zhang Xinhao Cheng Background Existing works only consider a token sequence generated by a single SSM for speculation which cannot align well with an LLM due to the model capacity gap between them. The probability of a successful alignment between the LLM and the speculated token sequence decays exponentially with the expected alignment length. Challenges How to generate a token tree in a extremely large search space? How to verify the whole token tree in a single verfication pass? Insights Simultaneously consider a diversity of speculation candidates (instead of just one as in existing approaches) to maximize speculative performance. ...

July 25, 2025 · Last updated on August 19, 2025 · 2 min · KKKZOZ