AWQ Activation-aware Weight Quantization for LLM Compression and Acceleration

Extensive Reading Author Info Ji Lin’s Homepage Jiaming Tang Shang Yang | MIT EECS Song Han - Associate Professor, MIT EECS Background Quantization is vital for running LLM on edge devices. Challenges Quantization-aware training (QAT) is not efficient due to the high training cost. Post-training quantization (PTQ) suffers from large accuracy degradation under a low-bit setting. Insights Not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. Mixed-precision format is not hardware-efficient, we can employ activation-aware scaling. Approaches Activation-aware Weight Quantization ...

July 28, 2025 · Last updated on August 25, 2025 · 2 min · KKKZOZ

PowerInfer Fast Large Language Model Serving with a Consumer-grade GPU

Intensive Reading Author Info ‪Yixin Song‬ - ‪Google Scholar‬ Zeyu Mi (糜泽羽): He is an associate professor at School of Software, Shanghai Jiao Tong University (SJTU). Haotong Xie (谢昊彤) Haibo Chen [IPADS]: Director of Institute of Parallel and Distributed Systems. Background Local deployments focus on low latency in processing small batches. LLM inference exhibits notable sparsity in neuron activation, a phenomenon observed in both self-attention and MLP blocks. The offloading technique leverages the CPU’s additional computational and memory resources. GPU-centric offloading utilizes CPU memory to store portions of the model parameters that exceed the GPU’s capacity. Lead to substantial per-token latency mainly due to frequent data transfers between GPU and CPU. Over 99.5% of processing time is consumed by transferring LLM weights from CPU to GPU. Hybrid offloading distributes model parameters between GPU and CPU, splitting them at the Transformer layer level. The CPU processes its layers first, then sends intermediate results to the GPU for token generation. The CPU, with higher memory but lower computational power, ends up handling 98% of the total computational time. ...

July 28, 2025 · Last updated on August 26, 2025 · 3 min · KKKZOZ

FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU

Extensive Reading Author Info Ying Sheng: She got her Ph.D. in Computer Science at Stanford University (Centaur), where she was advised by Clark Barrett. Before that, she received an M.S. in Computer Science from Columbia University in 2017 and a B.E. in Computer Science and Technology from ACM Honored Class, Shanghai Jiao Tong University in 2016. Lianmin Zheng: He is a member of technical staff at xAI. His research interests include machine learning systems, large language models, compilers, and distributed systems. Previously, he completed his Ph.D. at UC Berkeley, where he was advised by Ion Stoica and Joseph E. Gonzalez. Binhang Yuan(袁彬航) – Assistant Profossor@CSE HKUST: He is an assistant professor in the Department of Computer Science & Engineering (CSE), also affiliated with World Sustainable Development Institute, at the Hong Kong University of Science and Technology (HKUST). He is leading the Relaxed System Lab. Background Prior efforts to lower resource requirements of LLM inference correspond to three directions: ...

July 25, 2025 · Last updated on August 19, 2025 · 3 min · KKKZOZ

SpecInfer Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification

Extensive Reading Author Info Xupeng Miao Gabriele Oliaro Zhihao Zhang Xinhao Cheng Background Existing works only consider a token sequence generated by a single SSM for speculation which cannot align well with an LLM due to the model capacity gap between them. The probability of a successful alignment between the LLM and the speculated token sequence decays exponentially with the expected alignment length. Challenges How to generate a token tree in a extremely large search space? How to verify the whole token tree in a single verfication pass? Insights Simultaneously consider a diversity of speculation candidates (instead of just one as in existing approaches) to maximize speculative performance. ...

July 25, 2025 · Last updated on August 19, 2025 · 2 min · KKKZOZ

EdgeLLM Fast On-Device LLM Inference With Speculative Decoding

Extensive Reading 在 axriv 或者其他论文中的引用经常是另一个名字:LLMCad Author Info Daliang Xu (徐大亮) - Daliang Xu’s Website ‪Wangsong Yin‬ - ‪Google Scholar‬ Xin Jin Mengwei Xu Professor Xuanzhe Liu @ Peking University Background The Scaling Law vs. The Memory Wall: The machine learning community has shown that increasing an LLM’s parameter size consistently improves its accuracy and can lead to new, emergent abilities. However, this “scaling law” is challenged on mobile devices by a “memory wall”. When an LLM is too large to fit into a device’s memory, inference latency increases dramatically, by as much as 59-224x. ...

July 23, 2025 · Last updated on August 25, 2025 · 3 min · KKKZOZ

Efficient Memory Management for Large Language Model Serving with PagedAttention

Extensive Reading Author Info Woosuk Kwon Zhuohan Li Background The existing systems suffer from internal and external memory fragmentation. Three primary sources of memory wastes: 7+ Internal fragmentation: Space that will not be used in the future within an allocated memory block. External fragmentation: Unused space between memory blocks. The existing systems cannot exploit the opportunities for memory sharing. Parallel sampling, beam search, and shared prefix have the potential to leverage the shared KV cache to reduce memory footprint. ...

July 23, 2025 · Last updated on August 19, 2025 · 3 min · KKKZOZ

A Survey on Efficient Inference for Large Language Models

General Background Resources LLMs typically demand: Higher Computational Cost Higher Memory Access Cost Higher Memory Cost Inference Process of LLMs auto-regressive generation In each generation step, the LLM takes as input the whole token sequences, including the input tokens and previously generated tokens, and generates the next token. With the increase in sequence length, the time cost of the generation process grows rapidly. KV cache technique can store and reuse previous key and value pairs within the Multi-Head Self-Attention block. ...

July 20, 2025 · Last updated on August 25, 2025 · 21 min · KKKZOZ

Orca A Distributed Serving System for Transformer-Based Generative Models

Background Current serving system schedules the execution of the engine at the granularity of request. Under this design, when the serving system dispatches a batch of requests to the engine, the engine returns inference results for the entire batch at once after processing all requests within the batch. Challenge 1: Early-finished and late-joining requests Requests can’t be early finished As different client requests may require different numbers of iterations for processing, requests that have finished earlier than others in the batch cannot return to the client, resulting in an increased latency. ...

July 2, 2025 · Last updated on August 19, 2025 · 5 min · KKKZOZ