LLM in a flash Efficient Large Language Model Inference with Limited Memory

Intensive Reading Author Info Keivan Alizadeh - Google Scholar Iman Mirzadeh: An ML Research Engineer at Apple. Background LLM is hard for personal devices to load. The standard approach is to load the entire model into DRAM (Dynamic Random Access Memory) for inference. However, this severely limits the maximum model size that can be run. Challenges The primary challenge is that the memory footprint of large language models (LLMs) often exceeds the limited DRAM capacity of personal devices. While storing models on high-capacity flash memory is a potential solution, it introduces two new major challenges: ...

July 30, 2025 · Last updated on August 1, 2025 · 3 min · KKKZOZ

PowerInfer Fast Large Language Model Serving with a Consumer-grade GPU

Intensive Reading Author Info ‪Yixin Song‬ - ‪Google Scholar‬ Zeyu Mi (糜泽羽): He is an associate professor at School of Software, Shanghai Jiao Tong University (SJTU). Haotong Xie (谢昊彤) Haibo Chen [IPADS]: Director of Institute of Parallel and Distributed Systems Background Local deployments focus on low latency in processing small batches. LLM inference exhibits notable sparsity in neuron activation, a phenomenon observed in both self-attention and MLP blocks. The offloading technique leverages the CPU’s additional computational and memory resources. GPU-centric offloading utilizes CPU memory to store portions of the model parameters that exceed the GPU’s capacity. Lead to substantial per-token latency mainly due to frequent data transfers between GPU and CPU. Over 99.5% of processing time is consumed by transferring LLM weights from CPU to GPU. Hybrid offloading distributes model parameters between GPU and CPU, splitting them at the Transformer layer level. The CPU processes its layers first, then sends intermediate results to the GPU for token generation. The CPU, with higher memory but lower computational power, ends up handling 98% of the total computational time. ...

July 28, 2025 · Last updated on August 1, 2025 · 3 min · KKKZOZ

FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU

Extensive Reading Author Info Ying Sheng: She got her Ph.D. in Computer Science at Stanford University (Centaur), where she was advised by Clark Barrett. Before that, she received an M.S. in Computer Science from Columbia University in 2017 and a B.E. in Computer Science and Technology from ACM Honored Class, Shanghai Jiao Tong University in 2016. Lianmin Zheng: He is a member of technical staff at xAI. His research interests include machine learning systems, large language models, compilers, and distributed systems. Previously, he completed his Ph.D. at UC Berkeley, where he was advised by Ion Stoica and Joseph E. Gonzalez. Binhang Yuan(袁彬航) – Assistant Profossor@CSE HKUST: He is an assistant professor in the Department of Computer Science & Engineering (CSE), also affiliated with World Sustainable Development Institute, at the Hong Kong University of Science and Technology (HKUST). He is leading the Relaxed System Lab. Background Prior efforts to lower resource requirements of LLM inference correspond to three directions: ...

July 25, 2025 · Last updated on August 1, 2025 · 3 min · KKKZOZ