PowerInfer Fast Large Language Model Serving with a Consumer-grade GPU
Intensive Reading Author Info Yixin Song - Google Scholar Zeyu Mi (糜泽羽): He is an associate professor at School of Software, Shanghai Jiao Tong University (SJTU). Haotong Xie (谢昊彤) Haibo Chen [IPADS]: Director of Institute of Parallel and Distributed Systems Background Local deployments focus on low latency in processing small batches. LLM inference exhibits notable sparsity in neuron activation, a phenomenon observed in both self-attention and MLP blocks. The offloading technique leverages the CPU’s additional computational and memory resources. GPU-centric offloading utilizes CPU memory to store portions of the model parameters that exceed the GPU’s capacity. Lead to substantial per-token latency mainly due to frequent data transfers between GPU and CPU. Over 99.5% of processing time is consumed by transferring LLM weights from CPU to GPU. Hybrid offloading distributes model parameters between GPU and CPU, splitting them at the Transformer layer level. The CPU processes its layers first, then sends intermediate results to the GPU for token generation. The CPU, with higher memory but lower computational power, ends up handling 98% of the total computational time. ...