Fast On-device LLM Inference with NPUs

Intensive Reading Author Info Daliang Xu (徐大亮) - Daliang Xu’s Website: An incoming Assistant Professor at BUPT. ‪Hao Zhang‬ - ‪Google Scholar‬: Author of Edgellm. Mengwei Xu: An associate professor in BUPT. Professor Xuanzhe Liu @ Peking University: an Endowed Boya Distinguished Professor at the School of Computer Science in Peking University. Background The prefill stage is often the bottleneck in typical mobile applications. 论文设定的背景限制,但大部分情况下应该还是 decoding 阶段是瓶颈? Modern mobile SoCs ubiquitously include mobile neural processing units (NPUs) that are well-suited for integer operations, such as INT8-based matrix multiplication. ...

August 4, 2025 · Last updated on August 18, 2025 · 3 min · KKKZOZ