Serving-on-Edge

Fast On-device LLM Inference with NPUs

Intensive Reading Author Info Daliang Xu （徐大亮） - Daliang Xu’s Website: An incoming Assistant Professor at BUPT. ‪Hao Zhang‬ - ‪Google Scholar‬: Author of Edgellm. Mengwei Xu: An associate professor in BUPT. Professor Xuanzhe Liu @ Peking University: an Endowed Boya Distinguished Professor at the School of Computer Science in Peking University. Background The prefill stage is often the bottleneck in typical mobile applications. 论文设定的背景限制，但大部分情况下应该还是 decoding 阶段是瓶颈？ Modern mobile SoCs ubiquitously include mobile neural processing units (NPUs) that are well-suited for integer operations, such as INT8-based matrix multiplication. ...

PowerInfer-2 Fast Large Language Model Inference on a Smartphone

Intensive Reading Author Info Zhenliang Xue: From IPADS. Yixin Song: First author of PowerInfer. Zeyu Mi (糜泽羽): He is an associate professor at School of Software, Shanghai Jiao Tong University (SJTU). Haibo Chen [IPADS]: Director of Institute of Parallel and Distributed Systems. Background Sparsity FFN 的参数占比大，稀疏化特征也明显（特别是在使用 ReLU 时），所以可以在执行计算前利用一个 predictor 来预测哪些神经元会被激活，从而降低计算和 I/O 开销。 PowerInfer2 还探索了 LLM 推理过程中的动态稀疏性：当批次很大时，对于任何一个神经元，只要它被输入中的至少一个激活，它在这一步的计算中就不是稀疏的。由于不同输入会激活不同神经元，其聚合效应导致大量神经元被激活，形成稳定、密集的“热点”，整体稀疏度显著降低。由于某些序列会更早终止，所以有效批次的大小也会动态波动。这个实时变化导致了模型的计算模式在一个任务的生命周期内，会从一个接近稠密的模式平滑地过渡到一个高度稀疏的模式。 Mobile Hardware Characteristics 与 PC 相比，手机的硬件有两个特点： Heterogeneous computing capabilities with distinct sparse computation characteristics. CPU 更擅长稀疏计算 NPU 更擅长稠密计算 GPU 比 CPU 和 NPU 都更慢，而且在推理中使用 GPU 会影响设备的渲染帧率移动 LLM 推理框架应同时利用异构处理器，以最大限度地利用共享内存带宽 Distinct storage architecture with unique I/O characteristics. 读的块大小越大，吞吐量越高数据范围越小，吞吐量越高频率越高的 CPU core 读取时吞吐量越高 UFS 并发能力有限 ...

EdgeLLM Fast On-Device LLM Inference With Speculative Decoding

Extensive Reading 在 axriv 或者其他论文中的引用经常是另一个名字：LLMCad Author Info Daliang Xu （徐大亮） - Daliang Xu’s Website ‪Wangsong Yin‬ - ‪Google Scholar‬ Xin Jin Mengwei Xu Professor Xuanzhe Liu @ Peking University Background The Scaling Law vs. The Memory Wall: The machine learning community has shown that increasing an LLM’s parameter size consistently improves its accuracy and can lead to new, emergent abilities. However, this “scaling law” is challenged on mobile devices by a “memory wall”. When an LLM is too large to fit into a device’s memory, inference latency increases dramatically, by as much as 59-224x. ...