Dynamic Sparse Attention on Mobile SoCs
Extensive Reading Author Info Wangsong Yin - Google Scholar Daliang Xu (徐大亮) - Daliang Xu’s Website Mengwei Xu Background State-of-the-art on-device inference frameworks fall back to the CPU/GPU for the attention operation , which is necessary for accuracy but causes resource contention and degrades user experience. Running the full attention operation directly on the NPU is not a viable alternative, as its high sensitivity to quantization results in significant accuracy degradation (an 18 pp average drop) when using the NPU’s low-precision integer compute. Applying traditional sparse attention on the CPU/GPU to lessen the workload yields minimal performance gain, as the required estimation stage to find important tokens becomes the new computational bottleneck. Insights Compute sparse attention accurately and efficiently in NPU-centric LLM inference ...