EdgeLLM Fast On-Device LLM Inference With Speculative Decoding

Extensive Reading Author Info Daliang Xu (徐大亮) - Daliang Xu’s Website ‪Wangsong Yin‬ - ‪Google Scholar‬ Xin Jin Mengwei Xu Professor Xuanzhe Liu @ Peking University Background The Scaling Law vs. The Memory Wall: The machine learning community has shown that increasing an LLM’s parameter size consistently improves its accuracy and can lead to new, emergent abilities. However, this “scaling law” is challenged on mobile devices by a “memory wall”. When an LLM is too large to fit into a device’s memory, inference latency increases dramatically, by as much as 59-224x. ...

July 23, 2025 · Last updated on August 1, 2025 · 3 min · KKKZOZ