EdgeLLM Fast On-Device LLM Inference With Speculative Decoding
Extensive Reading Author Info Daliang Xu (徐大亮) - Daliang Xu’s Website Wangsong Yin - Google Scholar Xin Jin Mengwei Xu Professor Xuanzhe Liu @ Peking University Background The Scaling Law vs. The Memory Wall: The machine learning community has shown that increasing an LLM’s parameter size consistently improves its accuracy and can lead to new, emergent abilities. However, this “scaling law” is challenged on mobile devices by a “memory wall”. When an LLM is too large to fit into a device’s memory, inference latency increases dramatically, by as much as 59-224x. ...