EdgeMoE Empowering Sparse Large Language Models on Mobile Devices

Extensive Reading Author Info ‪Rongjie Yi‬ - ‪Google Scholar‬ Homepage - Liwei Guo / Assistant Professor: a tenure-track Assistant Professor at University of UESTC. Mengwei Xu Background Challenges End-to-end latency is I/O-dominated because expert weights are loaded on demand from slow storage (tail delay inflation). Quantization trilemma: compress aggressively, preserve accuracy, and keep dequantization nearly free on low-power CPUs/NPUs. Dynamic routing obscures which experts will be needed, making prefetch hard and naive caching ineffective when activations are balanced. Tiny RAM budgets (~1.5–3 GB) constrain the expert buffer, demanding careful eviction to avoid thrashing. Hardware heterogeneity and variable storage speeds complicate a one-size-fits-all pipeline and bitwidth plan. Insights Non-expert weights are held in device memory; while expert weights are held on external storage and fetched to memory only when activated. ...

August 24, 2025 · Last updated on August 26, 2025 · 2 min · KKKZOZ

EdgeLLM Fast On-Device LLM Inference With Speculative Decoding

Extensive Reading 在 axriv 或者其他论文中的引用经常是另一个名字:LLMCad Author Info Daliang Xu (徐大亮) - Daliang Xu’s Website ‪Wangsong Yin‬ - ‪Google Scholar‬ Xin Jin Mengwei Xu Professor Xuanzhe Liu @ Peking University Background The Scaling Law vs. The Memory Wall: The machine learning community has shown that increasing an LLM’s parameter size consistently improves its accuracy and can lead to new, emergent abilities. However, this “scaling law” is challenged on mobile devices by a “memory wall”. When an LLM is too large to fit into a device’s memory, inference latency increases dramatically, by as much as 59-224x. ...

July 23, 2025 · Last updated on August 25, 2025 · 3 min · KKKZOZ