EdgeMoE Empowering Sparse Large Language Models on Mobile Devices

Extensive Reading

Author Info

‪Rongjie Yi‬ - ‪Google Scholar‬
Homepage - Liwei Guo / Assistant Professor: a tenure-track Assistant Professor at University of UESTC.
Mengwei Xu

Background

Challenges

End-to-end latency is I/O-dominated because expert weights are loaded on demand from slow storage (tail delay inflation).
Quantization trilemma: compress aggressively, preserve accuracy, and keep dequantization nearly free on low-power CPUs/NPUs.
Dynamic routing obscures which experts will be needed, making prefetch hard and naive caching ineffective when activations are balanced.
Tiny RAM budgets (~1.5–3 GB) constrain the expert buffer, demanding careful eviction to avoid thrashing.
Hardware heterogeneity and variable storage speeds complicate a one-size-fits-all pipeline and bitwidth plan.

Insights

Non-expert weights are held in device memory; while expert weights are held on external storage and fetched to memory only when activated.

Motivated by a key observation that expert weights are bulky but infrequently used due to sparse activation

Designs:

Expert-wise bitwidth adaptation: 不同专家对压缩的容忍度不同，容忍度高的专家可以使用更激进的量化
In-memory expert management: 虽然单层中的专家激活频率接近均衡，但是对于同一个 Token，跨所有层被激活的专家序列呈现幂律分布 – 也就是说根据当前路径中已经被激活的两个专家，可以有很大概率成功预测下一个激活的专家是哪个，所以也能很自然地使用 predict-then-prefetch 的流水线机制

顺便也总结一下论文提到的几个重要观察：

Expert weights bloat device memory
Experts weights are bulky but cold
Expert activation path follows power law distribution

pasted-image-20250825102402

pasted-image-20250825102413

Note

The experts in the same layer have a similar chance to be activated.
The 20% most frequently activated paths contribute to more than 99% of the activation cases of all tokens.
我还是有点迷惑这两者的区别是什么

Approaches

It permanently hosts all hot weights in memory since they are used per-token inference; while the rest of the memory budget is used as an expert buffer for the cold expert weights.

Expert-wise Quantization

使用的是比较基础的 Channel-wise linear quantization

先逐个把专家压到低比特（如INT2），测整体精度下降作为该专家重要性的反向指标，形成专家敏感度排序
把所有 Expert 应用一个基础量化水平
然后从最不重要的 K 个专家开始下调压缩比特，并反复验证精度，直到达到用户可容忍的精度损失 P

Note

离线阶段需要迭代枚举各种位宽配置并动态调整专家分配，直到模型的精度下降满足阈值
在线推理阶段直接使用调整好的配置即可

In-memory Expert Management

Given the prior knowledge of expert activations of 0..n − 1 layers, we can estimate the activation probability of each expert at n−th layer with good confidence.

根据离线 Profiling 结果形成一个字典：Key 为前面两层激活的专家，Value 为预测下一层会激活的专家
推理时采用在线预测与预取的 pipeline: 按概率从高到低预取1–3个最可能专家到缓存，与当前层计算并行；若预测命中，I/O得到隐藏；若失误，回退按需加载
在内存中维护一个专家缓存，利用两类信号联合打分：
- 历史激活频率
- 层位置距离（离当前执行近的优先保留）

pasted-image-20250825104701

Evaluation

Thoughts

When Reading

提出"暴论": 所有需要离线 Profiling 的工作都应该由 LLM 厂家在训练时实现。

和 LLM as a System Service on Mobile Devices 有点类似，都是利用了某个组件的 Compression-Tolerance 的不同，把容忍度高的进行更激进的量化，把容忍度低的部分进行更保守的量化：

MoE 中的不同 Expert
KV Cache 中的不同 KV Chunk

但是相比于 LLMS 根据注意力分数去压缩 KV Chunk，EdgeMoE 的操作就太直白(暴力)了

Author Info#

Background#

Challenges#

Insights#

Approaches#

Expert-wise Quantization#

In-memory Expert Management#

Evaluation#

Thoughts#

When Reading#

Related Works#

Author Info

Background

Challenges

Insights

Approaches

Expert-wise Quantization

In-memory Expert Management

Evaluation

Thoughts

When Reading

Related Works