KTransformers Unleashing the Full Potential of CPU GPU Hybrid Inference for MoE Models
Extensive Reading Author Info Hongtao Chen | MADSys Weiyu Xie | MADSys Boxin Zhang | MADSys Background MoE LLMs and hybrid setups Modern MoE models (DeepSeek, Qwen-MoE, etc.) are huge but activate few experts per token. On single-GPU or low-concurrency setups, we naturally pair a small GPU with a big CPU + large DRAM. Limitations of current hybrid / offloading systems Tools like Fiddler or basic offloading keep attention on GPU and push experts or layers to CPU. CPU becomes the bottleneck; generic AMX/AVX-512 kernels are far from peak, and GPU often waits on CPU. Hardware inefficiencies on CPU and NUMA Poor weight layouts and scheduling starve caches and AMX units. Multi-socket (NUMA) machines suffer from cross-socket memory traffic and weak scaling. Crude accuracy–latency tradeoffs in MoE Existing accelerations often reduce or skip experts (smaller top-k, pruning). These approaches speed up inference but can noticeably hurt accuracy. There are tow major inefficiencies: ...