KTransformers Unleashing the Full Potential of CPU GPU Hybrid Inference for MoE Models

Extensive Reading Author Info Hongtao Chen | MADSys Weiyu Xie | MADSys Boxin Zhang | MADSys Background MoE LLMs and hybrid setups Modern MoE models (DeepSeek, Qwen-MoE, etc.) are huge but activate few experts per token. On single-GPU or low-concurrency setups, we naturally pair a small GPU with a big CPU + large DRAM. Limitations of current hybrid / offloading systems Tools like Fiddler or basic offloading keep attention on GPU and push experts or layers to CPU. CPU becomes the bottleneck; generic AMX/AVX-512 kernels are far from peak, and GPU often waits on CPU. Hardware inefficiencies on CPU and NUMA Poor weight layouts and scheduling starve caches and AMX units. Multi-socket (NUMA) machines suffer from cross-socket memory traffic and weak scaling. Crude accuracy–latency tradeoffs in MoE Existing accelerations often reduce or skip experts (smaller top-k, pruning). These approaches speed up inference but can noticeably hurt accuracy. There are tow major inefficiencies: ...

November 17, 2025 · Last updated on November 17, 2025 · 5 min · KKKZOZ

QServe W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Extensive Reading Author Info MIT HAN Lab Background Common quantization formats for LLMs: W8A8: 8-bit weights, 8-bit activations – almost lossless, widely deployed. W4A16: 4-bit weights, 16-bit activations – also near-lossless; good for weight memory. W4A4: 4-bit weights and activations – more aggressive, but accuracy drops and real GPU speedups are disappointing. On data center GPUs (A100, L40S), 4-bit quantization often underperforms because: Dequantization of weights or partial sums runs on slow CUDA cores, not fast tensor cores. For W4A4 systems like Atom and QuaRot, 20–90% of runtime can be eaten by dequantization in the main GEMM loop. To achieve resonable accuracy, W4A4 must apply per-group quantization, which is finer than per-channel quantization – sharing FP16 scaling factors a sub-channel basis ...

November 16, 2025 · Last updated on November 17, 2025 · 7 min · KKKZOZ

SmoothQuant Accurate and Efficient Post-Training Quantization for Large Language Models

Extensive Reading Author Info MIT HAN Lab Background Modern large language models (LLMs) are extremely costly to serve in FP16 because of their massive parameter counts and long-context workloads; while low-bit quantization (especially INT8) is an attractive way to cut memory and latency, naïve post-training W8A8 (8-bit weights and activations) breaks down on large models due to severe activation outliers that cause large accuracy drops. Existing INT8 solutions either focus on weights only (e.g., GPTQ-style methods) or handle activation outliers with mixed precision (e.g., LLM.int8(), outlier-aware kernels); these approaches can preserve accuracy but often bring limited end-to-end gains because they leave activations/KV caches in higher precision, rely on complex custom kernels, or end up slower than plain FP16 in practical deployments. ...

November 16, 2025 · Last updated on November 17, 2025 · 4 min · KKKZOZ

LServe Efficient Long-sequence LLM Serving with Unified Sparse Attention

Extensive Reading Author Info MIT HAN Lab Background Long-context LLM serving is bottlenecked by attention and KV caches. Prefilling has quadratic attention cost in sequence length, while decoding is memory-bound due to ever-growing KV caches; this makes 128k–512k contexts and long reasoning traces (e.g., 20k-token CoT) slow and expensive in practice. Existing KV cache optimizations are incomplete. Quantization and compression methods (e.g., KV quantization, paged KV cache) reduce memory and bandwidth but do not change the asymptotic attention complexity, so latency still grows linearly (decoding) or quadratically (prefilling) with context length. ...

November 15, 2025 · Last updated on February 2, 2026 · 3 min · KKKZOZ

DuoAttention Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Extensive Reading Author Info MIT HAN Lab Background Long-context LLMs strain attention and KV caches. As sequence length grows, prefill cost scales quadratically and decoding linearly, while KV cache memory grows linearly, making naive full-attention inference impractical in real-world long-context applications. Existing architectural and approximate-attention methods trade accuracy or require retraining. Linear-attention and specialized long-context architectures reduce complexity but often underperform standard Transformers on long-range reasoning, while methods like H2O, StreamingLLM, TOVA, and FastGen drop or sparsify tokens uniformly across heads, which can severely damage long-context retrieval accuracy and are difficult to apply safely in settings with KV-sharing schemes such as GQA. ...

November 13, 2025 · Last updated on February 2, 2026 · 3 min · KKKZOZ

Efficient Streaming Language Models with Attention Sinks

Extensive Reading Author Info MIT HAN Lab Background When applying LLMs for infinite input streams, two main challenges arise: KV Cache will grow infinitely which leads to excessive memory usage and decode latency LLM’s performance will degrade when the sequence length goes beyond the attention window size set during pre-training Window Attention: Only keep $L$ recent tokens in KV cache Model degrades dramatically once the sequence length exceeds the cache size (even just evict the first token) Slide Window with Re-computation: Do not reuse KV. At every step, rebuild the whole window last $L$ tokens and run the Transformer on that small segment from scratch Slide Window Example t = 1: Window: [x₁] Run the model on this length-1 sequence, use the output of x₁. t = 2: Window: [x₁, x₂] Run the model on [x₁, x₂] (full self-attention 2×2), use the output of x₂. t = 3: Window: [x₁, x₂, x₃] Run the model on these 3 tokens (3×3 attention), use x₃. t = 4: Window slides: [x₂, x₃, x₄] Run the model again on this 3-token segment (3×3 attention), use x₄. t = 5: window [x₃, x₄, x₅], full 3×3 attention, use x₅. t = 6: window [x₄, x₅, x₆], full 3×3 attention, use x₆. Observations A surprisingly large amount of attention score is allocated to the initial tokens, irrespective of their relevance to the language modeling task. ...

November 13, 2025 · Last updated on November 17, 2025 · 3 min · KKKZOZ

Quest Query-Aware Sparsity for Efficient Long-Context LLM Inference

Extensive Reading Author Info MIT HAN Lab Background In long-context inference: The KV cache grows linearly with context length ($L$). At each decoding step, the model must read the entire KV cache to compute attention. Existing works recognize there is small part of tokens that can domoinate the accuracy of token generation, and they choose to evict unimportant tokens: StreamingLLM keeps a sliding window plus a few “anchor” tokens. H2O, TOVA, etc., use heuristics or statistics to permanently drop “less important” tokens. Once a token is evicted, it’s gone. BUT, the important tokens are Query-dependent. ...

November 13, 2025 · Last updated on February 2, 2026 · 2 min · KKKZOZ

A dynamic parallel method for performance optimization on hybrid CPUs

Extensive Reading Author Info Background 混合CPU(如包含性能核P-core和能效核E-core的CPU)由于其核心硬件能力不均衡 ,导致在运行AI推理任务时性能低下。传统的并行计算方法会平均分配工作,导致高性能核心必须等待低性能核心,造成资源浪费 Insights 放弃“均匀分配”任务的传统并行策略,转而采用“按能力分配”的动态策略 它确保在并行计算时,每个核心(无论强弱)都能在大致相同的时间完成各自的子任务,从而最大限度地提高 CPU 的整体利用率 动态地为每个核心的特定指令集维护一个性能比率 $pr_i$ 也就是说其实维护的是一张表 核心 (Core) 核心类型 性能比率 (用于 AVX-VNNI) 性能比率 (用于 AVX2) Core 0 P-core 3.5 2.0 Core 1 P-core 3.5 2.0 Core 2 E-core 1.0 1.0 Core 3 E-core 1.0 1.0 分配任务时按照 $$\theta_i = \dfrac{pr_i}{\sum pr_i}$$ 进行切分 执行任务后根据实际的执行时间动态调整 $pr_i$ $$p{r_{i}}^{\prime}=\frac{pr_{i}}{\sum_{j}t_{i}pr_{j}/t_{j}}$$Approaches 由两大组件构成 CPU Runtime Thread Scheduler CPU Runtime 管理CPU状态,负责跟踪和更新每个核心的相对性能 绑定核心: 它创建的线程池会将每个线程严格绑定到特定的物理核心上 性能比率表 : 它为每个核心维护一个“性能比率”($pr_i$),这个比率在初始化时都设为 动态更新: 内核(kernel,如一次矩阵乘法)执行完成后,运行时会跟踪每个线程的实际执行时间 ($t_i$)。然后,它使用一个公式(公式2)来更新每个核心的性能比率$pr_i$, 为了避免噪声干扰,还采用了一个滤波器 考虑指令集: P-core 和 E-core 在执行不同指令集(ISA,如AVX-VNNI)时性能差异不同,因此会为不同的 ISA 维护不同的性能比率。 Thread Scheduler 负责在推理过程中(如矩阵乘法或张量复制)具体分发并行计算任务 ...

November 11, 2025 · Last updated on November 17, 2025 · 1 min · KKKZOZ