Token Level Routing Inference System for Edge Devices

Extensive Reading Author Info Background Insights 1. 核心思路:协同解码(Collaborative Decoding) 传统的推理要么全在云端(高延迟、隐私风险),要么全在端侧(质量差)。该论文采用混合推理模式: 默认路径:使用端侧的小模型(SLM)进行主要的Token生成。 路由机制:引入一个轻量级的路由器(Router)。对于每一个生成的Token,路由器会评估小模型的“信心”。 按需介入:如果路由器认为小模型对当前Token的预测不可靠(信心低于阈值),则将该Token的生成请求路由到云端的大模型(LLM)。 结果:利用大模型修正关键Token,从而提升整体生成质量,同时保持边缘设备的低延迟和低能耗优势。 Challenges Approaches 2. 系统架构与实现方法 作者构建了一个完整的Client-Server系统,打通了端侧ONNX推理和云端服务。 A. 路由算法 (The Router) 论文主要采用了 CITER (Collaborative Inference with Token-level Routing) 框架。 原理:基于MLP(多层感知机)的分类器。 输入:小模型最后一层的隐藏状态(Hidden States)。 输出:一个置信度分数。 决策:设定一个阈值 $\tau$。如果分数 $<\tau$,则判定为“不自信”,路由给LLM;否则由SLM继续生成。 B. 端侧实现 (Edge Side) 为了在移动设备上高效运行,作者采用了以下技术栈: 推理引擎:使用 ONNX Runtime,支持跨平台(笔记本、手机)。 模型修改(关键技术点): 标准的ONNX模型通常只输出Logits。 问题:路由器需要利用“隐藏状态”作为输入来做决策。 解决方案:作者编写脚本修改了ONNX的计算图(Computation Graph),自动定位并注册最后一层的计算节点,将其作为额外的输出节点暴露出来。这样在推理时,不仅得到Token预测,还能拿到用于路由决策的特征向量。 C. 云端实现 (Cloud Side) 服务引擎:使用 SGLang 部署大模型(如Qwen2.5-32B)。 选择理由:SGLang 提供了灵活的KV-Cache管理和算子定义,适合处理这种非连续的、插拔式的推理请求。 D. 通信与状态管理 自定义API:设计了包含上下文、当前Token索引、路由阈值以及SLM内部状态(如Hidden States)的API格式,用于端云交互。 KV-Cache挑战: 由于端侧和云端是两个独立的系统,当请求从端侧切到云端时,云端并没有之前的KV-Cache历史。 当前局限:目前的实现中,每次路由到云端都需要重新进行Prefill(预填充),这在网络延迟较高或路由频繁时会显著增加 TBT (Time Between Tokens)。 Evaluation Thoughts When Reading 目前实现中,每次路由到云端都需要重新进行预填充,这是人能想出来的操作? ...

December 1, 2025 · Last updated on February 2, 2026 · 1 min · KKKZOZ

KTransformers Unleashing the Full Potential of CPU GPU Hybrid Inference for MoE Models

Extensive Reading Author Info Hongtao Chen | MADSys Weiyu Xie | MADSys Boxin Zhang | MADSys Background MoE LLMs and hybrid setups Modern MoE models (DeepSeek, Qwen-MoE, etc.) are huge but activate few experts per token. On single-GPU or low-concurrency setups, we naturally pair a small GPU with a big CPU + large DRAM. Limitations of current hybrid / offloading systems Tools like Fiddler or basic offloading keep attention on GPU and push experts or layers to CPU. CPU becomes the bottleneck; generic AMX/AVX-512 kernels are far from peak, and GPU often waits on CPU. Hardware inefficiencies on CPU and NUMA Poor weight layouts and scheduling starve caches and AMX units. Multi-socket (NUMA) machines suffer from cross-socket memory traffic and weak scaling. Crude accuracy–latency tradeoffs in MoE Existing accelerations often reduce or skip experts (smaller top-k, pruning). These approaches speed up inference but can noticeably hurt accuracy. There are tow major inefficiencies: ...

November 17, 2025 · Last updated on November 17, 2025 · 5 min · KKKZOZ

QServe W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Extensive Reading Author Info MIT HAN Lab Background Common quantization formats for LLMs: W8A8: 8-bit weights, 8-bit activations – almost lossless, widely deployed. W4A16: 4-bit weights, 16-bit activations – also near-lossless; good for weight memory. W4A4: 4-bit weights and activations – more aggressive, but accuracy drops and real GPU speedups are disappointing. On data center GPUs (A100, L40S), 4-bit quantization often underperforms because: Dequantization of weights or partial sums runs on slow CUDA cores, not fast tensor cores. For W4A4 systems like Atom and QuaRot, 20–90% of runtime can be eaten by dequantization in the main GEMM loop. To achieve resonable accuracy, W4A4 must apply per-group quantization, which is finer than per-channel quantization – sharing FP16 scaling factors a sub-channel basis ...

November 16, 2025 · Last updated on November 17, 2025 · 7 min · KKKZOZ

SmoothQuant Accurate and Efficient Post-Training Quantization for Large Language Models

Extensive Reading Author Info MIT HAN Lab Background Modern large language models (LLMs) are extremely costly to serve in FP16 because of their massive parameter counts and long-context workloads; while low-bit quantization (especially INT8) is an attractive way to cut memory and latency, naïve post-training W8A8 (8-bit weights and activations) breaks down on large models due to severe activation outliers that cause large accuracy drops. Existing INT8 solutions either focus on weights only (e.g., GPTQ-style methods) or handle activation outliers with mixed precision (e.g., LLM.int8(), outlier-aware kernels); these approaches can preserve accuracy but often bring limited end-to-end gains because they leave activations/KV caches in higher precision, rely on complex custom kernels, or end up slower than plain FP16 in practical deployments. ...

November 16, 2025 · Last updated on November 17, 2025 · 4 min · KKKZOZ

LServe Efficient Long-sequence LLM Serving with Unified Sparse Attention

Extensive Reading Author Info MIT HAN Lab Background Long-context LLM serving is bottlenecked by attention and KV caches. Prefilling has quadratic attention cost in sequence length, while decoding is memory-bound due to ever-growing KV caches; this makes 128k–512k contexts and long reasoning traces (e.g., 20k-token CoT) slow and expensive in practice. Existing KV cache optimizations are incomplete. Quantization and compression methods (e.g., KV quantization, paged KV cache) reduce memory and bandwidth but do not change the asymptotic attention complexity, so latency still grows linearly (decoding) or quadratically (prefilling) with context length. ...

November 15, 2025 · Last updated on February 2, 2026 · 3 min · KKKZOZ

DuoAttention Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

November 13, 2025 · Last updated on February 9, 2026 · 0 min · KKKZOZ

Efficient Streaming Language Models with Attention Sinks

Extensive Reading Author Info MIT HAN Lab Background When applying LLMs for infinite input streams, two main challenges arise: KV Cache will grow infinitely which leads to excessive memory usage and decode latency LLM’s performance will degrade when the sequence length goes beyond the attention window size set during pre-training Window Attention: Only keep $L$ recent tokens in KV cache Model degrades dramatically once the sequence length exceeds the cache size (even just evict the first token) Slide Window with Re-computation: Do not reuse KV. At every step, rebuild the whole window last $L$ tokens and run the Transformer on that small segment from scratch Slide Window Example t = 1: Window: [x₁] Run the model on this length-1 sequence, use the output of x₁. t = 2: Window: [x₁, x₂] Run the model on [x₁, x₂] (full self-attention 2×2), use the output of x₂. t = 3: Window: [x₁, x₂, x₃] Run the model on these 3 tokens (3×3 attention), use x₃. t = 4: Window slides: [x₂, x₃, x₄] Run the model again on this 3-token segment (3×3 attention), use x₄. t = 5: window [x₃, x₄, x₅], full 3×3 attention, use x₅. t = 6: window [x₄, x₅, x₆], full 3×3 attention, use x₆. Observations A surprisingly large amount of attention score is allocated to the initial tokens, irrespective of their relevance to the language modeling task. ...

November 13, 2025 · Last updated on November 17, 2025 · 3 min · KKKZOZ

Quest Query-Aware Sparsity for Efficient Long-Context LLM Inference

Extensive Reading Author Info MIT HAN Lab Background In long-context inference: The KV cache grows linearly with context length ($L$). At each decoding step, the model must read the entire KV cache to compute attention. Existing works recognize there is small part of tokens that can domoinate the accuracy of token generation, and they choose to evict unimportant tokens: StreamingLLM keeps a sliding window plus a few “anchor” tokens. H2O, TOVA, etc., use heuristics or statistics to permanently drop “less important” tokens. Once a token is evicted, it’s gone. BUT, the important tokens are Query-dependent. ...

November 13, 2025 · Last updated on February 2, 2026 · 2 min · KKKZOZ