LLM-Inference

KTransformers Unleashing the Full Potential of CPU GPU Hybrid Inference for MoE Models

Extensive Reading Author Info Hongtao Chen | MADSys Weiyu Xie | MADSys Boxin Zhang | MADSys Background MoE LLMs and hybrid setups Modern MoE models (DeepSeek, Qwen-MoE, etc.) are huge but activate few experts per token. On single-GPU or low-concurrency setups, we naturally pair a small GPU with a big CPU + large DRAM. Limitations of current hybrid / offloading systems Tools like Fiddler or basic offloading keep attention on GPU and push experts or layers to CPU. CPU becomes the bottleneck; generic AMX/AVX-512 kernels are far from peak, and GPU often waits on CPU. Hardware inefficiencies on CPU and NUMA Poor weight layouts and scheduling starve caches and AMX units. Multi-socket (NUMA) machines suffer from cross-socket memory traffic and weak scaling. Crude accuracy–latency tradeoffs in MoE Existing accelerations often reduce or skip experts (smaller top-k, pruning). These approaches speed up inference but can noticeably hurt accuracy. There are tow major inefficiencies: ...

LServe Efficient Long-sequence LLM Serving with Unified Sparse Attention

Extensive Reading Author Info MIT HAN Lab Background Long-context LLM serving is bottlenecked by attention and KV caches. Prefilling has quadratic attention cost in sequence length, while decoding is memory-bound due to ever-growing KV caches; this makes 128k–512k contexts and long reasoning traces (e.g., 20k-token CoT) slow and expensive in practice. Existing KV cache optimizations are incomplete. Quantization and compression methods (e.g., KV quantization, paged KV cache) reduce memory and bandwidth but do not change the asymptotic attention complexity, so latency still grows linearly (decoding) or quadratically (prefilling) with context length. ...

EAGLE Speculative Sampling Requires Rethinking Feature Uncertainty

Extensive Reading Author Info Background The standard method for large language model (LLM) inference, autoregressive decoding, is slow and costly because it generates tokens sequentially, one at a time. Existing acceleration methods like speculative sampling often struggle to find a suitable draft model; using a smaller version of the LLM can have high overhead, while training a new, appropriately-sized draft model is prohibitively expensive. Other approaches like Lookahead and Medusa successfully reduce drafting latency but are ultimately limited by the low accuracy of their drafts, which restricts their maximum achievable speedup. Insights Two key insights: ...

Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash

Intensive Reading Author Info About Ting Cao - Dr. Ting Cao: A Professor at the Institute of AI Industry Research (AIR), Tsinghua University. Background Challenges 挑战一：如何准确地识别出下一步计算到底需要哪些“活跃权重” 。如果识别错误，会降低模型的准确度。挑战二：如何能足够早地预测出需要的活跃权重，从而将缓慢的闪存加载过程与当前的计算过程并行处理，以隐藏延迟。现有的一些方法依赖 ReLU 激活函数来预测稀疏性，但这不适用于 Llama 等为追求高精度而未使用 ReLU 的现代 LLM. Insights 利用了 Top-K 的稀疏性，实现了在非 ReLu 上的权重值预测和预取论文提出了两个核心观察： Similarities in Cross-Layer Activations The input activations of the attention and MLP blocks in LLMs exhibit high cross-layer similarity due to residuals to the input activations. 由于激活值相似度很高，所以用当前层最重要的 K 个激活通道去预测下一层最重要的 K 个激活通道，准确度也很高 Contextual Hot Active Weights During Decoding Contextual active weights exhibit high temporal locality across inference iterations during decoding. 在一个具体的对话或任务中（上下文层面），“热点权重”的重复使用率，远高于在所有通用任务中（任务层面）的平均重复使用率所以根据上下文的激活频率设计缓存会更有效（缓存命中率会更高） Approaches Cross-Layer Active Weight Preloading 当计算第 N 层时，ActiveFlow 利用第 N 层的激活值来预测并提前加载第 N+1 层到第 N+k 层（一个“层组”）所需要的活跃权重 ...

ELMS Elasticized Large Language Models On Mobile Devices

Intensive Reading Author Info ‪Wangsong Yin‬ - ‪Google Scholar‬ ‪Rongjie Yi‬ - ‪Google Scholar‬ Daliang Xu （徐大亮） - Daliang Xu’s Website: An Assistant Professor (Associate Researcher) at BUPT. Mengwei Xu Xuanzhe Liu Background Existing LLMs lack the flexibility to accommodate the diverse Service-Level Objectives (SLOs) regarding inference latency across different applications. Prerequisite In-context learning is a paradigm that allows language models to learn tasks given only a few examples in the form of demonstration. ...

STI Turbocharge NLP Inference at the Edge via Elastic Pipelining

Intensive Reading Author Info Homepage - Liwei Guo / Assistant Professor: a tenure-track Assistant Professor at University of UESTC. Background Challenges Cold start of NLP models in mobile devices NLP inference stresses mobile devices on two aspects Latency: impromptu user engagements Model Size Existing Paradigms: Hold in memory Too large memory footprint, likely to be victims of mobile memory management Load before execute Slow start, waiting for I/O, computation resources stall Pipeline load/execution Low arithmetic intensity in Transformer’s attention modules The pipeline is filled with bubbles and the computation stalls most of the time at each model layer Insights A model can be re-engineered from a monolithic block into a collection of resource-elastic “shards” by uniquely combining vertical partitioning with fine-grained, per-shard quantization. This transforms the I/O time of each model component into a tunable parameter. ...

EdgeMoE Empowering Sparse Large Language Models on Mobile Devices

Extensive Reading Author Info ‪Rongjie Yi‬ - ‪Google Scholar‬ Homepage - Liwei Guo / Assistant Professor: a tenure-track Assistant Professor at University of UESTC. Mengwei Xu Background Challenges End-to-end latency is I/O-dominated because expert weights are loaded on demand from slow storage (tail delay inflation). Quantization trilemma: compress aggressively, preserve accuracy, and keep dequantization nearly free on low-power CPUs/NPUs. Dynamic routing obscures which experts will be needed, making prefetch hard and naive caching ineffective when activations are balanced. Tiny RAM budgets (~1.5–3 GB) constrain the expert buffer, demanding careful eviction to avoid thrashing. Hardware heterogeneity and variable storage speeds complicate a one-size-fits-all pipeline and bitwidth plan. Insights Non-expert weights are held in device memory; while expert weights are held on external storage and fetched to memory only when activated. ...

HeteroLLM Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators

Extensive Reading Author Info ‪Le Chen‬ - ‪Google Scholar‬ Haibo Chen [IPADS]: Director of Institute of Parallel and Distributed Systems Background 现有的LLM推理引擎通常只使用其中一种加速器（例如只用GPU或只用NPU），这导致了两个主要问题：资源浪费：无法充分利用芯片上所有计算单元的算力。性能瓶颈：单一加速器有其固有的性能短板，无法在所有场景下都达到最优性能。 Challenges Insights 设计一个能够同时、高效地利用 GPU 和 NPU 进行协同计算的 LLM 推理引擎，以最大限度地提升移动设备上的 LLM 推理速度 The NPU serves as the primary computing unit, handling the majority of computing tasks, while the GPU acts as a secondary computing unit to enhance the lower bound of NPU performance. GPU Characteristics Linear Performance: The GPU’s performance scales linearly as the tensor size increases. It transitions from being memory-bound on small tensors to compute-bound on large ones, where its performance plateaus. High-Cost Synchronization: There are two main types of synchronization overheads. Data Copy: API calls to transfer data between CPU and GPU buffers, like clEnqueueWriteBuffer, incur a fixed latency of about 400 microseconds, irrespective of the data’s size. Kernel Submission: Submitting a kernel to an active, non-empty GPU queue has a negligible overhead (10-20 microseconds). However, after a synchronization event empties the queue, submitting the next kernel incurs a much higher “startup” latency of 50-100 microseconds. ...