LLM-Inference

Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash

Intensive Reading Author Info About Ting Cao - Dr. Ting Cao: A Professor at the Institute of AI Industry Research (AIR), Tsinghua University. Background Challenges 挑战一：如何准确地识别出下一步计算到底需要哪些“活跃权重” 。如果识别错误，会降低模型的准确度。挑战二：如何能足够早地预测出需要的活跃权重，从而将缓慢的闪存加载过程与当前的计算过程并行处理，以隐藏延迟。现有的一些方法依赖 ReLU 激活函数来预测稀疏性，但这不适用于 Llama 等为追求高精度而未使用 ReLU 的现代 LLM. Insights 利用了 Top-K 的稀疏性，实现了在非 ReLu 上的权重值预测和预取论文提出了两个核心观察： Similarities in Cross-Layer Activations The input activations of the attention and MLP blocks in LLMs exhibit high cross-layer similarity due to residuals to the input activations. 由于激活值相似度很高，所以用当前层最重要的 K 个激活通道去预测下一层最重要的 K 个激活通道，准确度也很高 Contextual Hot Active Weights During Decoding Contextual active weights exhibit high temporal locality across inference iterations during decoding. 在一个具体的对话或任务中（上下文层面），“热点权重”的重复使用率，远高于在所有通用任务中（任务层面）的平均重复使用率所以根据上下文的激活频率设计缓存会更有效（缓存命中率会更高） Approaches Cross-Layer Active Weight Preloading 当计算第 N 层时，ActiveFlow 利用第 N 层的激活值来预测并提前加载第 N+1 层到第 N+k 层（一个“层组”）所需要的活跃权重 ...

ELMS Elasticized Large Language Models On Mobile Devices

Intensive Reading Author Info ‪Wangsong Yin‬ - ‪Google Scholar‬ ‪Rongjie Yi‬ - ‪Google Scholar‬ Daliang Xu （徐大亮） - Daliang Xu’s Website: An Assistant Professor (Associate Researcher) at BUPT. Mengwei Xu Xuanzhe Liu Background Existing LLMs lack the flexibility to accommodate the diverse Service-Level Objectives (SLOs) regarding inference latency across different applications. Prerequisite In-context learning is a paradigm that allows language models to learn tasks given only a few examples in the form of demonstration. ...

STI Turbocharge NLP Inference at the Edge via Elastic Pipelining

Intensive Reading Author Info Homepage - Liwei Guo / Assistant Professor: a tenure-track Assistant Professor at University of UESTC. Background Challenges Cold start of NLP models in mobile devices NLP inference stresses mobile devices on two aspects Latency: impromptu user engagements Model Size Existing Paradigms: Hold in memory Too large memory footprint, likely to be victims of mobile memory management Load before execute Slow start, waiting for I/O, computation resources stall Pipeline load/execution Low arithmetic intensity in Transformer’s attention modules The pipeline is filled with bubbles and the computation stalls most of the time at each model layer Insights A model can be re-engineered from a monolithic block into a collection of resource-elastic “shards” by uniquely combining vertical partitioning with fine-grained, per-shard quantization. This transforms the I/O time of each model component into a tunable parameter. ...

EdgeMoE Empowering Sparse Large Language Models on Mobile Devices

Extensive Reading Author Info ‪Rongjie Yi‬ - ‪Google Scholar‬ Homepage - Liwei Guo / Assistant Professor: a tenure-track Assistant Professor at University of UESTC. Mengwei Xu Background Challenges End-to-end latency is I/O-dominated because expert weights are loaded on demand from slow storage (tail delay inflation). Quantization trilemma: compress aggressively, preserve accuracy, and keep dequantization nearly free on low-power CPUs/NPUs. Dynamic routing obscures which experts will be needed, making prefetch hard and naive caching ineffective when activations are balanced. Tiny RAM budgets (~1.5–3 GB) constrain the expert buffer, demanding careful eviction to avoid thrashing. Hardware heterogeneity and variable storage speeds complicate a one-size-fits-all pipeline and bitwidth plan. Insights Non-expert weights are held in device memory; while expert weights are held on external storage and fetched to memory only when activated. ...

HeteroLLM Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators

Extensive Reading Author Info ‪Le Chen‬ - ‪Google Scholar‬ Haibo Chen [IPADS]: Director of Institute of Parallel and Distributed Systems Background 现有的LLM推理引擎通常只使用其中一种加速器（例如只用GPU或只用NPU），这导致了两个主要问题：资源浪费：无法充分利用芯片上所有计算单元的算力。性能瓶颈：单一加速器有其固有的性能短板，无法在所有场景下都达到最优性能。 Challenges Insights 设计一个能够同时、高效地利用 GPU 和 NPU 进行协同计算的 LLM 推理引擎，以最大限度地提升移动设备上的 LLM 推理速度 The NPU serves as the primary computing unit, handling the majority of computing tasks, while the GPU acts as a secondary computing unit to enhance the lower bound of NPU performance. GPU Characteristics Linear Performance: The GPU’s performance scales linearly as the tensor size increases. It transitions from being memory-bound on small tensors to compute-bound on large ones, where its performance plateaus. High-Cost Synchronization: There are two main types of synchronization overheads. Data Copy: API calls to transfer data between CPU and GPU buffers, like clEnqueueWriteBuffer, incur a fixed latency of about 400 microseconds, irrespective of the data’s size. Kernel Submission: Submitting a kernel to an active, non-empty GPU queue has a negligible overhead (10-20 microseconds). However, after a synchronization event empties the queue, submitting the next kernel incurs a much higher “startup” latency of 50-100 microseconds. ...

TPI-LLM Serving 70B-scale LLMs Efficiently on Low-resource Mobile Devices

Extensive Reading A similar paper is found: arxiv.org/pdf/2504.08791? Author Info ‪Zonghang Li‬ - ‪Google 学术搜索‬ Background LLM serving is shifting from the cloud to edge devices like smartphones and laptops. This trend is driven by growing privacy concerns, as users want to avoid sending their sensitive interaction data to cloud providers. The goal is to process user requests locally on their own devices. Preliminaries TP 场景下 KV Cache 的维护 Challenges Hardware Limitations: Mobile devices have very limited memory (typically 4-16 GiB) and computing power, often lacking GPUs. Running a 70B-scale model can require over 40 GiB of memory, which far exceeds the capacity of a single device. Inefficient Parallelism: The standard solution for distributed systems, pipeline parallelism, is inefficient for home scenarios where only one request is processed at a time. This leads to many devices being idle most of the time, wasting resources. Slow Memory Offloading: Existing on-device solutions like llama.cpp and Accelerate offload model data to disk to save RAM. However, their blocking disk I/O operations significantly slow down the inference speed. Insights 在低资源设备协同的环境下，应该选择 Tensor Parallelism 用户的请求一次性只有一条，并行的目的应该是降低延迟而不是增加吞吐量 Tensor Parallelism 依赖 allreduce 操作来同步和聚合计算结果，在低资源设备协同的环境下，通信瓶颈并非网络带宽而是链路延迟使用 star-based allreduce 来降低网络跳数进而降低延迟使用滑动窗口内存调度器来异步加载和卸载权重由一个独立的线程后台进行将权重的加载隐藏在计算和同步的过程中 Approaches ...

Deja Vu Contextual Sparsity for Efficient LLMs at Inference Time

Intensive Reading Author Info Zichang Liu：Research Scientist at Meta. Jue Wang, Ph.D: Founder & President of Stylar AI (stylar.ai). Tri Dao: Assistant Professor of Computer Science at Princeton University. Chief Scientist at Together AI. Background LLM Inference Latency Breakdown Challenges Speeding up inference-time sparse LLMs in wall-clock time while maintaining quality and in-context learning abilities remains a challenging problem. While sparsity and pruning have been well-studied, they have not seen wide adoption on LLMs due to the poor quality and efficiency trade-offs on modern hardware such as GPUs: ...

Fast On-device LLM Inference with NPUs

Intensive Reading Author Info Daliang Xu （徐大亮） - Daliang Xu’s Website: An incoming Assistant Professor at BUPT. ‪Hao Zhang‬ - ‪Google Scholar‬: Author of Edgellm. Mengwei Xu: An associate professor in BUPT. Professor Xuanzhe Liu @ Peking University: an Endowed Boya Distinguished Professor at the School of Computer Science in Peking University. Background The prefill stage is often the bottleneck in typical mobile applications. 论文设定的背景限制，但大部分情况下应该还是 decoding 阶段是瓶颈？ Modern mobile SoCs ubiquitously include mobile neural processing units (NPUs) that are well-suited for integer operations, such as INT8-based matrix multiplication. ...