STI Turbocharge NLP Inference at the Edge via Elastic Pipelining

Intensive Reading Author Info Homepage - Liwei Guo / Assistant Professor: a tenure-track Assistant Professor at University of UESTC. Background Challenges Cold start of NLP models in mobile devices NLP inference stresses mobile devices on two aspects Latency: impromptu user engagements Model Size Existing Paradigms: Hold in memory Too large memory footprint, likely to be victims of mobile memory management Load before execute Slow start, waiting for I/O, computation resources stall Pipeline load/execution Low arithmetic intensity in Transformer’s attention modules The pipeline is filled with bubbles and the computation stalls most of the time at each model layer Insights A model can be re-engineered from a monolithic block into a collection of resource-elastic “shards” by uniquely combining vertical partitioning with fine-grained, per-shard quantization. This transforms the I/O time of each model component into a tunable parameter. ...

August 25, 2025 · Last updated on August 26, 2025 · 2 min · KKKZOZ

EdgeMoE Empowering Sparse Large Language Models on Mobile Devices

Extensive Reading Author Info ‪Rongjie Yi‬ - ‪Google Scholar‬ Homepage - Liwei Guo / Assistant Professor: a tenure-track Assistant Professor at University of UESTC. Mengwei Xu Background Challenges End-to-end latency is I/O-dominated because expert weights are loaded on demand from slow storage (tail delay inflation). Quantization trilemma: compress aggressively, preserve accuracy, and keep dequantization nearly free on low-power CPUs/NPUs. Dynamic routing obscures which experts will be needed, making prefetch hard and naive caching ineffective when activations are balanced. Tiny RAM budgets (~1.5–3 GB) constrain the expert buffer, demanding careful eviction to avoid thrashing. Hardware heterogeneity and variable storage speeds complicate a one-size-fits-all pipeline and bitwidth plan. Insights Non-expert weights are held in device memory; while expert weights are held on external storage and fetched to memory only when activated. ...

August 24, 2025 · Last updated on August 26, 2025 · 2 min · KKKZOZ

HeteroLLM Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators

Extensive Reading Author Info ‪Le Chen‬ - ‪Google Scholar‬ Haibo Chen [IPADS]: Director of Institute of Parallel and Distributed Systems Background 现有的LLM推理引擎通常只使用其中一种加速器(例如只用GPU或只用NPU),这导致了两个主要问题: 资源浪费:无法充分利用芯片上所有计算单元的算力。 性能瓶颈:单一加速器有其固有的性能短板,无法在所有场景下都达到最优性能。 Challenges Insights 设计一个能够同时、高效地利用 GPU 和 NPU 进行协同计算的 LLM 推理引擎,以最大限度地提升移动设备上的 LLM 推理速度 The NPU serves as the primary computing unit, handling the majority of computing tasks, while the GPU acts as a secondary computing unit to enhance the lower bound of NPU performance. GPU Characteristics Linear Performance: The GPU’s performance scales linearly as the tensor size increases. It transitions from being memory-bound on small tensors to compute-bound on large ones, where its performance plateaus. High-Cost Synchronization: There are two main types of synchronization overheads. Data Copy: API calls to transfer data between CPU and GPU buffers, like clEnqueueWriteBuffer, incur a fixed latency of about 400 microseconds, irrespective of the data’s size. Kernel Submission: Submitting a kernel to an active, non-empty GPU queue has a negligible overhead (10-20 microseconds). However, after a synchronization event empties the queue, submitting the next kernel incurs a much higher “startup” latency of 50-100 microseconds. ...

August 24, 2025 · Last updated on August 26, 2025 · 4 min · KKKZOZ

TPI-LLM Serving 70B-scale LLMs Efficiently on Low-resource Mobile Devices

Extensive Reading A similar paper is found: arxiv.org/pdf/2504.08791? Author Info ‪Zonghang Li‬ - ‪Google 学术搜索‬ Background LLM serving is shifting from the cloud to edge devices like smartphones and laptops. This trend is driven by growing privacy concerns, as users want to avoid sending their sensitive interaction data to cloud providers. The goal is to process user requests locally on their own devices. Preliminaries TP 场景下 KV Cache 的维护 Challenges Hardware Limitations: Mobile devices have very limited memory (typically 4-16 GiB) and computing power, often lacking GPUs. Running a 70B-scale model can require over 40 GiB of memory, which far exceeds the capacity of a single device. Inefficient Parallelism: The standard solution for distributed systems, pipeline parallelism, is inefficient for home scenarios where only one request is processed at a time. This leads to many devices being idle most of the time, wasting resources. Slow Memory Offloading: Existing on-device solutions like llama.cpp and Accelerate offload model data to disk to save RAM. However, their blocking disk I/O operations significantly slow down the inference speed. Insights 在低资源设备协同的环境下,应该选择 Tensor Parallelism 用户的请求一次性只有一条,并行的目的应该是降低延迟而不是增加吞吐量 Tensor Parallelism 依赖 allreduce 操作来同步和聚合计算结果,在低资源设备协同的环境下,通信瓶颈并非网络带宽而是链路延迟 使用 star-based allreduce 来降低网络跳数进而降低延迟 使用滑动窗口内存调度器来异步加载和卸载权重 由一个独立的线程后台进行 将权重的加载隐藏在计算和同步的过程中 Approaches ...

August 17, 2025 · Last updated on August 26, 2025 · 2 min · KKKZOZ

Deja Vu Contextual Sparsity for Efficient LLMs at Inference Time

Intensive Reading Author Info Zichang Liu:Research Scientist at Meta. Jue Wang, Ph.D: Founder & President of Stylar AI (stylar.ai). Tri Dao: Assistant Professor of Computer Science at Princeton University. Chief Scientist at Together AI. Background LLM Inference Latency Breakdown Challenges Speeding up inference-time sparse LLMs in wall-clock time while maintaining quality and in-context learning abilities remains a challenging problem. While sparsity and pruning have been well-studied, they have not seen wide adoption on LLMs due to the poor quality and efficiency trade-offs on modern hardware such as GPUs: ...

August 4, 2025 · Last updated on August 26, 2025 · 3 min · KKKZOZ

Fast On-device LLM Inference with NPUs

Intensive Reading Author Info Daliang Xu (徐大亮) - Daliang Xu’s Website: An incoming Assistant Professor at BUPT. ‪Hao Zhang‬ - ‪Google Scholar‬: Author of Edgellm. Mengwei Xu: An associate professor in BUPT. Professor Xuanzhe Liu @ Peking University: an Endowed Boya Distinguished Professor at the School of Computer Science in Peking University. Background The prefill stage is often the bottleneck in typical mobile applications. 论文设定的背景限制,但大部分情况下应该还是 decoding 阶段是瓶颈? Modern mobile SoCs ubiquitously include mobile neural processing units (NPUs) that are well-suited for integer operations, such as INT8-based matrix multiplication. ...

August 4, 2025 · Last updated on August 19, 2025 · 3 min · KKKZOZ

LLM in a flash Efficient Large Language Model Inference with Limited Memory

Intensive Reading Author Info ‪Keivan Alizadeh-Vahid‬ - ‪Google Scholar‬ Iman Mirzadeh: An ML Research Engineer at Apple. Background LLM is hard for personal devices to load. The standard approach is to load the entire model into DRAM (Dynamic Random Access Memory) for inference. However, this severely limits the maximum model size that can be run. Challenges The primary challenge is that the memory footprint of large language models (LLMs) often exceeds the limited DRAM capacity of personal devices. While storing models on high-capacity flash memory is a potential solution, it introduces two new major challenges: ...

July 30, 2025 · Last updated on August 19, 2025 · 3 min · KKKZOZ

PowerInfer-2 Fast Large Language Model Inference on a Smartphone

Intensive Reading Author Info Zhenliang Xue: From IPADS. Yixin Song: First author of PowerInfer. Zeyu Mi (糜泽羽): He is an associate professor at School of Software, Shanghai Jiao Tong University (SJTU). Haibo Chen [IPADS]: Director of Institute of Parallel and Distributed Systems. Background Sparsity FFN 的参数占比大,稀疏化特征也明显(特别是在使用 ReLU 时),所以可以在执行计算前利用一个 predictor 来预测哪些神经元会被激活,从而降低计算和 I/O 开销。 PowerInfer2 还探索了 LLM 推理过程中的动态稀疏性: 当批次很大时,对于任何一个神经元,只要它被输入中的至少一个激活,它在这一步的计算中就不是稀疏的。由于不同输入会激活不同神经元,其聚合效应导致大量神经元被激活,形成稳定、密集的“热点”,整体稀疏度显著降低。 由于某些序列会更早终止,所以有效批次的大小也会动态波动。这个实时变化导致了模型的计算模式在一个任务的生命周期内,会从一个接近稠密的模式平滑地过渡到一个高度稀疏的模式。 Mobile Hardware Characteristics 与 PC 相比,手机的硬件有两个特点: Heterogeneous computing capabilities with distinct sparse computation characteristics. CPU 更擅长稀疏计算 NPU 更擅长稠密计算 GPU 比 CPU 和 NPU 都更慢,而且在推理中使用 GPU 会影响设备的渲染帧率 移动 LLM 推理框架应同时利用异构处理器,以最大限度地利用共享内存带宽 Distinct storage architecture with unique I/O characteristics. 读的块大小越大,吞吐量越高 数据范围越小,吞吐量越高 频率越高的 CPU core 读取时吞吐量越高 UFS 并发能力有限 ...

July 29, 2025 · Last updated on August 19, 2025 · 4 min · KKKZOZ