KKKZOZ's Blog

LLM as a System Service on Mobile Devices

Intensive Reading Author Info ‪Wangsong Yin‬ - ‪Google Scholar‬ Mengwei Xu Background 论文首先提出了 LLMaaS: LLM as a system service on mobile devices (LLMaaS): The mobile OS exposes an LLM and its inference infrastructure as a system feature to mobile apps, akin to the location or notification services. LLMaaS 的提出主要有以下原因： LLMaaS needs only one copy of LLM weights in memory. 不同应用程序应该去调用由系统维护的同一个大模型，而不是自己单独去加载一个 A system-level LLM can be better customized for on-device accelerator and enjoy the performance gain over commodity hardware. 在系统层面去做大模型的管理和推理更接近底层，能够更好地利用底层的硬件资源这篇文章要解决的核心问题是 How to efficiently manage the LLM contexts ...

KV-Runahead Scalable Causal LLM Inference by Parallel Key-Value Cache Generation

Skimming Author Info Background Challenges Insights Approaches 看了好几遍都没看懂，我大概的理解是利用了 casual mask 的特性以链式的方式在不同设备之间传递 KV，避免了传统 TSP 的大量重复计算和冗余传输为了平衡整个流水线采用了 context-level load balancing，靠前的设备多算一些 KV, 靠后的设备少算一些，因为靠后的设备注意力计算会更长这里的关键点是：每个设备不仅传递 KV 缓存，也要利用收到的缓存，完成自己那部分词元的注意力计算。在 D1 上: 计算 T1-T4 的Q_0, K_0, V_0。立刻进行自己部分的注意力计算：用 Q_0 与 K_0 计算一个 4x4 的注意力矩阵，得到输出A_0。然后，它将 K_0, V_0（尺寸为 4 的缓存）发送给D2。在 D2 上: 在等待 D1 数据的同时，它可以并行计算 T5-T7 的本地Q_1, K_1, V_1。当它收到 D1 发来的 K_0, V_0 后，它将自己本地的 K_1, V_1 追加上去，形成一个包含 T1-T7 信息的、尺寸为 7 的 KV 缓存。立刻进行自己部分的注意力计算：用自己的 Q_1（来自 T5-T7）与这个尺寸为 7 的完整缓存进行计算（一个 3x7 的注意力计算），得到输出 A_1。然后，它将这个尺寸为 7 的 KV 缓存发送给 D3。在 D3 上: 并行计算 T8-T9 的本地Q_2, K_2, V_2。收到 D2 发来的尺寸为 7 的缓存后，追加自己的 K_2, V_2，形成包含全部 9 个词元信息的最终KV缓存。它进行自己部分的注意力计算：用 Q_2 与这个尺寸为 9 的完整缓存进行计算（一个 2x9 的注意力计算），得到输出 A_2。作为最后一个设备，它最终生成第一个令牌。 TSP ...

Ring Attention with Blockwise Transformers for Near-Infinite Context

Extensive Reading Author Info Hao Liu: A research scientist at Google DeepMind. Matei Zaharia: An associate professor at UC Berkeley (previously Stanford), where he works on computer systems and AI in the Sky Lab. Related Blogs Ring Attention Explained | Coconut Mode Background Transformer 的核心组件“自注意力机制”的内存消耗会随着输入序列长度的增加而呈二次方增长。这导致即便是最先进的 GPU/TPU，其有限的显存（通常小于 100GB）也无法处理超长序列，例如处理百万甚至千万级别的 token. 注意力模块的显存占用分析 $B$: Batch size ...

Striped Attention Faster Ring Attention for Causal Transformers

Skimming Author Info Implementation and Benchmark zhuzilin/ring-flash-attention: Ring attention implementation with flash attention Corresponding virtualization is here Background Challenges Insights Ring attention suffers from workload imbalance Due to the casual mask mechanism, some devices are doing meaningless computations in the iterations while other devices stays busy all the time. Stripped attention propose an another way to distribute workloads across devices to eliminate the imbalance. Approaches Striped Attention 让每个设备都持有了在原始序列中均匀分布的、不连续的词元 Example Important 理解这个例子最重要的一点：Ring Attention 和 Striped Attention 都不是采用朴素的注意力计算 ...

TPI-LLM Serving 70B-scale LLMs Efficiently on Low-resource Mobile Devices

Extensive Reading A similar paper is found: arxiv.org/pdf/2504.08791? Author Info ‪Zonghang Li‬ - ‪Google 学术搜索‬ Background LLM serving is shifting from the cloud to edge devices like smartphones and laptops. This trend is driven by growing privacy concerns, as users want to avoid sending their sensitive interaction data to cloud providers. The goal is to process user requests locally on their own devices. Preliminaries TP 场景下 KV Cache 的维护 Challenges Hardware Limitations: Mobile devices have very limited memory (typically 4-16 GiB) and computing power, often lacking GPUs. Running a 70B-scale model can require over 40 GiB of memory, which far exceeds the capacity of a single device. Inefficient Parallelism: The standard solution for distributed systems, pipeline parallelism, is inefficient for home scenarios where only one request is processed at a time. This leads to many devices being idle most of the time, wasting resources. Slow Memory Offloading: Existing on-device solutions like llama.cpp and Accelerate offload model data to disk to save RAM. However, their blocking disk I/O operations significantly slow down the inference speed. Insights 在低资源设备协同的环境下，应该选择 Tensor Parallelism 用户的请求一次性只有一条，并行的目的应该是降低延迟而不是增加吞吐量 Tensor Parallelism 依赖 allreduce 操作来同步和聚合计算结果，在低资源设备协同的环境下，通信瓶颈并非网络带宽而是链路延迟使用 star-based allreduce 来降低网络跳数进而降低延迟使用滑动窗口内存调度器来异步加载和卸载权重由一个独立的线程后台进行将权重的加载隐藏在计算和同步的过程中 Approaches ...

LLM.int8() 8-bit Matrix Multiplication for Transformers at Scale

Extensive Reading Author Info About Me — Tim Dettmers: A research scientist at the Allen Institute for Artificial Intelligence (Ai2) and an incoming Assistant Professor at Carnegie Mellon University (CMU). ‪Mike Lewis‬ - ‪Google Scholar‬ Related Blogs LLM.int8() and Emergent Features — Tim Dettmers A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using transformers, accelerate and bitsandbytes Background 常见的 8 bit 量化有两种： Absmax quantization 在所有数据中，找到绝对值的最大值，我们称之为 abs_max，然后计算一个全局的 scaling factor 用数据中的每一个数字乘以这个 scaling factor, 再四舍五入到最近的整数完成量化 Zeropoint Quantization 找到数据中的最大值和最小值，计算 scaling factor 同时引入一个偏移量 zeropoint 来利用整个映射后的数值范围精度更高，但是开销更大 Challenges How to preserve high quantization precision at scales beyond 1B parameters? How to deal with the systematic outliers emerged in all transformer layers starting at scales of 6.7B parameters? Insights Regular quantization methods introduce larger quantization errors for outliers. The amount of outlier can be small, but contributes the majority to the LLM’s quality. Isolate the outlier feature dimensions into a 16-bit matrix multiplication while other values are multiplied in 8-bit. Approaches 主要包括两部分内容： ...

LLM Generated Content

Q: 能否用一个更详细的例子说明 git worktree 好的，没问题。让我们通过一个完整、详细的实际场景来演练一下 git worktree 的用法。场景设定你的项目：一个名为 my-website 的网站项目。你的主目录：~/dev/my-website 你的主要任务：你正在开发一个复杂的新功能——“图片画廊”（Photo Gallery）。为此，你创建了一个名为 feature/gallery 的分支，并且已经写了一些代码，但工作远未完成。突发状况：产品经理跑过来说，网站页脚的版权年份还是去年，这是一个紧急的线上 Bug，必须马上修复！现在，你的 feature/gallery 分支工作区是“凌乱”的，有很多新建的、修改过的但还不能提交的文件。详细步骤演练第 1 步：检查当前状态首先，我们看看当前的工作目录。 # You are in your main project directory cd ~/dev/my-website # You are working on the gallery feature git status git status 的输出可能是这样的： On branch feature/gallery Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git restore <file>..." to discard changes in working directory) modified: src/app.js Untracked files: (use "git add <file>..." to include in what will be committed) src/components/PhotoGallery.vue no changes added to commit (use "git add" and/or "git commit -a") 此时，你不能直接切换到 main 分支，因为这会强制你处理这些未提交的修改，这正是我们想避免的。 ...

Deja Vu Contextual Sparsity for Efficient LLMs at Inference Time

Intensive Reading Author Info Zichang Liu：Research Scientist at Meta. Jue Wang, Ph.D: Founder & President of Stylar AI (stylar.ai). Tri Dao: Assistant Professor of Computer Science at Princeton University. Chief Scientist at Together AI. Background LLM Inference Latency Breakdown Challenges Speeding up inference-time sparse LLMs in wall-clock time while maintaining quality and in-context learning abilities remains a challenging problem. While sparsity and pruning have been well-studied, they have not seen wide adoption on LLMs due to the poor quality and efficiency trade-offs on modern hardware such as GPUs: ...