Tensor-Parallelism

KV-Runahead Scalable Causal LLM Inference by Parallel Key-Value Cache Generation

Skimming Author Info Background Challenges Insights Approaches 看了好几遍都没看懂，我大概的理解是利用了 casual mask 的特性以链式的方式在不同设备之间传递 KV，避免了传统 TSP 的大量重复计算和冗余传输为了平衡整个流水线采用了 context-level load balancing，靠前的设备多算一些 KV, 靠后的设备少算一些，因为靠后的设备注意力计算会更长这里的关键点是：每个设备不仅传递 KV 缓存，也要利用收到的缓存，完成自己那部分词元的注意力计算。在 D1 上: 计算 T1-T4 的Q_0, K_0, V_0。立刻进行自己部分的注意力计算：用 Q_0 与 K_0 计算一个 4x4 的注意力矩阵，得到输出A_0。然后，它将 K_0, V_0（尺寸为 4 的缓存）发送给D2。在 D2 上: 在等待 D1 数据的同时，它可以并行计算 T5-T7 的本地Q_1, K_1, V_1。当它收到 D1 发来的 K_0, V_0 后，它将自己本地的 K_1, V_1 追加上去，形成一个包含 T1-T7 信息的、尺寸为 7 的 KV 缓存。立刻进行自己部分的注意力计算：用自己的 Q_1（来自 T5-T7）与这个尺寸为 7 的完整缓存进行计算（一个 3x7 的注意力计算），得到输出 A_1。然后，它将这个尺寸为 7 的 KV 缓存发送给 D3。在 D3 上: 并行计算 T8-T9 的本地Q_2, K_2, V_2。收到 D2 发来的尺寸为 7 的缓存后，追加自己的 K_2, V_2，形成包含全部 9 个词元信息的最终KV缓存。它进行自己部分的注意力计算：用 Q_2 与这个尺寸为 9 的完整缓存进行计算（一个 2x9 的注意力计算），得到输出 A_2。作为最后一个设备，它最终生成第一个令牌。 TSP ...

Striped Attention Faster Ring Attention for Causal Transformers

Skimming Author Info Implementation and Benchmark zhuzilin/ring-flash-attention: Ring attention implementation with flash attention Corresponding virtualization is here Background Challenges Insights Ring attention suffers from workload imbalance Due to the casual mask mechanism, some devices are doing meaningless computations in the iterations while other devices stays busy all the time. Stripped attention propose an another way to distribute workloads across devices to eliminate the imbalance. Approaches Striped Attention 让每个设备都持有了在原始序列中均匀分布的、不连续的词元 Example Important 理解这个例子最重要的一点：Ring Attention 和 Striped Attention 都不是采用朴素的注意力计算 ...

TPI-LLM Serving 70B-scale LLMs Efficiently on Low-resource Mobile Devices

Extensive Reading A similar paper is found: arxiv.org/pdf/2504.08791? Author Info ‪Zonghang Li‬ - ‪Google 学术搜索‬ Background LLM serving is shifting from the cloud to edge devices like smartphones and laptops. This trend is driven by growing privacy concerns, as users want to avoid sending their sensitive interaction data to cloud providers. The goal is to process user requests locally on their own devices. Preliminaries TP 场景下 KV Cache 的维护 Challenges Hardware Limitations: Mobile devices have very limited memory (typically 4-16 GiB) and computing power, often lacking GPUs. Running a 70B-scale model can require over 40 GiB of memory, which far exceeds the capacity of a single device. Inefficient Parallelism: The standard solution for distributed systems, pipeline parallelism, is inefficient for home scenarios where only one request is processed at a time. This leads to many devices being idle most of the time, wasting resources. Slow Memory Offloading: Existing on-device solutions like llama.cpp and Accelerate offload model data to disk to save RAM. However, their blocking disk I/O operations significantly slow down the inference speed. Insights 在低资源设备协同的环境下，应该选择 Tensor Parallelism 用户的请求一次性只有一条，并行的目的应该是降低延迟而不是增加吞吐量 Tensor Parallelism 依赖 allreduce 操作来同步和聚合计算结果，在低资源设备协同的环境下，通信瓶颈并非网络带宽而是链路延迟使用 star-based allreduce 来降低网络跳数进而降低延迟使用滑动窗口内存调度器来异步加载和卸载权重由一个独立的线程后台进行将权重的加载隐藏在计算和同步的过程中 Approaches ...