TPI-LLM Serving 70B-scale LLMs Efficiently on Low-resource Mobile Devices
Extensive Reading A similar paper is found: arxiv.org/pdf/2504.08791? Author Info Zonghang Li - Google 学术搜索 Background LLM serving is shifting from the cloud to edge devices like smartphones and laptops. This trend is driven by growing privacy concerns, as users want to avoid sending their sensitive interaction data to cloud providers. The goal is to process user requests locally on their own devices. Preliminaries TP 场景下 KV Cache 的维护 Challenges Hardware Limitations: Mobile devices have very limited memory (typically 4-16 GiB) and computing power, often lacking GPUs. Running a 70B-scale model can require over 40 GiB of memory, which far exceeds the capacity of a single device. Inefficient Parallelism: The standard solution for distributed systems, pipeline parallelism, is inefficient for home scenarios where only one request is processed at a time. This leads to many devices being idle most of the time, wasting resources. Slow Memory Offloading: Existing on-device solutions like llama.cpp and Accelerate offload model data to disk to save RAM. However, their blocking disk I/O operations significantly slow down the inference speed. Insights 在低资源设备协同的环境下,应该选择 Tensor Parallelism 用户的请求一次性只有一条,并行的目的应该是降低延迟而不是增加吞吐量 Tensor Parallelism 依赖 allreduce 操作来同步和聚合计算结果,在低资源设备协同的环境下,通信瓶颈并非网络带宽而是链路延迟 使用 star-based allreduce 来降低网络跳数进而降低延迟 使用滑动窗口内存调度器来异步加载和卸载权重 由一个独立的线程后台进行 将权重的加载隐藏在计算和同步的过程中 Approaches ...