A Survey on Efficient Inference for Large Language Models

General Background Resources LLMs typically demand: Higher Computational Cost Higher Memory Access Cost Higher Memory Cost Inference Process of LLMs auto-regressive generation In each generation step, the LLM takes as input the whole token sequences, including the input tokens and previously generated tokens, and generates the next token. With the increase in sequence length, the time cost of the generation process grows rapidly. KV cache technique can store and reuse previous key and value pairs within the Multi-Head Self-Attention block. ...

July 20, 2025 · Last updated on August 1, 2025 · 21 min · KKKZOZ

Orca A Distributed Serving System for Transformer-Based Generative Models

Background Current serving system schedules the execution of the engine at the granularity of request. Under this design, when the serving system dispatches a batch of requests to the engine, the engine returns inference results for the entire batch at once after processing all requests within the batch. Challenge 1: Early-finished and late-joining requests Requests can’t be early finished As different client requests may require different numbers of iterations for processing, requests that have finished earlier than others in the batch cannot return to the client, resulting in an increased latency. ...

July 2, 2025 · Last updated on August 1, 2025 · 5 min · KKKZOZ

EdgeShard Efficient LLM Inference via Collaborative Edge Computing

Background 传统的LLM部署方式主要有两种: 云端部署:将模型完全部署在云服务器上。这种方式虽然计算能力强,但会带来较高的网络延迟、带宽成本,并可能引发用户数据隐私泄露的风险 。 边缘端部署:将模型直接部署在靠近用户的边缘设备上。这种方式可以有效解决延迟和隐私问题,但边缘设备(如手机、物联网网关)的计算和内存资源非常有限,难以承载动辄数十亿参数的LLM 。 现有的解决方案,如模型量化(压缩模型)会造成精度损失 ,而简单的云-边协同(将模型切分两部分)仍然严重依赖与云端的高质量连接 。 论文首次提出了一种名为EdgeShard 的通用LLM推理框架,旨在利用协同边缘计算(Collaborative Edge Computing, CEC)环境 。该环境整合了地理上分布的、异构的多个边缘设备和云服务器的计算资源,形成一个共享资源池 ,共同执行LLM推理任务。 Core Insights EdgeShard 将一个计算密集的LLM智能地 “分片(Shard)”,并将这些分片部署到一组经过精心挑选的异构计算设备上(包括边缘设备和云服务器)。通过这种方式,它能够: 突破内存瓶颈:将一个大到任何单个设备都无法承载的模型,分散到多个设备上,使得部署超大规模模型(如Llama2-70B)成为可能 。 优化推理性能:综合考虑各个设备的计算能力、内存大小以及它们之间的网络带宽等因素,智能地决定哪些设备参与计算以及如何切分模型,从而实现最小化推理延迟或最大化系统吞吐量 。 保障数据隐私:通过策略强制模型的输入层(第一层)必须部署在用户数据所在的源设备上,避免了原始数据在网络中传输,从而降低了隐私泄露风险 。 主要方法 为了实现上述思路,EdgeShard框架的设计包含三个主要阶段: 1. 离线性能剖析 (Offline Profiling) 这是一个一次性的准备步骤 。系统会全面地测量和记录运行LLM所需的关键信息,包括: 模型每一层在不同设备上的平均执行时间(同时考虑了预填充和自回归生成两个阶段) 。 每一层计算后产生的激活值(即中间结果)的大小和内存消耗 。 各个设备的可用内存上限以及设备之间的网络带宽 。 2. 任务调度优化 (Task Scheduling Optimization) 调度器会利用第一阶段收集到的数据,来解决一个“联合设备选择与模型划分”的优化问题。论文针对两种不同的优化目标,设计了两种对应的算法: Hint 提出的两种算法都是简单的动态规划 ...

July 1, 2025 · Last updated on August 1, 2025 · 1 min · KKKZOZ

ServerlessLLM Locality-Enhanced Serverless Inference for Large Language Models

Background Serverless inference can significantly reduce costs for LLM users by charging only for the duration of inference and the volume of processed data. Key components in GPU serverless clusters: Controller: Request Router: direct incoming requests to nodes already running LLM inference processes, or instructs the model loading scheduler. Model Loading Scheduler: activate LLM inference processes on unallocated GPUs. TODO The deployment of LLMs on serverless systems, although promising, often incurs significant latency overheads. This is largely due to the substantial proportions of cold-start in serverless clusters. ...

June 28, 2025 · Last updated on August 1, 2025 · 3 min · KKKZOZ