EdgeLLM Fast On-Device LLM Inference With Speculative Decoding

Extensive Reading 在 axriv 或者其他论文中的引用经常是另一个名字:LLMCad Author Info Daliang Xu (徐大亮) - Daliang Xu’s Website ‪Wangsong Yin‬ - ‪Google Scholar‬ Xin Jin Mengwei Xu Professor Xuanzhe Liu @ Peking University Background The Scaling Law vs. The Memory Wall: The machine learning community has shown that increasing an LLM’s parameter size consistently improves its accuracy and can lead to new, emergent abilities. However, this “scaling law” is challenged on mobile devices by a “memory wall”. When an LLM is too large to fit into a device’s memory, inference latency increases dramatically, by as much as 59-224x. ...

July 23, 2025 · Last updated on August 25, 2025 · 3 min · KKKZOZ

Efficient Memory Management for Large Language Model Serving with PagedAttention

Extensive Reading Author Info Woosuk Kwon Zhuohan Li Background The existing systems suffer from internal and external memory fragmentation. Three primary sources of memory wastes: 7+ Internal fragmentation: Space that will not be used in the future within an allocated memory block. External fragmentation: Unused space between memory blocks. The existing systems cannot exploit the opportunities for memory sharing. Parallel sampling, beam search, and shared prefix have the potential to leverage the shared KV cache to reduce memory footprint. ...

July 23, 2025 · Last updated on August 19, 2025 · 3 min · KKKZOZ

A Survey on Efficient Inference for Large Language Models

General Background Resources LLMs typically demand: Higher Computational Cost Higher Memory Access Cost Higher Memory Cost Inference Process of LLMs auto-regressive generation In each generation step, the LLM takes as input the whole token sequences, including the input tokens and previously generated tokens, and generates the next token. With the increase in sequence length, the time cost of the generation process grows rapidly. KV cache technique can store and reuse previous key and value pairs within the Multi-Head Self-Attention block. ...

July 20, 2025 · Last updated on August 25, 2025 · 21 min · KKKZOZ

Sonata Multi-Database Transactions Made Fast and Serializable

Background Modern applications are often built using a service-oriented architecture, such as microservices, where different functionalities are handled by independent services, each with its own dedicated database. This design leads to workflows that span multiple services and databases, creating the need for multi-database transactions. Without proper coordination, these transactions can suffer from concurrency anomalies, violating business rules and data consistency. Local serializability at all participating databases does not imply global serializability! ...

July 13, 2025 · Last updated on August 1, 2025 · 6 min · KKKZOZ

Concurrency Control as a Service

Background Disaggregated databases typically decouple the system into: an execution layer, which requires substantial computational resources a storage layer, which necessitates significant storage capacity Concurrency Control (CC) is a key function module in databases The resource requirements of CC are neither consistent With SQL execution: execution prefers relatively more compute nodes but CC prefers fewer nodes With data storage: data storage nodes have substantial storage capacities but limited computing resources Yet, most existing cloud-native databases simply couple CC either with the execution layer or the storage layer. ...

July 11, 2025 · Last updated on August 1, 2025 · 5 min · KKKZOZ

Orca A Distributed Serving System for Transformer-Based Generative Models

Background Current serving system schedules the execution of the engine at the granularity of request. Under this design, when the serving system dispatches a batch of requests to the engine, the engine returns inference results for the entire batch at once after processing all requests within the batch. Challenge 1: Early-finished and late-joining requests Requests can’t be early finished As different client requests may require different numbers of iterations for processing, requests that have finished earlier than others in the batch cannot return to the client, resulting in an increased latency. ...

July 2, 2025 · Last updated on August 19, 2025 · 5 min · KKKZOZ

EdgeShard Efficient LLM Inference via Collaborative Edge Computing

Background 传统的LLM部署方式主要有两种: 云端部署:将模型完全部署在云服务器上。这种方式虽然计算能力强,但会带来较高的网络延迟、带宽成本,并可能引发用户数据隐私泄露的风险 。 边缘端部署:将模型直接部署在靠近用户的边缘设备上。这种方式可以有效解决延迟和隐私问题,但边缘设备(如手机、物联网网关)的计算和内存资源非常有限,难以承载动辄数十亿参数的LLM 。 现有的解决方案,如模型量化(压缩模型)会造成精度损失 ,而简单的云-边协同(将模型切分两部分)仍然严重依赖与云端的高质量连接 。 论文首次提出了一种名为EdgeShard 的通用LLM推理框架,旨在利用协同边缘计算(Collaborative Edge Computing, CEC)环境 。该环境整合了地理上分布的、异构的多个边缘设备和云服务器的计算资源,形成一个共享资源池 ,共同执行LLM推理任务。 Core Insights EdgeShard 将一个计算密集的LLM智能地 “分片(Shard)”,并将这些分片部署到一组经过精心挑选的异构计算设备上(包括边缘设备和云服务器)。通过这种方式,它能够: 突破内存瓶颈:将一个大到任何单个设备都无法承载的模型,分散到多个设备上,使得部署超大规模模型(如Llama2-70B)成为可能 。 优化推理性能:综合考虑各个设备的计算能力、内存大小以及它们之间的网络带宽等因素,智能地决定哪些设备参与计算以及如何切分模型,从而实现最小化推理延迟或最大化系统吞吐量 。 保障数据隐私:通过策略强制模型的输入层(第一层)必须部署在用户数据所在的源设备上,避免了原始数据在网络中传输,从而降低了隐私泄露风险 。 主要方法 为了实现上述思路,EdgeShard框架的设计包含三个主要阶段: 1. 离线性能剖析 (Offline Profiling) 这是一个一次性的准备步骤 。系统会全面地测量和记录运行LLM所需的关键信息,包括: 模型每一层在不同设备上的平均执行时间(同时考虑了预填充和自回归生成两个阶段) 。 每一层计算后产生的激活值(即中间结果)的大小和内存消耗 。 各个设备的可用内存上限以及设备之间的网络带宽 。 2. 任务调度优化 (Task Scheduling Optimization) 调度器会利用第一阶段收集到的数据,来解决一个“联合设备选择与模型划分”的优化问题。论文针对两种不同的优化目标,设计了两种对应的算法: Note 提出的两种算法都是简单的动态规划 ...

July 1, 2025 · Last updated on August 19, 2025 · 1 min · KKKZOZ

ServerlessLLM Locality-Enhanced Serverless Inference for Large Language Models

Background Serverless inference can significantly reduce costs for LLM users by charging only for the duration of inference and the volume of processed data. Key components in GPU serverless clusters: Controller: Request Router: direct incoming requests to nodes already running LLM inference processes, or instructs the model loading scheduler. Model Loading Scheduler: activate LLM inference processes on unallocated GPUs. TODO The deployment of LLMs on serverless systems, although promising, often incurs significant latency overheads. This is largely due to the substantial proportions of cold-start in serverless clusters. ...

June 28, 2025 · Last updated on August 19, 2025 · 3 min · KKKZOZ