EdgeShard Efficient LLM Inference via Collaborative Edge Computing

Background 传统的 LLM 部署方式主要有两种: 云端部署:将模型完全部署在云服务器上。这种方式虽然计算能力强,但会带来较高的网络延迟、带宽成本,并可能引发用户数据隐私泄露的风险 。 边缘端部署:将模型直接部署在靠近用户的边缘设备上。这种方式可以有效解决延迟和隐私问题,但边缘设备(如手机、物联网网关)的计算和内存资源非常有限,难以承载动辄数十亿参数的LLM 。 现有的解决方案,如模型量化(压缩模型)会造成精度损失 ,而简单的云-边协同(将模型切分两部分)仍然严重依赖与云端的高质量连接 。 论文首次提出了一种名为 EdgeShard 的通用 LLM 推理框架,旨在利用协同边缘计算(Collaborative Edge Computing, CEC)环境 。该环境整合了地理上分布的、异构的多个边缘设备和云服务器的计算资源,形成一个共享资源池 ,共同执行LLM推理任务。 Core Insights EdgeShard 将一个计算密集的LLM智能地 “分片(Shard)”,并将这些分片部署到一组经过精心挑选的异构计算设备上(包括边缘设备和云服务器)。通过这种方式,它能够: 突破内存瓶颈:将一个大到任何单个设备都无法承载的模型,分散到多个设备上,使得部署超大规模模型(如Llama2-70B)成为可能 。 优化推理性能:综合考虑各个设备的计算能力、内存大小以及它们之间的网络带宽等因素,智能地决定哪些设备参与计算以及如何切分模型,从而实现最小化推理延迟或最大化系统吞吐量 。 保障数据隐私:通过策略强制模型的输入层(第一层)必须部署在用户数据所在的源设备上,避免了原始数据在网络中传输,从而降低了隐私泄露风险 。 主要方法 为了实现上述思路,EdgeShard框架的设计包含三个主要阶段: 1. 离线性能剖析 (Offline Profiling) 这是一个一次性的准备步骤 。系统会全面地测量和记录运行LLM所需的关键信息,包括: 模型每一层在不同设备上的平均执行时间(同时考虑了预填充和自回归生成两个阶段) 。 每一层计算后产生的激活值(即中间结果)的大小和内存消耗 。 各个设备的可用内存上限以及设备之间的网络带宽 。 2. 任务调度优化 (Task Scheduling Optimization) 调度器会利用第一阶段收集到的数据,来解决一个“联合设备选择与模型划分”的优化问题。论文针对两种不同的优化目标,设计了两种对应的算法: Note 提出的两种算法都是简单的动态规划 ...

July 1, 2025 · Last updated on September 2, 2025 · 1 min · KKKZOZ

ServerlessLLM Locality-Enhanced Serverless Inference for Large Language Models

Background Serverless inference can significantly reduce costs for LLM users by charging only for the duration of inference and the volume of processed data. Key components in GPU serverless clusters: Controller: Request Router: direct incoming requests to nodes already running LLM inference processes, or instructs the model loading scheduler. Model Loading Scheduler: activate LLM inference processes on unallocated GPUs. TODO The deployment of LLMs on serverless systems, although promising, often incurs significant latency overheads. This is largely due to the substantial proportions of cold-start in serverless clusters. ...

June 28, 2025 · Last updated on September 1, 2025 · 3 min · KKKZOZ

Data Management in Microservices: State of the Practice, Challenges, and Research Directions

FAQ What is polyglot persistence principle? Polyglot Persistence is a term coined by Martin Fowler and Pramod Sadalage to describe the concept of using different data storage technologies to handle varying data storage needs within a given application. The principle behind polyglot persistence is that no single database can serve all needs of a modern application. Different kinds of data are best dealt with different types of databases. For example, you might use a relational database such as MySQL for transaction data, a NoSQL document store like MongoDB for semi-structured data, and a graph database like Neo4j for relationships between entities. Instead of trying to make one data storage technology work for all types of data, you use the type of database that is best suited for each particular need. This approach can result in a system that’s more scalable, performant, and easier to work with. Polyglot Persistence represents a shift away from the traditional, one-size-fits-all database strategy, allowing for a more flexible and optimized approach to data management. ...

April 15, 2024 · Last updated on August 1, 2025 · 5 min · KKKZOZ

Paper Note: Taking Omid to the Clouds: Fast, Scalable Transactions for Real-Time Cloud Analytics

Designing a TPS for clouds will meet these challenges: Diverse functionality. The concept of translytics as “a unified and integrated data platform that supports multi-workloads such as transactional, operational, and analytical simultaneously in realtime, … and ensures full transactional integrity and data consistency”. Scale Cloud-first data platforms are designed to scale well beyond the limits of a single application. Latency With the thrust into new interactive domains like social networks, messaging and algorithmic trading,latency becomes essential. Multi-tenancy Maintaining access rights is therefore an important design consideration for TPSs. 论文的工作集中于以下几点: ...

March 25, 2024 · Last updated on August 1, 2025 · 2 min · KKKZOZ

Paper Note: Omid, Reloaded: Scalable and Highly-Available Transaction Processing

An entirely re-designed version of Omid1. Omid’s job is to manage the transaction control plane. The transaction metadata includes: A dedicated table(Commit Tabla, CT) holds a single record per committing transaction. Per-row metadata for items accessed by transactions. Feature: A middleware oriented. Snapshot isolate should provide: Visibility check rules. Write conflicts detection. Visibility Check Rules Intuitively, with SI, a transaction’s reads all appear to occur the time when it begins($ts_r$) while its writes appear to execute when it commits($ts_c$). ...

March 22, 2024 · Last updated on August 1, 2025 · 2 min · KKKZOZ

Paper Note: Towards Transaction as a Service

Background Database systems have evolved to be with a cloud-native architecture,i.e., disaggregation of compute and storage architecture, which decouples the storage from the compute nodes, then connects the compute nodes to shared storage through a high-speed network. The principle of cloud-native architecture is decoupling. (decoupled functions to make good use of disaggregated resources) Most existing cloud-native databases couple TP either with storage layer or with execution layer. Coupling TP with Storage Layer TiDB adopts a distributed transactional key-value store TiKV as the storage layer. ...

January 22, 2024 · Last updated on August 1, 2025 · 6 min · KKKZOZ

Paper Note: Scalable Distributed Transactions across Heterogeneous Stores

FAQ What is the difference between rolling backward and rolling forward in database transactions? “Rolling backward” and “rolling forward” in the context of database transactions refer to two distinct phases of the recovery process that helps maintain the integrity and consistency of the database after a system crash or failure. These concepts are tied to the idea of transaction logs that record the changes made to the database. Below are the key differences between rolling backward and rolling forward: ...

December 9, 2023 · Last updated on August 1, 2025 · 5 min · KKKZOZ

Paper Note: GRIT: Consistent Distributed Transactions across Polyglot Microservices with Multiple Databases

FAQ What is a deterministic database? A deterministic database is a system where the outcomes of any database operations are guaranteed to be the same every time they are executed, provided that the operations are started from the same database state. This concept implies a level of reliability and predictability in the behavior of the database system. Deterministic behavior is essential in many contexts, especially in distributed databases, where operations might need to be coordinated across multiple nodes, or in any system where replication, fault tolerance, and consistency are important. If a database operation is deterministic, it means the following: ...

December 8, 2023 · Last updated on August 1, 2025 · 4 min · KKKZOZ