LoRA Low-Rank Adaptation of Large Language Models

Extensive Reading Author Info About | Edward Hu: Edward Hu is a founding partner in a stealth AI company in Woodside, CA. He was a researcher at OpenAI and received his research training as a Ph.D. student advised by Yoshua Bengio, a recipient of the 2018 A.M. Turing Award. Before graduate school, Edward was a researcher at Microsoft, where he invented LoRA and μTransfer. Yelong Shen - Microsoft | AMiner Background The dominant paradigm in modern NLP is ...

July 25, 2025 · Last updated on August 19, 2025 · 3 min · KKKZOZ

SpecInfer Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification

Extensive Reading Author Info Xupeng Miao Gabriele Oliaro Zhihao Zhang Xinhao Cheng Background Existing works only consider a token sequence generated by a single SSM for speculation which cannot align well with an LLM due to the model capacity gap between them. The probability of a successful alignment between the LLM and the speculated token sequence decays exponentially with the expected alignment length. Challenges How to generate a token tree in a extremely large search space? How to verify the whole token tree in a single verfication pass? Insights Simultaneously consider a diversity of speculation candidates (instead of just one as in existing approaches) to maximize speculative performance. ...

July 25, 2025 · Last updated on September 1, 2025 · 2 min · KKKZOZ

EdgeLLM Fast On-Device LLM Inference With Speculative Decoding

Extensive Reading 在 axriv 或者其他论文中的引用经常是另一个名字:LLMCad Author Info Daliang Xu (徐大亮) - Daliang Xu’s Website ‪Wangsong Yin‬ - ‪Google Scholar‬ Xin Jin Mengwei Xu Professor Xuanzhe Liu @ Peking University Background The Scaling Law vs. The Memory Wall: The machine learning community has shown that increasing an LLM’s parameter size consistently improves its accuracy and can lead to new, emergent abilities. However, this “scaling law” is challenged on mobile devices by a “memory wall”. When an LLM is too large to fit into a device’s memory, inference latency increases dramatically, by as much as 59-224x. ...

July 23, 2025 · Last updated on August 25, 2025 · 3 min · KKKZOZ

Efficient Memory Management for Large Language Model Serving with PagedAttention

Extensive Reading Author Info Woosuk Kwon Zhuohan Li Background The existing systems suffer from internal and external memory fragmentation. Three primary sources of memory wastes: 7+ Internal fragmentation: Space that will not be used in the future within an allocated memory block. External fragmentation: Unused space between memory blocks. The existing systems cannot exploit the opportunities for memory sharing. Parallel sampling, beam search, and shared prefix have the potential to leverage the shared KV cache to reduce memory footprint. ...

July 23, 2025 · Last updated on August 19, 2025 · 3 min · KKKZOZ

A Survey on Efficient Inference for Large Language Models

General Background Resources LLMs typically demand: Higher Computational Cost Higher Memory Access Cost Higher Memory Cost Inference Process of LLMs auto-regressive generation In each generation step, the LLM takes as input the whole token sequences, including the input tokens and previously generated tokens, and generates the next token. With the increase in sequence length, the time cost of the generation process grows rapidly. KV cache technique can store and reuse previous key and value pairs within the Multi-Head Self-Attention block. ...

July 20, 2025 · Last updated on August 25, 2025 · 21 min · KKKZOZ

Sonata Multi-Database Transactions Made Fast and Serializable

Background Modern applications are often built using a service-oriented architecture, such as microservices, where different functionalities are handled by independent services, each with its own dedicated database. This design leads to workflows that span multiple services and databases, creating the need for multi-database transactions. Without proper coordination, these transactions can suffer from concurrency anomalies, violating business rules and data consistency. Local serializability at all participating databases does not imply global serializability! ...

July 13, 2025 · Last updated on August 1, 2025 · 6 min · KKKZOZ

Concurrency Control as a Service

Background Disaggregated databases typically decouple the system into: an execution layer, which requires substantial computational resources a storage layer, which necessitates significant storage capacity Concurrency Control (CC) is a key function module in databases The resource requirements of CC are neither consistent With SQL execution: execution prefers relatively more compute nodes but CC prefers fewer nodes With data storage: data storage nodes have substantial storage capacities but limited computing resources Yet, most existing cloud-native databases simply couple CC either with the execution layer or the storage layer. ...

July 11, 2025 · Last updated on August 1, 2025 · 5 min · KKKZOZ

Orca A Distributed Serving System for Transformer-Based Generative Models

Background Current serving system schedules the execution of the engine at the granularity of request. Under this design, when the serving system dispatches a batch of requests to the engine, the engine returns inference results for the entire batch at once after processing all requests within the batch. Challenge 1: Early-finished and late-joining requests Requests can’t be early finished As different client requests may require different numbers of iterations for processing, requests that have finished earlier than others in the batch cannot return to the client, resulting in an increased latency. ...

July 2, 2025 · Last updated on August 19, 2025 · 5 min · KKKZOZ