DDIA: Chapter 4 Encoding and Evolution

Formats for Encoding Data 这里提到了两种兼容性,后面分析数据编码格式时都会用到: In order for the system to continue running smoothly, we need to maintain compatibility in both directions: Backward compatibility Newer code can read data that was written by older code. Forward compatibility Older code can read data that was written by newer code. 直译有一个问题, 英语的"前后"在时间和空间上统一, 而汉语却是相反. 比如 forward 在空间上指前进, 在时间上指未来. 但是汉语中的"前"在空间上指前进, 在时间上却指过去. 向后兼容很好理解:指新的版本的软/硬件可以使用老版本的软/硬件产生的数据。 Forward compatibility 译为向前兼容极容易混乱,这里可以想成向未来兼容:指老的版本的软/硬件可以使用新版本的软/硬件产生的数据。 以下是几个例子: Intel 的 x86指令集 CPU 是向后兼容的,因为新款 CPU 依然可以运行老版本的软件。Intel 保证老版本 CPU 有的指令集新版本一定还保留着,这种只增加不删除的策略,保证了我们换 CPU 时,不需要更换很多软件。 ...

October 21, 2023 · Last updated on August 1, 2025 · 9 min · KKKZOZ

DDIA: Chapter 2 Data Models and Query Languages

Relational Model Versus Document Model 首先谈到了 NoSQL 的诞生: There are several driving forces behind the adoption of NoSQL databases, including: A need for greater scalability than relational databases can easily achieve, includ‐ ing very large datasets or very high write throughput A widespread preference for free and open source software over commercial database products Specialized query operations that are not well supported by the relational model Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model 然后通过下图的这份简历来说明了 one-to-many 这种关系 ...

October 20, 2023 · Last updated on August 1, 2025 · 7 min · KKKZOZ

DDIA: Chapter 3 Storage and Retrieval

这一章主要讲的是数据库更底层的一些东西。 In order to tune a storage engine to perform well on your kind of workload, you need to have a rough idea of what the storage engine is doing under the hood. Data Structures That Power Your Database Index Any kind of index usually slows down writes, because the index also needs to be updated every time data is written. This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but every index slows down writes. ...

October 20, 2023 · Last updated on August 1, 2025 · 9 min · KKKZOZ

DistServe Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Extensive Reading Author Info Background Existing LLM serving systems typically colocate the prefill and decoding phases on the same set of GPUs, often using scheduling techniques like continuous batching to mix the computation of both phases. This colocation strategy creates severe prefill-decoding interference, where the long, compute-intensive prefill tasks block the short, memory-intensive decoding tasks, significantly degrading both the Time-To-First-Token (TTFT) and the Time-Per-Output-Token (TPOT). Colocation also couples the resource allocation and parallelism strategies for both phases, forcing them to share the same configuration even though their computational characteristics and latency requirements are fundamentally different, which leads to resource over-provisioning and inefficient performance. Insights Disaggregate the prefill and decoding phases of LLM inference, assigning them to separate GPUs, which brings two benefits: ...

November 3, 2025 · Last updated on October 4, 2025 · 2 min · KKKZOZ

Splitwise Efficient Generative LLM Inference Using Phase Splitting

Extensive Reading Author Info Background Generative LLM inference is characterized by two distinct phases: a compute-intensive prompt computation phase and a memory-intensive token generation phase, each with unique resource demands. Current systems run both phases on the same powerful, expensive GPUs, which is inefficient because the memory-bound token generation phase underutilizes the advanced compute resources of the hardware. This inefficiency is worsening as new GPUs (like the H100) increase compute power much faster than memory bandwidth or capacity, leading to higher-than-necessary costs and power consumption for large-scale deployments. Challenges The memory-intensive token generation phase, which accounts for the majority of end-to-end latency, severely underutilizes the expensive compute resources of modern GPUs. This inefficiency is exacerbated by hardware trends, as new GPUs (like the H100) provide massive compute gains (3.43x) but much smaller increases in memory bandwidth (1.6x) and no increase in capacity, making them poorly suited for the memory-bound token phase. Running both distinct phases on the same machine leads to inconsistent latencies and resource contention , forcing providers to over-provision expensive, power-hungry hardware to meet service level objectives (SLOs). Insights Prefill phase is compute-intensive, and decoding phase is memory-intensive, decoding does not need the compute capability of the latest GPUs and can be run with lower power and cost. Approaches 在提出具体方法之前,论文的很大部分篇幅都是在说明 LLM Inference 的一些特性,这些特性对 Splitwise 方法的设计有着非常大的影响 ...

November 3, 2025 · Last updated on October 4, 2025 · 3 min · KKKZOZ

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Extensive Reading Author Info Background Current LLM inference schedulers can be broadly classified into two categories: Prefill-Prioritizing Throughput first: allows subsequent decodes to operate at high batch sizes Compromise on latency: prefill can take arbitrarily long time depending on the lengths of the given prompts Decode-Prioritizing Latency first - new requests do not affect the execution of ongoing requests in their decode phase Compromise on throughput: even if some requests in a batch finish early, the execution continues with reduced batch size until the completion of the last request Analysis 论文指出, matrix multiplication 的执行时间可以看做 $T=max(T_{math}, T_{mem})$ ...

November 1, 2025 · Last updated on October 4, 2025 · 2 min · KKKZOZ

Aegaeon Effective GPU Pooling for Concurrent LLM Serving on the Market

Extensive Reading Author Info Background The workloads are heavily skewed, containing a long tail (more than 90%) of infrequently invoked models. Hot models are prone to request bursts that can overload their provisioned sources from time to time. 按需从主机内存(DRAM)或SSD中加载/卸载模型:当一个模型A的请求处理完毕后,再加载模型B来处理其请求,会导致严重的队头阻塞 => Request-level 调度太粗了 通过数学分析(Theorem 3.1)指出,即使单个模型的请求到达率($\lambda$)很低,但只要T很长,系统中的“活跃模型数量” ($\mathbb{E}[m]$) 依然会很高 Due to the typically long service time of LLM requests, the expected active model count E[m] can be large even when the aggregate arrival rate Mλ is low. ...

October 28, 2025 · Last updated on October 4, 2025 · 2 min · KKKZOZ

Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash

Intensive Reading Author Info About Ting Cao - Dr. Ting Cao: A Professor at the Institute of AI Industry Research (AIR), Tsinghua University. Background Challenges 挑战一:如何准确地识别出下一步计算到底需要哪些“活跃权重” 。如果识别错误,会降低模型的准确度。 挑战二:如何能足够早地预测出需要的活跃权重,从而将缓慢的闪存加载过程与当前的计算过程并行处理,以隐藏延迟。 现有的一些方法依赖 ReLU 激活函数来预测稀疏性,但这不适用于 Llama 等为追求高精度而未使用 ReLU 的现代 LLM. Insights 利用了 Top-K 的稀疏性,实现了在非 ReLu 上的权重值预测和预取 论文提出了两个核心观察: Similarities in Cross-Layer Activations The input activations of the attention and MLP blocks in LLMs exhibit high cross-layer similarity due to residuals to the input activations. 由于激活值相似度很高,所以用当前层最重要的 K 个激活通道去预测下一层最重要的 K 个激活通道,准确度也很高 Contextual Hot Active Weights During Decoding Contextual active weights exhibit high temporal locality across inference iterations during decoding. 在一个具体的对话或任务中(上下文层面),“热点权重”的重复使用率,远高于在所有通用任务中(任务层面)的平均重复使用率 所以根据上下文的激活频率设计缓存会更有效(缓存命中率会更高) Approaches Cross-Layer Active Weight Preloading 当计算第 N 层时,ActiveFlow 利用第 N 层的激活值来预测并提前加载第 N+1 层到第 N+k 层(一个“层组”)所需要的活跃权重 ...

September 1, 2025 · Last updated on September 2, 2025 · 2 min · KKKZOZ