DDIA: Chapter 2 Data Models and Query Languages

Relational Model Versus Document Model 首先谈到了 NoSQL 的诞生: There are several driving forces behind the adoption of NoSQL databases, including: A need for greater scalability than relational databases can easily achieve, includ‐ ing very large datasets or very high write throughput A widespread preference for free and open source software over commercial database products Specialized query operations that are not well supported by the relational model Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model 然后通过下图的这份简历来说明了 one-to-many 这种关系 ...

October 20, 2023 · Last updated on August 1, 2025 · 7 min · KKKZOZ

DDIA: Chapter 3 Storage and Retrieval

这一章主要讲的是数据库更底层的一些东西。 In order to tune a storage engine to perform well on your kind of workload, you need to have a rough idea of what the storage engine is doing under the hood. Data Structures That Power Your Database Index Any kind of index usually slows down writes, because the index also needs to be updated every time data is written. This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but every index slows down writes. ...

October 20, 2023 · Last updated on August 1, 2025 · 9 min · KKKZOZ

Cascade Speculative Drafting for Even Faster LLM Inference

Extensive Reading Author Info Background While speculative decoding improves latency by using a smaller draft model to generate tokens for a larger target model, it suffers from two specific bottlenecks: Autoregressive Drafting: The draft model itself generates tokens autoregressively (one by one), which is still computationally expensive and slow. Inefficient Time Allocation: Standard methods allocate equal time to generate every draft token. However, tokens later in the sequence have a significantly lower probability of acceptance. Using the same computational resources for these “high-rejection” tokens is inefficient. Insights The autoregressive process of draft model is the bottleneck: Use draft model to accelerate draft models (Vertical Cascade) Tokens later in the sequence have a lower probability of acceptance: Use a faster and lighter draft model later in the sequence (Horizontal Cascade) Challenges Approaches ...

February 10, 2026 · Last updated on February 10, 2026 · 4 min · KKKZOZ

3-Model Speculative Decoding

Extensive Reading Author Info Background The Accuracy-Speed Trade-off: The effectiveness of SD is limited by a fundamental trade-off: very small draft models are fast but often diverge from the target model’s distribution, leading to low acceptance rates. Conversely, larger draft models have higher acceptance rates but are too slow to provide significant speedups. Limitations of Single-Stage Verification: As the performance gap between the draft and target models widens, the output distributions diverge significantly, diminishing the acceleration gains. Even relaxed verification methods like Fuzzy Speculative Decoding struggle to bridge large distributional gaps between a tiny draft model and a massive target model in a single step. Insights The authors propose Pyramid Speculative Decoding, which inserts an intermediate “Qualifier Model” between the small Draft and the large Target. This creates a hierarchical pipeline that bridges the “distributional gap” between the small and large models. ...

February 9, 2026 · Last updated on February 9, 2026 · 3 min · KKKZOZ

LayerSkip Enabling Early Exit Inference and Self-Speculative Decoding

Extensive Reading Author Info Background Early Exit (Dynamic Halting) These techniques attempt to stop the forward pass at an intermediate layer if the model is sufficiently confident in the prediction. Problems: In standard LLMs, early layers are “lazy” (not trained to produce final tokens), leading to severe accuracy drops; furthermore, these methods typically require adding and training auxiliary “exit heads,” which increases parameter overhead. Layer Pruning and Dropout Existing research has explored skipping layers (dropout) during training to make sub-networks robust or pruning layers post-training for speed. Problems: Standard uniform layer dropout does not specifically incentivize early layers to be accurate, and post-training pruning often results in performance degradation that requires complex fine-tuning to recover. Insights Accelerate Large Language Model (LLM) inference by enabling the model to generate tokens using fewer layers when possible, while maintaining accuracy. ...

February 9, 2026 · Last updated on February 9, 2026 · 3 min · KKKZOZ

Draft & Verify Lossless Large Language Model Acceleration via Self-Speculative Decoding

Extensive Reading Author Info Prerequisite 贝叶斯优化是一种用于全局优化的策略,专门用于解决黑盒函数(Black-box function)的极值问题。它特别适用于那些评估代价昂贵(computationally expensive)、不可导或没有解析表达式的复杂函数。 其核心思想是:不要盲目地搜索,而是根据已有的历史数据构建一个概率模型,智能地推测下一次应该尝试哪里,从而以最少的尝试次数找到全局最优解。 贝叶斯优化由两个关键部分组成: 代理模型(Surrogate Model): 这是对目标函数的一种概率近似。最常用的是高斯过程(Gaussian Process, GP)。 与普通回归模型不同,代理模型不仅预测某个输入点对应的函数值(均值),还会给出一个不确定性范围(方差)。 作用:它告诉我们“根据目前已知的点,目标函数长什么样”以及“我们在哪些地方比较确信,哪些地方完全不知道”。 采集函数(Acquisition Function): 这是根据代理模型来指导下一步决策的函数。常见的有 Expected Improvement (EI) 或 Upper Confidence Bound (UCB)。 它负责解决探索(Exploration)与开发(Exploitation)的权衡问题: Exploitation:去代理模型预测值最好的地方,试图找到当前的局部最优。 Exploration:去代理模型不确定性最高(方差大)的地方,试图发现未知的潜在最优解。 作用:它计算搜索空间中每个点的“潜在价值”,价值最高点就是下一次实验的参数。 优化流程(迭代闭环) 观察:根据当前的初始数据点,训练代理模型(高斯过程)。 决策:最大化采集函数,找到下一个最有希望的候选点 $x$。 评估:在真实的复杂系统(目标函数)中运行这个参数 $x$,得到真实结果 $y$。 更新:将新的数据对 $(x, y)$ 加入历史数据,更新代理模型的后验概率分布。 重复:重复上述步骤,直到达到预定的迭代次数或满足收敛条件。 凡是符合**“输入参数维度不高(通常<20维)”且“验证一次结果很慢或很贵”**的问题,都是贝叶斯优化的用武之地 ...

February 8, 2026 · Last updated on February 9, 2026 · 2 min · KKKZOZ

Swift On-the-fly Self-speculative Decoding For LLM Inference Acceleration

Extensive Reading Author Info Background Existing Speculative Decoding (SD) methods accelerate inference by using a small “draft” model to guess tokens and a large “target” model to verify them. However, these methods usually require training auxiliary models or adding extra parameters, which limits their flexibility (they are not “plug-and-play”). Insights LLMs exhibit great potential for self-acceleration through layer sparsity and the task-specific nature of this sparsity. This paper proposes a method that dynamically determines which layers to skip during inference based on the input, according to these two observations: ...

February 8, 2026 · Last updated on February 9, 2026 · 2 min · KKKZOZ

CAS-Spec Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs

Extensive Reading Author Info Background Existing “Self-Speculative Decoding” (SSD) methods are easy to use (training-free) but often slower than methods that rely on training specialized draft models. “Cascade Speculative Decoding” (using a hierarchy of draft models) offers high speed but is impractical because it requires training and maintaining multiple draft models. Insights The paper proposes Cascade Adaptive Self-Speculative Decoding (CAS-Spec). This framework constructs a “virtual” hierarchy of draft models directly from the target model itself, without needing extra training. It effectively combines ...

February 7, 2026 · Last updated on February 9, 2026 · 4 min · KKKZOZ