DDIA: Chapter 4 Encoding and Evolution

Formats for Encoding Data 这里提到了两种兼容性,后面分析数据编码格式时都会用到: In order for the system to continue running smoothly, we need to maintain compatibility in both directions: Backward compatibility Newer code can read data that was written by older code. Forward compatibility Older code can read data that was written by newer code. 直译有一个问题, 英语的"前后"在时间和空间上统一, 而汉语却是相反. 比如 forward 在空间上指前进, 在时间上指未来. 但是汉语中的"前"在空间上指前进, 在时间上却指过去. 向后兼容很好理解:指新的版本的软/硬件可以使用老版本的软/硬件产生的数据。 Forward compatibility 译为向前兼容极容易混乱,这里可以想成向未来兼容:指老的版本的软/硬件可以使用新版本的软/硬件产生的数据。 以下是几个例子: Intel 的 x86指令集 CPU 是向后兼容的,因为新款 CPU 依然可以运行老版本的软件。Intel 保证老版本 CPU 有的指令集新版本一定还保留着,这种只增加不删除的策略,保证了我们换 CPU 时,不需要更换很多软件。 ...

October 21, 2023 · Last updated on August 1, 2025 · 9 min · KKKZOZ

DDIA: Chapter 2 Data Models and Query Languages

Relational Model Versus Document Model 首先谈到了 NoSQL 的诞生: There are several driving forces behind the adoption of NoSQL databases, including: A need for greater scalability than relational databases can easily achieve, includ‐ ing very large datasets or very high write throughput A widespread preference for free and open source software over commercial database products Specialized query operations that are not well supported by the relational model Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model 然后通过下图的这份简历来说明了 one-to-many 这种关系 ...

October 20, 2023 · Last updated on August 1, 2025 · 7 min · KKKZOZ

DDIA: Chapter 3 Storage and Retrieval

这一章主要讲的是数据库更底层的一些东西。 In order to tune a storage engine to perform well on your kind of workload, you need to have a rough idea of what the storage engine is doing under the hood. Data Structures That Power Your Database Index Any kind of index usually slows down writes, because the index also needs to be updated every time data is written. This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but every index slows down writes. ...

October 20, 2023 · Last updated on August 1, 2025 · 9 min · KKKZOZ

KTransformers Unleashing the Full Potential of CPU GPU Hybrid Inference for MoE Models

Extensive Reading Author Info Hongtao Chen | MADSys Weiyu Xie | MADSys Boxin Zhang | MADSys Background MoE LLMs and hybrid setups Modern MoE models (DeepSeek, Qwen-MoE, etc.) are huge but activate few experts per token. On single-GPU or low-concurrency setups, we naturally pair a small GPU with a big CPU + large DRAM. Limitations of current hybrid / offloading systems Tools like Fiddler or basic offloading keep attention on GPU and push experts or layers to CPU. CPU becomes the bottleneck; generic AMX/AVX-512 kernels are far from peak, and GPU often waits on CPU. Hardware inefficiencies on CPU and NUMA Poor weight layouts and scheduling starve caches and AMX units. Multi-socket (NUMA) machines suffer from cross-socket memory traffic and weak scaling. Crude accuracy–latency tradeoffs in MoE Existing accelerations often reduce or skip experts (smaller top-k, pruning). These approaches speed up inference but can noticeably hurt accuracy. There are tow major inefficiencies: ...

November 17, 2025 · Last updated on November 17, 2025 · 5 min · KKKZOZ

QServe W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Extensive Reading Author Info MIT HAN Lab Background Common quantization formats for LLMs: W8A8: 8-bit weights, 8-bit activations – almost lossless, widely deployed. W4A16: 4-bit weights, 16-bit activations – also near-lossless; good for weight memory. W4A4: 4-bit weights and activations – more aggressive, but accuracy drops and real GPU speedups are disappointing. On data center GPUs (A100, L40S), 4-bit quantization often underperforms because: Dequantization of weights or partial sums runs on slow CUDA cores, not fast tensor cores. For W4A4 systems like Atom and QuaRot, 20–90% of runtime can be eaten by dequantization in the main GEMM loop. To achieve resonable accuracy, W4A4 must apply per-group quantization, which is finer than per-channel quantization – sharing FP16 scaling factors a sub-channel basis ...

November 16, 2025 · Last updated on November 17, 2025 · 7 min · KKKZOZ

SmoothQuant Accurate and Efficient Post-Training Quantization for Large Language Models

Extensive Reading Author Info MIT HAN Lab Background Modern large language models (LLMs) are extremely costly to serve in FP16 because of their massive parameter counts and long-context workloads; while low-bit quantization (especially INT8) is an attractive way to cut memory and latency, naïve post-training W8A8 (8-bit weights and activations) breaks down on large models due to severe activation outliers that cause large accuracy drops. Existing INT8 solutions either focus on weights only (e.g., GPTQ-style methods) or handle activation outliers with mixed precision (e.g., LLM.int8(), outlier-aware kernels); these approaches can preserve accuracy but often bring limited end-to-end gains because they leave activations/KV caches in higher precision, rely on complex custom kernels, or end up slower than plain FP16 in practical deployments. ...

November 16, 2025 · Last updated on November 17, 2025 · 4 min · KKKZOZ

LServe Efficient Long-sequence LLM Serving with Unified Sparse Attention

Extensive Reading Author Info MIT HAN Lab Background Long-context LLM serving is bottlenecked by attention and KV caches. Prefilling has quadratic attention cost in sequence length, while decoding is memory-bound due to ever-growing KV caches; this makes 128k–512k contexts and long reasoning traces (e.g., 20k-token CoT) slow and expensive in practice. Existing KV cache optimizations are incomplete. Quantization and compression methods (e.g., KV quantization, paged KV cache) reduce memory and bandwidth but do not change the asymptotic attention complexity, so latency still grows linearly (decoding) or quadratically (prefilling) with context length. ...

November 15, 2025 · Last updated on November 17, 2025 · 3 min · KKKZOZ

DuoAttention Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Extensive Reading Author Info MIT HAN Lab Background Long-context LLMs strain attention and KV caches. As sequence length grows, prefill cost scales quadratically and decoding linearly, while KV cache memory grows linearly, making naive full-attention inference impractical in real-world long-context applications. Existing architectural and approximate-attention methods trade accuracy or require retraining. Linear-attention and specialized long-context architectures reduce complexity but often underperform standard Transformers on long-range reasoning, while methods like H2O, StreamingLLM, TOVA, and FastGen drop or sparsify tokens uniformly across heads, which can severely damage long-context retrieval accuracy and are difficult to apply safely in settings with KV-sharing schemes such as GQA. ...

November 13, 2025 · Last updated on November 17, 2025 · 3 min · KKKZOZ