Build a Large Language Model (From Scratch) Reading Note

Chapter 1 GPT是自回归模型:像GPT这样的解码器式模型,它们生成文本的方式是一次预测一个词 (one word at a time)。也就是说,它会根据已经生成的词语序列来预测下一个最可能出现的词语,然后将这个预测到的词语加入到序列中,再进行下一步的预测,如此循环。这种依赖于自身先前输出进行下一步预测的特性,使得这类模型被称为“自回归模型”(Autoregressive model)。 Chapter 2 Word2Vec trained neural network architecture to generate word embeddings by predicting the context of a word given the target word or vice versa. The main idea behind Word2Vec is that words that appear in similar contexts tend to have similar meanings. LLMs commonly produce their own embeddings that are part of the input layer and are updated during training. The advantage of optimizing the embeddings as part of the LLM training instead of using Word2Vec is that the embeddings are optimized to the specific task and data at hand. ...

May 20, 2025 · Last updated on August 18, 2025 · 9 min · KKKZOZ

Aegaeon Effective GPU Pooling for Concurrent LLM Serving on the Market

Extensive Reading Author Info Background The workloads are heavily skewed, containing a long tail (more than 90%) of infrequently invoked models. Hot models are prone to request bursts that can overload their provisioned sources from time to time. 按需从主机内存(DRAM)或SSD中加载/卸载模型:当一个模型A的请求处理完毕后,再加载模型B来处理其请求,会导致严重的队头阻塞 => Request-level 调度太粗了 通过数学分析(Theorem 3.1)指出,即使单个模型的请求到达率($\lambda$)很低,但只要T很长,系统中的“活跃模型数量” ($\mathbb{E}[m]$) 依然会很高 Due to the typically long service time of LLM requests, the expected active model count E[m] can be large even when the aggregate arrival rate Mλ is low. ...

October 28, 2025 · Last updated on October 4, 2025 · 2 min · KKKZOZ

LLM Preliminaries

Math Vector-Matrix Multiplication 从三个不同的角度分析向量乘以矩阵的运算过程 $xW$。 假设向量 $x$ 的形状是 $(1, 3)$,矩阵 $W$ 的形状是 $(3, 6)$。 $$x = \begin{bmatrix} x_1 & x_2 & x_3 \end{bmatrix}$$$$ W = \begin{bmatrix} w_{11} & w_{12} & w_{13} & w_{14} & w_{15} & w_{16} \\\\ w_{21} & w_{22} & w_{23} & w_{24} & w_{25} & w_{26} \\\\ w_{31} & w_{32} & w_{33} & w_{34} & w_{35} & w_{36} \end{bmatrix} $$根据矩阵乘法规则,结果 $y = xW$ 的形状将是 $(1, 6)$。 角度一:将 W 视为元素的二维集合 这是最基本、最微观的视角。我们将矩阵 $W$ 看作是一个 $3 \times 6$ 的数字网格。结果向量 $y$ 中的每一个元素 $y_j$,都是通过将向量 $x$ 的每个元素与其在矩阵 $W$ 中对应列的每个元素相乘,然后将结果相加得到的。 ...

August 4, 2025 · Last updated on September 1, 2025 · 13 min · KKKZOZ