TOBETAGGED

Chapter 1 GPT是自回归模型：像GPT这样的解码器式模型，它们生成文本的方式是一次预测一个词 (one word at a time)。也就是说，它会根据已经生成的词语序列来预测下一个最可能出现的词语，然后将这个预测到的词语加入到序列中，再进行下一步的预测，如此循环。这种依赖于自身先前输出进行下一步预测的特性，使得这类模型被称为“自回归模型”（Autoregressive model）。 Chapter 2 Word2Vec trained neural network architecture to generate word embeddings by predicting the context of a word given the target word or vice versa. The main idea behind Word2Vec is that words that appear in similar contexts tend to have similar meanings. LLMs commonly produce their own embeddings that are part of the input layer and are updated during training. The advantage of optimizing the embeddings as part of the LLM training instead of using Word2Vec is that the embeddings are optimized to the specific task and data at hand. ...

Math Vector-Matrix Multiplication 从三个不同的角度分析向量乘以矩阵的运算过程 $xW$。假设向量 $x$ 的形状是 $(1, 3)$，矩阵 $W$ 的形状是 $(3, 6)$。 $$x = \begin{bmatrix} x_1 & x_2 & x_3 \end{bmatrix}$$$$ W = \begin{bmatrix} w_{11} & w_{12} & w_{13} & w_{14} & w_{15} & w_{16} \\\\ w_{21} & w_{22} & w_{23} & w_{24} & w_{25} & w_{26} \\\\ w_{31} & w_{32} & w_{33} & w_{34} & w_{35} & w_{36} \end{bmatrix} $$根据矩阵乘法规则，结果 $y = xW$ 的形状将是 $(1, 6)$。角度一：将 W 视为元素的二维集合这是最基本、最微观的视角。我们将矩阵 $W$ 看作是一个 $3 \times 6$ 的数字网格。结果向量 $y$ 中的每一个元素 $y_j$，都是通过将向量 $x$ 的每个元素与其在矩阵 $W$ 中对应列的每个元素相乘，然后将结果相加得到的。 ...

TOBETAGGED

Build a Large Language Model (From Scratch) Reading Note

LLM Preliminaries