TOBETAGGED

0-notes-and-paradigm

Kernel Paradigm triton 的 kernel 有两种写法: 传统 kernel Persistent Kernel 传统 Kernel grid = (num_tiles,) 每个 block 处理一个 tile → 结束每个 SM 从 grid 里领一个 work item，做完就退出, CTA 调度由硬件自动完成 Persistent Kernel @triton.jit def kernel(..., n_tiles: tl.constexpr, ...): pid = tl.program_id(0) n_progs = tl.num_programs(0) # grid(0) 启动的 program 数 tile = pid while tile < n_tiles: # 处理 tile 对应的那一块工作 # ... tile += n_progs # 跳到下一个属于自己的 tile 每个 CTA 长期驻留在 SM 上，自己软件调度 work persistent kernel 常见动机：任务总 tile 数不够多（比如小 batch、小矩阵、很多小 GEMM / MoE / grouped GEMM）:普通写法 programs 数 < SM，GPU 吃不满；persistent 让每个 SM 都有活干，并通过循环把剩余工作吃完。 ...

torch-python

Tensor Operations clamp torch.clamp（或 Tensor 的实例方法 .clamp）是 PyTorch 中用于数值截断（clipping）的常用操作。它的主要作用是将输入张量（Tensor）中的所有元素限制在一个指定的范围内 $[min, max]$。 Example: import torch # Initialize a tensor with values ranging from -10 to 10 data = torch.tensor([-10.0, -5.0, 0.5, 5.0, 10.0]) print(f"Original: {data}") # 1. Clamp between a min and max range [-1, 1] # Values < -1 become -1; Values > 1 become 1 clamped_both = data.clamp(min=-1.0, max=1.0) print(f"Range [-1, 1]: {clamped_both}") # 2. Clamp with only a lower bound (min=-2) # Values < -2 become -2; No upper limit clamped_min = data.clamp(min=-2.0) print(f"Min -2 only: {clamped_min}") # 3. Clamp with only an upper bound (max=3) # Values > 3 become 3; No lower limit clamped_max = data.clamp(max=3.0) print(f"Max 3 only: {clamped_max}") Advanced Indexing x[y] 是 PyTorch（以及 NumPy）中非常强大且灵活的**高级索引（Advanced Indexing）**语法 ...

1-builtin-tutorial

Background GPU Memory Model SRAM (Static RAM) Located inside the GPU core, it utilizes Registers, L1 Cache, and L2 Cache: Registers — These are tiny, ultra-fast memory locations within each GPU core. Registers store immediate values that a core is actively processing, making them the fastest type of memory. L1 Cache — This is the first-level cache inside a Streaming Multiprocessor (SM). It stores frequently accessed data to speed up calculations and reduce access to slower memory (like DRAM). L2 Cache — This is a larger, second-level cache that is shared across multiple SMs. It helps store and reuse data that might not fit in L1 cache, reducing reliance on external memory (VRAM). HBM Bigger capacity ...

Build a Large Language Model (From Scratch) Reading Note

Chapter 1 GPT是自回归模型：像GPT这样的解码器式模型，它们生成文本的方式是一次预测一个词 (one word at a time)。也就是说，它会根据已经生成的词语序列来预测下一个最可能出现的词语，然后将这个预测到的词语加入到序列中，再进行下一步的预测，如此循环。这种依赖于自身先前输出进行下一步预测的特性，使得这类模型被称为“自回归模型”（Autoregressive model）。 Chapter 2 Word2Vec trained neural network architecture to generate word embeddings by predicting the context of a word given the target word or vice versa. The main idea behind Word2Vec is that words that appear in similar contexts tend to have similar meanings. LLMs commonly produce their own embeddings that are part of the input layer and are updated during training. The advantage of optimizing the embeddings as part of the LLM training instead of using Word2Vec is that the embeddings are optimized to the specific task and data at hand. ...

DuoAttention Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Fast On-device LLM Inference with NPUs

LLM Preliminaries

Math Vector-Matrix Multiplication 从三个不同的角度分析向量乘以矩阵的运算过程 $xW$。假设向量 $x$ 的形状是 $(1, 3)$，矩阵 $W$ 的形状是 $(3, 6)$。 $$x = \begin{bmatrix} x_1 & x_2 & x_3 \end{bmatrix}$$$$ W = \begin{bmatrix} w_{11} & w_{12} & w_{13} & w_{14} & w_{15} & w_{16} \\\\ w_{21} & w_{22} & w_{23} & w_{24} & w_{25} & w_{26} \\\\ w_{31} & w_{32} & w_{33} & w_{34} & w_{35} & w_{36} \end{bmatrix} $$根据矩阵乘法规则，结果 $y = xW$ 的形状将是 $(1, 6)$。角度一：将 W 视为元素的二维集合这是最基本、最微观的视角。我们将矩阵 $W$ 看作是一个 $3 \times 6$ 的数字网格。结果向量 $y$ 中的每一个元素 $y_j$，都是通过将向量 $x$ 的每个元素与其在矩阵 $W$ 中对应列的每个元素相乘，然后将结果相加得到的。 ...