FlexPrefill A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

Extensive Reading Author Info About me - Xunhao Lai Good at writing Triton, here is another repo: XunhaoLai/native-sparse-attention-triton: Efficient triton implementation of Native Sparse Attention. Background As LLM context windows expand (up to 1M+ tokens), the pre-filling phase (processing the input prompt) becomes prohibitively expensive due to the quadratic complexity of full attention($O(n^2)$). Why prior sparse attention is insufficient Many approaches use fixed sparse patterns (e.g., sliding window) or offline-discovered patterns/ratios. These often fail because: ...

January 29, 2026 · Last updated on February 2, 2026 · 5 min · KKKZOZ

XAttention Block Sparse Attention with Antidiagonal Scoring

Extensive Reading Author Info MIT HAN Lab Background Long-Context Transformer Models (LCTMs) are increasingly needed (e.g., long-document QA, long video understanding/generation), but prefill attention is a major bottleneck because standard attention scales quadratically with sequence length. Insights 在一个 Block 中用反对角线可以捕捉到 Vertical-Slash Pattern 的中每个部分,假设整个 Pattern 很稀疏,那么只要包含了 Vertical/Slash 的 BLock 的得分就会很大,因此更容易被选出来 为什么反对角线有帮助: 信息覆盖:通过提出的跨步反对角线选择,每个标记都至少对一个反对角线和做出贡献(因此不太可能错过重要区域)。 模式检测:反对角线与块内常见的垂直和斜线稀疏模式相交,因此它们可以在不明确搜索这些模式的情况下检测到它们。 可以认为这篇文章的前提就是每个头都遵循 Vertical-Slash Pattern? Challenges 整体看下来,理念很简单,但是具体的怎么算的 (Algorithm1) 还挺难理解的,必须手动模拟一遍,建议大小为 B=4, S=2 其中最重要的一步是基于步长的降维采样 假设:L=16, d=4, B=4, S=2 ...

January 29, 2026 · Last updated on February 2, 2026 · 2 min · KKKZOZ

torch-python

Tensor Operations clamp torch.clamp(或 Tensor 的实例方法 .clamp)是 PyTorch 中用于数值截断(clipping)的常用操作。它的主要作用是将输入张量(Tensor)中的所有元素限制在一个指定的范围内 $[min, max]$。 Example: import torch # Initialize a tensor with values ranging from -10 to 10 data = torch.tensor([-10.0, -5.0, 0.5, 5.0, 10.0]) print(f"Original: {data}") # 1. Clamp between a min and max range [-1, 1] # Values < -1 become -1; Values > 1 become 1 clamped_both = data.clamp(min=-1.0, max=1.0) print(f"Range [-1, 1]: {clamped_both}") # 2. Clamp with only a lower bound (min=-2) # Values < -2 become -2; No upper limit clamped_min = data.clamp(min=-2.0) print(f"Min -2 only: {clamped_min}") # 3. Clamp with only an upper bound (max=3) # Values > 3 become 3; No lower limit clamped_max = data.clamp(max=3.0) print(f"Max 3 only: {clamped_max}") Advanced Indexing x[y] 是 PyTorch(以及 NumPy)中非常强大且灵活的**高级索引(Advanced Indexing)**语法 ...

January 15, 2026 · Last updated on January 27, 2026 · 11 min · KKKZOZ

Beyond the 80 20 Rule High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

InExtensive Reading Author Info Background 目前,带验证奖励的强化学习(RLVR,如用于训练 DeepSeek-R1 或 OpenAI o1 的技术)显著提升了 LLM 的推理能力。然而,现有方法通常对生成的所有 Token 进行训练,缺乏对“哪些 Token 真正推动了推理能力提升”的细粒度理解。 Insights 论文首先对思维链(CoT)中的 Token 熵模式进行了定性和定量分析: CoT 中的熵分布模式: 低熵多数派(Low-Entropy Majority): 大部分 Token 的生成熵很低。这些 Token 主要负责语法结构的补全或按部就班的叙述(例如 “The answer is”, “implies that”),它们倾向于“遵循路径(Follow the path)”。 高熵少数派(High-Entropy Minority): 只有少部分 Token 具有高熵。这些 Token 通常出现在逻辑推理的关键转折点、假设提出或步骤选择上(例如 “However”, “Suppose”, “Thus”),被称为**“分叉 Token”(Forking Tokens)**。它们负责“决定路径(Fork the path)”。 RLVR 训练在很大程度上保留了基座模型(Base Model)的熵模式。 训练过程主要调整的是那些原本就是高熵的 Token 的概率分布,而低熵 Token 的变化非常微小。 基于上述观察,作者提出了一种改进的 RLVR 算法策略,即只针对高熵 Token 计算梯度。 Challenges Approaches Evaluation 作者在 Qwen3-8B、14B 和 32B 模型上进行了广泛的实验,主要结论如下: ...

January 7, 2026 · Last updated on February 2, 2026 · 1 min · KKKZOZ

Entropy Adaptive Decoding Dynamic Model Switching for Efficient Inference

Extensive Reading Author Info Background The Problem: Standard decoding applies the same computational power to every token generated. However, text generation has heterogeneous complexity. A complex logical deduction in a mathematical proof requires significantly more “intelligence” than generating routine connecting phrases (e.g., “therefore,” “it follows that”). The Limitation of Existing Solutions: Current optimization techniques, such as Speculative Decoding, are conservative. They prioritize perfect output fidelity, ensuring the output matches the large model exactly by verifying every token. The authors argue this is unnecessary for many applications. Insights The paper’s Proposal: Entropy Adaptive Decoding (EAD). Dynamically switches between a small model ($M_S$) and a large model ($M_L$) during generation. Unlike speculative decoding, EAD accepts controlled output divergence—meaning the output might differ from what the large model would have produced alone, provided the reasoning remains sound. So why not use EAD when divergence occurs in Speculative Decoding? ...

January 7, 2026 · Last updated on February 2, 2026 · 3 min · KKKZOZ

Think Big, Generate Quick LLM-to-SLM for Fast Autoregressive Decoding

Extensive Reading Author Info Background Insights 论文的出发点基于对 LLM 推理过程的两个关键观察: Prompt 编码是并行的:输入提示词(Prompt)的处理可以高度并行化,因此即使是大模型,这部分的计算效率也相对较高。 自回归解码是串行的:生成响应(Response)必须逐个 Token 进行,受限于显存带宽(Memory Wall),大模型在此阶段非常缓慢且昂贵。 Idea 是将这两部分任务解耦。使用一个冻结的 LLM 来处理 Prompt,提取高质量的深层语义表征(“Think Big”);然后将这些表征传递给一个小模型(SLM),由 SLM 负责后续的自回归解码生成(“Generate Quick”) Approaches 架构包含三个主要组件: LLM Encoder ($f_{\xi}$):作用:负责对输入 Prompt 进行编码,提取高维、高质量的表征 $H$。状态:在训练和推理期间保持冻结(Frozen),不需要更新参数,节省训练资源。选择:通常使用 Encoder-Decoder 架构(如 T5)的 Encoder 部分。如果是 Decoder-only 模型(如 GPT),则提取中间层的特征(但论文发现 Encoder-Decoder 的效果更好)。 投影器 (Projector, $q_{\phi}$):作用:解决 LLM 和 SLM 维度不匹配的问题。结构:一个简单的轻量级 MLP(Linear $\to$ ReLU $\to$ Linear)。流程:将 LLM 的高维特征 $H$ 映射到 SLM 的嵌入空间维度,得到 $Z$。 SLM ($g_{\theta}$):作用:接收投影后的特征和原始 Prompt,进行自回归生成。状态:全量微调(或微调部分参数),使其学会利用 LLM 提供的强语义特征。选择:可以是 Encoder-Decoder 或 Decoder-only 架构(如 GPT-2, T5 Small)。 如何将 LLM 的“思考”注入到 SLM 中是关键。 ...

January 7, 2026 · Last updated on February 2, 2026 · 1 min · KKKZOZ

SLED A Speculative LLM Decoding Framework for Efficient Edge Serving

Extensive Reading Author Info SEC: CCF C Background Insights Pure implementation of Speculative decoding in edge scenarios Edge device holds draft models Edge servers holds verifier models Approaches Route to the server when the confidence score associated with token generated by the edge device falls below a given threshold Two details: When sending tokens to the server, the edge device keeps generating draft tokens, expecting the verifier would accept all sent tokens When retrying due to network issues, the edge device can append new generated tokens to the draft sequence Evaluation ...

December 7, 2025 · Last updated on February 2, 2026 · 1 min · KKKZOZ

KVCache Cache in the Wild Characterizing and Optimizing KVCache Cache at a Large Cloud Provider

Extensive Reading Author Info IPADS Alibaba Group Background Large-scale services go further and maintain a KV cache cache (prefix/prompt cache) that reuses KV blocks across different requests that share prefixes. Most deployed KV eviction strategies reuse general-purpose cache policies: Recency-based (LRU, FIFO) and frequency-based (LFU) policies, sometimes combined (e.g., GDFS-style recency–frequency–size heuristics). These methods are workload-agnostic and overlook several KV-specific realities: KV blocks often have short, bursty lifespans; past frequency is a poor predictor of future reuse. Different request categories (API vs chat, first turn vs later turns) have very different reuse patterns that generic policies cannot distinguish. Spatial locality is highly asymmetric: early “head” blocks of prompts are far more valuable than late “tail” blocks, but standard policies treat all blocks similarly. Observations Trace A: To-C workload, a consumer-facing trace including: ...

December 6, 2025 · Last updated on February 2, 2026 · 4 min · KKKZOZ