Estimating LLM Uncertainty with Evidence

Extensive Reading Author Info Background Hallucinations exist in Large Language Models (LLMs) — where models generate unreliable responses due to a lack of knowledge. Existing methods for estimating uncertainty to detect hallucinations are flawed: Failure of Probability-Based Methods: Traditional methods rely on softmax probabilities. The normalization process (softmax) causes a loss of “evidence strength” information. A high probability does not always mean the model is knowledgeable; it might simply mean one token is slightly better than others in a low-knowledge scenario. Conversely, a low probability might not mean ignorance; it could mean the model knows multiple valid answers (e.g., synonyms). Limitations of Sampling-Based Methods: Methods like Semantic Entropy require multiple sampling iterations, which is computationally expensive and fails to capture the model’s inherent epistemic uncertainty (e.g., consistently producing the same incorrect answer due to lack of training data). Insights The reason why probability-based methods fail to identify reliability is that probability is normalized. ...

February 2, 2026 · Last updated on February 2, 2026 · 4 min · KKKZOZ

R-Stitch Dynamic Trajectory Stitching for Efficient Reasoning

Extensive Reading Author Info R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning Background Existing acceleration methods like Speculative Decoding have limitations: Rigid Consistency: They require the Small Language Model (SLM) to match the LLM’s tokens exactly. If the SLM phrases a correct reasoning step differently, speculative decoding rejects it, wasting computation. Low Agreement: In complex reasoning tasks, token-level agreement between SLMs and LLMs is often low, leading to frequent rollbacks and minimal speed gains. ...

February 2, 2026 · Last updated on February 2, 2026 · 3 min · KKKZOZ

FlexPrefill A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

Extensive Reading Author Info About me - Xunhao Lai Good at writing Triton, here is another repo: XunhaoLai/native-sparse-attention-triton: Efficient triton implementation of Native Sparse Attention. Background As LLM context windows expand (up to 1M+ tokens), the pre-filling phase (processing the input prompt) becomes prohibitively expensive due to the quadratic complexity of full attention($O(n^2)$). Why prior sparse attention is insufficient Many approaches use fixed sparse patterns (e.g., sliding window) or offline-discovered patterns/ratios. These often fail because: ...

January 29, 2026 · Last updated on February 2, 2026 · 5 min · KKKZOZ

XAttention Block Sparse Attention with Antidiagonal Scoring

Extensive Reading Author Info MIT HAN Lab Background Long-Context Transformer Models (LCTMs) are increasingly needed (e.g., long-document QA, long video understanding/generation), but prefill attention is a major bottleneck because standard attention scales quadratically with sequence length. Insights 在一个 Block 中用反对角线可以捕捉到 Vertical-Slash Pattern 的中每个部分,假设整个 Pattern 很稀疏,那么只要包含了 Vertical/Slash 的 BLock 的得分就会很大,因此更容易被选出来 为什么反对角线有帮助: 信息覆盖:通过提出的跨步反对角线选择,每个标记都至少对一个反对角线和做出贡献(因此不太可能错过重要区域)。 模式检测:反对角线与块内常见的垂直和斜线稀疏模式相交,因此它们可以在不明确搜索这些模式的情况下检测到它们。 可以认为这篇文章的前提就是每个头都遵循 Vertical-Slash Pattern? Challenges 整体看下来,理念很简单,但是具体的怎么算的 (Algorithm1) 还挺难理解的,必须手动模拟一遍,建议大小为 B=4, S=2 其中最重要的一步是基于步长的降维采样 假设:L=16, d=4, B=4, S=2 ...

January 29, 2026 · Last updated on February 2, 2026 · 2 min · KKKZOZ

Beyond the 80 20 Rule High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

InExtensive Reading Author Info Background 目前,带验证奖励的强化学习(RLVR,如用于训练 DeepSeek-R1 或 OpenAI o1 的技术)显著提升了 LLM 的推理能力。然而,现有方法通常对生成的所有 Token 进行训练,缺乏对“哪些 Token 真正推动了推理能力提升”的细粒度理解。 Insights 论文首先对思维链(CoT)中的 Token 熵模式进行了定性和定量分析: CoT 中的熵分布模式: 低熵多数派(Low-Entropy Majority): 大部分 Token 的生成熵很低。这些 Token 主要负责语法结构的补全或按部就班的叙述(例如 “The answer is”, “implies that”),它们倾向于“遵循路径(Follow the path)”。 高熵少数派(High-Entropy Minority): 只有少部分 Token 具有高熵。这些 Token 通常出现在逻辑推理的关键转折点、假设提出或步骤选择上(例如 “However”, “Suppose”, “Thus”),被称为**“分叉 Token”(Forking Tokens)**。它们负责“决定路径(Fork the path)”。 RLVR 训练在很大程度上保留了基座模型(Base Model)的熵模式。 训练过程主要调整的是那些原本就是高熵的 Token 的概率分布,而低熵 Token 的变化非常微小。 基于上述观察,作者提出了一种改进的 RLVR 算法策略,即只针对高熵 Token 计算梯度。 Challenges Approaches Evaluation 作者在 Qwen3-8B、14B 和 32B 模型上进行了广泛的实验,主要结论如下: ...

January 7, 2026 · Last updated on February 2, 2026 · 1 min · KKKZOZ

Entropy Adaptive Decoding Dynamic Model Switching for Efficient Inference

Extensive Reading Author Info Background The Problem: Standard decoding applies the same computational power to every token generated. However, text generation has heterogeneous complexity. A complex logical deduction in a mathematical proof requires significantly more “intelligence” than generating routine connecting phrases (e.g., “therefore,” “it follows that”). The Limitation of Existing Solutions: Current optimization techniques, such as Speculative Decoding, are conservative. They prioritize perfect output fidelity, ensuring the output matches the large model exactly by verifying every token. The authors argue this is unnecessary for many applications. Insights The paper’s Proposal: Entropy Adaptive Decoding (EAD). Dynamically switches between a small model ($M_S$) and a large model ($M_L$) during generation. Unlike speculative decoding, EAD accepts controlled output divergence—meaning the output might differ from what the large model would have produced alone, provided the reasoning remains sound. So why not use EAD when divergence occurs in Speculative Decoding? ...

January 7, 2026 · Last updated on February 2, 2026 · 3 min · KKKZOZ

Think Big, Generate Quick LLM-to-SLM for Fast Autoregressive Decoding

Extensive Reading Author Info Background Insights 论文的出发点基于对 LLM 推理过程的两个关键观察: Prompt 编码是并行的:输入提示词(Prompt)的处理可以高度并行化,因此即使是大模型,这部分的计算效率也相对较高。 自回归解码是串行的:生成响应(Response)必须逐个 Token 进行,受限于显存带宽(Memory Wall),大模型在此阶段非常缓慢且昂贵。 Idea 是将这两部分任务解耦。使用一个冻结的 LLM 来处理 Prompt,提取高质量的深层语义表征(“Think Big”);然后将这些表征传递给一个小模型(SLM),由 SLM 负责后续的自回归解码生成(“Generate Quick”) Approaches 架构包含三个主要组件: LLM Encoder ($f_{\xi}$):作用:负责对输入 Prompt 进行编码,提取高维、高质量的表征 $H$。状态:在训练和推理期间保持冻结(Frozen),不需要更新参数,节省训练资源。选择:通常使用 Encoder-Decoder 架构(如 T5)的 Encoder 部分。如果是 Decoder-only 模型(如 GPT),则提取中间层的特征(但论文发现 Encoder-Decoder 的效果更好)。 投影器 (Projector, $q_{\phi}$):作用:解决 LLM 和 SLM 维度不匹配的问题。结构:一个简单的轻量级 MLP(Linear $\to$ ReLU $\to$ Linear)。流程:将 LLM 的高维特征 $H$ 映射到 SLM 的嵌入空间维度,得到 $Z$。 SLM ($g_{\theta}$):作用:接收投影后的特征和原始 Prompt,进行自回归生成。状态:全量微调(或微调部分参数),使其学会利用 LLM 提供的强语义特征。选择:可以是 Encoder-Decoder 或 Decoder-only 架构(如 GPT-2, T5 Small)。 如何将 LLM 的“思考”注入到 SLM 中是关键。 ...

January 7, 2026 · Last updated on February 2, 2026 · 1 min · KKKZOZ

SLED A Speculative LLM Decoding Framework for Efficient Edge Serving

Extensive Reading Author Info SEC: CCF C Background Insights Pure implementation of Speculative decoding in edge scenarios Edge device holds draft models Edge servers holds verifier models Approaches Route to the server when the confidence score associated with token generated by the edge device falls below a given threshold Two details: When sending tokens to the server, the edge device keeps generating draft tokens, expecting the verifier would accept all sent tokens When retrying due to network issues, the edge device can append new generated tokens to the draft sequence Evaluation ...

December 7, 2025 · Last updated on February 2, 2026 · 1 min · KKKZOZ