Hierarchical Speculative Decoding with Dynamic Windows for Efficient Language Model Inference

AI-Aided Author Info Background The Bottleneck: LLM inference is slow due to its auto-regressive nature and memory bandwidth constraints. Existing Solution (Speculative Decoding): Standard Speculative Decoding (SD) uses a small “draft model” to predict a fixed number of tokens ($K$), which are then verified by the larger “target model”. The Limitation: SD relies on a fixed window size ($K$). If $K$ is too large, the draft model generates bad tokens that waste time; if $K$ is too small, it limits potential speedups. Previous methods to adjust $K$ dynamically often required extra training or complex resource management. Insights Use entropy to dynamically decide the window size $K$ Hierarchical speculative decoding Three models: M1,M2,MP When the confidence score of M2 is high, draft-verify process only happens between M1 and M2, without MP Challenges Can we dynamically adjust the window size K without requiring any additional training? Can we leverage models of different sizes to enhance speed? Approaches Self-verify: verify the draft token by itself ...

February 7, 2026 · Last updated on February 9, 2026 · 8 min · KKKZOZ

AIConfigurator Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving

AI-Aided Author Info Background The primary challenges identified in optimizing LLM inference in production are the combinatorial explosion of configuration parameters (parallelism strategies, batch sizes, quantization) and the diversity of inference frameworks (TensorRT-LLM, vLLM, SGLang), which makes manual tuning or exhaustive GPU benchmarking prohibitively expensive and slow. Insights Instead of modeling an entire neural network as a black box, the system breaks down LLM inference into fundamental, reusable operations called primitives, profile these primitives then combine the statistics to model the LLM inference process. ...

February 5, 2026 · Last updated on February 9, 2026 · 3 min · KKKZOZ

Revati Transparent GPU-Free Time-Warp Emulation for LLM Serving

AI-Aided Author Info Background The paper identifies a critical bottleneck in deploying Large Language Models (LLMs): The Optimization Challenge: Efficient deployment requires tuning a vast configuration space (e.g., parallelism strategies, batch sizes, caching policies). Cost vs. Fidelity Trade-off: Real GPU Execution: Testing on physical hardware is prohibitively expensive and slow. Discrete-Event Simulators (DES): While fast and cheap, traditional simulators require manually re-implementing the serving system’s complex control logic. Because frameworks (like vLLM and SGLang) evolve rapidly, simulators suffer from a perpetual “semantic gap” and high maintenance burden. Insights ...

February 5, 2026 · Last updated on February 9, 2026 · 2 min · KKKZOZ

Estimating LLM Uncertainty with Evidence

Extensive Reading Author Info Background Hallucinations exist in Large Language Models (LLMs) — where models generate unreliable responses due to a lack of knowledge. Existing methods for estimating uncertainty to detect hallucinations are flawed: Failure of Probability-Based Methods: Traditional methods rely on softmax probabilities. The normalization process (softmax) causes a loss of “evidence strength” information. A high probability does not always mean the model is knowledgeable; it might simply mean one token is slightly better than others in a low-knowledge scenario. Conversely, a low probability might not mean ignorance; it could mean the model knows multiple valid answers (e.g., synonyms). Limitations of Sampling-Based Methods: Methods like Semantic Entropy require multiple sampling iterations, which is computationally expensive and fails to capture the model’s inherent epistemic uncertainty (e.g., consistently producing the same incorrect answer due to lack of training data). Insights The reason why probability-based methods fail to identify reliability is that probability is normalized. ...

February 2, 2026 · Last updated on February 2, 2026 · 4 min · KKKZOZ

R-Stitch Dynamic Trajectory Stitching for Efficient Reasoning

Extensive Reading Author Info R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning Background Existing acceleration methods like Speculative Decoding have limitations: Rigid Consistency: They require the Small Language Model (SLM) to match the LLM’s tokens exactly. If the SLM phrases a correct reasoning step differently, speculative decoding rejects it, wasting computation. Low Agreement: In complex reasoning tasks, token-level agreement between SLMs and LLMs is often low, leading to frequent rollbacks and minimal speed gains. ...

February 2, 2026 · Last updated on February 2, 2026 · 3 min · KKKZOZ

FlexPrefill A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

Extensive Reading Author Info About me - Xunhao Lai Good at writing Triton, here is another repo: XunhaoLai/native-sparse-attention-triton: Efficient triton implementation of Native Sparse Attention. Background As LLM context windows expand (up to 1M+ tokens), the pre-filling phase (processing the input prompt) becomes prohibitively expensive due to the quadratic complexity of full attention($O(n^2)$). Why prior sparse attention is insufficient Many approaches use fixed sparse patterns (e.g., sliding window) or offline-discovered patterns/ratios. These often fail because: ...

January 29, 2026 · Last updated on February 2, 2026 · 5 min · KKKZOZ

XAttention Block Sparse Attention with Antidiagonal Scoring

Extensive Reading Author Info MIT HAN Lab Background Long-Context Transformer Models (LCTMs) are increasingly needed (e.g., long-document QA, long video understanding/generation), but prefill attention is a major bottleneck because standard attention scales quadratically with sequence length. Insights 在一个 Block 中用反对角线可以捕捉到 Vertical-Slash Pattern 的中每个部分,假设整个 Pattern 很稀疏,那么只要包含了 Vertical/Slash 的 BLock 的得分就会很大,因此更容易被选出来 为什么反对角线有帮助: 信息覆盖:通过提出的跨步反对角线选择,每个标记都至少对一个反对角线和做出贡献(因此不太可能错过重要区域)。 模式检测:反对角线与块内常见的垂直和斜线稀疏模式相交,因此它们可以在不明确搜索这些模式的情况下检测到它们。 可以认为这篇文章的前提就是每个头都遵循 Vertical-Slash Pattern? Challenges 整体看下来,理念很简单,但是具体的怎么算的 (Algorithm1) 还挺难理解的,必须手动模拟一遍,建议大小为 B=4, S=2 其中最重要的一步是基于步长的降维采样 假设:L=16, d=4, B=4, S=2 ...

January 29, 2026 · Last updated on February 2, 2026 · 2 min · KKKZOZ

torch-python

Tensor Operations clamp torch.clamp(或 Tensor 的实例方法 .clamp)是 PyTorch 中用于数值截断(clipping)的常用操作。它的主要作用是将输入张量(Tensor)中的所有元素限制在一个指定的范围内 $[min, max]$。 Example: import torch # Initialize a tensor with values ranging from -10 to 10 data = torch.tensor([-10.0, -5.0, 0.5, 5.0, 10.0]) print(f"Original: {data}") # 1. Clamp between a min and max range [-1, 1] # Values < -1 become -1; Values > 1 become 1 clamped_both = data.clamp(min=-1.0, max=1.0) print(f"Range [-1, 1]: {clamped_both}") # 2. Clamp with only a lower bound (min=-2) # Values < -2 become -2; No upper limit clamped_min = data.clamp(min=-2.0) print(f"Min -2 only: {clamped_min}") # 3. Clamp with only an upper bound (max=3) # Values > 3 become 3; No lower limit clamped_max = data.clamp(max=3.0) print(f"Max 3 only: {clamped_max}") Advanced Indexing x[y] 是 PyTorch(以及 NumPy)中非常强大且灵活的**高级索引(Advanced Indexing)**语法 ...

January 15, 2026 · Last updated on January 27, 2026 · 11 min · KKKZOZ