Draft & Verify Lossless Large Language Model Acceleration via Self-Speculative Decoding

Extensive Reading Author Info Prerequisite 贝叶斯优化是一种用于全局优化的策略,专门用于解决黑盒函数(Black-box function)的极值问题。它特别适用于那些评估代价昂贵(computationally expensive)、不可导或没有解析表达式的复杂函数。 其核心思想是:不要盲目地搜索,而是根据已有的历史数据构建一个概率模型,智能地推测下一次应该尝试哪里,从而以最少的尝试次数找到全局最优解。 贝叶斯优化由两个关键部分组成: 代理模型(Surrogate Model): 这是对目标函数的一种概率近似。最常用的是高斯过程(Gaussian Process, GP)。 与普通回归模型不同,代理模型不仅预测某个输入点对应的函数值(均值),还会给出一个不确定性范围(方差)。 作用:它告诉我们“根据目前已知的点,目标函数长什么样”以及“我们在哪些地方比较确信,哪些地方完全不知道”。 采集函数(Acquisition Function): 这是根据代理模型来指导下一步决策的函数。常见的有 Expected Improvement (EI) 或 Upper Confidence Bound (UCB)。 它负责解决探索(Exploration)与开发(Exploitation)的权衡问题: Exploitation:去代理模型预测值最好的地方,试图找到当前的局部最优。 Exploration:去代理模型不确定性最高(方差大)的地方,试图发现未知的潜在最优解。 作用:它计算搜索空间中每个点的“潜在价值”,价值最高点就是下一次实验的参数。 优化流程(迭代闭环) 观察:根据当前的初始数据点,训练代理模型(高斯过程)。 决策:最大化采集函数,找到下一个最有希望的候选点 $x$。 评估:在真实的复杂系统(目标函数)中运行这个参数 $x$,得到真实结果 $y$。 更新:将新的数据对 $(x, y)$ 加入历史数据,更新代理模型的后验概率分布。 重复:重复上述步骤,直到达到预定的迭代次数或满足收敛条件。 凡是符合**“输入参数维度不高(通常<20维)”且“验证一次结果很慢或很贵”**的问题,都是贝叶斯优化的用武之地 ...

February 8, 2026 · Last updated on February 9, 2026 · 2 min · KKKZOZ

Swift On-the-fly Self-speculative Decoding For LLM Inference Acceleration

Extensive Reading Author Info Background Existing Speculative Decoding (SD) methods accelerate inference by using a small “draft” model to guess tokens and a large “target” model to verify them. However, these methods usually require training auxiliary models or adding extra parameters, which limits their flexibility (they are not “plug-and-play”). Insights LLMs exhibit great potential for self-acceleration through layer sparsity and the task-specific nature of this sparsity. This paper proposes a method that dynamically determines which layers to skip during inference based on the input, according to these two observations: ...

February 8, 2026 · Last updated on February 9, 2026 · 2 min · KKKZOZ

CAS-Spec Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs

Extensive Reading Author Info Background Existing “Self-Speculative Decoding” (SSD) methods are easy to use (training-free) but often slower than methods that rely on training specialized draft models. “Cascade Speculative Decoding” (using a hierarchy of draft models) offers high speed but is impractical because it requires training and maintaining multiple draft models. Insights The paper proposes Cascade Adaptive Self-Speculative Decoding (CAS-Spec). This framework constructs a “virtual” hierarchy of draft models directly from the target model itself, without needing extra training. It effectively combines ...

February 7, 2026 · Last updated on February 9, 2026 · 4 min · KKKZOZ

Hierarchical Speculative Decoding with Dynamic Windows for Efficient Language Model Inference

AI-Aided Author Info Background The Bottleneck: LLM inference is slow due to its auto-regressive nature and memory bandwidth constraints. Existing Solution (Speculative Decoding): Standard Speculative Decoding (SD) uses a small “draft model” to predict a fixed number of tokens ($K$), which are then verified by the larger “target model”. The Limitation: SD relies on a fixed window size ($K$). If $K$ is too large, the draft model generates bad tokens that waste time; if $K$ is too small, it limits potential speedups. Previous methods to adjust $K$ dynamically often required extra training or complex resource management. Insights Use entropy to dynamically decide the window size $K$ Hierarchical speculative decoding Three models: M1,M2,MP When the confidence score of M2 is high, draft-verify process only happens between M1 and M2, without MP Challenges Can we dynamically adjust the window size K without requiring any additional training? Can we leverage models of different sizes to enhance speed? Approaches Self-verify: verify the draft token by itself ...

February 7, 2026 · Last updated on February 9, 2026 · 8 min · KKKZOZ

AIConfigurator Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving

AI-Aided Author Info Background The primary challenges identified in optimizing LLM inference in production are the combinatorial explosion of configuration parameters (parallelism strategies, batch sizes, quantization) and the diversity of inference frameworks (TensorRT-LLM, vLLM, SGLang), which makes manual tuning or exhaustive GPU benchmarking prohibitively expensive and slow. Insights Instead of modeling an entire neural network as a black box, the system breaks down LLM inference into fundamental, reusable operations called primitives, profile these primitives then combine the statistics to model the LLM inference process. ...

February 5, 2026 · Last updated on February 9, 2026 · 3 min · KKKZOZ

Revati Transparent GPU-Free Time-Warp Emulation for LLM Serving

AI-Aided Author Info Background The paper identifies a critical bottleneck in deploying Large Language Models (LLMs): The Optimization Challenge: Efficient deployment requires tuning a vast configuration space (e.g., parallelism strategies, batch sizes, caching policies). Cost vs. Fidelity Trade-off: Real GPU Execution: Testing on physical hardware is prohibitively expensive and slow. Discrete-Event Simulators (DES): While fast and cheap, traditional simulators require manually re-implementing the serving system’s complex control logic. Because frameworks (like vLLM and SGLang) evolve rapidly, simulators suffer from a perpetual “semantic gap” and high maintenance burden. Insights ...

February 5, 2026 · Last updated on February 9, 2026 · 2 min · KKKZOZ

Estimating LLM Uncertainty with Evidence

Extensive Reading Author Info Background Hallucinations exist in Large Language Models (LLMs) — where models generate unreliable responses due to a lack of knowledge. Existing methods for estimating uncertainty to detect hallucinations are flawed: Failure of Probability-Based Methods: Traditional methods rely on softmax probabilities. The normalization process (softmax) causes a loss of “evidence strength” information. A high probability does not always mean the model is knowledgeable; it might simply mean one token is slightly better than others in a low-knowledge scenario. Conversely, a low probability might not mean ignorance; it could mean the model knows multiple valid answers (e.g., synonyms). Limitations of Sampling-Based Methods: Methods like Semantic Entropy require multiple sampling iterations, which is computationally expensive and fails to capture the model’s inherent epistemic uncertainty (e.g., consistently producing the same incorrect answer due to lack of training data). Insights The reason why probability-based methods fail to identify reliability is that probability is normalized. ...

February 2, 2026 · Last updated on February 2, 2026 · 4 min · KKKZOZ

R-Stitch Dynamic Trajectory Stitching for Efficient Reasoning

Extensive Reading Author Info R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning Background Existing acceleration methods like Speculative Decoding have limitations: Rigid Consistency: They require the Small Language Model (SLM) to match the LLM’s tokens exactly. If the SLM phrases a correct reasoning step differently, speculative decoding rejects it, wasting computation. Low Agreement: In complex reasoning tasks, token-level agreement between SLMs and LLMs is often low, leading to frequent rollbacks and minimal speed gains. ...

February 2, 2026 · Last updated on February 2, 2026 · 3 min · KKKZOZ