Extensive Reading

Author Info

Background

  • The standard method for large language model (LLM) inference, autoregressive decoding, is slow and costly because it generates tokens sequentially, one at a time.
  • Existing acceleration methods like speculative sampling often struggle to find a suitable draft model; using a smaller version of the LLM can have high overhead, while training a new, appropriately-sized draft model is prohibitively expensive.
  • Other approaches like Lookahead and Medusa successfully reduce drafting latency but are ultimately limited by the low accuracy of their drafts, which restricts their maximum achievable speedup.

Insights

Two key insights:

  • autoregression at the feature level is simpler than at the token level
  • the uncertainty inherent in the sampling process significantly constrains the performance of predicting the next feature

Approaches

Preliminaries

Vanilla autoregression at the token level is described by

$$T_{1:j} → E_{1:j} → f_j → p_{j+1} → t_{j+1}$$

for any integer $j ≥ 1$.

  • $T$: Tokens
  • $E$: Embeddings
  • $f$: Features (the output of last transformer layer, or “the second-to-top-layer feature of a LLM, the hidden state before the LM head” in the paper)
  • $p$: distributions

解释一下整个过程:

  • LLM 输入为一个文字序列,经过 Tokenizer 后成为 $T_{1:j}$
  • $T_{1:j}$ 经过 Embedding 后变为 $F_{1:j}^0$
  • 然后经过 Transformer Layers 不断丰富,得到最后包含了对所有内容语义,语法和上下文理解的 $F_{1:j}$
  • 取最后一个 token 的特征表示 $f_j$
  • 通过 LM Head 得到下一个 token 的概率分布 $p_{j+1}$ (用 $f_j$ 预测 $p_{j+1}$)
  • 从 $p_{j+1}$ 采样得到 $t_{j+1}$

Feature-level prediction

  • 利用模型次顶层特征进行自回归预测比直接预测下一个 Token 更简单、更具规律性
    • 直接预测 Token ❌
    • 直接预测 Feature ✅

pasted-image-20251110215215

非常 Reasonable:

  • $f_j$ 包括了大量的“知识”(语义语法上下文),而采样后的 $p_j+1$ 丢失了这种信息

Feature uncertainty

如果只用 $f_j$ 来进行预测,会出现一个问题:

pasted-image-20251110224705

比如上图中,$f_{i}^{(I)}$ 其实包含了两种采样结果:$p_{(am)}$ 和 $p_{(always)}$,在采样时选择不同的 token 会通向不同的特征 $f_{i+1}^{(\text{am})}$ or $f_{i+1}^{(\text{always})}$

在通过 $f_{i+1}$ 预测 $f_{i+2}$ 时,我们只希望 $f_{i+1}$ 中包含一个确定的 $t_{i+2}$

Idea: feature & shifted-token -> next feature

pasted-image-20251110225421

伪代码:

# 已知:上一时刻真特征 f_i(来自目标模型某层的隐藏态)
# 步骤:
# 1) 先用 f_i 的 LM Head 采样出 t_{i+1}
logits_i = lm_head(f_i)                     # 复用目标模型 LM Head
t_next = sample_from_logits(logits_i)       # 采样(或取 top-k/温度采样)

# 2) 把 tokens_right_shifted 更新到包含 t_{i+1},喂 AR-Head 预测 \hat f_{i+1}
#    feats_hist 包含到 i 的特征序列;tokens_shifted 包含 t2..t_{i+1}
f_next_hat = ar_head(feats_hist, tokens_shifted, return_all=False)  # [B, H]

# 3) 再用 \hat f_{i+1} 走 LM Head 得到下一步分布,继续采样 t_{i+2}
logits_next = lm_head(f_next_hat)
t_next2 = sample_from_logits(logits_next)

# (可扩展为树:对 t_{i+1} 取多分支并行各自预测 \hat f_{i+1}^{(branch)})

Overall Comparison

pasted-image-20251110225646

By the way, Medusa overview:

pasted-image-20251110225756

Evaluation

Thoughts

When Reading