NeurIPS-24

Extensive Reading Author Info Background While speculative decoding improves latency by using a smaller draft model to generate tokens for a larger target model, it suffers from two specific bottlenecks: Autoregressive Drafting: The draft model itself generates tokens autoregressively (one by one), which is still computationally expensive and slow. Inefficient Time Allocation: Standard methods allocate equal time to generate every draft token. However, tokens later in the sequence have a significantly lower probability of acceptance. Using the same computational resources for these “high-rejection” tokens is inefficient. Insights The autoregressive process of draft model is the bottleneck: Use draft model to accelerate draft models (Vertical Cascade) Tokens later in the sequence have a lower probability of acceptance: Use a faster and lighter draft model later in the sequence (Horizontal Cascade) Challenges Approaches ...