Speculative-Decoding

Cascade Speculative Drafting for Even Faster LLM Inference

Extensive Reading Author Info Background While speculative decoding improves latency by using a smaller draft model to generate tokens for a larger target model, it suffers from two specific bottlenecks: Autoregressive Drafting: The draft model itself generates tokens autoregressively (one by one), which is still computationally expensive and slow. Inefficient Time Allocation: Standard methods allocate equal time to generate every draft token. However, tokens later in the sequence have a significantly lower probability of acceptance. Using the same computational resources for these “high-rejection” tokens is inefficient. Insights The autoregressive process of draft model is the bottleneck: Use draft model to accelerate draft models (Vertical Cascade) Tokens later in the sequence have a lower probability of acceptance: Use a faster and lighter draft model later in the sequence (Horizontal Cascade) Challenges Approaches ...

SpecInfer Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification

Extensive Reading Author Info Xupeng Miao Gabriele Oliaro Zhihao Zhang Xinhao Cheng Background Existing works only consider a token sequence generated by a single SSM for speculation which cannot align well with an LLM due to the model capacity gap between them. The probability of a successful alignment between the LLM and the speculated token sequence decays exponentially with the expected alignment length. Challenges How to generate a token tree in a extremely large search space? How to verify the whole token tree in a single verfication pass? Insights Simultaneously consider a diversity of speculation candidates (instead of just one as in existing approaches) to maximize speculative performance. ...

EdgeLLM Fast On-Device LLM Inference With Speculative Decoding

Extensive Reading 在 axriv 或者其他论文中的引用经常是另一个名字：LLMCad Author Info Daliang Xu （徐大亮） - Daliang Xu’s Website ‪Wangsong Yin‬ - ‪Google Scholar‬ Xin Jin Mengwei Xu Professor Xuanzhe Liu @ Peking University Background The Scaling Law vs. The Memory Wall: The machine learning community has shown that increasing an LLM’s parameter size consistently improves its accuracy and can lead to new, emergent abilities. However, this “scaling law” is challenged on mobile devices by a “memory wall”. When an LLM is too large to fit into a device’s memory, inference latency increases dramatically, by as much as 59-224x. ...