Speculative-Decoding

Extensive Reading Author Info Xupeng Miao Gabriele Oliaro Zhihao Zhang Xinhao Cheng Background Existing works only consider a token sequence generated by a single SSM for speculation which cannot align well with an LLM due to the model capacity gap between them. The probability of a successful alignment between the LLM and the speculated token sequence decays exponentially with the expected alignment length. Challenges How to generate a token tree in a extremely large search space? How to verify the whole token tree in a single verfication pass? Insights Simultaneously consider a diversity of speculation candidates (instead of just one as in existing approaches) to maximize speculative performance. ...

Speculative-Decoding

SpecInfer Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification

EdgeLLM Fast On-Device LLM Inference With Speculative Decoding