CAS-Spec Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs
Extensive Reading Author Info Background Existing “Self-Speculative Decoding” (SSD) methods are easy to use (training-free) but often slower than methods that rely on training specialized draft models. “Cascade Speculative Decoding” (using a hierarchy of draft models) offers high speed but is impractical because it requires training and maintaining multiple draft models. Insights The paper proposes Cascade Adaptive Self-Speculative Decoding (CAS-Spec). This framework constructs a “virtual” hierarchy of draft models directly from the target model itself, without needing extra training. It effectively combines ...