ICLR-25

Extensive Reading Author Info Background Existing Speculative Decoding (SD) methods accelerate inference by using a small “draft” model to guess tokens and a large “target” model to verify them. However, these methods usually require training auxiliary models or adding extra parameters, which limits their flexibility (they are not “plug-and-play”). Insights LLMs exhibit great potential for self-acceleration through layer sparsity and the task-specific nature of this sparsity. This paper proposes a method that dynamically determines which layers to skip during inference based on the input, according to these two observations: ...