LayerSkip Enabling Early Exit Inference and Self-Speculative Decoding
Extensive Reading Author Info Background Early Exit (Dynamic Halting) These techniques attempt to stop the forward pass at an intermediate layer if the model is sufficiently confident in the prediction. Problems: In standard LLMs, early layers are “lazy” (not trained to produce final tokens), leading to severe accuracy drops; furthermore, these methods typically require adding and training auxiliary “exit heads,” which increases parameter overhead. Layer Pruning and Dropout Existing research has explored skipping layers (dropout) during training to make sub-networks robust or pruning layers post-training for speed. Problems: Standard uniform layer dropout does not specifically incentivize early layers to be accurate, and post-training pruning often results in performance degradation that requires complex fine-tuning to recover. Insights Accelerate Large Language Model (LLM) inference by enabling the model to generate tokens using fewer layers when possible, while maintaining accuracy. ...