ACLFindings-25

AI-Aided Author Info Background The Bottleneck: LLM inference is slow due to its auto-regressive nature and memory bandwidth constraints. Existing Solution (Speculative Decoding): Standard Speculative Decoding (SD) uses a small “draft model” to predict a fixed number of tokens ($K$), which are then verified by the larger “target model”. The Limitation: SD relies on a fixed window size ($K$). If $K$ is too large, the draft model generates bad tokens that waste time; if $K$ is too small, it limits potential speedups. Previous methods to adjust $K$ dynamically often required extra training or complex resource management. Insights Use entropy to dynamically decide the window size $K$ Hierarchical speculative decoding Three models: M1,M2,MP When the confidence score of M2 is high, draft-verify process only happens between M1 and M2, without MP Challenges Can we dynamically adjust the window size K without requiring any additional training? Can we leverage models of different sizes to enhance speed? Approaches Self-verify: verify the draft token by itself ...