Efficient Streaming Language Models with Attention Sinks
Extensive Reading Author Info MIT HAN Lab Background When applying LLMs for infinite input streams, two main challenges arise: KV Cache will grow infinitely which leads to excessive memory usage and decode latency LLM’s performance will degrade when the sequence length goes beyond the attention window size set during pre-training Window Attention: Only keep $L$ recent tokens in KV cache Model degrades dramatically once the sequence length exceeds the cache size (even just evict the first token) Slide Window with Re-computation: Do not reuse KV. At every step, rebuild the whole window last $L$ tokens and run the Transformer on that small segment from scratch Slide Window Example t = 1: Window: [x₁] Run the model on this length-1 sequence, use the output of x₁. t = 2: Window: [x₁, x₂] Run the model on [x₁, x₂] (full self-attention 2×2), use the output of x₂. t = 3: Window: [x₁, x₂, x₃] Run the model on these 3 tokens (3×3 attention), use x₃. t = 4: Window slides: [x₂, x₃, x₄] Run the model again on this 3-token segment (3×3 attention), use x₄. t = 5: window [x₃, x₄, x₅], full 3×3 attention, use x₅. t = 6: window [x₄, x₅, x₆], full 3×3 attention, use x₆. Observations A surprisingly large amount of attention score is allocated to the initial tokens, irrespective of their relevance to the language modeling task. ...