Quest Query-Aware Sparsity for Efficient Long-Context LLM Inference
Extensive Reading Author Info MIT HAN Lab Background In long-context inference: The KV cache grows linearly with context length ($L$). At each decoding step, the model must read the entire KV cache to compute attention. Existing works recognize there is small part of tokens that can domoinate the accuracy of token generation, and they choose to evict unimportant tokens: StreamingLLM keeps a sliding window plus a few “anchor” tokens. H2O, TOVA, etc., use heuristics or statistics to permanently drop “less important” tokens. Once a token is evicted, it’s gone. BUT, the important tokens are Query-dependent. ...