FlexPrefill A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
Extensive Reading Author Info About me - Xunhao Lai Good at writing Triton, here is another repo: XunhaoLai/native-sparse-attention-triton: Efficient triton implementation of Native Sparse Attention. Background As LLM context windows expand (up to 1M+ tokens), the pre-filling phase (processing the input prompt) becomes prohibitively expensive due to the quadratic complexity of full attention($O(n^2)$). Why prior sparse attention is insufficient Many approaches use fixed sparse patterns (e.g., sliding window) or offline-discovered patterns/ratios. These often fail because: ...