Deja Vu Contextual Sparsity for Efficient LLMs at Inference Time
Intensive Reading Author Info Zichang Liu:Research Scientist at Meta. Jue Wang, Ph.D: Founder & President of Stylar AI (stylar.ai). Tri Dao: Assistant Professor of Computer Science at Princeton University. Chief Scientist at Together AI. Background LLM Inference Latency Breakdown Challenges Speeding up inference-time sparse LLMs in wall-clock time while maintaining quality and in-context learning abilities remains a challenging problem. While sparsity and pruning have been well-studied, they have not seen wide adoption on LLMs due to the poor quality and efficiency trade-offs on modern hardware such as GPUs: ...