ArXiv-23

Ring Attention with Blockwise Transformers for Near-Infinite Context

Extensive Reading Author Info Hao Liu: A research scientist at Google DeepMind. Matei Zaharia: An associate professor at UC Berkeley (previously Stanford), where he works on computer systems and AI in the Sky Lab. Related Blogs Ring Attention Explained | Coconut Mode Background Transformer 的核心组件“自注意力机制”的内存消耗会随着输入序列长度的增加而呈二次方增长。这导致即便是最先进的 GPU/TPU，其有限的显存（通常小于 100GB）也无法处理超长序列，例如处理百万甚至千万级别的 token. 注意力模块的显存占用分析 $B$: Batch size ...

Deja Vu Contextual Sparsity for Efficient LLMs at Inference Time

Intensive Reading Author Info Zichang Liu：Research Scientist at Meta. Jue Wang, Ph.D: Founder & President of Stylar AI (stylar.ai). Tri Dao: Assistant Professor of Computer Science at Princeton University. Chief Scientist at Together AI. Background LLM Inference Latency Breakdown Challenges Speeding up inference-time sparse LLMs in wall-clock time while maintaining quality and in-context learning abilities remains a challenging problem. While sparsity and pruning have been well-studied, they have not seen wide adoption on LLMs due to the poor quality and efficiency trade-offs on modern hardware such as GPUs: ...

AWQ Activation-aware Weight Quantization for LLM Compression and Acceleration

Extensive Reading Author Info Ji Lin’s Homepage Jiaming Tang Shang Yang | MIT EECS Song Han - Associate Professor, MIT EECS Background Quantization is vital for running LLM on edge devices. Challenges Quantization-aware training (QAT) is not efficient due to the high training cost. Post-training quantization (PTQ) suffers from large accuracy degradation under a low-bit setting. Insights Not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. Mixed-precision format is not hardware-efficient, we can employ activation-aware scaling. Approaches Activation-aware Weight Quantization ...

Efficient Memory Management for Large Language Model Serving with PagedAttention

Extensive Reading Author Info Woosuk Kwon Zhuohan Li Background The existing systems suffer from internal and external memory fragmentation. Three primary sources of memory wastes: 7+ Internal fragmentation: Space that will not be used in the future within an allocated memory block. External fragmentation: Unused space between memory blocks. The existing systems cannot exploit the opportunities for memory sharing. Parallel sampling, beam search, and shared prefix have the potential to leverage the shared KV cache to reduce memory footprint. ...