[Pinned] LLM Inference Papers Index
My reading notes. 2025 1111-1117 LServe Efficient Long-sequence LLM Serving with Unified Sparse Attention QServe W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Quest Query-Aware Sparsity for Efficient Long-Context LLM Inference Dynamic Sparse Attention on Mobile SoCs A dynamic parallel method for performance optimization on hybrid CPUs SmoothQuant Accurate and Efficient Post-Training Quantization for Large Language Models DuoAttention Efficient Long-Context LLM Inference with Retrieval and Streaming Heads Efficient Streaming Language Models with Attention Sinks KTransformers Unleashing the Full Potential of CPU GPU Hybrid Inference for MoE Models 1104-1110 EAGLE Speculative Sampling Requires Rethinking Feature Uncertainty 1028-1103 Aegaeon Effective GPU Pooling for Concurrent LLM Serving on the Market DistServe Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving Splitwise Efficient Generative LLM Inference Using Phase Splitting Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve 0826-0901 ELMS Elasticized Large Language Models On Mobile Devices Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash 0819-0825 STI Turbocharge NLP Inference at the Edge via Elastic Pipelining EdgeMoE Empowering Sparse Large Language Models on Mobile Devices LLM as a System Service on Mobile Devices SmallThinker A Family of Efficient Large Language Models Natively Trained for Local Deployment HeteroLLM Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators A Survey of Resource-efficient LLM and Multimodal Foundation Models H2O Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models 0812-0818 KV-Runahead Scalable Causal LLM Inference by Parallel Key-Value Cache Generation Striped Attention Faster Ring Attention for Causal Transformers Ring Attention with Blockwise Transformers for Near-Infinite Context TPI-LLM Serving 70B-scale LLMs Efficiently on Low-resource Mobile Devices LLM.int8() 8-bit Matrix Multiplication for Transformers at Scale 0729-0804 Fast On-device LLM Inference with NPUs Deja Vu Contextual Sparsity for Efficient LLMs at Inference Time PowerInfer-2 Fast Large Language Model Inference on a Smartphone LLM in a flash Efficient Large Language Model Inference with Limited Memory PowerInfer Fast Large Language Model Serving with a Consumer-grade GPU 0722-0728 AWQ Activation-aware Weight Quantization for LLM Compression and Acceleration FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU LoRA Low-Rank Adaptation of Large Language Models SpecInfer Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification EdgeLLM Fast On-Device LLM Inference With Speculative Decoding Efficient Memory Management for Large Language Model Serving with PagedAttention 0715-0721 A Survey on Efficient Inference for Large Language Models -0714 Orca A Distributed Serving System for Transformer-Based Generative Models EdgeShard Efficient LLM Inference via Collaborative Edge Computing ServerlessLLM Locality-Enhanced Serverless Inference for Large Language Models Uncategorized WIP 🚧 ...