[Pinned] LLM Inference Papers Index
My reading notes. 2025 0819-0825 STI Turbocharge NLP Inference at the Edge via Elastic Pipelining EdgeMoE Empowering Sparse Large Language Models on Mobile Devices LLM as a System Service on Mobile Devices SmallThinker A Family of Efficient Large Language Models Natively Trained for Local Deployment HeteroLLM Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators A Survey of Resource-efficient LLM and Multimodal Foundation Models H2O Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models 0812-0818 KV-Runahead Scalable Causal LLM Inference by Parallel Key-Value Cache Generation Striped Attention Faster Ring Attention for Causal Transformers Ring Attention with Blockwise Transformers for Near-Infinite Context TPI-LLM Serving 70B-scale LLMs Efficiently on Low-resource Mobile Devices LLM.int8() 8-bit Matrix Multiplication for Transformers at Scale 0729-0804 Fast On-device LLM Inference with NPUs Deja Vu Contextual Sparsity for Efficient LLMs at Inference Time PowerInfer-2 Fast Large Language Model Inference on a Smartphone LLM in a flash Efficient Large Language Model Inference with Limited Memory PowerInfer Fast Large Language Model Serving with a Consumer-grade GPU 0722-0728 AWQ Activation-aware Weight Quantization for LLM Compression and Acceleration FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU LoRA Low-Rank Adaptation of Large Language Models SpecInfer Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification EdgeLLM Fast On-Device LLM Inference With Speculative Decoding Efficient Memory Management for Large Language Model Serving with PagedAttention 0715-0721 A Survey on Efficient Inference for Large Language Models -0714 Orca A Distributed Serving System for Transformer-Based Generative Models EdgeShard Efficient LLM Inference via Collaborative Edge Computing ServerlessLLM Locality-Enhanced Serverless Inference for Large Language Models Uncategorized WIP 🚧 ...