LLM Inference Papers Index
My reading notes. 2025 0729-0804 PowerInfer-2 Fast Large Language Model Inference on a Smartphone LLM in a flash Efficient Large Language Model Inference with Limited Memory PowerInfer Fast Large Language Model Serving with a Consumer-grade GPU 0722-0728 AWQ Activation-aware Weight Quantization for LLM Compression and Acceleration FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU LoRA Low-Rank Adaptation of Large Language Models SpecInfer Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification EdgeLLM Fast On-Device LLM Inference With Speculative Decoding Efficient Memory Management for Large Language Model Serving with PagedAttention 0715-0721 A Survey on Efficient Inference for Large Language Models -0714 Orca A Distributed Serving System for Transformer-Based Generative Models EdgeShard Efficient LLM Inference via Collaborative Edge Computing ServerlessLLM Locality-Enhanced Serverless Inference for Large Language Models Uncategorized WIP 🚧 ...