My reading notes.
2025
0729-0804
- PowerInfer-2 Fast Large Language Model Inference on a Smartphone
- LLM in a flash Efficient Large Language Model Inference with Limited Memory
- PowerInfer Fast Large Language Model Serving with a Consumer-grade GPU
0722-0728
- AWQ Activation-aware Weight Quantization for LLM Compression and Acceleration
- FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU
- LoRA Low-Rank Adaptation of Large Language Models
- SpecInfer Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification
- EdgeLLM Fast On-Device LLM Inference With Speculative Decoding
- Efficient Memory Management for Large Language Model Serving with PagedAttention
0715-0721
-0714
- Orca A Distributed Serving System for Transformer-Based Generative Models
- EdgeShard Efficient LLM Inference via Collaborative Edge Computing
- ServerlessLLM Locality-Enhanced Serverless Inference for Large Language Models
Uncategorized
WIP 🚧