My reading notes.
2025
0819-0825
- STI Turbocharge NLP Inference at the Edge via Elastic Pipelining
- EdgeMoE Empowering Sparse Large Language Models on Mobile Devices
- LLM as a System Service on Mobile Devices
- SmallThinker A Family of Efficient Large Language Models Natively Trained for Local Deployment
- HeteroLLM Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators
- A Survey of Resource-efficient LLM and Multimodal Foundation Models
- H2O Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
0812-0818
- KV-Runahead Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
- Striped Attention Faster Ring Attention for Causal Transformers
- Ring Attention with Blockwise Transformers for Near-Infinite Context
- TPI-LLM Serving 70B-scale LLMs Efficiently on Low-resource Mobile Devices
- LLM.int8() 8-bit Matrix Multiplication for Transformers at Scale
0729-0804
- Fast On-device LLM Inference with NPUs
- Deja Vu Contextual Sparsity for Efficient LLMs at Inference Time
- PowerInfer-2 Fast Large Language Model Inference on a Smartphone
- LLM in a flash Efficient Large Language Model Inference with Limited Memory
- PowerInfer Fast Large Language Model Serving with a Consumer-grade GPU
0722-0728
- AWQ Activation-aware Weight Quantization for LLM Compression and Acceleration
- FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU
- LoRA Low-Rank Adaptation of Large Language Models
- SpecInfer Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification
- EdgeLLM Fast On-Device LLM Inference With Speculative Decoding
- Efficient Memory Management for Large Language Model Serving with PagedAttention
0715-0721
-0714
- Orca A Distributed Serving System for Transformer-Based Generative Models
- EdgeShard Efficient LLM Inference via Collaborative Edge Computing
- ServerlessLLM Locality-Enhanced Serverless Inference for Large Language Models
Uncategorized
WIP 🚧