QServe W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Extensive Reading Author Info MIT HAN Lab Background Common quantization formats for LLMs: W8A8: 8-bit weights, 8-bit activations – almost lossless, widely deployed. W4A16: 4-bit weights, 16-bit activations – also near-lossless; good for weight memory. W4A4: 4-bit weights and activations – more aggressive, but accuracy drops and real GPU speedups are disappointing. On data center GPUs (A100, L40S), 4-bit quantization often underperforms because: Dequantization of weights or partial sums runs on slow CUDA cores, not fast tensor cores. For W4A4 systems like Atom and QuaRot, 20–90% of runtime can be eaten by dequantization in the main GEMM loop. To achieve resonable accuracy, W4A4 must apply per-group quantization, which is finer than per-channel quantization – sharing FP16 scaling factors a sub-channel basis ...