Quantization

QServe W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Extensive Reading Author Info MIT HAN Lab Background Common quantization formats for LLMs: W8A8: 8-bit weights, 8-bit activations – almost lossless, widely deployed. W4A16: 4-bit weights, 16-bit activations – also near-lossless; good for weight memory. W4A4: 4-bit weights and activations – more aggressive, but accuracy drops and real GPU speedups are disappointing. On data center GPUs (A100, L40S), 4-bit quantization often underperforms because: Dequantization of weights or partial sums runs on slow CUDA cores, not fast tensor cores. For W4A4 systems like Atom and QuaRot, 20–90% of runtime can be eaten by dequantization in the main GEMM loop. To achieve resonable accuracy, W4A4 must apply per-group quantization, which is finer than per-channel quantization – sharing FP16 scaling factors a sub-channel basis ...

SmoothQuant Accurate and Efficient Post-Training Quantization for Large Language Models

Extensive Reading Author Info MIT HAN Lab Background Modern large language models (LLMs) are extremely costly to serve in FP16 because of their massive parameter counts and long-context workloads; while low-bit quantization (especially INT8) is an attractive way to cut memory and latency, naïve post-training W8A8 (8-bit weights and activations) breaks down on large models due to severe activation outliers that cause large accuracy drops. Existing INT8 solutions either focus on weights only (e.g., GPTQ-style methods) or handle activation outliers with mixed precision (e.g., LLM.int8(), outlier-aware kernels); these approaches can preserve accuracy but often bring limited end-to-end gains because they leave activations/KV caches in higher precision, rely on complex custom kernels, or end up slower than plain FP16 in practical deployments. ...

LLM.int8() 8-bit Matrix Multiplication for Transformers at Scale

Extensive Reading Author Info About Me — Tim Dettmers: A research scientist at the Allen Institute for Artificial Intelligence (Ai2) and an incoming Assistant Professor at Carnegie Mellon University (CMU). ‪Mike Lewis‬ - ‪Google Scholar‬ Related Blogs LLM.int8() and Emergent Features — Tim Dettmers A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using transformers, accelerate and bitsandbytes Background 常见的 8 bit 量化有两种： Absmax quantization 在所有数据中，找到绝对值的最大值，我们称之为 abs_max，然后计算一个全局的 scaling factor 用数据中的每一个数字乘以这个 scaling factor, 再四舍五入到最近的整数完成量化 Zeropoint Quantization 找到数据中的最大值和最小值，计算 scaling factor 同时引入一个偏移量 zeropoint 来利用整个映射后的数值范围精度更高，但是开销更大 Challenges How to preserve high quantization precision at scales beyond 1B parameters? How to deal with the systematic outliers emerged in all transformer layers starting at scales of 6.7B parameters? Insights Regular quantization methods introduce larger quantization errors for outliers. The amount of outlier can be small, but contributes the majority to the LLM’s quality. Isolate the outlier feature dimensions into a 16-bit matrix multiplication while other values are multiplied in 8-bit. Approaches 主要包括两部分内容： ...

AWQ Activation-aware Weight Quantization for LLM Compression and Acceleration

Extensive Reading Author Info Ji Lin’s Homepage Jiaming Tang Shang Yang | MIT EECS Song Han - Associate Professor, MIT EECS Background Quantization is vital for running LLM on edge devices. Challenges Quantization-aware training (QAT) is not efficient due to the high training cost. Post-training quantization (PTQ) suffers from large accuracy degradation under a low-bit setting. Insights Not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. Mixed-precision format is not hardware-efficient, we can employ activation-aware scaling. Approaches Activation-aware Weight Quantization ...