LLM.int8() 8-bit Matrix Multiplication for Transformers at Scale

Extensive Reading Author Info About Me — Tim Dettmers: A research scientist at the Allen Institute for Artificial Intelligence (Ai2) and an incoming Assistant Professor at Carnegie Mellon University (CMU). ‪Mike Lewis‬ - ‪Google Scholar‬ Related Blogs LLM.int8() and Emergent Features — Tim Dettmers A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using transformers, accelerate and bitsandbytes Background 常见的 8 bit 量化有两种: Absmax quantization 在所有数据中,找到绝对值的最大值,我们称之为 abs_max,然后计算一个全局的 scaling factor 用数据中的每一个数字乘以这个 scaling factor, 再四舍五入到最近的整数完成量化 Zeropoint Quantization 找到数据中的最大值和最小值,计算 scaling factor 同时引入一个偏移量 zeropoint 来利用整个映射后的数值范围 精度更高,但是开销更大 Challenges How to preserve high quantization precision at scales beyond 1B parameters? How to deal with the systematic outliers emerged in all transformer layers starting at scales of 6.7B parameters? Insights Regular quantization methods introduce larger quantization errors for outliers. The amount of outlier can be small, but contributes the majority to the LLM’s quality. Isolate the outlier feature dimensions into a 16-bit matrix multiplication while other values are multiplied in 8-bit. Approaches 主要包括两部分内容: ...

August 12, 2025 · Last updated on August 25, 2025 · 2 min · KKKZOZ

AWQ Activation-aware Weight Quantization for LLM Compression and Acceleration

Extensive Reading Author Info Ji Lin’s Homepage Jiaming Tang Shang Yang | MIT EECS Song Han - Associate Professor, MIT EECS Background Quantization is vital for running LLM on edge devices. Challenges Quantization-aware training (QAT) is not efficient due to the high training cost. Post-training quantization (PTQ) suffers from large accuracy degradation under a low-bit setting. Insights Not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. Mixed-precision format is not hardware-efficient, we can employ activation-aware scaling. Approaches Activation-aware Weight Quantization ...

July 28, 2025 · Last updated on August 25, 2025 · 2 min · KKKZOZ