LLM.int8() 8-bit Matrix Multiplication for Transformers at Scale

Extensive Reading Author Info About Me — Tim Dettmers: A research scientist at the Allen Institute for Artificial Intelligence (Ai2) and an incoming Assistant Professor at Carnegie Mellon University (CMU). ‪Mike Lewis‬ - ‪Google Scholar‬ Related Blogs LLM.int8() and Emergent Features — Tim Dettmers A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using transformers, accelerate and bitsandbytes Background 常见的 8 bit 量化有两种: Absmax quantization 在所有数据中,找到绝对值的最大值,我们称之为 abs_max,然后计算一个全局的 scaling factor 用数据中的每一个数字乘以这个 scaling factor, 再四舍五入到最近的整数完成量化 Zeropoint Quantization 找到数据中的最大值和最小值,计算 scaling factor 同时引入一个偏移量 zeropoint 来利用整个映射后的数值范围 精度更高,但是开销更大 Challenges How to preserve high quantization precision at scales beyond 1B parameters? How to deal with the systematic outliers emerged in all transformer layers starting at scales of 6.7B parameters? Insights Regular quantization methods introduce larger quantization errors for outliers. The amount of outlier can be small, but contributes the majority to the LLM’s quality. Isolate the outlier feature dimensions into a 16-bit matrix multiplication while other values are multiplied in 8-bit. Approaches 主要包括两部分内容: ...

August 12, 2025 · Last updated on August 18, 2025 · 2 min · KKKZOZ