Quantization

Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhu

Fundamentals

What is Quantization?

Quantization is the process of reducing the precision of numerical representations to achieve:

Key Idea

Convert high-precision floating-point numbers (FP32) to lower-precision representations (INT4, INT8) while maintaining acceptable accuracy.

Energy Motivation

Why Quantization Works


Floating Point Representations

FP32 (Full Precision)

FP16 (Half Precision)

Dynamic Range vs Precision Trade-offs


Linear Quantization

Mathematical Formulation

Quantization: $q = \text{clip}(\text{round}(r/S + Z), -2^{b-1}, 2^{b-1})$

Dequantization: $r = S(q - Z)$

Where: - $S = \frac{r_{max} - r_{min}}{q_{max} - q_{min}}$ (scaling factor) - $Z$ = zero point - $b$ = bit width

Sources of Error

  1. Rounding Error: Bounded by [-S/2, S/2]
  2. Clipping Error: Values outside representable range

Symmetric vs Asymmetric Quantization

Matrix Multiplication with Quantization

$$Y = S_W S_X [q_W q_X - q_W Z_X - q_X Z_W + Z_W Z_X]$$

Symmetric quantization eliminates the overhead terms when $Z_W = Z_X = 0$.


Non-Linear Quantization

Clustering-Based Approach

Granularity Options

  1. Per-Tensor: Single scale/zero-point for entire tensor
  2. Per-Channel/Vector: Scale per channel (more precise)
  3. Per-Group: Intermediate granularity

Trade-off: Finer granularity o Higher precision but increased overhead


Training Workflows

Post-Training Quantization (PTQ)

Quantization-Aware Training (QAT)

Straight-Through Estimator (STE)


LLM-Specific Quantization Challenges

Outlier Problem

Modern LLM Context

Example - DeepSeek V3:


Advanced LLM Quantization Methods

LLM.int8()

Key Ideas: 1. Vector-wise quantization for better outlier handling 2. Mixed precision: Keep outliers in FP16, quantize regular values to INT8 3. Decomposition: Separate outlier and regular computations

SmoothQuant (W8A8)

Motivation: Outliers typically in activations, not weights

Core Concept: Migrate quantization difficulty

Benefits: Balances quantization difficulty between weights and activations

AWQ (Activation-Aware Weight-Only Quantization)

Target: Low-batch scenarios where activation quantization is prohibitive

Key Insights: 1. Weight-only quantization (W4) for memory efficiency 2. Salient weight identification using activation magnitudes 3. Per-channel scaling to protect important weights

Method: $WX \rightarrow Q(W \cdot s)(s^{-1} \cdot X)$


Quantization Performance Impact

Accuracy vs Bit-width

Rounding Schemes Impact

Scheme Accuracy
Nearest 52.29%
Stochastic 52.06plusequal5.52%
Stochastic (best) 63.06%
Ceil/Floor 0.10%

Summary

Quantization is essential for efficient LLM serving, providing:

Key techniques for LLMs:

Modern approaches (LLM.int8(), SmoothQuant, AWQ) address LLM-specific challenges while maintaining practical deployability.