Quantization in LLM Serving Systems

Last modified: 2025-05-25 Category: Machine Learning Systems ..

Quantization

Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhu

Fundamentals

What is Quantization?

Quantization is the process of reducing the precision of numerical representations to achieve:

Reduced memory footprint
Lower latency
Decreased energy consumption
Compact representation

Key Idea

Convert high-precision floating-point numbers (FP32) to lower-precision representations (INT4, INT8) while maintaining acceptable accuracy.

Energy Motivation

Energy scales quadratically with bit-width
4x bit width o ~16x energy consumption
Memory access costs (DRAM: 640pJ vs 8b Add: 0.03pJ)

Why Quantization Works

Activation ranges are well-defined in neural networks
Even distributions lead to better training outcomes
Proper initialization prevents activation divergence and gradient instability

Floating Point Representations

FP32 (Full Precision)

Sign: 1 bit
Exponent: 8 bits
Significand/Mantissa: 23 bits
Total: 32 bits

FP16 (Half Precision)

Sign: 1 bit
Exponent: 5 bits
Significand/Mantissa: 10 bits
Total: 16 bits

Dynamic Range vs Precision Trade-offs

Dynamic Range: Range of representable numbers (better for training)
Precision: Distance between neighboring values (better for inference)
INT8: Limited dynamic range (-127 to 127) but consistent precision

Linear Quantization

Mathematical Formulation

Quantization: $q = \text{clip}(\text{round}(r/S + Z), -2^{b-1}, 2^{b-1})$

Dequantization: $r = S(q - Z)$

Where: - $S = \frac{r_{max} - r_{min}}{q_{max} - q_{min}}$ (scaling factor) - $Z$ = zero point - $b$ = bit width

Sources of Error

Rounding Error: Bounded by [-S/2, S/2]
Clipping Error: Values outside representable range

Symmetric vs Asymmetric Quantization

Symmetric: Zero point Z = 0 (simpler computation)
Asymmetric: Non-zero Z (more flexible, better for skewed distributions like ReLU outputs)

Matrix Multiplication with Quantization

$$Y = S_W S_X [q_W q_X - q_W Z_X - q_X Z_W + Z_W Z_X]$$

Symmetric quantization eliminates the overhead terms when $Z_W = Z_X = 0$.

Non-Linear Quantization

Clustering-Based Approach

Use case: Skewed weight distributions
Method: K-means clustering to determine quantization levels
Storage:
- Indices: $\log_2(N)$ bits per weight
- Codebook: N centroids in original precision
- Example: 3.2x compression for 4-bit quantization

Granularity Options

Per-Tensor: Single scale/zero-point for entire tensor
Per-Channel/Vector: Scale per channel (more precise)
Per-Group: Intermediate granularity

Trade-off: Finer granularity o Higher precision but increased overhead

Training Workflows

Post-Training Quantization (PTQ)

Training: Uses full precision (FP32/BF16)
Inference: Applies quantization to weights and/or activations
Simplest approach but may have accuracy degradation

Quantization-Aware Training (QAT)

Training: Simulates quantization effects during training
Method: Quantize o Dequantize in forward pass
Backpropagation: Uses Straight-Through Estimator (STE) for differentiability
Results: Generally outperforms PTQ

Straight-Through Estimator (STE)

Problem: Step function derivative is 0 or infty
Solution: Use identity function gradient: $\frac{\partial}{\partial x}\text{quantize}(x) \approx 1$

LLM-Specific Quantization Challenges

Outlier Problem

Observation: Quantization accuracy drops significantly for large models (>6.7B parameters)
Cause: Emergence of outlier features in activations
Impact: Sharp accuracy degradation with standard quantization

Modern LLM Context

Example - DeepSeek V3:

FP8 quantization: 671B parameters imes 1 byte = 671 GB (fits ~5 H200s)
BF16 weights: 671B parameters imes 2 bytes = 1.3 TB

Advanced LLM Quantization Methods

LLM.int8()

Key Ideas: 1. Vector-wise quantization for better outlier handling 2. Mixed precision: Keep outliers in FP16, quantize regular values to INT8 3. Decomposition: Separate outlier and regular computations

SmoothQuant (W8A8)

Motivation: Outliers typically in activations, not weights

Core Concept: Migrate quantization difficulty

Formula: $s_j = \frac{\max(|X_j|)^\alpha}{\max(|W_j|)^{1-\alpha}}$
Process: $WX \rightarrow Q(W \cdot s)(s^{-1} \cdot X)$
Optimal alpha: Empirically found to be 0.5

Benefits: Balances quantization difficulty between weights and activations

AWQ (Activation-Aware Weight-Only Quantization)

Target: Low-batch scenarios where activation quantization is prohibitive

Key Insights: 1. Weight-only quantization (W4) for memory efficiency 2. Salient weight identification using activation magnitudes 3. Per-channel scaling to protect important weights

Method: $WX \rightarrow Q(W \cdot s)(s^{-1} \cdot X)$

Scale important channels up before quantization
Fuse inverse scaling with previous operations (e.g., LayerNorm)

Quantization Performance Impact

Accuracy vs Bit-width

High variance across models and quantization levels
INT8: Generally acceptable accuracy loss
INT4: Requires careful techniques (AWQ, etc.)
INT3/INT2: Significant accuracy challenges

Rounding Schemes Impact

Scheme	Accuracy
Nearest	52.29%
Stochastic	52.06plusequal5.52%
Stochastic (best)	63.06%
Ceil/Floor	0.10%

Summary

Quantization is essential for efficient LLM serving, providing:

Memory reduction (2-8x compression)
Energy savings (quadratic with bit-width)
Inference speedup through specialized hardware

Key techniques for LLMs:

Handle outliers through mixed precision or smoothing
Choose appropriate granularity (per-tensor vs per-channel)
Balance accuracy vs efficiency based on deployment requirements

Modern approaches (LLM.int8(), SmoothQuant, AWQ) address LLM-specific challenges while maintaining practical deployability.