Sparsity and Pruning in LLM Serving Systems
Sparsity and Pruning in LLM Serving Systems
Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhu
Introduction
Types of Sparsity
- Weights: Model parameter sparsity
- Activations: Runtime computation sparsity
- KV-Cache: Attention cache sparsity
Sparsity Enables Pruning
Core Concepts
- Pruning: Removing synapses, neurons, weights, or reducing hidden dimension size
- Granularity matters: Different levels of pruning have different accuracy impacts
- Pruning + fine-tuning: Common approach to recover accuracy
- Can sometimes achieve higher accuracy than the original model
Accuracy Impact
- Pruning alone causes gradual accuracy degradation
- Pruning + fine-tuning maintains accuracy better across higher sparsity levels
- Sharp accuracy drop occurs around 80-90% sparsity even with fine-tuning
Weight Sparsity Approaches
Magnitude-Based Weight Pruning
- Method: Remove weights based on their magnitude
- Optimal Brain Damage: Theoretical foundation
- Shows pruning weights with least impact is optimal
- Requires computing Hessian of weights (expensive)
- Weight magnitude serves as simple proxy
The Lottery Ticket Hypothesis
Core Hypothesis: "In a large, randomly initialized neural network, there exist small sparse subnetworks - the 'winning tickets' - that, when trained from scratch (with their original initial weights), can match the full model's performance."
Key Insights:
- Over-parametrization is useful - gives many chances for good initializations
- Explains why pruning works: winning tickets exist from the start
- Can prune after training because winning subnetworks were present initially
Weight Sparsity Types
Structured Sparsity
- Removes entire pre-defined blocks of weights
- Examples: complete rows, columns, channels, attention heads, layers
- Easy to implement efficiently
Semi-structured Sparsity
- 2:4 pattern: Two non-zero weights out of every 4 weights
- Hardware-friendly (NVIDIA Ampere support)
- Balance between flexibility and efficiency
Unstructured Sparsity
- Maximum flexibility in weight removal
- Hard to implement efficiently due to irregular patterns
Advanced Pruning Techniques
Wanda (Weight and Activation Pruning)
Key Features:
- Prunes weights on per-output basis
- Uses product of weight magnitudes and input activation norms
- No retraining required
Algorithm:
def prune(W, X, s):
metric = W.abs() * X.norm(p=2, dim=0) # Wanda pruning metric
_, sorted_idx = torch.sort(metric, dim=1) # sort per output
pruned_idx = sorted_idx[:, :int(C_in * s)] # get indices to prune
W.scatter_(dim=1, index=pruned_idx, src=0) # zero out weights
return W
Performance:
- Comparable to SparseGPT but simpler (no intensive weight updates)
- Maintains good accuracy across different sparsity levels (50%, 4:8, 2:4)
DEJAVU: Contextual Sparsity
Core Concept: Input-dependent sparsity patterns
- Dynamically selects which attention heads and FFN parameters to activate
- Query-aware: Sparsity patterns adapt to current input context
Key Innovation:
- Uses lookahead predictors:
- Attention layer at block k o predicts MLP sparsity at block k
- MLP at block k o predicts attention sparsity at block k+1
- Implemented via hardware-aware sparse matrix multiplication
Results:
- No accuracy drop until 75% sparsity
- 1.8-6x speedup compared to state-of-the-art
- Preserves long-dependency task performance
KV Cache Sparsity
Sparsity in Attention
- Attention matrices are >95% sparse at inference time
- Only <1% of attention weights are relatively large
- Motivates KV cache compression techniques
Quest: Query-Aware Sparsity
Problem with Previous Methods:
- Evict "unimportant" KV cache based on historical information
- Challenge: Evicted tokens might be important for future tokens
Quest Solution:
- Preserve all KV cache in storage
- Reduce memory movement by selecting only top-K relevant pages
- Two-stage process:
- Stage 1: Identify most relevant KV cache pages for current query
- Stage 2: Compute sparse attention only on selected pages
Performance:
- 7.03x speedup at 32k sequence length with 2k token budget
- Maintains high accuracy on long-dependency tasks
- Quest estimation policy aligns closely with oracle selection
DeepSeek Multi-Head Latent Attention (MLA)
Key Innovation: Matrix-level KV cache compression
- 15x smaller KV cache in DeepSeek-V2 vs V1
- 5.7x throughput improvement
Approach:
- Store compressed latent vector instead of full keys & values
- 7% of original size through learned matrix projections
- Exploits redundancy in high-dimensional vectors
Integration with RoPE:
- Standard RoPE makes compression harder (position-dependent rotations)
- MLA trick: Keep Q&K unrotated, concatenate RoPE information separately
KV Cache Comparison
Attention Mechanism | KV Cache per Token | Capability |
---|---|---|
Multi-Head Attention (MHA) | $2n_h d_h l$ | Strong |
Grouped-Query Attention (GQA) | $2n_g d_h l$ | Moderate |
Multi-Query Attention (MQA) | $2d_h l$ | Weak |
MLA | $(d_c + d_h^R)l \approx \frac{3}{2}d_h l$ | Stronger |
Activation Sparsity
Core Concept
- Skip computation dynamically based on activation values
- Focus on ~zero values at activation function outputs
- When activations are zero, corresponding weights become unnecessary
Motivation
Activation distributions in LLMs are centered around zero, making magnitude-based pruning effective across: - MLP up/down projections - Attention Q, K, V weights - Attention output weights
TEAL (Threshold-based Activation Pruning)
Method:
- During decoding, threshold low-magnitude activations to 0
- Skip moving associated weight channels to registers
- Provides speedup by reducing memory movement
Performance:
- 25% sparsity: Minimal accuracy loss, 1.2-1.3x speedup
- 50% sparsity: Small accuracy degradation, 1.6-1.8x speedup
- Works across different model sizes (8B to 70B parameters)
Implementation Considerations
Hardware Support
- NVIDIA Ampere 2:4 sparsity: Hardware acceleration for semi-structured patterns
- Structured formats: Enable efficient compressed storage and computation
- Memory bandwidth: Key bottleneck addressed by activation sparsity
Practical Deployment
- Zero-shot methods (Wanda, TEAL) require no retraining
- Fine-tuning approaches offer better accuracy recovery
- Hardware-aware implementation crucial for realizing speedups
- Granularity trade-offs between flexibility and efficiency
Key Takeaways
- Sparsity is pervasive in LLMs across weights, activations, and attention
- Multiple sparsity types serve different optimization goals
- Contextual/dynamic sparsity often outperforms static approaches
- Hardware support is crucial for practical speedups
- Accuracy-efficiency trade-offs can be managed through careful technique selection
- Combination approaches (weight + activation sparsity) show promise for maximum efficiency