Sparsity and Pruning in LLM Serving Systems

Last modified: 2025-08-09 ..

Sparsity and Pruning in LLM Serving Systems

Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhu

Introduction

Types of Sparsity

Weights: Model parameter sparsity
Activations: Runtime computation sparsity
KV-Cache: Attention cache sparsity

Sparsity Enables Pruning

Core Concepts

Pruning: Removing synapses, neurons, weights, or reducing hidden dimension size
Granularity matters: Different levels of pruning have different accuracy impacts
Pruning + fine-tuning: Common approach to recover accuracy
Can sometimes achieve higher accuracy than the original model

Accuracy Impact

Pruning alone causes gradual accuracy degradation
Pruning + fine-tuning maintains accuracy better across higher sparsity levels
Sharp accuracy drop occurs around 80-90% sparsity even with fine-tuning

Weight Sparsity Approaches

Magnitude-Based Weight Pruning

Method: Remove weights based on their magnitude
Optimal Brain Damage: Theoretical foundation
Shows pruning weights with least impact is optimal
Requires computing Hessian of weights (expensive)
Weight magnitude serves as simple proxy

Core Hypothesis: "In a large, randomly initialized neural network, there exist small sparse subnetworks - the 'winning tickets' - that, when trained from scratch (with their original initial weights), can match the full model's performance."

Key Insights:

Over-parametrization is useful - gives many chances for good initializations
Explains why pruning works: winning tickets exist from the start
Can prune after training because winning subnetworks were present initially

Weight Sparsity Types

Structured Sparsity

Removes entire pre-defined blocks of weights
Examples: complete rows, columns, channels, attention heads, layers
Easy to implement efficiently

Semi-structured Sparsity

2:4 pattern: Two non-zero weights out of every 4 weights
Hardware-friendly (NVIDIA Ampere support)
Balance between flexibility and efficiency

Unstructured Sparsity

Maximum flexibility in weight removal
Hard to implement efficiently due to irregular patterns

Advanced Pruning Techniques

Wanda (Weight and Activation Pruning)

Key Features:

Prunes weights on per-output basis
Uses product of weight magnitudes and input activation norms
No retraining required

Algorithm:

def prune(W, X, s):
    metric = W.abs() * X.norm(p=2, dim=0)  # Wanda pruning metric
    _, sorted_idx = torch.sort(metric, dim=1)  # sort per output
    pruned_idx = sorted_idx[:, :int(C_in * s)]  # get indices to prune
    W.scatter_(dim=1, index=pruned_idx, src=0)  # zero out weights
    return W

Performance:

Comparable to SparseGPT but simpler (no intensive weight updates)
Maintains good accuracy across different sparsity levels (50%, 4:8, 2:4)

DEJAVU: Contextual Sparsity

Core Concept: Input-dependent sparsity patterns

Dynamically selects which attention heads and FFN parameters to activate
Query-aware: Sparsity patterns adapt to current input context

Key Innovation:

Uses lookahead predictors:
Attention layer at block k o predicts MLP sparsity at block k
MLP at block k o predicts attention sparsity at block k+1
Implemented via hardware-aware sparse matrix multiplication

Results:

No accuracy drop until 75% sparsity
1.8-6x speedup compared to state-of-the-art
Preserves long-dependency task performance

KV Cache Sparsity

Sparsity in Attention

Attention matrices are >95% sparse at inference time
Only <1% of attention weights are relatively large
Motivates KV cache compression techniques

Quest: Query-Aware Sparsity

Problem with Previous Methods:

Evict "unimportant" KV cache based on historical information
Challenge: Evicted tokens might be important for future tokens

Quest Solution:

Preserve all KV cache in storage
Reduce memory movement by selecting only top-K relevant pages
Two-stage process:
Stage 1: Identify most relevant KV cache pages for current query
Stage 2: Compute sparse attention only on selected pages

Performance:

7.03x speedup at 32k sequence length with 2k token budget
Maintains high accuracy on long-dependency tasks
Quest estimation policy aligns closely with oracle selection

DeepSeek Multi-Head Latent Attention (MLA)

Key Innovation: Matrix-level KV cache compression

15x smaller KV cache in DeepSeek-V2 vs V1
5.7x throughput improvement

Approach:

Store compressed latent vector instead of full keys & values
7% of original size through learned matrix projections
Exploits redundancy in high-dimensional vectors

Integration with RoPE:

Standard RoPE makes compression harder (position-dependent rotations)
MLA trick: Keep Q&K unrotated, concatenate RoPE information separately

KV Cache Comparison

Attention Mechanism	KV Cache per Token	Capability
Multi-Head Attention (MHA)	$2n_h d_h l$	Strong
Grouped-Query Attention (GQA)	$2n_g d_h l$	Moderate
Multi-Query Attention (MQA)	$2d_h l$	Weak
MLA	$(d_c + d_h^R)l \approx \frac{3}{2}d_h l$	Stronger

Activation Sparsity

Core Concept

Skip computation dynamically based on activation values
Focus on ~zero values at activation function outputs
When activations are zero, corresponding weights become unnecessary

Motivation

Activation distributions in LLMs are centered around zero, making magnitude-based pruning effective across: - MLP up/down projections - Attention Q, K, V weights - Attention output weights

TEAL (Threshold-based Activation Pruning)

Method:

During decoding, threshold low-magnitude activations to 0
Skip moving associated weight channels to registers
Provides speedup by reducing memory movement

Performance:

25% sparsity: Minimal accuracy loss, 1.2-1.3x speedup
50% sparsity: Small accuracy degradation, 1.6-1.8x speedup
Works across different model sizes (8B to 70B parameters)

Implementation Considerations

Hardware Support

NVIDIA Ampere 2:4 sparsity: Hardware acceleration for semi-structured patterns
Structured formats: Enable efficient compressed storage and computation
Memory bandwidth: Key bottleneck addressed by activation sparsity

Practical Deployment

Zero-shot methods (Wanda, TEAL) require no retraining
Fine-tuning approaches offer better accuracy recovery
Hardware-aware implementation crucial for realizing speedups
Granularity trade-offs between flexibility and efficiency

Key Takeaways

Sparsity is pervasive in LLMs across weights, activations, and attention
Multiple sparsity types serve different optimization goals
Contextual/dynamic sparsity often outperforms static approaches
Hardware support is crucial for practical speedups
Accuracy-efficiency trade-offs can be managed through careful technique selection
Combination approaches (weight + activation sparsity) show promise for maximum efficiency