Sparsity and Pruning in LLM Serving Systems

Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhu

Introduction

Types of Sparsity

Sparsity Enables Pruning

Core Concepts

Accuracy Impact

Weight Sparsity Approaches

Magnitude-Based Weight Pruning

The Lottery Ticket Hypothesis

Core Hypothesis: "In a large, randomly initialized neural network, there exist small sparse subnetworks - the 'winning tickets' - that, when trained from scratch (with their original initial weights), can match the full model's performance."

Key Insights:

Weight Sparsity Types

Structured Sparsity

Semi-structured Sparsity

Unstructured Sparsity

Advanced Pruning Techniques

Wanda (Weight and Activation Pruning)

Key Features:

Algorithm:

def prune(W, X, s):
    metric = W.abs() * X.norm(p=2, dim=0)  # Wanda pruning metric
    _, sorted_idx = torch.sort(metric, dim=1)  # sort per output
    pruned_idx = sorted_idx[:, :int(C_in * s)]  # get indices to prune
    W.scatter_(dim=1, index=pruned_idx, src=0)  # zero out weights
    return W

Performance:

DEJAVU: Contextual Sparsity

Core Concept: Input-dependent sparsity patterns

Key Innovation:

Results:

KV Cache Sparsity

Sparsity in Attention

Quest: Query-Aware Sparsity

Problem with Previous Methods:

Quest Solution:

Performance:

DeepSeek Multi-Head Latent Attention (MLA)

Key Innovation: Matrix-level KV cache compression

Approach:

Integration with RoPE:

KV Cache Comparison

Attention Mechanism KV Cache per Token Capability
Multi-Head Attention (MHA) $2n_h d_h l$ Strong
Grouped-Query Attention (GQA) $2n_g d_h l$ Moderate
Multi-Query Attention (MQA) $2d_h l$ Weak
MLA $(d_c + d_h^R)l \approx \frac{3}{2}d_h l$ Stronger

Activation Sparsity

Core Concept

Motivation

Activation distributions in LLMs are centered around zero, making magnitude-based pruning effective across: - MLP up/down projections - Attention Q, K, V weights - Attention output weights

TEAL (Threshold-based Activation Pruning)

Method:

Performance:

Implementation Considerations

Hardware Support

Practical Deployment

Key Takeaways

  1. Sparsity is pervasive in LLMs across weights, activations, and attention
  2. Multiple sparsity types serve different optimization goals
  3. Contextual/dynamic sparsity often outperforms static approaches
  4. Hardware support is crucial for practical speedups
  5. Accuracy-efficiency trade-offs can be managed through careful technique selection
  6. Combination approaches (weight + activation sparsity) show promise for maximum efficiency