Parallelism
title: Parallelism in LLM Serving Systems category: Machine Learning Systems tags: parallelism, performance, throughput, latency, llm, serving systems, machine learning description: Overview of parallelism techniques in LLM serving systems, focusing on theoretical foundations and practical applications.
Parallelism in LLM Serving Systems
Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhu
Introduction & Motivation
Limits to GPU-based Scaling
Compute Limitations
-
GPU improvements have included:
- Number formats (FP32 o FP16 o Int8)
- Specialized instructions (DP4A, HMMA, IMMA)
- Process nodes (28nm o 5nm)
- Sparsity support
- Still, single GPU performance remains fundamentally limited.
- Supercomputers can reach exaflop scales, but LLMs continue to push hardware constraints.
Memory Limitations
- Model sizes are growing exponentially: ELMo (94M) o GPT-2 (1.5B) o GPT-3 (175B) o MT-NLG (530B)
- A single GPU cannot hold full model weights or intermediate activations.
Solution: Multi-GPU, Multi-Machine Parallelism
Network Infrastructure
-
Intra-node (within machine):
- NVLink 3.0: 600 GB/s
- PCIe 4.0: 32 GB/s
- Inter-node (between machines):
- InfiniBand HDR: 25 GB/s
Goal: Distribute compute and memory across devices efficiently.
Collective Communication Primitives
Key Operations
- AllReduce: Aggregates data across devices.
- Broadcast: Sends data from one device to all others.
- AllGather: Each device collects data from all others.
- ReduceScatter: Combines reduction and scatter.
AllReduce can be implemented as ReduceScatter + AllGather, which is bandwidth-optimal.
Key Concepts in ML Training/Serving
State Classifications
- Model Parameters: Learned weights (used in both training & serving).
- Gradients: For updating parameters during training.
- Activations: Intermediate results from forward pass (used in both).
- Optimizer State: Momentum, variance, etc. for training.
- KV Cache: Used in serving for autoregressive models to avoid recomputing past tokens.
Parallelism Strategies
Goals
- Scale with batch size (data)
- Scale with model size (parameters)
Data Parallelism
Concept
- Each GPU has a full model copy.
- Batches split across GPUs.
- Gradients are aggregated post-backward.
Implementations
Parameter Server (Centralized)
- Gradients sent to central server; updated params broadcast.
- Scalability bottleneck: Central point of failure and bandwidth.
AllReduce-based (Decentralized)
-
Peer-to-peer gradient aggregation:
- Ring, Tree, Butterfly, or ReduceScatter + AllGather.
Limitations
- Full model + gradients + optimizer state on each GPU.
- Does not scale to models larger than a single GPU's memory.
Pipeline Parallelism
Concept
- Split model vertically across layers into pipeline stages.
- Each stage runs on a separate GPU.
Execution
- Forward pass: left to right
- Backward pass: right to left
- GPUs exchange activations, not parameters.
Scheduling Strategies
GPipe
- Microbatching to improve utilization.
- Trade-off: More memory needed to store microbatches.
1F1B (One Forward, One Backward)
- Keeps pipeline full during steady state.
- Phases: warm-up o alternating o drain.
Zero Bubble Pipeline (ZBP)
-
Splits backward pass into:
- Activation gradients
- Weight gradients (can be delayed) * Eliminates pipeline idle time.
Analysis
Bubble Ratio
- \$(p - 1)/m\$ where \$p\$ = stages, \$m\$ = microbatches
- Larger \$m\$ reduces bubble size.
Characteristics
Advantages:
- Shards model (less memory per GPU)
- Point-to-point activation communication
Disadvantages:
- Batch-size sensitive
- Pipeline bubbles without careful scheduling
Tensor Parallelism
Concept
- Split model horizontally: partition within layers.
- Each GPU holds part of a layer.
Matrix Ops Decomposition
MLP Example:
\$Z = \text{Dropout}(\text{GeLU}(XA)B)\$
- Column-split A: \$Y_i = \text{GeLU}(XA_i)\$ o No communication
- Row-split B: \$Z_i = Y_i B_i\$ o AllReduce to combine \$Z = \sum Z_i\$
Self-Attention Example
- Split heads: \$Q = [Q_1, Q_2]\$, etc.
- Each GPU processes subset of heads.
- AllReduce after attention for combined output.
Communication Patterns
- Forward: Identity o AllReduce
- Backward: AllReduce o Identity
Characteristics
Advantages:
- No pipeline bubbles
- Doesn't require large batches
- High utilization with fast interconnects
Disadvantages:
-
High communication volume:
- Pipeline: \$bsh\$ (point-to-point)
- Tensor: \~\$8bsh\$ (AllReduce-heavy)
- Needs fast intra-node links (e.g., NVLink)
Memory Optimization: Activations
Formula
$\text{Memory per layer} = sbh\left(34 + 5\frac{as}{h}\right)$
Where:
- \$s\$ = sequence length
- \$b\$ = batch size
- \$h\$ = hidden size
- \$a\$ = attention heads
Optimization Techniques
Checkpointing vs Stashing
- Stashing: Stores all activations (high memory, faster)
- Checkpointing: Stores minimal; recomputes during backward pass o Memory savings at cost of \~33% throughput hit o Enables larger batch sizes
Tensor Parallelism Impact
With \$t\$ tensor-parallel units: $\text{Memory per layer} = sbh\left(10 + \frac{24}{t} + 5\frac{as}{ht}\right)$
- LayerNorm (4sbh) + Dropout (2sbh) + Inputs (4sbh) = 10sbh
- Remaining terms shrink with larger \$t\$
Sequence Parallelism
Motivation
- The 10sbh term includes pointwise ops o can be split along sequence
Implementation
- Split LayerNorm, Dropout, and Input activations along sequence
- All-Gather before MLP to reassemble
- Results in true linear memory scaling with device count
Memory Scaling Comparison
Configuration | Activations per Layer |
---|---|
No parallelism | \$sbh(34 + 5\frac{as}{h})\$ |
Tensor only | \$sbh(10 + \frac{24}{t} + 5\frac{as}{ht})\$ |
Tensor + Sequence | \$sbh(\frac{34}{t} + 5\frac{as}{ht})\$ |
ZeRO Optimization / FSDP
Memory Breakdown (for \$\Psi\$ params)
- FP16 params: \$2\Psi\$
- Gradients: \$2\Psi\$
- FP32 optimizer: \$16\Psi\$ o Total per GPU (naive): \$20\Psi\$
ZeRO Stages
- Stage 1: Shard optimizer state
- Stage 2: Shard gradients
- Stage 3: Shard model parameters o Total memory: \$\frac{(2 + 2 + K)\Psi}{N_d}\$
- Reduces memory linearly with number of devices \$N_d\$
- Used in Fully Sharded Data Parallel (FSDP)
3D Parallelism Strategy
Deployment Phases
-
Fit model on memory
- Use tensor parallel within node
- Use pipeline parallel across nodes
-
Scale compute
- Add data parallelism
- Use gradient accumulation to improve communication efficiency
Example: 8 imes8 GPU nodes
- Tensor: 8-way intra-node
- Pipeline: 8-way across nodes
- Data: Across groups of nodes
Considerations
- Batch size must be large enough for efficient pipeline use
- Tensor size should align with bandwidth (avoid over-splitting)
- Best setup depends on model, hardware, and latency/bandwidth topology
Summary
Takeaways
-
Three main parallelism forms:
- Data: scale batch size
- Pipeline: scale model depth
- Tensor: scale width
-
Communication varies:
- Data: gradient AllReduce
- Pipeline: point-to-point activations
- Tensor: AllReduce per layer
-
Memory optimization is essential:
- Activation dominates for large models
- Checkpointing and sequence parallel reduce cost
-
Hardware-aware deployment:
- Use fast interconnects for tensor parallel
- Use pipeline across slower links
- Match parallel strategy to topology
-
Combine all three (3D parallelism) for optimal scale and efficiency.