Parallelism

Last modified: 2025-05-25

title: Parallelism in LLM Serving Systems category: Machine Learning Systems tags: parallelism, performance, throughput, latency, llm, serving systems, machine learning description: Overview of parallelism techniques in LLM serving systems, focusing on theoretical foundations and practical applications.

Parallelism in LLM Serving Systems

Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhu

Introduction & Motivation

Limits to GPU-based Scaling

Compute Limitations

GPU improvements have included:
- Number formats (FP32 o FP16 o Int8)
- Specialized instructions (DP4A, HMMA, IMMA)
- Process nodes (28nm o 5nm)
- Sparsity support
- Still, single GPU performance remains fundamentally limited.
- Supercomputers can reach exaflop scales, but LLMs continue to push hardware constraints.

Memory Limitations

Model sizes are growing exponentially: ELMo (94M) o GPT-2 (1.5B) o GPT-3 (175B) o MT-NLG (530B)
A single GPU cannot hold full model weights or intermediate activations.

Solution: Multi-GPU, Multi-Machine Parallelism

Network Infrastructure

Intra-node (within machine):
- NVLink 3.0: 600 GB/s
- PCIe 4.0: 32 GB/s
- Inter-node (between machines):
- InfiniBand HDR: 25 GB/s

Goal: Distribute compute and memory across devices efficiently.

Collective Communication Primitives

Key Operations

AllReduce: Aggregates data across devices.
Broadcast: Sends data from one device to all others.
AllGather: Each device collects data from all others.
ReduceScatter: Combines reduction and scatter.

AllReduce can be implemented as ReduceScatter + AllGather, which is bandwidth-optimal.

Key Concepts in ML Training/Serving

State Classifications

Model Parameters: Learned weights (used in both training & serving).
Gradients: For updating parameters during training.
Activations: Intermediate results from forward pass (used in both).
Optimizer State: Momentum, variance, etc. for training.
KV Cache: Used in serving for autoregressive models to avoid recomputing past tokens.

Parallelism Strategies

Goals

Scale with batch size (data)
Scale with model size (parameters)

Data Parallelism

Concept

Each GPU has a full model copy.
Batches split across GPUs.
Gradients are aggregated post-backward.

Implementations

Parameter Server (Centralized)

Gradients sent to central server; updated params broadcast.
Scalability bottleneck: Central point of failure and bandwidth.

AllReduce-based (Decentralized)

Peer-to-peer gradient aggregation:
- Ring, Tree, Butterfly, or ReduceScatter + AllGather.

Limitations

Full model + gradients + optimizer state on each GPU.
Does not scale to models larger than a single GPU's memory.

Pipeline Parallelism

Concept

Split model vertically across layers into pipeline stages.
Each stage runs on a separate GPU.

Execution

Forward pass: left to right
Backward pass: right to left
GPUs exchange activations, not parameters.

Scheduling Strategies

GPipe

Microbatching to improve utilization.
Trade-off: More memory needed to store microbatches.

1F1B (One Forward, One Backward)

Keeps pipeline full during steady state.
Phases: warm-up o alternating o drain.

Zero Bubble Pipeline (ZBP)

Splits backward pass into:
1. Activation gradients
2. Weight gradients (can be delayed) * Eliminates pipeline idle time.

Analysis

Bubble Ratio

\$(p - 1)/m\$ where \$p\$ = stages, \$m\$ = microbatches
Larger \$m\$ reduces bubble size.

Characteristics

Advantages:

Shards model (less memory per GPU)
Point-to-point activation communication

Disadvantages:

Batch-size sensitive
Pipeline bubbles without careful scheduling

Tensor Parallelism

Concept

Split model horizontally: partition within layers.
Each GPU holds part of a layer.

Matrix Ops Decomposition

MLP Example:

\$Z = \text{Dropout}(\text{GeLU}(XA)B)\$

Column-split A: \$Y_i = \text{GeLU}(XA_i)\$ o No communication
Row-split B: \$Z_i = Y_i B_i\$ o AllReduce to combine \$Z = \sum Z_i\$

Self-Attention Example

Split heads: \$Q = [Q_1, Q_2]\$, etc.
Each GPU processes subset of heads.
AllReduce after attention for combined output.

Communication Patterns

Forward: Identity o AllReduce
Backward: AllReduce o Identity

Characteristics

Advantages:

No pipeline bubbles
Doesn't require large batches
High utilization with fast interconnects

Disadvantages:

High communication volume:
- Pipeline: \$bsh\$ (point-to-point)
- Tensor: \~\$8bsh\$ (AllReduce-heavy)
- Needs fast intra-node links (e.g., NVLink)

Memory Optimization: Activations

Formula

$\text{Memory per layer} = sbh\left(34 + 5\frac{as}{h}\right)$

Where:

\$s\$ = sequence length
\$b\$ = batch size
\$h\$ = hidden size
\$a\$ = attention heads

Optimization Techniques

Checkpointing vs Stashing

Stashing: Stores all activations (high memory, faster)
Checkpointing: Stores minimal; recomputes during backward pass o Memory savings at cost of \~33% throughput hit o Enables larger batch sizes

Tensor Parallelism Impact

With \$t\$ tensor-parallel units: $\text{Memory per layer} = sbh\left(10 + \frac{24}{t} + 5\frac{as}{ht}\right)$

LayerNorm (4sbh) + Dropout (2sbh) + Inputs (4sbh) = 10sbh
Remaining terms shrink with larger \$t\$

Sequence Parallelism

Motivation

The 10sbh term includes pointwise ops o can be split along sequence

Implementation

Split LayerNorm, Dropout, and Input activations along sequence
All-Gather before MLP to reassemble
Results in true linear memory scaling with device count

Memory Scaling Comparison

Configuration	Activations per Layer
No parallelism	\$sbh(34 + 5\frac{as}{h})\$
Tensor only	\$sbh(10 + \frac{24}{t} + 5\frac{as}{ht})\$
Tensor + Sequence	\$sbh(\frac{34}{t} + 5\frac{as}{ht})\$

ZeRO Optimization / FSDP

Memory Breakdown (for \$\Psi\$ params)

FP16 params: \$2\Psi\$
Gradients: \$2\Psi\$
FP32 optimizer: \$16\Psi\$ o Total per GPU (naive): \$20\Psi\$

ZeRO Stages

Stage 1: Shard optimizer state

Stage 2: Shard gradients

Stage 3: Shard model parameters o Total memory: \$\frac{(2 + 2 + K)\Psi}{N_d}\$

Reduces memory linearly with number of devices \$N_d\$

Used in Fully Sharded Data Parallel (FSDP)

3D Parallelism Strategy

Deployment Phases

Fit model on memory
- Use tensor parallel within node
- Use pipeline parallel across nodes
Scale compute
- Add data parallelism
- Use gradient accumulation to improve communication efficiency

Example: 8 imes8 GPU nodes

Tensor: 8-way intra-node
Pipeline: 8-way across nodes
Data: Across groups of nodes

Considerations

Batch size must be large enough for efficient pipeline use
Tensor size should align with bandwidth (avoid over-splitting)
Best setup depends on model, hardware, and latency/bandwidth topology

Summary

Takeaways

Three main parallelism forms:
- Data: scale batch size
- Pipeline: scale model depth
- Tensor: scale width
Communication varies:
- Data: gradient AllReduce
- Pipeline: point-to-point activations
- Tensor: AllReduce per layer
Memory optimization is essential:
- Activation dominates for large models
- Checkpointing and sequence parallel reduce cost
Hardware-aware deployment:
- Use fast interconnects for tensor parallel
- Use pipeline across slower links
- Match parallel strategy to topology
Combine all three (3D parallelism) for optimal scale and efficiency.