Parallelism in LLM Serving Systems

Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhu

Introduction & Motivation

Limits to GPU-based Scaling

Compute Limitations

Memory Limitations

Solution: Multi-GPU, Multi-Machine Parallelism

Network Infrastructure

Goal: Distribute compute and memory across devices efficiently.


Collective Communication Primitives

Key Operations

AllReduce can be implemented as ReduceScatter + AllGather, which is bandwidth-optimal.


Key Concepts in ML Training/Serving

State Classifications


Parallelism Strategies

Goals


Data Parallelism

Concept

Implementations

Parameter Server (Centralized)

AllReduce-based (Decentralized)

Limitations


Pipeline Parallelism

Concept

Execution

Scheduling Strategies

GPipe

1F1B (One Forward, One Backward)

Zero Bubble Pipeline (ZBP)

Analysis

Bubble Ratio

Characteristics

Advantages:

Disadvantages:


Tensor Parallelism

Concept

Matrix Ops Decomposition

MLP Example:

\$Z = \text{Dropout}(\text{GeLU}(XA)B)\$

Self-Attention Example

Communication Patterns

Characteristics

Advantages:

Disadvantages:


Memory Optimization: Activations

Formula

$\text{Memory per layer} = sbh\left(34 + 5\frac{as}{h}\right)$

Where:

Optimization Techniques

Checkpointing vs Stashing

Tensor Parallelism Impact

With \$t\$ tensor-parallel units: $\text{Memory per layer} = sbh\left(10 + \frac{24}{t} + 5\frac{as}{ht}\right)$


Sequence Parallelism

Motivation

Implementation

Memory Scaling Comparison

Configuration Activations per Layer
No parallelism \$sbh(34 + 5\frac{as}{h})\$
Tensor only \$sbh(10 + \frac{24}{t} + 5\frac{as}{ht})\$
Tensor + Sequence \$sbh(\frac{34}{t} + 5\frac{as}{ht})\$

ZeRO Optimization / FSDP

Memory Breakdown (for \$\Psi\$ params)

ZeRO Stages


3D Parallelism Strategy

Deployment Phases

  1. Fit model on memory

  2. Use tensor parallel within node

  3. Use pipeline parallel across nodes

  4. Scale compute

  5. Add data parallelism

  6. Use gradient accumulation to improve communication efficiency

Example: 8 imes8 GPU nodes

Considerations


Summary

Takeaways

  1. Three main parallelism forms:

  2. Data: scale batch size

  3. Pipeline: scale model depth
  4. Tensor: scale width

  5. Communication varies:

  6. Data: gradient AllReduce

  7. Pipeline: point-to-point activations
  8. Tensor: AllReduce per layer

  9. Memory optimization is essential:

  10. Activation dominates for large models

  11. Checkpointing and sequence parallel reduce cost

  12. Hardware-aware deployment:

  13. Use fast interconnects for tensor parallel

  14. Use pipeline across slower links
  15. Match parallel strategy to topology

  16. Combine all three (3D parallelism) for optimal scale and efficiency.