Performance Modeling for LLM Serving Systems

Last modified: 2025-08-09 ..

Performance Modeling for LLM Serving Systems

Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhu

Performance Analysis

Roofline model - Based on the Roofline paper and Google's scaling book
Detailed model of Transformer performance

The Roofline Model

Core Concept

Operational Intensity = $\frac{\text{# of Operations}}{\text{# of Bytes Moved}}$

Operations: E.g., FLOPs/Sec
Bytes Moved: E.g., Bytes/sec (Memory or Network Bandwidth)

Key Components

Computational Ceilings

Memory Bound: Low operational intensity region
Compute Bound: High operational intensity region
Roofline: The boundary between memory and compute limitations

Performance Optimization Strategies

Compute optimizations:

Multithreading
ILP (Instruction-Level Parallelism)
SIMD (Single Instruction, Multiple Data)

Memory optimizations:

Stride accesses (HW prefetching)
Memory affinity (NUMA effect)
Software prefetching

Critical Operational Intensity

$$\text{Intensity(Computation)} = \text{Intensity(Accelerator)}$$

$$\frac{\text{Computation FLOPs}}{\text{Communication Bytes}} = \frac{\text{Accelerator FLOPs/s}}{\text{Bandwidth Bytes/s}}$$

Compute-Bound: $\frac{\text{Computation FLOPs}}{\text{Communication Bytes}} > \frac{\text{Accelerator FLOPs/s}}{\text{Bandwidth Bytes/s}}$

Memory-Bound: $\frac{\text{Computation FLOPs}}{\text{Communication Bytes}} < \frac{\text{Accelerator FLOPs/s}}{\text{Bandwidth Bytes/s}}$

Example: NVIDIA H100 Analysis

Hardware Specifications

Peak FP16 Tensor TFLOPS: 1000/2000 TFLOPs (with sparsity/no sparsity)
Memory bandwidth: 3000 GB/s
Critical intensity: $(1000 \times 10^{12}) / (3000 \times 10^9) = 333$ FLOPs/Byte

Rule: If kernel has higher operational intensity than 333 FLOPs/Byte o compute-bound, otherwise memory-bound

Example Operations

FP32 Dot Product

Compute: $N$ multiplications, $N$ additions = $2N$ FLOPs

Memory: $2 \times 4N$ bytes read + $4$ bytes written back

Operational Intensity: $\frac{2N}{4N + 4} \approx \frac{1}{2}$

Result: FP32 dot product on H100 is memory-bound

Matrix Multiplication with FP16

For $[M,N] \times [N,K] \rightarrow [M,K]$:

Memory: $2MN + 2NK$ bytes read, $2MK$ bytes written back

Compute: $2MNK$ FLOPs

Operational Intensity: $\frac{2MNK}{2MN + 2NK + 2MK} \approx M$

Result: Matrix multiplication on H100 is compute-bound if $M > 333$

NUMA Effect with GPUs

Modern GPU clusters show significant bandwidth variations: - GPU memory bandwidth: 900 GB/s - Network bandwidth: 200 Gb/s = 25 GB/s

This creates hierarchical memory access patterns affecting performance modeling.

Performance Modeling Framework

Key Hardware Factors

$N_{GPU}$: number of GPUs
MemBW (GB/s): aggregate GPU memory bandwidth
$GPU_{mem}$ (GB): aggregate GPU memory capacity
Compute (GFLOPS/s): aggregate GPU compute capacity
NetBW (GB/s): aggregate GPU interconnect bandwidth

Key Model Configuration Factors

$D_{model}$: hidden dimension size (hidden_dim)
$L$: number of layers
$P_{model}$: number of parameters
$R_{GQA}$: group size of GQA (group_size)
$S_{Type}$: Model parameters' data size in bytes (e.g., 2 for FP16)

Key User Statistics

$p$: average number of tokens in prompts to be prefilled
$d$: average number of tokens in output to be decoded
$p + d$: average number of tokens in user request (sequence length)
Per request throughput: $\frac{\text{Throughput}_{total}}{p + d}$
Decoding throughput: $d \frac{\text{Throughput}_{total}}{p + d}$

Execution Time Models

Memory-Centric Execution Time

$$T_{memory} = \frac{GPU_{mem}}{MemBW}$$

Assumption: Entire contents of GPU memory loaded once during one iteration

Compute-Centric Execution Time

$$T_{compute} = \frac{2B_{dense}P_{model}}{Compute}$$

Logic: All dense operations require $2B_{dense}P_{model}$ FLOPs total

Network-Centric Execution Time

$$T_{net} = \frac{4(N_{GPU} - 1)D_{model}B_{dense}S_{type}L}{NetBw}$$

Components:

$(N_{GPU} - 1)$: number of hops
$4$: All-Gathers per layer
All-Reduce approx 2 imes All-Gather

Performance Analysis Results

Compute vs Network

$$\frac{T_{net}}{T_{compute}} = 2(N_{GPU} - 1)\frac{D_{model}L}{P_{model}} \frac{S_{type} \cdot Compute}{NetBw}$$

Key Finding: LLM Serving is more compute-bound than network-bound

Compute vs Memory

$$\frac{T_{memory}}{T_{compute}} = \frac{Compute \cdot GPU_{mem}}{MemBW \cdot 2B_{dense}P_{model}}$$

Factors affecting the balance:

GQA allows for larger batch size o favors compute
Model sizes tend to increase o favors compute
Batches with large decode requests increase memory accesses o favors memory

Key Finding: LLM serving is more compute-bound than memory-bound

Grouped Query Attention (GQA)

Concept

Traditional: Each attention head has its own Key-Value cache
GQA: Multiple attention heads share Key-Value pairs
Group size: Number of heads sharing the same K-V pairs

Impact on Performance

Reduces KV cache memory requirements by factor of group size
Allows increasing batch size by factor of group size
Shifts workload toward compute-bound rather than memory-bound

Roofline model provides framework for understanding compute vs memory bounds
Critical operational intensity determines performance bottlenecks
LLM serving is primarily compute-bound rather than memory or network bound
GQA enables larger batch sizes and improves compute utilization
Significant optimization opportunities exist - current frameworks achieve only ~25-45% of theoretical peak throughput
High GPU compute utilization is the key for effective LLM serving

Performance Modeling for LLM Serving Systems

Performance Modeling for LLM Serving Systems

Performance Analysis

The Roofline Model

Core Concept

Key Components

Computational Ceilings

Performance Optimization Strategies

Critical Operational Intensity

Example: NVIDIA H100 Analysis

Hardware Specifications

Example Operations

FP32 Dot Product

Matrix Multiplication with FP16

NUMA Effect with GPUs

Performance Modeling Framework

Key Hardware Factors

Key Model Configuration Factors

Key User Statistics

Execution Time Models

Memory-Centric Execution Time

Compute-Centric Execution Time

Network-Centric Execution Time

Performance Analysis Results

Compute vs Network

Compute vs Memory

Grouped Query Attention (GQA)

Concept

Impact on Performance

Optimal Throughput Analysis

Theoretical Maximum

Performance Gap

Key Takeaways