Performance Modeling for LLM Serving Systems

Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhu

Performance Analysis


The Roofline Model

Core Concept

Operational Intensity = $\frac{\text{# of Operations}}{\text{# of Bytes Moved}}$

Key Components

Computational Ceilings

Performance Optimization Strategies

Compute optimizations:

Memory optimizations:

Critical Operational Intensity

$$\text{Intensity(Computation)} = \text{Intensity(Accelerator)}$$

$$\frac{\text{Computation FLOPs}}{\text{Communication Bytes}} = \frac{\text{Accelerator FLOPs/s}}{\text{Bandwidth Bytes/s}}$$

Compute-Bound: $\frac{\text{Computation FLOPs}}{\text{Communication Bytes}} > \frac{\text{Accelerator FLOPs/s}}{\text{Bandwidth Bytes/s}}$

Memory-Bound: $\frac{\text{Computation FLOPs}}{\text{Communication Bytes}} < \frac{\text{Accelerator FLOPs/s}}{\text{Bandwidth Bytes/s}}$


Example: NVIDIA H100 Analysis

Hardware Specifications

Rule: If kernel has higher operational intensity than 333 FLOPs/Byte o compute-bound, otherwise memory-bound

Example Operations

FP32 Dot Product

Compute: $N$ multiplications, $N$ additions = $2N$ FLOPs

Memory: $2 \times 4N$ bytes read + $4$ bytes written back

Operational Intensity: $\frac{2N}{4N + 4} \approx \frac{1}{2}$

Result: FP32 dot product on H100 is memory-bound

Matrix Multiplication with FP16

For $[M,N] \times [N,K] \rightarrow [M,K]$:

Memory: $2MN + 2NK$ bytes read, $2MK$ bytes written back

Compute: $2MNK$ FLOPs

Operational Intensity: $\frac{2MNK}{2MN + 2NK + 2MK} \approx M$

Result: Matrix multiplication on H100 is compute-bound if $M > 333$


NUMA Effect with GPUs

Modern GPU clusters show significant bandwidth variations: - GPU memory bandwidth: 900 GB/s - Network bandwidth: 200 Gb/s = 25 GB/s

This creates hierarchical memory access patterns affecting performance modeling.


Performance Modeling Framework

Key Hardware Factors

Key Model Configuration Factors

Key User Statistics


Execution Time Models

Memory-Centric Execution Time

$$T_{memory} = \frac{GPU_{mem}}{MemBW}$$

Assumption: Entire contents of GPU memory loaded once during one iteration

Compute-Centric Execution Time

$$T_{compute} = \frac{2B_{dense}P_{model}}{Compute}$$

Logic: All dense operations require $2B_{dense}P_{model}$ FLOPs total

Network-Centric Execution Time

$$T_{net} = \frac{4(N_{GPU} - 1)D_{model}B_{dense}S_{type}L}{NetBw}$$

Components:


Performance Analysis Results

Compute vs Network

$$\frac{T_{net}}{T_{compute}} = 2(N_{GPU} - 1)\frac{D_{model}L}{P_{model}} \frac{S_{type} \cdot Compute}{NetBw}$$

Key Finding: LLM Serving is more compute-bound than network-bound

Compute vs Memory

$$\frac{T_{memory}}{T_{compute}} = \frac{Compute \cdot GPU_{mem}}{MemBW \cdot 2B_{dense}P_{model}}$$

Factors affecting the balance:

Key Finding: LLM serving is more compute-bound than memory-bound


Grouped Query Attention (GQA)

Concept

Impact on Performance


Optimal Throughput Analysis

Theoretical Maximum

$$\text{Throughput} = \frac{B_{dense}}{T_{compute}} = \frac{B_{dense}}{\frac{2B_{dense}P_{model}}{Compute}} = \frac{Compute}{2P_{model}}$$

Example: LLaMA 70B on A100 o 1857 tokens/s/GPU

Performance Gap

Current serving frameworks show significant gaps to optimal throughput: - vLLM: ~494-552 tokens/s - DeepSpeed-FastGen: ~372-513 tokens/s - TensorRT-LLM: ~636-817 tokens/s

Key Insight: There is a large gap to optimal throughput - high GPU compute utilization is critical for LLM serving performance.


Key Takeaways

  1. Roofline model provides framework for understanding compute vs memory bounds
  2. Critical operational intensity determines performance bottlenecks
  3. LLM serving is primarily compute-bound rather than memory or network bound
  4. GQA enables larger batch sizes and improves compute utilization
  5. Significant optimization opportunities exist - current frameworks achieve only ~25-45% of theoretical peak throughput
  6. High GPU compute utilization is the key for effective LLM serving