Memory Management in LLM Serving Systems

Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhuz

KV Cache Size Calculation

Key Components

The KV cache size depends on:

Example Calculation - Llama3-8B on H100

KV cache per request:

(1024 + 1024/2)     imes 2  imes 2  imes 8  imes 128    imes 32 / 1024 / 1024 = 192 MB

Maximum batch size:

(80 - 16) / (192/1024) = 341 requests

Note: Batch size of 333 was needed to reach compute-bound regime on H100.

Memory Allocation Challenges

Variable-Length Requests

When serving requests with different output lengths:

Key Question: How to efficiently allocate KV cache for requests with different lengths?

Allocation Methods

Method 1: Fixed Max Sequence Length

Approach: Allocate KV cache using maximum sequence length the model supports

Problems:

Method 2: std::vector-style Allocation

Approach:

Characteristics:

Method 3: PagedAttention

Core Innovation: Chunk global memory space into small pages

Key Features:

PagedAttention Implementation

Multi-Level Page Table Structure

kv_indptr:  [0, 2, 3, 6, 10]                   # NumReq + 1 elements
kv_indices: [1, 4, 8, 2, 5, 0, 6, 10, 15, 17]  # NumPage elements
kv_data:    [actual KV cache data...]          # Max Page elements

How it works:

Prefix Sharing

Concept: Share common prefixes across multiple requests

Benefits:

Drawbacks:

Use Cases for Prefix Sharing

Multi-round Conversations

1. "You are a helpful assistant. User:Hello, Assistant:Hi!"
2. "You are a helpful assistant. User:Hello, Assistant:Hi!, User: Solve this problem..."
3. "You are a helpful assistant. User:What can you do?"

Multi-request Scenarios

Performance Benefits

Performance gains vary with:

Typical improvements: 2-32x speedup depending on configuration.

Memory Management Strategies

Eviction Policies: Recomputing vs Load/Offload

When GPU memory insufficient for full radix tree:

Recomputation Approach

Time cost: $$T_{recompute} = \frac{2pP_{model}}{Compute}$$

Load/Offload Approach

Time cost: $$T_{load} = \frac{2 \times dtype \times \frac{D_{model}}{GQA} \times L \times p}{PCIE_Bandwidth}$$

Comparison Ratio

$$\frac{T_{recompute}}{T_{load}} = \frac{PCIE_Bandwidth \times P_{compute}}{dtype \times \frac{D_{model}}{GQA} \times L \times Compute}$$

Example calculation for A100:

30GB/s  imes 8  imes 10^9 / (2  imes 1024   imes 32     imes 300T) = 12

Conclusion: Loading from CPU memory is ~12x faster than recomputation.

FlashAttention: Memory-Efficient Attention

Problem with Standard Attention

Softmax Numerical Stability

Standard softmax: $$sm(x_i) = \frac{e^{x_i}}{\sum_{j=1}^d e^{x_j}}$$

Numerically stable version: $$sm(x_i) = \frac{e^{x_i-c}}{\sum_{j=1}^d e^{x_j-c}}$$

Where c = max(x_i) prevents overflow.

FlashAttention Algorithm

Key Innovation: Tile-based computation with online softmax

Steps:

  1. Tiling: Divide Q, K, V into blocks that fit in SRAM
  2. Block-wise computation: Compute attention for each block pair
  3. Online aggregation: Maintain running statistics (max, sum) across blocks
  4. Rescaling: Properly combine results from different blocks

Memory hierarchy utilization:

Causal Masking in FlashAttention

Grouped Query Attention (GQA)

Key Takeaways

  1. KV cache dominates memory usage in LLM serving, limiting batch sizes
  2. PagedAttention eliminates fragmentation through page-based allocation
  3. Prefix sharing provides significant speedups for common use cases
  4. Loading is much faster than recomputation for evicted cache entries
  5. FlashAttention enables long sequences through memory-efficient tiled computation
  6. Proper memory management is crucial for maximizing GPU utilization in LLM serving