Batching in LLM Serving Systems

Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhu

Overview

Batching is critical for LLM serving performance. The H100 needs 333 batch size to reach peak performance. Batching involves two key considerations:

  1. User Experience - maintaining response quality and latency
  2. Throughput - maximizing system efficiency

User Experience Metrics

Key Latency Metrics

Service Level Objectives (SLO)

Common SLO types in order of difficulty:

  1. End-to-end time (easiest)
  2. TTFT + Average TPOT
  3. TTFT + Maximum TPOT (hardest)

SLO Difficulty: TTFT + TPOT_max > TTFT + TPOT_avg > E2E

Deadline-Based SLO Management

Batching Strategies

1. Simple Batching

2. Continuous Batching (Orca)

Key Insight: Admit new requests when decode requests finish

Benefits:

Drawbacks:

Batch Size Calculation

For continuous batching with:

GEMM batch size = $$\frac{p+d}{d+1}B = B + \frac{p-1}{d+1}B$$

Key relationships:

Example: Batch size = 512, p/d = 2 o GEMM batch size = 512 imes3 = 1536

3. Chunked Prefill

Problem: Simple continuous batching creates generation stalls due to variable prefill sizes

Solution: Break prefill into fixed-size chunks

Benefits:

Drawbacks:

Fixed Token Budget Approach

4. Prefill-Decode Disaggregation

Architecture: Separate clusters for prefill and decode operations

Process:

  1. Prefill server processes input tokens
  2. KV cache transferred to decode server
  3. Decode server handles token generation

Benefits:

Drawbacks:

KV Transfer Optimization

Batching Limitations

1. SLO Constraints

For PD Disaggregation:

For Chunked Prefill:

2. GPU Memory Capacity

KV Cache Limitations:

Batch Size Formulas

For constant lengths:

$$B = \frac{C}{p + \frac{1}{2}d}$$

For variable lengths: $$B = \frac{d_{avg}C}{(pd){avg} + \frac{1}{2}(d^2)$$}

Where:

Example: 1K input, uniform 0-4K output o effective batch size approx 220

3. Memory Management Strategies

Prediction-Based Control:

Out-of-Memory Handling:

Performance Comparison

Method Throughput TTFT TPOT Infra Complexity
Simple Lowest Short Short Low
Continuous Batching High Longer Long, Unstable Low
Chunked Prefill Highest Longest Long, Controlled Medium
PD Disaggregation Low Short Short High

Advanced Considerations

SLO Attainment Strategies

Fairness Constraints

Output Length Impact