Batching in LLM Serving Systems

Last modified: 2025-08-09 ..

Batching in LLM Serving Systems

Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhu

Overview

Batching is critical for LLM serving performance. The H100 needs 333 batch size to reach peak performance. Batching involves two key considerations:

User Experience - maintaining response quality and latency
Throughput - maximizing system efficiency

User Experience Metrics

Key Latency Metrics

Time to First Token (TTFT): Time between user request submission and first token generation
TTFT = Queuing time + Prefill time
Time Per Output Token (TPOT/ITL/IBT): Time between each output token
Average TPOT: Average time to produce one output token
End-to-end time: Total time to queue, prefill, and decode entire request
Normalized Latency: End-to-end time / number of output tokens

Service Level Objectives (SLO)

Common SLO types in order of difficulty:

End-to-end time (easiest)
TTFT + Average TPOT
TTFT + Maximum TPOT (hardest)

SLO Difficulty: TTFT + TPOT_max > TTFT + TPOT_avg > E2E

Deadline-Based SLO Management

Can use delay strategies to smooth token generation when deadlines allow
TTFT + TPOT_max ge DDL > TTFT + TPOT_avg > E2E
Delaying token output can help meet consistent TPOT requirements

Batching Strategies

1. Simple Batching

Characteristics:
Lowest throughput
Short TTFT and TPOT
Low infrastructure complexity
Limitation: Bottlenecked by longest decode request

2. Continuous Batching (Orca)

Key Insight: Admit new requests when decode requests finish

Benefits:

Higher throughput: No waiting for longest decode request
Stabilized GEMM batch size: Better GPU utilization
Reduced queuing time: Requests enter at token granularity

Drawbacks:

Higher prefill latency: Prefill batched with decode requests
Higher decode latency: Prefill slows concurrent decode

Batch Size Calculation

For continuous batching with:

Decode length d
Prefill length p
Request batch size B

GEMM batch size = $$\frac{p+d}{d+1}B = B + \frac{p-1}{d+1}B$$

Key relationships:

Increasing prefill length o increases GEMM batch size
Increasing decode length o decreases GEMM batch size
For large d: GEMM batch size approx $(1 + \frac{p}{d})B$

Example: Batch size = 512, p/d = 2 o GEMM batch size = 512 imes3 = 1536

3. Chunked Prefill

Problem: Simple continuous batching creates generation stalls due to variable prefill sizes

Solution: Break prefill into fixed-size chunks

Benefits:

Highest throughput: Further stabilized batch size
Controlled decode latency: Eliminates generation stalls
Consistent GEMM batch size: Better performance predictability

Drawbacks:

Longest TTFT: More cycles needed for prefill
Higher infrastructure complexity: Chunking management overhead

Fixed Token Budget Approach

Allocate fixed token budget per iteration
Mix decode requests with prefill chunks
Maintains constant GEMM batch size

4. Prefill-Decode Disaggregation

Architecture: Separate clusters for prefill and decode operations

Process:

Prefill server processes input tokens
KV cache transferred to decode server
Decode server handles token generation

Benefits:

Short TTFT and TPOT: Decoupled operations
Optimal latency: Can scale to one request per machine

Drawbacks:

Low throughput: Prefill cluster fully utilized, decode underutilized
High infrastructure complexity: KV cache transfer overhead
Network overhead: KV transfer can be significant (160ms for 16K tokens)

KV Transfer Optimization

Use chunked prefill + layer-wise transfer to overlap KV movement
Transfer last layer of last chunk when prefill completes

Batching Limitations

1. SLO Constraints

For PD Disaggregation:

Prefill batch limit o TTFT constraint
Decode batch limit o TPOT constraint: B*attn + GEMM(B) + C < TPOT

For Chunked Prefill:

Cycle Time = GEMM(Bdense) + $\frac{d}{p+d}B$ imes attn
Constraints: Cycle time < TPOT, $\frac{p+d}{B_{dense}}$ imes Cycle Time < TTFT

2. GPU Memory Capacity

KV Cache Limitations:

For 8B model on H100: ~512K tokens max
Challenge: Output length unknown, KV cache grows over time

Batch Size Formulas

For constant lengths:

$$B = \frac{C}{p + \frac{1}{2}d}$$

For variable lengths: $$B = \frac{d_{avg}C}{(pd){avg} + \frac{1}{2}(d^2)$$}

Where:

C = KV cache capacity
p = prefill length
d = decode length
Longer requests occupy cache longer, reducing effective batch size

Example: 1K input, uniform 0-4K output o effective batch size approx 220

3. Memory Management Strategies

Prediction-Based Control:

Use small encoder models to predict output length
Stop prefill when KV cache predicted to exceed capacity
Add decode pending queues for PD disaggregation

Out-of-Memory Handling:

Similar to prefix sharing eviction
Offload KV cache to CPU memory (faster than recomputation)
Evict least recently used requests

Performance Comparison

Method	Throughput	TTFT	TPOT	Infra Complexity
Simple	Lowest	Short	Short	Low
Continuous Batching	High	Longer	Long, Unstable	Low
Chunked Prefill	Highest	Longest	Long, Controlled	Medium
PD Disaggregation	Low	Short	Short	High

Advanced Considerations

SLO Attainment Strategies

95% SLO targets may require:
Dropping long input/output requests
Infinite delay for predicted SLO violations
Prioritizing high-SLO requests

Fairness Constraints

All requests should make progress
Equal throughput share per user
SLO violation "badness" metrics vs binary attain/violate

Output Length Impact

Longer requests stay in KV cache longer
Creates memory pressure and batch size oscillations
Prediction accuracy critical for optimal performance

Tags: batching latency llm machine learning performance serving systems throughput