Overview

Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhu

Speculative Decoding is a technique to accelerate large language model inference by using a smaller, faster model to generate candidate tokens that are then verified by the larger target model in parallel.

Key References

Core Concept

Speculative Decoding in a Nutshell

Key Enabling Observations

  1. Compute vs Memory Bound:
  2. LLM serving is compute-bound at large batch sizes
  3. At lower batch sizes, LLM serving becomes memory-bound
  4. A batch of quickly generated tokens can be verified in parallel at once

  5. Draft Model Accuracy:

  6. Small (draft) LLMs are quite accurate for most "easy" tokens
  7. Most of the time, a large (target) LLM is not needed
  8. Example: "Geoffrey Hinton did his PhD at the University of..." o "Edinburgh" (easy) vs more complex completions (difficult)

Algorithm Details

Two-Step Process

  1. Draft Generation: Run the draft model N iterations (e.g., 5) p1(x) = Mp(prefix) o x1 pz(x) = Mp(prefix, x1) o xz ... p5(x) = Mp(prefix, x1, xz, xepsilon, x4) o x5

  2. Parallel Verification: Run the target model once to verify all tokens q1(x), qz(x), qepsilon(x), q4(x), q5(x), q6(x) = Mq(prefix, x1, xz, xepsilon, x4, x5)

Important: Target model only produces distributions; sampling is only done from the draft model.

Rejection Sampling Process

For each generated token, compare draft probability p(x) with target probability q(x):

Handling Rejections

When a token is rejected: - Sample from the corrected distribution: (q(x) - p(x))+ - The + notation means we won't sample from negative probabilities - This ensures the final distribution matches what the target model would produce

Token Generation Outcomes

Performance Analysis

Speedup Factors

Speedup Results

The effectiveness shows: - Higher accuracy (alpha) leads to better speedup - Optimal gamma values exist (diminishing returns from too many draft predictions) - Typical speedups: 1.4x to 3.4x depending on model size and task

Advanced Techniques

Medusa

Key Innovation: Add multiple prediction heads to a single model instead of using separate draft/target models.

Architecture:

Tree Attention:

Variants:

SpecInfer

Problem: Single draft model may not provide enough "coverage"

Solution: Use multiple draft models simultaneously

Token Tree Verification:

Implementation Details

Parallel Token Probability Computation

# Project to vocabulary
# I: (seq_len, hidden_dim): (seq_len, 4096)
# O: (seq_len, vocab_size): (seq_len, 128256)
logits = model_output.matmul(lm_head_weight.t())
# Pick the next token with highest probability
sample_output = torch.argmax(logits, dim=1)
# Return the next token following the last token in input sequence
return sample_output[-1].item()

This gives next token probabilities for each token in the sequence in one pass.

Benefits Timeline Comparison

Results Summary

Performance Gains

Key Insight

Diminishing returns from increased gamma (number of draft predictions) - there's an optimal balance between speculation depth and verification overhead.

Practical Applications

Speculative decoding is particularly effective for: - Memory-bound serving scenarios (lower batch sizes) - Tasks with predictable patterns where draft models can be reasonably accurate - Scenarios requiring maintained output quality (lossless acceleration) - Real-time applications where latency reduction is critical