Transformer Architecture and Implementation
Transformer Architecture Overview
Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhu
Prefill vs. Decode Phases
Prefill Phase:
- Processes entire input prompt at once
- All tokens processed in parallel
- Compute-bound operation
Decode Phase:
- Generates one token at a time
- Sequential generation process
- Memory-bound operation (utilizes KV cache)
Transformer Layers and Iterations
- Inference Iteration: Complete forward pass through all layers to generate one output token
- Inference Layer: Single transformer layer containing attention and FFN components
- Activations: Intermediate representations passed between layers
Core Transformer Components
1. Embedding Layer
Purpose: Convert token IDs to dense vector representations
input_ids: [The, University, ...]
embeddings: {
The: [0, 1, 0, 1, ...], # 4096 elements in Llama
University: [0, 0, 0, ...]
}
2. Attention Mechanism
Self-Attention Formula
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Where:
- Q (Query): What the transformer is looking for
- K (Key): What's available in the sequence
- V (Value): What needs to be updated to assimilate context
- $d_k$: Head dimension for scaling
Causal Self-Attention
- Causal Mask: Prevents tokens from attending to future tokens
- Uses $-\infty$ values in attention matrix for future positions
- When $x_i \to -\infty$, $\text{softmax}(x_i) \to 0$
Grouped Query Attention (GQA)
- Purpose: Reduce KV cache memory usage
- Mechanism: Multiple query heads share the same key and value heads
- Group Size: Number of query heads per key/value head (e.g., group size = 4)
- Benefit: Allows increasing batch size by factor of group size
3. Multi-Head Attention
Head Separation Process
# Original query tensor
q = [[1, 2, 3, 4, 5, 6], # Token 1
[7, 8, 9, 10, 11, 12]] # Token 2
# Shape: (seq_len, hidden_dim) = (2, 6)
# Separated into heads
sub_q = [[[1, 2, 3], # Head 1 for Token 1
[4, 5, 6]], # Head 2 for Token 1
[[7, 8, 9], # Head 1 for Token 2
[10, 11, 12]]] # Head 2 for Token 2
# Shape: (seq_len, num_heads, head_dim) = (2, 2, 3)
4. Feed Forward Network (FFN)
Architecture: Two linear transformations with activation function
- Up Projection: Expands hidden dimension
- Gate Projection: Controls information flow
- Activation Function: SwiGLU (Swish-Gated Linear Unit)
- Down Projection: Returns to original dimension
Mathematical representation: $$\text{FFN}(x) = \text{Down}(\text{SwiGLU}(\text{Up}(x)) \odot \text{Gate}(x))$$
5. Normalization
RMSNorm (Root Mean Square Normalization)
$$\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2 + \epsilon}} \odot g$$
Where:
- x: Input vector of size n
- epsilon: Small constant for numerical stability (e.g., $10^{-8}$)
- g: Learned scaling parameter (element-wise multiplication)
6. Residual Connections
- Purpose: Enable gradient flow in deep networks
- Implementation: Add input to output of each major component
- Formula: $\text{output} = \text{input} + \text{component}(\text{input})$
Multi-GPU Implementation
Tensor Parallelism
- Weight Distribution: Split weight matrices across GPUs
- Query/Key/Value: Distributed across different GPUs
- Computation: Parallel matrix multiplications
Communication Operations
AllGather
- Purpose: Collect partial results from all GPUs
- Usage: After attention computation to gather all head outputs
- Operation Type: Network-bound
AllReduce
- Purpose: Sum partial results across GPUs
- Composition: ReduceScatter + AllGather
- Usage: After FFN down projection
- Operation Type: Network-bound
Resource Utilization Patterns
Compute-Bound Operations
- Query, Key, Value projections
- Up and Gate projections in FFN
- Output projections
- Prefill attention computation
Memory-Bound Operations
- Decode attention (KV cache access)
- Reading cached key-value pairs
Network-Bound Operations
- AllGather communications
- AllReduce communications
Key Implementation Details
KV Cache Management
- Storage: Unique per batch, shared across layers
- Purpose: Avoid recomputing key-value pairs during decode
- Memory Impact: Major contributor to GPU memory usage
Rotary Positional Encoding
- Application: Applied to query and key vectors
- Purpose: Encode relative position information
- Advantage: Better handling of variable sequence lengths
Softmax Computation
$$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$ - Numerical Stability: Often implemented with temperature scaling - Attention Weights: Output represents attention distribution
Architecture Comparison
Original Transformer vs. Llama Architecture
- Original: Encoder-decoder with cross-attention
- Llama: Decoder-only architecture
- Normalization: LayerNorm o RMSNorm
- Activation: ReLU o SwiGLU
- Position Encoding: Absolute o Rotary (RoPE)
- Attention: Multi-head o Grouped Query Attention
This architecture forms the foundation for modern large language models and understanding these components is crucial for optimizing LLM serving systems.