Transformer Architecture and Implementation

Last modified: 2025-05-25 Category: Machine Learning Systems ..

Transformer Architecture Overview

Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhu

Prefill vs. Decode Phases

Prefill Phase:

Processes entire input prompt at once
All tokens processed in parallel
Compute-bound operation

Decode Phase:

Generates one token at a time
Sequential generation process
Memory-bound operation (utilizes KV cache)

Transformer Layers and Iterations

Inference Iteration: Complete forward pass through all layers to generate one output token
Inference Layer: Single transformer layer containing attention and FFN components
Activations: Intermediate representations passed between layers

Core Transformer Components

1. Embedding Layer

Purpose: Convert token IDs to dense vector representations

input_ids: [The, University, ...]
embeddings: {
    The: [0, 1, 0, 1, ...],  # 4096 elements in Llama
    University: [0, 0, 0, ...]
}

2. Attention Mechanism

Self-Attention Formula

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:

Q (Query): What the transformer is looking for
K (Key): What's available in the sequence
V (Value): What needs to be updated to assimilate context
$d_k$: Head dimension for scaling

Causal Self-Attention

Causal Mask: Prevents tokens from attending to future tokens
Uses $-\infty$ values in attention matrix for future positions
When $x_i \to -\infty$, $\text{softmax}(x_i) \to 0$

Grouped Query Attention (GQA)

Purpose: Reduce KV cache memory usage
Mechanism: Multiple query heads share the same key and value heads
Group Size: Number of query heads per key/value head (e.g., group size = 4)
Benefit: Allows increasing batch size by factor of group size

3. Multi-Head Attention

Head Separation Process

# Original query tensor
q = [[1, 2, 3, 4, 5, 6],    # Token 1
     [7, 8, 9, 10, 11, 12]] # Token 2
# Shape: (seq_len, hidden_dim) = (2, 6)

# Separated into heads
sub_q = [[[1, 2, 3],   # Head 1 for Token 1
          [4, 5, 6]],  # Head 2 for Token 1
         [[7, 8, 9],   # Head 1 for Token 2
          [10, 11, 12]]] # Head 2 for Token 2
# Shape: (seq_len, num_heads, head_dim) = (2, 2, 3)

4. Feed Forward Network (FFN)

Architecture: Two linear transformations with activation function

Up Projection: Expands hidden dimension
Gate Projection: Controls information flow
Activation Function: SwiGLU (Swish-Gated Linear Unit)
Down Projection: Returns to original dimension

Mathematical representation: $$\text{FFN}(x) = \text{Down}(\text{SwiGLU}(\text{Up}(x)) \odot \text{Gate}(x))$$

5. Normalization

RMSNorm (Root Mean Square Normalization)

$$\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2 + \epsilon}} \odot g$$

Where:

x: Input vector of size n
epsilon: Small constant for numerical stability (e.g., $10^{-8}$)
g: Learned scaling parameter (element-wise multiplication)

6. Residual Connections

Purpose: Enable gradient flow in deep networks
Implementation: Add input to output of each major component
Formula: $\text{output} = \text{input} + \text{component}(\text{input})$

Multi-GPU Implementation

Tensor Parallelism

Weight Distribution: Split weight matrices across GPUs
Query/Key/Value: Distributed across different GPUs
Computation: Parallel matrix multiplications

Communication Operations

AllGather

Purpose: Collect partial results from all GPUs
Usage: After attention computation to gather all head outputs
Operation Type: Network-bound

AllReduce

Purpose: Sum partial results across GPUs
Composition: ReduceScatter + AllGather
Usage: After FFN down projection
Operation Type: Network-bound

Resource Utilization Patterns

Compute-Bound Operations

Query, Key, Value projections
Up and Gate projections in FFN
Output projections
Prefill attention computation

Memory-Bound Operations

Decode attention (KV cache access)
Reading cached key-value pairs

Network-Bound Operations

AllGather communications
AllReduce communications

Key Implementation Details

KV Cache Management

Storage: Unique per batch, shared across layers
Purpose: Avoid recomputing key-value pairs during decode
Memory Impact: Major contributor to GPU memory usage

Rotary Positional Encoding

Application: Applied to query and key vectors
Purpose: Encode relative position information
Advantage: Better handling of variable sequence lengths

Softmax Computation

$$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$ - Numerical Stability: Often implemented with temperature scaling - Attention Weights: Output represents attention distribution

Architecture Comparison

Original Transformer vs. Llama Architecture

Original: Encoder-decoder with cross-attention
Llama: Decoder-only architecture
Normalization: LayerNorm o RMSNorm
Activation: ReLU o SwiGLU
Position Encoding: Absolute o Rotary (RoPE)
Attention: Multi-head o Grouped Query Attention

This architecture forms the foundation for modern large language models and understanding these components is crucial for optimizing LLM serving systems.

Tags: architecture attention decode feedforward implementation machine learning normalization prefill transformers