Transformer Architecture Overview

Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhu

Prefill vs. Decode Phases

Prefill Phase:

Decode Phase:

Transformer Layers and Iterations


Core Transformer Components

1. Embedding Layer

Purpose: Convert token IDs to dense vector representations

input_ids: [The, University, ...]
embeddings: {
    The: [0, 1, 0, 1, ...],  # 4096 elements in Llama
    University: [0, 0, 0, ...]
}

2. Attention Mechanism

Self-Attention Formula

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:

Causal Self-Attention

Grouped Query Attention (GQA)

3. Multi-Head Attention

Head Separation Process

# Original query tensor
q = [[1, 2, 3, 4, 5, 6],    # Token 1
     [7, 8, 9, 10, 11, 12]] # Token 2
# Shape: (seq_len, hidden_dim) = (2, 6)

# Separated into heads
sub_q = [[[1, 2, 3],   # Head 1 for Token 1
          [4, 5, 6]],  # Head 2 for Token 1
         [[7, 8, 9],   # Head 1 for Token 2
          [10, 11, 12]]] # Head 2 for Token 2
# Shape: (seq_len, num_heads, head_dim) = (2, 2, 3)

4. Feed Forward Network (FFN)

Architecture: Two linear transformations with activation function

Mathematical representation: $$\text{FFN}(x) = \text{Down}(\text{SwiGLU}(\text{Up}(x)) \odot \text{Gate}(x))$$

5. Normalization

RMSNorm (Root Mean Square Normalization)

$$\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2 + \epsilon}} \odot g$$

Where:

6. Residual Connections


Multi-GPU Implementation

Tensor Parallelism

Communication Operations

AllGather

AllReduce


Resource Utilization Patterns

Compute-Bound Operations

Memory-Bound Operations

Network-Bound Operations


Key Implementation Details

KV Cache Management

Rotary Positional Encoding

Softmax Computation

$$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$ - Numerical Stability: Often implemented with temperature scaling - Attention Weights: Output represents attention distribution


Architecture Comparison

Original Transformer vs. Llama Architecture

This architecture forms the foundation for modern large language models and understanding these components is crucial for optimizing LLM serving systems.