GPU Architecture and Introduction to GPU Programming

Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhu


Training vs Inference

Serving

ML model serving is about building a system to efficiently and scalably perform inference with:

LLM Applications and Market Context

Applications Enabled by LLMs

Key principle: Batching user requests and aiming for optimal throughput are key

Market Demand

Infrastructure Costs

Large-scale H100 investments in 2024:

NVIDIA H200 HGX Server specs:

GPU Fundamentals

What is a GPU?

CPU vs GPU Architecture

Aspect CPU GPU
Design Focus Control logic (good with branching) Computation/loading
Performance Single thread performance Parallel processing
Cores Few powerful cores Many simpler cores
Memory Large cache hierarchy High bandwidth memory

Example Specifications

Specification AMD EPYC 9555 (CPU) NVIDIA H200 (GPU)
Cores/Threads 64 Cores / 128 Threads 16,896 CUDA Cores
Frequency 4.4 GHz 1.980 GHz
TFLOPs ~10-20 TFLOPs 989 TFLOPs
Memory Size Up to 6 TB 144GB
Memory Bandwidth 576 GB/s 4800 GB/s
Memory Latency ~70ns ~110ns

Key difference:


GPU Hardware Architecture

Data Center Context

GPU Memory Hierarchy

Streaming Multiprocessors (SMs)

Components:


GPU Programming Model

Hierarchy of Execution Units

Concept Definition Architecture Communication Limits
Thread Minimal units that execute instructions Function units Local Up to 255 registers
Warp Group of Threads "SM tiles" Register File 32 threads
Thread Blocks Group of Warps SM Shared Memory Up to 32 warps (1024 threads)
Kernel Function on GPU GPU L2/Global memory Up to (2epsilonz-1)epsilon Blocks

Key Concepts


GPU Programming Approaches

1. PyTorch (Easiest)

import torch

def add_tensors(a, b):
    return a + b

if __name__ == "__main__":
    num_elements = 10**9

    # Create tensors on CPU
    tensor1 = torch.rand(num_elements, device='cpu')
    tensor2 = torch.rand(num_elements, device='cpu')

    # Move to GPU
    tensor1 = tensor1.to('cuda')
    tensor2 = tensor2.to('cuda')

    # Compute addition
    for i in range(10):
        result = add_tensors(tensor1, tensor2)

    # Move back to CPU
    result = result.cpu()
    print("Result of addition:", result)

2. Triton (Intermediate)

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements

    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y

    tl.store(output_ptr + offsets, output, mask=mask)

3. CUDA (Most Control)

CUDA Memory Management

// Memory allocation
cudaMalloc          // device memory allocation
cudaMallocHost      // pinned host memory allocation
cudaFree            // free memory

// Memory operations
cudaMemcpy          // synchronous copy
cudaMemcpyAsync     // asynchronous copy
cudaMemset          // synchronous set
cudaMemsetAsync     // asynchronous set

CUDA Kernel Structure

// Kernel declaration
__global__ void kernel_name(args...)

// Device helper function
__device__ T helper_name(args...)

// Example addition kernel
__global__ void add(int *a, int *b, int *c, size_t num) {
    int block_start = blockIdx.x * blockDim.x;
    int thread_id = threadIdx.x;
    int index = block_start + thread_id;
    if (index < num) {
        c[index] = a[index] + b[index];
    }
}

CUDA Kernel Launch

// Define block and thread dimensions
dim3 block(x, y, z);
dim3 thread(x, y, z);

// Launch kernel
kernel_name<<<block, thread>>>(args);

CUDA Synchronization

__syncthreads()           // Thread synchronization (device function)
cudaDeviceSynchronize()   // Device synchronization (host function)

// Error checking
cudaGetLastError()        // Get last error
cudaGetErrorString()      // Get error description

Performance Analysis

Timing Considerations

Profiling Tools

CUDA Streams


Modern GPU Features

Advanced CUDA Features


Summary

Key Takeaways

  1. GPU Architecture Understanding
  2. Parallel processing focus
  3. SMs, blocks, threads hierarchy
  4. Programming Approaches
  5. PyTorch: Easiest, high-level
  6. Triton: Balance of control and ease
  7. CUDA: Maximum control and performance
  8. Performance Analysis
  9. Proper timing with CUDA events
  10. Profiling tools for optimization
  11. Understanding memory hierarchy impact

Core Principle: Batching user requests and aiming for optimal throughput are key to effective LLM serving systems.