TLB and Page Walks

The Problem

Virtual addresses must be translated to physical addresses. The CPU caches these translations in the Translation Lookaside Buffer (TLB). When the TLB misses, the CPU must walk the page table - a series of dependent memory accesses.

On x86-64 with 4-level paging, a TLB miss requires up to 4 memory accesses to resolve.

The Benchmark

Access array elements at different strides: - Small strides: many accesses per page, TLB hits - Page-sized strides (4KB): one access per page, TLB misses

for (size_t i = 0; i < n; i += stride) {
    sum += array[i];
}

Results (256 MB array)

Stride Accesses/page ns/access vs sequential
8B 512 1.36 ns 1.0×
64B 64 2.49 ns 1.8×
512B 8 5.38 ns 4.0×
4KB 1 12.51 ns 9.2×
8KB 0.5 13.31 ns 9.8×

Observations

1. TLB Miss Penalty: ~10 ns

The jump from 512B stride (5.38 ns) to 4KB stride (12.51 ns) shows the TLB miss cost. At 4KB stride, every access misses the TLB.

2. Page Walk is a Pointer Chase

A TLB miss triggers a page table walk - 4 dependent loads through PML4 → PDPT → PD → PT. This is why the penalty is fixed (~10 ns) rather than scaling with stride.

3. Hardware Page Walk Cache

Modern CPUs cache intermediate page table entries (page walk cache / paging structure cache). This is why 4KB and 8KB strides show similar latency - upper-level entries are cached.

4. TLB Capacity

Typical L1 dTLB: 64-128 entries (covers 256KB-512KB with 4KB pages). L2 TLB: 1024-2048 entries. With 256MB of data at 4KB stride, we access 65K pages - far exceeding TLB capacity.

Cache vs TLB

Stride Cache behavior TLB behavior
8B Hit (prefetcher) Hit (same page)
64B Miss (1 line/access) Hit (64 accesses/page)
512B Miss Hit (8 accesses/page)
4KB Miss Miss (1 access/page)

The 64B→512B slowdown is cache misses. The 512B→4KB slowdown is TLB misses.

Huge Pages

With 2MB huge pages, the TLB covers 512× more memory per entry. Page-strided access would hit TLB until the working set exceeds TLB capacity × 2MB.

Running

./bench tlb_seq 256   # Sequential (TLB hits)
./bench tlb_64 256    # Cache line stride
./bench tlb_512 256   # Within-page stride
./bench tlb_4k 256    # Page stride (TLB misses)
./bench tlb_8k 256    # Skip pages