Software Prefetching

Hardware vs Software

Modern CPUs have hardware prefetchers that detect access patterns and speculatively fetch cache lines. Great for sequential/strided access, but fail for random patterns.

Software prefetching explicitly tells the CPU to fetch data before you need it:

__builtin_prefetch(&array[i + 64], 0, 0);  // Prefetch 64 elements ahead
sum += array[i];                            // Use current element

Results (256 MB array)

Random Access

Variant	Distance	ns/access	vs baseline
`pf_none`	0	7.61 ns	1.0×
`pf_8`	+8	6.23 ns	1.2×
`pf_32`	+32	5.91 ns	1.3×
`pf_128`	+128	5.99 ns	1.3×

Sequential Access

Variant	Distance	ns/access	vs baseline
`pf_seq`	0 (hw only)	1.37 ns	1.0×
`pf_seq64`	+64	0.54 ns	2.5×

Observations

1. Random Prefetch Helps Modestly (~25%)

For random access, we know the sequence of indices but not their values. We can prefetch array[indices[i+N]] while processing array[indices[i]].

Limited improvement because: - Prefetch distance must be large enough to hide DRAM latency (~300 cycles) - But not so large that prefetched data gets evicted before use - Memory bandwidth becomes the bottleneck, not latency

2. Sequential SW Prefetch Beats Hardware

Surprisingly, software prefetching beats the hardware prefetcher for sequential access. The hardware prefetcher is conservative - it doesn't fetch too far ahead to avoid cache pollution. Our explicit prefetch with distance +64 (512 bytes ahead) is more aggressive and wins when we know the pattern perfectly.

3. Optimal Prefetch Distance

Depends on memory latency (~75ns), loop iteration time, and cache capacity. For random access, +32 to +128 works well. For sequential, +64 (512 bytes = 8 cache lines) is effective.

When Prefetching Doesn't Help

Pointer chasing: Can't prefetch array[array[i]] because you don't know array[i] yet
Unpredictable patterns: If you don't know future addresses, you can't prefetch them
Cache-resident data: Prefetching hurts if data is already in cache (extra instructions)
Bandwidth-bound: If memory bandwidth is saturated, prefetching adds pressure

The Instruction

__builtin_prefetch(addr, rw, locality);
// rw: 0 = read, 1 = write
// locality: 0 = no temporal locality, 3 = high temporal locality

Compiles to x86 prefetchnta (non-temporal) or prefetcht0/t1/t2 (temporal).

Running

./bench pf_none 256   # Random baseline
./bench pf_32 256     # Random prefetch +32
./bench pf_seq 256    # Sequential hw prefetch only
./bench pf_seq64 256  # Sequential sw prefetch +64