Software Prefetching
Hardware vs Software
Modern CPUs have hardware prefetchers that detect access patterns and speculatively fetch cache lines. Great for sequential/strided access, but fail for random patterns.
Software prefetching explicitly tells the CPU to fetch data before you need it:
__builtin_prefetch(&array[i + 64], 0, 0); // Prefetch 64 elements ahead
sum += array[i]; // Use current element
Results (256 MB array)
Random Access
| Variant | Distance | ns/access | vs baseline |
|---|---|---|---|
pf_none |
0 | 7.61 ns | 1.0× |
pf_8 |
+8 | 6.23 ns | 1.2× |
pf_32 |
+32 | 5.91 ns | 1.3× |
pf_128 |
+128 | 5.99 ns | 1.3× |
Sequential Access
| Variant | Distance | ns/access | vs baseline |
|---|---|---|---|
pf_seq |
0 (hw only) | 1.37 ns | 1.0× |
pf_seq64 |
+64 | 0.54 ns | 2.5× |
Observations
1. Random Prefetch Helps Modestly (~25%)
For random access, we know the sequence of indices but not their values. We can prefetch array[indices[i+N]] while processing array[indices[i]].
Limited improvement because: - Prefetch distance must be large enough to hide DRAM latency (~300 cycles) - But not so large that prefetched data gets evicted before use - Memory bandwidth becomes the bottleneck, not latency
2. Sequential SW Prefetch Beats Hardware
Surprisingly, software prefetching beats the hardware prefetcher for sequential access. The hardware prefetcher is conservative - it doesn't fetch too far ahead to avoid cache pollution. Our explicit prefetch with distance +64 (512 bytes ahead) is more aggressive and wins when we know the pattern perfectly.
3. Optimal Prefetch Distance
Depends on memory latency (~75ns), loop iteration time, and cache capacity. For random access, +32 to +128 works well. For sequential, +64 (512 bytes = 8 cache lines) is effective.
When Prefetching Doesn't Help
- Pointer chasing: Can't prefetch
array[array[i]]because you don't knowarray[i]yet - Unpredictable patterns: If you don't know future addresses, you can't prefetch them
- Cache-resident data: Prefetching hurts if data is already in cache (extra instructions)
- Bandwidth-bound: If memory bandwidth is saturated, prefetching adds pressure
The Instruction
__builtin_prefetch(addr, rw, locality);
// rw: 0 = read, 1 = write
// locality: 0 = no temporal locality, 3 = high temporal locality
Compiles to x86 prefetchnta (non-temporal) or prefetcht0/t1/t2 (temporal).
Running
./bench pf_none 256 # Random baseline
./bench pf_32 256 # Random prefetch +32
./bench pf_seq 256 # Sequential hw prefetch only
./bench pf_seq64 256 # Sequential sw prefetch +64