Memory-Level Parallelism (MLP)
The Problem
A single pointer chase serializes memory accesses:
size_t idx = 0;
for (size_t i = 0; i < n; i++) {
idx = array[idx]; // Can't start next load until this completes
}
Each load depends on the previous result. The CPU issues one memory request, waits ~75ns for DRAM, then issues the next. Memory bandwidth sits mostly idle.
Multiple Chains
Run N independent pointer chains simultaneously
// 4 independent chains - CPU can issue 4 loads in parallel
idx0 = array[idx0];
idx1 = array[idx1];
idx2 = array[idx2];
idx3 = array[idx3];
No dependencies between chains. The CPU's load/store unit can have multiple outstanding memory requests in its Miss Status Holding Registers (MSHRs). Modern CPUs probably have anywhere from a dozen to 64 MSHRs.
Results (256 MB array)
| Chains | ns/access | Speedup | Effective MLP |
|---|---|---|---|
| 1 | 92.6 ns | 1.0× | 1.0 |
| 2 | 53.6 ns | 1.7× | 1.7 |
| 4 | 31.0 ns | 3.0× | 3.0 |
| 8 | 15.6 ns | 5.9× | 5.9 |
| 16 | 8.9 ns | 10.4× | 10.4 |
Observations
1. Near-Linear Scaling to ~8 Chains
Each additional chain nearly halves latency until we hit hardware limits. The CPU successfully overlaps memory requests when there are no dependencies.
2. Diminishing Returns Past 8-10 Chains
At 16 chains, we see ~10× speedup rather than 16×. This reflects: - MSHR capacity limits (~10-12 on this CPU) - Memory controller queue depth - DRAM bank conflicts when many requests are in flight
3. Convergence with Random Access
Compare to ran benchmark (random with independent indices): ~7ns. That's our ceiling - maximum MLP the hardware can sustain.
| Benchmark | ns/access | MLP |
|---|---|---|
chase (1 chain) |
92.6 ns | 1 |
mlp16 (16 chains) |
8.9 ns | ~10 |
ran (independent) |
7.6 ns | ~12 |
Why
Most real code has memory access patterns somewhere between fully serial (pointer chase) and fully parallel (independent random). Understanding MLP helps you:
- Restructure data to enable parallel accesses
- Batch operations instead of one-at-a-time processing
- Unroll loops to expose multiple independent memory operations
- Use prefetching when you know future access patterns
Assembly
Ensuring the compiler doesn't serialize the loads:
; 4-chain inner loop
.L_loop:
movq (%rax,%r8,8), %r8 ; chain 0
movq (%rax,%r9,8), %r9 ; chain 1 (independent)
movq (%rax,%r10,8), %r10 ; chain 2 (independent)
movq (%rax,%r11,8), %r11 ; chain 3 (independent)
...
All four movq instructions can be issued in the same cycle. The CPU's out-of-order engine tracks that they're independent and keeps all four memory requests in flight.
Running
./bench mlp1 256 # Baseline (1 chain)
./bench mlp4 256 # 4 chains
./bench mlp16 256 # 16 chains
# Compare to random access (maximum MLP)
./bench ran 256