Memory Bandwidth
The Problem
A single core cannot saturate memory bandwidth. Modern DDR4/DDR5 systems provide 50-100+ GB/s, but one core typically achieves 20-30 GB/s due to: - Limited outstanding memory requests per core - Memory controller parallelism across channels/ranks
The Benchmark
Simple sequential read with varying thread counts:
for (size_t i = 0; i < n; i++) {
sum += array[i];
}
Each thread processes a disjoint chunk of the array.
Results (1 GB array)
| Threads | GB/s | vs 1 thread |
|---|---|---|
| 1 | 24.9 | 1.0× |
| 2 | 44.1 | 1.8× |
| 4 | 47.8 | 1.9× |
| 8 | 63.7 | 2.6× |
Observations
1. Single-Core Bottleneck
One core achieves ~25 GB/s. This is limited by: - Number of Line Fill Buffers (LFBs) / Miss Status Handling Registers (MSHRs) - Memory controller queue depth per core - Prefetcher effectiveness
2. Diminishing Returns
Bandwidth doesn't scale linearly with threads: - 1→2 threads: 1.8× (good scaling) - 2→4 threads: 1.1× (memory controller saturating) - 4→8 threads: 1.3× (some additional parallelism)
3. Memory Controller Saturation
The memory controller has finite bandwidth. Once saturated, adding threads doesn't help. The exact saturation point depends on: - Number of memory channels - DDR speed and timing - Memory controller design
4. NUMA Effects
On multi-socket systems, threads should access local memory. Remote memory access adds latency and reduces bandwidth.
Theoretical vs Achieved
DDR4-3200 dual-channel: ~51 GB/s theoretical DDR5-4800 dual-channel: ~77 GB/s theoretical
Achieved bandwidth is typically 60-80% of theoretical due to: - Row buffer misses - Refresh cycles - Command/address overhead
Running
./bench bw_1 1024 # 1 thread
./bench bw_2 1024 # 2 threads
./bench bw_4 1024 # 4 threads
./bench bw_8 1024 # 8 threads
Use large arrays (1GB+) to ensure memory-bound behavior.