Parallel Reductions Benchmark

The Problem with Naive Reduction

A naive reduction creates a dependency chain:

uint64_t sum = 0;
for (size_t i = 0; i < n; i++) {
    sum += array[i];  // Each add depends on the previous!
}

Each addition must wait for the previous one to complete. On a modern CPU that can execute 4+ adds per cycle, this leaves most execution units idle.

Benchmark Variants

Variant	Description
`red_naive`	Single accumulator - creates dependency chain
`red_ilp`	8 independent accumulators - breaks dependency chain
`red_simd`	AVX2 vector adds (4 × 64-bit per instruction)
`red_thread`	8 pthreads, each summing a portion
`red_ilp_simd`	4 independent AVX2 accumulators
`red_all`	Threads + ILP + SIMD combined
`red_opt`	Simple loop - compiler free to auto-vectorize

Results

Performance by Array Size (ns per element)

Size	naive	ilp	simd	thread	ilp+simd	all
1 MB	0.41	0.21	0.17	1.70	0.15	1.40
4 MB	0.21	0.19	0.18	0.37	0.21	0.35
16 MB	0.35	0.27	0.24	0.18	0.25	0.15
64 MB	0.41	0.31	0.30	0.19	0.30	0.17
256 MB	0.40	0.31	0.33	0.17	0.32	0.14
1024 MB	0.39	0.31	0.32	0.15	0.32	0.13

Speedup vs Naive (1024 MB)

Variant	Time (ms)	ns/elem	Speedup
`red_naive`	52.8	0.39	1.0×
`red_ilp`	41.8	0.31	1.3×
`red_simd`	43.3	0.32	1.2×
`red_thread`	19.5	0.15	2.7×
`red_ilp_simd`	42.4	0.32	1.2×
`red_all`	16.9	0.13	3.1×

Key Observations

1. ILP Benefit (~1.3× speedup)

The 8-accumulator version tries to break the dependency chain:

// 8 independent chains - maybe CPU can execute in parallel?
sum0 += array[i+0];
sum1 += array[i+1];
sum2 += array[i+2];

The ~1.3× speedup (not 8×) indicates the workload is memory-bandwidth bound, not compute-bound.

2. Small Arrays and Threading

At 1-4 MB (fits in cache), threading overhead dominates: - red_naive: 0.21-0.41 ns - red_thread: 0.37-1.70 ns (slower!)

3. Large Arrays: Threading Wins

At 64+ MB (exceeds cache), threading provides the biggest benefit: - red_naive: 0.39-0.41 ns - red_thread: 0.15-0.19 ns (2.5-2.7× faster)

Multiple threads can saturate memory bandwidth from different memory controllers/channels.

4. SIMD \approx ILP for this Workload

SIMD and scalar ILP show similar performance because: - Both break the dependency chain - Memory bandwidth is the bottleneck, not ALU throughput - AVX2 loads 32 bytes/instruction, but memory can't keep up

5. Cache Effects

Performance varies with array size due to cache hierarchy:

Size	Likely Location	Observed Behavior
1 MB	L2/L3 cache	Fastest single-thread; threading hurts
4 MB	L3 cache	ILP/SIMD help; threading overhead visible
16+ MB	Main memory	Threading helps; memory-bound

Assembly

Compiled with gcc -O3 -march=native. The compiler auto-vectorizes all variants

red_naive (compiler auto-vectorizes)

.L25:
    vpaddq  (%rax), %ymm0, %ymm0      ; AVX2: add 4 x 64-bit from memory to ymm0
    addq    $32, %rax                  ; advance pointer by 32 bytes (4 elements)
    cmpq    %rdx, %rax
    jne     .L25
; horizontal reduction:
    vextracti128  $0x1, %ymm0, %xmm1  ; extract high 128 bits
    vpaddq  %xmm0, %xmm1, %xmm0       ; add high + low halves
    vpsrldq $8, %xmm0, %xmm1          ; shift right 8 bytes
    vpaddq  %xmm1, %xmm0, %xmm0       ; final sum

Even though source has single accumulator, compiler vectorizes to 1 AVX2 accumulator (4 parallel adds).

red_ilp (8 scalar → 2 AVX2 accumulators)

.L42:
    addq    $1, %rdx
    vpaddq  (%rax), %ymm1, %ymm1      ; accumulator 1: 4 x 64-bit
    vpaddq  32(%rax), %ymm0, %ymm0    ; accumulator 2: 4 x 64-bit
    addq    $64, %rax                  ; 64 bytes = 8 elements per iteration
    cmpq    %rsi, %rdx
    jb      .L42

Compiler recognizes 8 independent accumulators and converts to 2 AVX2 accumulators (8 parallel adds).

red_simd (hand-written AVX2)

.L54:
    vpaddq  (%rax), %ymm0, %ymm0      ; 1 AVX2 accumulator
    addq    $32, %rax
    cmpq    %rax, %rcx
    jne     .L54

Same as naive! Hand-written SIMD with 1 accumulator matches compiler's auto-vectorization.

red_ilp_simd (4 AVX2 accumulators = 16 parallel adds)

.L79:
    vpaddq  (%rax), %ymm1, %ymm1      ; accumulator 1
    vpaddq  32(%rax), %ymm2, %ymm2    ; accumulator 2
    subq    $-128, %rax               ; advance 128 bytes (16 elements)
    vpaddq  -64(%rax), %ymm0, %ymm0   ; accumulator 3
    vpaddq  -32(%rax), %ymm3, %ymm3   ; accumulator 4
    cmpq    %rcx, %rax
    jne     .L79

4 independent AVX2 accumulators = 16 parallel 64-bit adds per iteration.

red_opt (compiler completely free)

.L116:
    vpaddq  (%rax), %ymm0, %ymm0      ; identical to red_naive!
    addq    $32, %rax
    cmpq    %rdx, %rax
    jne     .L116

Identical to naive - compiler auto-vectorizes the simple loop.

Thread workers

ThreadSum (naive per-thread):

    vpaddq  ...  ; compiler auto-vectorizes each thread's work

ThreadSumILPSimd (optimized per-thread):

.L17:
    vpaddq  -64(%rcx,%rax,8), %ymm0, %ymm0  ; 2 AVX2 accumulators
    vpaddq  -32(%rcx,%rax,8), %ymm1, %ymm1
    ...

Key Instructions Summary

Variant	Main Loop Instruction	Accumulators	Elements/Iteration
`red_naive`	`vpaddq (%rax), %ymm0, %ymm0`	1 × ymm	4
`red_ilp`	2 × `vpaddq`	2 × ymm	8
`red_simd`	`vpaddq (%rax), %ymm0, %ymm0`	1 × ymm	4
`red_ilp_simd`	4 × `vpaddq`	4 × ymm	16
`red_opt`	`vpaddq (%rax), %ymm0, %ymm0`	1 × ymm	4

Why Similar Performance?

All variants use AVX2 vpaddq (4 × 64-bit add). However, More accumulators = more ILP = fewer stalls waiting for previous add, but memory bandwidth is the real bottleneck at large sizes. That's why red_naive ≈ red_ilp_simd for 256+ MB arrays. The only major win is threading which saturates multiple memory channels.

Conclusion

For memory-bound workloads like large array reduction, the bottleneck is memory bandwidth, not CPU compute. Multi-threading helps by utilizing multiple memory channels, while ILP/SIMD provide modest benefits.

Running the Benchmarks

# Build
make

# Run individual benchmark
./bench red_naive 256    # 256 MB array
./bench red_ilp 256
./bench red_all 256

# Compare all variants
for v in red_naive red_ilp red_simd red_thread red_ilp_simd red_all; do
    ./bench $v 256 2>&1 | grep "Time per"
done