Branch Prediction

The Problem

Modern CPUs speculatively execute past branches before knowing the outcome. When the prediction is wrong, the pipeline flushes - typically 15-20 cycles wasted.

The Benchmark

Conditionally sum array elements:

for (size_t i = 0; i < n; i++) {
    if (array[i] < threshold)
        sum += array[i];
}

Three variants: - Sorted: First half below threshold, second half above. Predictor learns the pattern. - Random: 50% probability per element. Predictor wrong ~50% of the time. - Branchless: Use bit manipulation instead of a branch.

Results (64 MB array)

Variant ns/element vs sorted
br_sort (sorted) 0.85 ns 1.0×
br_less (branchless) 1.41 ns 1.7×
br_rand (random) 2.75 ns 3.2×

Observations

1. Misprediction Penalty: ~2 ns / ~6 cycles

Random data causes ~50% misprediction. The 1.9 ns difference (2.75 - 0.85) amortized over 50% mispredictions suggests ~4 ns per misprediction, or ~12 cycles at 3 GHz.

2. Branchless Trades ALU for Predictability

The branchless version:

uint64_t mask = -(array[i] < threshold);  // 0 or 0xFFFFFFFFFFFFFFFF
sum += array[i] & mask;

More instructions, but no branch. Faster than mispredicted branches, slower than predicted ones.

3. When to Use Branchless

4. Modern Compilers

GCC/Clang may auto-convert branches to conditional moves (cmov) with -O3. Check the assembly if results seem off.

Branch Predictor Basics

Modern predictors use: - Local history: Pattern of recent outcomes for this branch - Global history: Pattern of recent branches (any branch) - Hybrid: Combines both with neural-network-like structures (TAGE)

Predictors can learn patterns like "taken 3×, not taken 1×, repeat". They fail on truly random data.

Running

./bench br_sort 64   # Predictable (sorted)
./bench br_rand 64   # Unpredictable (random)
./bench br_less 64   # Branchless (mask)