Branch Prediction

The Problem

Modern CPUs speculatively execute past branches before knowing the outcome. When the prediction is wrong, the pipeline flushes - typically 15-20 cycles wasted.

The Benchmark

Conditionally sum array elements:

for (size_t i = 0; i < n; i++) {
    if (array[i] < threshold)
        sum += array[i];
}

Three variants: - Sorted: First half below threshold, second half above. Predictor learns the pattern. - Random: 50% probability per element. Predictor wrong ~50% of the time. - Branchless: Use bit manipulation instead of a branch.

Results (64 MB array)

Variant	ns/element	vs sorted
`br_sort` (sorted)	0.85 ns	1.0×
`br_less` (branchless)	1.41 ns	1.7×
`br_rand` (random)	2.75 ns	3.2×

Observations

1. Misprediction Penalty: ~2 ns / ~6 cycles

Random data causes ~50% misprediction. The 1.9 ns difference (2.75 - 0.85) amortized over 50% mispredictions suggests ~4 ns per misprediction, or ~12 cycles at 3 GHz.

2. Branchless Trades ALU for Predictability

The branchless version:

uint64_t mask = -(array[i] < threshold);  // 0 or 0xFFFFFFFFFFFFFFFF
sum += array[i] & mask;

More instructions, but no branch. Faster than mispredicted branches, slower than predicted ones.

3. When to Use Branchless

Predictable data (sorted, patterns): Use branches
Unpredictable data (random, hashed): Use branchless
Mixed: Profile and measure

./bench br_sort 64   # Predictable (sorted)
./bench br_rand 64   # Unpredictable (random)
./bench br_less 64   # Branchless (mask)