Store-to-Load Forwarding

The Problem

When a load reads from an address that was just written, the CPU can forward the data directly from the store buffer instead of waiting for the store to commit to cache. This is store-to-load forwarding.

However, forwarding only works when the load exactly matches the store. Partial overlaps cause a store forwarding stall - the CPU must wait for the store to commit before loading.

The Benchmark

Three patterns:

// Aligned: store 8 bytes, load same 8 bytes (forwarding works)
*(uint64_t *)(bytes) = i;
sum += *(uint64_t *)(bytes);

// Overlap: store at offset 1, load at offset 0 (forwarding fails)
*(uint64_t *)(bytes + 1) = i;
sum += *(uint64_t *)(bytes);

// Independent: store and load different addresses (no dependency)
array[0] = i;
sum += array[64];

Results (64 MB array)

Pattern ns/op vs forwarding
sf_fwd (aligned) 0.52 ns 1.0×
sf_indep (independent) 0.70 ns 1.3×
sf_stall (overlap) 3.64 ns 7.0×

Observations

1. Forwarding is Faster Than Independent

The aligned case (0.52 ns) beats the independent case (0.70 ns). Store-forwarding provides data faster than even an L1 cache hit because: - No cache lookup needed - Data comes directly from store buffer - Zero latency dependency

2. Overlap Stall: ~3 ns Penalty

When the load partially overlaps the store, the CPU cannot forward. It must: 1. Wait for the store to commit to L1 cache 2. Then perform the load from cache

This adds ~3 ns (roughly 10 cycles at 3 GHz).

3. When Forwarding Fails

Store-forwarding fails when: - Load is smaller than store and not aligned to store start - Load is larger than store - Load spans multiple stores - Store and load have different sizes at same address (sometimes)

4. Compiler Implications

The compiler doesn't know about store-forwarding stalls. Code like:

struct { char a; int64_t b; } __attribute__((packed)) s;
s.b = value;
use(s.b);  // May stall if 'a' was recently written

Can cause unexpected slowdowns.

Running

./bench sf_fwd 64     # Aligned (forwarding)
./bench sf_stall 64   # Overlapping (stall)
./bench sf_indep 64   # Independent (no dependency)