Published Date: 2026-4-13

Cache Invalidation Makes Your Code Fast: Building a 21.9 GiB/s SPSC queue.

78,561 sampled memory operations. 18 hit DRAM. The rest never left cache.

Cache invalidation is normally seen as a bad thing, but this doesn’t always have to be the case. There are ways in which you can (ab)use the cache coherency protocol to get data transfer over the memory fabric itself.

As a part of my yak shaving for my key value database named kvell, I built a stupidly fast spsc queue over a ring buffer.

The implementation focussed on achieving:

Minimise TLB misses
Minimise cache misses
Minimise atomic load/stores in order to minimise pipeline stalls.
Maximise vectorisation opportunities.
Minimise the amount of compute as well obviously.

The performance of this queue is mostly due to aspects outside your direct control and that should make you feel uncomfortable as a programmer. But unfortunately, instead of us being given the capability to control the underlying hardware, we are instead given a leaky abstraction of said hardware. I will stop my rant of hardware and software here else this blog post will continue on with no end in sight about hardware, the C memory model and Linux.

What is inter-core communication?

When most people think of inter-core communication, shared memory is what pops into mind. Core A copies into a buffer, signals that its done processing and Core B reads off that buffer.

But unfortunately, this is too shallow of a picture of what really goes on under the hood in our tremendously complex hardware.

An introduction to MESI.

Modern x86 CPUs implement the MESI protocol to keep caches coherent across cores. Every cache line in this system is one of four states:

Modified - This core has written to the line. It is the only valid copy in the system. Main memory is stale.
Exclusive - This core has the only copy but it hasn’t been written to yet. Main memory is still up to date.
Shared - Multiple cores have a read only copy of this line.
Invalid - The line has been invalidated. It is meaningless.

Initial state

Step 1 of 6

Both shared and ready live in DRAM, initialised to 0. Neither core has cached either line yet.

volatile long long shared = 0;

atomic_bool ready = 0;

void *writer(void *_) {

shared = 0xDEADBEEFCAFEBABE;

atomic_store_explicit(

&ready, 1, release);

return NULL;

}

void *reader(void *_) {

while (!atomic_load_explicit(

&ready, acquire));

printf("0x%llX\n", shared);

return NULL;

}

Writer L1

Ishared-

Iready-

Reader L1

Ishared-

Iready-

DRAM

shared=0x0ready=0

Modified

Exclusive

Shared

Invalid

Did you notice that we didn’t actually touch DRAM itself much? Crazy right? But that is how modern hardware tries to mitigate the fact that DRAM is super expensive to operate on. A single DRAM operation can cost as much as 500 instructions, stalling on DRAM operations is a nice way to absolutely destroy performance.

What if we can use this knowledge to build a super data transfer mechanism? One that hardly touches DRAM and use the interconnect as a stupid fast inter-core communication data bus?

The core insight

If we make a 512-slot ring buffer where each slot is exactly 64 bytes (one cache line), the entire working set is 32 KB. That fits in L1. When I ran perf c2c on the benchmark, only 18 out of 78,561 sampled memory operations hit DRAM. Those 18 were cold start page faults. After warmup, the working set never leaves cache. The lines just bounce between cores, carried by the coherence protocol, never touching main memory.

So the coherence protocol isn’t something we have to work around. It is the transport layer. We’re using the interconnect as a core-to-core message bus.

Why cache line sized slots matter.

Most SPSC queues treat slot size as whatever sizeof(T) happens to be. Nobody really thinks about it.

Consider a queue of u32 values. Each element is 4 bytes. A single 64-byte cache line holds 16 of them. When the producer writes slot N, it modifies a cache line that also contains slots N-1 through N-15. If the consumer is reading any of those adjacent slots, the producer’s write invalidates the consumer’s copy of the entire line. The consumer has to re-fetch it via a coherence snoop. This is false sharing. The two cores aren’t touching the same data, but they’re fighting over the same coherence unit.

If each slot is exactly one cache line (64 bytes), every produce and consume is one clean cache line transfer. One write, one invalidation, one snoop. That’s the minimum cost the protocol allows.

How the ring buffer works

The ring buffer is backed by mmap’d memory, which gives us page-aligned allocation without the allocator’s metadata getting in the way. Capacity is always a power of two, so index wrapping is a bitmask (index & (capacity - 1)) instead of a modulo. One instruction instead of a division.

Initial state

Step 1 of 8

head = 0, tail = 0. Both threads are about to enter their loops. No atomic operations yet.

Thread A (Writer)

for (int i = 0; i < N; i++) {

int h = load(&head); // atomic

int t = load(&tail); // atomic

if (h - t == 8) {

spin_loop(); continue;

}

buffer[h & 7] = i;

store(&head, h + 1); // atomic

}

Thread B (Reader)

for (int i = 0; i < N; i++) {

int t = load(&tail); // atomic

int h = load(&head); // atomic

if (h == t) {

spin_loop(); continue;

}

sum += buffer[t & 7];

store(&tail, t + 1); // atomic

}

H=0

T=0

0/8

used

Atomic loads0

Atomic stores0

Total0

Empty

Filled

Writing

Reading

Remember that each slot in the ring buffer is one cache line. Every write and read you just saw in that animation is the same MESI dance from earlier. Producer writes a slot, that line goes Modified in its L1. Consumer reads the slot, the coherence protocol snoops it from the producer’s L1. Core-to-core, DRAM not involved.

512 slots is 32 KB. L1 is typically 32-48 KB. The whole ring buffer lives in cache. After the initial page faults on first touch, every data access for the lifetime of the queue is either an L1 hit or an L1-to-L1 snoop. The data never falls out to main memory.

Maximising pipelining by minimising stalls

Your CPU doesn’t execute one instruction at a time. It has a pipeline, fetch, decode, execute, memory, writeback, and it overlaps them so multiple instructions are in flight at once. While one instruction is executing, the next is being decoded, and the one after that is being fetched. When things flow smoothly you get close to one instruction retired per cycle.

A pipeline stall is when something blocks this. In our case the usual culprit is a load instruction that misses L1. The core issues the load, but the data has to come from somewhere else, another core’s cache, L3, or DRAM. Until it arrives, anything that depends on it just sits there waiting.

Out of order execution can hide some of that latency by doing unrelated work while waiting, but in a tight loop like our queue there’s barely any independent work to schedule. The load of the head/tail counter feeds directly into the comparison that decides the next iteration. The core just waits.

The naive ring buffer has this problem. Every queue operation does 2 atomic loads and 1 atomic store. One of those loads is cross-core, it has to snoop a cache line from the other core’s L1 over the interconnect. That’s ~40-70 cycles of the core sitting idle, waiting for the coherence protocol to hand over the data.

The fix: cache the other side’s counter locally. The writer keeps a tail_cache, a plain integer, not atomic. The reader keeps a head_cache. You check the local copy first. If it says the queue isn’t full (or empty), you skip the cross-core load entirely. You only pay for the snoop when the cached value is stale enough to give a wrong answer.

When is the cache wrong? Only when the queue is nearly full or nearly empty. With a 512-slot buffer that rarely happens in steady state. Cross-core loads drop from one per operation to roughly one per 512 operations.

Initial state

Step 1 of 7

head = 0, tail = 0. The writer holds a local tail_cache (0), the reader holds a local head_cache (0). These are plain integers — no atomics, no cross-core traffic to read them.

Thread A (Writer)

for (int i = 0; i < N; i++) {

int h = load(&head); // own

if (h - tail_cache == 8) {

tail_cache = load(&tail); // CROSS-CORE

if (h - tail_cache == 8)

{ spin; continue; }

}

buffer[h & 7] = i;

store(&head, h + 1); // own

}

Thread B (Reader)

for (int i = 0; i < N; i++) {

int t = load(&tail); // own

if (head_cache == t) {

head_cache = load(&head); // CROSS-CORE

if (head_cache == t)

{ spin; continue; }

}

sum += buffer[t & 7];

store(&tail, t + 1); // own

}

H=0

T=0

0/8

used

Cached Counters

tail_cache0

head_cache0

Own loads0

Cross-core loads0

Stores0

Total0

Empty

Filled

Writing

Reading

Cache fresh

Cache stale

Play around with the calculator below. Crank the buffer capacity up to 512 and watch the cross-core loads disappear. Drag it back to 1 and they equalise, because with a single slot the cache is always stale. For most producer-consumer patterns a few hundred slots is enough to make cross-core traffic negligible.

Cross-core atomic load comparison

Items transferred (N)

Buffer capacity2³ = 8 slots

	Uncached	Cached
Own loads (L1 hit)	20.0K	20.0K
Cross-core loads (snoop)	20.0K	2.5K
Stores	20.0K	20.0K
Total atomic ops	60.0K	42.5K

Cross-core loads (the expensive ones)

Uncached

20.0K

Cached

2.5K

87.5% fewer cross-core loads with caching (capacity = 8)

Why the raw pointer API matters

Raw manual mmap-ed backed memory

Benchmark results

Latency is the real highlight

Confirming the binary and auditing the assembly.

Examining the cache utilisation

Key numbers from the output header:

Load L1D hit        :  16134   (69%)
Load Local HITM     :     46   (0.2%)
Load Local DRAM     :     18   (0.08%)
Store L1D Hit       :  53579   (99.7%)

What these mean:

L1D hit (69%): The cache line was already in the consumer’s L1 when it tried to read. The prefetcher got there first.
Local HITM (0.2%): The consumer tried to read a line still Modified in the producer’s L1 — a synchronous core-to-core snoop. With a 1-slot buffer this would dominate. With 512 slots, it’s noise.
Local DRAM (0.08%): Actual main memory access. 18 total — almost certainly cold-start page faults, not steady-state operation.
Store L1D Hit (99.7%): The producer almost always writes to a line it already owns. No contention.

Note: perf c2c uses sampling, not counting. The absolute numbers are a fraction of reality. The ratios are what matter.