Cache invalidation is normally seen as a bad thing, but this doesn’t always have to be the case. There are ways in which you can (ab)use the cache coherency protocol to get data transfer over the memory fabric itself.
As a part of my yak shaving for my key value database named kvell, I built a stupidly fast spsc queue over a ring buffer.
The implementation focussed on achieving:
The performance of this queue is mostly due to aspects outside your direct control and that should make you feel uncomfortable as a programmer. But unfortunately, instead of us being given the capability to control the underlying hardware, we are instead given a leaky abstraction of said hardware. I will stop my rant of hardware and software here else this blog post will continue on with no end in sight about hardware, the C memory model and Linux.
When most people think of inter-core communication, shared memory is what pops into mind. Core A copies into a buffer, signals that its done processing and Core B reads off that buffer.
But unfortunately, this is too shallow of a picture of what really goes on under the hood in our tremendously complex hardware.
Modern x86 CPUs implement the MESI protocol to keep caches coherent across cores. Every cache line in this system is one of four states:
Did you notice that we didn’t actually touch DRAM itself much? Crazy right? But that is how modern hardware tries to mitigate the fact that DRAM is super expensive to operate on. A single DRAM operation can cost as much as 500 instructions, stalling on DRAM operations is a nice way to absolutely destroy performance.
What if we can use this knowledge to build a super data transfer mechanism? One that hardly touches DRAM and use the interconnect as a stupid fast inter-core communication data bus?
If we make a 512-slot ring buffer where each slot is exactly 64 bytes (one cache line), the entire working set is 32 KB. That fits in L1. When I ran perf c2c on the benchmark, only 18 out of 78,561 sampled memory operations hit DRAM. Those 18 were cold start page faults. After warmup, the working set never leaves cache. The lines just bounce between cores, carried by the coherence protocol, never touching main memory.
So the coherence protocol isn’t something we have to work around. It is the transport layer. We’re using the interconnect as a core-to-core message bus.
Most SPSC queues treat slot size as whatever sizeof(T) happens to be. Nobody really thinks about it.
Consider a queue of u32 values. Each element is 4 bytes. A single 64-byte cache line holds 16 of them. When the producer writes slot N, it modifies a cache line that also contains slots N-1 through N-15. If the consumer is reading any of those adjacent slots, the producer’s write invalidates the consumer’s copy of the entire line. The consumer has to re-fetch it via a coherence snoop. This is false sharing. The two cores aren’t touching the same data, but they’re fighting over the same coherence unit.
If each slot is exactly one cache line (64 bytes), every produce and consume is one clean cache line transfer. One write, one invalidation, one snoop. That’s the minimum cost the protocol allows.
The ring buffer is backed by mmap’d memory, which gives us page-aligned allocation without the allocator’s metadata getting in the way. Capacity is always a power of two, so index wrapping is a bitmask (index & (capacity - 1)) instead of a modulo. One instruction instead of a division.
Remember that each slot in the ring buffer is one cache line. Every write and read you just saw in that animation is the same MESI dance from earlier. Producer writes a slot, that line goes Modified in its L1. Consumer reads the slot, the coherence protocol snoops it from the producer’s L1. Core-to-core, DRAM not involved.
512 slots is 32 KB. L1 is typically 32-48 KB. The whole ring buffer lives in cache. After the initial page faults on first touch, every data access for the lifetime of the queue is either an L1 hit or an L1-to-L1 snoop. The data never falls out to main memory.
Your CPU doesn’t execute one instruction at a time. It has a pipeline, fetch, decode, execute, memory, writeback, and it overlaps them so multiple instructions are in flight at once. While one instruction is executing, the next is being decoded, and the one after that is being fetched. When things flow smoothly you get close to one instruction retired per cycle.
A pipeline stall is when something blocks this. In our case the usual culprit is a load instruction that misses L1. The core issues the load, but the data has to come from somewhere else, another core’s cache, L3, or DRAM. Until it arrives, anything that depends on it just sits there waiting.
Out of order execution can hide some of that latency by doing unrelated work while waiting, but in a tight loop like our queue there’s barely any independent work to schedule. The load of the head/tail counter feeds directly into the comparison that decides the next iteration. The core just waits.
The naive ring buffer has this problem. Every queue operation does 2 atomic loads and 1 atomic store. One of those loads is cross-core, it has to snoop a cache line from the other core’s L1 over the interconnect. That’s ~40-70 cycles of the core sitting idle, waiting for the coherence protocol to hand over the data.
The fix: cache the other side’s counter locally. The writer keeps a tail_cache, a plain integer, not atomic. The reader keeps a head_cache. You check the local copy first. If it says the queue isn’t full (or empty), you skip the cross-core load entirely. You only pay for the snoop when the cached value is stale enough to give a wrong answer.
When is the cache wrong? Only when the queue is nearly full or nearly empty. With a 512-slot buffer that rarely happens in steady state. Cross-core loads drop from one per operation to roughly one per 512 operations.
Play around with the calculator below. Crank the buffer capacity up to 512 and watch the cross-core loads disappear. Drag it back to 1 and they equalise, because with a single slot the cache is always stale. For most producer-consumer patterns a few hundred slots is enough to make cross-core traffic negligible.
| Uncached | Cached | |
|---|---|---|
| Own loads (L1 hit) | 20.0K | 20.0K |
| Cross-core loads (snoop) | 20.0K | 2.5K |
| Stores | 20.0K | 20.0K |
| Total atomic ops | 60.0K | 42.5K |
Key numbers from the output header:
Load L1D hit : 16134 (69%)
Load Local HITM : 46 (0.2%)
Load Local DRAM : 18 (0.08%)
Store L1D Hit : 53579 (99.7%)
What these mean:
Note: perf c2c uses sampling, not counting. The absolute numbers are a fraction of reality. The ratios are what matter.