Memory barriers are often confusing to people, newbies and consummate professionals alike. I reckon it is because most people learn that they need to use memory barriers but seldom go the extra kilometer of finding out why they're needed and what they're abstracting. Sure, they know that it is to prevent memory reordering but not why the reordering happens in the first place. As such, questions about memory barriers are brought up often in programming social circles.

This would be fine if it were possible to explain the nuances succinctly in a chat room inhabited by people with varying degrees of expertise on the subject, all generating noise in the communication channel. It isn't. And more often than not, I get annoyed by people butchering it. My goal here is to take a people on a tour and hopefully have a linkable resources by the end of it. But to get to get to the heart of this story, we've got to go back to the beginni— we've got to make sure your understanding of CPU caches is not merely "the stuff that hides main memory latency through black magic".

For whom it may concern

I know it's a little bit late, but I am the worst combination of busy and lazy. As an apology, I went a little above and beyond. If you need further clarification, @ me on chat.

Let's make sure we map to the same page

Otherwise, this article won't be very cache-money.

A cache is any construct which stores often needed resources in a quickly accessible manner, resources which would otherwise be temporally prohibitive to continually obtain from a backing store on an individual basis. While most of us are used to thinking about them in the context of software and hardware, the notion is based on real world caches which expire. Everyone has a cache. I have one on my desk, a miniature trash can filled with sour Skittles obtained from a local store called Plodine. As I often reach for them, buying them in spades from the store enables me quick access to them as long as they're not depleted, after which another roundtrip is needed.

A cache of Skittles ready for prompt consumption

Computer caches, on the other hand, are quick access staging areas for data and instructions. They do not expire unless instructed to do so and the system has a responsibility to propagate changes in both ways, from the caches to the comparatively slow main memory and vice versa in order for the state not to become corrupted or stale. Every load and store needs to be sanitized and checked for consistency which can be guaranteed by following carefully designed protocols to the letter (which computers are extremely good at).

In order to avoid diminishing returns, caches are never made too large. Multiple levels of cache are used instead, progressively increasing in size (and access times). This requires additional bookkeeping, but nets the best returns. Each respective level's size seldom changes. In fact, Intel hasn't changed the size of the first two levels of cache in over 10 years. They're actually on the brink of doing so with the next-gen Ice Lake architecture, increasing combined L1 cache to 80 KiB (data-biased, 3:2) and L2 to 512 KiB.

The architecture of a CPU cache reflects its purpose. Given its function is to be a buffer between the CPU and main memory, accessing it must be fast and a direct function of memory address. Thus, a CPU cache needs to be designed like a hardware hash table. And for the most part, that is exactly what it is. Furthermore, given that accesses in memory are more often than not localized, access granularity for a cache shouldn't be too small. Every time a fresh bit of memory is touched, it is desirable to take out an aligned swath, known as a cache line. The choice of size is not an easy one, but Intel has settled on using 64 bytes for their desktop products ever since Pentium IV. A single cache line is augmented with necessary metadata such as flags and tags. Thus, physically, caches are larger than their logical description for the purposes of internal bookkeeping.

As the cache gets populated (not even necessarily filled up to capacity), due to its design as a fixed-size hardware hash table, conflicts on loads are inevitable. This means that cache lines from conflicting ways need to be sanitized and evicted. Thus, eviction protocols are of paramount importance, especially as data may need to be propagated back to main memory. Other cores may also need to be notified of the eviction.

With all this in mind, a CPU cache is composed out of $n$ -sized hash buckets called "sets" where each can thus be inhabited in $n$ "ways" each sufficiently sized to harbor a single cache line. To put this in practical terms, one of my favorite Intel processors of all time, the i7-2600K (which is still alive and kicking in many machines almost a decade later) has an evenly split 8-way set associative 64 KiB L1 cache. $32$ KiB ( $2^{15}$ B) are dedicated to data and the size of a cache line, as discussed, is $64$ B ( $2^6$ B). In total, it can hold $32768 / 64 = 512$ cache lines. Each set can hold $8$ of these cache lines, thus the number sets is $64$ .

Way / Set	0	1	2	3
0x0	0x00000000
0x1	0x00000040	0x10000040	0x20000040
0x2	0x00000080	0x10000080
0x3	0x000000c0	0x130000c0	0x200000c0	0x300000c0
0x4	0x00000100
0x5	0x00000140
0x6	0x00000180
0x7	0x000001c0

As you might imagine, this would be difficult to showcase and reason about here, thus we will be using a fictional, crappy two-core multiprocessor for the duration of this article. It conveniently has only one level of four-way set associative cache per core (eight sets on each), a preview of which you can see above. The complexity of adding more levels of cache is staggering, just consider a decade old processor like the 2600K also has a similarly configured 256 KiB L2 cache per core and a third level of cache, up to 16-way set associative, which is shared between four cores (lending up to 2 MiB to each) and the integrated GPU. Intel refers to it as the Last Level Cache or LLC. Not to mention that cache functionality, interops and even mere addressing rises in complexity with the level of the cache discussed.

Indexing an L1 cache

Sticking to the lowest level, how do we go from a memory address to a slot in the L1 cache? We utilize low bits of the underlying virtual address (prior to translating as the relevant bits remain unchanged by it given the smallest page size is $4096$ bytes). Since we decided on a $64$ byte wide cache line ( $2^6$ B), the six lowest significant bits address the data to be shoveled into a specific way of a specific set. Given that we only have $8$ sets in our toy cache, we only need bits $6$ - $8$ to determine in which set our cache line lands. In parallel with this, we also translate the virtual address into the physical one and use the tags derived from the upper physical address associated with each "way" to determine which one gets it. Specifically, for the tag bit count we subtract the address width of the cache line itself ( $6$ in our case) and the index width ( $3$ in our case).

Assuming our processor is 32-bit (because nobody has time to write out 48-52 bits), say we have a memory address of 0x130000c0, bitwise that's 0001 0011 0000 0000 0000 0000 1100 0000. Bits 6-8 are $011$ or 0x3. The "ways" divide the particular indexed set into slots as a function of the tag (which is computed as the remaining $32-6-3 = 23$ upper bits of the translated physical address). In our table, it lands coarsely in way 1 of set 0x3.

Some lingo to keep in mind

Failure to obtain data from the cache is known as a cache miss. It instigates a load of a cache line containing the desired data. Conversely, a cache miss due to a different cache line in the desired slot of a set is sometimes called a capacity miss (even though not necessarily at capacity). A cache miss trying to access a recently evicted line is referred to as an associativity miss.

Duplicating cores means duplicating caches

An obvious, but critical observation

The nuanced dance between all the cores and their local low-level caches is where the complexity arises. Especially when we consider writing data as opposed to simple reading. Making a fast cache for each core is trivial compared to ensuring timely data consistency across multiple cores in a multithreaded environment. We will show how modern cache coherency design introduces the need for memory barriers in order to remain as fast as possible while producing results we deem correct (because the computer doesn't care).

Take, for example, a simple write to the cache by CPU 0. The target cache line might be shared/duplicated by CPU 1. Obviously, you don't write the value to both caches, you just tell (we'll quantify how shortly) CPU 1 to consider its cache line invalid and mark CPU 0's cache line as modifiable. Then we proceed to write our new data. Should CPU 1 need it later on, it will have to request it from CPU 0's cache after suffering a cache miss on its own. You're beginning to see the importance of keeping data coherent across multiple cores and caches. But how exactly is this achieved, especially as multiple threads start manipulating the same data? To cover all the possibilities, the protocols involved are very involved. However, we're in luck as there's only one specific and comparatively simple protocol that directly concerns us for the purposes of this article.

Comic relief

This reminds me of the joke where a man claimed to be quick at arithmetic. When asked what's $18$ times $19$ , he espoused $26$ to everyone's bamboozlement. Alas, he said he was quick at arithmetic, not necessarily correct.

The Illinois Protocol

MESI (Modified / Exclusive / Shared / Invalid) protocol, also known as the Illinois protocol, is an invalidation-based cache-coherence protocol compatible with the more performant write-back caches, as opposed to write-through ones. Its name stands for the four states on which it is based.

Each cache line takes one of these states, encoded in an additional two-bit tag associated with it. Following is a short description of the possible states:

Modified state is set on any cache line in the cache of a CPU that modified it recently, usually after moving from a Shared state. This means it only exists in that CPU's cache as others received an invalidation order previously. Furthermore, as we're talking about a write-back cache, the CPU is responsible for this data. Either to write it back to the main memory or hand it off to some other CPU's cache (share, then receive invalidation order).
Exclusive state is set prior to modification of a cache line. At this point, the data still up-to-date and the CPU itself has no responsibilities towards it yet. In this state, it is accessible only be the CPU in whose cache it is marked as such.
Shared state is set on any cache line which is shared across multiple CPUs, in other words, there is at least one copy of it in other CPUs' caches. In this state, it is up-to-date and cannot be modified without ordering the acting CPU's cache line state to exclusive and the remaining ones to invalid.
Invalid state is set on any cache line which is currently "open for business", unused. It is preferable to place any new data inside invalid cache lines as others require overhead in the way of MESI state changes communicated between the participating CPUs and their caches. Not that there is much you can do over the choice of set and way, though.

How are these states transitioned? Messages. MESI protocol is a message-passing protocol. The messages themselves are simple, the most important ones being the read and invalidate requests alongside their responses, including the combined read-invalidate message commonly known as RFO / Read For Ownership. And of course, the writeback request which facilitates the timely completion of a CPU's responsibilities to a cache line in the modified state before it gets evicted.

Suppose now we have a cache line in CPU 1's cache which we want to modify by CPU 0 whose own cache line is marked as effectively invalid. By agreement to ensure cache coherency, we cannot write to the cache line until all other CPUs invalidate their copies and send a confirmation to the CPU wanting to modify it, to perform any remaining actions currently in progress. This is wasteful as CPU 0 in this case will stall as it waits for the invalidation confirmation after sending an RFO, especially as it intends to just overwrite the data immediately.

Write policies

A write-back cache doesn't rush to reflect its changes in main memory, data propagation is deferred until a block is being evicted/replaced. A write-through cache synchronously writes both to the cache and the backing memory. Former requires more bookkeeping than the latter.

Enter the store buffer

To this end, store buffers were introduced. Their name is apt, the CPU performs intermediate stores in them while awaiting an invalidation confirmation from other CPUs. This allows the CPU in question to process further instructions without stalling. To avoid reading stale information when acting on its own cache data, the CPU owning the store buffer can also check for updated information waiting to write through and act on it instead. Other CPUs cannot see the contents of their twins' store buffers.

This is what instigates the need for write memory barriers. Some writes which require access to an unowned cache line will be stuck in the store buffer while others whose target cache line is already in the cache (modified / exclusive) will be modified first, even though the instruction came in second (from the perspective of another CPU which cannot snoop other CPUs’ store buffers). The writes are out of order for speed. While the pursuit of speed is great, sometimes you need to slow down to get things right. Most of the time, this doesn’t matter. But when you depend on some variables to be modified in a specific memory order, you need a write memory barrier instruction in place.

In reality, it's actually an attribute on another instruction rather than an instruction itself (but it can be). The CPU with the memory barrier instruction will do two things. First, mark store buffer entries that precede the barrier and are halting the flush (awaiting RFO/invalidation confirmation). Second, prevent any further writes to the cache by any subsequent instruction by buffering those writes as well (unmarked). This will prevent cache writes from going out of order wrt what the programmer expects as, once again, other CPUs cannot see the contents of their twins' store buffers. Once the invalidation confirmations are received, the store buffer is flushed to the cache and changes become visible to other threads, with proper effective order.

"Let's go into invalidation debt!"

On the flip side, hardware designers found another performance loophole to exploit. Instead of actually invalidating the cache line immediately, just go into debt by sending a confirmation immediately. This way, a queue is formed which needs to be cleared before sending any further messages regarding the affected cache lines. This is useful because the CPU might be involved in something more important than invalidating an unused cache line (at that moment). Furthermore, due to physical constraints, the CPU itself cannot check its invalidation queue like it can the store buffer.

What does it mean for the CPU to be in "invalidation debt"? Unperformed invalidations leave its cache in a stale state as it will keep reading data from invalidated cache lines under the impression they're valid. Thus, we need read memory barriers. Read memory barriers ensure that the invalidation queue gets flushed and, in this case, means the CPU needs to stall until this is done as further processing depends on data written by other CPUs to become visible.

Write, read and general memory barriers

This will have undoubtedly placed some moderate pressure on your brain. So, let's recap together. A write memory barrier ensures that every write preceding the barrier is effectively completed before the one superseding it, from the perspective of some other thread (!). This is done by hiding out-of-order writes in the store buffer as unmarked while waiting for invalidation confirmation of the one we want to appear to have been written before. So once all the marked entries have their matching invalidation confirmations received, the entire store buffer is flushed to the cache, ready for other threads to request access to it.

A read memory barrier ensures that writes done by other threads/CPUs are all visible to the invoking CPU/thread by making sure the invalidation queue is empty and offending cache lines are invalidated, forcing loads to request access to up-to-date information after a cache miss.

General memory barriers combine both read and write memory barriers, which is a more complicated operation that is not always required. Hence on most architectures, these instructions are offered individually.