CPU, Memory consistency in multicore processors

Memory barriers

x86 has 3 types of memory barriers : sfence (store), lfence (load), mfence (full barrier)
Memory barriers ensure loads and stores use the latest data from the cache. They enforce some sort of sequential consistency.

Cache coherence

L1 caches always have the most up-to-date data. In case of memory write-back the memory (and L3 cache) may have old data.

Cache coherence protocols ensure all caches sharing the same line have the same data.
2 types of coherence architectures : bus snooping (between the same processor cores) and directory-based (between NUMA sockets)
Bus snooping is faster but not scalable. It implies invalidation broadcasts => bus contention
Directory based coherence has more latency but scales better. It uses point-to-point messages between sockets.
Many types of messages flows are possible for directory based coherence. The optimal flow will reduce the contention on the home socket and have the shortest critical path (longest sequence of messages)
WARNING : the diagram illustrating directory based coherence is not optimal nor realistic. I just made it up to have an idea on the message flows

MESI protocol

Each cache line load/store follows a finite state machine. There are 4 main states M, E, S, I. Other states like O(owner) and F(forward) are optimisations to reduce bus contention and for NUMA sockets interconnect.

Invalid : cache line is not present
Exclusive : cache line is present only in 1 core cache and is clean (same contents in memory/L3 cache)
Modified : cache line is exclusive and dirty (write-back needed)
Shared : cache line is clean and present in at least 2 core caches

Memory barriers

Cache coherence

L1 caches always have the most up-to-date data. In case of memory write-back the memory (and L3 cache) may have old data.

MESI protocol

Links