Notes on concurrency and memory consistency on x86

# Notes on concurrency and memory consistency on x86 ### Thread safety definition Thread safe: Implementation is guaranteed to be free of race conditions when accessed by multiple threads simultaneously. A race condition where the system's behavior is dependent on the sequence or timing of other uncontrollable events. ### Memory ordering, barriers, acquire/release #### Sequential consistency Sequential consistency: the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program #### Memory ordering The term memory ordering refers to the order in which the processor issues reads (loads) and writes (stores) through the system bus to system memory. #### Acquire/release Acquire semantics prevent memory reordering of the read-acquire with any read or write operation that follows it in program order. Release semantics prevent memory reordering of the write-release with any read or write operation that precedes it in program order. ![](https://i.imgur.com/H4ELQ66.png) Acquire/release barrier sandwitch: ![](https://i.imgur.com/pYSvEGD.png) #### Release operations vs fences Release **operation** only guarantees that preceding accesses are not reordered past the release operation itself. Which means that stores _below_ the release operation can be reordered _above_ it (as long as they are for a different memory object), and as a result can then be reordered with other accesses above the release operation. A release _fence_ on the other hand prevents preceding accesses to be reordered past _itself and all stores below the fence_ because it doesn't allow stores to cross it in any direction. This distinction is important to prevent unexpected stores to be reordered into a critical section. An example from C++: * `std::atomic_thread_fence(std::memory_order_release)` is a _release fence_ * `std::atomic<>::store(std::memory_order_release)` is a _release operation_ ### X86 memory model, bus locking, serializing instructions #### Bus locking The processor uses three interdependent mechanisms for carrying out locked atomic operations: * Guaranteed atomic operations. Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line * Bus locking, using the LOCK# signal and the LOCK instruction prefix. Intel 64 and IA-32 processors provide a LOCK# signal that is asserted automatically during certain critical memory operations to lock the system bus or equivalent link. While this output signal is asserted, requests from other processors or bus agents for control of the bus are blocked. XCHG instruction automatically assumes LOCK semantics * Cache coherency protocols that ensure that atomic operations can be carried out on cached data structures (cache lock). For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow it’s cache coherency mechanism to ensure that the operation is carried out atomically. This operation is called “cache locking”. #### Serializing instructions Serializing instructions force the processor to complete all modifications to flags, registers, and memory by previous instructions and to drain all buffered writes to memory before the next instruction is fetched and executed. When the processor serializes instruction execution, it ensures that all pending memory transactions are completed (including writes stored in its store buffer) before it executes the next instruction. Nothing can pass a serializing instruction and a serializing instruction cannot pass any other instruction (read, write, instruction fetch, or I/O). It is important to note that executing of serializing instructions on P6 and more recent processor families constrain speculative execution because the results of speculatively executed instructions are discarded. * Privileged serializing instructions — INVD, INVEPT, INVLPG, INVVPID, LGDT, LIDT, LLDT, LTR, MOV (to control register, with the exception of MOV CR8), MOV (to debug register), WBINVD, and WRMSR (except to non-serializing MSRs). * Non-privileged serializing instructions — CPUID, IRET, and RSM. The following instructions are memory-ordering instructions, not serializing instructions. These drain the data memory subsystem. They do not serialize the instruction execution stream: * Non-privileged memory-ordering instructions — SFENCE, LFENCE, and MFENCE. LFENCE does provide some guarantees on instruction ordering. It does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes. The following additional information is worth noting regarding serializing instructions: * The processor does not write back the contents of modified data in its data cache to external memory when it serializes instruction execution. Software can force modified data to be written back by executing the WBINVD instruction, which is a serializing instruction. * When an instruction is executed that enables or disables paging (that is, changes the PG flag in control register CR0), the instruction should be followed by a jump instruction. The target instruction of the jump instruction is fetched with the new setting of the PG flag (that is, paging is enabled or disabled), but the jump instruction itself is fetched with the previous setting. The Pentium 4, Intel Xeon, and P6 family processors do not require the jump operation following the move to register CR0 (because any use of the MOV instruction in a Pentium 4, Intel Xeon, or P6 family processor to write to CR0 is completely serializing). However, to maintain backwards and forward compatibility with code written to run on other IA-32 processors, it is recommended that the jump operation be performed. * Whenever an instruction is executed to change the contents of CR3 while paging is enabled, the next instruction is fetched using the translation tables that correspond to the new value of CR3. Therefore the next instruction and the sequentially following instructions should have a mapping based upon the new value of CR3. * The Pentium processor and more recent processor families use branch-prediction techniques to improve performance by prefetching the destination of a branch instruction before the branch instruction is executed. Consequently, instruction execution is not deterministically serialized when a branch instruction is executed. #### Memory ordering The term memory ordering refers to the order in which the processor issues reads (loads) and writes (stores) through the system bus to system memory. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium 4, and P6 family processors also use a processor-ordered memory-ordering model that can be further defined as “write ordered with store-buffer forwarding.”. In a single-processor system, the following ordering principles apply: * Reads are not reordered with other reads. * Writes are not reordered with older reads. * Writes to memory are not reordered with other writes, with the following exceptions: * streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); * string operations. * No write to memory may be reordered with an execution of the CLFLUSH instruction; a write may be reordered with an execution of the CLFLUSHOPT instruction that flushes a cache line other than the one being written. * **Reads may be reordered with older writes to different locations but not with older writes to the same location.** * Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions. * Reads cannot pass earlier LFENCE and MFENCE instructions. * Writes and executions of CLFLUSH and CLFLUSHOPT cannot pass earlier LFENCE, SFENCE, and MFENCE instructions. * LFENCE instructions cannot pass earlier reads. * SFENCE instructions cannot pass earlier writes or executions of CLFLUSH and CLFLUSHOPT. * MFENCE instructions cannot pass earlier reads, writes, or executions of CLFLUSH and CLFLUSHOPT. In a multiple-processor system, the following ordering principles apply: * Individual processors use the same ordering principles as in a single-processor system. * Writes by a single processor are observed in the same order by all processors. * Writes from an individual processor are NOT ordered with respect to the writes from other processors. * Memory ordering obeys causality (memory ordering respects transitive visibility). * Any two stores are seen in a consistent order by processors other than those performing the stores * Locked instructions have a total order Prior to executing an I/O instruction, the processor waits for all previous instructions in the program to complete and for all buffered writes to drain to memory. Only instruction fetch and page tables walks can pass I/O instructions. Execution of subsequent instructions do not begin until the processor determines that the I/O instruction has been completed. Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory. Like the I/O and locking instructions, the processor waits until all previous instructions have been completed and all buffered writes have been drained to memory before executing the serializing instruction.