簡介 Memory Access, cache coherence protocol and memory barrier

--- tags: Linux Kernel, CPU, UMA, NUMA, cache coherence, MESI, memory barrier, store buffer, invalidate queue --- # 簡介 Memory Access, cache coherence protocol and memory barrier contributed by < `kevinshieh0225` > > [3/31 一對一討論](https://docs.google.com/document/d/1LF8-ZEC9O7QGyexqWcEMNDW23ciJfdhHMziFg1xMMh0/edit#) 這次閱讀筆記將簡介多核系統架構對記憶體操作的議題，包含什麼是 `UMA` 和 `NUMA`、維護多核心架構快取一致性的 cache coherence protocol、簡介 `MESI` 、最後提到硬體架構 store buffer 和 invalidate queue，他們如何加快指令流程，卻造成無法維持快取一致性的副作用，而需要靠軟體手段上 memory barrier 指令來保證 happens before 的期望。 ## 「為什麼不能無止盡提昇 CPU 時脈？」 [時脈頻率](https://zh.wikipedia.org/wiki/%E6%97%B6%E9%92%9F%E9%A2%91%E7%8E%87) (clock rate) 是是指同步電路中執行週期的基礎頻率，用以衡量在定量時間內 CPU 的執行作業量，測量執行時 cycles per second 的單位。根據[摩爾定律](https://zh.wikipedia.org/wiki/%E6%91%A9%E5%B0%94%E5%AE%9A%E5%BE%8B)：「積體電路上可容納的電晶體數目，約每隔兩年便會增加一倍。」CPU 的時脈頻率也得益於硬體效能。縱看 PC 電腦的 CPU 發展，1974 年的 Altair 8800 使用了一個時脈頻率為2MHz（200萬次/秒）的Intel 8080 CPU。第一台IBM PC的時脈頻率是4.77MHz。1995年，Intel's Pentium 晶片達到了100 MHz ，到了2002年，最快的CPU：Intel Pentium 4 達到了3GHz。然而在 2005 年之後，CPU 時脈速度的發展卻不再成長。 ![](https://www.researchgate.net/publication/321233071/figure/fig1/AS:563692805001216@1511406245686/CPU-transistor-densities-clock-speeds-power-and-performance-from-1970-2015-Courtesy-of.png) CPU 的執行效能遇到技術發展上的瓶頸了嗎？不盡然是。衡量 CPU 的效能不能只看時脈速度的量級（[Megahertz myth](https://en.wikipedia.org/wiki/Megahertz_myth)），而這背後涉及到的瓶頸，使的提昇 CPU 效能需要從以下方面進行突破： 1. **功耗與散熱**：隨著我們放入更多電晶體時，執行上產生了更高的功耗與溫度，而這一大程度影響電晶體的速度與準確性。另外如果我們提昇時脈速度，意味著我們需要更高的電壓，也將大幅提昇耗能。如何讓晶片散熱，是一個很大的議題。 2. **電晶體限制**：在一個 clock pulse 後，CPU 需要一段時間穩定新的狀態（0->1, 1->0），如果有新的 clock pulse 進入，可能會導致執行結果的錯誤。另外當我們改變電晶體狀態時，也會產生多餘的功耗，這便回到我們議題 1 。 3. **Memory access limit**：就算我們能夠不斷提昇 CPU 時脈，但是在記憶體操作卻會遇到瓶頸。隨著系統架構越漸複雜，如何確保多核心執行底下記憶體存取內容的正確一致，從系統架構與計算機結構的角度來改進多核處理器的瓶頸，是提昇電腦執行效能的一大議題，這使的提昇系統效能不再侷限於時脈的追求上。 ## Memory Access and cache coherence ### UMA vs NUMA `UMA` 全名為 `Uniform memeory access`（統一記憶體存取架構）。與之相對的是 `NUMA` 全名為 `Non-uniform memeory access`（非統一記憶體存取架構）兩者是對應於多處理器下的記憶體架構。先來看看 `uniform memeory access` ![](https://www.researchgate.net/profile/Youssef-Essa/publication/270586633/figure/fig1/AS:668942175584271@1536499651952/Multiprocessor-Architecture-14.ppm) 隨著系統架構越漸複雜，不同核心存取不同區段記憶體的速率也難以保持一致，這是因為原本對稱式多處理器 (`Symmetric Multi-Processor`, `SMP`) 設計，處理器與記憶體間在資料匯流上依賴同一個 `system bus` ，當系統中的核心數量增多時，將造成 memory bandwith 彼此受限。一些高傳輸需求的系統嘗試引入 high-speed data bus，但這個解法成本高昂且 scalability 有限。 `NUMA`的設計簡化了 `system bus` 的複雜度，把系統切成數個節點 (`node`)，每個處理器及記憶體就位在某一個節點上，當處理器存取同一個節點的記憶體時，可以有較高的存取速度；而存取其他節點的記憶體時，就需要透過節點間的資料傳遞，會耗費較多時間。 ![](https://i.imgur.com/V10fhPm.png) `UMA`和`NUMA` 架構在系統具有多核需求下，都面臨到 cache coherence 的議題。 ### cache coherence ![](https://upload.wikimedia.org/wikipedia/commons/a/a1/Cache_Coherency_Generic.png) 在 multiprocessor system 中，共享記憶體（shared memory）可能同時被載入到不同節點中的 local cache 裡，當存值內容發生改變時，如果其他節點的 local cache 並未即時更新的話，便可能讓 CPU 使用了舊的內容而導致錯誤。系統要思考如何確保當下分散在不同節點中的 cache 都能看到相同的存值內容。 > The following are the requirements for [cache coherence](https://en.wikipedia.org/wiki/Cache_coherence): >>**Write Propagation** >> Changes to the data in any cache must be propagated to other copies (of that cache line) in the peer caches. > >>**Transaction Serialization** >> Reads/Writes to a single memory location must be seen by all processors in the same order. 這裡出現一個問題：如果我們為了確保多核執行下的資料一致性，在每次資料寫入後每個相關的 CPU 都需要停止執行指令、等待 cache 更新到 memory、再等待其他節點的 cache 讀取更新資料，那這將要付出很大的成本。目前有許多不同的協定與手法以解決 cache coherence 的議題。 ### cache coherence protocols > The schemes can be classified based on: > >* Snoopy scheme vs Directory scheme and vs Shared caches >* [Write through vs Write-back](https://www.geeksforgeeks.org/write-through-and-write-back-in-cache/) (ownership-based) protocol >* Update vs Invalidation protocol >* Intervention vs not Intervention >* Dirty-sharing vs not-dirty-sharing protocol (MOESI vs MESI) 這裡針對 snoopy vs directory-based 做探討： #### bus snooping based 當一筆 shared memory 同時紀錄於不同快取的 cacheline 時，利用 bus 顯示資料的狀態，讓所有 cache 能窺探並更新自己狀態與資料，確保所有 cache 對這筆資料讀寫時只有一種基準值。依照作法又可分為： 1. **Write-invalidate**：當有有一個 processor 更改這筆共享資料時，將其他快取中的這筆資料標示為 invalidate ，說明目前持有的 cacheline 已經無效（再次使用會得到 coherence misses）。對應不同 cache state 與協定規範，進行 write through / write-back 的資料更新以維持 cache coherence。在 [並行程式設計: Atomics 操作](https://hackmd.io/@sysprog/concurrency-atomics#背後作祟的-cache) 的講座中，提及了 cache coherence 在 atomic 層面上的難題，並細說了 [`MESI protocol`](https://en.wikipedia.org/wiki/MESI_protocol) 在硬體系統架構的實作。`MESI` 與 `MOESI` 同為最廣為使用的協定之一。其中在這個基礎實作的又包含 `MSI`, `MESI`, `MOSI` and `MESIF` protocols 等等。 2. **Write-update**：當有有一個 processor 更改這筆共享資料時，將更改的數值 broadcast 給其他快取做更新。這個方法的 bus traffic 成本高於 `write-invalidate` 的作法，故非主流。有 `Dragon`, `firefly` protocols `Snooping-based` 的作法相比 `Directory-based` 更簡單快速，然而他受限於 bus traffic 的因素，使其 scalability 不佳，主要使用於 `UMA`。針對 `ccNUMA` 的情境，Intel 提出 [`MESIF`](https://www.cs.auckland.ac.nz/~goodman/TechnicalReports/MESIF-2009.pdf) protocol 。需要規模性的`ccNUMA` 依賴 directory-based 的方式。 #### Directory-based [Directory-based](https://en.wikipedia.org/wiki/Directory-based_coherence) 不透過 system bus 維持節點間資訊同步，而以 directory 追蹤其他 cache block 的情形，包括 cache block 目前的 state，與目前有哪些節點正共享這份 cache block。如此一來即可降低透過 system bus 將 signal 廣播給所有節點，而只傳給對資訊有興趣的節點，以降低 memory bandwith 的成本。 ### MESI protocol 最後我們針對 MESI protocol 來一探 cache coherence 的實用。 MESI protocol 是 Invalidate-based 且使用 write-back 的協定。不用每次資料更動就 write through 回記憶體，而是當 Snoopers 請求更新時再進行 write back 行為，這個手段節省了需要頻繁寫入記憶體的成本。 #### state cacheline 被標記一個狀態： :::info **Modified (M)** > The cache line is present only in the current cache, and is dirty - it has been modified (M state) from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (no longer valid) main memory state. The write-back changes the line to the Shared state(S). **Exclusive (E)** > The cache line is present only in the current cache, but is clean - it matches main memory. It may be changed to the Shared state at any time, in response to a read request. Alternatively, it may be changed to the Modified state when writing to it. **Shared (S)** > Indicates that this cache line may be stored in other caches of the machine and is clean - it matches the main memory. The line may be discarded (changed to the Invalid state) at any time. **Invalid (I)** > Indicates that this cache line is invalid (unused). ::: ![](https://miro.medium.com/max/752/1*kInaLCUTD_0fzLdV4T-r8A.png) 可以注意到當有一筆 cacheline 被標記為 `M`,`E` 則其他使用中的 cacheline 必須標示為 `I`。 #### state transition cacheline 可以被有限狀態機 FSM 的[模型](https://en.wikipedia.org/wiki/MESI_protocol#Operation)來表示，我這裡簡單描述：呼叫 cacheline 行動與改變狀態的來自兩個刺激源，一個是快取對應的 processor 的 Read and Write request；一個來自其他 processor 想對同一筆 cacheline 進行操作時，透過 bus snooper 轉介訊息以提出對應 request。 ![](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c1/Diagrama_MESI.GIF/330px-Diagrama_MESI.GIF) :::info **Processor Requests** > 1. **PrRd**: The processor requests to read a Cache block. > 2. **PrWr**: The processor requests to write a Cache block **Bus side requests** > 1. **BusRd**: Snooped request that indicates there is a read request to a Cache block requested by another processor > 2. **BusRdX**: Snooped request that indicates there is a write request to a Cache block requested by another processor that doesn't already have the block. > 3. **BusUpgr**: Snooped request that indicates that there is a write request to a Cache block requested by another processor but that processor already has that Cache block residing in its own Cache. > 4. **Flush**: Snooped request that indicates that an entire cache block is written back to the main memory by another processor. > 5. **FlushOpt**: Snooped request that indicates that an entire cache block is posted on the bus in order to supply it to another processor(Cache to Cache transfers). ::: ## Memory Barriers: a Hardware View for Software Hackers > [Memory Barriers: a Hardware View for Software Hackers (McKenney, P.E., 2009)](http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf) > [從硬體觀點了解 memory barrier 的實作和效果](https://medium.com/fcamels-notes/%E5%BE%9E%E7%A1%AC%E9%AB%94%E8%A7%80%E9%BB%9E%E4%BA%86%E8%A7%A3-memry-barrier-%E7%9A%84%E5%AF%A6%E4%BD%9C%E5%92%8C%E6%95%88%E6%9E%9C-416ff0a64fc1) 本章節摘錄論文內容，解釋硬體架構 store buffer 和 invalidate queue，他們如何加快指令流程，卻造成無法維持快取一致性的副作用，而需要靠軟體手段上 memory barrier 指令來保證 happens before 的期望。 ### Cache Coherence Write Stall 透過 cache coherence protocol 我們確保了多處理器間 cache 的一致性，然而 cache coherence 的代價是在（第一次）寫入時會變慢，讓 CPU 閒置。以 `MESI` 在 `PrWr` 的情境來看： <p align="center"> <img src="https://miro.medium.com/max/1400/1*95gDq3Q2OgTf6Mc43lxw1g.png" width="600"> </p> 在 CPU0 第一次要對同一筆 cache line 寫入前，他必須發出 Invalidate 令其他處理器（CPU1）的 cache line 標示為 `I`，其他處理器（CPU1）接收訊息並回傳 Acknowledgement 告訴 CPU0 標示完成後，CPU0 才能繼續進行寫入。這段期間 CPU0 被迫保持停頓（[Stall](https://en.wikipedia.org/wiki/Pipeline_stall)），且還可能因為 CPU1 的 cache 可能太忙而拖慢了回覆時間 (比方同時從 cache 大量的讀寫資料，或是短時間收到大量 invalidate ack)。但這種停頓是沒有必要的，因為到頭來 CPU0 還是要覆寫 cacheline 的值。 ### Store Buffer 和 Invalidate Queue 針對上述問題，以下的方法可以縮短 CPU 0 閒置時間： - **CPU0 不等 invalidate ack**：先寫入 store buffer，然後繼續作事。之後收到 invalidate ack 再更新 cache 的狀態。因為最新的資料可能存在 store buffer，CPU 讀資料的順序變成 `store buffer → cache → memory`。 - **CPU1 立即回 invalidate ack**：收到 invalidate 時，記錄到 invalidate queue 裡，先回 invalidate ack，稍後再處理 invalidate。 <p align="center"> <img src="https://miro.medium.com/max/1400/1*Nf9c79LUP5rpa72kqNvHdw.png" width="700"> </p> 利用 Store Buffer 和 Invalidate Queue 我們避免非必要的 stall ，但也帶出了另一個副作用：因為 [memory-ordering](https://en.wikipedia.org/wiki/Memory_ordering) 的改變，導致 cache 之間不再保證一致性。為了讓其它 CPU 有需要的時候也能讀到最新的資料，針對 store buffer 和 invalidate queue 的副作用設計了 write/read memory barrier。於是寫軟體的人在需要的時候可以用 [memory barrier](https://en.wikipedia.org/wiki/Memory_barrier) 確保關鍵的資料有依正確的順序更新 (沒保證更新的時間)。CPU 在多數情況下仍能避免閒置。 ## Dive deeper into write/read memory barrier ### Memory Barrier [Memory barrier](https://en.wikipedia.org/wiki/Memory_barrier)（membar, memory fence , fence instruction）是一種 barrier instruction。現代處理器與編譯器為了提昇性能會做 [out-of-order execution](https://en.wikipedia.org/wiki/Out-of-order_execution)，亂序執行保證了單執行序下的結果正確性，但卻無法保證多執行序下的結果一致性。為了確保預期的 [happens-before](https://en.wikipedia.org/wiki/Happened-before)，以軟體指令加上 Memory barrier 令 CPU 或編譯器在 barrier instruction 前後的記憶體操作能遵循一定順序執行。根據使用情境需求也有分為不同的 memory barrier： [Compiler Barrier](https://hackmd.io/@Masamaloka/linux-kernel-memory-barrier-document/%2FF1NsdIC9Sh6koXzYYNGY5Q#Compiler-Barrier)：限制編譯器對於指令前後的 memory access ，不會被 reorder 到指令的另一側去。 > [linux/tools/include/linux/compiler.h](https://github.com/torvalds/linux/blob/7b58b82b86c8b65a2b57a4c6cb96a460654f9e09/tools/include/linux/compiler.h) ```c #define barrier() __asm__ __volatile__("" : : : "memory") #define READ_ONCE(x) \ ({ \ union { typeof(x) __val; char __c[1]; } __u = \ { .__c = { 0 } }; \ __read_once_size(&(x), __u.__c, sizeof(x)); \ __u.__val; \ }) #define WRITE_ONCE(x, val) \ ({ \ union { typeof(x) __val; char __c[1]; } __u = \ { .__val = (val) }; \ __write_once_size(&(x), __u.__c, sizeof(x)); \ __u.__val; \ }) ``` [CPU Memory Barrier](https://hackmd.io/@Masamaloka/linux-kernel-memory-barrier-document/%2FF1NsdIC9Sh6koXzYYNGY5Q#CPU-Memory-Barriers)：compiler barrier 確保 data dependency ，然而無法確保 CPU 讀寫的存值是 up-to-date 的，比如 store buffers/Invalidate queue 改變的快取一致性。這時需要透過 CPU Memory Barrier 來保障正確性。 [linux/include/asm-generic/barrier.h](https://github.com/torvalds/linux/blob/0809edbae347a224ca1b59fb8be1c2d54389c2c6/include/asm-generic/barrier.h) ```c #ifdef __mb #define mb() do { kcsan_mb(); __mb(); } while (0) #endif #ifdef __rmb #define rmb() do { kcsan_rmb(); __rmb(); } while (0) #endif #ifdef __wmb #define wmb() do { kcsan_wmb(); __wmb(); } while (0) #endif ``` 詳細說明可參考 [從 Linux Kernel 閱讀 Memory Barrier](https://hackmd.io/@Masamaloka/linux-kernel-memory-barrier)。 ### Store Buffers Reordering 我們首先考慮 store buffer 的問題： ```c int a = b = 0; void foo(void) { a = 1; b = 1; } void bar(void) { while (b == 0) continue; assert(a == 1); } ``` 假設 CPU0 執行 `foo()` 而 CPU1 執行 `bar()`，並且 cache line 的狀態如下： ``` | a b ------|------------ CPU 0 | | 0 | stbff | | | cache | | E | CPU 1 | 0 | | stbff | | | cache | E | | ``` 1. CPU 0 executes `a = 1`. The cache line is not in CPU 0’s cache, so CPU 0 places the new value of “a” in its store buffer and transmits a “read invalidate” message. 2. CPU 1 executes `while(b == 0)` continue, but the cache line containing “b” is not in its cache. It therefore transmits a “read” message. 3. CPU 0 executes b=1. It already owns this cache line (in other words, the cache line is already in either the “modified” or the “exclusive” state), so it stores the new value of “b” in its cache line. ``` CPU 0 foo() a = 1 CPU 1 bar() while(b == 0) CPU 0 foo() b = 1 | a b ------|------------ CPU 0 | 1 | 1 | -> a "read invalidate" stbff | 1 | | cache | | 1 M | CPU 1 | 0 | | -> b "read" stbff | | | cache | 0 E | | ``` 4. CPU 0 receives the “read” message, and transmits the cache line containing the now-updated value of “b” to CPU 1, also marking the line as “shared” in its own cache. 5. CPU 1 receives the cache line containing “b” and installs it in its cache. ``` CPU 0 receives “read” message CPU 1 receives b "up-dated value" | a b ------|------------ CPU 0 | 1 | 1 | stbff | 1 | | cache | | 1 S | -> b "up-dated value" CPU 1 | 0 | 1 | stbff | | | cache | 0 E | 1 S | ``` 6. CPU 1 can now finish executing while(b==0) continue, and since it finds that the value of “b” is 1, it proceeds to the next statement. 7. CPU 1 executes the assert(a==1), and, since CPU 1 is working with the old value of “a”, this assertion fails. 8. CPU 1 receives the “read invalidate” message, and transmits the cache line containing “a” to CPU 0 and invalidates this cache line from its own cache. But it is too late. 9. CPU 0 receives the cache line containing “a” and applies the buffered store just in time to fall victim to CPU 1’s failed assertion. ### Store Buffers with Write Memory Barrier 這裡遇到硬體上的侷限，因為 CPUs 之間無法確認 variables 之間的關係，所以需要軟體上提供 `smp_wmb()` 來宣告 variables 的關係： ```c int a = b = 0; void foo(void) { a = 1; smp_wmb(); b = 1; } void bar(void) { while (b == 0) continue; assert(a == 1); } ``` 1. CPU 0 executes `a = 1`. The cache line is not in CPU 0’s cache, so CPU 0 places the new value of “a” in its store buffer and transmits a “read invalidate” message. 2. CPU 1 executes `while(b == 0)`continue, but the cache line containing “b” is not in its cache. It therefore transmits a “read” message. ``` CPU 0 foo() a = 1 CPU 1 bar() while(b == 0) | a b ------|------------ CPU 0 | 1 | 0 | -> a "read invalidate" stbff | 1 | | cache | | 0 E | CPU 1 | 0 | | -> b "read" stbff | | | cache | 0 E | | ``` 3. CPU 0 executes `smp_wmp()`, and marks all current store-buffer entries (namely, the `a = 1`). ``` CPU 0 foo() smp_wmp() | a b ------|------------ CPU 0 | 1 | 0 | stbff | 1m | | cache | | 0 E | CPU 1 | 0 | | stbff | | | cache | 0 E | | ``` 4. CPU 0 executes `b = 1`. It already owns this cache line (in other words, the cache line is already in either the “modified” or the “exclusive” state), but there is a marked entry in the store buffer. Therefore, rather than store the new value of “b” in the cache line, it instead places it in the store buffer (but in an unmarked entry). 5. CPU 0 receives the “read” message, and transmits the cache line containing the original value of “b” to CPU 1. It also marks its own copy of this cache line as “shared”. 6. CPU 1 receives the cache line containing “b” and installs it in its cache. 7. CPU 1 can now finish executing `while(b == 0)` continue, but since it finds that the value of “b” is still 0, it repeats the while statement. The new value of “b” is safely hidden in CPU 0’s store buffer. ``` CPU 0 foo() b = 1 CPU 0 receives “read” message CPU 1 receives b "original value 0" CPU 1 bar() while(b == 0) | a b ------|------------ CPU 0 | 1 | 1 | stbff | 1m | 1u | cache | | 0 S | -> b "original value 0" CPU 1 | 0 | 0 | stbff | | | cache | 0 E | 0 S | ``` 8. CPU 1 receives the “read invalidate” message, and transmits the cache line containing “a” to CPU 0 and invalidates this cache line from its own cache. 9. CPU 0 receives the cache line containing “a” and applies the buffered store. 10. Since the store to “a” was the only entry in the store buffer that was marked by the `smp_wmb()`, CPU 0 can also store the new value of “b” — except for the fact that the cache line containing “b” is now in “shared” state. 11. CPU 0 therefore sends an “invalidate” message to CPU 1. 12. CPU 1 receives the “invalidate” message, invalidates the cache line containing “b” from its cache, and sends an “acknowledgement” message to CPU 0. ``` CPU 1 receives a "read invalidate" CPU 0 receives "invalidate ack" and cache line store "a" CPU 0 sends b "invalidate" CPU 1 receives b "invalidate" | a b ------|------------ CPU 0 | 1 | 1 | stbff | | 1u | cache | 1 M | 0 S | CPU 1 | 0 | 0 | stbff | | | cache | I | I | ``` 13. CPU 1 executes `while(b == 0)`continue, but the cache line containing “b” is not in its cache. It therefore transmits a “read” message to CPU 0. 14. CPU 0 receives the “acknowledgement” message, and puts the cache line containing “b” into the “exclusive” state. CPU 0 now stores the new value of “b” into the cache line. ``` CPU 1 send b "read" CPU 0 receives “acknowledgement” | a b ------|------------ CPU 0 | 1 | 1 | stbff | | | cache | 1 M | 1 E | CPU 1 | 0 | 0 | stbff | | | cache | I | I | ``` 15. CPU 0 receives the “read” message, and transmits the cache line containing the original value of “b” to CPU 1. It also marks its own copy of this cache line as “shared”. 16. CPU 1 receives the cache line containing “b” and installs it in its cache. 17. CPU 1 can now finish executing while(b==0) continue, and since it finds that the value of “b” is 1, it proceeds to the next statement. 18. CPU 1 executes the assert(a==1), but the cache line containing “a” is no longer in its cache. Once it gets this cache from CPU 0, it will be working with the up-to-date value of “a”, and the assertion therefore passes. ``` CPU 0 receives b "read" CPU 1 receives b "update value 1" CPU 1 send a "read" ... | a b ------|------------ CPU 0 | 1 | 1 | stbff | | | cache | 1 S | 1 S | CPU 1 | 1 | 1 | stbff | | | cache | 1 S | 1 S | ``` 從這個複雜的過程可以見到雖然只是安插一個的 barrier 指令，但底下的硬體運行卻十分的複雜。 ### Invalidate Queues Reordering 這次我們考慮 Invalidate queues 的情況，再次考慮導致的問題： ```c int a = b = 0; void foo(void) { a = 1; smp_wmb(); b = 1; } void bar(void) { while (b == 0) continue; assert(a == 1); } ``` 1. CPU 0 executes `a = 1`. The corresponding cache line is read-only in CPU 0’s cache, so CPU 0 places the new value of “a” in its store buffer and transmits an “invalidate” message in order to flush the corresponding cache line from CPU 1’s cache. 2. CPU 1 executes `while(b == 0)`continue, but the cache line containing “b” is not in its cache. It therefore transmits a “read” message. 3. CPU 0 executes b=1. It already owns this cache line (in other words, the cache line is already in either the “modified” or the “exclusive” state), so it stores the new value of “b” in its cache line. 4. CPU 0 receives the “read” message, and transmits the cache line containing the now-updated value of “b” to CPU 1, also marking the line as “shared” in its own cache. ``` CPU 0 foo() a = 1 CPU 1 bar() while(b == 0) CPU 0 foo() b = 1 CPU 0 receives b "read" | a b ------|------------ CPU 0 | 1 | 1 | -> a "read invalidate" stbff | 1 | | cache | | 1 S | -> b "update value 1" Iqueue| | | CPU 1 | 0 | | -> b "read" stbff | | | cache | 0 E | | Iqueue| | | ``` 5. CPU 1 receives the “invalidate” message for “a”, places it into its invalidate queue, and transmits an “invalidate acknowledge” message to CPU 0. Note that the old value of “a” still remains in CPU 1’s cache. 6. CPU 1 receives the cache line containing “b” and installs it in its cache. 7. CPU 1 can now finish executing `while(b == 0)` continue, and since it finds that the value of “b” is 1, it proceeds to the next statement. ``` CPU 1 receives a “invalidate” CPU 1 receives b "update value 1" CPU 1 bar() while(b == 0) | a b ------|------------ CPU 0 | 1 | 1 | stbff | 1 | | cache | | 1 S | Iqueue| | | CPU 1 | 0 | 1 | stbff | | | cache | 0 E | 1 S | Iqueue| I | | ``` 8. CPU 1 executes the assert(a==1), and, since the old value of “a” is still in CPU 1’s cache, this assertion fails. 9. CPU 1 processes the queued “invalidate” message, and invalidates the cache line containing “a” from its own cache. But it is too late. 10. CPU 0 receives the “invalidate acknowledge” message for “a” from CPU 0, and therefore applies the buffered store just in time to fall victim to CPU 1’s failed assertion. ### Invalidate Queues with Read Memory Barrier 這也是在硬體上無法解決的狀況，所以需要另外加上 `smp_rmb()` 來提示硬體資源上的關係。 ```c int a = b = 0; void foo(void) { a = 1; smp_wmb(); b = 1; } void bar(void) { while (b == 0) continue; smp_rmb(); assert(a == 1); } ``` 1. CPU 0 executes a=1. The corresponding cache line is read-only in CPU 0’s cache, so CPU 0 places the new value of “a” in its store buffer and transmits an “invalidate” message in order to flush the corresponding cache line from CPU 1’s cache. 2. CPU 1 executes while(b==0)continue, but the cache line containing “b” is not in its cache. It therefore transmits a “read” message. 3. CPU 0 executes b=1. It already owns this cache line (in other words, the cache line is already in either the “modified” or the “exclusive” state), so it stores the new value of “b” in its cache line. 4. CPU 0 receives the “read” message, and transmits the cache line containing the now-updated value of “b” to CPU 1, also marking the line as “shared” in its own cache. ``` CPU 0 foo() a = 1 CPU 1 bar() while(b == 0) CPU 0 foo() b = 1 CPU 0 receives b "read" | a b ------|------------ CPU 0 | 1 | 1 | -> a "read invalidate" stbff | 1 | | cache | | 1 S | -> b "update value 1" Iqueue| | | CPU 1 | 0 | | -> b "read" stbff | | | cache | 0 E | | Iqueue| | | ``` 5. CPU 1 receives the “invalidate” message for “a”, places it into its invalidate queue, and transmits an “invalidate acknowledge” message to CPU 0. Note that the old value of “a” still remains in CPU 1’s cache. 6. CPU 1 receives the cache line containing “b” and installs it in its cache. 7. CPU 1 can now finish executing while(b==0) continue, and since it finds that the value of “b” is 1, it proceeds to the next statement. ``` CPU 1 receives a “invalidate” CPU 1 receives b "update value 1" CPU 1 bar() while(b == 0) | a b ------|------------ CPU 0 | 1 | 1 | stbff | 1 | | cache | | 1 S | Iqueue| | | CPU 1 | 0 | 1 | stbff | | | cache | 0 E | 1 S | Iqueue| I | | ``` 8. CPU 1 executes the `smp_rmb()`, marking the entry in its invalidate queue. 9. CPU 1 executes the `assert(a == 1)`, but since there is a marked entry for the cache line containing “a” in the invalidate queue, CPU 1 must stall this load until that entry in the invalidate queue has been applied. 10. CPU 1 processes the “invalidate” message, removing the cacheline containing “a” from its cache. ``` CPU 1 bar() smp_rmb() CPU 1 bar() assert(a == 1) | a b ------|------------ CPU 0 | 1 | 1 | stbff | 1 | | cache | | 1 S | Iqueue| | | CPU 1 | 0 | 1 | stbff | | | cache | I | 1 S | Iqueue| Im | | ``` 11. CPU 1 is now free to load the value of “a”, but since this results in a cache miss, it must send a “read” message to fetch the corresponding cache line. 12. CPU 0 receives the “invalidate acknowledge” message for “a” from CPU 0, and therefore applies the buffered store, changing the MESI state of the corresponding cache line to “modified”. 13. CPU 0 receives the “read” message for “a” from CPU 1, and therefore changes the state of the corresponding cache line to “shared”, and transmits the cache line to CPU 1. 14. CPU 1 receives the cache line containing “a”, and can therefore do the load. Since this load returns the updated value of “a”, the assertion passes ``` CPU 1 sends a “read” CPU 0 receives a “invalidate acknowledge” CPU 0 receives a “read” CPU 1 receives a "update value 1" | a b ------|------------ CPU 0 | 1 | 1 | stbff | | | cache | 1 S | 1 S | -> a "update value 1" Iqueue| | | CPU 1 | 0 | 1 | -> a "read" stbff | | | cache | 1 S | 1 S | Iqueue| | | ``` ### Conclusion 硬體為了減少讀寫 memory 而有 cache。有 cache 就要確保 cache 之間的資料一致 (同一時間同一位置只有一個值)。但確保 cache 資料完全一致容易讓 CPU 閒置，於是有了 store buffer 和 invalidate queue 降少 CPU 閒置。代價是只保證 CPU 自己會讀到自己寫入的最新資料，但其它 CPU 不一定。為了讓其它 CPU 有需要的時候也能讀到最新的資料，針對 store buffer 和 invalidate queue 的副作用設計了 write/read memory barrier。於是寫軟體的人在需要的時候可以用 memory barrier 確保關鍵的資料有依正確的順序更新 (沒保證更新的時間)。CPU 在多數情況下仍能避免閒置。到此可以了解為什麼這兩種操作合在一起比較符合 CPU 架構： - 一個 thread 「先 write X 後執行 write memory barrier」 - 另一個 thread 「先執行 read memory barrier 後 read X」兩者合起來構成 happens-before relation。由此可以理解 Sequential-consistent 和硬體的運作方式差太多，會太常讓多個 CPU 閒置，無法發揮多 CPU 的效能。 ## 參考資料： - [Why CPU Clock Speed Isn’t Increasing](https://www.maketecheasier.com/why-cpu-clock-speed-isnt-increasing/) - [What is NUMA? - VMware Docs](https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.resmgmt.doc/GUID-1DAB8F35-BA86-4063-8459-55D2979B593E.html) - [What is NUMA? — The Linux Kernel documentation](https://www.kernel.org/doc/html/v4.18/vm/numa.html#what-is-numa) - [簡介 NUMA 架構](https://hackmd.io/@hPMCWajOS-ORQdEEAQ04-w/Hkd1rsonP) - [多核心計算環境—NUMA與CPUSET簡介](https://www.cc.ntu.edu.tw/chinese/epaper/0015/20101220_1508.htm) - [並行程式設計：atomic](https://hackmd.io/@sysprog/concurrency/https%3A%2F%2Fhackmd.io%2F%40sysprog%2Fconcurrency-atomics) - [現代處理器設計: Cache 原理和實際影響](https://hackmd.io/@sysprog/HkW3Dr1Rb?type=view) - [Cache coherence - Wikipedia](https://en.wikipedia.org/wiki/Cache_coherence) - [Cache coherency protocols (examples)](https://en.wikipedia.org/wiki/Cache_coherency_protocols_(examples)) - [Cache Coherence Problem & Cache Coherency Protocols](https://www.youtube.com/watch?v=r_ZE1XVT8Ao) - [Snooping-based Cache Coherency Protocol](https://www.youtube.com/watch?v=YNpaELJZm2c) - [Directory-based Cache Coherency Protocol](https://www.youtube.com/watch?v=KEc4NQZjkMI) - [MESI protocol - Wikipedia](https://en.wikipedia.org/wiki/MESI_protocol) - [從硬體觀點了解 memory barrier 的實作和效果](https://medium.com/fcamels-notes/從硬體觀點了解-memory-barrier-的實作和效果-416ff0a64fc1)