從 C++11 規格和 Template Meta-Programming 的角度欣賞 Boost 對 atomic 和 memory order 的支援 --- ## Agenda - Introduction - Why needs Template metaprogramming? - What is memory order? - How [C++11](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf) supports memory order? - Boost.Atomic --- ### Introduction 為什麼要看 C++11 規格? ```cpp= void push(const T &data) { node * n = new node; n->data = data; node * stale_head = head_.load(boost::memory_order_relaxed); do { n->next = stale_head; } while (!head_.compare_exchange_weak(stale_head, n, boost::memory_order_release)); } ... node * pop_all_reverse(void) { return head_.exchange(0, boost::memory_order_consume); } private: boost::atomic<node *> head_; ``` <small>Reference [Boost.Atomic, Usage example](https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.mp_queue.implementation)</small> ---- - <small>上面這段節錄自 [Boost.Atomic, Usage example](https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.mp_queue.implementation),由於直接看 code 並不是有效率的學習方式,<font color='red'>甚至可能得到錯誤的認知</font>,所以從理解 C++11 規格開始。</small> - <small>透過理解 C++11 規格,看到上面這段 code 就不必用猜的,也不必用 try and error ,是一種比較有效率的學習路線。</small> --- ## Why needs Template metaprogramming? ---- 先澄清一件事,<br> 不是直接用 templates 程式就會變快。 ---- 雖然 template 會在 compile-time 作展開和運算,但如果 <font color='red'>僅僅只是</font> 直接把 function 改成 template,就如同<font color='red'>直接換成 inline 或 macro一樣</font>,執行的時間是沒有差別的 ---- 作一個實驗來說明這個概念 ---- ```cpp // templates.cpp template <class T> T add(T a, T b){ return a + b; } int main(){ int ans = 0; int a = 1, b = 2; for(int i = 0; i < 1000000; ++i){ ans += add(a, b); } } ``` ```cpp // no-templates.cpp int add(int a, int b){ return a + b; } int main(){ int ans = 0; int a = 1, b = 2; for(int i = 0; i < 1000000; ++i){ ans += add(a, b); } } ``` ---- ``` Performance counter stats for './templates' (10000 runs): 8,911,140 cycles ( +- 0.09% ) 0.003526686 seconds time elapsed ( +- 0.11% ) Performance counter stats for './no-templates' (10000 runs): 9,010,668 cycles ( +- 0.11% ) 0.003570547 seconds time elapsed ( +- 0.13% ) ``` - 是不是 templates 的差異只有 0.05 ms - 0.05 ms 是執行 1 萬次的平均時間 - <font color='red'>並不是直接換成 template 就會變快</font> ---- <font color='red'>並不是 templates 不會讓執行時間縮短</font>,而是因為剛剛的測試範例只是直接把 function 換成 templates ---- 用 templates 實作出<font color='red'>在 compile-time 運算來縮短執行時間</font>,同時具有 inline 和 macro 所不具有的優點,<font color='red'>並不是這一次 talk 的主題</font> ---- 那 template metaprogramming 要探討什麼? ---- 有。 write generic code ---- 用對照組,體會一下是不是 generic code 的差異 ---- ```cpp unsigned int calc(unsigned int a, unsigned int b, unsigned int c){ return a ^ 2 + b * 3 + c * 2; } int calc(int a, int b, int c){ return a * 2 + b * 2 + c ^ 2; } long calc(long a, long b, long c){ return a * 3 + b ^ 2 + c ^ 3; } ``` <small>這段看起來,我們可能會這麼<font color='red'>覺得</font>:</small> 1. 這是 overloading,只是 <font color='red'>其中幾個寫錯</font> - <small>那麼哪一個寫錯?</small> - <small>還是有兩個寫錯?</small> - <small>哪麼是哪兩個寫錯?</small> 2. 不同的 types 的計算方式<font color='red'>本來就不一樣</font> ---- 剛剛我們示範了一個讓人用 <font color='red'>覺得</font> 來判斷的 code,這在可讀性上來說,<font color='red'>實在不妥</font> ---- 或許可以用加註解來解決可讀性, 不過上面這個例子有更好的解法 ---- 更好的解決方法是 write generic code,所以我們用 templates metaprogramming,修改後是這樣子 ```cpp template<class T> T calc(T a, T b, T c){ return a ^2 + b^2 + c^2; } ``` <small>當然,我們假設上面的例子,每一個 calc 的計算方式一樣</small> --- ## What is memory order? - reorder - out-of-order ---- ### reorder ``` shared memory: x = 0; y = 0 P0 | P1 ; a: MOV [x],$1 | c: MOV [y],$1 ; b: MOV EAX,[y] | d: MOV EAX,[x] ; ``` - a, b, c, d 的執行順序有沒有可能是 <font color='red'>b, d</font>, c, a - 由於 CPU 會進行 reorder,所以<font color='red'>未必會照著順序執行</font>,因此在這個範例中,processor 0 的 EAX 和 processor 1 的 EAX 有可能都是 0 (只是經過測試,100 萬次裡出現的次數是個位數) ---- reorder 在不同的 CPU 會有不一樣的行為 ---- reorder 之後<font color="red"> **有可能** </font>是照著這個順序執行 ``` shared memory: x = 0; y = 0 P0 | P1 ; b: MOV EAX,[y] | d: MOV EAX,[x] ; | c: MOV [y],$1 ; a: MOV [x],$1 | ; ``` ---- - compiler 會對 program order 作 reorder,而 CPU 也會對 instruction 執行的順序作 reorder。 - 阻止全部的 reorder 雖然也是一種方法,但這樣會產生 <font color='red'>效率低落</font> 的結果。所以需要研究 memory order,讓工程師能對 reorder 作一些控制。 ---- ## out-of-order ---- 我們從 [instruction pipelining](https://en.wikipedia.org/wiki/Instruction_pipelining) 來理解 out-of-order ### in-order ``` 1 2 3 4 5 6 7 8 fetch | fetch1 fetch2 decode | decode1 decode2 execute | execute1 execute2 write-back | write1 write1 ``` 如果是照著以上的順序來執行 2 個 instruction,會用掉 8 個 cycle 來執行 2 個 instruction。 ---- ### out-of-order 但如果是 OoOE,**可能**會是以下的順序,這樣就可以只用 5 個 cycle 執行 2 個 instruction ``` 1 2 3 4 5 fetch | fetch1 fetch2 decode | decode1 decode2 execute | execute1 execute2 write-back | write1 write1 ``` 但是 out-of-order 會有一個延伸的問題,`execute2` <font color='red'>如果</font>會從 register 讀取一個值,而這個值要 `write1` 執行後才會更新... <font color='red'>那 `execute2` 的結果?</font> ---- 現在我們知道 program order 是我們從 source code 看到的順序,但是實際執行的順序,未必和 program order 一樣。 ---- CPU 會對 instruction 作 reordering,所以我們看到的 assembly 的 program order 未必是執行順序 如果我們寫的是 C++11 的 code,program order 是不是也有可能不一樣? ---- 實際執行順序是允許和 sequenced-before 不一樣,但 C++11 支援不同的 memory order,可以控制實際執行順序 --- #### How [C++11](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf) supports memory order? - [data races](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf#section.1.10) - synchronization operations / atomic operations - memory_order_seq_cst - memory_order_relaxed - 如何透過 memory model 建立 happens-before 的關係 ---- #### 從 C++11 規格來看 [data races](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf#section.1.10) <small>C++11 1.10/21</small> <small>The execution of a program contains a **data race** if it contains two **conflicting** actions in different threads, at least one of which is **not atomic**, and neither **happens before** the other.</small> <small>Reference http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf</small> ---- 其中一個 thread 的 operation 不是 atomic operation,而且也沒有 <font color='red'>happens before 這種關係</font> - synchronization operations - atomic operation - 如何建立 happens before 這種關係? ---- A **synchronization** operation on one or more memory locations is either a **consume** operation, an acquire operation, a **release** operation, or both an acquire and release operation. <small>Reference http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf#section.1.10</small> <small>synchronization operation 是要在不同的執行緒之間同步,等一下我們會在範例中看到這是如何運作的</small> <small>在 wait-free multi-producer queue 會用到 memory_order_relaxed、memory_order_release 和 memory_order_consume</small> <small>wait-free multi-producer queue 就是一開始提到的 Boost 的範例</small> ---- 特別要留意的地方是: relaxed operations 雖然是 atomic,但不是 synchronization operations <small> In addition, there are relaxed atomic operations, which are <font color='red'>not synchronization operations</font>, and atomic read modify-write operations</small> <small>Reference [C++11 1.10 Multi-threaded executions and data races](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf#section.1.10)</small> ---- [C++11, page 1115]() 說明了 memory_order_seq_cst 是一種 single total order <small>**consistent** with the “happens before” order and modification orders for **all** affected locations</small> ---- 因為 memory_order_seq_cst 會維持所有執行緒的同步,考慮到效能,所以需要再理解其他的 memor order,例如接下來要探討的 memory_order_relaxed 、memory_order_release 和 memory_order_consume。 ---- ```cpp // Thread 1: r1 = y.load(memory_order_relaxed); x.store(r1, memory_order_relaxed); // Thread 2: r2 = x.load(memory_order_relaxed); y.store(42, memory_order_relaxed); ``` <small>Reference: [C++11 規格, page 1116](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf#section.1.10) </small> 有沒有可能最後的結果是 r1 = r2 = 42? ---- 因為 memory_order_relaxed 只是 atomic operaion **不是 synchronization operations** 所以執行順序有可能是這樣 ```cpp y.store(42, memory_order_relaxed); r1 = y.load(memory_order_relaxed); x.store(r1, memory_order_relaxed); r2 = x.load(memory_order_relaxed); ``` <small>Reference: [C++11 規格, page 1116](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf#section.1.10) </small> ---- ## [Boost.Atomic](https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/thread_coordination.html) ---- ### Example about atomic ---- ```cpp= void push(const T &data) { node * n = new node; n->data = data; node * stale_head = head_.load(boost::memory_order_relaxed); do { n->next = stale_head; } while (!head_.compare_exchange_weak(stale_head, n, boost::memory_order_release)); } ... node * pop_all_reverse(void) { return head_.exchange(0, boost::memory_order_consume); } private: boost::atomic<node *> head_; ``` <small>第 5 、8 行的 load、compare_exchange_weak 就是 atomic operation,底下會再討論到這兩個 atomic operation,也會提到 3 個 memory order:memory_order_relaxed、memory_order_release 和 memory_order_consume。</small> <small>Reference [Boost.Atomic, Usage examples](https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.mp_queue.implementation)</small> ---- [C++11, page 1124]() 對 compare_exchange_weak 是這麼說的 <small>Implementations should ensure that weak compare-and-exchange operations **do not** consistently return false unless either the atomic object has value **different** from expected or there are concurrent modifications to the atomic object.</small> ---- - compare_exchange_weak 在 exchange 前會先作比較,比較結果不一樣才會 exchange,如果 exchange 就會 return true,否則 return false。 - 所以範例裡會有一個 ```cpp while (!head_.compare_exchange_weak(stale_head, n, boost::memory_order_release)); ``` - 在這個範例,我們用 memory_order_release 作為 atomic operation 的標記,底下將會探討 memory_order_release 還有與之成對的 memory_order_consume ---- memory_order_release v.s. memory_order_consume ---- Boost 是這麼說的 The use of **memory_order_release** after creating and initializing the object and **memory_order_consume** before dereferencing the object provides this guarantee. ---- 再看看 [C++11, page 1114]() 怎麼說 - **memory_order_release**, memory_order_acq_rel, and memory_order_seq_cst: a store operation performs a release operation **on the affected memory location**. - memory_order_consume: a load operation performs a consume operation **on the affected memorylocation** ---- 如果用 memory_order_acquire 來理解 memory_order_consume,會容易點,但是<font color='red'> memory_order_consume 不等於 memory_order_acquire</font> ---- 可以把 memory_order_consume 換成 memory_order_acquire,<font color='red'>但不能把 memory_order_acquire 換成 memory_order_consume</font> ---- 透過 memory_order_release、memory_order_consume 建立了 happens-before 的關係 ```cpp head_.compare_exchange_weak(stale_head, n, boost::memory_order_release)); ``` ```cpp head_.exchange(0, boost::memory_order_consume) ``` --- ### Boost.Atomic - Boost 實作了一個 [Wait-free multi-producer queue](https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.mp_queue.implementation) - 透過以上對 C++11、memory order 和 atomic operation 的了解,我們來欣賞一下 Boost 如何用 memory order 和 atomic operation 來實作出一個 Wait-free multi-producer queue。 ---- - producer–consumer problem - multi-producer and single consumer queue ---- #### multi-producer and single consumer queue [producer–consumer problem](https://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem) 基本上分這幾種 - **multi-producer and single consumer** - single-producer and multi-consumer - multi-producer and multi-consumer - single producer and single consumer <small>由於 [producer–consumer problem](https://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem) 不是這裡的重點,<br>所以只會講到所以只會講到 **multi-producer and single consumer**</small> ---- ### multi-producer and single consumer queue <small>producer 會 push message 到 queue ,然後 consumer 會將 queue 裡的 message 取走</small> <small>queue 是一個先進先出的 FIFO, First-In-First-Out 的資料結構,但是如果加了<font color='red'>在多執行緒下有多個 producer</font> 同時 push to queue,情況就有點複雜</small> ```graphviz digraph G{ rankdir = "LR"; producer1 -> queue; producer2 -> queue; queue -> consumer queue[shape="quare"]; } ``` ---- 節錄自 [Boost.Atomic, Usage example](https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.mp_queue.implementation) ```cpp= void push(const T &data) { node * n = new node; n->data = data; node * stale_head = head_.load(boost::memory_order_relaxed); do { n->next = stale_head; } while (!head_.compare_exchange_weak(stale_head, n, boost::memory_order_release)); } ... node * pop_all_reverse(void) { return head_.exchange(0, boost::memory_order_consume); } private: boost::atomic<node *> head_; ``` <small>現在再看這個範例,是不是就可以不用靠 try and error 跟猜測了?</small> <small>Reference [Boost.Atomic, Usage example, multi-producer queue](https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.mp_queue.implementation)</small> --- ## Reference - [ ] [Thread coordination using Boost.Atomic, Boost.Atomic, version 1.66](https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/thread_coordination.html) - [ ] [Wait-free multi-producer queue, implementation](https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.mp_queue.implementation) - [ ] [producer–consumer problem](https://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem) - [ ] [Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer](https://www.linuxjournal.com/content/lock-free-multi-producer-multi-consumer-queue-ring-buffer) - [ ] [The Purpose of memory_order_consume in C++11](https://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/) - [ ] [CppMem: Interactive C/C++ memory model](http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/) --- END ---- ###### tags: `COSCUP 2019` `C++11` `memory order` `TMP` `Template Meta-Programming` `Boost` `atomic` `Boost.Atomic` `1.66`
{"metaMigratedAt":"2023-06-14T23:25:38.809Z","metaMigratedFrom":"Content","title":"Untitled","breaks":true,"contributors":"[{\"id\":\"ff22442a-240f-4d3f-a3c1-2de9a238e8ae\",\"add\":21815,\"del\":6927}]"}
    718 views