從 C++11 規格和 Template Meta-Programming 的角度欣賞 Boost 對 atomic 和 memory order 的支援
---
## Agenda
- Introduction
- Why needs Template metaprogramming?
- What is memory order?
- How [C++11](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf) supports memory order?
- Boost.Atomic
---
### Introduction
為什麼要看 C++11 規格?
```cpp=
void push(const T &data)
{
node * n = new node;
n->data = data;
node * stale_head = head_.load(boost::memory_order_relaxed);
do {
n->next = stale_head;
} while (!head_.compare_exchange_weak(stale_head, n, boost::memory_order_release));
}
...
node * pop_all_reverse(void)
{
return head_.exchange(0, boost::memory_order_consume);
}
private:
boost::atomic<node *> head_;
```
<small>Reference
[Boost.Atomic, Usage example](https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.mp_queue.implementation)</small>
----
- <small>上面這段節錄自 [Boost.Atomic, Usage example](https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.mp_queue.implementation),由於直接看 code 並不是有效率的學習方式,<font color='red'>甚至可能得到錯誤的認知</font>,所以從理解 C++11 規格開始。</small>
- <small>透過理解 C++11 規格,看到上面這段 code 就不必用猜的,也不必用 try and error ,是一種比較有效率的學習路線。</small>
---
## Why needs Template metaprogramming?
----
先澄清一件事,<br>
不是直接用 templates 程式就會變快。
----
雖然 template 會在 compile-time 作展開和運算,但如果 <font color='red'>僅僅只是</font> 直接把 function 改成 template,就如同<font color='red'>直接換成 inline 或 macro一樣</font>,執行的時間是沒有差別的
----
作一個實驗來說明這個概念
----
```cpp
// templates.cpp
template <class T> T add(T a, T b){
return a + b;
}
int main(){
int ans = 0;
int a = 1, b = 2;
for(int i = 0; i < 1000000; ++i){
ans += add(a, b);
}
}
```
```cpp
// no-templates.cpp
int add(int a, int b){
return a + b;
}
int main(){
int ans = 0;
int a = 1, b = 2;
for(int i = 0; i < 1000000; ++i){
ans += add(a, b);
}
}
```
----
```
Performance counter stats for './templates' (10000 runs):
8,911,140 cycles ( +- 0.09% )
0.003526686 seconds time elapsed ( +- 0.11% )
Performance counter stats for './no-templates' (10000 runs):
9,010,668 cycles ( +- 0.11% )
0.003570547 seconds time elapsed ( +- 0.13% )
```
- 是不是 templates 的差異只有 0.05 ms
- 0.05 ms 是執行 1 萬次的平均時間
- <font color='red'>並不是直接換成 template 就會變快</font>
----
<font color='red'>並不是 templates 不會讓執行時間縮短</font>,而是因為剛剛的測試範例只是直接把 function 換成 templates
----
用 templates 實作出<font color='red'>在 compile-time 運算來縮短執行時間</font>,同時具有 inline 和 macro 所不具有的優點,<font color='red'>並不是這一次 talk 的主題</font>
----
那 template metaprogramming 要探討什麼?
----
有。
write generic code
----
用對照組,體會一下是不是 generic code 的差異
----
```cpp
unsigned int calc(unsigned int a, unsigned int b, unsigned int c){
return a ^ 2 + b * 3 + c * 2;
}
int calc(int a, int b, int c){
return a * 2 + b * 2 + c ^ 2;
}
long calc(long a, long b, long c){
return a * 3 + b ^ 2 + c ^ 3;
}
```
<small>這段看起來,我們可能會這麼<font color='red'>覺得</font>:</small>
1. 這是 overloading,只是 <font color='red'>其中幾個寫錯</font>
- <small>那麼哪一個寫錯?</small>
- <small>還是有兩個寫錯?</small>
- <small>哪麼是哪兩個寫錯?</small>
2. 不同的 types 的計算方式<font color='red'>本來就不一樣</font>
----
剛剛我們示範了一個讓人用 <font color='red'>覺得</font> 來判斷的 code,這在可讀性上來說,<font color='red'>實在不妥</font>
----
或許可以用加註解來解決可讀性,
不過上面這個例子有更好的解法
----
更好的解決方法是 write generic code,所以我們用 templates metaprogramming,修改後是這樣子
```cpp
template<class T> T calc(T a, T b, T c){
return a ^2 + b^2 + c^2;
}
```
<small>當然,我們假設上面的例子,每一個 calc 的計算方式一樣</small>
---
## What is memory order?
- reorder
- out-of-order
----
### reorder
```
shared memory: x = 0; y = 0
P0 | P1 ;
a: MOV [x],$1 | c: MOV [y],$1 ;
b: MOV EAX,[y] | d: MOV EAX,[x] ;
```
- a, b, c, d 的執行順序有沒有可能是 <font color='red'>b, d</font>, c, a
- 由於 CPU 會進行 reorder,所以<font color='red'>未必會照著順序執行</font>,因此在這個範例中,processor 0 的 EAX 和 processor 1 的 EAX 有可能都是 0 (只是經過測試,100 萬次裡出現的次數是個位數)
----
reorder 在不同的 CPU 會有不一樣的行為
----
reorder 之後<font color="red"> **有可能** </font>是照著這個順序執行
```
shared memory: x = 0; y = 0
P0 | P1 ;
b: MOV EAX,[y] | d: MOV EAX,[x] ;
| c: MOV [y],$1 ;
a: MOV [x],$1 | ;
```
----
- compiler 會對 program order 作 reorder,而 CPU 也會對 instruction 執行的順序作 reorder。
- 阻止全部的 reorder 雖然也是一種方法,但這樣會產生 <font color='red'>效率低落</font> 的結果。所以需要研究 memory order,讓工程師能對 reorder 作一些控制。
----
## out-of-order
----
我們從 [instruction pipelining](https://en.wikipedia.org/wiki/Instruction_pipelining) 來理解 out-of-order
### in-order
```
1 2 3 4 5 6 7 8
fetch | fetch1 fetch2
decode | decode1 decode2
execute | execute1 execute2
write-back | write1 write1
```
如果是照著以上的順序來執行 2 個 instruction,會用掉 8 個 cycle 來執行 2 個 instruction。
----
### out-of-order
但如果是 OoOE,**可能**會是以下的順序,這樣就可以只用 5 個 cycle 執行 2 個 instruction
```
1 2 3 4 5
fetch | fetch1 fetch2
decode | decode1 decode2
execute | execute1 execute2
write-back | write1 write1
```
但是 out-of-order 會有一個延伸的問題,`execute2` <font color='red'>如果</font>會從 register 讀取一個值,而這個值要 `write1` 執行後才會更新...
<font color='red'>那 `execute2` 的結果?</font>
----
現在我們知道 program order 是我們從 source code 看到的順序,但是實際執行的順序,未必和 program order 一樣。
----
CPU 會對 instruction 作 reordering,所以我們看到的 assembly 的 program order 未必是執行順序
如果我們寫的是 C++11 的 code,program order 是不是也有可能不一樣?
----
實際執行順序是允許和 sequenced-before 不一樣,但 C++11 支援不同的 memory order,可以控制實際執行順序
---
#### How [C++11](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf) supports memory order?
- [data races](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf#section.1.10)
- synchronization operations / atomic operations
- memory_order_seq_cst
- memory_order_relaxed
- 如何透過 memory model 建立 happens-before 的關係
----
#### 從 C++11 規格來看 [data races](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf#section.1.10)
<small>C++11 1.10/21</small>
<small>The execution of a program contains a **data race** if it contains two **conflicting** actions in different threads, at least one of which is **not atomic**, and neither **happens before** the other.</small>
<small>Reference
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf</small>
----
其中一個 thread 的 operation 不是 atomic operation,而且也沒有 <font color='red'>happens before 這種關係</font>
- synchronization operations
- atomic operation
- 如何建立 happens before 這種關係?
----
A **synchronization** operation on one or more memory locations is either a **consume** operation, an acquire operation, a **release** operation, or both an acquire and release operation.
<small>Reference
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf#section.1.10</small>
<small>synchronization operation 是要在不同的執行緒之間同步,等一下我們會在範例中看到這是如何運作的</small>
<small>在 wait-free multi-producer queue 會用到 memory_order_relaxed、memory_order_release 和 memory_order_consume</small>
<small>wait-free multi-producer queue 就是一開始提到的 Boost 的範例</small>
----
特別要留意的地方是:
relaxed operations 雖然是 atomic,但不是 synchronization operations
<small> In addition, there are relaxed atomic operations, which are <font color='red'>not synchronization operations</font>, and atomic read modify-write operations</small>
<small>Reference
[C++11 1.10 Multi-threaded executions and data races](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf#section.1.10)</small>
----
[C++11, page 1115]() 說明了 memory_order_seq_cst 是一種 single total order
<small>**consistent** with the “happens before” order and modification orders for **all** affected locations</small>
----
因為 memory_order_seq_cst 會維持所有執行緒的同步,考慮到效能,所以需要再理解其他的 memor order,例如接下來要探討的 memory_order_relaxed 、memory_order_release 和 memory_order_consume。
----
```cpp
// Thread 1:
r1 = y.load(memory_order_relaxed);
x.store(r1, memory_order_relaxed);
// Thread 2:
r2 = x.load(memory_order_relaxed);
y.store(42, memory_order_relaxed);
```
<small>Reference:
[C++11 規格, page 1116](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf#section.1.10)
</small>
有沒有可能最後的結果是 r1 = r2 = 42?
----
因為 memory_order_relaxed 只是 atomic operaion **不是 synchronization operations**
所以執行順序有可能是這樣
```cpp
y.store(42, memory_order_relaxed);
r1 = y.load(memory_order_relaxed);
x.store(r1, memory_order_relaxed);
r2 = x.load(memory_order_relaxed);
```
<small>Reference:
[C++11 規格, page 1116](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf#section.1.10)
</small>
----
## [Boost.Atomic](https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/thread_coordination.html)
----
### Example about atomic
----
```cpp=
void push(const T &data)
{
node * n = new node;
n->data = data;
node * stale_head = head_.load(boost::memory_order_relaxed);
do {
n->next = stale_head;
} while (!head_.compare_exchange_weak(stale_head, n, boost::memory_order_release));
}
...
node * pop_all_reverse(void)
{
return head_.exchange(0, boost::memory_order_consume);
}
private:
boost::atomic<node *> head_;
```
<small>第 5 、8 行的 load、compare_exchange_weak 就是 atomic operation,底下會再討論到這兩個 atomic operation,也會提到 3 個 memory order:memory_order_relaxed、memory_order_release 和 memory_order_consume。</small>
<small>Reference
[Boost.Atomic, Usage examples](https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.mp_queue.implementation)</small>
----
[C++11, page 1124]() 對 compare_exchange_weak 是這麼說的
<small>Implementations should ensure that weak compare-and-exchange operations **do not** consistently return false unless either the atomic object has value **different** from expected or there are concurrent modifications to the atomic object.</small>
----
- compare_exchange_weak 在 exchange 前會先作比較,比較結果不一樣才會 exchange,如果 exchange 就會 return true,否則 return false。
- 所以範例裡會有一個
```cpp
while (!head_.compare_exchange_weak(stale_head, n, boost::memory_order_release));
```
- 在這個範例,我們用 memory_order_release 作為 atomic operation 的標記,底下將會探討 memory_order_release 還有與之成對的 memory_order_consume
----
memory_order_release v.s. memory_order_consume
----
Boost 是這麼說的
The use of **memory_order_release** after creating and initializing the object and **memory_order_consume** before dereferencing the object provides this guarantee.
----
再看看 [C++11, page 1114]() 怎麼說
- **memory_order_release**, memory_order_acq_rel, and memory_order_seq_cst:
a store operation performs a release operation **on the affected memory location**.
- memory_order_consume:
a load operation performs a consume operation **on the affected memorylocation**
----
如果用 memory_order_acquire 來理解 memory_order_consume,會容易點,但是<font color='red'> memory_order_consume 不等於 memory_order_acquire</font>
----
可以把 memory_order_consume 換成 memory_order_acquire,<font color='red'>但不能把 memory_order_acquire 換成 memory_order_consume</font>
----
透過 memory_order_release、memory_order_consume 建立了 happens-before 的關係
```cpp
head_.compare_exchange_weak(stale_head, n, boost::memory_order_release));
```
```cpp
head_.exchange(0, boost::memory_order_consume)
```
---
### Boost.Atomic
- Boost 實作了一個 [Wait-free multi-producer queue](https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.mp_queue.implementation)
- 透過以上對 C++11、memory order 和 atomic operation 的了解,我們來欣賞一下 Boost 如何用 memory order 和 atomic operation 來實作出一個 Wait-free multi-producer queue。
----
- producer–consumer problem
- multi-producer and single consumer queue
----
#### multi-producer and single consumer queue
[producer–consumer problem](https://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem) 基本上分這幾種
- **multi-producer and single consumer**
- single-producer and multi-consumer
- multi-producer and multi-consumer
- single producer and single consumer
<small>由於 [producer–consumer problem](https://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem) 不是這裡的重點,<br>所以只會講到所以只會講到 **multi-producer and single consumer**</small>
----
### multi-producer and single consumer queue
<small>producer 會 push message 到 queue ,然後 consumer 會將 queue 裡的 message 取走</small>
<small>queue 是一個先進先出的 FIFO, First-In-First-Out 的資料結構,但是如果加了<font color='red'>在多執行緒下有多個 producer</font> 同時 push to queue,情況就有點複雜</small>
```graphviz
digraph G{
rankdir = "LR";
producer1 -> queue;
producer2 -> queue;
queue -> consumer
queue[shape="quare"];
}
```
----
節錄自 [Boost.Atomic, Usage example](https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.mp_queue.implementation)
```cpp=
void push(const T &data)
{
node * n = new node;
n->data = data;
node * stale_head = head_.load(boost::memory_order_relaxed);
do {
n->next = stale_head;
} while (!head_.compare_exchange_weak(stale_head, n, boost::memory_order_release));
}
...
node * pop_all_reverse(void)
{
return head_.exchange(0, boost::memory_order_consume);
}
private:
boost::atomic<node *> head_;
```
<small>現在再看這個範例,是不是就可以不用靠 try and error 跟猜測了?</small>
<small>Reference
[Boost.Atomic, Usage example, multi-producer queue](https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.mp_queue.implementation)</small>
---
## Reference
- [ ] [Thread coordination using Boost.Atomic, Boost.Atomic, version 1.66](https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/thread_coordination.html)
- [ ] [Wait-free multi-producer queue, implementation](https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.mp_queue.implementation)
- [ ] [producer–consumer problem](https://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem)
- [ ] [Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer](https://www.linuxjournal.com/content/lock-free-multi-producer-multi-consumer-queue-ring-buffer)
- [ ] [The Purpose of memory_order_consume in C++11](https://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/)
- [ ] [CppMem: Interactive C/C++ memory model](http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/)
---
END
----
###### tags: `COSCUP 2019` `C++11` `memory order` `TMP` `Template Meta-Programming` `Boost` `atomic` `Boost.Atomic` `1.66`
{"metaMigratedAt":"2023-06-14T23:25:38.809Z","metaMigratedFrom":"Content","title":"Untitled","breaks":true,"contributors":"[{\"id\":\"ff22442a-240f-4d3f-a3c1-2de9a238e8ae\",\"add\":21815,\"del\":6927}]"}