Toward Concurrency 2.0

# Toward Concurrency 2.0 :::success **更動內容** 1. 重新整理文章架構 2. 新增文章摘要 1. Mutexes and Semaphores Demystified 2. Mutex vs. Semaphores – Part 1: Semaphores 3. Mutex vs. Semaphores – Part 2: The Mutex 4. Mutex vs. Semaphores – Part 3: Mutual Exclusion Problems 5. 浅谈Memory Reordering 6. Weak vs. Strong Memory Models 3. 新增簡介 1. Functional Programming 4. 新增案例探討 1. 研究 <cloudwu / coroutine> 5. 整理範例程式碼 1. Pthread 2. OpenMP ::: ## What is Concurrency ? ### [Concurrency is not Parallelism](https://blog.golang.org/concurrency-is-not-parallelism) by Rob Pike * [投影片](https://talks.golang.org/2012/waza.slide#1) * [錄影](https://www.youtube.com/watch?v=cN_DpYBzKso); [後續 stackoverflow 討論](http://stackoverflow.com/questions/11700953/concurrency-is-not-parallelism) * **Concurrency** 是指程式架構，將程式拆開成多個可獨立運作的工作。e.g. drivers，都可以獨立運作，但不需要平行化。 * 拆開多個的工作不一定要同時運行 * 多個工作在單核心 CPU 上運行 * **Parallelism** 是指程式執行，同時執行多個程式。Concurrency 可能會用到 parallelism，但不一定要用 parallelism 才能實現 concurrency。 e.g. Vector dot product * 程式會同時執行 (例如：分支後，同時執行，再收集結果) * 一個工作在多核心 CPU 上運行 * Concurrency 對軟體設計的影響 * 想要充分使用到 CPU 的資源 * 程式越來越有機會造成 CPU-bound。雖然主要還是 IO-bound 等，但如果 CPU 時脈無法增加，而其他存取方式速度變快，最後會發生 CPU-bound * 軟體效能優化將會越來越重要 * 程式語言必須好好處理 concurrency ### 用地鼠燒書做例子 ![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_K6DJ0ZtiecH_p.537916_1460613009716_task.jpg) * 如果今天增加多一只地鼠，一個推車或多一個焚燒盧，這樣有機會作到更好的資源使用率，但我們不能保證兩只或更多地鼠會同時進行，因為可能只有有限的火爐。 (在單核系統中只能允許一次進行一次的燒書工作，那樣就沒有效率了) ![](https://i.imgur.com/XTGAK98.jpg) * 以 Concurrency 的方式去作業，能夠以不同的解構方式去進行,可以是三個地鼠分別負責一部分的工作 (decomposition) ![](https://i.imgur.com/w5Eugs7.jpg) * 其中也可以 Parallelism: ![](https://i.imgur.com/3IHX5BT.jpg) 或 ![](https://i.imgur.com/duEVvNK.jpg) ### 換個觀點來理解 [source](http://stackoverflow.com/questions/1050222/concurrency-vs-parallelism-what-is-the-difference) * **Concurrency:** If two or more problems are solved by a single processor. * multiple execution flows with the potential to share resources * Example: two threads competing for a I/O port. * the separation of tasks to provide **interleaved execution** * Concurrency solves the problem of having scarce CPU resources and many tasks. So, you create threads or independent paths of execution through code in order to share time on the scarce resource. Up until recently, concurrency has dominated the discussion because of CPU availability. ![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_K6DJ0ZtiecH_p.537916_1460613316743_p1.png) * **Parallelism:** If one problem is solved by multiple processors. * splitting a problem in multiple similar chunks. * Example: parsing a big file by running two processes on every half of the file. * the **simultaneous execution of multiple pieces of work** in order to increase speed * Parallelism solves the problem of finding enough tasks and appropriate tasks (ones that can be split apart correctly) and distributing them over plentiful CPU resources. Parallelism has always been around of course, but it’s coming to the forefront because multi-core processors are so cheap. ![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_K6DJ0ZtiecH_p.537916_1460613329719_p2.png) ### 小結 [Introduction to OpenMP](https://www.youtube.com/watch?v=6jFkNjhJ-Z4) * Concurrent (並行) * 工作可拆分成「獨立執行」的部份，這樣「可以」讓很多事情一起做，但是「不一定」要真的同時做。下方情境： ![](https://i.imgur.com/rweOyiD.png) * 展示具有並行性，但不去同時執行。 * 並行性是種「架構程式」的概念。寫下一段程式之前，思考問題架構時就決定好的。 * Parallel (平行) * 把規劃好，能夠並行的程式，分配給不同執行緒，並讓他們同時執行。 ![](https://i.imgur.com/Oom3wM5.png) * 「平行」是一種選擇。 * 假設有天我在程式發現需要做內積，我可用迴圈遍歷 3 個元素做(沒有平行化)，也可用 3 個執行緒同時執行每個相應位置元素的乘法 (有平行化)。這裡「要不要做平行化」只是種「選擇」 :::info 只是舉例, 為內積故意創造執行緒, 有可能光是花在建立新執行緒上的時間與資源就比內積本身還要多 ::: ## Why Do We Need Concurrency ? ### 軟體開發現況 * [The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software](http://www.gotw.ca/publications/concurrency-ddj.htm) * Free (performance) lunch 指的是程式設計的效能可以透過 CPU 時脈的進步而得到改善。會說 over 是因為 CPU 的時脈無法得到更進一步的增加，由於耗電和散熱的問題，所以程式設計師必須要修改程式才能改善效能 * 文章中 `Figure 1`，可以看到 CPU 時脈並沒有隨著電晶體數量而增加，反而是趨緩了 ![](https://i.imgur.com/hr4gwXs.png) * 以前 CPU 增進效能的手段 * Clock speed * Execution Optimization * Pipelining、Branch prediction * Out-of-order execution 還要注意不能讓原本程式崩潰 (read/write reorder) * Cache：盡量減少存取 Main memory 的機會 * 現在 CPU 增進效能的手段 * Hyperthreading: 在 single CPU 上同時執行多個 thread，但是共用 ALU、FPU * Multicore：多個 CPU * 迷思：`2 x 3GHz < 6GHz` * Cache * 軟體開發革命 (Software development revolution) 是源自於存在一陣子的技術 (擁有成熟的銷售商和工具支援)，而非新技術一出現即發生。 ### Functional Programming * [Why are functional languages considered a boon for multi threaded environments?](http://stackoverflow.com/questions/2909282/why-are-functional-languages-considered-a-boon-for-multi-threaded-environments) 這年頭要充分展現 multicore architecture 的效能，不能不會寫 FP * [It's Time to Get Good at Functional Programming](http://www.drdobbs.com/tools/its-time-to-get-good-at-functional-progr/212201710) 解釋 FP 為何物介紹常見的 4 個 FP language: Scala, F#, Erlang, Haskell * [Don’t Be Scared Of Functional Programming](https://www.smashingmagazine.com/2014/07/dont-be-scared-of-functional-programming/) 專為新手寫的 FP 教學，讓大家能從傳統 imperative 轉到 FP 思維逐步解釋程式碼 + 真實範例 * [Functional programming in C](https://lucabolognese.wordpress.com/2013/01/04/functional-programming-in-c/) 從大家熟悉的 C 語言實做 FP 可以發現 FP 並不是一種語言分類，而是一種程式的思考模式，就像 OOP 是一種態度，C 語言也可以實做 OOP 一樣 ## How to Achieve Concurrency ? By [冼鏡光老師的並行計算投影片 01-Into](http://blog.dcview.com/article.php?a=BTgBYw1qUWIAaQ%3D%3D) 1. Split a program * Splitting a program improperly would just make the program more inefficent. 2. Synchronization * Programmers may be excited about concurrency. But, the picture is not that rosy because splitting a program into multiple processes or threads is easily said than done. * Processes and threads must communicate with each other to get the job done. **Once there are communications there are troubles** * 以 Rob Pike 的地鼠燒書為例 ![](https://i.imgur.com/w5Eugs7.jpg) * Split a program: 將燒書過程中，切割好各地鼠負責的工作。 * Synchronization: 確保地鼠在交接工作時，能夠有**效率**且**正確**的溝通。 ## 背景知識 ### Thread vs. Coroutines * With **threads**, the operating system switches running tasks **preemptively** according to its scheduling algorithm. * [SMT (Simultaneous Multithreading), VMT(Vertical Multithreading)](http://www.fujitsu.com/global/products/computing/servers/unix/sparc-enterprise/technology/performance/processor.html) ![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_K6DJ0ZtiecH_p.537916_1460616808318_proc.png) * With **coroutines**, the programmer chooses, meaning tasks are cooperatively multitasked by pausing and resuming functions at set points. * coroutine switches are cooperative, meaning the programmer controls when a switch will happen. * **The kernel is not involved in coroutine switches**. * 一圖勝千語 ![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_K6DJ0ZtiecH_p.537916_1460615185290_undefined) * 將函式呼叫從原本的面貌: [ [source](http://www.csl.mtu.edu/cs4411.ck/www/NOTES/non-local-goto/coroutine.html) ] ![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_K6DJ0ZtiecH_p.537916_1460615014454_undefined) 轉換成以下 ![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_K6DJ0ZtiecH_p.537916_1460615044111_undefined) * C 程式實做 coroutines 的手法相當多，像是: * [switch-case](http://blog.linux.org.tw/~jserv/archives/001848.html) * [setjmp / longjmp](http://descent-incoming.blogspot.tw/2014/02/coroutine.html) * [ 研究 <```cloudwu / coroutine```> ](https://hackmd.io/s/rkxoqN8Al) ### Fork Join Model * 典型的平行程式，由以下 3 個區段所組成: 1. fork: 準備進入要平行化的區域時, master thread (下圖紅色) 產生出 team of threads (紅色加黃色) * 注意: 這裡的 "fork" 並非 `fork(2)` system call，請回家複習**OS 作業系統** ![](https://i.imgur.com/3cHajBu.png) 2. parallel region: 這些執行緒合力執行給定的工作 ![](https://i.imgur.com/FcOmLLn.png) 3. join: 工作完成之後, 又回到 fork 之前, 單一執行緒的狀態 ![](https://i.imgur.com/M3I3AFO.png) * fork-join model 就是不斷重複上述過程 ![](https://i.imgur.com/UtpTNxC.png) * 由記憶體的角度切入: 每個 thread 都有自己的 stack, 但 text, heap & data section 是共用的 [source](https://www.tutorialspoint.com/operating_system/os_multi_threading.htm) ![](https://www.tutorialspoint.com/operating_system/images/thread_processes.jpg) ## 先來探討 Synchronization 為了達到 concurrency 要有 Split a program 與 Synchronization 兩個步驟但本篇只會負責到 Synchronization 而已，所以這裡說是要先來探討 Synchronization，其實之後也不會解釋怎麼做 Program Split XD * Race Condition * 設 count = 0 並為 Process A 和 B 的共享變數。 * 正確執行 Process A 和 B 的結果 count 應仍為 0 * 但若 Scheduler 照下列時序 (T1 -> T2 -> ...)，則最後 count 為 -1 ```clike= /* Process A: count++ */ T1: lw $t1, &count T3: addi $t1, $t1, 1 T4: sw $tw, &count /* Process B: count-- */ T2: lw $t1, &count T5: addi $t1, $t1, -1 T6: sw $tw, &count ``` * [Mutex vs. Semaphores – Part 1: Semaphores](https://blog.feabhas.com/2009/09/mutex-vs-semaphores-%E2%80%93-part-1-semaphores/) * [Edsger Dijkstra](https://en.wikipedia.org/wiki/Edsger_W._Dijkstra) 在 1965 年提出 binary semaphore 的構想，嘗試去解決在 concurrent programs 中可能發生的 race conditions。想法便是透過兩個 function calls 存取系統資源 Semaphore，來標明要進入或離開 critical region。 * 使用 Semaphore 擁有的風險 * Accidental release * This problem arises mainly due to a bug fix, product enhancement or cut-and-paste mistake. * Recursive deadlock * 某一個 task 重複地想要 lock 它自己已經鎖住的 Semaphore，通常發生在 libraries 或 recursive functions。 > for example, the simple locking of malloc being called twice within the framework of a library. An example of this appeared in the MySQL database bug reporting system: Bug #24745 InnoDB semaphore wait timeout/crash – deadlock waiting for itself >> 不太懂這個例子。 * Task-Death deadlock * 某一個 task 其持有的 Semaphore 已經被終止或發生錯誤。 * Priority inversion * Semaphore as a signal * **Synchronization** between tasks is where, typically, one task waits to be notified by another task before it can continue execution (unilateral rendezvous). A variant of this is either task may wait, called the bidirectional rendezvous. **Quite different to mutual exclusion, which is a protection mechanism.** * 使用 Semaphore 作為處理 Synchronization 的方法容易造成 Debug 困難及增加 "accidental release" 類性的問題。 * [Mutex vs. Semaphores – Part 2: The Mutex](https://blog.feabhas.com/2009/09/mutex-vs-semaphores-%E2%80%93-part-2-the-mutex/) * 作業系統實作中，需要引入的機制。"The major use of the term mutex appears to have been driven through the development of the common programming specification for UNIX based systems." * 儘管 Mutex 及 Binary Semaphore 的概念極為相似，但有一個最重要的差異：**the principle of ownership**。 * 而 Ownership 的概念可以解決上一章所討論的五個風險 * Accidental release * Mutex 若遭遇 Accidental release 將會直接發出 Error Message，因為它在當下並不是 owner。 * Recursive Deadlock * 由於 Ownership，Mutex 可以支援 relocking 相同的 Mutex 多次，只要它被 released 相同的次數。 * Priority Inversion * Priority Inheritance Protocol * [Priority Ceiling Protocol](https://en.wikipedia.org/wiki/Priority_ceiling_protocol) * [投影片](https://tams.informatik.uni-hamburg.de/lehre/2016ss/vorlesung/es/doc/cs.lth.se-RTP-F6b.pdf): 介紹 PCP，並實際推導兩個例子。 * Death Detection * 如果某 task 因為某些原因而中止，則 RTOS 會去檢查此 task 是否擁有 Mutex ;若有, RTOS 則會負責通知所有 waiting 此 Mutex 的 tasks 們。最後再依據這些 tasks 們的情況，有不同的處理方式。 * All tasks readied with error condition * 所有 tasks 假設 critical region 為未定義狀態，必需重新初始化。 * Only one task readied * Ownership 交給此 readied tesk，並確保 the integrity of critical region，之後便正常執行。 * Semaphore as a signal * Mutex 不會用來處理 Synchronization。 * [Mutex vs. Semaphores – Part 3: Mutual Exclusion Problems](https://blog.feabhas.com/2009/10/mutex-vs-semaphores-%E2%80%93-part-3-final-part-mutual-exclusion-problems/) * 但還是有一些問題是無法使用 mutex 解決的。 * Circular deadlock * One task is blocked waiting on a mutex owned by another task. That other task is also block waiting on a mutex held by the first task. * Necessary Conditions * Mutual Exclusion * Hold and Wait * Circular Waiting * No preemption * Solution: Priority Ceiling Protocol * PCP 的設計可以讓低優先權的 task 提升至 ceiling value，讓其中一個必要條件 "Hold and Wait" 不會成立。 * Non-cooperation * Most important aspect of mutual exclusion covered in these ramblings relies on one founding principle: **we have to rely on all tasks to access critical regions using mutual exclusion primitives**. * Unfortunately this is dependent on the design of the software and cannot be detected by the run-time system. * Solution: Monitor * Not typically supplied by the RTOS * A monitor simply encapsulates the shared resource and the locking mechanism into a single construct. * Access to the shared resource is through a controlled interface which cannot be bypassed. * [Mutexes and Semaphores Demystified](http://www.barrgroup.com/Embedded-Systems/How-To/RTOS-Mutex-Semaphore) * Myth: Mutexes and Semaphores are Interchangeable * Reality: 即使 Mutexes 跟 Semaphores 在實作上有極為相似的地方，但他們通常被用在不同的情境。 * The correct use of a semaphore is for **signaling** from one task to another. * A mutex is meant to be taken and released, always in that order, by each task that uses the shared resource it protects. * 還有一個重要的差別在於，即便是正確地使用 mutex 去存取共用資源，仍有可能造成危險的副作用。 * 任兩個擁有不同優先權的 RTOS task 存取同一個 mutex，可能會創造 **Priority Inversion**。 * Definition: The real trouble arises at run-time, when a medium-priority task preempts a lower-priority task using a shared resource on which the higher-priority task is pending. If the higher-priority task is otherwise ready to run, but a medium-priority task is currently running instead, a priority inversion is said to occur. [[Source]](https://barrgroup.com/Embedded-Systems/How-To/RTOS-Priority-Inversion) * 但幸運的是，Priority Inversion 的風險可以透過修改作業系統裡 mutex 的實作而減少，其中會增加 acquiring and releasing muetexs 時的 overhead。 * 當 Semaphores 被用作 signaling 時，是不會造成 Priority Inversion 的，也因此是沒有必要去跟 Mutexes 一樣去更改 Semaphores 的實作。 * This is a second important reason for having distinct APIs for these two very different RTOS primitives. ## Deeper Look: Lock-Free Programming >導讀 [An Introduction to Lock-Free Programming](http://preshing.com/20120612/an-introduction-to-lock-free-programming/) 與 [Lock-Free 编程](http://www.cnblogs.com/gaochundong/p/lock_free_programming.html) 為了解決上述 semaphore, mutex 會產生的問題 (e.g. deadlock, livelock, priority invertion, ...)，可以使用 lock-free(lockless) programming 內的一些技巧以提昇效能與正確性 lock-free 不是說完全不用 mutex lock，而是指整個程式的執行不會被其單個執行緒 lock 鎖住 ``` while (x == 0) { x = 1 - x; } ``` * 由兩個執行緒同時執行這段程式碼，有可能出現兩者都無法跳出迴圈的情況嘛? |x|thread 1|thread 2| |-|-|-| |0|while(x == 0)|| |0||while(x == 0)| |1|x = 1 - x|| |0||x = 1 - x| |0|while(x == 0)| |0||while(x == 0)| ||...|...| ---- ### A. lock-free vs wait-free 參考: [Non-blocking algorithm](https://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom) * 想像一個情況: 在使用 mutex 時，若其中一個執行緒在釋放 lock 之前被系統 preempt，則需要同一個資源的其他執行緒會無法繼續，直到原本的執行緒恢復執行完畢若系統 kill 執行緒，則整個程式就完全無法再繼續前進 * **wait-free** 是 non-blocking 之中強度最高的，它保證所有的指令可以 bound 在某個時間內完成，也就是所有的執行緒經過一段足夠的時間後，一定會有所進展 * **lock-free** 稍微弱一點，它允許部份執行緒被暫停，但仍然保證整體程式的其他部份會有進展最差的情況下，就算只有一個執行緒在執行，其他所有執行緒都在等待，仍然屬於 lock-free * obstruction-free，non-blocking 中最弱的，但不在今天的討論範圍 * 根據定義，所有 wait-free 都屬於 lock-free 故就算不使用 mutex lock (e.g. busy waiting, 上面的 spin lock)，不代表就是 lock-free ---- ### B. lock-free 的意義要求整個程式都是 lock-free 是極為困難的，通常 lock-free 會應用在一些特定的資料結構上，例如為 lock-free queue 實做 lock-free 版本的 `push`, `pop`, `isEmpty` 等 ---- ### C. lock-free 使用的技術 #### 1. Atomic Operation: Read-Modify-Write 此類的指令雖然包含複數個動作，但由硬體的設計保證就像是 load, store, add, ... 中間無法再插入其他指令延伸: [Read-modify-write](https://en.wikipedia.org/wiki/Read-modify-write) 常見的實做如: test-and-set, fetch-and-add, compare-and-swap 延伸: [Compare-and-swap](https://en.wikipedia.org/wiki/Compare-and-swap) ``` bool test_and_set(bool *target) //atomic { //non-interruptable operations bool b = *target; *target = TRUE; return b; } ``` ``` loop //lock init to FALSE { while(test_and_set(lock)) ; //block //critical section lock = FALSE; } ``` 各平台提供的 API * Win32: `_InterlockedIncrement` * iOS: `OSAtomicAdd32` * C++11: `std::atomic<int>::fetch_add` 值得一提的是，C++11 的 atomic operation 無法保證是 lock-free 的，必須透過 `std::atomic<>::is_lock_free` 來確認 #### 2. Compare-And-Swap Loops 在 RMW 中最常見的應用就是將 CAS 搭配 loop 使用 Win32 支援的 CSS: `_InterlockedCompareExchange` ```clike void LockFreeQueue::push(Node* newHead) { for (;;) { // Copy a shared variable (m_Head) to a local. Node* oldHead = m_Head; // Do some speculative work, not yet visible to other threads. newHead->next = oldHead; // Next, attempt to publish our changes to the shared variable. // If the shared variable hasn't changed, the CAS succeeds and we return. // Otherwise, repeat. if (_InterlockedCompareExchange(&m_Head, newHead, oldHead) == oldHead) return; } } ``` * 上例為何可被稱為 lock-free? 因為如果其中一個執行緒的迴圈無法通過 `_InterlockedCompareExchange`，代表有其他的執行緒成功了，程式整體是有進展的 * 在處理 CAS Loop 時要特別小心 [ABA Problem](https://en.wikipedia.org/wiki/ABA_problem): 若兩個執行緒同時對一個記憶體內容進行操作，但優先執行完畢的執行緒把原值 A 改爲 B 又改回 A，這樣後執行的執行緒將會讀取錯誤的資料 ![](https://i.imgur.com/SparxyO.jpg) #### 3. Sequential Consistancy 所有的執行緒都依循順序執行程式，簡單來說，就是前面的指令一定比後面的指令先執行 ``` $ gcc -s main.c $ cat main.s ... store x, 1 load r1, y ... ``` * sequential consistancy 保證不會出現 ``` //in CPU load r1, y store x, 1 ``` 的情況 #### 4. Memory Ordering 參考: [浅谈Memory Reordering](http://dreamrunner.org/blog/2014/06/28/qian-tan-memory-reordering/) * 當一個 **lock-free** 程式在**多核**系統執行，且執行環境**不保證 sequential consistancy** 時，就要小心 memory reordering ![](http://preshing.com/images/marked-example2.png) ![](http://preshing.com/images/reordered.png) memory reordering 代表程式的執行順序不一定，可能會造成錯誤結果實例: [Memory Reordering Caught in the Act](http://preshing.com/20120515/memory-reordering-caught-in-the-act/) 注意發生的條件 * 如果保證 sequential consistancy，那就不用擔心會出錯 * 如果在**單核**系統執行，那只要關閉 compiler optimization 就可以保證 sequential consistancy * 如果程式**並非 lock-free**，那 critical section 的指令會因為有 lock 而被 block 住，也保證了執行的正確性 reordering 可以分成兩種: * compiler reordering 實例: [Memory Ordering at Compile Time](http://preshing.com/20120625/memory-ordering-at-compile-time/) * processor reordering 其實就是在 compile time 跟 run time 兩個時間點，因為最佳化產生指令重新排序的關係 ---- ### D. lock-free 衍生的問題上文提到兩個，一個是 CAS loop 裡面的 ABA 問題，另一個是在討論 sequential consistancy 時提到的 memory ordering，這裡著重於第二點 #### 1. 根據最佳化時間點不同，有不同的解法 * Compiler Barrier [浅谈Memory Reordering](http://dreamrunner.org/blog/2014/06/28/qian-tan-memory-reordering/) 和 [Memory Ordering at Compile Time](http://preshing.com/20120625/memory-ordering-at-compile-time/) 的後半部都有實際演練，沒有乖乖讀的人現在可以點進去老實讀完 * Memory Barrier 圖解: [Memory Barriers Are Like Source Control Operations](http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/) 定義: [Acquire and Release Semantics](http://preshing.com/20120913/acquire-and-release-semantics/) 總共有 4 種 barrier，而透過這 4 種 barrier 可組合成 aquire & release semantics ![](http://preshing.com/images/acq-rel-barriers.png) 引述: "**Acquire semantics** prevent memory reordering of the read-acquire with any read or write operation which follows it in program order. ... **Release semantics** prevent memory reordering of the write-release with any read or write operation which precedes it in program order." ![](http://preshing.com/images/read-acquire.png) ![](http://preshing.com/images/write-release.png) 透過 Acquire and Release Semantics 來限制優化，讓記憶體相關操作必須在特定語句前後完成 #### 2. Memory Models 延伸: [Weak vs. Strong Memory Models](http://preshing.com/20120930/weak-vs-strong-memory-models/) memory reordering 發生的機率，跟兩者有關，一個是硬體(processor)，另一個是軟體(toolchain) 根據不同的執行環境可以列出下表: ![](https://i.imgur.com/Z0LWi07.png) 所謂的 weak 就是指有很大的機率會將 load/store 指令進行重排，而 strong 則相反 1. **Weak memory model**: 可能存在所有的 memory reordering 2. **Weak with data dependency ordering**: 可能存在 storeload 和 storestore 的reordering (load 的順序必能被保證) 3. **Usually Strong**: 保證 acquire and release semantic 的執行,所以只存在 storeload reordering 4. **Sequentially consisten**: 全部被保證順序 ## 程式實做 ### OpenMP OpenMP是開發在shared memory架構的平行程式API，它使用特殊形式的前置敘述或註解，來指導編譯器將特定的程式區塊變成平行執行，使用者幾乎不用修改程式本身架構即能完成程式的平行化，目前大部分的C/C++編譯器都能支援OpenMP，只要在編譯程式時加入OpenMP的選項，即會在編譯過程中將特定程式區塊進行平行化。 #### Simple Example ```clike= #include <stdio.h> #include <stdlib.h> #include <omp.h> void Hello(void); int main (int argc, char *argv[]){ int thread_count = strtol(argv[1], NULL, 10); # pragma omp parallel num_threads(thread_count) Hello(); return 0; } void Hello(void){ int my_rank = omp_get_thread_num(); int thread_count = omp_get_num_threads(); printf("Hello from thread %d of %d\n", my_rank, thread_count); } ``` #### 編譯&結果 - Openmp編譯需要在command最後面加上 -fopenmp * example: ``` gcc -o OpenMp_hello OpenMp_hello.c -fopenmp ```` 記得要加上參數進去 ``` ./pth_hello 5 ``` - 編譯結果 ``` Hello from thread 0 of 5 Hello from thread 3 of 5 Hello from the main thread Hello from thread 1 of 5 Hello from thread 2 of 5 Hello from thread 4 of 5 ``` --- ### POSIX Thread POSIX標準中支援的執行緒函式庫稱為 pthread，我們可以透過 pthread 結構與 pthread_create() 函數執行某個函數指標，以建立新的執行緒。 #### Simple Example ```clike= #include <stdio.h> #include <stdlib.h> #include <pthread.h> int thread_count; void *Hello(void *rank); int main(int argc, char *argv[]){ long thread; pthread_t *thread_handles; thread_count = strtol(argv[1], NULL, 10); thread_handles = malloc(thread_count*sizeof(pthread_t)); for(thread = 0; thread < thread_count; thread++){ pthread_create(&thread_handles[thread], NULL, Hello, (void *) thread); } printf("Hello from the main thread\n"); for(thread = 0; thread < thread_count; thread++){ pthread_join(thread_handles[thread], NULL); } free(thread_handles); return 0; } void *Hello(void *rank){ long my_rank = (long)rank;//typecasting printf("Hello from thread %ld of %d\n", my_rank, thread_count); return NULL; } ``` #### 編譯&結果 - P_thread編譯需要在command最後面加上 -pthraed * example: ``` gcc -o pth_hello pth_hello.c -pthread ```` 記得要加上參數進去 ``` ./pth_hello 5 ``` - 編譯結果 ``` Hello from thread 0 of 5 Hello from thread 3 of 5 Hello from the main thread Hello from thread 1 of 5 Hello from thread 2 of 5 Hello from thread 4 of 5 ``` #### Pthread function 怎麼寫 ```clike= void *Hello(void *rank){ long my_rank = (long)rank;//typecasting printf("Hello from thread %ld of %d\n", my_rank, thread_count); return NULL; } ``` #### Pthread Function注意的事項 * pthread function argument 和 retuen value 的type 均需要 (void *) * [How to return a value from thread in C](http://stackoverflow.com/questions/2251452/how-to-return-a-value-from-thread-in-c) #### Pthread 常用 function 1. pthread_create 這個Function的作用是用來產生一個Thread並執行附帶的Function，附帶有4個參數原始的定義 ```clike= int pthread_create(pthread_t *tid, const pthread_attr_t *attr, void *(*function)(void *), void *argument) ``` * 參數1. pthread_t *tid為pthread的指標，在使用Thread之前必須要先宣告一個pthread_t的變數。 * 參數2. const pthread_attr_t *attr為該Thread的屬性，預設是NULL，如果沒有其他特殊的需求直接填入NULL即可。 * 參數3. void *(*function)(void *)為Function pointer，這邊要放入你要執行的Function。 * 參數4. void *argument為Function pointer所要帶的參數。回傳值: 如果執行成功則回傳0，如果執行失敗則回傳錯誤代碼。 example: pthread_create( &thread1, NULL , showmessage , message); 2. pthread_exit 原始的定義 ```clike= void pthread_exit (void *value_ptr) ``` 這個Function的作用是用來關閉一個Thread，附帶有1個參數。 * 參數1: void *value_ptr用來設定執行成功時該Thread會回傳的值，這個值可由pthread_join()這個Function來取得。 * 回傳值: 不會回傳任何值。 example: ```cliek= pthread_exit(NULL); ``` 3. pthread_cancel 原始的定義這個Function的作用是用來關閉一個指定的Thread，附帶有一個參數。 ```cliek= int pthread_cancel (pthread_t thread) ``` * 參數1: pthread_t thread為要關閉的Thread。 * 回傳值: 如果執行成功則回傳0，如果執行失敗則回傳錯誤代碼。 example: ```clike= pthread_cancel(thread1); ``` 4. pthread_join 原始的定義 ```clike= int pthread_join (pthread_t thread, void **value_ptr) ``` 這個Function的作用會暫停目前執行pthread_join的Thread，等到目標Thread執行完畢之後目前執行pthread_join的Thread才會繼續執行，附帶有2個參數。 * 參數1: pthread_t thread為要等待的目標Thread。 * 參數2: void **value_ptr用來取得目標Thread的回傳值。 * 回傳值: 如果執行成功則回傳0，如果執行失敗則回傳錯誤代碼。 #### 案例一 * main 函式建立 3 個執行緒，其中 2 個執行任務並更新名為 "count" 的變數 * 第 3 個執行緒等待 "count" 變數的內含值成為某個指定數值. ```C= #include <pthread.h> #include <stdio.h> #include <stdlib.h> #define NUM_THREADS 3 #define TCOUNT 10 #define COUNT_LIMIT 12 int count = 0; int thread_ids[3] = {0,1,2}; pthread_mutex_t count_mutex; pthread_cond_t count_threshold_cv; void *inc_count(void *t) { long my_id = (long)t; for (int i = 0; i < TCOUNT; i++) { pthread_mutex_lock(&count_mutex); count++; // Check the value of count and signal waiting thread when condition is // reached. Note that this occurs while mutex is locked. if (count == COUNT_LIMIT) { pthread_cond_signal(&count_threshold_cv); printf("inc_count(): thread %ld, count = %d Threshold reached.\n", my_id, count); } printf("inc_count(): thread %ld, count = %d, unlocking mutex\n", my_id, count); pthread_mutex_unlock(&count_mutex); /* Do some "work" so threads can alternate on mutex lock */ sleep(1); } pthread_exit(NULL); } void *watch_count(void *t) { long my_id = (long) t; printf("Starting watch_count(): thread %ld\n", my_id); /* Lock mutex and wait for signal. Note that the pthread_cond_wait routine will automatically and atomically unlock mutex while it waits. Also, note that if COUNT_LIMIT is reached before this routine is run by the waiting thread, the loop will be skipped to prevent pthread_cond_wait from never returning. */ pthread_mutex_lock(&count_mutex); while (count < COUNT_LIMIT) { pthread_cond_wait(&count_threshold_cv, &count_mutex); printf("watch_count(): thread %ld Condition signal received.\n", my_id); count += 125; printf("watch_count(): thread %ld count now = %d.\n", my_id, count); } pthread_mutex_unlock(&count_mutex); pthread_exit(NULL); } int main (int argc, char *argv[]) { int i, rc; long t1 = 1, t2 = 2, t3 = 3; pthread_t threads[3]; pthread_attr_t attr; /* Initialize mutex and condition variable objects */ pthread_mutex_init(&count_mutex, NULL); pthread_cond_init (&count_threshold_cv, NULL); /* For portability, explicitly create threads in a joinable state */ pthread_attr_init(&attr); pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); pthread_create(&threads[0], &attr, watch_count, (void *)t1); pthread_create(&threads[1], &attr, inc_count, (void *)t2); pthread_create(&threads[2], &attr, inc_count, (void *)t3); /* Wait for all threads to complete */ for (i = 0; i < NUM_THREADS; i++) pthread_join(threads[i], NULL); printf ("Main(): Waited on %d threads. Done.\n", NUM_THREADS); /* Clean up and exit */ pthread_attr_destroy(&attr); pthread_mutex_destroy(&count_mutex); pthread_cond_destroy(&count_threshold_cv); pthread_exit(NULL); } ``` #### 案例二: raytracing [程式碼](https://github.com/paul5566/RayTracing_PThread) * 執行結果(CPU 4 core) * thraed越多並不一定跑得比較快,超出反而會造成負擔 | Technique used | Execution time | | ----------------- | --------------------- | | *original* | **6.245250** sec | | *pthread (2 threads)*| **1.145604** sec | | *pthread (4 threads)*| **0.878356** sec | | *pthread (8 threads)*| **0.889436** sec | #### reference [Pthread 程式撰寫](http://blog.xuite.net/nikoung/wretch/154071878-Pthread+%E7%A8%8B%E5%BC%8F%E6%92%B0%E5%AF%AB) [Getting Started With POSIX Threads](http://www.csie.ntu.edu.tw/~r92094/c++/pthread.txt) --- ### Thread Pool [source](http://stackoverflow.com/questions/6297428/existing-threadpool-c-implementation) 裡面有提到幾個 C Thread Pool 的實作範例與可參考的文件 * [threadpool-mbrossard](https://github.com/mbrossard/threadpool) * [threadpool-jmatthew](http://people.clarkson.edu/~jmatthew/cs644.archive/cs644.fa2001/proj/locksmith/code/ExampleTest/) * [cthreadpool](http://sourceforge.net/projects/cthpool/) * [C-Thread-Pool](https://github.com/Pithikos/C-Thread-Pool) 以 threadpool-mbrossard 作為第一個研究的版本，因為仍在維護，而且作者是 [Existing threadpool C implementation](http://stackoverflow.com/questions/6297428/existing-threadpool-c-implementation) 的發文者。 **A simple C thread pool implementation** Currently, the implementation: * Works with pthreads only, but API is intentionally opaque to allow other implementations (Windows for instance). * Starts all threads on creation of the thread pool. * Reserves one task for signaling the queue is full. * Stops and joins all worker threads on destroy. **Thread Pool 的資料結構** 首先 Thread Pool 要有的東西就是 job 或者是 task 讓 Thread 知道他們要做什麼事情。 ```C= typedef struct { void (*function)(void *); void *argument; } threadpool_task_t; ``` 所以只要有一個資料結構紀錄要執行的 function pointer 與要傳遞的參數即可。接下來就是 Thread Pool 本身，他必須存放所有的 Thread 與 Job Queue: ```C= struct threadpool_t { pthread_mutex_t lock; pthread_cond_t notify; pthread_t *threads; threadpool_task_t *queue; int thread_count; int queue_size; int head; int tail; int count; int shutdown; int started; }; ``` 這邊使用一個 pthread_t 的 pointer 來紀錄所有的 Thread，簡單來說，就是一個 pthread_t 的 array，而 head, tail 就是紀錄 array 的 offset。 threadpool_task_t 也是一樣的原理，真是出乎意料的簡單。 **ThreadPool 的建立與工作的執行** 再來就是 Thread Pool 的建立，由於剛剛提到的他其實是使用一個 pthread array 與一個 job array 來存放所有的 thread 與 jobs。因此需要在一開始的時候就決定 Thread Pool 與 Jobs 的最大數量。 ```C= /* Allocate thread and task queue */ pool->threads = (pthread_t *) malloc(sizeof(pthread_t) * thread_count); pool->queue = (threadpool_task_t *) malloc(sizeof(threadpool_task_t) * queue_size); ``` 而每個 Thread 要排入執行的 callback function 透過以下： ```C= static void *threadpool_thread(void *threadpool) { threadpool_t *pool = (threadpool_t *) threadpool; threadpool_task_t task; for (;;) { /* Lock must be taken to wait on conditional variable */ pthread_mutex_lock(&(pool->lock)); /* Wait on condition variable, check for spurious wakeups. When returning from `pthread_cond_wait`(), we own the lock. */ while((pool->count == 0) && (!pool->shutdown)) { pthread_cond_wait(&(pool->notify), &(pool->lock)); } if((pool->shutdown == immediate_shutdown) || ((pool->shutdown == graceful_shutdown) && (pool->count == 0))) { break; } /* Grab our task */ task.function = pool->queue[pool->head].function; task.argument = pool->queue[pool->head].argument; pool->head += 1; pool->head = (pool->head == pool->queue_size) ? 0 : pool->head; pool->count -= 1; /* Unlock */ pthread_mutex_unlock(&(pool->lock)); /* Get to work */ (*(task.function))(task.argument); } pool->started--; pthread_mutex_unlock(&(pool->lock)); pthread_exit(NULL); return(NULL); } ``` 在 **for(;;)** 裡面，Thread 第一件要做的事情就是去搶奪 pool 的 lock，當搶到 lock 的 Thread 發現沒有工作可以做的時候，就會執行 pthread_cond_wait 來等待通知。這時候 pool->lock 會被 Unlock，因此這時候其他 Thread 也可以進來這個區域。所以在完全沒有工作的情況下，所有的 Thread 都會在這邊 waiting。當 Thread 被透過 pthread_cond_signal 喚醒的時候，該 Thread 就會重新取得 pool->lock。這時他就可以安心的取出 queue 中的 task，等待取出完畢之後，再 unlock 讓其他被喚醒的 Thread 也可以去取得 Task。之後就是執行 task 中的 function pointer 做該做的工作。 **ThreadPool 的 destory** destory 就更簡單了，只要使用 pthread_cond_broadcast 通知所有的 Thread 起來，由於 shoutdown 的確認會在執行工作之前。所以該 thread 就會離開執行工作的迴圈，並且結束。 [mbrossard 完整的 ThreadPool 原始碼](https://github.com/mbrossard/threadpool/blob/master/src/threadpool.c) --- ### Lock-free Thread Pool [ [source](http://blog.csdn.net/xhjcehust/article/details/45844901) ] 大多數 thread pool 實做都離不開 lock 的使用，如 pthread_mutex 結合 (condition variable) pthread_cond。一般來說，lock 的使用對於程式效能影響較大，雖然現有的 pthread_mutex 在 lock 的取得 (acquire) 與釋放，已在 Linux 核心和對應函式庫進行效能提昇，但我們仍會希望有不仰賴 lock 的 thread pool 的實做。 **常見 thread pool 實做原理** ![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_VJmq0R0ILi6_p.537916_1459234875775_001.PNG) 如上圖所示，workqueue (工作佇列) 由 main thread (主執行緒) 和 worker thread 共享，main thread 將任務放進 workqueue，worker thread 從 workqueue 中取出任務執行。要注意到，共享 workqueue 的操作必須在 mutex 的保護下安全進行，main thread 將任務放進 workqueue 時，若偵測到目前待執行的工作數目小於 worker thread 總數，則要透過 condition variable 喚醒可能處於等待狀態的 worker thread。 **lock-free 化 thread pool 實做原理** ![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_VJmq0R0ILi6_p.537916_1459234893887_002.PNG) 為解決 lock-free 所面臨的議題，我們一定要避免共享資源的競爭 (contention)，因此將共享 workqueue 加以拆分成每 worker thread 一個 workqueue 的方式，如上圖。對於 main thread 放入工作和 worker thread 取出任務的競爭問題，可以採取 ring-buffer 的方式避免。在解決了lock 機制後，就只剩下 condition variable 的問題，condition variable 本質上就是提出來解決執行緒之間的通訊議題，不然為了追蹤特定變數，如之前提到的 "count" 變數，還得額外建立執行緒來追蹤。而 semaphore 作為一種通信方式，可以代替之，其大體開發模式為： ```C= sigemptyset(&zeromask); sigemptyset(&newmask); sigaddset(&newmask, SIGXX); sigprocmask(SIG_BLOCK, &newmask, &oldmask); while (!CONDITION) sigsuspend(&zeromask); sigprocmask(SIG_SETMASK, &oldmask, NULL) ``` **lock-free thread pool 實做說明** 程式碼請見: [LFTPool](https://github.com/xhjcehust/LFTPool) 在 lock-free thread pool 的實做裡頭，有別於常見 thread pool 之處，在於 semaphore 與 condition variable 、排程演算法、增加或減少執行緒數目後的 task migration，另一點則是引入 [ring-buffer](https://en.wikipedia.org/wiki/Circular_buffer)，後者參考了 Linux 核心中的 kfifo 實做。 (1) **semaphore 與 condition variable** semaphore 與 condition variable 的區別，主要在於 condition variable 的喚醒 (signal)，對於接收端執行緒而言可以忽略，而在未設定 semaphore 處理函式的情況下，semaphore 的接收會導致接收執行緒，甚至是整個程式的終止。因此，需要在 thread pool 產生執行緒前，事先指定 semaphore 處理函式，如此一來，新建立的執行緒會繼承這個 semaphore 處理函式。在 LFTPool 中，用 SIGUSR1 這個 POSIX signal 來等待「有工作分發給我了」這個事件。「等待事件發生」是 semaphore 的一種功能，但不是 semaphore 唯一的用法。例：把 semaphore 初始化成 0，則可用來等待事件；把 semaphore 初始化成 1，則可用來 mutex。 :::warning 另外因為 POSIX 有另外 semaphore 機制 sem_{wait,post} 等，故在觀念上把此處的用法稱為 semaphore 也很容易混淆。 ::: :::info 原作者刻意用 POSIX signal，是為了展示他的程式是 "lockfree"：「在解决了锁机制之后，就只剩下条件变量的问题了，条件变量本身即解决条件满足时的线程通信问题，而信号作为一种通信方式，可以代替之」。但 POSIX signal 傳送在 kernel 中當然不是 lock-free 的。觀念上，只要程式在等待事件時要把 CPU 釋放出來，需要進入 scheduler，那個部份的 lock-free 與否就不太有意義了，因為 context switch 比 lock acquire/release 還要慢。 ::: (2) **任務排程演算法** 常見 thread pool 實做的任務排程，主要透過作業系統的 thread scheduler 來達成。考慮到 load-balancing，main thread 放入任務時應採取合適的任務排程演算法，將任務放入對應的 worker thread 隊列，本程式目前已實做 Round-Robin 和 Least-Load 演算法。Round-Robin 即輪詢式地分配工作，Least-Load 即選擇目前具有最少工作的 worker thread 放入。 (3) **Task migration** 在執行緒的動態增加和減少的過程中，同樣基於 load-balancing 的考量，涉及到現有 task 的遷移 (migration) 問題。load-balancing 演算法主要基於平均工作量的想法，即統計目前時刻的總任務數目，均分至每一個執行緒，求出每個工作者執行緒應該增加或減少的工作數目，然後從頭至尾遍歷，需要移出工作的執行緒與需要移入工作的執行緒執行 task migration，相互抵消。最後若還有多出來的工作，再依次分配。遷入 (migreate IN) 工作不存在競態，因為加入工作始終由 main thread 完成，而遷出 (migrate OUT) 工作則存在競態，因為在遷出工作的同時，worker thread 可能在同時執行任務。所以需要採用 atomic operation 加以修正，其主要思想就是預先擷取 (prefetch) 的技巧，大致實做程式碼如下： ```C= do { work = NULL; if (thread_queue_len(thread) <= 0) // thread_queue_len() 必須是 atomic break; tmp = thread ->out; // prefetch work work = &thread->work_queue[queue_offset(tmp)]; } while (!__sync_bool_compare_and_swap(&thread->out, tmp, tmp + 1)); if (work) do_something(); ``` * Test-and-Set instruction: [](https://en.wikipedia.org/wiki/Test-and-set)[https://en.wikipedia.org/wiki/Test-and-set](https://en.wikipedia.org/wiki/Test-and-set) * `__sync_bool_compare_and_swap` 是 GCC atomic builtin function 之一: [](https://gcc.gnu.org/onlinedocs/gcc-5.3.0/gcc/_005f_005fsync-Builtins.html#_005f_005fsync-Builtins)[https://gcc.gnu.org/onlinedocs/gcc-5.3.0/gcc/_005f_005fsync-Builtins.html#_005f_005fsync-Builtins](https://gcc.gnu.org/onlinedocs/gcc-5.3.0/gcc/_005f_005fsync-Builtins.html#_005f_005fsync-Builtins) * C11 標準正式將 atomic operation 納入，請見: [](http://en.cppreference.com/w/c/atomic)[http://en.cppreference.com/w/c/atomic](http://en.cppreference.com/w/c/atomic) Atomic 是種同步機制，不須透過 explicit lock (如 mutex)，也能確保變數之間的同步。Linux 核心提供了對應的型態 `atomic_t`。而 gcc 也提供了內建的函式 (built-in functions)，專門處理 atomics，如下： ``` type __sync_fetch_and_add (type *ptr, type value, ...) type __sync_fetch_and_sub (type *ptr, type value, ...) ``` 上面的 "type" 可以是 int8 .. int64。最常用到的就是「加」和「減」的操作，使用如下： ```C= int val = 0; __syn_fetch_and_add(&val, 1); ``` 這是種輕量級的同步機制，若只是對一個變數去做同步，實在不需要透過 mutex。在真實硬體環境中，atomics 的效能會比 `pthread_mutex_t` 好許多。 atomics 通常會伴隨著探討 memory barrier，這是因為編譯器進行最佳化時，往往會為了最佳化去變更指令的順序，但在特定的狀況下，往往會帶來後遺症，為了避免衝擊，gcc 也提供以下內建函式來處理 memory barrier: __sync_synchronize() 在執行緒的動態減少後，原先執行緒上未能執行完的任務，只需要由 main thread 再次根據任務排程演算法重新分配至其他存活的 worker thread 隊列中即可，不存在上述問題，當然，此時可以同時執行 load-balancing 演算法加以最佳化。 (4) ring-buffer ![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_VJmq0R0ILi6_p.537916_1459270052852_undefined) ring-buffer 實做主要參考了 Linux 核心中 kfifo 的實做，如下圖所示： ![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_VJmq0R0ILi6_p.537916_1459234911980_003.PNG) 環狀佇列長度為 2 的整數次方，out 和 in 下標一直遞增至越界後迴轉，其類型為 `unsigned int`，即 out 指標一直追趕 in 指標，out 和 in 映射至 FiFo 的對應下標處，其間的元素即為隊列元素。 ## 案例分析 ### 對 Linked-List 排序從 [Bubble sort](https://en.wikipedia.org/wiki/Bubble_sort) 談起，如何有效操作不連續記憶體 * [Linked List Bubble Sort](http://faculty.salina.k-state.edu/tim/CMST302/study_guide/topic7/bubble.html) (*) * [Linked List Algorithms](http://c.learncodethehardway.org/book/ex33.html) * [A Comparative Study of Linked List Sorting Algorithms](https://www.cs.mtu.edu/~shene/PUBLICATIONS/1996/3Conline.pdf), Ching-Kuang Shene [[冼鏡光](https://zh.wikipedia.org/zh-tw/%E5%86%BC%E9%8F%A1%E5%85%89)] (1996) * 「冼」發音為ㄒㄧㄢˇ * [Linked Lists: Locking, LockFree, and Beyond](http://www.cs.nyu.edu/courses/fall05/G22.2631-001/lists.slides2.pdf) * **[concurrent-ll](https://github.com/jserv/concurrent-ll)**: concurrent linked-list 實作 * [ size = **128** ] ![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_XNhmLBQPNHU_p.537916_1469204485254_ll.i128.u0.png)![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_XNhmLBQPNHU_p.537916_1469204485261_ll.i128.u10.png)![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_XNhmLBQPNHU_p.537916_1469204485266_ll.i128.u50.png) * [ size = **1024** ] ![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_XNhmLBQPNHU_p.537916_1469204512845_ll.i1024.u0.png)![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_XNhmLBQPNHU_p.537916_1469204512854_ll.i1024.u10.png)![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_XNhmLBQPNHU_p.537916_1469204512859_ll.i1024.u50.png) * [ size = **8192** ] ![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_XNhmLBQPNHU_p.537916_1469204535687_ll.i8192.u0.png)![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_XNhmLBQPNHU_p.537916_1469204535694_ll.i8192.u10.png)![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_XNhmLBQPNHU_p.537916_1469204535701_ll.i8192.u50.png) ### Hungry Birds * 展示典型 Producer-Consumer Problem 在多執行緒、多核心環境的運作 * [hungry-birds](https://github.com/jserv/hungry-birds) implements unbounded lockless single consumer multiple producer FIFO queue. * 延伸閱讀: [C11 atomic variables and the kernel](https://lwn.net/Articles/586838/) ### MapReduce with POSIX Thread * 詳見共筆: [MapReduce with POSIX Thread](https://hackmd.io/s/Hkb-lXkyg) ### 通用伺服器架構 * 詳見共筆: [server-framework](https://hackmd.io/s/B1s8hX1yg) ### Concurrent B+tree * 提出一個並行化的 B+ tree 實作，並探討其中資料結構設計、read-write lock，以及該如何延展為一個高效能的資料庫系統 * 詳見共筆: [concurrent B+tree](https://hackmd.io/s/SyjKs-mxg) ## 延伸閱讀 [若渴計畫 Studying Concurrency](https://www.slideshare.net/aj0612/studying-concurrency) [Concurrency Kit](http://concurrencykit.org/) [Memory Barriers in the Linux Kernel: Semantics and Practices](https://www.youtube.com/watch?time_continue=1559&v=Ykk_U7LX_jA) [POSIX Threads Programming](https://computing.llnl.gov/tutorials/pthreads/) [Priority Inversion on Mars](http://wiki.csie.ncku.edu.tw/embedded/priority-inversion-on-Mars.pdf) [Getting Started With POSIX Threads 宋振華](http://www.csie.ntu.edu.tw/~r92094/c++/pthread.txt) ###### tags: `sysprog2017`,`Sean`