**Mergesort Concurrent**

# **Mergesort Concurrent** ## **1. PThreads** [**[Source]**](http://shop.oreilly.com/product/9781565921153.do) ### **Mode** * Boss/Worker Mode：Boss 接收到一個新要求時，動態建立一個 worker 完成該項要求，適用於伺服器上。可於程式開始時先建立好一定數量的執行緒(Thread Pool)，減少執行時期的負擔。 * Peer Mode：相較於 Boss/Worker 模式，此模式的每執行緒擁有自己的輸入，適用於矩陣運算、資料庫搜尋等。 * Pipeline Mode：執行緒由上一階段取得輸入資料，並將結果傳至下一階段的執行緒。整體輸出受制於執行最久的階段，須注意每一階段的工作負擔要盡可能平均。 ### **Synchronization** * **pthread_join：** 某執行緒暫停執行直到另一直行緒結束 * **Mutex：** 同一時間內只有一執行緒可以保有該 lock 及保護資料的存取權 * Initialize * Static：```PTHREAD_MUTEX_INITIALIZER``` * Dynamic：```pthread_mutex_init()``` * Lock * ```pthread_mutex_lock()``` * Release * ```pthread_mutex_unlock()``` * **Condition variable：** 將事件實質化，並提供函式喚醒等待該事件的執行緒 * Initialize * Static：```PTHREAD_COND_INITIALIZER``` * Dynamic：```pthread_cond_init()``` * Waits for condition variables * 等待直到喚醒：```pthread_cond_wait()``` * 等待特定時間：```pthread_cond_timewait()``` * 喚醒等待中執行緒 * 喚醒其中之一：```pthread_cond_signal()``` * 喚醒所有等待中執行緒：```pthread_cond_broadcast()``` * 為避免 [**Spurious Wake Ups**](https://en.wikipedia.org/wiki/Spurious_wakeup) 或其他先被喚醒的執行緒執行導致條件變數再次不成立，一般會以迴圈判斷條件是否成立，不成立則再次進入等待 * **pthread_once：** 保證初始函式被許多直行緒呼叫時僅執行一次 * 宣告 pthread_once_t 並靜態初始為```PTHREAD_ONCE_INIT``` * 以 pthread_once() 與 once 區域呼叫目標函式 * 不容許傳遞參數至 once 所保護的函式 ### **Pthreads Management** #### Key 一種指標，將資料和執行緒進行關聯 #### Thread cancellation * **States：** * ```PTHREAD_CANCEL_DISABLE``` * **Type：** ignored * ```PTHREAD_CANCEL_ENABLE``` * **Types：** * ```PTHREAD_CANCEL_ASYNCHRONOUS```：立刻取消 * ```PTHREAD_CANCEL_DEFERRED```：當執行到 Cancellation-point 時發生 * **Cancellation-point** * ```pthread_cond_wait(), pthread_cond_timewait(), pthread_join()``` * 使用者定義```pthread_testcancel()``` * **Asynchronous Cancellation；** 當執行緒在此狀態時，即使在執行函式庫呼叫或是系統呼叫時也能夠被取消，除非被定義為 cancellation-safe 否則應防止該情形發生 ## **2. Modern Microprocessors** [**[Source]**](http://www.lighterra.com/papers/modernmicroprocessors/) ### **2-1. Pipelining** * One instruction is executing, the next instruction is being decoded, and the one after that is being fetched... ![](https://i.imgur.com/PFADDeP.png) * At the beginning of each clock cycle, the data and control information for a partially processed instruction is held in a **pipeline latch**. An output just in time to be captured by the next pipeline latch at the end of the clock cycle. ![](https://i.imgur.com/c31XUUe.png) * Since the result is available after the execute stage, the next instruction ought to be able to use that value immediately. To allow this, forwarding lines called **bypasses** are added ![](https://i.imgur.com/qzyiSwb.png) ### **2-2. Deeper Pipelines – Superpipelining** The logic gates that make up each stage can be subdivided, especially the longer ones, converting the pipeline into a deeper super-pipeline with a larger number of shorter stages. * Processor can be run at **higher clock speed** (cuz workload for each cycle decreases) * Processor still completes 1 instruction per cycle -> more instructions per second ### **2-3. Superscalar** * The execute stage of the pipeline consists of **different functional units** (each doing its own task) :::info We can execute multiple instructions in parallel with the fetch and decode/dispatch stages enhanced to ***decode multiple instructions*** in parallel and send them out to the "execution resources." ::: ![](https://i.imgur.com/0NtROVJ.png) * There are independent pipelines for each functional unit * Simpler instructions complete more quickly * It's normal to refer to the depth of a processor's pipeline when executing integer instructions(usually the shortest) * A bunch of bypasses within and between the various pipelines * Even more functional units could be added ![](https://i.imgur.com/5JfbRQD.png) * The issue width is less than the number of functional units. :::info The number of instructions able to be issued, executed or completed per cycle is called a **processor's width**. ::: * Superpipeline + Superscalar (just called superscalar for short) ![](https://i.imgur.com/n5e57dF.png) ### **2-4. Very Long Instruction Word (VLIW)** The **instructions** are groups of little sub-instructions * Each instruction contains information for multiple parallel operations * Much like a superscalar, except the decode/dispatch stage is much simpler and only occurs for each group of sub-instructions * Most VLIW designs are not **interlocked** * not check for dependencies * often have no way of stalling instructions :::danger Compiler needs to insert the appropriate number of cycles between dependent instructions if necessary. ::: ### **2-5. Instruction Dependencies & Latencies** * The number of cycles between when an instruction reaches the execute stage and when its result is available to be used is called the **instruction's latency** * The deeper pipeline could easily get filled up with bubbles due to instructions depending on each other * Latencies for memory loads are particularly troublesome * difficult to fill their delays with useful instructions * unpredictable ### **2-6. Branches & Branch Prediction** * When processor encounters a **conditional branch**, it must make a guess to prevent losing performance gained from pipeline. * Those instructions will not be committed until the outcome of the branch is known. * guess wrong：the instructions cancelled (cycles wasted) * guess right：the processor will be able to continue on at full speed * How the processor make the guess * Static branch prediction：the compiler marks which way to go * a bit in instruction format to **encode** the prediction * **convention** * Guess at runtime：**on-chip** branch prediction table * two-level adaptive predictor * gshare or gselect predictor * implements several branch predictors and select between them based on which one working best for each individual branch * Deep pipelines suffer from [diminishing returns](https://en.wikipedia.org/wiki/Diminishing_returns) :::info The deeper the pipeline, the further into the future you must predict, the more likely you'll be wrong, and the greater the mispredict penalty. ::: ### **2-7. Eliminating Branches with Predication** ``` cmp a, 7 ; a > 7 ? ble L1 mov c, b ; b = c br L2 L1: mov d, b ; b = d L2: ... ``` Simplifiled with predicated instruction： :::info **Predicated instruction：** works by executing as normal, but only commits if condition is true ::: ``` cmp a, 7 ; a > 7 ? mov c, b ; b = c cmovle d, b ; if le, then b = d ``` Always doing the first mov then overwriting it if necessary. :::warning If the blocks of code in the if and else cases were longer, using predication would mean executing more instructions than using a branch ::: ### **2-8. Instruction Scheduling, Register Renaming & OOO** * Find a couple of other instructions from further down in the program to fill the bubbles caused by branches and long-latency instructions in the pipeline(s) * Two ways to do that： **1.** Reorder in hardware at **runtime**：the dispatch logic must be enhanced to look at groups of instructions and dispatch them out of order. **Register renaming：** * Not dealing with the raw architecturally-defined registers, but a set of **renamed registers** * By mapped to different physical registers, they can be executed in parallel * The processor must keep a mapping of the instructions and the physical registers in flight :::info A larger set of real registers extract even more parallelism out of the code ::: **2.** The compiler optimizes the code by rearranging the instructions, called static, or compile-time, instruction scheduling. * avoiding complex OOO logic * see further down the program than the hardware :::warning Without OOO hardware, the pipeline will stall when the compiler fails to predict something like a **cache miss** ::: ### **2-9. The Brainiac Debate** **brainiac vs speed-demon** debate： Whether the costly out-of-order logic is really warranted, or compilers can do well enough without it * Brainiac designs：with lots of OOO hardware trying to squeeze every last drop of instruction-level parallelism * Speed-demon designs：relying on a smart compiler ### **2-10. The Power Wall & The ILP Wall** Increasing the clock speed of a processor will typically increase its power usage even more * The transistors switch more often * The voltage also needs to be increased to drive the signals through the circuits faster to meet the shorter timing requirements * Leakage current also goes up as the voltage is increased Power increases linearly with clock frequency, and increases as the square of voltage $$fV^{2}$$ **Power wall：** not possible to provide much power and cooling to a silicon chip in any practical fashion **ILP wall：** normal programs don't have a lot of fine-grained parallelism ### **2-11. What About x86?** **Problem：** the complex and messy x86 instruction set. **Solution：** Dynamically decode the x86 instructions into RISC-like **micro-instructions (μops)**, then executed by a RISC-style **register-renaming** OOO superscalar core. ![](https://i.imgur.com/dv5SwP8.png) **Improvement：** * Buffer or μop instruction cache to avoid translate the same instructions :::info Pipeline depth * 14 stages when the processor is running from its L0 μop cache (which is the common case) * 19 stages when running from the L1 instruction cache and having to translate the x86 instructions into μops ::: ### **Threads – SMT, Hyper-Threading & Multi-Core** * If additional independent instructions aren't available, there is another potential source of independent instructions – other running programs, or other threads within the same program **->** To fill those empty bubbles in the pipelines * **Simultaneous multi-threading (SMT)：** one physical processor core to present two or more logical processors to the system * duplicating all of the parts of the processor which store the "execution state" of each thread * other resources, such as the decoders and dispatch logic, the functional units, and the caches, are shared between the threads ![](https://i.imgur.com/dhYI4Fd.png) :::warning Should not be confused with multi-processor or multi-core processor, but there's nothing preventing a multi-core implementation where each core is an SMT design. ::: * Downsides： * If one thread saturates one functional unit which the other threads need, it effectively stalls all of the other threads * Competition between the threads for cache space * Applications which are limited primarily by memory latency benefit dramatically from SMT since it offers an way of using the otherwise idle time :::info The Pentium 4 was the first processor to use SMT, which Intel calls **hyper-threading**. ::: ### **2-12. More Cores or Wider Cores?** * The complex multiple-issue dispatch logic scales up as roughly the square of the issue width (n candidate instructions compared against every other candidate) * For applications with lots of active but memory-latency-limited threads more simple cores would be better * For most applications, there simply are not enough threads active, and the performance of just a single thread is much more important ### **2-13. Data Parallelism – SIMD Vector Instructions** * Rather than looking for ways to execute groups of instructions in parallel, SIMD make one instruction apply to a group of data values in parallel. :::info More often, it's called vector processing. ::: * The same operation as a 32-bit addition, except that every 8th carry is not propagated. ![](https://i.imgur.com/Fen0njg.png) * It is possible to define entirely new registers **->** more data to be processed in parallel ### **2-14. Memory & The Memory Wall** * Loads tend to occur near the beginning of code sequences (basic blocks), with most of the other instructions depending on the data being loaded **->** hard to achieve ILP * The facts of nature hinders fast memory system * speed of light：delays of signal transferred out to RAM and back * slow speed of charging and draining the tiny capacitors * **Memory wall：** the gap between the processor and main memory ### **2-15. Caches & The Memory Hierarchy** * A cache is a **small but fast** type of memory located on or near the processor chip, used to solve the problem of the memory wall * **Memory hierarchy：** The combination of the on-chip caches, off-chip external cache and main memory... (lower level **->** larger but slower) * Caches achieve amazing hit rates because most programs exhibit locality * **Temporal Locality：** there's a good chance a program will need to reaccess the same piece of memory in the near future **->** exploited by keeping recently accessed data in the cache * **Spatial Localityand：** there's a good chance a program will need to access other nearby memory in the future **->** data is transferred from main memory into the cache in blocks of a few dozen bytes **(cache line)** at a time * A cache works like a two-column table * Higher-end part of the address **(tag)** used for search * Lower part of the address used to index the cache ![](https://i.imgur.com/caDM5oI.png) * Cache Lookup * virtual address：the cache might need to be flushed on every context switch * physical address：virtual-to-physical mapping must be performed as part of the cache lookup * **virtually-indexed physically-tagged：** virtual-to-physical mapping (TLB lookup) can be performed in parallel with the cache indexing ### **2-16. Cache Conflicts & Associativity** * A cache usually only allows data from any particular address in memory to occupy one, or at most a handful, of locations within the cache * **Cache Conflict：** memory locations mapped to the same location are wanted at the same time * **Thrashing：** repeatedly accesses two memory locations which happen to map to the same cache line, and the cache must keep storing and loading from main memory * **Associativity：** The number of places a piece of data can be stored in a cache * Map * **Direct-mapped：** each piece of data is simply mapped to address % size within the cach * **Set-associative：** there are several tables, all indexed in parallel, and the tags from each table are compared [**延伸閱讀**](https://embedded2015.hackpad.com/ep/pad/static/QXYBh2n9wLD) ### **2-17. Memory Bandwidth vs Latency** * The transfer rate of a memory system is called its **bandwidth** * Increases bandwith：adding more memory banks and making the busses wider * Increases latency：Synchronously Clocked DRAM [**(SDRAM)**](https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory) [**Source of This Note**](https://hackmd.io/s/H161bNj0)