# **Mergesort Concurrent**
## **1. PThreads**
[**[Source]**](http://shop.oreilly.com/product/9781565921153.do)
### **Mode**
* Boss/Worker Mode:Boss 接收到一個新要求時,動態建立一個 worker 完成該項要求,適用於伺服器上。可於程式開始時先建立好一定數量的執行緒(Thread Pool),減少執行時期的負擔。
* Peer Mode:相較於 Boss/Worker 模式,此模式的每執行緒擁有自己的輸入,適用於矩陣運算、資料庫搜尋等。
* Pipeline Mode:執行緒由上一階段取得輸入資料,並將結果傳至下一階段的執行緒。整體輸出受制於執行最久的階段,須注意每一階段的工作負擔要盡可能平均。
### **Synchronization**
* **pthread_join:** 某執行緒暫停執行直到另一直行緒結束
* **Mutex:** 同一時間內只有一執行緒可以保有該 lock 及保護資料的存取權
* Initialize
* Static:```PTHREAD_MUTEX_INITIALIZER```
* Dynamic:```pthread_mutex_init()```
* Lock
* ```pthread_mutex_lock()```
* Release
* ```pthread_mutex_unlock()```
* **Condition variable:** 將事件實質化,並提供函式喚醒等待該事件的執行緒
* Initialize
* Static:```PTHREAD_COND_INITIALIZER```
* Dynamic:```pthread_cond_init()```
* Waits for condition variables
* 等待直到喚醒:```pthread_cond_wait()```
* 等待特定時間:```pthread_cond_timewait()```
* 喚醒等待中執行緒
* 喚醒其中之一:```pthread_cond_signal()```
* 喚醒所有等待中執行緒:```pthread_cond_broadcast()```
* 為避免 [**Spurious Wake Ups**](https://en.wikipedia.org/wiki/Spurious_wakeup) 或其他先被喚醒的執行緒執行導致條件變數再次不成立,一般會以迴圈判斷條件是否成立,不成立則再次進入等待
* **pthread_once:** 保證初始函式被許多直行緒呼叫時僅執行一次
* 宣告 pthread_once_t 並靜態初始為```PTHREAD_ONCE_INIT```
* 以 pthread_once() 與 once 區域呼叫目標函式
* 不容許傳遞參數至 once 所保護的函式
### **Pthreads Management**
#### Key
一種指標,將資料和執行緒進行關聯
#### Thread cancellation
* **States:**
* ```PTHREAD_CANCEL_DISABLE```
* **Type:** ignored
* ```PTHREAD_CANCEL_ENABLE```
* **Types:**
* ```PTHREAD_CANCEL_ASYNCHRONOUS```:立刻取消
* ```PTHREAD_CANCEL_DEFERRED```:當執行到 Cancellation-point 時發生
* **Cancellation-point**
* ```pthread_cond_wait(), pthread_cond_timewait(), pthread_join()```
* 使用者定義```pthread_testcancel()```
* **Asynchronous Cancellation;** 當執行緒在此狀態時,即使在執行函式庫呼叫或是系統呼叫時也能夠被取消,除非被定義為 cancellation-safe 否則應防止該情形發生
## **2. Modern Microprocessors**
[**[Source]**](http://www.lighterra.com/papers/modernmicroprocessors/)
### **2-1. Pipelining**
* One instruction is executing, the next instruction is being decoded, and the one after that is being fetched...

* At the beginning of each clock cycle, the data and control information for a partially processed instruction is held in a **pipeline latch**. An output just in time to be captured by the next pipeline latch at the end of the clock cycle.

* Since the result is available after the execute stage, the next instruction ought to be able to use that value immediately. To allow this, forwarding lines called **bypasses** are added

### **2-2. Deeper Pipelines – Superpipelining**
The logic gates that make up each stage can be subdivided, especially the longer ones, converting the pipeline into a deeper super-pipeline with a larger number of shorter stages.
* Processor can be run at **higher clock speed** (cuz workload for each cycle decreases)
* Processor still completes 1 instruction per cycle -> more instructions per second
### **2-3. Superscalar**
* The execute stage of the pipeline consists of **different functional units** (each doing its own task)
:::info
We can execute multiple instructions in parallel with the fetch and decode/dispatch stages enhanced to ***decode multiple instructions*** in parallel and send them out to the "execution resources."
:::

* There are independent pipelines for each functional unit
* Simpler instructions complete more quickly
* It's normal to refer to the depth of a processor's pipeline when executing integer instructions(usually the shortest)
* A bunch of bypasses within and between the various pipelines
* Even more functional units could be added

* The issue width is less than the number of functional units.
:::info
The number of instructions able to be issued, executed or completed per cycle is called a **processor's width**.
:::
* Superpipeline + Superscalar (just called superscalar for short)

### **2-4. Very Long Instruction Word (VLIW)**
The **instructions** are groups of little sub-instructions
* Each instruction contains information for multiple parallel operations
* Much like a superscalar, except the decode/dispatch stage is much simpler and only occurs for each group of sub-instructions
* Most VLIW designs are not **interlocked**
* not check for dependencies
* often have no way of stalling instructions
:::danger
Compiler needs to insert the appropriate number of cycles between dependent instructions if necessary.
:::
### **2-5. Instruction Dependencies & Latencies**
* The number of cycles between when an instruction reaches the execute stage and when its result is available to be used is called the **instruction's latency**
* The deeper pipeline could easily get filled up with bubbles due to instructions depending on each other
* Latencies for memory loads are particularly troublesome
* difficult to fill their delays with useful instructions
* unpredictable
### **2-6. Branches & Branch Prediction**
* When processor encounters a **conditional branch**, it must make a guess to prevent losing performance gained from pipeline.
* Those instructions will not be committed until the outcome of the branch is known.
* guess wrong:the instructions cancelled (cycles wasted)
* guess right:the processor will be able to continue on at full speed
* How the processor make the guess
* Static branch prediction:the compiler marks which way to go
* a bit in instruction format to **encode** the prediction
* **convention**
* Guess at runtime:**on-chip** branch prediction table
* two-level adaptive predictor
* gshare or gselect predictor
* implements several branch predictors and select between them based on which one working best for each individual branch
* Deep pipelines suffer from [diminishing returns](https://en.wikipedia.org/wiki/Diminishing_returns)
:::info
The deeper the pipeline, the further into the future you must predict, the more likely you'll be wrong, and the greater the mispredict penalty.
:::
### **2-7. Eliminating Branches with Predication**
```
cmp a, 7 ; a > 7 ?
ble L1
mov c, b ; b = c
br L2
L1: mov d, b ; b = d
L2: ...
```
Simplifiled with predicated instruction:
:::info
**Predicated instruction:** works by executing as normal, but only commits if condition is true
:::
```
cmp a, 7 ; a > 7 ?
mov c, b ; b = c
cmovle d, b ; if le, then b = d
```
Always doing the first mov then overwriting it if necessary.
:::warning
If the blocks of code in the if and else cases were longer, using predication would mean executing more instructions than using a branch
:::
### **2-8. Instruction Scheduling, Register Renaming & OOO**
* Find a couple of other instructions from further down in the program to fill the bubbles caused by branches and long-latency instructions in the pipeline(s)
* Two ways to do that:
**1.** Reorder in hardware at **runtime**:the dispatch logic must be enhanced to look at groups of instructions and dispatch them out of order.
**Register renaming:**
* Not dealing with the raw architecturally-defined registers, but a set of **renamed registers**
* By mapped to different physical registers, they can be executed in parallel
* The processor must keep a mapping of the instructions and the physical registers in flight
:::info
A larger set of real registers extract even more parallelism out of the code
:::
**2.** The compiler optimizes the code by rearranging the instructions, called static, or compile-time, instruction scheduling.
* avoiding complex OOO logic
* see further down the program than the hardware
:::warning
Without OOO hardware, the pipeline will stall when the compiler fails to predict something like a **cache miss**
:::
### **2-9. The Brainiac Debate**
**brainiac vs speed-demon** debate:
Whether the costly out-of-order logic is really warranted, or compilers can do well enough without it
* Brainiac designs:with lots of OOO hardware trying to squeeze every last drop of instruction-level parallelism
* Speed-demon designs:relying on a smart compiler
### **2-10. The Power Wall & The ILP Wall**
Increasing the clock speed of a processor will typically increase its power usage even more
* The transistors switch more often
* The voltage also needs to be increased to drive the signals through the circuits faster to meet the shorter timing requirements
* Leakage current also goes up as the voltage is increased
Power increases linearly with clock frequency, and increases as the square of voltage
$$fV^{2}$$
**Power wall:** not possible to provide much power and cooling to a silicon chip in any practical fashion
**ILP wall:** normal programs don't have a lot of fine-grained parallelism
### **2-11. What About x86?**
**Problem:** the complex and messy x86 instruction set.
**Solution:** Dynamically decode the x86 instructions into RISC-like **micro-instructions (μops)**, then executed by a RISC-style **register-renaming** OOO superscalar core.

**Improvement:**
* Buffer or μop instruction cache to avoid translate the same instructions
:::info
Pipeline depth
* 14 stages when the processor is running from its L0 μop cache (which is the common case)
* 19 stages when running from the L1 instruction cache and having to translate the x86 instructions into μops
:::
### **Threads – SMT, Hyper-Threading & Multi-Core**
* If additional independent instructions aren't available, there is another potential source of independent instructions – other running programs, or other threads within the same program
**->** To fill those empty bubbles in the pipelines
* **Simultaneous multi-threading (SMT):** one physical processor core to present two or more logical processors to the system
* duplicating all of the parts of the processor which store the "execution state" of each thread
* other resources, such as the decoders and dispatch logic, the functional units, and the caches, are shared between the threads

:::warning
Should not be confused with multi-processor or multi-core processor, but there's nothing preventing a multi-core implementation where each core is an SMT design.
:::
* Downsides:
* If one thread saturates one functional unit which the other threads need, it effectively stalls all of the other threads
* Competition between the threads for cache space
* Applications which are limited primarily by memory latency benefit dramatically from SMT since it offers an way of using the otherwise idle time
:::info
The Pentium 4 was the first processor to use SMT, which Intel calls **hyper-threading**.
:::
### **2-12. More Cores or Wider Cores?**
* The complex multiple-issue dispatch logic scales up as roughly the square of the issue width (n candidate instructions compared against every other candidate)
* For applications with lots of active but memory-latency-limited threads more simple cores would be better
* For most applications, there simply are not enough threads active, and the performance of just a single thread is much more important
### **2-13. Data Parallelism – SIMD Vector Instructions**
* Rather than looking for ways to execute groups of instructions in parallel, SIMD make one instruction apply to a group of data values in parallel.
:::info
More often, it's called vector processing.
:::
* The same operation as a 32-bit addition, except that every 8th carry is not propagated.

* It is possible to define entirely new registers **->** more data to be processed in parallel
### **2-14. Memory & The Memory Wall**
* Loads tend to occur near the beginning of code sequences (basic blocks), with most of the other instructions depending on the data being loaded **->** hard to achieve ILP
* The facts of nature hinders fast memory system
* speed of light:delays of signal transferred out to RAM and back
* slow speed of charging and draining the tiny capacitors
* **Memory wall:** the gap between the processor and main memory
### **2-15. Caches & The Memory Hierarchy**
* A cache is a **small but fast** type of memory located on or near the processor chip, used to solve the problem of the memory wall
* **Memory hierarchy:** The combination of the on-chip caches, off-chip external cache and main memory... (lower level **->** larger but slower)
* Caches achieve amazing hit rates because most programs exhibit locality
* **Temporal Locality:** there's a good chance a program will need to reaccess the same piece of memory in the near future
**->** exploited by keeping recently accessed data in the cache
* **Spatial Localityand:** there's a good chance a program will need to access other nearby memory in the future
**->** data is transferred from main memory into the cache in blocks of a few dozen bytes **(cache line)** at a time
* A cache works like a two-column table
* Higher-end part of the address **(tag)** used for search
* Lower part of the address used to index the cache

* Cache Lookup
* virtual address:the cache might need to be flushed on every context switch
* physical address:virtual-to-physical mapping must be performed as part of the cache lookup
* **virtually-indexed physically-tagged:** virtual-to-physical mapping (TLB lookup) can be performed in parallel with the cache indexing
### **2-16. Cache Conflicts & Associativity**
* A cache usually only allows data from any particular address in memory to occupy one, or at most a handful, of locations within the cache
* **Cache Conflict:** memory locations mapped to the same location are wanted at the same time
* **Thrashing:** repeatedly accesses two memory locations which happen to map to the same cache line, and the cache must keep storing and loading from main memory
* **Associativity:** The number of places a piece of data can be stored in a cache
* Map
* **Direct-mapped:** each piece of data is simply mapped to address % size within the cach
* **Set-associative:** there are several tables, all indexed in parallel, and the tags from each table are compared
[**延伸閱讀**](https://embedded2015.hackpad.com/ep/pad/static/QXYBh2n9wLD)
### **2-17. Memory Bandwidth vs Latency**
* The transfer rate of a memory system is called its **bandwidth**
* Increases bandwith:adding more memory banks and making the busses wider
* Increases latency:Synchronously Clocked DRAM [**(SDRAM)**](https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory)
[**Source of This Note**](https://hackmd.io/s/H161bNj0)