Mergesort Concurrent

1. PThreads

[Source]

Mode

Boss/Worker Mode：Boss 接收到一個新要求時，動態建立一個 worker 完成該項要求，適用於伺服器上。可於程式開始時先建立好一定數量的執行緒(Thread Pool)，減少執行時期的負擔。
Peer Mode：相較於 Boss/Worker 模式，此模式的每執行緒擁有自己的輸入，適用於矩陣運算、資料庫搜尋等。
Pipeline Mode：執行緒由上一階段取得輸入資料，並將結果傳至下一階段的執行緒。整體輸出受制於執行最久的階段，須注意每一階段的工作負擔要盡可能平均。

Synchronization

pthread_join： 某執行緒暫停執行直到另一直行緒結束
Mutex： 同一時間內只有一執行緒可以保有該 lock 及保護資料的存取權
- Initialize
  - Static：PTHREAD_MUTEX_INITIALIZER
  - Dynamic：pthread_mutex_init()
- Lock
  - pthread_mutex_lock()
- Release
  - pthread_mutex_unlock()
Condition variable： 將事件實質化，並提供函式喚醒等待該事件的執行緒
- Initialize
  - Static：PTHREAD_COND_INITIALIZER
  - Dynamic：pthread_cond_init()
- Waits for condition variables
  - 等待直到喚醒：pthread_cond_wait()
  - 等待特定時間：pthread_cond_timewait()
- 喚醒等待中執行緒
  - 喚醒其中之一：pthread_cond_signal()
  - 喚醒所有等待中執行緒：pthread_cond_broadcast()
- 為避免 Spurious Wake Ups 或其他先被喚醒的執行緒執行導致條件變數再次不成立，一般會以迴圈判斷條件是否成立，不成立則再次進入等待
pthread_once： 保證初始函式被許多直行緒呼叫時僅執行一次
- 宣告 pthread_once_t 並靜態初始為PTHREAD_ONCE_INIT
- 以 pthread_once() 與 once 區域呼叫目標函式
- 不容許傳遞參數至 once 所保護的函式

Pthreads Management

Key

一種指標，將資料和執行緒進行關聯

Thread cancellation

States：
- PTHREAD_CANCEL_DISABLE
  - Type： ignored
- PTHREAD_CANCEL_ENABLE
  - Types：
    - PTHREAD_CANCEL_ASYNCHRONOUS：立刻取消
    - PTHREAD_CANCEL_DEFERRED：當執行到 Cancellation-point 時發生
Cancellation-point
- pthread_cond_wait(), pthread_cond_timewait(), pthread_join()
- 使用者定義pthread_testcancel()
Asynchronous Cancellation； 當執行緒在此狀態時，即使在執行函式庫呼叫或是系統呼叫時也能夠被取消，除非被定義為 cancellation-safe 否則應防止該情形發生

2. Modern Microprocessors

[Source]

2-1. Pipelining

One instruction is executing, the next instruction is being decoded, and the one after that is being fetched…
At the beginning of each clock cycle, the data and control information for a partially processed instruction is held in a pipeline latch. An output just in time to be captured by the next pipeline latch at the end of the clock cycle.
Since the result is available after the execute stage, the next instruction ought to be able to use that value immediately. To allow this, forwarding lines called bypasses are added

2-2. Deeper Pipelines – Superpipelining

The logic gates that make up each stage can be subdivided, especially the longer ones, converting the pipeline into a deeper super-pipeline with a larger number of shorter stages.

Processor can be run at higher clock speed (cuz workload for each cycle decreases)
Processor still completes 1 instruction per cycle -> more instructions per second

2-3. Superscalar

The execute stage of the pipeline consists of different functional units (each doing its own task)

We can execute multiple instructions in parallel with the fetch and decode/dispatch stages enhanced to decode multiple instructions in parallel and send them out to the "execution resources."
There are independent pipelines for each functional unit
- Simpler instructions complete more quickly
- It's normal to refer to the depth of a processor's pipeline when executing integer instructions(usually the shortest)
- A bunch of bypasses within and between the various pipelines
- Even more functional units could be added
The issue width is less than the number of functional units.

The number of instructions able to be issued, executed or completed per cycle is called a processor's width.
Superpipeline + Superscalar (just called superscalar for short)

2-4. Very Long Instruction Word (VLIW)

The instructions are groups of little sub-instructions

Each instruction contains information for multiple parallel operations
Much like a superscalar, except the decode/dispatch stage is much simpler and only occurs for each group of sub-instructions
Most VLIW designs are not interlocked
- not check for dependencies
- often have no way of stalling instructions
Compiler needs to insert the appropriate number of cycles between dependent instructions if necessary.

2-5. Instruction Dependencies & Latencies

The number of cycles between when an instruction reaches the execute stage and when its result is available to be used is called the instruction's latency
The deeper pipeline could easily get filled up with bubbles due to instructions depending on each other
Latencies for memory loads are particularly troublesome
- difficult to fill their delays with useful instructions
- unpredictable

2-6. Branches & Branch Prediction

When processor encounters a conditional branch, it must make a guess to prevent losing performance gained from pipeline.
Those instructions will not be committed until the outcome of the branch is known.
- guess wrong：the instructions cancelled (cycles wasted)
- guess right：the processor will be able to continue on at full speed
How the processor make the guess
- Static branch prediction：the compiler marks which way to go
  - a bit in instruction format to encode the prediction
  - convention
- Guess at runtime：on-chip branch prediction table
  - two-level adaptive predictor
  - gshare or gselect predictor
  - implements several branch predictors and select between them based on which one working best for each individual branch
Deep pipelines suffer from diminishing returns

The deeper the pipeline, the further into the future you must predict, the more likely you'll be wrong, and the greater the mispredict penalty.

2-7. Eliminating Branches with Predication

    cmp a, 7    ; a > 7 ?
    ble L1
    mov c, b    ; b = c
    br L2
L1: mov d, b    ; b = d
L2: ...

Simplifiled with predicated instruction：

Predicated instruction： works by executing as normal, but only commits if condition is true

    cmp a, 7       ; a > 7 ?
    mov c, b       ; b = c
    cmovle d, b    ; if le, then b = d

Always doing the first mov then overwriting it if necessary.

If the blocks of code in the if and else cases were longer, using predication would mean executing more instructions than using a branch

2-8. Instruction Scheduling, Register Renaming & OOO

Find a couple of other instructions from further down in the program to fill the bubbles caused by branches and long-latency instructions in the pipeline(s)
Two ways to do that：

1. Reorder in hardware at runtime：the dispatch logic must be enhanced to look at groups of instructions and dispatch them out of order.
Register renaming：
- Not dealing with the raw architecturally-defined registers, but a set of renamed registers
- By mapped to different physical registers, they can be executed in parallel
- The processor must keep a mapping of the instructions and the physical registers in flight
A larger set of real registers extract even more parallelism out of the code

2. The compiler optimizes the code by rearranging the instructions, called static, or compile-time, instruction scheduling.
- avoiding complex OOO logic
- see further down the program than the hardware
Without OOO hardware, the pipeline will stall when the compiler fails to predict something like a cache miss

2-9. The Brainiac Debate

brainiac vs speed-demon debate：
Whether the costly out-of-order logic is really warranted, or compilers can do well enough without it

Brainiac designs：with lots of OOO hardware trying to squeeze every last drop of instruction-level parallelism
Speed-demon designs：relying on a smart compiler

2-10. The Power Wall & The ILP Wall

Increasing the clock speed of a processor will typically increase its power usage even more

The transistors switch more often
The voltage also needs to be increased to drive the signals through the circuits faster to meet the shorter timing requirements
Leakage current also goes up as the voltage is increased

Power increases linearly with clock frequency, and increases as the square of voltage

f V^{2}

Power wall： not possible to provide much power and cooling to a silicon chip in any practical fashion

ILP wall： normal programs don't have a lot of fine-grained parallelism

2-11. What About x86?

Problem： the complex and messy x86 instruction set.

Solution： Dynamically decode the x86 instructions into RISC-like micro-instructions (μops), then executed by a RISC-style register-renaming OOO superscalar core.

Improvement：

Buffer or μop instruction cache to avoid translate the same instructions
Pipeline depth
- 14 stages when the processor is running from its L0 μop cache (which is the common case)
- 19 stages when running from the L1 instruction cache and having to translate the x86 instructions into μops

Threads – SMT, Hyper-Threading & Multi-Core

If additional independent instructions aren't available, there is another potential source of independent instructions – other running programs, or other threads within the same program
-> To fill those empty bubbles in the pipelines
Simultaneous multi-threading (SMT)： one physical processor core to present two or more logical processors to the system
- duplicating all of the parts of the processor which store the "execution state" of each thread
- other resources, such as the decoders and dispatch logic, the functional units, and the caches, are shared between the threads
Should not be confused with multi-processor or multi-core processor, but there's nothing preventing a multi-core implementation where each core is an SMT design.
- Downsides：
  - If one thread saturates one functional unit which the other threads need, it effectively stalls all of the other threads
  - Competition between the threads for cache space
- Applications which are limited primarily by memory latency benefit dramatically from SMT since it offers an way of using the otherwise idle time
The Pentium 4 was the first processor to use SMT, which Intel calls hyper-threading.

2-12. More Cores or Wider Cores?

The complex multiple-issue dispatch logic scales up as roughly the square of the issue width (n candidate instructions compared against every other candidate)
For applications with lots of active but memory-latency-limited threads more simple cores would be better
For most applications, there simply are not enough threads active, and the performance of just a single thread is much more important

2-13. Data Parallelism – SIMD Vector Instructions

Rather than looking for ways to execute groups of instructions in parallel, SIMD make one instruction apply to a group of data values in parallel.

More often, it's called vector processing.
The same operation as a 32-bit addition, except that every 8th carry is not propagated.
It is possible to define entirely new registers -> more data to be processed in parallel

2-14. Memory & The Memory Wall

Loads tend to occur near the beginning of code sequences (basic blocks), with most of the other instructions depending on the data being loaded -> hard to achieve ILP
The facts of nature hinders fast memory system
- speed of light：delays of signal transferred out to RAM and back
- slow speed of charging and draining the tiny capacitors
Memory wall： the gap between the processor and main memory

2-15. Caches & The Memory Hierarchy

A cache is a small but fast type of memory located on or near the processor chip, used to solve the problem of the memory wall
Memory hierarchy： The combination of the on-chip caches, off-chip external cache and main memory… (lower level -> larger but slower)
Caches achieve amazing hit rates because most programs exhibit locality
- Temporal Locality： there's a good chance a program will need to reaccess the same piece of memory in the near future
  -> exploited by keeping recently accessed data in the cache
- Spatial Localityand： there's a good chance a program will need to access other nearby memory in the future
  -> data is transferred from main memory into the cache in blocks of a few dozen bytes (cache line) at a time
A cache works like a two-column table
- Higher-end part of the address (tag) used for search
- Lower part of the address used to index the cache
Cache Lookup
- virtual address：the cache might need to be flushed on every context switch
- physical address：virtual-to-physical mapping must be performed as part of the cache lookup
- virtually-indexed physically-tagged： virtual-to-physical mapping (TLB lookup) can be performed in parallel with the cache indexing

2-16. Cache Conflicts & Associativity

A cache usually only allows data from any particular address in memory to occupy one, or at most a handful, of locations within the cache
Cache Conflict： memory locations mapped to the same location are wanted at the same time
Thrashing： repeatedly accesses two memory locations which happen to map to the same cache line, and the cache must keep storing and loading from main memory
Associativity： The number of places a piece of data can be stored in a cache
Map
- Direct-mapped： each piece of data is simply mapped to address % size within the cach
- Set-associative： there are several tables, all indexed in parallel, and the tags from each table are compared
延伸閱讀

2-17. Memory Bandwidth vs Latency

The transfer rate of a memory system is called its bandwidth
- Increases bandwith：adding more memory banks and making the busses wider
- Increases latency：Synchronously Clocked DRAM (SDRAM)

Source of This Note