WEEK2 課程筆記: Concurrency

軟體開發現況

以前 CPU 增進效能的手段
- Clock speed
- Execution Optimization
  - Pipelining、Branch prediction
  - Out-of-order execution 還要注意不能讓原本程式崩潰 (read/write reorder)
- Cache：盡量減少存取 Main memory 的機會
現在 CPU 增進效能的手段
- Hyperthreading: 在 single CPU 上同時執行多個 thread，但是共用 ALU、FPU
- Multicore：多個 CPU
  - 迷思：2 x 3GHz < 6GHz
- Cache

Functional Programming

It's Time to Get Good at Functional Programming

Functional programming are ideally suited to solving the multicore problem
FP features:
- all computations are treated as the evaluation of a function
- functional language programs avoid state and mutable data
  ==>You never have to worry about some other thread modifying a memory location where you've stored some value.
FP languages store state in function parameters—that is, on the stack.
In any functional programming language, you are likely to encounter these features:
- First-class functions, or higher-order functions: Functions can serve as arguments and results of functions.
- Recursion as the primary tool for iteration.
- Heavy use of pattern matching, although technically it is not a defining feature of FP.
- Lazy evaluation, which makes possible the creation of infinite sequences and other data structures.

Don’t Be Scared Of Functional Programming

Stricter functional programming languages are typically used when a system’s performance and integrity are both critical
- i.e. your program needs to do exactly what you expect every time and needs to operate in an environment where its tasks can be shared across hundreds or thousands of networked computers.
Advantages: our programs can be broken down into smaller, simpler pieces that are both more reliable and easier to understand.
2 concepts:
- data in functional programs should be immutable:create new data structures instead of modifying ones that already exist.
  - For example, if you need to manipulate some data in an array, then you’d make a new array with the updated values, rather than revise the original array.
- functional programs should be stateless: they should perform every task as if for the first time
- Combined with immutability, this helps us think of each function as if it were operating in a vacuum, blissfully ignorant of anything else in the application besides other functions.
  - this means that your functions will operate only on data passed in as arguments and will never rely on outside values to perform their calculations.
programming steps:
- simple, repeatable actions that can be abstracted out into a function.
- then build more complex features by calling these functions in sequence (also known as “composing” functions)
- At a basic level, we’ll perform the following actions on our data:
  - add every number in a list
  - calculate an average
  - retrieve a single property from a list of objects
ground rules to ensure that you’re following best practices:
- All of your functions must accept at least one argument.
- All of your functions must return data or another function.
  - reduction: Returning a single value from an array
- No loops!==> replace it with recursion

Functional programming in C

Discriminated unions in C
Nested functions and lambda variables
Automatic cleanup of local variables
Data structures
Wrapping up

不是探討 Synchronization 而已

Mutexes and Semaphores Demystified

Mutex vs. Semaphores – Part 1: Semaphores

Scholten’s semaphore is referred to as the General or Counting Semaphore(S>=1), Dijkstra’s being known as the Binary Semaphore(S=1).
inherent dangers associated with using the semaphore:
- Accidental release: 把 counting semaphore 當 binary semaphore 用，每當有 task 執行完，S 便 +1 ，造成不只一個 task 進入 critical section
- Recursive deadlock:three possible deadlock situations associated with the semaphore:
  - Recursive Deadlock: if a task tries to lock a semaphore it has already locked.
  - Deadlock through Death
  - Cyclic Deadlock (Deadly Embrace)
- Task-Death deadlock: a task that is holding a semaphore dies or is terminated
  - it is common for the function call that acquires the semaphore to specify an optional timeout value
- Priority inversion: high priority task becomes blocked for an indefinite period by a low priority task.
- Semaphore as a signal

Mutex vs. Semaphores – Part 2: The Mutex

Mutex(MUTual EXclusion) appeared in late 1980's to address the problem of semaphore
The mutex is similar to the principles of the binary semaphore with one significant difference: the principle of ownership.
Ownership is the simple concept that when a task locks (acquires) a mutex only it can unlock (release) it.
- If a task tries to unlock a mutex it hasn’t locked (thus doesn’t own) then an error condition is encountered and, most importantly, the mutex is not unlocked.
- If the mutual exclusion object doesn’t have ownership then, irrelevant of what it is called, it is not a mutex.
The concept of ownership enables mutex implementations to address the problems:
- Accidental release: a check is made on the release and an error is raised if current task is not owner
- Recursive deadlock: mutex can support relocking of the same mutex by the owning task as long as it is released the same number of times
- Task-Death deadlock: Death Detection-RTOS can detect if that task current owns a mutex and signal waiting tasks of this condition.
  - for waiting tasks, the 2 dominants are:
    - All tasks readied with error condition: When all tasks are readied, these tasks must then assume critical region is in an undefined state. In this model no task currently has ownership of the mutex. The mutex is in an undefined state (and cannot be locked) and must be reinitialized.
    - Only one task readied: ownership of the mutex is passed from the terminated task to the readied task. This task is now responsible for ensuring integrity of critical region, and can unlock the mutex as normal.
- Priority inversion: using one of the following priority inheritance protocols:
  - [Basic] Priority Inheritance Protocol: enables a low-priority task to inherit a higher-priorities task’s priority if this higher-priority task becomes blocked waiting on a mutex currently owned by the low-priority task. The low priority task can now run and unlock the mutex – at this point it is returned back to its original priority.
  - Priority Ceiling Protocol(PCP): each mutex has a defined priority ceiling, set to that of the highest priority task which uses the mutex. Any task using a mutex executes at its own priority – until a second task attempts to acquire the mutex. At this point it has its priority raised to the ceiling value, preventing suspension and thus eliminating the “hold and wait” condition.
- Semaphore as a signal
ownership a mutex cannot be used for synchronization ==> synchronization is not equal to mutual exclusion

A specific Operating Systems mutex implementation may or may not support the following:

Recursion
Priority Inheritance
Death Detection

reviews of some API:
- VxWorks Version 5.4
- POSIX Threads (pThreads) – IEEE Std 1003.1, 2004 Edition
  - Portable Operating System Interface (the X has no meaning)
  - The default POSIX mutex is non-recursive , has no priority inheritance support or death detection.
  - the Pthreads standard allows for non-portable extensions (as long as they are tagged with “-np”).
  - Linux supports four different mutex types through non-portable extensions:
    - Fast mutex – non-recursive and will deadlock [default]
    - Error checking mutex – non-recursive but will report error
    - Recursive mutex – as the name implies
    - Adaptive mutex – extra fast for mutli-processor system
- Microsoft Windows Win32 – Not .NET

Mutex vs. Semaphores – Part 3: Mutual Exclusion Problems

problems that use of the mutex (in preference to the semaphore) will not solve:
- Circular deadlock: one task is blocked waiting on a mutex owned by another task. That other task is also block waiting on a mutex held by the first task.
  - For circular deadlock to occur the following conditions must all be true:
    - A thread has exclusive use of resources (Mutual exclusion)
    - A thread can hold on to a resource(s) whilst waiting for another resource (Hold and wait)
    - A circular dependency of thread and resources is set up (Circular waiting)
    - A thread never releases a resource until it is completely finished with it (No resource preemption)
  - one design policy: if a task needs to lock more than one mutex it must either lock all or none
- Non-cooperation:
  - one founding principle: we have to rely on all tasks to access critical regions using mutual exclusion primitives.
    ==> dependent on the design of the software and cannot be detected by the run-time system.
  - This final problem was addressed by Tony Hoare, called the Monitor.
Monitor: The monitor is a mechanism not typically supplied by the RTOS, but something the programmer tends to build (a notable exception is Ada95’s protected object mechanism). A monitor simply encapsulates the shared resource and the locking mechanism into a single construct (e.g. a C++ Object that encapsulates the mutex mechanism). Access to the shared resource, then, is through a controlled interface which cannot be bypassed (i.e. the application never explicitly calls the mutex, but calls upon access functions).

summary

The underlying difference between the Semaphores and the Mutex is the Principle of Ownership. Given the principle of ownership a particular implementation of a mutex may support Recursion, Priority inheritance and Death Detection.
以上尚未提到 condition variable 的概念

Concurrency (並行) vs. Parallelism (平行)

concurrency	parallelism
composition of independently executing processes	simultaneous execution of (possibly related) computations
dealing with lots of things at once.	doing lots of things at once.

Concurrency 系列文章

Sequenced-before: 同一個thread下，執行順序優先於(同一個thread之下的happens-before)
Happens-before: 效果優先於。可能發生於同一thread或不同thread底下。
- C ++ 中，須確保不同thread間的happens-before關係

C++定義了五種情況都能夠建立跨thread間的happens-before，如下

A synchronizes-with B (A和B有適當的同步)
A is dependency-ordered before B (A和B有相依的順序關係)
A synchronizes-with some evaluation X, and X is sequenced-before B
A is sequenced-before some evaluation X, and X inter-thread happens-before B
A inter-thread happens-before some evaluation X, and X inter-thread happens-before B

synchronized-with: 發生在兩個不同thread間的同步行為。當A synchronized-with B的時，代表A對記憶體操作的效果，對於B是可見的。
- A和B是兩個不同的thread的某個操作
- 跨thread的happens-before
- C++中有定義一系列的 atomic operation 和 mutex 來處理跨thread間的同步關係

Memory Consistency Models: Sequential Consistency

對於每個獨立的處理單元，執行時都維持程式的順序(Program Order)
整個程式以某種順序在所有處理器上執行
實現方式:
- 確保執行順序
- 確保對所有處理器都是一致的
- Write Bufers with Bypassing Capability
- Overlapping Write Operations
- Non-Blocking Read Operations
Cache Coherence
- cache coherence protocol: 確保不同處理器上的cache在系統中保持一致
- "所有的寫入最終都會被所有的處理器看見"
- "寫入相同的位址的順序對於所有處理器而言都是一致的"
- 若要確保 sequential consistency，還須確定"寫入不同位址的順序對於所有處理器而言都是一致的"
- 實現方式 :
  - Detecting the Completion of Write Operations
  - Maintaining the Illusion of Atomicity for Writes

Concurrency

An Introduction to Lock-Free Programming

Lock-free: 若在某個執行緒被 lock-up，絕對不會影響其他執行緒的運行 (non-blocking)，lock-free 不是說完全不用 mutex lock而是指「執行緒的執行不會被其他 lock 鎖住」

Lock-free 需要的機制：

CAS loop
In computer science, compare-and-swap (CAS) is an atomic instruction used in multithreading to achieve synchronization. It compares the contents of a memory location to a given value and, only if they are the same, modifies the contents of that memory location to a given new value.

Process vs. Thread vs. Coroutines

Coroutine的實作
- switch-case
- setjump/longjmp

POSIX Thread

Condition variables provide yet another way for threads to synchronize.
- mutexes implement synchronization by controlling thread access to data
- condition variables allow threads to synchronize based upon the actual value of data.
Without condition variables, the programmer would need to have threads continually polling (possibly in a critical section), to check if the condition is met. This can be very resource consuming since the thread would be continuously busy in this activity. A condition variable is a way to achieve the same goal without polling.

1018 課程

mutex v.s semaphore

priority inversion
mutex 在實作上提供更高層次的介面

地鼠

concurrency
- 解構 ==>工作切割
parallism
- 分成多個程式同時執行
- fork then join

CPU bound / IO bound

NUMA ->和記憶體模型有關、CPU間的typology

kernel object:作業系統提供的機制，處理一些問題

atomic operation: dekker's algo.
- 軟體演算法:比硬體慢一千倍
- 但執行成本太高
- memory model, memory ordering
- instruction scheduling
  - 去避免pipeline stall
  - Avoid illegal or semantically ambiguous operations (typically involving subtle instruction pipeline timing issues or non-interlocked resources.)==>Ex 變數明明是unsigned還去判斷是否>0
java concurrency "banking"
formal verification: 把執行時間複雜度算出來
- sel4
Compare-And-Swap Loops
TS (test and set)
spinlock
Garbage collection
lazy evaluation

lock free stack





typedef struct {
struct lstack_node *node_buffer;
_Atomic struct lstack_head head, free;
_Atomic size_t size;
} lstack_t;

前面一定要atomic
C 11

B-plus tree multi-thread
memory barrier
unsw concurrency
gc reference counting atomic aba
分析 mutex contention
- lock contention
- mutrace
- thread sanitizer
  - chromium
- 只是一個面向還有很多問題
go concurrency
- just a program model
- Erlang

coroutine

first mentioned: the art of computer programming
mapreduce

task migration
- work queue contention
Hungry Bird: single consumer multiple producer
- pc-test
lock顆粒度
- 和requirement有關