111/8/17 meeting - Synchronization in concurrent and distributed system 精簡版

## Synchronization in concurrent and distributed system 並行與分散式系統中的同步機制 - 精簡版 - ([link to writeup](https://hackmd.io/@idoleat/S14SsL6a9)) --- ## Outline * The essence of concurrency * Sync. in shared memory environment * Correctness * Sync. in distributed memory environment * High level overview of sync. * ToDo and some tricks --- ## The essence of concurrency 把一件事分給很多人做 * 理想狀況是 Data-Race-Free program * 100% 平行化，0% 共享資源而且 scalable ---- * 如果有共享資源呢？例如一些共用的 variable, data structure，像是 `wc`, `grep` 等工具 * 那就避免共享！沒有共享資源就沒有 Data Race * 從資料規劃上著手 * 如果真的無法避免共享，可以先分析共享形式，以適當的方法共享，使得對 scalability 的影響降到最低 --- ## Synchronization in shared memory environment * 對於共享資源，目標是要盡可能將 contention rate 降到最低 * 以減少 contention 造成的 busy waiting 或 starvation ---- 常見降低 contention 的方法 * 縮小 CS * 不要霸佔：多 reader 不會有 data race * **immutability** * replica --- ### Concurrency primitives * Lock * coarse-grained * fine-grained * Semaphore * atomic operations * RCU * Transactional Memory * software * hardware * CSP --- ### Lock * 使用上相當直覺，但很容易忽略帶來的風險 * 過多的 contention * 不易除錯與追蹤 * Lock is blocking * Deadlock, Non-composability, Priority inversion, Fault intolerance, Preemption intolerance... ---- ![](https://hackmd.io/_uploads/BJOqqXYR9.png) ---- ![eff-perf pic](https://hackmd.io/_uploads/BytfkQ065.png) ---- * 縮小 critical section $\to$ fine-grained lock * 說起來容易但是寫起來很容易出錯 * 不好追蹤哪裡寫錯 * 一下對一下錯 * 像打地鼠一樣一下這裡錯一下那裡錯 * 不如 coarse-grained lock 直覺 ---- ### read-write lock * 並非所有有用到共享資源的時候都需要上鎖 * multi-reader 的時候並不需要保護 * 只要在 writer 要寫的時候上鎖即可 --- ### COW in RCU * 不要共享就不會有 data race * 資料寫入了就不要再改動 (immutable) * 寫入在其他地方，並引導 reader 過去讀取 * Copy on Write * 不會阻擋原本正在讀取的 reader * no data race * no locks needed ---- 如果有多個 Writer? * 還是會有競爭關係 * 常見方式是透過 atomic pointer 確保 writer 在變更指標內容時不會被其他 writer 打斷，使得多 writer 的寫入呈線性化的關係 (Linearize) * atomic operation is.... --- ### atomic operation * 此類的操作如同原子一般不可被分割，直接由 CPU 指令提供實作 ,e.g. `CMPXCHG` * non-blocking synchronization 的基石 * 在變更記憶體內容前後檢查是否一樣，一樣則寫入，不一樣則不寫入 ---- * 以此建構出 lock-free，甚至 wait-free 的演算法 * 高難度操作 * 牽涉議題繁多: ABA problem, cache coherence, memory order, barrier...etc. ![](https://i.imgur.com/9mVFaz7.png) --- * Communicating Sequential Process (CSP) * *high potential!* * Transactional memory (STM/HTM) --- ### Correctness * unit test * model checking * SPIN * TLA+ * formal verification (mathematical proof) * [Isabelle, the proof assistant](https://www.youtube.com/watch?v=7w4KC6i9Yac) * In SMM concurrency, we aiming for **linearizability** ---- ![](https://hackmd.io/_uploads/H1bNoVG0c.png) --- 最近綜觀各語言的 concurrency 發展，皆朝著更有策略性、結構性的運用 concurrency primitives 發展更多深入解說請參考 :link: [並行和多執行緒程式設計系列講座](/HWnVCDWXRMOWASCAksYvNg) 千萬不要看恐龍書 --- ### Synchronization in distributed memory environment * All above with RMA (Remote Memory Access) * e.g. Distributed lock service * (Not fully replicated ones) * Consensus * Single-Leader * Multi-leader * Leaderless * CRDT/OT/MDT ---- * (possibly others) * CSP ---- ### Why distributed memory? * 單一機器的計算能力很容易碰到硬體擴充的上限 * 相比單一機器共享記憶體，分散式環境帶來的優勢為 Fault Tolerance 與 ***更多~?~*** 的 Scalability --- ### Shared memory illusion with RMA * 最直接的同步方式，移植 SMM 的過來 * 使用遠端記憶體存取 (RMA) 營造仍然共享記憶體的假象 * 但是頻繁的記憶體存取成本高且 blocking * 還是有人嘗試: [Nonblocking Data Structures for Distributed Memory: Stack as an example](https://www.semanticscholar.org/paper/Nonblocking-Data-Structures-for-Distributed-Memory-Diep-F%C3%BCrlinger/d87d282f20d46ecfa31ff8b0a8aef979039a38ca) * Global hazard pointer + caching ---- * Distributed locking? * [Martin has tried...](https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html) * lock 光是在 SMM 就是阻礙式的同步機制，在分散式的環境上更是 --- ### shared nothing * 都不共享記憶體，僅透過溝通交換資訊 (message passing) 同步 * 不管是 shared memory 還是 shared nothing 都會有額外的溝通成本。也可以依情況採取混和式的方式 * e.g. 使用多台機器聯合組成一個 shared memory，各自負責一個區段 * 常見的 Server-client 即屬於一種 single leader shared memory ---- * 共享資源在 shared nothing 中是一種很尷尬的存在，要共享又不共享 * 因此共享資源在 shared nothing 中是以副本 (replica) 的形式存在 * 把共享資源複製一份放到每一台機器上，用「某些方法」同步每個 replica ---- 因為增加了溝通成本，不管是 shared memory 還是 shared nothing 在同步上的延遲都大幅提昇，進而降低了 replica 的一致性 (consistency) ---- * Strongest consistency: Linearizability * Weaker consistency: (strong) eventual consistency * Still many others.... --- ### Consensus algorithm for replicas * 維持最強的一致性: Linearizability * 經典的 Consensus algorithm 會選出一個 leader 作為同步的中心，所有更動都由 leader 發起，其他人則作為 follower * 代價就是不允許 concurrent modification，不然就是要競選為 leader，且競選過程繁瑣 * Raft * (Multi-)Paxos ---- 🎞 [Raft vs Paxos: Have we reached consensus in distribute system yet?](https://www.youtube.com/watch?v=0K6kt39wyH0) ![](https://hackmd.io/_uploads/H1ZK0St09.png) ---- #### Leader-based consensus doesn't scale well... * Notorious for high election cost * 但是有作到 Fault tolerance * [:star:](https://hackmd.io/fWq3PRrWQ2e2WYbTXj_Z3g?both#Leaderless-State-Machine-Replication-Specification-Properties-Limits) [Leaderless State-Machine Replication: Specification, Properties, Limits](https://www.semanticscholar.org/paper/Leaderless-State-Machine-Replication%3A-Properties%2C-Rezende-Sutra/40eab96c504458521e93892b2c64e9fe68c24e25) * [Leaderless consensus](https://www.semanticscholar.org/paper/Leaderless-Consensus-Antoniadis-Desjardins/bd6322fd907f1a1c83dd2df723e02e17bd187cc5) * Scalable Consensus 開始被研究 * 但是是在作業系統的頂級會議上 --- ### Optimistic synchronization for replicas * 為了可以提高本地讀寫的自由程度，我們可以犧牲一點一致性，用比較 optimistic 的方法同步 * 資料庫中的 transaction 即屬於一種 * 近年最流行的是 [:star:](https://hackmd.io/fWq3PRrWQ2e2WYbTXj_Z3g?view=#CRDT-conflict-free-replicated-data-type) ++[CRDT](https://www.semanticscholar.org/paper/Conflict-free-Replicated-Data-Types-(CRDTs)-Pregui%C3%A7a-Baquero/af6e869c1bf89d4bff1d56e2a5c06bc71e8d9f81)++ 類的同步方法，在一些條件下，所有的操作都不會有 conflict，更新訊息可以以任意次序與當下狀態合併，達到 (strong) eventual consistency * Replication 的作法同時也是 high availability 的一種表現 ---- CAP Theorem: 對 Consistency, Availability, Partition Tolerance 三者 trade off 只能取兩者 * 🎞 [CAP Theorem: two decades and few clouds later](https://www.youtube.com/watch?v=K3Ahfbn7tPk) * 以現在的觀點來看可以不必這麼嚴格 * [:star:](https://hackmd.io/fWq3PRrWQ2e2WYbTXj_Z3g?both#CALM) [Keeping CALM: When Distributed Consistency is Easy](https://www.semanticscholar.org/paper/Keeping-CALM%3A-When-Distributed-Consistency-is-Easy-Hellerstein-Alvaro/383816a37b3508d547190816b581bab1c879fae2) --- ### Assumption * 忽略溝通的方法，假設所有機器全連接 * 事實上 broadcast 是很高成本的行為，特別是 total order broadcast * 在非全連接的狀況下會變成更複雜的生成樹之類的問題 * 限定樹高和 max degree 的 (動態) 隨機生成樹，樹高和 max degree 呈反向關係 ---- * partial order broadcast，對次序比較沒有那麼講究 * 再更鬆散一點：透過 gossip algorithm 散布資訊 * Failure mode 目前皆假設為 Fail-stop，(我目前) 不處理 BFT 的問題 --- ### Other ways of synchronization? * 其實 Cache Coherence 問題也是 Distributed Memory 的同步問題 * Store buffer + invalidate queue * Buffering * Branching with coroutine * Nesting priority with spanning tree (pre-cofigured coordination) * CSP --- ### Correctness * unit test * model checking * SPIN * TLA+ * formal verification (mathematical proof) * [Isabelle, the proof assistant](https://www.youtube.com/watch?v=7w4KC6i9Yac) ---- ![](https://hackmd.io/_uploads/H1bNoVG0c.png) --- ## High level view of sync multi-reader multi-writer 問題都可以化簡為 multi-reader single-writer 問題 :::info * 所有不是最新的 writer 都會變成 reader，常見作法是使用 CAS (小心 ABA) 或是透過 acquire lock 確認自己是不是最新 writer * 也可以轉變成最新 writer 要去通知所有舊 writer 變成 reader * 總之就是必須協調 (coordinate) writer ::: ---- :::info * 綜觀而言，以上作法是使用 latest-win 的規則來協調 writer，帶來最強的 consistency: linearizability * 使用其他規則例如 add-win, latest-write-win 甚至是設定 writer 的優先度，有機會可以不這麼緊密的協調，writer 可以比較自由的寫入，但不保證當下結果會與別人同步，weaker consistency. ::: 已知：搭配使用 COW/replica 的 MRSW 為 wait-free sync ---- * concurrent modification 是最難解決的問題 * 用另一個角度來看，其實所有的分散是系統上 replica 的操作都可以當作是 git 操作，你可以想到的 branch, merge, conflict, resolve conflict, commit ahead origin, abort 都會發生 --- ## 可以派上用場的技巧 * **_++immutability++_** * [immutability changes everything](https://www.semanticscholar.org/paper/Immutability-changes-everything-Helland/c673aa1224db95280fd57b388a065fb6a625f0de) * 以 PRNG 預先取得共識 (static consensus) * Ring buffer for eventual consistency. Eliminating the need of memory reclaimation as well. * overlapping message ring passing for faulty check * Nesting * only K/N is needed, others for faulty check. --- ## ToDo * 一個可以實際拿來測試的範例程式 (WIP) * 應該要再更仔細的思考和 **分析** 問題，像 [這篇分析](https://www.semanticscholar.org/paper/State-Machine-Replication-Is-More-Expensive-Than-Antoniadis-Guerraoui/85b0b6aa70824f3c13049ce27ee9f96a1be51459) 和 [這篇分析](https://www.semanticscholar.org/paper/Leaderless-State-Machine-Replication%3A-Properties%2C-Rezende-Sutra/40eab96c504458521e93892b2c64e9fe68c24e25) * 可以從 Shared Memory 的 concurrency 參考一些方法 * My MPC vs Others' MPC (wCQ wait-free queue)? * 可以參考 cache coherence policy --- ## 目前方向 * 分析？ * 量化分析？ * Determinism? upper/lower bound? * 現有同步機制共通的 property ，如 CALM 和 ROLL * ( MWMR$\to$SWMR 是一個? Too $trivial$ ?) ---- * 新的？ * 同步機制？有點遙遙無期 * 若是同步機制，要至少具有 scalability 或 fault tolerance 其中一者，否則與 RMA 移植無異