111/8/17 meeting write up - Synchronization in concurrent and distributed system

--- tags: CS Master --- # 111/8/17 meeting write up - Synchronization in concurrent and distributed system :link: [slide](https://hackmd.io/lIwn8XilSPGzdermvEVN7Q) ## 把一件事分給很多人做理想狀況是 Data-Race-Free program，可以直接把工作分配給很多人，達到 100% parallelism, 0% shared resources 而且 scalable。假設不計資料傳輸的成本，也不被輸出輸入所限制 (not bounded by I/O)。例如影像與音訊處理、科學計算等如果有共享資源呢？例如一些共用的 variable, data structure 等，`wc`, `grep`。那就避免共享！沒有共享資源就沒有 Data Race $\to$ 從資料規劃上著手 <small>(本來想設計一種可以預設不共享資源的 memory model，但是想想還是算了我沒有要讀博)</small> (MapReduce) 如果真的無法避免共享，可以先分析共享形式，以適當的方法共享，使得對 scalability 的影響降到最低 ## 共享資源的同步 (synchronization) 對於共享資源，目標是要盡可能將 contention rate 降到最低以減少 contention 造成的 busy waiting 或 starvation。 * 不要霸佔：多 reader 不會有 data race * immutability * 縮小 CS * replica ### Shared Memory * Lock * coarse-grained * fine-grained * Semaphore * atomic operations * RCU * Transactional Memory * software * hardware * CSP Lock 的特性是使用上相當直覺，直接對想要保護的區域上鎖就好了，但是代價就是很容易忽略帶來的風險，例如：過多的 contention、dead lock、不容除錯與追蹤。而且 Lock-based 的同步機制是 blocking 的同步方法，blablablablabla ![](https://hackmd.io/_uploads/BJOqqXYR9.png) ![eff-perf pic](https://hackmd.io/_uploads/BytfkQ065.png) 對應解決方法是縮小 critical section，我們稱之為 fine-grained lock，只在必要的地方上鎖。那必要的地方究竟是哪些地方呢？說起來容易但是寫起來很容易出錯，又不好追蹤哪裡寫錯，甚至一下對一下錯，如果有很多地方寫錯還會像打地鼠一樣一下這裡錯一下那裡錯。此時的 lock 就不如天真的 coarse-grained lock 直覺。不過也不是所有有用到共享資源的時候都需要上鎖，事實上 multi-reader 的時候並不需要保護，只要資料不會變更一個人讀取跟很多人讀取是一樣的，我們只要在 writer 要寫的時候上鎖就好了。這種鎖我們稱之為 read write lock。延伸不要共享就不會有 data race 的概念，一旦資料寫入了就不要再改動 (immutable)，如果有改動就寫在其他地方，並引導 reader 過去讀取，即 COW (Copy-on-Write) 的概念，RCU (Read-Copy-Update) 便是以此為基礎設計的同步機制。RCU 本身其實著墨更多的在安全記憶體回收的部份，在此就先不探討記憶體回收。由於 writer 是寫在其他地方，所以並不會阻擋原本就在讀取的 reader，自然就沒有 data race。但如果有多個 writer 還是會有競爭關係，常見方式是透過 atomic pointer 確保 writer 在變更指標內容時不會被其他 writer 打斷 atomic operation 是另一個很重要的機制，此類的操作如同原子一般不可被分割，直接由 CPU 指令提供實作 e.g. `CMPXCHG` blablablablabla 高難度 cache blablablaba memory barrier lock-free wait-free [並行和多執行緒程式設計系列講座](/HWnVCDWXRMOWASCAksYvNg) Dekker's algorithm WTF SPSC 用鎖 WTF 恐龍書 WTF 結果我們還拿他當作業系統課程和研究所考試的標準教材 WTF 最近綜觀各語言的 concurrency 發展，皆朝著更有策略性、結構性的運用 concurrency primitives 發展 ### Distributed Memory * All above with RMA (Remote Memory Access) * Distributed lock service * (Not fully replicated ones) * Consensus * Single Leader * Multi-leader * CRDT/OT/MDT * CSP 以上是在單一台機器上解決共享資源的方式，但是單一機器的計算能力很容易碰到硬體擴充的上限，這時便可以使用更多機器來擴充。相比單一機器共享記憶體，分散式環境帶來的優勢為 Fault Tolerance 與 **更多**的 Scalability，所以會對這兩點有更多的著墨在分散式的環境中最先碰到的問題就是記憶體不連續，解決方法為使用遠端記憶體存取 (RMA) 營造仍然共享記憶體 (shared memory) 的假象，或是都不共享 (shared nothing)，僅透過溝通交換資訊 (message passing)。不管是 shared memory 還是 shared nothing 都會有額外的溝通成本。也可以依情況採取混和式的方式 (e.g. 使用多台機器聯合組成一個 shared memory，各自負責一個區段)。常見的 Server-client 即屬於一種 single leader shared memory 共享資源在 shared nothing 中是一種很尷尬的存在，要共享又不共享。每台機器都是一個孤島，要怎麼知道別人更動了某項數值？因此共享資源在 shared nothing 中是以副本 (replica) 的形式存在，把共享資源複製一份放到每一台機器上，用「某些方法」同步每個 replica 因為增加了溝通成本，不管是 shared memory 還是 shared nothing 在同步上的延遲都大幅提昇，進而降低了 replica 的一致性 (consistency)，為了維持較強的一致性，有些演算法例如 consensus algorithm 會選出一個 leader 作為同步的中心，所有更動都由 leader 發起，其他人則作為 follower，但是代價就是無法自由寫入，不然就是要競選為 leader，且競選過程繁瑣。而 Distributed lock 光是在 SMM 就是阻礙式的同步機制，在分散式的環境上更是。為了可以提高本地讀寫的自由程度，我們可以犧牲一點一致性，用比較 optimistic 的方法同步。資料庫中的 transaction 即屬於一種，要更新資料時不管當下狀況為何反正就先寫入，做完了要 commit transaction 是發現有 conflict 時再來決定要 commit 還是 abort。近年最流行的是 CRDT 類的同步方法，在一些條件下，所有的操作都不會有 conflict，更新訊息可以以任意次序與當下狀態合併，達到 (strong) eventual consistency Replication 的作法同時也是 high availability 的一種表現 CAP Theorem: 2 decades and a few clouds later \KALM/ 在此先忽略溝通的方法，假設所有機器全連接。事實上 broadcast 是很高成本的行為，特別是 total order broadcast，在非全連接的狀況下會變成更複雜的生成樹之類的問題。與其相對的是 partial order broadcast，對次序比較沒有那麼講究，再更鬆散一點可以透過 gossip algorithm 散布資訊 (其實 Cache Coherence 問題也是 Distributed Memory 的同步問題)(Store buffer + invalidate queue) Failure mode 目前皆假設為 Fail-stop，不處理 BFT 的問題簡介 consensus Multi paxos 和 Raft (Have we reached consensus yet?) ![](https://hackmd.io/_uploads/H1ZK0St09.png) #### Other ways of synchronization? * 其實 Cache Coherence 問題也是 Distributed Memory 的同步問題 * Store buffer + invalidate queue * Buffering * Branching with coroutine * Nesting priority with spanning tree (pre-cofigured coordination) * CSP ## High level view of sync :::warning multi-reader multi-writer 問題都可以化簡為 multi-reader single-writer 問題 * 所有不是最新的 writer 都會變成 reader，常見作法是使用 CAS (小心 ABA) 或是透過 acquire lock 確認自己是不是最新 writer * 也可以轉變成最新 writer 要去通知所有舊 writer 變成 reader * 總之就是必須協調 (coordinate) writer * 綜觀而言，以上作法是使用 latest-win 的規則來協調 writer，帶來最強的 consistency: linearizability * 使用其他規則例如 add-win, longest-win 甚至是設定 writer 的優先度，有機會可以不這麼緊密的協調，writer 可以比較自由的寫入，但不保證當下結果會與別人同步，weaker consistency. e.g. CRDT (strong eventual consistency) ::: concurrent modification 是最難解決的問題用另一個角度來看，其實所有的分散是系統上 replica 的操作都可以當作是 git 操作，你可以想到的 branch, merge, conflict, resolve conflict, commit ahead origin, abort 都會發生 ## Correctness * unit test * model checking * SPIN * TLA+ * formal verification (mathematical proof) * [Isabelle](https://www.youtube.com/watch?v=7w4KC6i9Yac) ![](https://hackmd.io/_uploads/H1bNoVG0c.png) ## Classic papers and materials ## New Papers * Leaderless State-Machine Replication: Specification, Properties, Limits * Leaderless consensus * CRDT + Historical Modeling * ~~DMM Stack~~ * CALM + proof ## Not papers but also important * [CAP Theorem: two decades and few clouds later](https://www.youtube.com/watch?v=K3Ahfbn7tPk) * Immutability changes everything * (Historical Modeling) The Art of Immutable Architecture ### idk * The Power of Two Random Choices: A Survey of Techniques and Results ## 可以派上用場的技巧 * immutability * [immutability changes everything](https://www.semanticscholar.org/paper * 以 PRNG 預先取得共識 (static consensus) * Ring buffer for eventual consistency. Eliminating the need of memory reclaimation as well. * overlapping message ring passing for faulty check * Nesting * only K/N is needed (like reservoir sampling), the rest (N-K / N) for faulty check. ## ToDo * an actual running program * 應該要再更仔細的思考和 **分析** 問題，像 [這篇分析](https://www.semanticscholar.org/paper/State-Machine-Replication-Is-More-Expensive-Than-Antoniadis-Guerraoui/85b0b6aa70824f3c13049ce27ee9f96a1be51459) 和 [這篇分析](https://www.semanticscholar.org/paper/Leaderless-State-Machine-Replication%3A-Properties%2C-Rezende-Sutra/40eab96c504458521e93892b2c64e9fe68c24e25) * 可以從 Shared Memory 的 concurrency 借用一些方法 * 可以參考 cache coherence policy * 可以從 Shared Memory 的 concurrency 參考一些方法 * My MPC vs Others' MPC (wCQ wait-free queue)? * 可以參考 cache coherence policy ## 目前方向 * 分析？ * 量化分析？ * Determinism? upper/lower bound? * 現有同步機制共通的 property ，如 CALM 和 ROLL * ( MWMR $\to$ SWMR 是一個? Too $trivial$ ?) * 新的？ * 同步機制？有點遙遙無期 * 若是同步機制，要至少具有 scalability 或 fault tolerance 其中一者，否則與 RMA 移植無異 --- # CRDT: conflict-free replicated data type ## Properties 1. Any replica can be modified without coordinating with another replicas 2. when any two replicas have received the same set of updates, they reach the same state, deterministically, by adopting mathematically sound rules to guarantee state **convergence**. ## Concurrency semantics ### Happens-before if $e_1$ event $happens$-$before$ event $e_2$, $e_1\prec e_2$, iff: * $e_1$ occurred before $e_2$ in the same process * $e_1$ is the event of sending message $m$, and $e_2$ is the event of receiving that message * There exists an event $e_1$ such that $e_1\prec e$ and $e\prec e_2$ Happens-before enables Partial order: * 2 events with any order relationship are commutative. ### Kinds of concurrency semantics #### Add-wins set If 2 events are not commutative, add operation wins over remove operation if they happened concurrently. ![](https://hackmd.io/_uploads/ByMOnPtAc.png) $\left\{ e \space| \space add(e)\in O \space\wedge \space!\exists rmv(e)\in O\cdot add(e) \prec rmv(e) \right\}$ #### Remove-wins set The result become empty set on both sides. $\left\{ e \space| \space add(e)\in O \space\wedge \space\forall rmv(e)\in O\cdot rmv(e) \prec add(e) \right\}$ #### Last-writers-win set Total order by using time stamp on physical clock. :::info vector clock: record subjective order between events by using version number. $\to$ partial order. Lamport clock is similar. ::: $\left\{ e \space| \space add(e)\in O \space\wedge \space\forall rmv(e)\in O\cdot rmv(e) \lt add(e) \right\}$ #### Register * multi-value register * For preserving concurrent modification result * last-writer-wins ## State-based CRDT and Operation-based CRDT ### State-based * It must support a $happens$-$before$ (causality) relationship that defines a partial order. * All updates must increase the state in that partial order (the previous version $happens$-$before$). * It must support a merge operation that takes two states and produces a new one that is greater than both of them (both previous versions $happens$-$before$ the merged version). * e.g. for state $s$, $u$ it derives $s \cup u$ In operation-based replication, replicas converge by propagating operations to every other replica. ### Operational-based * A generator function: generate effector operation which has no side-effect * Restrictions * Message delivering must be idempotent to ensure causal order. Commutative is not accepted. * Generator function needs to be executed everytime $\to$ delay. Can be solved by pure-operation based CRDT, but adding more complex on meta data * Can be simulated by state-based. ## Historical Modeling as an example $Set$ is born to fit all these kind of requirement to be a state-based CRDT. * It contains no duplicates * It is unordered We can yield that * (Insertion can commute) * (Insertion is idempotent) * (set insertion behaves well in the face of duplicated or out-of-order messages) * Sets are partially ordered under the subset relationship * \{a, b, c\} $\prec$ \{a, b, c, d\} * Merge: compute the upper bound * merge(\{a, b, c\}, \{b, c, d\}) = \{a, b, c, d\} ![](https://hackmd.io/_uploads/rJvfwqFRq.png) ### Projection Walk through the DAG to gather the information we need. :::info Historical Modeling also leverage **immutability** to achieve all-time partial order, since historical fact are always append only. Historical fact establish causal relationship. Find out more on: https://www.immutablearchitecture.com/ (The book) https://jinaga.com (The database based on Historical Modeling) ::: ## The extension of CRDT ### Preservation of sequential semantics Some concurrency semantics can preserve the property of sequential execution. Such as add-wins and remove-wins. ### Extended behavior under concurrency Multi-value register can preserve the sequential semantics also the concurrency operations. A decided single value can overwrite later. (This is what historical modeling do). ## Future direction ### Scalability The bottleneck is the density and bandwidth of communication. Need compact causality representations. ### Reversible computation By far undo is either another operation append to the current state, or stored as another CRDT to be calculated together with the current state. It can be more universal. ### Security While access to a CRDT based interface can be restricted by adding authentication, any accessing replica has the potential to issue operations that can interfere with the other replicas. ### Verification How do you know the CRDT you designed works correctly? * unit test * model checking * formal verification It can be more generalized. :::info ## 優缺點？ The bad: * CRDTs 的瓶頸在於溝通與資訊散播方式 * 沒有 General purpose 的 CRDT，每次都需針對不同使用場景重新設計 * 在協作軟體中會有序列的問題，詳細請見 [CRDTs: The hard parts](https://www.youtube.com/watch?v=x7drE24geUw) The Good: * 沒有提到 Local first software 的應用，但我覺得這是 CRDT 很大的賣點之一 ::: --- # CALM The cost of coordinating concurrent operations is high, but not always. ## From coordination free to racing * Ideal situation: Stay in Your Lane, the Perfect Freeway * But there will always be someone changing lanes. * Coordination needed. * Distributed Deadlock Detection * Distributed Garbage Collection :::info Distributed Deadlock Detection and Distributed Garbage Collection are horrible examples. They don't really show how data race can affect the performance. ::: ## Problem domain > **Question:** What is the family of problems that can be consistently computed in a distributed fashion without coordination, and what problems lie outside that family? Question can be translate as one of computability, like P vs. NP or Decidability. It asks what is (im)possible for a clever programmer to achieve. The set of satisfying paths that exist is monotonic (單調性) in the information received: ### $Definition\space 1:$ A program $P$ is monotonic if for any input sets $S$, $T$ $where$ $S\subseteq T, P(S) \subseteq P(T)$ Monotonicity is the key property underlying the need for coordination to establish consistency, as captured in the CALM Theorem: ### $Theorem\space 1:$ Consistency As Logical Monotonicity (CALM). A program has a consistent, coordination-free distributed implementation if and only if it is monotonic. ## CALM: A PROOF SKETCH ### Confluent Confluence is a permissive correctness criterion. * An operation on a single machine is confluent if it produces the same set of outputs for any non-deterministic ordering and batching of a set of inputs. * A confluent single-machine operation can be viewed as a deterministic function from sets to sets * Confluent operations compose: if the outputs of one confluent operation are consumed by another, the resulting composite operation is confluent. * If an application is confluent, we know that any such anomalies(異常) at the memory or storage level do not affect the application outcomes. :::info I may just take it as weaker consistency. ::: ### Confluent example: shopping cart Insert-only! No race condition spontaneously. * "Add to cart" enlarge the set of items in the cart * "Remove from cart" enlarge the set of items to be removed at checkout. Both operation grows the set of item monotonically. Insertions can commute. Thus, coordination-free. :::info Again, immutability avoids race conditions. And this is basically the same idea as CRDTs. ::: :::info The formal proof by Ameloot (in another paper) is unreadable to people outside of database. ::: Some concept in Ameloot's proof: * **Ingest and apply:** an unordered batch of requests to insert and delete records in local relations * **Query:** the (now-updated) local relations to compute batches of records. (It's the same as projection in Historical Modeling) * **Send:** The results of the query phase to relevant machines in the network as requests to be handled :::info Seems familiar? Same idea as state-based CRDTs. ::: ## CALM perspective on the state of the art Brewer's CAP Theorem: > a system can exhibit only two out of the three following properties: Consistency, Availability, and Partition-tolerance. CAP is a negative result: it captures consistency properties that cannot be achieved in general. :::info I suggest go watching [CAP Theorem: two decades and few clouds later](https://www.youtube.com/watch?v=K3Ahfbn7tPk) for a modernized perspective. ::: CALM is a positive result in this arena: it circumscribes the class of programs for which all three of the CAP properties can indeed be achieved simultaneously. ### $Observation\space 1:$ Coordination-freeness is equivalent to availability under partition. Trivial, same as CRDTs. In the reverse direction, a program that employs coordination will stall (become unavailable) during coordination protocols if the machines involved in the coordination span the partition. In that frame, CALM asks and answers the underlying question of CAP: > Which programs can be consistently computed while remaining available under partition?” CALM does not contradict CAP. Instead, CALM approaches distributed consistency from a wider frame of reference. * Monotone programs can in fact satisfy all three of the CAP properties at once; non-monotone programs are the ones that cannot * It allows us to ask questions about what computations are possible $\to$ outcome-oriented ## What enables monotonic? * Functional Programming * Bare assignment is a non-monotonic programming construct. (Same applies to CRDTs) * The use of **immutable** variables and function chaining builds immutable trees and graphs * CRDTs * The bloom programming language invented by the author ## Coordination In Its Place * Use CALM to analyze your problem. Use coordination when it's needed. * CALM falls short of being a constructive result—it does not actually tell us how to write consistent, coordination-free distributed systems. * True. --- # Leaderless State-Machine Replication: Specification, Properties, Limits * classical SMR protocols offer limited scalability and availability in this setting * We propose a framework to depict the existing Leaderless SMR * we introduce a set of desirable properties for these protocols: ROLL * $R$eliability, (O)ptimal (L)atency and (L)oad Balancing * We show that protocols matching all of the ROLL properties are subject to a trade-off between performance and reliability. * We also establish a lower bound on the message delay to execute a command in protocols optimal for the ROLL properties. * This lower bound explains the persistent chaining effect observed in experimental results. --- # Leaderless consensus (WIP) ## Problem * Leader-base consensus is not scalable. * Impact on performance severely when faulty leader detected. * Multi-leader still got leadership coordination ## Contribution * Define a leaderless algorithm as one that decides in an eventually synchronous−1 有假設前提為 synchronous 怎麼不寫在標題？或早點講 ## Definition $synchronous-k$