# Week 03 ###### tags: `WS2020-IN2259-DS` ## 3-1 Distributed system model * Model 用來獲得 (captures) 與系統相關的所有假設。 * 如:網路、clock、處理器等。 * Algorithms always assume a certain model * 一些演算法只能在一些較為強大的 (stronger) 模型中使用,但限制也較多。 * Model 是與理論相關的 (theoretical relevance),因此 Model 的假設是否可以在現實中應用是一個不同的問題。 ### Synchronous vs. asynchronous model | Property | Synchronous system model | Asynchronous system model | | -------- | -------- | -------- | | Clocks | Bound on drift | Text | | Processor | Bound on execution time | No bound on execution time | | Channel | Bound on latency | No bound on latency | ### Synchronous model * 每個行程 local clock 距離真實時間的 drift rate,都在一個已知的限制 (known bound) 內。 * 每個 computation 的每個步驟,都將在 known bound 內完成。 * 每個透過一 channel 傳送出去的 message,會在 known bound 內被接收。 ### Asynchronous model * Clock 的 drift rates 是隨意的 (arbitrary)。 * 每個 computation 的每個步驟,隨意時間完成 (但一定會完成)。 * message 將需要不確定的時間完成傳送 (但一定會傳送)。 ### Two General’s Problem (Agreement) ![](https://i.imgur.com/Tdy8Cet.png =500x) 1. Who leads the attack? * Solution * Largest army first * If tied, Army 1 (如果都一樣兵力) * 因此,此解法於 Synchronous 與 Asynchronous 模型都適用 2. When to attack? * Asynchronous agreement * No bound on delivery! * 無法解決! * Synchronous agreement * Message takes at least **min** time and at most **max** time to arrive * Strategy: Army 1 **waits for min time** then attacks * Guarantee: Army 2 attacks no later than **max – min** time after Army 1 ### Summarizing takeaways * 有些問題無法在 asynchronous world 中解決,如 when vs. who leads attack。 * 在 asynchronous distributed systems 可以使用的 solution,在 synchronous distributed systems 也可以使用 (synchronous model is stronger)。 * 網際網路與很多實際應用的分散式系統,都是 asynchronous。 * 加入 timeouts 與 timing assumptions 可以減少 uncertainty,並使得 synchronous model 中的元素了解整體情況 (be brought into the picture)。 ### Self-study questions * Think of a few design problems that cannot be solved in an asynchronous world … * … and show how they can be solved in a synchronous world. * What are some other useful assumptions for distributed system design and how practical are they? * Can you think of other coordination and agreement problems? ## 3-2 RECAP: MUTUAL EXCLUSION - IN A NON-DISTRIBUTED SYSTEMS CONTEXT ### Problem: Access to shared variables * 想像一個要被多個 threads 或是 processes 存取的 globally shared variable counter 。 * 如:a key-value record managed by a storage server (or more complex data structure) ### Shared data & synchronization ![](https://i.imgur.com/TgFNmO7.png) * What may happen if multiple threads concurrently access shared state (e.g., a shared variable)? ### Concurrently manipulating shared data * Two threads execute concurrently as part of the same process * Shared variable (e.g., global variable) * counter = 5 * Thread 1 executes * counter++ * Thread 2 executes * counter-- * What are all the possible values of counter after Thread 1 and Thread 2 executed? ### Machine-level implementation * Implementation of “counter++” ```c register1 = counter register1 = register1 + 1 counter = register1` ``` * Implementation of “counter--” ```c register2 = counter register2 = register2 – 1 counter = register2 ``` ### Possible execution sequences ![](https://i.imgur.com/2SlFY0r.png) ### Interleaved execution ![](https://i.imgur.com/XJnSMIU.png =500x) * Assume counter is 5 and interleaved execution of counter++ (in T1) and counter-- (in T2) * The value of counter may be 4 or 6, whereas the correct result should be 5! ### Race condition * **Several** threads **manipulate shared data** **concurrently**. The **final value** of the data **depends** upon **which thread finishes last**. * In our example (interleaved execution) of * `counter++` * `counter--` * To prevent race conditions, concurrent processes must be **synchronized** ### The moral (寓意) of this story * The statements `counter++` and `counter--` must each be executed **atomically** * Atomic operation:指在沒有 interruption 的狀態下完成的 operation。 * 我們可以透過 synchronization primitives 獲得 Atomicity (原子性)。 * 在 critical section (臨界區段) 裡存取的 Shared variable,必須被 synchronization primitives 所保護。 * 這就是所謂的 critical section problem 或 mutual exclusion。 ### Self-study questions * Do we have a critical section problem in distributed systems? – There is no shared memory! * Asked differently, do we need to worry about (distributed) mutual exclusion in a distributed system? * Identify a few mutual exclusion scenarios in distributed systems. ## 3-3 DME - OVERVIEW ### Distributed mutual exclusion * In distributed systems, mutual exclusion is at least equally **complex** due to : * Lack of shared memory * Lack of a global clock * Event ordering (事件排序) * Examples * Accessing a shared resource in distributed systems * Acquiring a lock * One active master to coordinate (協調) activities ### Critical section problem - No shared memory * System with n nodes * Nodes access shared resources in CS * **Coordinate access to CS via message passing** * Application-level protocol for accessing CS * Enter_CS() – enter CS, block if necessary * ResourceAccess() – access shared resource in CS * Exit_CS() – leave CS ### Assumptions (假設) - No practical rather theoretical considerations * System is asynchronous (非同步) * No bound on delays, no bound on clock drift, etc. * Nodes do not fail * Message delivery is reliable * Any message sent, is eventually delivered intact (完整) and exactly once – i.e., not lost, not duplicated * Nodes are well-behaved and spent finite time accessing resources in CS ### Mutual exclusion requirements (重要!) * **Safety** – correctness * 一次只有一個 node 存取 CS。 * **Liveness** – progress (something good happens) * Requests to enter/exit CS eventually succeed:可以正常進出 CS。 * 沒有 deadlock。 * **Fairness** * order:如果一個要進入 CS 的 request 比另一個早 (happened-before),那必須依照這個順序。 * no starvation:在其他 nodes 尚未進入 CS 前,不會有一個 node已經進入 CS 兩次。 ### Deadlock & starvation * Deadlock:兩個或更多 nodes 無限期卡在 (stuck indefinitely) CS 入口。 * 如:互相依賴 (mutual dependency) ![](https://i.imgur.com/zFVCTrA.png =250x) * Starvation:某個要進入 CS 的節點因故無限期延遲 (indefinite postponement),一直無法進入 CS。 ### Possible performance metrics * Bandwidth:傳送/接收訊息的數量 * Synchronization delay:一個 process 離開 CS 到下一個 process 進來的時間。 * Client delay:一個 node 從進 CS 到出 CS 的時間,也就是回應時間,Delay at **entry** and **exit** (response time) * 不評估 access time:We do not measure client access to resources protected by the critical section (**assume finite**) ### Solution strategy overview * Centralized strategy * Divide nodes into leader and follower, leader dictates actions of followers * Leader 負責告訴 follower 要做什麼。 * Distributed strategy:每個 node 藉由本地端對於其他 node 狀態的認知,獨立地決策行動。(Each node independently decides actions, based on local knowledge of others' state) * Token-based:有 token 就可以進入 CS,而 token 由 nodes 間自行依優先順序 (priority order) 傳遞。 * Non-token-based:當「斷言為真 (assertion becomes true)」時,該 node 將允許進入 CS。這將透過 nodes 之間溝通,互相獲得對方狀態,然後才能決定該 node 的 assertion 是否為真。 ### Self-study questions * What are some examples for mutual exclusion scenarios in distributed systems? * What are some scenarios where we need locks in distributed systems, after all, there is no shared memory (in our notions of DS) ## 3-4 DME - CENTRALIZED STRATEGY ### Centralized strategy * Elect leader * $\mathrm{P_5}$ 的腳色變成:server / coordinator / **leader** / master ![](https://i.imgur.com/Kbj55YQ.png =350x) * Empty CS 1. $\mathrm{P_2}$ 向 $\mathrm{P_5}$ 發出 **Enter_CS()** (Request entry to CS)。 2. $\mathrm{P_5}$ 向 $\mathrm{P_2}$ 回傳 **token** (Grant access to CS)。 3. $\mathrm{P_2}$ 執行 **ResourceAccess()**,成功進入 CS。 4. $\mathrm{P_2}$ 向 $\mathrm{P_5}$ 發出 **Exit_CS()**,離開 CS。 示意圖如下: ![](https://i.imgur.com/ajPKI5I.png =350x) ![](https://i.imgur.com/BH9nMcx.png =350x) ![](https://i.imgur.com/seYJhcA.png =350x) ![](https://i.imgur.com/BoD13Gw.png =350x) * Non-empty CS (此時 $\mathrm{P_2}$ 在正在存取 CS) * Queue: Pending requests of nodes waiting for entry to CS 1. $\mathrm{P_n}$ 向 $\mathrm{P_5}$ 發出 **Enter_CS()** (Request entry to CS),$\mathrm{P_n}$ 的 request 進入 queue 中等待。 2. $\mathrm{P_1}$ 向 $\mathrm{P_5}$ 發出 **Enter_CS()** (Request entry to CS),$\mathrm{P_1}$ 的 request 進入 queue 中等待。 3. $\mathrm{P_2}$ 向 $\mathrm{P_5}$ 發出 **Exit_CS()**,離開 CS。 4. $\mathrm{P_5}$ 向 $\mathrm{P_n}$ 回傳 **token** (Grant access to CS),並把 $\mathrm{P_n}$ 的 request 從 queue 裡移除,然後 $\mathrm{P_n}$ 執行 **ResourceAccess()**,成功進入 CS。 示意圖如下: ![](https://i.imgur.com/OBotewE.png =350x) ![](https://i.imgur.com/W581ZwQ.png =350x) ![](https://i.imgur.com/rONXsDm.png =350x) ![](https://i.imgur.com/hm12Son.png =350x) ### Summarizing observations * 達成:safety、liveness、no starvation * Does solution meet the ordering requirement? * Meet! * Advantages * 實作很簡單。 * Disadvantages * 單點故障 (**Single point of failure**)。 * Bottleneck, network congestion, timeout * Deadlock potential (死結是可能的) for multiple resources with separate servers * Enter_CS() * **Two messages**: Request & Grant * One round of communication (**RTT delay**) * Exit _CS() * **One message**: Release message * **No delay** for the node in CS ### Self-study questions * Does solution meet the ordering requirement? Why or why not? * How could a single leader manage multiple resources? Analyse pros and cons of alternative designs? * Provide a differentiated discussion regarding failure of leader, follower, follower in CS, follower requested access, etc.) ## 3-5 DME - RING-BASED ALGORITHM * 所有 nodes 間,形成一個邏輯意義上的環 (Logical ring)。 * 每個 $\mathrm{P_i}$ 都知道他的下一個 node: $\mathrm{P_{(i+1)\mod n}}$ 是誰。 * 無關於 physical topology,logical topology 是先驗的 (priori)。 * Token 傳遞示意圖如下: ![](https://i.imgur.com/LVguf8v.png =350x) ![](https://i.imgur.com/hPQkX0q.png =350x) ![](https://i.imgur.com/1s92v3d.png =350x) ![](https://i.imgur.com/vlEC86d.png =350x) ![](https://i.imgur.com/y60V1Nu.png =350x) ### Ring-based algorithm analysis * **Safe** * 只有拿到 token 的 node 才能進入 CS。 * **Live** * 因為各個 nodes 會完成各自有限的工作 (finite work),token 最終一定會走遍各個 node。 * **Fair** * 順序是根據 ring topology 來執行的,因此沒有 starvation。 * **Performance** * 除非某個 node 在 CS 裡還沒出來,不然即使有些 nodes 沒有要進入 CS,但這個傳遞 token 的操作仍會持續地消耗網路頻寬 (bandwidth)。 * Synchronization delay (同步延遲):Between 1 and N messages (一個 process 離開 CS 到下一個 process 進來的時間)。 * Client delay (客戶端延遲):0 to N messages for entry (進入 CS); 0 for exit (離開 CS)。 ### Potential problems with ring-based algorithm (Due to our assumption, not all apply here.) * Node crash * Lost token * Duplicate token * Timeouts on token passing ### Self-study questions * Does solution meet the ordering requirement? Why or why not? * How could access to multiple resources be realized with the ring-based ME algorithm? * How can node failures be mitigated? Develop strategies that could tolerate 1, 2, ... node failures? * How could lost or duplicated tokens be mitigated? ## 3-6 DME - LAMPORT’S ALGORITHM, 1978 * 一個 n nodes 系統。 * $\mathrm{(ts_i, i)}$:Node $\mathrm{P_i}$ 的 logical clock timestamp (total order 版本)。 * Logical timestamps 用於排序 (order) requests for CS。 * 小的 timestamps 優先於大的 timestamps。 * 每個 node 都維護一個 **request queue** (此即 priority queue)。 * 假設 message 都以 FIFO 順序傳遞,也就是說,messages 不會爭搶 (race over) communication channel。 ### $\mathrm{P_i}$ requesting CS * 廣播 (Broadcast) **REQUEST$\mathrm{(ts_i, i)}$** message to **all** nodes * 把這個 **REQUEST$\mathrm{(ts_i, i)}$** 放在各自的 $request\_queue_i$ * N.B.: $\mathrm{(ts_i, i)}$ denotes timestamp of request * 示意圖: ![](https://i.imgur.com/G7WDMxG.png =500x) ### $\mathrm{P_k}$ receiving a request to enter CS * 當 $\mathrm{P_k}$ 收到 $\mathrm{P_i}$ 的 **REQUEST$\mathrm{(ts_i, i)}$** 時, * 將 $\mathrm{P_i}$ 的 **REQUEST$\mathrm{(ts_i, i)}$** 放到 自己的 $request\_queue_k$ 裡。 * 傳送一個 timestamped REPLY message 給 $\mathrm{P_i}$:**REPLY$\mathrm{(ts_k, k)}$**。 * 示意圖: ![](https://i.imgur.com/QdJpOpv.png =500x) ### $\mathrm{P_i}$ enters CS * $\mathrm{P_i}$ 自己的 timestamp 比來自其他 nodes 的 timestamp $\mathrm{(ts_i, i)}$ 小。 * $\mathrm{P_i}$ 的 request 在 $request\_queue_i$ 的頂端。 * 示意圖: ![](https://i.imgur.com/xwRhBNA.png =500x) ### $\mathrm{P_i}$ releasing CS * 將自己的 request 從 queue 頂端移除。 * 廣播自己的 timestanped RELEASE message 給所有其他 nodes:**RELEASE$\mathrm{(ts_i, i)}$** * 示意圖: ![](https://i.imgur.com/fTFiycp.png =500x) ### $\mathrm{P_k}$ receiving a release message * $\mathrm{P_k}$ 移除自己 queue 中 $\mathrm{P_i}$ 的 request。 * 示意圖: ![](https://i.imgur.com/nFq4xP5.png =500x) ### Summarizing observations * 3(N-1) messages per CS request invocation (調用,invocate=invoke)。 * (N - 1) REQUEST * (N - 1) REPLY * (N - 1) RELEASE messages * Not fault tolerant:無法容忍錯誤。 ### Self-study questions * Would node IDs as timestamps suffice, why or why not? * Would Lamport clock values as timestamps suffice, why or why not? * Is the algorithm order-preserving, why or why not? ::: warning * 思考:Synchronization delay 是多少? 猜測解答:RELEASE messages,1 messages。 ::: ## 3-7 DME - RICART & AGRAWALA, 1981 * Guarantees mutual exclusion among n node * Basic idea * 想要進入 CS 的 nodes 都**廣播** **request** 給**其他所有 nodes**。 * 當**所有 nodes** 都有 **granted request** 時,准許進入。 * Use **Lamport timestamps** to order requests:$\mathrm{(ts_i, i)}$ * $\mathrm{ts_i}$:timestamp * $\mathrm{i}$:node identifier ### Ricart & Agrawala: Distributed strategy * 每個 node 都有三種狀態: * **Released**:執行完 Exit_CS(),目前在 CS 外。 * **Wanted**:呼叫 Enter_CS(),等待進入 CS。 * **Held**:正在 RessourceAccess(),目前在 CS 裡。 * 當某一 node request 進入 CS 時,若**其他所有 nodes** 都處於 **Released State**,那這個**進入 (entry)** 的要求就會被其他 nodes **承認 (granted)**。 * 如果 $\mathrm{P_i}$ 想要進入 CS,但是 $\mathrm{P_k}$ 目前處於 Held State,那麼 $\mathrm{P_k}$ 在離開 CS 之前便不會回覆 (reply) $\mathrm{P_i}$。 ### Initialization ![](https://i.imgur.com/PASIUPs.png =500x) ### Requesting entry to CS * Request while allReleased ![](https://i.imgur.com/BEbtJyj.png =500x) ![](https://i.imgur.com/90K19qH.png =500x) ![](https://i.imgur.com/07BGqqM.png =500x) ![](https://i.imgur.com/jnX9h88.png =500x) * Request while Held ![](https://i.imgur.com/4Y03jcm.png =500x) ![](https://i.imgur.com/lbRtl3q.png =500x) ![](https://i.imgur.com/F8pdvJi.png =500x) ![](https://i.imgur.com/3avX0DP.png =500x) ### Concurrent entry requests * $\mathrm{P_2}$ and $\mathrm{P_3}$ request entry to CS concurrently ![](https://i.imgur.com/YAjeqmQ.png =500x) ![](https://i.imgur.com/ryfl9fG.png =500x) ![](https://i.imgur.com/fC4KfTZ.png =500x) ![](https://i.imgur.com/SWgq0pF.png =500x) ![](https://i.imgur.com/CdqACt5.png =500x) ![](https://i.imgur.com/7Lm3iC8.png =500x) ### Pseudo code * On initialization ```c state = RELEASED ``` * Enter_CS() ```c state = WANTED Broadcast timestamped request to all nodes wait until ((n-1) acks received) state = HELD ``` * On receiving a request with $\mathrm{(ts_i, i)}$, at $\mathrm{P_k(i \ne k)}$ ```c if (state==HELD or (state==WANTED and <tsk, k> < <tsi, i>)) // <tsk, k> 是 Pk 自己的 timestamp) queue request from Pi without replying else send a reply to P ``` * Exit_CS() ```c state = RELEASED Reply to all queued requests ``` ### Reminder: Subtlety about timestamps * 使用 **Lamport timestamps** 排序請求 (order requests):$\mathrm{(ts_i, i)}$ * $\mathrm{ts_i}$:timestamp * $\mathrm{i}$:node identifier * 如果有兩個 **timestamps**:$\mathrm{(ts_i, i)}$ 與 $\mathrm{(ts_k, k)}$ * 如果 $\mathrm{ts_i}=\mathrm{ts_k}$,使用 node identifiers $\mathrm{i}$、$\mathrm{k}$ 來決定順序。 * 如果便可以基於 timestamps 產生一**隨意的 total order** (Gives rise to an **arbitrary total order** over timestamps)。 ### Summarizing observations * Safe:因為 request 是 total order,因此不會產生兩個 nodes 同時進入 CS 的情況。 * Live:每個 request 都是排隊 (enqueued) 的,所以每一個都會輪到。 * Fairness - ordered:Based on timestamps * Each entry request requires 2(N-1) messages * N-1 requests * N-1 replies * 比起 Lamport 少了 release。 * Synchronization delay (一個 process 離開 CS 到下一個 process 進來的時間) is one message * Not fault tolerant:無法容忍錯誤。 ### Self-study questions * Would node IDs as timestamps suffice, why or why not? * Would Lamport clock values as timestamps suffice, why or why not? * Is the algorithm order-preserving, why or why not? * Compare all our distributed mutual exclusion algorithms according to bandwidth use, synchronization delay, and delay at entry and exit. * Compare all our distributed mutual exclusion algorithms according to their failure resilience. ## 3-8 LE - OVERVIEW ### Leader (a.k.a., coordinator, master, etc.) election * **Problem**:一群 nodes $\mathrm{P_1,\cdots,P_n}$,都**必須同意**其中一獨立 (**unique**) $\mathrm{P_k}$ 擔任 **leader**。 * 通常,leader 負責協調其他 nodes 的活動 (**coordinates another activity**)。 * 當偵測到 (detected) 或懷疑 (suspected) 有 leader failure 發生,就會啟動 LE (leader election)。 * 在預先定義的時間區間 (predefined time interval) 內,如果沒有聽到 leader 的消息,任何節點都可能要求啟動 LE (**call for an election**)。 * **False alarm** is a possibility:當新的 LE 啟動,但目前的 leader 仍然存活 (still alive)。 * 可能會有**好幾個 nodes** **同時 (concurrently)** **發起 LE**。 * 演算法必須允許正在 LE 時突然有 node crash。 ### Leader election use cases (LE 使用案例) * Berkeley clock synchronization algorithm * Centralized mutual exclusion algorithm * Leader election for choosing ***master*** in Hbase, Bigtable * Choosing ***master*** among ***n*** nodes in Chubby or ZooKeeper coordination service * ***Primary-backup replication algorithms*** * ***Two-phase commit protocol*** ### Leader election vs. mutual exclusion * LE 的 losers 返回原本狀態 * ME 的 losers **持續等待** * 速度 (**Fast election**) 很重要 * ME 必須避免 starvation,但 LE 可以由同一個 node 擔任 leader。 * **所有 nodes** 都必須知道結果 * ME 只有 winner 知道 :::info ME can be reduced to LE! (e.g., HBase wants LE, ZooKeeper provides ME) ::: ### Uniqueness requirement - Unique identifier * Leader 必須是 unique。 * largest identifier 贏。 * Unique identifier (UID) 可以是任何「有用的值 (useful value)」,意思就是 **unique** 且 **totally orderable** 的 value。 * 如:process identifiers、IP-port-number * 如:least computational load (負載) * <1/load, i>:load > 0,i 是 UID (用來 break ties) * 各個 node $\mathrm{P_i}$ 都有一個變數 $\mathrm{elected_i}$,儲存著 **leader** 的值或是 **$\perp$ (undefined)**。 ### Election algorithm requirement * **Safety** - correctness * 參與 LE 的 $\mathrm{P_i}$,其 $\mathrm{elected_i}$ 是 $\perp$ 或 $P$,$P$ 是沒有 crash 的 node,而且它有著最大的 ID (**the largest identifier**)。 * **Only one leader at a time!** * **Liveness** – progress is made * 參與 LE 的 nodes 最終除非 crash,否則 $\mathrm{elected_i} \ne \perp$。 ### Summary * LE 是構建分散式系統中基礎的一塊。 * LE algorithms overview (a popular shortlist) * Chang & Roberts, 1979 * HS algorithm, 1980 * Bully, 1982 * Leader Election in Raft, 2014 ### Self-study questions * Why settle on the largest unique identifier as determining characteristic for leader? * Develop your own leader election algorithm by reducing the problem to mutual exclusion. * Find a few more leader election use case scenarios. ## 3-9 LE - CHANG & ROBERTS – RING-BASED ALGORITHM, 1979 ### Assumptions and setup * 建構一個 ring。 * 假設每個 $P$ 都有 **unique identifier (UID)**。 * 假設沒有 failures 與 asynchronous system,**但 failures 會在 election 之前發生**。 * 每個 $\mathrm{P_i}$ 靠傳送 **election message** 至 successor 以啟動 LE (懷疑有 leader failure發生時)。 * Election message 裡有 $\mathrm{P_i}$ 的 UID。 ### Ring-based election algorithm * 目標:找最大的 UID。 * $\mathrm{P_i}$ 發起 election message。 * 所有 nodes 各自管理一個 flag "PARTICIPANT",以記錄自己是否參與 LE (初始值為 false)。 * 在收到 election message 時,$\mathrm{P_k}$ 將會比較 $\mathrm{UID_i}$ 與自己的 $\mathrm{UID_k}$。 * $\mathrm{UID_i} > \mathrm{UID_k}$:$\mathrm{P_k}$ 轉送 (forward) 此 election message。 * $\mathrm{UID_i} < \mathrm{UID_k}$:$\mathrm{P_k}$ 使用自己的 $\mathrm{UID_k}$ 來轉送 election message。 * 除非 $\mathrm{P_k}$ 已經轉送過 election message,也就是說 $\mathrm{P_k}$ 已經參加了現在的 LE,因此 $\mathrm{P_k}$ 不用再轉送訊息。 * $\mathrm{UID_i} = \mathrm{UID_k}$:**$\mathrm{P_k}$ 現在是 leader**,轉送 **victory message** 給所有其他 nodes (resets PARTICIPANT flag)。 ### Ring-based election algorithm 圖解 * Calling an election (determine winner) ![](https://i.imgur.com/OCMQBH9.png =500x) ![](https://i.imgur.com/6wBjcMK.png =500x) ![](https://i.imgur.com/D6gw3W2.png =500x) * Calling an election (origin & victory) ![](https://i.imgur.com/0JyZN2q.png =500x) ![](https://i.imgur.com/YFUi4qE.png =500x) ![](https://i.imgur.com/1c7HfDy.png =500x) ![](https://i.imgur.com/uy0QMfq.png =500x) * Different cases $\mathrm{S_{ID}}$:ID in message from sender $\mathrm{R_{ID}}$:ID at receiver ![](https://i.imgur.com/kFpaolf.png =500x) * Concurrent election start ![](https://i.imgur.com/UsO3227.png =500x) ![](https://i.imgur.com/kyCDLaH.png =500x) ![](https://i.imgur.com/9bJmOzP.png =500x) ### Summarizing observations * **Worst case**:3N -1 messages * **N-1** messages to **reach highest ID** from lowest ID:例如從 $\mathrm{P_1}$ 到 $\mathrm{P_n}$。 * **N** messages to **reach point of origin**:例如從 $\mathrm{P_n}$ 繞一圈回 $\mathrm{P_n}$。 * **Leader announcement** takes another **N** messages:一樣又是 $\mathrm{P_n}$ 繞一圈「廣播」回 $\mathrm{P_n}$。 * **Safety**:即使多個 nodes 同時啟動,LE 仍能成功完成。 * **Liveness**:guaranteed progress **if no failures during the election occur** ### Self-study questions * Develop a mutual exclusion algorithm by reducing the problem to leader election. * How can the algorithm be made to tolerate 1, 2, ... crash faults? * How can the algorithm be made to tolerate message loss? ## 3-10 LE - BULLY ALGORITHM, 1982 * 假設每個 node 都有 unique ID (UID)、reliable message delivery 以及 synchronous system。 * 假設所有 nodes 知道彼此的 UID,且可以直接跟彼此溝通 (communicate)。 * Higher UIDs have priority * Can “bully” nodes with lower UIDs (可以「霸凌」) * 可由任一偵測到(detects) leader failure 的 node 發起。 * 可容忍 nodes crash,即使是在 LE 中 crash。 * 有 crash recovery model。 ### Bully algorithm messages * **Election** message:宣告一個 election。 * **Answer** message:回應 election message。 * **Coordination** message:宣告 victory,並確認 leader 身分。 ### $\mathrm{P_i}$ detects failure of leader For any $j < i$ and any $i < k$ ![](https://i.imgur.com/ehAAQr1.png =200x) 1. $\mathrm{P_i}$ 廣播 **election message** 給所有 $\mathrm{P_k}$ with $i<k$。 1. 任一收到 election message 的 $\mathrm{P_k}$ **回覆 answer message** 給 $\mathrm{P_i}$,然後自己再啟動另一個 election。 1. 任一收到 election message 的 $\mathrm{P_j}$ **不做任何回覆**。 1. 如果 $\mathrm{P_i}$ **沒有收到任一 answer message (timeout)**,**自己廣播 coordination message**。 * 因此,如果 $\mathrm{P_i}$ 知道自己有最高的 UID (例如剛經過 crash recovery),它可以直接宣布自己是 leader (廣播 coordination message)。 1. 如果 $\mathrm{P_i}$ **有收到 answer message(s)** (timeout),**等待接收 coordination message**。 1. 如果 $\mathrm{P_i}$ 在 timeout 前沒有收到 coordination message,**重新發起 election**。 ### Election and answer message flow ![](https://i.imgur.com/Hj33XXz.png =300x) ### Why Bully? (Upon node (leader) crash) * 假設 node 終究會復原 (如果沒復原也沒關係,為什麼?) * Node 可以確定它自己有 the highest UID,因此它可以宣告自己是 leader。 * 就算是系統現階段已經有個運作中的 leader (因為 crash 情節產生的新 LE)。 * 新的 node 會「霸凌」現任 leader。 ### Summary: Safety and liveness * **Safety** – argue by contradiction * 假設兩個 leader $\mathrm{P_i}$ 與 $\mathrm{P_k}$ with $i \ne k$ * 因此,$\mathrm{P_i}$ 與 $\mathrm{P_k}$ 交換 victory messages。 * 因此,$\mathrm{P_i}$ 與 $\mathrm{P_k}$ 交換 election messages。 * 但有著 lower UID 的 node 不會回覆,矛盾! * **Liveness** * Would-be leader (WbL) 在傳送 answer message 後 fail (還沒傳 coordination message)。 * Lower UIDs 的 nodes 根據 timeout period 決策: * 如果 WbL 在 timeout period 內 recover,WbL 會發出 coordination message。 * 如果 WbL 在 timeout period 內沒有 recover,新的 LE 將被啟動。 ### Self-study question * What are the pros and cons of broadcasting election messages to all nodes vs. to only broadcasting them to nodes with higher UIDs? * Is the algorithms safe during an election, explain why or why not? * What is Bully’s worst and best case message complexity, assuming n nodes in total? ## Review ![](https://i.imgur.com/2ezJ3nH.png)