# Week 03
###### tags: `WS2020-IN2259-DS`
## 3-1 Distributed system model
* Model 用來獲得 (captures) 與系統相關的所有假設。
* 如:網路、clock、處理器等。
* Algorithms always assume a certain model
* 一些演算法只能在一些較為強大的 (stronger) 模型中使用,但限制也較多。
* Model 是與理論相關的 (theoretical relevance),因此 Model 的假設是否可以在現實中應用是一個不同的問題。
### Synchronous vs. asynchronous model
| Property | Synchronous system model | Asynchronous system model |
| -------- | -------- | -------- |
| Clocks | Bound on drift | Text |
| Processor | Bound on execution time | No bound on execution time |
| Channel | Bound on latency | No bound on latency |
### Synchronous model
* 每個行程 local clock 距離真實時間的 drift rate,都在一個已知的限制 (known bound) 內。
* 每個 computation 的每個步驟,都將在 known bound 內完成。
* 每個透過一 channel 傳送出去的 message,會在 known bound 內被接收。
### Asynchronous model
* Clock 的 drift rates 是隨意的 (arbitrary)。
* 每個 computation 的每個步驟,隨意時間完成 (但一定會完成)。
* message 將需要不確定的時間完成傳送 (但一定會傳送)。
### Two General’s Problem (Agreement)

1. Who leads the attack?
* Solution
* Largest army first
* If tied, Army 1 (如果都一樣兵力)
* 因此,此解法於 Synchronous 與 Asynchronous 模型都適用
2. When to attack?
* Asynchronous agreement
* No bound on delivery!
* 無法解決!
* Synchronous agreement
* Message takes at least **min** time and at most **max** time to arrive
* Strategy: Army 1 **waits for min time** then attacks
* Guarantee: Army 2 attacks no later than **max – min** time after Army 1
### Summarizing takeaways
* 有些問題無法在 asynchronous world 中解決,如 when vs. who leads attack。
* 在 asynchronous distributed systems 可以使用的 solution,在 synchronous distributed systems 也可以使用 (synchronous model is stronger)。
* 網際網路與很多實際應用的分散式系統,都是 asynchronous。
* 加入 timeouts 與 timing assumptions 可以減少 uncertainty,並使得 synchronous model 中的元素了解整體情況 (be brought into the picture)。
### Self-study questions
* Think of a few design problems that cannot be solved in an asynchronous world …
* … and show how they can be solved in a synchronous world.
* What are some other useful assumptions for distributed system design and how practical are they?
* Can you think of other coordination and agreement problems?
## 3-2 RECAP: MUTUAL EXCLUSION - IN A NON-DISTRIBUTED SYSTEMS CONTEXT
### Problem: Access to shared variables
* 想像一個要被多個 threads 或是 processes 存取的 globally shared variable counter 。
* 如:a key-value record managed by a storage server (or more complex data structure)
### Shared data & synchronization

* What may happen if multiple threads concurrently access shared state (e.g., a shared variable)?
### Concurrently manipulating shared data
* Two threads execute concurrently as part of the same process
* Shared variable (e.g., global variable)
* counter = 5
* Thread 1 executes
* counter++
* Thread 2 executes
* counter--
* What are all the possible values of counter after Thread 1 and Thread 2 executed?
### Machine-level implementation
* Implementation of “counter++”
```c
register1 = counter
register1 = register1 + 1
counter = register1`
```
* Implementation of “counter--”
```c
register2 = counter
register2 = register2 – 1
counter = register2
```
### Possible execution sequences

### Interleaved execution

* Assume counter is 5 and interleaved execution of counter++ (in T1) and counter-- (in T2)
* The value of counter may be 4 or 6, whereas the correct result should be 5!
### Race condition
* **Several** threads **manipulate shared data** **concurrently**. The **final value** of the data **depends** upon **which thread finishes last**.
* In our example (interleaved execution) of
* `counter++`
* `counter--`
* To prevent race conditions, concurrent processes must be **synchronized**
### The moral (寓意) of this story
* The statements `counter++` and `counter--` must each be executed **atomically**
* Atomic operation:指在沒有 interruption 的狀態下完成的 operation。
* 我們可以透過 synchronization primitives 獲得 Atomicity (原子性)。
* 在 critical section (臨界區段) 裡存取的 Shared variable,必須被 synchronization primitives 所保護。
* 這就是所謂的 critical section problem 或 mutual exclusion。
### Self-study questions
* Do we have a critical section problem in distributed systems? – There is no shared memory!
* Asked differently, do we need to worry about (distributed) mutual exclusion in a distributed system?
* Identify a few mutual exclusion scenarios in distributed systems.
## 3-3 DME - OVERVIEW
### Distributed mutual exclusion
* In distributed systems, mutual exclusion is at least equally **complex** due to :
* Lack of shared memory
* Lack of a global clock
* Event ordering (事件排序)
* Examples
* Accessing a shared resource in distributed systems
* Acquiring a lock
* One active master to coordinate (協調) activities
### Critical section problem - No shared memory
* System with n nodes
* Nodes access shared resources in CS
* **Coordinate access to CS via message passing**
* Application-level protocol for accessing CS
* Enter_CS() – enter CS, block if necessary
* ResourceAccess() – access shared resource in CS
* Exit_CS() – leave CS
### Assumptions (假設) - No practical rather theoretical considerations
* System is asynchronous (非同步)
* No bound on delays, no bound on clock drift, etc.
* Nodes do not fail
* Message delivery is reliable
* Any message sent, is eventually delivered intact (完整) and exactly once – i.e., not lost, not duplicated
* Nodes are well-behaved and spent finite time accessing resources in CS
### Mutual exclusion requirements (重要!)
* **Safety** – correctness
* 一次只有一個 node 存取 CS。
* **Liveness** – progress (something good happens)
* Requests to enter/exit CS eventually succeed:可以正常進出 CS。
* 沒有 deadlock。
* **Fairness**
* order:如果一個要進入 CS 的 request 比另一個早 (happened-before),那必須依照這個順序。
* no starvation:在其他 nodes 尚未進入 CS 前,不會有一個 node已經進入 CS 兩次。
### Deadlock & starvation
* Deadlock:兩個或更多 nodes 無限期卡在 (stuck indefinitely) CS 入口。
* 如:互相依賴 (mutual dependency)

* Starvation:某個要進入 CS 的節點因故無限期延遲 (indefinite postponement),一直無法進入 CS。
### Possible performance metrics
* Bandwidth:傳送/接收訊息的數量
* Synchronization delay:一個 process 離開 CS 到下一個 process 進來的時間。
* Client delay:一個 node 從進 CS 到出 CS 的時間,也就是回應時間,Delay at **entry** and **exit** (response time)
* 不評估 access time:We do not measure client access to resources protected by the critical section (**assume finite**)
### Solution strategy overview
* Centralized strategy
* Divide nodes into leader and follower, leader dictates actions of followers
* Leader 負責告訴 follower 要做什麼。
* Distributed strategy:每個 node 藉由本地端對於其他 node 狀態的認知,獨立地決策行動。(Each node independently decides actions, based on local knowledge of others' state)
* Token-based:有 token 就可以進入 CS,而 token 由 nodes 間自行依優先順序 (priority order) 傳遞。
* Non-token-based:當「斷言為真 (assertion becomes true)」時,該 node 將允許進入 CS。這將透過 nodes 之間溝通,互相獲得對方狀態,然後才能決定該 node 的 assertion 是否為真。
### Self-study questions
* What are some examples for mutual exclusion scenarios in distributed systems?
* What are some scenarios where we need locks in distributed systems, after all, there is no shared memory (in our notions of DS)
## 3-4 DME - CENTRALIZED STRATEGY
### Centralized strategy
* Elect leader
* $\mathrm{P_5}$ 的腳色變成:server / coordinator / **leader** / master

* Empty CS
1. $\mathrm{P_2}$ 向 $\mathrm{P_5}$ 發出 **Enter_CS()** (Request entry to CS)。
2. $\mathrm{P_5}$ 向 $\mathrm{P_2}$ 回傳 **token** (Grant access to CS)。
3. $\mathrm{P_2}$ 執行 **ResourceAccess()**,成功進入 CS。
4. $\mathrm{P_2}$ 向 $\mathrm{P_5}$ 發出 **Exit_CS()**,離開 CS。
示意圖如下:




* Non-empty CS (此時 $\mathrm{P_2}$ 在正在存取 CS)
* Queue: Pending requests of nodes waiting for entry to CS
1. $\mathrm{P_n}$ 向 $\mathrm{P_5}$ 發出 **Enter_CS()** (Request entry to CS),$\mathrm{P_n}$ 的 request 進入 queue 中等待。
2. $\mathrm{P_1}$ 向 $\mathrm{P_5}$ 發出 **Enter_CS()** (Request entry to CS),$\mathrm{P_1}$ 的 request 進入 queue 中等待。
3. $\mathrm{P_2}$ 向 $\mathrm{P_5}$ 發出 **Exit_CS()**,離開 CS。
4. $\mathrm{P_5}$ 向 $\mathrm{P_n}$ 回傳 **token** (Grant access to CS),並把 $\mathrm{P_n}$ 的 request 從 queue 裡移除,然後 $\mathrm{P_n}$ 執行 **ResourceAccess()**,成功進入 CS。
示意圖如下:




### Summarizing observations
* 達成:safety、liveness、no starvation
* Does solution meet the ordering requirement?
* Meet!
* Advantages
* 實作很簡單。
* Disadvantages
* 單點故障 (**Single point of failure**)。
* Bottleneck, network congestion, timeout
* Deadlock potential (死結是可能的) for multiple resources with separate servers
* Enter_CS()
* **Two messages**: Request & Grant
* One round of communication (**RTT delay**)
* Exit _CS()
* **One message**: Release message
* **No delay** for the node in CS
### Self-study questions
* Does solution meet the ordering requirement? Why or why not?
* How could a single leader manage multiple resources? Analyse pros and cons of alternative designs?
* Provide a differentiated discussion regarding failure of leader, follower, follower in CS, follower requested access, etc.)
## 3-5 DME - RING-BASED ALGORITHM
* 所有 nodes 間,形成一個邏輯意義上的環 (Logical ring)。
* 每個 $\mathrm{P_i}$ 都知道他的下一個 node: $\mathrm{P_{(i+1)\mod n}}$ 是誰。
* 無關於 physical topology,logical topology 是先驗的 (priori)。
* Token 傳遞示意圖如下:





### Ring-based algorithm analysis
* **Safe**
* 只有拿到 token 的 node 才能進入 CS。
* **Live**
* 因為各個 nodes 會完成各自有限的工作 (finite work),token 最終一定會走遍各個 node。
* **Fair**
* 順序是根據 ring topology 來執行的,因此沒有 starvation。
* **Performance**
* 除非某個 node 在 CS 裡還沒出來,不然即使有些 nodes 沒有要進入 CS,但這個傳遞 token 的操作仍會持續地消耗網路頻寬 (bandwidth)。
* Synchronization delay (同步延遲):Between 1 and N messages (一個 process 離開 CS 到下一個 process 進來的時間)。
* Client delay (客戶端延遲):0 to N messages for entry (進入 CS); 0 for exit (離開 CS)。
### Potential problems with ring-based algorithm (Due to our assumption, not all apply here.)
* Node crash
* Lost token
* Duplicate token
* Timeouts on token passing
### Self-study questions
* Does solution meet the ordering requirement? Why or why not?
* How could access to multiple resources be realized with the ring-based ME algorithm?
* How can node failures be mitigated? Develop strategies that could tolerate 1, 2, ... node failures?
* How could lost or duplicated tokens be mitigated?
## 3-6 DME - LAMPORT’S ALGORITHM, 1978
* 一個 n nodes 系統。
* $\mathrm{(ts_i, i)}$:Node $\mathrm{P_i}$ 的 logical clock timestamp (total order 版本)。
* Logical timestamps 用於排序 (order) requests for CS。
* 小的 timestamps 優先於大的 timestamps。
* 每個 node 都維護一個 **request queue** (此即 priority queue)。
* 假設 message 都以 FIFO 順序傳遞,也就是說,messages 不會爭搶 (race over) communication channel。
### $\mathrm{P_i}$ requesting CS
* 廣播 (Broadcast) **REQUEST$\mathrm{(ts_i, i)}$** message to **all** nodes
* 把這個 **REQUEST$\mathrm{(ts_i, i)}$** 放在各自的 $request\_queue_i$
* N.B.: $\mathrm{(ts_i, i)}$ denotes timestamp of request
* 示意圖:

### $\mathrm{P_k}$ receiving a request to enter CS
* 當 $\mathrm{P_k}$ 收到 $\mathrm{P_i}$ 的 **REQUEST$\mathrm{(ts_i, i)}$** 時,
* 將 $\mathrm{P_i}$ 的 **REQUEST$\mathrm{(ts_i, i)}$** 放到 自己的 $request\_queue_k$ 裡。
* 傳送一個 timestamped REPLY message 給 $\mathrm{P_i}$:**REPLY$\mathrm{(ts_k, k)}$**。
* 示意圖:

### $\mathrm{P_i}$ enters CS
* $\mathrm{P_i}$ 自己的 timestamp 比來自其他 nodes 的 timestamp $\mathrm{(ts_i, i)}$ 小。
* $\mathrm{P_i}$ 的 request 在 $request\_queue_i$ 的頂端。
* 示意圖:

### $\mathrm{P_i}$ releasing CS
* 將自己的 request 從 queue 頂端移除。
* 廣播自己的 timestanped RELEASE message 給所有其他 nodes:**RELEASE$\mathrm{(ts_i, i)}$**
* 示意圖:

### $\mathrm{P_k}$ receiving a release message
* $\mathrm{P_k}$ 移除自己 queue 中 $\mathrm{P_i}$ 的 request。
* 示意圖:

### Summarizing observations
* 3(N-1) messages per CS request invocation (調用,invocate=invoke)。
* (N - 1) REQUEST
* (N - 1) REPLY
* (N - 1) RELEASE messages
* Not fault tolerant:無法容忍錯誤。
### Self-study questions
* Would node IDs as timestamps suffice, why or why not?
* Would Lamport clock values as timestamps suffice, why or why not?
* Is the algorithm order-preserving, why or why not?
::: warning
* 思考:Synchronization delay 是多少?
猜測解答:RELEASE messages,1 messages。
:::
## 3-7 DME - RICART & AGRAWALA, 1981
* Guarantees mutual exclusion among n node
* Basic idea
* 想要進入 CS 的 nodes 都**廣播** **request** 給**其他所有 nodes**。
* 當**所有 nodes** 都有 **granted request** 時,准許進入。
* Use **Lamport timestamps** to order requests:$\mathrm{(ts_i, i)}$
* $\mathrm{ts_i}$:timestamp
* $\mathrm{i}$:node identifier
### Ricart & Agrawala: Distributed strategy
* 每個 node 都有三種狀態:
* **Released**:執行完 Exit_CS(),目前在 CS 外。
* **Wanted**:呼叫 Enter_CS(),等待進入 CS。
* **Held**:正在 RessourceAccess(),目前在 CS 裡。
* 當某一 node request 進入 CS 時,若**其他所有 nodes** 都處於 **Released State**,那這個**進入 (entry)** 的要求就會被其他 nodes **承認 (granted)**。
* 如果 $\mathrm{P_i}$ 想要進入 CS,但是 $\mathrm{P_k}$ 目前處於 Held State,那麼 $\mathrm{P_k}$ 在離開 CS 之前便不會回覆 (reply) $\mathrm{P_i}$。
### Initialization

### Requesting entry to CS
* Request while allReleased




* Request while Held




### Concurrent entry requests
* $\mathrm{P_2}$ and $\mathrm{P_3}$ request entry to CS concurrently






### Pseudo code
* On initialization
```c
state = RELEASED
```
* Enter_CS()
```c
state = WANTED
Broadcast timestamped request to all nodes
wait until ((n-1) acks received)
state = HELD
```
* On receiving a request with $\mathrm{(ts_i, i)}$, at $\mathrm{P_k(i \ne k)}$
```c
if (state==HELD or (state==WANTED and <tsk, k> < <tsi, i>))
// <tsk, k> 是 Pk 自己的 timestamp)
queue request from Pi without replying
else
send a reply to P
```
* Exit_CS()
```c
state = RELEASED
Reply to all queued requests
```
### Reminder: Subtlety about timestamps
* 使用 **Lamport timestamps** 排序請求 (order requests):$\mathrm{(ts_i, i)}$
* $\mathrm{ts_i}$:timestamp
* $\mathrm{i}$:node identifier
* 如果有兩個 **timestamps**:$\mathrm{(ts_i, i)}$ 與 $\mathrm{(ts_k, k)}$
* 如果 $\mathrm{ts_i}=\mathrm{ts_k}$,使用 node identifiers $\mathrm{i}$、$\mathrm{k}$ 來決定順序。
* 如果便可以基於 timestamps 產生一**隨意的 total order** (Gives rise to an **arbitrary total order** over timestamps)。
### Summarizing observations
* Safe:因為 request 是 total order,因此不會產生兩個 nodes 同時進入 CS 的情況。
* Live:每個 request 都是排隊 (enqueued) 的,所以每一個都會輪到。
* Fairness - ordered:Based on timestamps
* Each entry request requires 2(N-1) messages
* N-1 requests
* N-1 replies
* 比起 Lamport 少了 release。
* Synchronization delay (一個 process 離開 CS 到下一個 process 進來的時間) is one message
* Not fault tolerant:無法容忍錯誤。
### Self-study questions
* Would node IDs as timestamps suffice, why or why not?
* Would Lamport clock values as timestamps suffice, why or why not?
* Is the algorithm order-preserving, why or why not?
* Compare all our distributed mutual exclusion algorithms according to bandwidth use, synchronization delay, and delay at entry and exit.
* Compare all our distributed mutual exclusion algorithms according to their failure resilience.
## 3-8 LE - OVERVIEW
### Leader (a.k.a., coordinator, master, etc.) election
* **Problem**:一群 nodes $\mathrm{P_1,\cdots,P_n}$,都**必須同意**其中一獨立 (**unique**) $\mathrm{P_k}$ 擔任 **leader**。
* 通常,leader 負責協調其他 nodes 的活動 (**coordinates another activity**)。
* 當偵測到 (detected) 或懷疑 (suspected) 有 leader failure 發生,就會啟動 LE (leader election)。
* 在預先定義的時間區間 (predefined time interval) 內,如果沒有聽到 leader 的消息,任何節點都可能要求啟動 LE (**call for an election**)。
* **False alarm** is a possibility:當新的 LE 啟動,但目前的 leader 仍然存活 (still alive)。
* 可能會有**好幾個 nodes** **同時 (concurrently)** **發起 LE**。
* 演算法必須允許正在 LE 時突然有 node crash。
### Leader election use cases (LE 使用案例)
* Berkeley clock synchronization algorithm
* Centralized mutual exclusion algorithm
* Leader election for choosing ***master*** in Hbase, Bigtable
* Choosing ***master*** among ***n*** nodes in Chubby or ZooKeeper coordination service
* ***Primary-backup replication algorithms***
* ***Two-phase commit protocol***
### Leader election vs. mutual exclusion
* LE 的 losers 返回原本狀態
* ME 的 losers **持續等待**
* 速度 (**Fast election**) 很重要
* ME 必須避免 starvation,但 LE 可以由同一個 node 擔任 leader。
* **所有 nodes** 都必須知道結果
* ME 只有 winner 知道
:::info
ME can be reduced to LE!
(e.g., HBase wants LE, ZooKeeper provides ME)
:::
### Uniqueness requirement - Unique identifier
* Leader 必須是 unique。
* largest identifier 贏。
* Unique identifier (UID) 可以是任何「有用的值 (useful value)」,意思就是 **unique** 且 **totally orderable** 的 value。
* 如:process identifiers、IP-port-number
* 如:least computational load (負載)
* <1/load, i>:load > 0,i 是 UID (用來 break ties)
* 各個 node $\mathrm{P_i}$ 都有一個變數 $\mathrm{elected_i}$,儲存著 **leader** 的值或是 **$\perp$ (undefined)**。
### Election algorithm requirement
* **Safety** - correctness
* 參與 LE 的 $\mathrm{P_i}$,其 $\mathrm{elected_i}$ 是 $\perp$ 或 $P$,$P$ 是沒有 crash 的 node,而且它有著最大的 ID (**the largest identifier**)。
* **Only one leader at a time!**
* **Liveness** – progress is made
* 參與 LE 的 nodes 最終除非 crash,否則 $\mathrm{elected_i} \ne \perp$。
### Summary
* LE 是構建分散式系統中基礎的一塊。
* LE algorithms overview (a popular shortlist)
* Chang & Roberts, 1979
* HS algorithm, 1980
* Bully, 1982
* Leader Election in Raft, 2014
### Self-study questions
* Why settle on the largest unique identifier as determining characteristic for leader?
* Develop your own leader election algorithm by reducing the problem to mutual exclusion.
* Find a few more leader election use case scenarios.
## 3-9 LE - CHANG & ROBERTS – RING-BASED ALGORITHM, 1979
### Assumptions and setup
* 建構一個 ring。
* 假設每個 $P$ 都有 **unique identifier (UID)**。
* 假設沒有 failures 與 asynchronous system,**但 failures 會在 election 之前發生**。
* 每個 $\mathrm{P_i}$ 靠傳送 **election message** 至 successor 以啟動 LE (懷疑有 leader failure發生時)。
* Election message 裡有 $\mathrm{P_i}$ 的 UID。
### Ring-based election algorithm
* 目標:找最大的 UID。
* $\mathrm{P_i}$ 發起 election message。
* 所有 nodes 各自管理一個 flag "PARTICIPANT",以記錄自己是否參與 LE (初始值為 false)。
* 在收到 election message 時,$\mathrm{P_k}$ 將會比較 $\mathrm{UID_i}$ 與自己的 $\mathrm{UID_k}$。
* $\mathrm{UID_i} > \mathrm{UID_k}$:$\mathrm{P_k}$ 轉送 (forward) 此 election message。
* $\mathrm{UID_i} < \mathrm{UID_k}$:$\mathrm{P_k}$ 使用自己的 $\mathrm{UID_k}$ 來轉送 election message。
* 除非 $\mathrm{P_k}$ 已經轉送過 election message,也就是說 $\mathrm{P_k}$ 已經參加了現在的 LE,因此 $\mathrm{P_k}$ 不用再轉送訊息。
* $\mathrm{UID_i} = \mathrm{UID_k}$:**$\mathrm{P_k}$ 現在是 leader**,轉送 **victory message** 給所有其他 nodes (resets PARTICIPANT flag)。
### Ring-based election algorithm 圖解
* Calling an election (determine winner)



* Calling an election (origin & victory)




* Different cases
$\mathrm{S_{ID}}$:ID in message from sender
$\mathrm{R_{ID}}$:ID at receiver

* Concurrent election start



### Summarizing observations
* **Worst case**:3N -1 messages
* **N-1** messages to **reach highest ID** from lowest ID:例如從 $\mathrm{P_1}$ 到 $\mathrm{P_n}$。
* **N** messages to **reach point of origin**:例如從 $\mathrm{P_n}$ 繞一圈回 $\mathrm{P_n}$。
* **Leader announcement** takes another **N** messages:一樣又是 $\mathrm{P_n}$ 繞一圈「廣播」回 $\mathrm{P_n}$。
* **Safety**:即使多個 nodes 同時啟動,LE 仍能成功完成。
* **Liveness**:guaranteed progress **if no failures during the election occur**
### Self-study questions
* Develop a mutual exclusion algorithm by reducing the problem to leader election.
* How can the algorithm be made to tolerate 1, 2, ... crash faults?
* How can the algorithm be made to tolerate message loss?
## 3-10 LE - BULLY ALGORITHM, 1982
* 假設每個 node 都有 unique ID (UID)、reliable message delivery 以及 synchronous system。
* 假設所有 nodes 知道彼此的 UID,且可以直接跟彼此溝通 (communicate)。
* Higher UIDs have priority
* Can “bully” nodes with lower UIDs (可以「霸凌」)
* 可由任一偵測到(detects) leader failure 的 node 發起。
* 可容忍 nodes crash,即使是在 LE 中 crash。
* 有 crash recovery model。
### Bully algorithm messages
* **Election** message:宣告一個 election。
* **Answer** message:回應 election message。
* **Coordination** message:宣告 victory,並確認 leader 身分。
### $\mathrm{P_i}$ detects failure of leader
For any $j < i$ and any $i < k$

1. $\mathrm{P_i}$ 廣播 **election message** 給所有 $\mathrm{P_k}$ with $i<k$。
1. 任一收到 election message 的 $\mathrm{P_k}$ **回覆 answer message** 給 $\mathrm{P_i}$,然後自己再啟動另一個 election。
1. 任一收到 election message 的 $\mathrm{P_j}$ **不做任何回覆**。
1. 如果 $\mathrm{P_i}$ **沒有收到任一 answer message (timeout)**,**自己廣播 coordination message**。
* 因此,如果 $\mathrm{P_i}$ 知道自己有最高的 UID (例如剛經過 crash recovery),它可以直接宣布自己是 leader (廣播 coordination message)。
1. 如果 $\mathrm{P_i}$ **有收到 answer message(s)** (timeout),**等待接收 coordination message**。
1. 如果 $\mathrm{P_i}$ 在 timeout 前沒有收到 coordination message,**重新發起 election**。
### Election and answer message flow

### Why Bully? (Upon node (leader) crash)
* 假設 node 終究會復原 (如果沒復原也沒關係,為什麼?)
* Node 可以確定它自己有 the highest UID,因此它可以宣告自己是 leader。
* 就算是系統現階段已經有個運作中的 leader (因為 crash 情節產生的新 LE)。
* 新的 node 會「霸凌」現任 leader。
### Summary: Safety and liveness
* **Safety** – argue by contradiction
* 假設兩個 leader $\mathrm{P_i}$ 與 $\mathrm{P_k}$ with $i \ne k$
* 因此,$\mathrm{P_i}$ 與 $\mathrm{P_k}$ 交換 victory messages。
* 因此,$\mathrm{P_i}$ 與 $\mathrm{P_k}$ 交換 election messages。
* 但有著 lower UID 的 node 不會回覆,矛盾!
* **Liveness**
* Would-be leader (WbL) 在傳送 answer message 後 fail (還沒傳 coordination message)。
* Lower UIDs 的 nodes 根據 timeout period 決策:
* 如果 WbL 在 timeout period 內 recover,WbL 會發出 coordination message。
* 如果 WbL 在 timeout period 內沒有 recover,新的 LE 將被啟動。
### Self-study question
* What are the pros and cons of broadcasting election messages to all nodes vs. to only broadcasting them to nodes with higher UIDs?
* Is the algorithms safe during an election, explain why or why not?
* What is Bully’s worst and best case message complexity, assuming n nodes in total?
## Review
