6 - 4 Chord & Consistent Hashing and Random Trees

# Consistent Hashing 在網路的世界，伺服器的新增或刪除（斷線、失效）都是很常見的，我們將伺服器想成一個節點無論新增或刪除節點都只想更新最小程度的資料。而且，client跟server之間引入cache減緩伺服器是很正常的事，我們想將data平均分散在各個伺服器，client想要快速定位data要去哪個cache找特定的資料，一個很明顯的解法就是hashing。但是傳統定義的hashing 如 (k) mod m 這種方法若buckets數（node數），因為新增或刪除節點而改變，mapping的結果可能需要全面修改，跟我們剛提及的目標完全不同，因此我們想要尋找一種不會因新增或刪除節點的hashing方法，稱為consistent hashing，或是distributed hash table(DHT). View的定義： We define a ***view*** to be 對一個client來說他能得知的cache 集合 ## Definiations - Balance - items mapped 平均的分散到 each bucket - Monotonicity - 此property是說若一個bucket集合最初為V1，之後新增了一些buckets形成另一集合V2，然而有可能會導致一些資料從v1的舊bucket搬到v2才新增的bucket，但不可能移到v1就已經有的舊bucket。考慮到consistency,這也是比較直覺的想法，當可用的bucket集合改變了，只要移動少數相關bucket的data來維持balance，無關的bucket就不用移動。 - Spread - 因為是分散式的環境，每個client的view不同，用consistent hashing將item指派到bucket可能會有相同的資料，但因為view不同，所以分配到不同的bucket，那麼在所有資料中，一個資料最多會被分配到i個不同的bucket，就稱為spread的程度，好的hashing function應讓此程度越小越好 - Load - load property 跟 spread很像.但這次我們從bucket的角度思考，而不是 item的角度. 此特性定義為在所有client的view中至少一個人覺得此bucket有最多 b 個不同的item。稱之此bucket的load，依舊是能越低越好。 ## System model Chord simplifies the design of peer-to-peer systems and applications based on it by addressing these difficult problems ### Features: 1. Load Balance: Chord uses a distributed hash function to evenly distribute keys across nodes, ensuring a balanced load without hotspots. 2. Decentralization: Chord is fully distributed with no single point of failure or hierarchy, enhancing robustness and making it suitable for loosely organized peer-to-peer applications. 3. Scalability: The cost of a Chord lookup grows as the log of the number of nodes, so even very large systems are feasible. No parameter tuning is required to achieve this scaling 4. Availability: Chord dynamically updates its internal tables to reflect node joins and departures, ensuring continuous operation and the ability to locate keys despite ongoing changes in the network. 5. Flexible Naming: The Chord keyspace is flat and does not impose any structure on the keys, allowing applications to map their own names to Chord keys as needed. ### Interaction with Applications - The Chord software is provided as a library that applications can link to. Applications interact with Chord mainly in two ways: 1. Requesting the IP address of the node responsible for a given key. 2. Being notified of changes in the set of keys a node is responsible for, allowing the application to manage data accordingly. ### Example Applications The paper lists several example applications that can be built on top of Chord, demonstrating its flexibility and utility: - Cooperative Mirroring: Multiple content providers can share the load of storing and serving each other’s data, lowering overall system cost and balancing load among participants. - Time-Shared Storage: Nodes with intermittent connectivity can store each other's data, ensuring data availability even when some nodes are offline. - Distributed Indexes for Keyword Search: Chord can support applications like Gnutella or Napster by mapping keywords to lists of documents or machines that contain those keywords. - Large-Scale Combinatorial Search: Applications such as code-breaking can distribute the task of testing candidate solutions across the network, with Chord efficiently managing the distribution of these tasks. ### Implementation and Maintenance The Chord system maintains routing information in the form of a finger table, which helps in efficiently locating the node responsible for any given key. Nodes use a stabilization protocol to periodically update their finger tables and successor pointers, ensuring that the system remains consistent and performs well even as nodes join and leave. ## CHORD PROTOCOL In this paper, we assume that communication in the underlying network is both symmetric (if A can route to B, then B can route to A), and transitive (if A can route to B and B can route to C, then A can route to C). - Chord 提供hash分散運算，可以快速的mapping keys到對應的節點，consistent hashing有幾個很好的特性，像是能夠有很高的機率能夠load balance，也就是說能將keys大致平均的分給節點們，也因此在新增或刪除節點時，能夠用最小的effort完成維持load balance。. chord還改善了consistent hashing的scalability，不需要讓每個節點知道每個節點的資訊，我們可以透過routing的方式，node只需要知道一些的routing 到其他節點的其他資訊，因為這個資訊是分散的，那節點就能夠透過跟其他節點溝通的方式來完成hashing。在n個節點的網路中，我們需要maintain 關於其他O(logN)個節點的資訊，lookup需要最多O(logN)個message - consistent hashing function 這篇論文採用 SHA-1 作為Hash Function，將每個key & node 用SHA-1 hashing 成 m-bit的id，node的id是ip address，SHA-1(IP address) 得到 NodeID。SHA-1(Key) 得到KeyID。這邊有個前提是id長度m是大到能夠讓兩個節點或是key hashing到相同的id的機率可以小到忽略不計的。 Hash Space為一個m-bit的ID-Space，串聯成一個環，key會被assigned到順時鐘碰到的第一個節點，稱為successor，下圖為m = 6的示意圖 ![Screenshot from 2024-05-15 21-36-45](https://hackmd.io/_uploads/S1dK9Efm0.png) 當node n離開，所有在node會被移動到node n 的successor，當node 新增，會移動到該存到該節點的key到node n，以上圖為例 node 26 were to join with id 26，因此key 24會被移動到node 24 - Basic Lookup Mechanism 每個node都只maintain自己的successor pointer和predessor pointer 就像doubly circular Linked List那樣，要查詢key所在的節點就需要一個一個節點查詢，若不在這個節點就傳給successor，直到找到為止，圖(a)為這個簡單的演算法的pseudo code，這個演算法會在node id 大於等於要查詢的key id 時停下。圖(b)為一範例，client連上node 8要查詢key id 54會一直走到node 56，這樣最差的情況需要O(N)的時間，因此後面提出一個更有效率的方法。 ![Screenshot from 2024-05-15 21-57-18](https://hackmd.io/_uploads/SJW8JHzQA.png) - Scalable Key Location 為了有效率的Lookup，為了幫助我們快速lookup，那我們先改為每個node紀錄最多m個其他節點，形成一個稱為finger table的表格，node n的第i個entry為 n + 2 ^ i後的successor i.e. 以這張圖(a)為例，第0個entry為 8 + 2 ^ 0 = 9, 最靠近9的node為n14，因此第一個enrty對應的node為n14，第1個entry為 8 + 2 ^ 1 = 10，最靠近的因此也為n14 ![Screenshot from 2024-05-15 22-21-31](https://hackmd.io/_uploads/SkgbHHMmR.png) 因此改良的演算法為若我們連上node n就會透過finger table知道我能跳到哪個node，如果有我想要的節點，就是key最靠近的node，若finger table上沒有我們想找的key對應的節點，就跳到finger table最後一個entry來看他的finger table有沒有我要的節點，如此循環。一樣是剛剛那個例子n8 找k54，n8的finger table可以跳到n42，n42的finger table知道要到n56取得資訊，回傳給client要連上n56取得想要的資料這樣的方法可以將lookup降到O(logN)，因為每次都刪去了將近一半的資料，就像是Binary search，因此就算系統scale up，lookup時間也是以對數成長，得到更好的scability 一個節點的finger table雖然有m個entry但只有logN個不同的Node，因此storage overhead不會太大。 - Dynamic Operations and Failures - create()新增一個chord網路 - join是新增一個節點到chord網路並設定predecessor及尋找設定successor - 由於P2P網路更動頻繁，若是一有節點更動，就需要更新的話，確保finger table隨時為正確的overhead太大了，可適度搭配剛剛提到的basic chord，finger table未必要是最新的，因此下列操作都是一段時間才執行一次。 - stablilize是尋找插入後的successor的predecessor x,看是否要設定x的successor為此新插入的節點，判斷只有是x是新插入的節點才需要更新。 - 並通知原先的successor更新他的predecessor為此新插入的節點 - fix fingers 就是更新finger table為正確的狀態 - check_predecessor檢查predecessor是否失效。 ![Screenshot from 2024-05-16 01-14-39](https://hackmd.io/_uploads/ryGi6PzXA.png) ![Screenshot from 2024-05-16 01-29-29](https://hackmd.io/_uploads/S1iHW_MmA.png) - 新增Node: 主要分成了三個步驟 1. 建立新加入的Node的所有Finger Table Entries 2. 更新現有的Node的Finger Table 3. 將受影響到Key，重新分配他們到新加入的Node > 等於連上一個Node n 做m次Lookup，尋找j+2^0, j+2^1, ..., j+2^m-1 這幾個ID 等於連上一個Node n 做m次Lookup 尋找j+2^0, j+2^1, ..., j+2^m-1 這幾個ID 步驟二: 更新現有的Node的Finger Table Node j 的加入迫使有些舊有的 Node 之Finger Table 一部分的entries的ID必須重新應對指到 Node j 上有哪些Node的Finger Table會受到影響呢？答案就是 [pred(j)-2^i+1, j-2^i] i = 0~m-1 這些區間的Nodes都需要調整 ![image](https://hackmd.io/_uploads/SkcoquGQ0.png) 舉例來說: m = 6 j = 28，j的predecessor為Node 21 則區間為 [21-2^i+1, 28-2^i] i = 0~5 E.g. i = 4: [21-2^4+1, 28-2^4] = [21-15, 28-16] = [6,12] Node 8在這個區間內，他的Finger Table裡面一定有entries的會落在N21-N32之間，在N28加入後必須重新指向N28，就需要調整。而總共要調整的Node有 O(log N) 步驟三: 將受影響到Key，重新分配他們到新加入的Node 延續上圖N21先聯絡原本的Successor也就是N32 將N32裡面現在必須指向N28的Key全部copy到N28來然後N21將自己的Successor改為N28 如下圖: ![Screenshot from 2024-05-16 01-46-21](https://hackmd.io/_uploads/BJcZHuzm0.png) 以上就是Node新加入後的處理，而移除也是幾乎一樣的步驟，只是反過來做。 - Handling Failure: 以這張圖為例，若n14, n21,n32同時失效，node 8不會知道他successor應為node 38，因為沒有finger 指向node 38，不正確的succesoor會導致不正確的lookup，假設query key30，node will return node 42 instead of the correct successor, node 38，因此我們需要maintain一個長度為r的successor list，內有該節點的前r個successor，若node的下個successor沒有回應，我們就利用successor list尋找下個正確的successor，會不會整個list的successor都沒回應，會！，所以我們要找一個合適的長度r讓這件事情的發生機率非常小。用此方法可以增加 robustness。 ![Screenshot from 2024-05-16 01-57-09](https://hackmd.io/_uploads/r1RFwdMXC.png) - Voluntary Node Departures: 自願離開的節點可視為失效的節點，但額外兩個操作可以讓效能改善，在離開前自己擁有的key轉移給successor，並通知 predecessor and successor更新。 - Realistic: In practical scenarios where Chord networks experience continuous node joins and departures, the stabilization process ensures the network remains in an "almost stable" state. As long as the stabilization process runs frequently enough relative to the rate of changes, lookups remain efficient and correct. This more realistic analysis confirms that Chord performs well under constant change, maintaining its robustness and efficiency. 实际应用中并不充分。上述定理假定Chord ring开始时处于稳定状态然后出现连接或故障。实际上，Chord ring永远不会处于稳定状态；相反的，加入和退出会不断发生，与stablize交錯執行。在新的变化发生時，ring會沒有時間執行stablize。和弦算法可以在这种更普遍的情况下进行分析。一般情况下进行分析。其他研究表明，如果稳定协议以一定的速率运行（取决于节点加入和失败的速率节点加入和失效的速度），那么和弦环会持续处于几乎稳定 "的状态，在这种状态下，查找既快速又正确。 ## Simulation Result 1. Protocol Simulator - Chord可以採用迭代或遞歸實現；模擬使用迭代方法。 - 穩定化步驟確保節點定期更新其後繼者列表和指紋表條目。 - 因此，如果節點的後繼者列表和指紋表總共包含 ***k*** 個獨特條目，則每個條目在每 ***k*** 次穩定化round中更新一次。 - 模擬器包括節點離開和失敗檢測的優化。 - 數據包傳輸延遲呈指數分布，平均值為50毫秒。 - 如果查找到達當前後繼者，則認為查找成功。 2. Load Balance In a network with nodes and keys, we would like the distribution of keys to nodes to be tight around N / K. 考慮一個有10 ^ 4個節點的網路並且最初有10^5個key，一次增加10^5個key直到有10^6個key為止，我們每個節點數都跑20次實驗取得平均值及第一個和第99個百分點，如圖(a)，可以看到變化率和key的數量有線性增長的關係，可看到機率分佈函數圖(b) ![Screenshot from 2024-05-16 10-59-41](https://hackmd.io/_uploads/H1H-Pe7m0.png) ![Screenshot from 2024-05-16 10-59-56](https://hackmd.io/_uploads/S1rbDlmQC.png) 之所以會這樣狀況是因為，node在id space中分佈不平均，導致某些節點很多key，某些節點卻完全沒有key，解決方式是引入virtual node，virtual node平均的分佈在整個id space，key都會mapping到virtual node virtual node再mapping到真正的node，要考量的成本是這樣儲存overhead會更大，可參考圖(b) ![Screenshot from 2024-05-16 11-16-17](https://hackmd.io/_uploads/B1oi5lm7C.png) 3. Path Length Chord的效能跟一次query需要探訪幾個節點有很大的關係，作者模擬的網路有N = 2 ^ k個節點，儲存 100 * 2 ^ k個key, k from 3 to 14，如下圖 ![Screenshot from 2024-05-16 11-24-14](https://hackmd.io/_uploads/HJVu3x7XR.png) 圖B是k = 12時的機率分佈函數 ![Screenshot from 2024-05-16 11-24-36](https://hackmd.io/_uploads/SJdYneXQR.png) 從圖a可以推導出平均一次query需要探訪1/2logN個節點，隨著節點數量成對數成長 4. Simultaneous Node Failures 實驗設置 - 一個最初包含1000個節點的網絡。 - 每個節點維護一個大小為5的successor list。 - 穩定後，每個節點都有一定的機率失效。 - 實驗在節點失敗後進行10,000次隨機查詢，記錄超時、聯繫的節點數量以及找到關鍵字當前後繼者的成功率。 - 在失敗發生前，穩定化過程停止，以評估Chord在壓力下未進行恢復調整的即時表現。測量指標： - 每次查詢的超時數量。 - 每次查詢期間聯繫的節點數量。 - 找到正確後繼者的成功率。結果： - 表格II展示了不同失敗節點比例下的路徑長度、超時數量的平均值、第一個百分位和第九十九百分位數據。 - 隨著失敗節點比例的增加，路徑長度和超時數量均有所增加。 ![Screenshot from 2024-05-16 11-41-17](https://hackmd.io/_uploads/rJksl-7QR.png) 作者用數學模型推導證明system robustness跟successor list 長度有關，支持前面提到的定理，且實驗數據也和數學推導的結果相去不遠。 5. Lookups During Stabilization - 實驗設置： - 節點根據泊松過程加入和離開網絡，平均到達率為 λ。 - 每個節點在15至45秒之間均勻分布的間隔執行穩定化程序。 - 關鍵字查找根據每秒一次的泊松過程生成。 - 網絡起始時有1,000個節點，每個節點維護一個大小為5的後繼者列表。 - 離開程序包括優化措施，離開的節點將其鍵轉移給其後繼者並通知其predecessor 和successor。 - 結果： - 表格III展示了作為節點加入和離開率函數的路徑長度和超時次數的平均值、第一百分位和第九十百分位數據。 - 平均路徑長度保持接近穩定狀態的值約3.82，並且在加入和離開率增加時不會發生劇烈變化。 - 相當接近表格II節點失效機率為0時的平均路徑長度。 - 隨著 λ 的增加，超時次數增加，表明節點離開更頻繁導致更多查找失敗。 - 分析： - 每個節點的指針因穩定化、離開和查找事件的交錯而經歷超時。 - 查找失敗是由於狀態不一致造成的，如後繼者指針錯誤或鍵尚未遷移到新加入的節點。 - 儘管節點不斷加入和離開，Chord仍能維持有效的查找。 - 系統在節點持續更迭的動態環境中顯示出彈性，保持穩定的平均路徑長度，並有效處理查找失敗。 ![Screenshot from 2024-05-16 12-04-32](https://hackmd.io/_uploads/rkNy8Zm7R.png) 6. Improving Routing Latency - Challenge: 雖然Chord保證平均路徑長度為O(logN)，但由於id 空間中靠近的節點在底層網絡中可能相隔很遠，實際的lookup latency可能會很高。 - 新協議擴展： - 新方法為每個finger維護一組替代node，根據實際網絡距離選擇最近的節點。 - 每個節點將其finger關聯一個直接後繼者列表，並修改find_successor函數以返回網絡距離最近的節點。 - 修改後的函數返回指紋及其後繼者中最近的節點。 - 模擬和評估：考慮兩種網絡模型：三維歐幾里得空間和轉運-支幹拓撲。 - 3D歐幾里得空間：將網絡延遲建模為3D空間中的幾何距離。 - 轉運-支幹拓撲：反映互聯網的層次結構，對內部轉運域、轉運-支幹和內部支幹連接有不同的延遲。 - 評估使用的主要指標是查找延伸，定義為Chord查找延遲與最佳網絡延遲的比率。 - 表IV展示了迭代和遞歸風格的lookup stretch的中位數、第十百分位和第99百分位數據。 - 遞歸查找比迭代查找更快，因為它們每跳只產生單向延遲而不是往返延遲。 While Chord ensures the average path length is O(logN), the actual lookup latency can be high because nodes close in the identifier space can be far apart in the underlying network. - New protocol extension: - The new approach maintains a set of alternate nodes for each finger, selecting the closest node based on a network proximity metric. - Each node associates with its fingers a list of immediate successors, and the find_successor function is modified to return the closest node in terms of network distance. - The modified function returns the closest node among a finger and its successors. - Simulation and Evaluation: Two network models are considered: a three-dimensional Euclidean space and a transit-stub topology. - 3D Euclidean Space: Modeled network latency as geometric distance in a 3D space. - Transit-Stub Topology: Reflects the hierarchical structure of the Internet with different latencies for intra-transit domain, transit-stub, and intra-stub links. - The main metric used for evaluation is the lookup stretch, defined as the ratio of the Chord lookup latency to the optimal network latency. - Table IV presents the median, tenth, and 99th percentiles of lookup stretch for both iterative and recursive styles. - Recursive lookups are faster than iterative ones because they incur a one-way latency per hop instead of a roundtrip latency. ![Screenshot from 2024-05-16 12-21-15](https://hackmd.io/_uploads/rJRTK-7XR.png)