Week 09 - Peer-to-Peer Systems

# Week 09 - Peer-to-Peer Systems ###### tags: `WS2020-IN2259-DS` ## P2P System - Introduction ### Client-Server Architecture * Well known, powerful, reliable server as data source. * Clients request data from server. * Very successful model: * WWW (HTTP), FTP, Web services, etc. * N-tier architecture. * 圖示： ![](https://i.imgur.com/eDJHaNW.png =300x) * **限制**： * 昂貴的 scalability：vertical vs. horizontal。 * 單點故障 (single point of failure)。 * 需要系統管理 (administration)。 * 於網路邊緣 (network edge) 的未使用資源 (Unused resources)： * 如頻寬 (bandwidth)、儲存空間 (storage) 等。 * 而 P2P System 嘗試解決 (address) 這些限制並利用 (leverage) 未使用的資源。 ### Peer-to-Peer * P2P 即是透過於 peers (nodes) 之間**直接的交換**，來分享 **compute resources** 與 **services**。 * 這些資源與服務的交換包含： * **data** * **processing cycles** * **cache storage** * **disk storage** * **bandwidth** * P2P 利用現有的資源 (如：computing power、computer storage、networking connectivity)，允許使用者利用**集中力量 (collective power)** 以帶給**所有人利益 (benefit)**。 ### What is a P2P system? * 一個有著下述特性的分散式系統架構： * 非集中式架構：No centralized control。 * 各個節點有著相同 (對稱) 的功能：Nodes are symmetric in function。 * 很多的 unreliable nodes。 * 受惠於科技的進步得以實現。 ### P2P Architecture ![](https://i.imgur.com/Q43fXKV.png =350x) * 所有的 nodes 同時是 clients 與 servers。 * 沒有集中式的資料來源 (centralized data source)。 > “The ultimate form of democracy on the Internet” > “The ultimate threat to copyright protection on the Internet” * 實作上，**hybrid models** 十分流行： * client-server 與 peer-to-peer 的結合。 * 如：Skype (early days, now unknown) * 如：Spotify (peer-assisted) * 如：BitTorrent (trackers) ### P2P Benefits * **資源的有效利用 (efficiency)**： * Unused bandwidth * Storage * Processing power at the edge of network * **擴充性 (scalability)**： * 對於資源的使用同樣貢獻了資源。 * 總體資源 (Aggregate resources) 隨著使用而成長。 * Organic scaling：愈多使用者、愈多資源。 * 「Infrastructure-less」 scaling。 * **可靠性 (reliability)**： * 很多 replicas (redundancy)。 * 地理上的分布 (Geographic distribution)。 * 沒有單點故障的問題：no single point of failure and control。 * **易於管理 (administration)**： * 自成體系 (self-organize) 的 nodes。 * 不需要佈署 servers。 * 自帶的 fault-tolerance、replication 與 load balancing :::warning 但 P2P 並非全然適用 (one size fits all)，很多大型企業並沒有轉向 P2P 的懷抱。 ::: ### Use Cases: Large-Scale Systems * 需要巨量資源 (immense resources) 的應用： * 吃 CPU：Scientific data analysis * 如：*@home * 吃 bandwidth：Streaming、file sharing * 吃 storage：Decentralized data、file sharing * Thousands or even millions of nodes * 但要如何有效的管理如此龐大的網路呢？ #### SETI@home – started in 1999 ![](https://i.imgur.com/GQhOEUG.png =500x) * 5.2 million participants worldwide * On September 26, 2001, had performed a total of 1021 flops * 35 GB/data per day, 140K work units * 30 hours/WU * 4.2M hours of computation/day * Centralized database #### 2015 NA Traffic ![](https://i.imgur.com/eKX7Y8s.png) ### P2P File Sharing Systems * 大規模的共享檔案 (large-scale sharing of files)： * 使用者 A 令自己電腦上的檔案 (如音樂、影片等) 可以為他人所存取。 * 使用者 B 透過連接「P2P 網路」，搜尋並直接從使用者 A 的電腦下載檔案。 * P2P 網路： * Peers 互相連結，以形成「**覆蓋網路 (overlay network)**」。 * Peers 使用於「**覆蓋網路 (overlay network)**」中建立的連結 (links) 互相通訊。 * 沒落：fallen out of favor (RIP 1999-2015) * 版權侵犯 (copyright infringement) 問題層出不窮。 * 雲端基礎架構 (cloud infrastructures) 取而代之 (controlled resources)。 * 難以透過 mobile devices 與 connected devices 使用。 * **串流 (streaming)** 崛起，檔案分享 (file sharing) 變成陳舊 (obsolete) 的技術 * cf. P2P Streaming ### Types of P2P Systems * Unstructured networks * 沒有明顯的 overlay topology 架構。 * Peers 簡單地透過現有網路進行連接。 * First generation * Centralized P2P (集中式)：**Napster** * Pure P2P：**Gnutella**、Freenet * Second generation * Dynamic "supernodes" * 混合式 (hybrid)：**Skype**、**Spotify**、FastTrack、eDonkey、**BitTorrent** * Structured networks * 有控制的 (controlled) overlay topology：peers 之間的連結是固定的。 * 基於 **distributed hash table abstraction (DHT)**。 ## P2P System - Unstructured peer-to-peer systems ### Centralized P2P: Napster “June 1999 – July 2001” * 集中式的 `server` 負責提供音樂檔案索引搜索 (centralized search indexes music files) * 擁有最全面的資訊 (perfect knowledge)。 * 瓶頸 (bottleneck) 所在。 * `Client` 向 `server` 發送 query。 * `Napster server` 回復擁有該檔案的目標 `user` 之 IP address。 * `Client` 像擁有檔案之 `user` **直接**連線並下載檔案。 * 美國唱片協會 (RIAA，Recording Industry Association of America) 最終輕鬆地關閉了 Napster，這是因為 Napster 有集中式管理的 server。 * 圖示： ![](https://i.imgur.com/9Zfj0zy.png =300x) ### Pure P2P: Gnutella 0.4 (2000 – 2001) * 共享各種格式的檔案。 * 非集中式的搜索： * Imperfect content availability：不保證一定可以搜索到存在於網路的內容。 * 每個 client 連接 3 個 peers (平均值)。 * **Flood a `QUERY`** to connected peers：一個 node 傳送 `QUERY` 給所有與其連接的 peers，其他 nodes 依序傳送下去，如此稱之為 flood。 * TTL (Time to live)：最多 flood 幾個節點，例如通常設定為 7 個。 * Flood 非常浪費 bandwidth，而後來的版本使用了更為精細的搜索方法。 * 若有 matching 的搜索，這些 nodes (users) 將回傳 **`QUERYHIT`** * 圖示： ![](https://i.imgur.com/wBvQFIr.png =300x) ### Hybrid P2P * Centralized vs. Pure： * Centralized： * 優點：perfect content availability，資料的可用性較高。 * 缺點：single point of failure，單點故障。 * 缺點：安全，easy to control，只要掌握了 server，就等於掌握了整個架構。 * Pure： * 優點：decentralized (resistant)：非集中式管理較能抵抗 nodes 的故障。 * 缺點：costly，很消耗頻寬。 * 缺點：unreliable search，不保證一定可以搜索到網路中存在的資料。 * Hybrid P2P (混合式的 P2P)： * Peers 區分階層與職責：hierarchy of peers。 * **`Superpeers`**：永有更多容量 (capacity)，並可以被**動態**地搜尋與發現 (discovered **dynamically**)。 * 一般使用者的角色為 **`normal peers`** (**`leaf nodes`**)。 * **`Superpeer`** 的職責： * 參與搜尋： `protocol`、`indexes`、`caches data`。 * 改善 content availability：類似集中式管理的功能。 * 減少 message load：也就是減少頻寬消耗。 ### Skype Network (around 2004) * **Super Nodes**： * 任何擁有足夠 **`CPU`**、**`memory`** 與 **`network bandwidth`** 的節點，均為 **`superpeer`** 的候選者。 * **Ordinary Host**： * 一般像 **`Skype login server`** 註冊的使用者。 * 透過向 `superpeer` 連接，以連接到其他使用者 (如位於通訊錄內的好友)。 * 其中 **`Login server`** 與 **`PSTN gateway`** (public switched telephone network，圖片中未繪製) 在整個系統中，設計為**集中式的元件 (`centralized components`)**。 * 圖示： ![](https://i.imgur.com/9XHVFpu.png =300x) #### Responsibilities of Superpeers (around 2004) * 負責使用者目錄 (user directory) 的索引： * 分散於各個 `superpeers`。 * 為了完成查找任務，`superpeers` 間會互相進行通訊。 * Communication relay (中繼)： * NAT traversal：為了穿過防火牆的設計。 * 於 2011 年，`superpeers` 被 Microsoft 逐漸淘汰，以下為猜測原因： * 使用 `private server` 替代 `public nodes`。 * 轉換為可以被公司控制的 server-based 「**`dedicated supernodes`**」。 * 一個 `dedicated supernode` 可以服務更多的 client。 * `dedicated supernode` 位於受保護的資料中心 (**`protected data centers`**)：更加安全。 * 程式的運行與維護上較不複雜 (**running code that is less complex**)：比在 client code base 容易。 #### Skype Impact * Skype 展現了： * **Signaling**： * 傳統電話系統的特色，現在已經可以簡單地透過 self-organizing P2P networks 來達成。 * P2P overlay networks can **scale**： * 除了小型簡單的 signaling 以外，P2P 可以擴增規模以處理更進一步的 connection-oriented real-time services，如影像與聲音。 * AT&T：傳統電話的末日。 > The end of landlines … ## P2P System - DHT ### DHT structured P2P Systems * 第三代的 P2P overlay networks。 * Self-organizing、load balanced、fault-tolerant。 * **快速**而且**有效率**的 **lookup guarantees**： * 傳統的搜尋複雜度：$O(\log(n))$。 * Centralized P2P：$O(1)$。 * Pure P2P：$O(n)$。 * Hybrid P2P：$O(\#S)$。 * 基於 hash table interface 的技術 (cf. KV-Store)： * `Put(Key, Data)` 與 `Get(Key)` * 簡稱為 (coined term) **distributed hash table** (DHT)。 * 使用系統：Chord、CAN、Pastry...。 ### DHT Introduction * 分散式版本的 hash table data structure。 * 可以儲存與取回 `(key, value)-pairs`： * **`Key`**：如 filename、hash of name、hash of content (當 filename 可能改變的情況下)。 * **`Value`**：即是該檔案內容 (file content)。 * 通常是到儲存有該內容節點的參照 (reference)。 * 將 keys 進行 hash，再 map 至分散式節點中 (a set of distributed nodes)： * 透過 consistent hashing 或其他類似的 hasing 來實現。 * 系統中的改變必須不影響其他大部分 nodes 的運作。 ### DHT Abstraction * 應用程式分散於各個 nodes。 * DHT 將資料的儲存分散至許多 nodes。 * 圖示： ![](https://i.imgur.com/gCZRtwx.png) ### DHT Interface * 透過簡單的 interface 來取得 value： * `Put(key, value)` * `Get(key)` * DHT API 支援廣泛的應用： > DHT imposes neither structure nor meaning on keys/values * DHT 可以單純使用 key 值做為基底，建立 linked list、graph、tree 等等結構。 * Key-value pairs 保持一致不變 (persisted)，且全域可用 (globally available)： * Good availability：資料儲存於網路邊緣 (edge，意即客戶端)。 * 可以將 keys 儲存於其他 DHT 的 values，以此建立更複雜的資料結構。 ### DHT as Infrastructure or Service * 很多應用程式可以共享一個 DHT 服務。 * 以 DHT 作為基礎設施，可以使新應用的佈署變得簡單。 * 可以匯集 (pool) 許多參與者的資源：如 P2P。 * 本質上，DHT 為中間層服務 (middleware service)，是整個分散式系統中基礎設施的一部分。 ### DHT-based Projects * 所有 projects 的共同之處 (common denominator)： * 資料儲存與位置無關，可能會分散各處。 > Data is location-independent. * 均利用 DHT 的抽象化來實作各自的功能。 > All leverage DHT abstraction. * 各種 projects： * File sharing [CFS, OceanStore, PAST, Ivy, …] * Web cache [Squirrel, ..] * Archival/Backup store [HiveNet, Mojo, Pastiche] * Censor-resistant stores [Eternity,..] * DB query and indexing [PIER, …] * Event notification (Publish/Subscribe) [Scribe, ToPSS] * Naming systems [ChordDNS, Twine, ..] * Communication primitives [I3, …] * Key-value stores [**Cassandra**, **Dynamo**, ...]： * 其各個 nodes 之間使用「**一致性雜湊 (consistent hashing)**」。 ## P2P System - Chord DHT ### DHT Requirements * **負載平衡 (load balancing)**： * `keys` 均勻地映射至所有網路中的 `nodes`。 * **低維護性 (low maintenance)**： * `node` 的出現與消失只會影響一部分的其他 `nodes`。 * 每個 `node` 僅持有並維護一些其他 `nodes` 的資訊。 * **快速查找 (fast lookup)**： * `Messages` 可以有效率地路由至一個 `node`。 ### Chord Identifier Circle * 所有 `nodes` 以 `node identifiers` 為參照，盡可能地均勻分布於 **`identifier circle`** 上。 * 每個 `keys` 被指定給 `identifier circle` 上的 **`successor node`**。 * `Hash function` 保證 `node` 與 `keys` 位於 `identifier circle` 上的**均勻分布** (**`even distribution`**)。 * 假設有 $N$ 個 `nodes`，$K$ 個 `keys`，每個 `node` 約負責 $\dfrac{K}{N}$ 個 `keys`。 * 可以參考 [Consistent Hashing](https://hackmd.io/MO8monVKSwCCIb9AkBwAuw?both#Consistent-Hashing)。 * 圖示： ![](https://i.imgur.com/jbzupeP.png =400x) ### Node Joins & Leaves * `Nodes` 可能會從網路上消失。 * 如：failure、departure。 * 每一個 `node` 紀錄一筆**鄰近圓弧段** (**segment of the circle**) 的完整資訊。 * 如：$r$ 個之前 (**preceding**) 與之後 (**following**) 的 `nodes`。 * 有很高的機率，一個 `node` 可以正確地定位 (locate) 其 `successor` 與 `predecessor`。 * 因此，當一個新 `node` 加入或離去時，我們必須管理 $O(K/N)$ 個 `keys` 的變更。 :::info 上述機制於 Consistent Hashing 中並沒有。 ::: ### Searching in Chord #### Naïve Search * 因為每個 `nodes` 均擁有 `successor` 的資訊，因此 linear search 便可以搜尋到目標 `key`。 * 一個 message 可能會在整個網路中不斷地被轉送：複雜度 $O(n)$。 * 因為 nodes 非常地多，$O(n)$ 的複雜度可以說是相當地可觀。 #### Chord Finger Table * 較快的搜索方式。 * 每個 nodes 均需要維護一個 **`finger table`** (**`FT`**)，裡面儲存最多至 $m$ 個 `entry`。 * 第 $n$ 個 `node` 的 `finger table` 的第 $i$ 個 `entry` 裡儲存著 $\mbox{successor}((n + 2^{i-1}) \mod 2^m)$ 之位址。 * $\mbox{successor}(\cdots)$：指的是位於 circle 上的 `node`，這是根據輸入的參數來決定的 (若找不到相對應的 `node` 則往後遞延找下一個)，結果可以是 `key` 或是 `node identifier`。 * $n\mbox{-node}$ 網路之複雜度為：$O\log(n)$。 * 詳細機制圖解： ![](https://i.imgur.com/ar7n331.png) #### Chord Key Location * 查找 `finger table`，最遠的 `node` 即是最近於目標 `key` 的 `predecessor`。 * 到達目標最多需要 $O\log(n)$ 個 `hops`。 * 每個 `hop` 至少減去一半到達目標的距離。 * 圖示： ![](https://i.imgur.com/Nwi3lX7.png =400x) #### Lookup Example ![](https://i.imgur.com/IrQtdrM.png =500x) ![](https://i.imgur.com/P1lXJX3.png =500x) ### Lookup Latencies * 雖然 $O(\log(n)) < O(n)$，但仍然會需要相當多的時間去取得 target。 * 假設 $\log(1,000,000)$ 個 hops 地理上分散於世界各處。 * 在 circle 上鄰近的 nodes 可能會相聚非常遙遠： ![](https://i.imgur.com/VcWmWpz.png =500x) * 這便可能會造成相當高的 response latencies。 ## P2P System - CAN DHT ### [Content Addressable Network](https://en.wikipedia.org/wiki/Content_addressable_network) * 基於**虛擬的多維 Cartesian coordinate space ([笛卡兒座標系](https://zh.wikipedia.org/wiki/%E7%AC%9B%E5%8D%A1%E5%B0%94%E5%9D%90%E6%A0%87%E7%B3%BB))** 來組織覆蓋網路 (overlay network)。 * 各個節點映射 (map) 到此笛卡兒空間： * 邊緣的座標環繞成一個封閉的環：**coordinates at edges wrap around**。 * 只有 x 軸 wrap around 的範例： ![](https://i.imgur.com/2ORKWz0.png =200x) * **位址空間 (address space)** 獨立於「**物理位置 (physical location)**」以及「各個 nodes 間的**物理連結 (physical connectivity)**」。 * 透過**座標**來**辨識**空間中的所有點。 * 一般的模型是 $n$ 維的「**環面 (torus)**」(所有維度均 wrap around)：利用多維度來 routing。 ![](https://i.imgur.com/m39b5ky.png =200x) ### Example 2-d space with 5 nodes * 整體座標空間是「**動態分割 (dynamically partitioned)**」的。 * Among all `nodes` in system * 每個 `node` 都有自己在空間中特定 (distinct) 的 zone。 * 每個 `key` 都雜湊成空間上的一點。 * 圖示： ![](https://i.imgur.com/ZV3ya00.png =350x) ### CAN Routing * **介面**： * `Put(key, data)` * `Get(key)` * **貪婪地傳送 (Greedily forward)** 訊息至**最接近目的地的鄰居 (neighbor closest to destination)**。 * 各個 `node` 維護一個有著所有鄰居 IP 與 zones 的 **`routing table`**。 * $1$ 號座標的「鄰居集 (neighbor set)」：$\{2,3,4,5\}$。 * 兩點間存在著多條可能的 routing paths。 * 如果有一條 path 其中的 `node` 故障了，可以簡單的換成下一條最佳 path。 * 圖示： ![](https://i.imgur.com/CenplRi.png =300x) ### Average Path Length * $d$ 維的空間，分割成 $n$ 個大小相同的 zones。 * **Average routing path length**：$\dfrac{d}{4} \cdot n^{1/d}$ * $2$ 維空間：$\dfrac{2}{4} \cdot n^{1/2}$ * $3$ 維空間：$\dfrac{3}{4} \cdot n^{1/3}$ * 每個 `node` 維護 $2d$ neighbors。 * 隨著 `node` 的數量增加， * 不影響每個 `node` 的狀態。 * Path length 提升：$O(n^{1/d})$。 ### Node Joining a CAN * 首先，我們擁有一個已經存在於 overlay network 之中的 `node`。 * 接下來要**辨認可以進行 split 作業的 zone**： * 新加入的 `node` 於於座標空間內隨機選擇一個點。 * 傳送 `join request`，並由其他空間的點進行 route。 * 初始化 zone 的分割。 * **更新**鄰近新分割區域之 `nodes` 的 `routing tables`。 * 如果被拒絕 (refused)，再重新挑選新的隨機點。 ### Zone Splitting Upon Node Joining * $1$ 號座標的 neighbor set 為 $\{2,3,4,5\}$。 * 挑選隨機點 $(A,B)$，並路由 `join request` $(7 \to 1)$： ![](https://i.imgur.com/hzCrB1q.png =300x) * 初始化 zone split，並更新鄰近 neighbor set： ![](https://i.imgur.com/7PQDPKH.png =300x) * $1$ 號座標的 neighbor set 變成 $\{2,3,4,7\}$。 * $7$ 號座標的 neighbor set 則是 $\{1,2,4,5\}$。 * 其他 neighbor set 比照辦理。 ### Node Join Properties * 只會有 $O(d)$ 的 `nodes` 會受到新增/減少 `nodes` 的影響。 * 與整個 CAN 中的總 `nodes` 數量 $n$ 無關 (獨立)。 ## P2P System - Pastry (2001) ### Introduction * 如同 Chord，為 **ring-based partitioning**。 * 每個 `peer` 發現並交換**狀態訊息 (state information)**： * **`List of leaf nodes`** * 以自身為中心輻射出去的圓內，記錄其中 $\dfrac{L}{2}$ 個 `closest peers` 的 `node ID`。 * 第一次搜尋會先找這個清單。 * **`routing table`** * $6$ digits，base $4$：表示 $6 \times 4$ 的表格。 * 第 $i$ 個 row 包含的 nodes 與自身共享長度為 $i-1$ 的 prefix。 * column 的數字為第 $i$ 個 digit。 * 盡可能讓 neighbors 進入 cells。 * 於 routing table 中，搜尋較長 prefix 的 nodes。 * 圖示： ![](https://i.imgur.com/vj5BtJ9.png) * **`neighborhood list`** * 根據 routing metric 計算 (如：ping delay)，取 $M$ 個最接近的 `peers`。 * `routing table` 的備用。 * 為 proximity-based Routing ![](https://i.imgur.com/2kZkIP7.png) ### DHT Routing Summary * Chord * Finger Table Routing：每個 hop 至少省下一半到達目標的距離。 * Pastry * Proximity-based Routing * CAN * Neighbour routing：於笛卡兒向量空間內，轉傳訊息至最近的 neighbor，直到目標為止。 ### Conclusions on P2P * 2000 至 2010 為相當流行的研究領域。 * 大規模的公司都喜歡自己管理的雲端基礎設施 (cloud infrastructures)。 * 如：Amazon、Google 等。 * 大規模系統會使用 P2P principles、techniques、abstractions。 * 如：**DHTs**。 * 如今仍然活躍的應用：BitTorrent、Bitcoin 等。 * 受歡迎的 peer-assisted、hybrid systems 如 **Skype** 與 **Spotify**。 ## P2P System - SPOTIFY ### Spotify.com, 2004 * Commercially deployed system, KTH start-up (Sweden) * Peer-assisted on-demand music streaming * Legal and licensed content, only * Large catalogue of music (over 15 million tracks) * Available in U.S. & 7 European countries, over 10 million users, 1.6 million subscribers (in 2004) * Fast (median playback latency of 265 ms) * Proprietary client software for desktop & phone (not p2p) * Business model: Ad-funded and free & monthly subscription, no ads, premium content, higher quality streaming ### Overview of Spotify Protocol * Proprietary protocol * Designed for on-demand music streaming * Only Spotify can add tracks * 96–320 kbps audio streams (most are Ogg Vorbis q5, 160 kbps) * Relatively simple and straightforward design * Phased out in 2014: > “We’re now at a stage where we can power music delivery through our growing number of servers and ensure our users continue to receive a best-in-class service.” * Conclusion: > Commercially, P2P technology is good for startupswho demand more resources than their servers offer. Avoid “death by success”. ### Spotify architecture: Peer-assisted * 伺服器一個在英國，一個在瑞士。 * 圖示： ![](https://i.imgur.com/ER3lMv1.png =500x) ### Why a Peer-to-peer Protocol? * 增加服務的 **scalability**。 * 減少 **load on servers** 以及 **network resources**。 :::warning 設計目標： * 使用 P2P 不應對整體效能產生影響：如 playback latency 與 stutter (口吃)。 ::: ### Peer-to-peer Overlay Structure * **Unstructured overlay** * **不使用 DHT**： * 較快的 lookup (Hybrid P2P)。 * 粗略估計的 ping time： * Latency UK – Netherlands ~ 10 ms and up * Latency across EU more like ~ 80 ms and up * Latency US – Europe ~ 100 ms and up * Playback latency ~265 ms * <1% Playbacks have stuttering * 簡化 protocol 的設計與實作。 * Nodes **最高 dgree 為 60**： * 1 個 node 最多連結 60 個 neighbors。 * 以針對功能的[啟發式評估](https://zh.wikipedia.org/wiki/%E5%90%AF%E5%8F%91%E5%BC%8F%E8%AF%84%E4%BC%B0) (**heuristic evaluation of utility**) 來進行 neighbour eviction。 * 找「[易用性](https://zh.wikipedia.org/wiki/%E6%98%93%E7%94%A8%E6%80%A7) (Usability)」高的。 * 當 streaming new track (下載音樂) 時，尋找並連接新的 peers。 * 完整的「覆蓋」因為不同的興趣喜好而變得薄弱，或可以說是變得一群一群的集中式分類覆蓋。 * Client 僅下載使用者所需的資料。 ### Finding Peers * **Server-side tracker** (伺服器端的曲目追蹤器，cf. [BitTorrent](#P2P-System---BITTORRENT))： * 僅為每個 `track`記憶 20 個 `peers`。 * 當有 client 進行 query 時，回傳 10 個 `peers`。 * Client 會再覆蓋網路中進行小小的廣播。 * TTL 為 2 hops (cf. [Gnutella](#Pure-P2P-Gnutella-04-2000-%E2%80%93-2001))。 * 針對每一個 `track`，client 使用上述兩種機制。 * Peers： ![](https://i.imgur.com/YakhmS5.png =400x) ### Protocol * 幾乎所有東西都加密。 * 幾乎所有東西都走 TCP。 * 當 logged in 時，與 `server` 有著 persistent connection。 * 單獨的 TCP 連線支援多工的訊息傳輸 (multiplex messages)。 ### Caches * `Client` 快取曾經播放過的曲目 (track)。 * Default policy 為使用 10% 的空閒空間 (總空間約 10 GB)。 * 但 `cache` 通常會用到 56% (超過 5GB)。 * **Cache eviction** 使用 **LRU** (Least Recently Use) policy。 * 超過 50% 的資料來自於 `local chache`。 * `Cached files` 於 P2P 覆蓋網路中提供服務如下載。 * 前提是此 track 已經完全下載。 ### Streaming a Track * 每個 track 分成 **16 kB** 的 **`chunks`**。 * 流程： * `Client` 向 Spotify `servers` 丟 request 要第一個 chunk。 * 同時搜索有快取此 track 的 `peers`。 * 依序下載 `chunks` (透過 TCP)。 * 直到此 track 結束下載後，再開始下載新的 track。 * 如果遠端 `peer` 很慢，向新的 `peers` 重丟 request。 * 如果 **`local buffer`** 已經足夠滿了 (**sufficiently filled**)，此時僅從 P2P 覆蓋網路下載資料。 * 可以播放的音樂很夠，不用擔心沒有音樂播。 * 如果 **`local buffer`** 很空 (**low**)，也向 central server 下載資料。 * 僅有很少的 `chunks` 可以播放。 * 估計回復正常 P2P download 的時間。 * 果 **`local buffer`** 非常空 (**very low**)，停止上傳資料。 * 可能只剩最後一個 `chunk` 可以播放了。 * 將所有的連線能量用在下載。 ### Security Through Obscurity * 音樂加密存在 cache 裡。 * Client 可以存取這些加密資料。 * 程式碼有進行混淆處理 (obfuscated)。 * 然而，這些被駭客逆向破解只是早晚的事情...。 > ***Security through obscurity*** is a bad idea! ### 圖表：Data sources 8.8% from servers, 35.8% from p2p network, 55.4 % from caches ![](https://i.imgur.com/MZ0Me7a.png =500x) ### 圖表：Tracks played ![](https://i.imgur.com/9VmQPjo.png =500x) ### 圖表：Users connected ![](https://i.imgur.com/CHjqZnD.png =500x) ### Key Points * 簡單的架構、協定與設計。 * **Peer-assisted**：意即依賴於 centralized server。 * 使用 P2P 技術來： * 增加 scalability * 避免 heavy and over-provisioned infrastructure * 使用 [centralized tracker](#Finding-Peers)。 ## P2P System - BITTORRENT ### Introduction * Written by Bram Cohen (in Python) in 2001 * **Pull-based**, swarming approach (segmented) (群策群力、團隊合作) * Each file is split into **smaller pieces** (& sub-pieces) * Peers request desired pieces from neighboring peers * Pieces are **not** downloaded **in sequential order** * **Encourages contribution** by all peers * Based on a **tit-for-tat model** (你消費，也要提供) * **Use cases** * File-sharing * Downloading movies, music, etc. (licensed only ☺) * Others such as, * **Update distribution** among servers at Facebook et al. * Distribution of updates and releases (e.g., World of Warcraft -> Blizzard Downloader) * ... * Client-server 如下會塞爆： ![](https://i.imgur.com/X27glsx.png =300x) ### BitTorrent Terminology * **`Seed`**： * 擁有完整檔案的 `peer`。 * 第一個 `seed` 又稱為 `Original Seed`。 * **`Leech`**： * 下載檔案的 `peer`。 * 如果下載了完整檔案，維持在線狀態，持續透過協定提供此檔案，`leech` 便成為了 `seed`。 * **`Sub-piece`**： * 一個 `piece` 更進一步的分割 (further subdivision)。 * 「Unit for requests」是一種 `sub-piece` (16 kB)。 * Peer 在組合完 `sub-piece` 後才會上傳 `piece`。 ### BitTorrent Swarm * **`Swarm` (群)** * 下載相同檔案的 `peers`。 * 一個 `swarm` 組織成 `peers` 隨機相連的網格 (randomly connected mesh)。 * 每一個 `peer` 都知道 `neighboring peers` 下載的 `pieces` 清單。 * `Peer` 會向 `neighbors` 要自己所沒有的 `pieces`。 * 圖示： ![](https://i.imgur.com/CEFguWz.png =500x) ### Entering a Swarm for file “popeye.mp4” * 檔案：**popeye.mp4.torrent** * **URL of tracker**：目標檔案之 `tracker` 的位址。 * **Piece length**：通常是 256 KB。 * **SHA-1 hashes** of each `piece` in file：用來較驗下載檔案的完整性。 * `Tracker` 運行在 web server 上，對於擁有該檔案的 `peers` 保持追蹤。 * 圖解： * Find the **Torrent** for Desired Content ![](https://i.imgur.com/NtcZATA.png =300x) * Contact the **Tracker** of Retrieved Torrent ![](https://i.imgur.com/dXORBC3.png =300x) * Connect to **Available Peers** ![](https://i.imgur.com/Q2gah39.png =300x) ### Download Phases * **Beginning**：First piece * 使用 **random first policy**： * 開始下載時，從一 `peer` 隨機挑選 `piece`。 * 第一個 `piece` 下載好時，進入第二階段，也就是切換成 rarest-first strategy。 * 得到第一個 `piece` 後，就可以**加入 `swarm` 提供上傳**。 * 稀少的 `pieces` 只在儲存於少數 `peers`，這會減慢下載速度。 * Strict priority policy * 總是先完成一個 `piece` 的完整下載，再開始下載新的 `piece`。 > Getting a complete piece as quickly as possible * **Middle**：Piece 2 to n-x * 使用 **rarest-first strategy**： * 檢視所有 peers 所擁有目標檔案的 `pieces`，哪種 `piece` 最少就先下載它。 * 如此可以增加此檔案 `pieces` 之 **diversity**： * 避免 `node` 與其 `peers` 都擁有相同的 `pieces`。 * 增加整體系統的 throughput。 * 增加所有 `pieces` 可存取之可能性 (**likelihood**)： * 如果 `original seed` 離線了，有機會從其他的 `nodes` 完成下載。 * **End-game**：Pieces n-x to n * **[Coupon’s collector problem](https://zh.wikipedia.org/wiki/%E8%B4%88%E5%88%B8%E6%94%B6%E9%9B%86%E5%95%8F%E9%A1%8C)** * 傳送針對最後一個 `piece` 的請求給所有 `peers`。 * 一旦開始下載此 `entire piece`，立刻取消對於其他 `peers` 的請求。 * 此模式可以加速下載的完成，否則最後一個 `piece` 將會拖遲整個下載。 * 只能在最後關頭使用，不然會造成網路壅塞 (congestion)。 > Why isn’t this done all along? * 圖示： ![](https://i.imgur.com/LoxPnt3.png =500x) ### Why BitTorrent took off * Working implementation (Bram Cohen) with **simple well-defined interfaces** for publishing content * **Open specification** * Many competitors got sued & shut down (Napster, KaZaA) * Simple approach * Doesn’t do “search” per ==se== * Users use **well-known**, **trusted sources** to locate content ### Pros & Cons of BitTorrent #### Pros * 充分利用部分下載的文件。 * 不鼓勵「免費載 (freeloading)」 * 獎勵速度最快的上傳者。 * 透過 rarest-first 策略促進檔案的「分散性 (diversity)」 * 延長 `swarm` 的壽命。 * 對於「火熱的內容 (hot content)」而言，效能會非常的好。 * 因為 swarm 多且大。 #### Cons * 需要假設擁有該檔案的 peers 在線。 * 如果 swarm 「冷卻 (cools off)」了，效能會急遽惡化。 * 檔案不再流行，swarm 變小。 * 分送 (disseminate) 小檔案的開銷 (overhead) 太高。 #### Others * Dependence on centralized trackers * 單點故障： * 如果 tracker 故障，新的 node 將無法加入 swarm。 * 不過 tracker 的設計與實作簡單，也很容易部屬。 * 如今 BitTorrent 支援 trackerless downloads。 * Lack of a search feature * 使用者必須使用其他的搜索管道 (out-of-band search)： * 例如：有名的 torrent-hosting sites，或是直接使用網路搜尋引擎。 ### Docker Image Distribution via BitTorrent * Docker images are containers that encapsulate applications and systems with full run-time stacks * Image comprised of minimal file system composed of layers * When launching a system based on containers, its images must be present on every node. * Layers are downloaded from **central registry** * Layers can be downloaded **in parallel** #### Traditional Approach * Every node downloads image layers from **registry** ![](https://i.imgur.com/CFWk0iL.png) * 問題： * High contention (爭奪) at **registry** * Images are pulled entirely from **registry** to every node * Massive network bandwidth overhead and performance bottleneck at the **central registry**! #### New Approach * **Nodes** exchange available layers instead of inundating (氾濫) **registry** * **Idea**: Chunk layers into small-sized blocks and distribute via BitTorrent (layerXYZ.torrent) * Widely adopted in industry: * E.g., Alibaba Dragenfly, Uber Kraken, etc. * Drastically improves performance (almost 100x speedup!) ## Spotify vs. BitTorrent ![](https://i.imgur.com/I8WYxzY.png)