Week 01 - Introduction

# Week 01 - Introduction ###### tags: `WS2020-IN2259-DS` ## 1-1 Distributed systems definitions * A distributed system is a system that is comprised of several physically **disjoint** compute resources interconnected by a **network**. * MapReduce (Hadoop) * Peer-to-peer (Bitcoin, BitTorrent) * Google infrastructure (BigTable) * World Wide Web (Akamai CDN)) > A distributed system is one in which hardware or software components located at **networked** computers communicate and **coordinate** their actions only by passing messages. – By Coulouris et al. > A distributed system is a collection of **independent** computers that appears to its users as a **single coherent** system. – By Tanenbaum & van Steen. > A distributed system is an application that executes a collection of protocols to **coordinate** the actions of multiple processes on a **network**, such that all components cooperate together to perform a **single** or small set of related tasks. – By Google Code University ### Terminology * Node：A physically separable computing node in our systems * Message：The unit of communication among nodes ### Why Build a Distributed System? * 省錢：Vertical scaling costs more than horizontal scaling * 保障：Availability (可用性) and redundancy (冗餘) * 單點故障：Single point of failure * 天生分散：Many resources are inherently distributed * 共享趨勢：Many resources used in a shared fashion ### Related Disciplines ![](https://i.imgur.com/G8OQUQL.png =500x) ### Self-study questions * Find more formal definitions of distributed systems and contrast them to our points of view. * Compare today’s pricing of a vertically scaled machine to a horizontally scaled one with equal resources. * Find other terms used for node and message by going through online popular press articles on systems. * What are other related disciplines you have come across in your studies, if any? * Is the client-server paradigm a distributed computing paradigm, argue for or against? ## 1-2 Distributed systems characteristics ### Reliability (可靠性) * Probability of a system to **perform** its **required functions** under **stated conditions** for a **specified period of time**. * 定義：一個系統在規定時間與條件下，正常執行其功能之機率。 * 表示：Mean Time Between Failure (MTBF), failure rate * 理解：系統連續無故障的運作時間，連續運作時間愈長，可靠性愈高。 ### Availability (可用性) and high-availability (高可用性) * Proportion of time a system is in a **functioning state**, i.e., can be used, (**1 – unavailable**). * **Ratio** of time usable over entire time * Informally, uptime / (uptime + downtime) * System that **can be used 100 hrs** out of **168 hrs** has **availability of 100/168** * 定義：系統運作狀態下的時間比例。 * 理解：一段時間裡，系統可以使用的時間比例。 * Specified as decimal or percentage * Five nines is 0.99999 or 99.999% available ![](https://i.imgur.com/BxjYrXB.png =500x) ### Availability ≠ Reliability * 一個系統，每一個小時下線一秒，availability > 99.9999% * Highly available (高可用) * Highly unreliable (極不可靠) * 一個系統，從不故障，但每年下線維護兩週 * Availability: 96% or so (低可用) * Highly reliable (高可靠) ### Distributed Systems Design Fallacies (分散式系統的設計謬誤) 1. The network is reliable 1. The network is secure 1. The network is homogeneous 1. The topology does not change 1. Latency is zero 1. Bandwidth is infinite 1. Transport cost is zero 1. There is one administrator ### Self-study questions * Look at popular cloud providers and seek to identify the Class of 9 they offer their customers – who promises the most availability and at what cost? ## 1-3 Massively scalable key-value stores ### What are key-value stores? * Container for key-value pairs (databases) * Distributed, multi-component, systems * NoSQL semantics (non-relational) * KV-stores offer **simpler query semantics** in exchange for **increased scalability**, **speed**, **availability**, and **flexibility** * Data model not new | DBMS (SQL) | Key-value store | | -------- | -------- | | 關聯式資料結構 (Relational data schema) | 沒有資料結構 (No data schema) | | 資料型別 (Data types) | Raw byte access | | 外來鍵(Foreign keys) | 沒有關聯(No relations) | | 完整的 SQL 支援 (Full SQL support) | Single-row operations | ### Why are key-value stores needed? * 因現今網路應用程式的特性： * 巨量的儲存資料 * 大量的網路用戶 * 高頻率的更新 * 高速的資料檢索 * 快速的資料定義變更 * 具備水平擴充性 (Horizontal scalability) * 使用者增加、流量模式變更 * 適應 request 數量與大小 * 更好的效能 (Performance) * single-record read and write operations 帶來的高速 * 更佳的彈性 (Flexibility) * 適應資料定義的變更 * 可靠性 (Reliability) * 上千個 components 同時運作 * 使用商用硬體 * 提供 failure recovery * 可用性 (Availability and geo-distribution) * 使用者來自世界各地 * 保證快速存取 ### Key-value store client interface * 主要操作： * Write/update：put(key, value) * Read：get(key) * Delete：delete(key) * 沒有 aggregation、table joins 以及 transactions * 範例： ![](https://i.imgur.com/F7ultYc.png =600x) ### Common elements of key-value stores * 錯誤偵測 (failure detection)、錯誤復原 (failure recovery) * 複製 (Replication)：儲存並管理多組資料的備份。 * Memory store：透過將資料保存在記憶體增加存取速度。 * write ahead log (WAL)：確保 commit log 是正確答案 (ground truth)。 * 版本控制 (Versioning) * 儲存不同版本的資料 * 時間戳記 ### Self-study questions * How would you select, project and join tables with the key-value store API vs. via SQL? * What are the main differences in realizing the above operations between either models? * Elicit the main differences between traditional keyvalue stores (e.g., Berkeley DB) and the massive scale key-value stores we introduced. * Can SQL be layered on top of a key-value store API, argue for or against? ## 1-4 Google’s BigTable * Engineered at Google, 2004 * Designed for petabyte scale * Based on Google File System (GFS) ![](https://i.imgur.com/iobGGqU.png =300x) ### BigTable: Tables & Tablets ![](https://i.imgur.com/Vls7fE3.png =500x) ### BigTable Components * Client library * 直接跟 Tablet server 溝通。 * Master * Metadata operations：用來控制 Tablet server 的指令。 * Load balancing：調控 Tablet server 間的負載。 * Tablet server * Data operations：如 read()、write() 的處理。 ![](https://i.imgur.com/c65ca9T.png =400x) ### Master * 指派 tablets 給 tablet servers。 * 平衡 tablet server 的負載。 * 偵測新增的 (addition) 以及到期 (expiration) 的 tablet server。 ### Tablet Server * 管理 tablets (可到上千個)。 * 處理 Client 對於所管理 tablets 的 read、write 請求。 * 分割或整合太大或太小的 tablets。 * 一個 tablets 約 100-200 MB。 ### BigTable Building Blocks * Chubby * Lock service (互斥鎖) * Metadata storage * GFS * Data, log storage * Replication * Uses Sorted Strings Table files (SSTables) * Scheduler * Monitoring * Failover ![](https://i.imgur.com/rLP35Bs.png =400x) ### Tablet location hierarchy * 3 Levels: 2^17 (METADATA tablets) * 2^17 (user tablets) = 2^34 tablets ![](https://i.imgur.com/6pxczZy.png =500x) ### Apache HBase * 開源，為 BigTable 概念的再實作。 * Facebook Messenger 使用 HBase。 * 不同名字，相似 components： * GFS → HDFS * Chubby → ZooKeeper * BigTable → HBase * MapReduce → Hadoop ### HBase architecture overview * Client library * 發起 put, get, delete 指令 * ZooKeeper (Chubby) * Distributed lock service * Based on ZAB – ZooKeeper Atomic Broadcast (Paxos，一種共識演算法 Consensus Algorithm) * HRegion (tablet) * Table 被分成許多 key-regions * HRegionServer (tablet server) * 執行資料 (data / key-regions in tablets) 相關的操作指令 * HMaster (master) * 協調元件 (Coordinates components) * 負責 tablet servers (region servers) 的開機 (Startup)、關機 (shutdown) 以及故障 (failure)。 * 負責 tablets (region) 的開啟 (Opens)、關閉 (closes)、指派 (assigns) 以及移動 (moves)。 * 不在讀取或寫入的路線上，讀取或寫入都與 master 無關。 * Write Ahead Log (WAL) * 在發生任何意外之前，完整保存操作日誌，以提供錯誤還原 (failure recovery) 功能。 * MemStore * 在主記憶體中保留所謂的「熱」資料 (hot data)，週期性的與硬碟同步。 * HDFS (GFS) * 基礎的 (Underlying) 分散式文件系統。 * 表單資料以 HFile format (也就是 GFS 的 SSTable) 儲存。 * 在多個資料節點上複製資料，增強冗餘機制。 ### HBase read-path 1. Client 向 Zookeeper 要 Root Tablets 的位址。 2. Client 向 Root Tablets 要 Matadate Tablets 的位址。 3. Client 向 Matadate Tablets 要 key ranges 的位址。 4. Client 向 Taget Tablets 要資料。 ![](https://i.imgur.com/oYZTx2r.png =500x) ### HBase write-path 1. 執行 read-path 的步驟，但如果已經有了 region server address 的 cache，便可以直接向 Taget Tablets 執行後續步驟。 2. Client 向 Taget Tablets 傳送 put 指令。 3. Region Server 將 key-value pair 寫入至 WAL。 4. Region Server 將 key-value pair 寫入至 MemStore。 5. Region Server 將 key-value pair Flush 至 HDFS。 6. Region Server 回傳 acknowledgement 至 client。 ![](https://i.imgur.com/xhsHOPZ.png =500x) ### HBase scalability and fault tolerance * 可以於系統運行時增加 components。 * 擁有 multiple backup master servers * 規避單點故障 (single point of failure)。 * 主要 master server 故障時，由備援 master server 接替任務。 * 使用 ZooKeeper 進行 Leader election。 * 擁有 multiple region servers * **Horizontal** scalability * 由 master 負責 **load balancing** * 圖示 - scalability： ![](https://i.imgur.com/U4VbqMc.png =500x) * 圖示 - storage unit failure： ![](https://i.imgur.com/7cf2hFO.png =500x) ### Summary on BigTable and HBase * Horizontal scalability：將資料分區存放 (Partitioning)。 * Tables → Regions (Tablets) * 對 Region Servers (TabletServer) 做 **load balancing**。 * Write-Ahead-Log：以實現 **failure recovery**。 * **Decouple write**：I/O 先進入 **MemStore**，後續再 flush 至 disk。 * Centralized management： * (H)Master – single point of failure * **Backup masters** for failover, **leader election** needed * 因為 master 不參與 read/write (不在 path 上)，因此不會造成 bottleneck。 * Coordination： * ZooKeeper (Chubby) lock service * **Leader election**, server status, region directory, ... * Sessions (leases) for timeout (**failure detection**) * Mechanisms for **high availability** and reliability * **Paxos**, atomic broadcast to replicate coordination state * Cache meta-data replies to avoid frequent communication * Distributed file system： * HDFS (GFS) * Store data as Hfiles (SSTables) * **Data is replicated** for **availability** ### Summary Big Picture - BigTable vs. MapReduce * BigTable * Layered on top of GFS * Data storage and access * read/write web data * MapReduce * Layered on top of GFS * Batch analytics * offline batch processing ### Self-study questions * Would the BigTable architecture make sense without relying on a distributed file system layer for storage, argue for or against? * Contrast the life of a read request vs. write request issued from a client to BigTable, what processing stages can you identify? * How many bytes could BitTable/Hbase address, assuming tables are of set sizes (e.g., 1MB, 100MB, 200MB etc.) * Why are read and write requests not channelled through the Master, argue for or against? * What happens if a client request from a client sent to a Tablet Server is not serviced due to Tablet Server crash? * Why is the Master a single-point of failure? ## 1-5 Chubby (Zookeeper) ### Chubby Lock Service * Highly-available, persistent, distributed lock, coordination (協調) service * 使用於 BigTable 的 Lock Service 範例： * 確保任何時候，最多只有一台 active BigTable master。 * 儲存並保管 bootstrap location of data (root tablet)。 * 管理 tablet servers 的詩命週期：Discover tablet servers。 * 儲存 schema information。 ![](https://i.imgur.com/SnkO6s1.png =300x) * 參考下列課程： * Coordination and Agreement Lecture * The Paxos Consensus Algorithm Lecture * Coordination with Zookeeper Lecture ### Lock Service Operational Model * 認識並以 locks 來管理 directories 與 files。 * Reads, writes are atomic。 * Timeout mechanism：client 會獲得並維持 session，若 session 過期，locks 就會釋放。 ![](https://i.imgur.com/9AVqJ3J.png =400x) ### Lock Service Availability * 由 5 個 replicas 組成。 * Consistently [**replicate**](https://hackmd.io/LgURxUqoRXSiZOh8z9i3bw?view#Data-replication) writes。 * 其中 1 個 replica 是 master。 * 需要 **leader election**。 * 注意！Chubby master 與 BigTable master 不同！ * 在下列條件滿足之情形下，服務才可以啟動： * Majority (絕大多數) of replicas are running。 * A **quorum** of replicasis established。 * 可以與其他 replicas 通連。 ### Core Mechanisms * Ensure one active BigTable master at any time * [Leader election](https://hackmd.io/4gqb3pZ2SbGPaKJtcQpj_g?view#Leader-aka-coordinator-master-etc-election) * Keep replicas consistent in face of failures * [Paxos algorithm](https://hackmd.io/LszrOdiYQNiva1tt508fzg?view) based on replicated state machines (RSM) * Atomic broadcast ### Chubby Example: Leader election * 所有 clients 一起開啟一個 file。 * 成功的寫入自己名字，成為 leader。 * 失敗的讀取 file，知道 leader 是誰。 * 程式碼： ![](https://i.imgur.com/R9gDt9p.png) ### Self-study questions * Why do we need another database (Chubby) in addition to BigTable? * How could the sketched Chubby API be used for locking and for mutual exclusion? ## 1-6 DYNAMO / CASSANDRA ### Cassandra * Developed by Facebook * Based on Amazon Dynamo (but open-source) * **Structured storage nodes** (**no GFS** used) * **Decentralized** architecture (**no master** assignment) * **Consistent hashing** for load balancing * Eventual consistency * **Gossiping** to exchange information #### Cassandra architecture overview ![](https://i.imgur.com/GBij7Gf.png =500x) #### Cassandra global read-path ![](https://i.imgur.com/UrdoQM1.png =500x) #### Cassandra global write-path ![](https://i.imgur.com/fwHJzVx.png =500x) #### Incremental scaling in Cassandra ![](https://i.imgur.com/CcbXDcD.png =500x) #### Storage unit failure ![](https://i.imgur.com/AadihLq.png =500x) ![](https://i.imgur.com/NhVGXYC.png =500x) ### Core mechanisms * Decentralized load balancing and scalability * [Consistent Hashing Lecture](https://hackmd.io/MO8monVKSwCCIb9AkBwAuw?both#Consistent-Hashing) * Read/write reliability * [Replication](https://hackmd.io/LgURxUqoRXSiZOh8z9i3bw?view#Data-replication) * Membership management * [Gossip in Replication](https://hackmd.io/LgURxUqoRXSiZOh8z9i3bw?view#Data-replication) * Eventual consistency model * [Consistency](https://hackmd.io/aqhctdHlSvOs1c1qWQuTgw?view#Week-05---Consistency-Model) ### Self-study questions * Would the Dynamo/Cassandra architecture make sense given a distributed file system layer for storage, argue for or against? * How does the life of a read and write request issued from a client to Dynamo/Cassandra differ from requests issued in BigTable?