Week 08 - Distributed File Systems & Erasure Codes & CAP

# Week 08 - Distributed File Systems & Erasure Codes & CAP ###### tags: `WS2020-IN2259-DS` ## DFS - Basics ### Interaction with file systems - POSIX * POSIX (Portable OS Interface) * ***The Single UNIX Specification*** * Aligns with the **ISO C 1999 standard** (stdio.h) * Family of standards * Specified by **IEEE Computer Society** * 今日擁有約 20 份文件。 * **Abstractions**：令 programmer 可以完成跨平台的編寫 (platform independence，portability)。 * 所謂的 **file system interface**。 * Basic concepts * Files * Operations：open、read、write、close。 * Directories * Operations：create file、mkdir、rename file、rename dir、delete file、delete dir。 * Links * Metadata * Locks #### POSIX Files <stdio.h> * **fopen** ```c FILE *fopen(const char * filename, const char * mode) ``` * **fflush**：Any buffered data is physically persisted，意即清空 buffer，將 buffer 裡的內容寫入 file。 ```c int fflush(FILE *stream); ``` * **fclose**：File flushed and closed。 ```c int fclose(FILE *stream); ``` #### POSIX Files <stat.h> * **mkdir** ```c int mkdir(const char* path, mode_t mode) // example mkdir("/home/aj/distributed_systems", S_IRUSR | S_IWUSR | S_IXUSR | S_IRWXG ); // S_IRUSR read permission, owner // S_IWUSR write permission, owner // S_IXUSR execute/search permission owner // S_IRWXG read, write, execute/search by group... ``` #### POSIX File Locking <fcntl.h> * **fcntl** ```c int fcntl(int fildes, int cmd, ...); // example int fd; struct flock fl; fd = open("/home/adsuser/test.txt“); fl.l_type = F_WRLCK; //write lock fl.l_whence = SEEK_SET fl.l_start = 500; //start at byte 500, multiple threads can lock different parts of the file fl.l_len = 100; //next 100 bytes fcntl(fd, F_SETLK, &fl); //acquire lock ``` #### POSIX File Metadata <stat.h>, <unistd.h> * **chmod** ```c int chmod(const char * file, mode_t mode) // example chmod(“file.txt”, S_IRUSR | S_IWUSR) ``` * **chown** ```c int chown(const char *, uid_t, gid_t) // example chown(“file.txt”, getpwnam(“arno”), -1) ``` ### HDD & SSD * **Hard-disk Drive** * Magnetic discs * Cache (8MB – 128MB) * Cost * ~38€/TB (1.1.2014) * ~30€/TB (4.12.2014) * ~29€/TB (7.12.2015) * ~22€/TB (29.11.2017) * 5400 rpm – 15000 rpm * Seek 4-9ms * Connected via SATA, SCSI/SAS ... * Failures： * ~5% fail per year * ~10% failed during first 3 years * Bitrot：silent corruption of data * HDD specifications predict an **Uncorrectable bit Error Rate (UER)**：$10^{15}$ (1,000,000,000,000,000 ~ 125 TB) * Evaluation：8x100GB 的 HDD 在 2 PB reads 後會觀察到 4 read errors。 * 如何預防與保護：Erasure codes (糾刪碼)。 * **Solid-state Drive** * DRAM (NAND-based flash memory) * No moving mechanical components * Cache (16MB – 512MB) * Cost * Can also be connected via **PCI Express** * Low-level operations differ a lot compared to HDD * On SSD’s overwriting costs more → TRIM Command * Deleting is delegated to internal firmware which has a garbage collector ### Disk file systems * Linux * ext, ext2, ext3, ext4 * JFS, XFS, ... * BTRFS, ZFS * Pooling, snapshots, checksums .... * Windows * NTFS * FAT, FAT32, exFAT, ReFS #### Linux ext2 * **Superblock** (block 0, file system metadata)： * 定義 file system type、size、status 等。 * 也儲存關於其他 metadata structures 的資訊 (metadata of metadata)。 * Index-nodes (**inodes**)：一個 file 或 directory 各擁有一個 inode。 * inode 的 metadata 包含：Owner and group identifiers、File length、File type and access rights、Number of data blocks、Array of pointers to data blocks、Timestamp。 * 三種 type：File、Directory、Symbolic link。 * 最多 3-level 的 indirections： ![](https://i.imgur.com/62GMCKG.png =400x) * 參考資料：[linux/ext2.h at master · torvalds/linux · GitHub](https://github.com/torvalds/linux/blob/master/fs/ext2/ext2.h) * **Data blocks** * 圖示： ![](https://i.imgur.com/Q4S9cG0.png =400x) ## DFS - NFS ### Motivation (動機) * **Collaboration**：可分享式檔案目錄，俾利各種專案之合作。 * **Resource sharing**：以 pooling 的方式跨設備 (accross devices) 分享各種資源。 * 概念圖： ![](https://i.imgur.com/NpZ0wMO.png =300x) :::warning NFS 面對的挑戰：Performance、Scalability、Consistency ::: ### The Network File System (NFS) Initially, 1984, by Sun Microsystems * 目標： * 跨節點 (`nodes`) 的 **consistent namespace** (針對 `files`) * **Authorized users** 可以於各個 `node` 存取各自的 `files`。 * NFS protocol designed for LANs (較小的區域網路) * NFS 在原本的 file system 上建立了一層所謂的 **remote access layer**： * `file` 各自儲存於各個 `server` 上，而可以被多個 `clients` 存取。 * Namespace 分散於 (**distributed**) 各個 `servers`，意即每個 `server` 都有相同的 namespace。 * **Virtual files** 的概念：對於每個 client 而言，remote files 感覺像是 local files。 * **User-centric** design * 大部分的 `files` 均是私屬於某個 **single user**。 * **很少**來自不同 `client` 的 **concurrent access**。 * **Read** 比 write 常見 (more common)。 * Open protocol (開源) * 廣為採用 (wide adoption)。 * 很多商業實作 (commercial implementation)。 * Basic NFS architecture ![](https://i.imgur.com/yJfna9M.png =500x) #### Sending Commands * 基本上，NFS 如同 replicated system 般運作。其 `client` 透過 **remote procedure calls (RPCs)** 向 `server` 傳播 operations。 * Naïve solution：forward **every RPC** to `server`。 * Server 需要為所有的 incoming operations 進行排序、執行，以及最後的回傳結果。 * Downside * High access latency * Overloaded server * **Caching solution**： * Virtual files：clients 運送 cache 儲存 remote files 的備份。 * Clients 週期地與 server 進行同步作業。 * 基本上是一種 **[multi-primary replication](https://en.wikipedia.org/wiki/Multi-master_replication)** * **Synchronization**：使用 eager 或 lazy？ * **Consistency level**：使用 linear、sequential、causal、FIFO、weak？ ### Original version: Sun NFS * NFSv2, ..., NFSv4, ... * Developed in 1984 * Uses **in-memory caching** * File blocks, directory metadata * 儲存於 clients 與 servers。 * Advantage：對於 open、read、write 指令而言，不需要 network traffic。 * Problems：**failures** and **cache consistency** #### Failures I - server crash * 沒有保存於 disk 的資料都會遺失。 * 範例： ```c client.seek(); server.crash(); client.read(); ``` * `client.seek()` 會為 server 在 opened file 中設定一個 position offset。 * 然後，因為 crash，server 遺失了 position offset，`client.read()` 因此獲得了錯誤的回傳資料。 #### Failures II - communication omission failures * 範例： ```c clientA.delete(foo); // A issues delete(foo) server.process_delete(foo); // Acknowledgement to A is lost clientB.create(foo); // B issues create(foo) clientA.delete(foo); // A timeout, re-issue delete(foo) ``` * B 所創建的檔案 foo 最終被 A 的 re-issued 指令刪除。 #### Failures III - client crash * 因為所有的 `caching` 都位於 `memory`，因此會失去所有**尚未與 `server` 同步**的 `client` 資料。 #### Solution I - stateless RPC * RPC commands are **stateless**：意思是 server 不用在一個 session 內的多個 commands 間維護 state。 * 範例： ```c // Failures I - server crash client.seek(); server.crash(); client.read(); // stateful read // Solution I - stateless RPC client.seek(); server.crash(); client.read(positions); // stateless read ``` * 有了 `stateless read`，`server` 便可在 recover 後直接讀取到正確的 position 並回傳正確的值。 #### Solution II - idempotent RPC * NFS 的 RPCs 都被設計成 **idempotent**：重複的指令不會有副作用 (side effect)。 * 範例： ```c // Failures II - communication omission failures clientA.delete(foo); // A issues delete(foo) server.process_delete(foo); // Acknowledgement to A is lost clientB.create(foo); // B issues create(foo) clientA.delete(foo); // A timeout, re-issue delete(foo) // Solution II - idempotent RPC clientA.delete(getinode(foo)); // getinode(foo) = 10001 server.process_delete(10001); // Acknowledgement to A is lost clientB.create(foo); // B issues create(foo), the inode number of new foo is 10002 clientA.delete(10001); // A timeout, re-issue delete(foo) ``` * 使用 **someid** 而非可重複的名稱來執行 RPCs。 * 如此設計之 `read`、`lookup` 均為 idempotent。 * 參數設計如後的 `write(data, file ID, position offset)` 同樣為 idempotent。 #### 可救回的 common loss scenarios * 透過 client 的 timeout 與 retry，再加上 idempotent 的 server RPCs，便可以輕易解決下列 common loss scenarios。 ![](https://i.imgur.com/qAYwk0D.png =400x) #### Is mkdir idempotent? * 範例： ```c client.mkdir(foo); server.process_mkdir(foo); // Acknowledgement to client is lost client.mkdir(foo); // Server respond with error code stating that the directory "foo" has already been created. ``` * NFS 的設計者沒有進一步地設計完善的 idempotent mkdir，僅僅是保留目前這種簡單的機制。 * Exercise 有實作 idempotent mkdir 的作業。 ### Cache Consistency * 因為多個 `clients` 可以保有 file blocks、directory metadata 等資料的 `cache`。 * 而當 `client` C1 與 `client` C2 都想對 `cache block` in F 進行寫入時，該如何處理 **Cache Consistency**？ #### Solution: Time-bounded consistency * **Flush-on-close**：當 `file` 關閉時，將 modified `blocks` 傳送至 `server` 進行同步作業 (close() does not return until update is finished)。 * Periodically checks：每個 `client` 週期性地檢查 `server` 是否更新。 * 若 `server` 並無更新，`client` **於 bounded time 後**與 `server` 進行同步作業。 * 尚未同步時 client 可能會讀到 stale data。 #### No Concurrent Writes in NFS * `Server` 會用一個或另一個 `clinet` 之 write 來更新，甚至是多個 `client` 的 writes。 * **User-centric** 的設計，使得 concurrent writes 不會是 NFS 的罩門，也因此不是 NFS 所關心的問題。 * 當然，如果系統規格需要 concurrent writes，那 NFS 就會有問題。 ### NFS Summary * **Transparent** remote file access * **Client-side caching** for improved performance * **Stateless** and **idempotent** RPCs for fault-tolerance * **Periodical synchronization** with the server, with **flush-on-close semantics** * **No** guarantees for **concurrent writes** ## DFS - GFS ### 設計理念 * 架構輪廓： * Master * 1：儲存 metadata、監視 chunk servers。 * Chunk (data) servers * N：儲存並負責 data chunks。 * WAN links：負責機器間通訊與構連。 * 為 **Big Data** workloads 而設計。 * Huge files (100MB+) * 雖說也支援 small files，但並非主要目標。 * 具備運行於不昂貴 commodity hardware 上的 **fault tolerance**。 * 假設有 1000 台機器，那麼 failure 會是家常便飯 (norm)。 * 引用設計上具**可擴充性 (scalability)** 的 API (non-POSIX)。 * Read workload * 為支援 large streaming reads，不使用 client-side data caches (data caching 在此情況下並不有利)。 * 預測僅會有很少的 random reads。 * Write workload * 以 **producer-consumer pattern** 進行 `file` 的 append。 * 預測會有上百筆來自各個 `client` 的 **concurrent appending**。 * 支援 modification，但此非首要目的。 * **Bandwidth** 比 low latency 重要。 ### Interface * 支援的 operations： * Create、delete、open、close、read、write。 * Record append：允許多個 clients 針對同一個 file 進行 append 操作，而且保證 atomicity，。 * Snapshot：以低成本 (low cost) 建立 file tree 或 directory tree 的備份。 * 不支援完全的 (full) POSIX interface。 * POSIX 需要太多於 distributed applications 中難以滿足的保證與前提。 ### Architecture * Files * 每個 `file` 都分割成固定大小的 `chunks` (64MB)：如果儲存的檔案太小，就會造成浪費。 * 每個 `chunk` 均透過 immutable 且 unique 之 `id` 識別 (此 `id` 又稱為 `chunk handle`) * Single master * 維護 GFS 之 `metadata`：Namespace、access control information、mapping from files to chunks、location of chunks。 * 透過與 chunkservers 之間的 `hearbeats` 來管理及偵測 chunkservers 的狀態。 * Multiple chunkservers * 每個 `chunk` 以 `linux file` 的形式儲存於 `disk`。 * 每個 `chunk` 都於多個 chunkservers 之中保有副本：副本數量取決於設定值「replication factor」(預設為 3)。 * 圖示： ![](https://i.imgur.com/isEcmvr.png =600x) ### Metadata at Master * 複製至 `shadow master` 的資料，並於 `operation log` 中留存複製紀錄： * `File` and `chunk` namespaces * Mapping from `files` to `chunks` :::info `Shadow master` 是資料與狀態更新較 `master` 慢，但與 `master` 進行同步的備援 node，可於 `master` 發生 failover 時接替 `master` 之任務。 ::: * 留存於 `memory` 的資料 (為快速存取)： * Location of each chunk’s replicas * 於系統啟動的狀態下，為應對 failover 的發生，`master` 會定期地詢問 `chunkservers` 是否需要重建某 `chunk` 之 location-to-chunk mapping。 * 使用 periodic scanning 實作： * Garbage collection (when files are deleted) * Re-replication (chunkserver failure) * Chunk migration (to balance load and disk space) :::info 於系統啟動的狀態下，為應對 failover 的發生，`master` 會定期地詢問 `chunkservers` 是否需要重建某 `chunk` 之 location-to-chunk mapping。而以 periodic scanning 實作之功能有： * Garbage collection (when files are deleted) * Re-replication (chunkserver failure) * Chunk migration (to balance load and disk space) ::: * Metadata 必須與記憶體大小對齊 (64 bytes/chunk) * `Operation log` * 紀錄 file creating、renaming、deletion operations。 * 紀錄已發生的 `metadata` 歷史紀錄 (persistent historical record of `metadata` changes)。 * 除了於 `local disk` 永久保存 (persisted) 外，也備份至 `shadow master`。 * `Metadata` changes 只有在永久保存後才可見 (visible)。 * 透過重播 (replay) `operation log` 以達成 Master Recovery 之目標。 * 此外，透過週期性地 checkpointing of `master` state 以最小化 replay 之工作負荷。 ### How is **fault-tolerance** achieved? * Master * `Operation log` * `Shadow master`：replication of `log` and `metadata` * Chunkserver * 所有 `chunks` 均做版本控制 (versioned)：只要 `chunk` 一發生 modification 即更新 version number 。 * 將舊 version number 的 `chunk` 刪除。 * Chunks * 參與由 `master` 發起的 **re-replication** 作業 (由 replication factor 決定數量)。 * **Rebalancing**：將 `chunks` 負荷平均分散至各個 `chunkserver`。 * **Data integrity checks**。 ### How is **high-availability** achieved? * Fast recovery of master * 使用 checkpointing and `operation log`。 * Shadow master(s) * 雖然無法提供 write traffic。 * 但可以即時提供 read traffic，以此減少因 failover 所導致的 downtime。 * Heartbeat messages (often include piggy-backed (稍帶的、順便帶上的) status updates) * Discover **chunkserver failure** * Trigger **re-replication** * Share current load：進行負載平衡作業。 * Trigger **garbage collection** * Diagnostic tools ### GFS - Summary * Highly concurrent `reads` (streaming reads) and `appends` * Highly scalable * On cheap commodity hardware * Built for **map-reduce** kind of workloads * `Reads` * `Appends` * 開發者必須了解這些限制，並以其他機制 (work around) 去達成自己的需求。 * 注意：不支援 POSIX API (於 DS 中實現 POSIX API 相關保證過於困難)。 ## EC - Introduction * **Erasure**：指已知 dropped or corrupted bytes 之位置的 byte errors。 * 與之相對的是未知 dropped or corrupted bytes 之位置的 byte errors。 * Types of Erasure Codes * **Linear block code** * Reed Solomon Code * 可以支援已知位置的 lost bytes 問題。 * 使用於 DFS (Distributed file systems)。 * Fountain code * LT Codes (Luby Transform) * 可以支援未知位置的 lost bytes 問題。 * 使用於 P2P systems、torrents、video streaming... ### Reed Solomon Code * 一種 error-correcting code。 * 使用於 QR codes。 * 以 block encoding 的方式實作 error correcting： * **`Data blocks`**：原資料。 * `Error-correcting blocks` (**`parity blocks`**)：獨立於` data blocks` 之 additional `blocks`。 * 實作概念： * 首先讀取 `data blocks`。 * 若無法成功讀取，decode with `parity blocks` 以進行資料糾錯。 * 範例 - **(n, m)-code**： * 目標：有 **n 個 `data files`**，有能力復原其中 **m 個 `data files` 之遺失**。 * 做法：再生成 **m 個 `parity files`**。 * 特性：此 n+m 個 `file` (`data files` + `parity files`)，其中任意丟失 m 個 `file` 以下 (**最少要有 n 個 `files`**) 均可完成復原操作。 #### 比較 - common backup strategy * n 個 data files、n 個 copies。 * Total storage requirement：2n。 * 若失去一對 data file 與 copy，便無法復原。 * 圖示： ![](https://i.imgur.com/N42AI4K.png =500x) #### 比較 - (n, m)-code * n 個 data files、n 個 parity files。 * Total storage requirement：2n。 * 但需要**額外的計算**工作。 * 任意失去總共 n 個 data files 或 parity files 後，仍能完成資料復原。 * 圖示： ![](https://i.imgur.com/0RaQ50t.png =500x) ### Optimal Erasure Code * 上述概述的 **(n, m)-code** 便是一種 **erasure code** (因保護了 byte erasures)。 * 但 (n, m)-code 並沒有辦法防範更 general 的錯誤，例如一般情況下，我們不知道哪些 data bytes 有問題。 * 計算步驟： * 首先，我們有 n 個 `data bytes`，並要實作 (n, m)-code。 * 以此 n `data bytes` 計算 `parity data`。 * 重建資料：使用至少 n 個 `data` 與 `parity` 的 partial list 進行 byte level 的計算。 * 範例 - **(n, 1)-code**： * `Data bytes`：$d = \begin{Bmatrix} d_0,\ d_1, \cdots,\ d_{n-1} \end{Bmatrix}$ * `Parity data`：$p = d_0 + d_1 + \cdots + d_{n-1}$ * 重建資料 ($d_i$ 遺失)：$d_i = p - (d_0 + d_1 + d_{i-1} + d_{i+1} + \cdots + d_{n-1})$ * 注意 (Nota Bene)： * 基於 bytes 的操作。 * 基於 256 的魔數運算 (modulo 256 arithmetic)。 * 基於加法與減法。 * 不需要乘法與除法參與計算。 ### General EC Computations * 同樣基於 bytes 運算。 * 基於**有限體** (**伽羅瓦體**，**Galois field**) 的運算： * 定義了 addition、subtraction、multiplication、division。 * 存在加法反元素 (additive inverse) 與乘法反元素 (multiplicative inverse)。 * 為了方便講解，使用有理數 **Rationales Q** 來進行範例的運算。 * Q 是一個 field，因此所有需要的運算都已經定義好。 * 範例 - **(3, 2)-code**： * `Data bytes`：$d = \begin{Bmatrix} d_0,\ d_1,\ d_2 \end{Bmatrix}$ * `Parity data` (兩個 `parity data` 必須為 **sufficiently different**，也就是不可為 **linear combinations**)： $p_0 = d_0 + d_1 + d_2$ $p_1 = 1 * d_0 + 2 * d_1 + 3 * d_2$ :::info 何謂 **sufficiently different**？即相互**線性獨立** (linear independence)。若 $p_1 = 2 * d_0 + 2 * d_1 + 2 * d_2$，則 $p_1 = 2 * p_2$，此即**違反線性獨立**。 ::: * 重建資料 ($d_i$ 與 $d_j$ 遺失，$d_k$ 為已知資料)： * 第一步： $X =d_i + d_j = p_0 - d_k$ $Y = (i + 1) * d_i + (j + 1) * d_j = p_1 - (k+1) * d_k$ * 第二步 (解聯立方程式)： $d_j = (Y - (i - 1) * X) / (j - 1)$ $d_i = X - d_j = ((j + 1) * X - Y) / (j - i)$ ## EC - Basic Linear Algebra ### 基礎表示法 * 某 `parity data` 可表示如下： $p_0 = d_0 + d_1 + d_2 = \begin{pmatrix} a_0 & a_1 & a_2 \end{pmatrix} \cdot \begin{bmatrix} d_0 \\ d_1 \\ d_2 \end{bmatrix}$ * 因此，所有 `parity data` 可表示如下： $\begin{bmatrix} p_0 \\ p_1 \end{bmatrix} = \begin{pmatrix} a_{00} & a_{01} & a_{02} \\ a_{10} & a_{11} & a_{12} \end{pmatrix} \cdot \begin{bmatrix} d_0 \\ d_1 \\ d_2 \end{bmatrix}$ ### Matrix Multiplication * 圖示： ![](https://i.imgur.com/rbxbDUG.png =500x) * code： ```c MATRIX-MULTIPLY (A,B): if A.columns ≠ B.rows error “incomplete dimensions” else let C be a new (A.rows × B.columns) matrix for i=1 to A.rows for j=1 to B.columns C[i][j] = 0 for k=1 to A.columns C[i][j] = C[i][j] + A[i][k] ∗ B[k][j] return C ``` ### Identity Matrix ![](https://i.imgur.com/TWoHgBt.png =300x) ### The Inverse ![](https://i.imgur.com/dOWu6Fm.png =300x) ## EC - (3, 2)-Code Example Continued ### Erasure Codes * 原公式： $p_0 = d_0 + d_1 + d_2$ $p_1 = 1 * d_0 + 2 * d_1 + 3 * d_2$ * 矩陣公式： $\begin{bmatrix} p_0 \\ p_1 \end{bmatrix} = \begin{pmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \end{pmatrix} \cdot \begin{bmatrix} d_0 \\ d_1 \\ d_2 \end{bmatrix}$ ### Data Reconstruction Matrix * 又稱為 **Encoding Matrix**： $\begin{bmatrix} d_0 \\ d_1 \\ d_2 \\ p_0 \\ p_1 \end{bmatrix} = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 1 & 1 \\ 1 & 2 & 3 \end{pmatrix} \cdot \begin{bmatrix} d_0 \\ d_1 \\ d_2 \end{bmatrix}$ ### Data Loss Example * 假設 $d_0$ 與 $d_2$ 遺失，我們可以重建 matrix 如下： $\begin{bmatrix} d_1 \\ p_0 \\ p_1 \end{bmatrix} = \begin{pmatrix} 0 & 1 & 0 \\ 1 & 1 & 1 \\ 1 & 2 & 3 \end{pmatrix} \cdot \begin{bmatrix} d_0 \\ d_1 \\ d_2 \end{bmatrix}$ * 以 Inverse Matrix 解 $d_0$ 與 $d_2$： $\begin{bmatrix} d_0 \\ d_1 \\ d_2 \end{bmatrix} = \begin{pmatrix} 0 & 1 & 0 \\ 1 & 1 & 1 \\ 1 & 2 & 3 \end{pmatrix}^{-1} \cdot \begin{bmatrix} d_1 \\ p_0 \\ p_1 \end{bmatrix} = \begin{pmatrix} -1/2 & 3/2 & -1/2 \\ 1 & 0 & 0 \\ -1/2 & -1/2 & 1/2 \end{pmatrix} \cdot \begin{bmatrix} d_1 \\ p_0 \\ p_1 \end{bmatrix}$ ## EC - Generalizing to (n, m)-codes ### Generalizing * 首先建立 "**sufficiently different**" **Parity Matrix** $P = \begin{pmatrix} \rho_0 \\ \vdots \\ \rho_{m-1} \end{pmatrix}$，其中 $\rho_i$ 為 $m \times n$ 之 `parity matrix`： $\begin{bmatrix} p_0 \\ \vdots \\ p_{m-1} \end{bmatrix} = \begin{pmatrix} \rho_0 \\ \vdots \\ \rho_{m-1} \end{pmatrix} \cdot \begin{bmatrix} d_0 \\ \vdots \\ d_{n-1} \end{bmatrix}$ * 接下來加入 Identity Matrix，構建 **Data Reconstruction Matrix**： $\begin{bmatrix} d_0 \\ \vdots \\ d_{n-1} \\ p_0 \\ \vdots \\ p_{m-1} \end{bmatrix} = \begin{pmatrix} e_0 \\ \vdots \\ e_{n-1} \\ \rho_0 \\ \vdots \\ \rho_{m-1} \end{pmatrix} \cdot \begin{bmatrix} d_0 \\ \vdots \\ d_{n-1} \end{bmatrix}$ * 此時，傳送過程中遺失了 $k$ 份資料 ($k \le m$)，**Data Loss with Resulting Matrix** 構建如下： $\begin{bmatrix} d_{j_{0}} \\ \vdots \\ d_{j_{n-k-1}} \\ p_0 \\ \vdots \\ p_{k-1} \end{bmatrix} = \begin{pmatrix} e_{j_{0}} \\ \vdots \\ e_{j_{n-k-1}} \\ \rho_0 \\ \vdots \\ \rho_{k-1} \end{pmatrix} \cdot \begin{bmatrix} d_0 \\ \vdots \\ d_{n-1} \end{bmatrix}$ * 若 $P$ 正確假設，$M^{-1}$ 存在，如此便可以完成 revcovery： $\begin{bmatrix} d_0 \\ \vdots \\ d_{n-1} \end{bmatrix} = M^{-1} \cdot \begin{bmatrix} d_{j_{0}} \\ \vdots \\ d_{j_{n-k-1}} \\ p_0 \\ \vdots \\ p_{k-1} \end{bmatrix}$ ### 簡單 Generalizing 後的一些疑問 * 如何計算 $M^{-1}$？ * 如何製作 **"optimal" 的 Parity Matrix** $P$ 以確保 $M^{-1}$ 存在？ * [可逆矩陣 (Invertible Matrix)](https://ccjou.wordpress.com/2010/03/12/%E5%8F%AF%E9%80%86%E7%9F%A9%E9%99%A3%E5%AE%9A%E7%90%86/) 的條件之一：線性獨立，因此只要確保 $P$ 的行向量線性獨立，那麼 $M$ 就會是可逆矩陣。 * 不同於十進位的 "parity numbers"，**"parity bytes"** 又該如何進行計算？ ## CAP ### Introduction * Initially, conjectured by Eric Brewer in 1998, later proven by Lynch et al. * Describes **tradeoffs** involved in distributed system design * CAP Theorem：對於一非同步系統 (asynchronous system) 進行的 **replicated read-write store (使用 atomic register)**，是不可能**同時保證 (guarantees simultaneously)** 下列三項要素的： * **Consistency**：所有 read requests 必須讀取到最新的 value，否則應回傳 error。 * **Availability**：所有的 request 都應成功被處理並回傳資料。 * **Partition-tolerance**：系統必須可以容忍任意數量的通聯故障 (communication failures)。 ### Definition of Atomic Register * Replicated read-write store 的模型： * set(X)：設定此 register 的值為 X。 * get()：獲取此 register 的值。 * 必須在所有 replicas 之間保持 **replicates** and **distributes** register 的一致性 (consistency) 與可用性 (availability)。 * 多種 DS 使用此模型： * Key-value store, distributed shared memory * Replicated system, distributed file system, etc. ### Definition of C-A-P * Definition of **Consistency** * 指 **replication consistency**，與 transactions 的 [ACID](https://en.wikipedia.org/wiki/ACID) 無關。 * 因為 strict consistency 是不可能的，因此假設 **linearizability**。 * 通常使用 **eager replication**。 * Definition of **Availability** * Non-failed node 接收的**每個 request** 都必須完成處理，並回傳 **non-error response**。 * **Non-triviality requirement**：指回傳 errors 不能算是 available。 * 設計有 **crash failure** model。 * 就算系統中有 failed nodes，functioning nodes 必須繼續運作。 * **不要求 no latency**：只要 request **最終**有被處理並回傳結果，不要求 response 的時間。 * **Weak** and **strong** definition： * Weak：不保證 no latency。 * Strong：保證 100% 回傳成功的 response。 * Definition of **Partition-Tolerance** * 實作於非同步系統模型 (Asynchronous system model)。 * 因為 **partition** (網路構連斷開)，因此必定會有 message loss (這也是 failure model 中的一部分)。 * 斷開的 subsystems 會繼續處理來自 clients 的 request。 * 如果系統要求 **stronger system model** 或 **weaker failure model**，那本質上與 partition-tolerant 衝突。 * 雖說不保證 partitions recover，但這也不代表不會 recover。 * 圖示： ![](https://i.imgur.com/NyqiHLR.png =500x) ### Illustrating Example * Hotel Booking：是否下了重複的訂單？ * 第一種情況：不保證 Availability ![](https://i.imgur.com/5ikeOPb.png =500x) * 第二種情況：不保證 Consistency ![](https://i.imgur.com/neEDoNF.png =500x) ### 取捨 1 - CA Systems * 最完美的選擇。 * 需要相當強的假設 (strong assumptions)：**網路可靠**或**不用網路 (根本不是 DS 了)**。 * 說是三選二，但其實根本沒得選擇： > Of the CAP theorem’s Consistency, Availability, and Partition-Tolerance, **Partition Tolerance is mandatory in distributed systems**. You cannot not choose it. > – Coda Hale, Yammer software engineer ### Misconception #1: Always choose two * 由上述可知，CA 是無法實際使用於分散式系統的。 * 除非在**沒有 network partitions** 的情況下，這時所有系統便可以表現得像是 CA。 * 也就是說，真正意義上的「三選二」只有在**有 network partitions** 時才成立。 * 三選二：AP 或 CP ### Misconception #2: C, A are binary * CAP theorem 使用非常狹隘的定義去描述 C 與 A： * **Consistency**: Linearizability * **Availability**: Infinite latency budget, 100% successful * 因為過於簡化這些特性間的「張力 (互相的影響)」，如此造成了對於「三選二」這件事相當程度的誤解。 * 其實，CAP 僅針對 Distributed System 設計 (design space) 中的一小部分進行了如此解釋：在有 partitions 的情況下，**perfect** availability and consistency 是不太可能的 (rare)。 ### Reality of the CAP Theorem * 太多設計者根據 CAP 理論，在他們的 DDBS (Distributed Database System) 正常的操作中 (normal operation ) 設計了一些 restrictions，因此時做出了很多不必要的限制 (unnecessarily limited system)。 * 實際上 CAP 可以做微調與取捨： * CP → CaP * AP → cAP * 自由 (freedom) 地設計以符合 (suit) 程式需求。 * 如選擇適合的 consistency level。 ### 取捨 2 - AP Systems * Best Effort Availability * 特性： * Optimistic replication (i.e., [lazy replication](https://hackmd.io/LgURxUqoRXSiZOh8z9i3bw?view#%E5%AE%9A%E7%BE%A91)) * Expiration/time-to-live * Conflict resolution (e.g., [CRDTs](https://hackmd.io/LgURxUqoRXSiZOh8z9i3bw?view#CRDT---FROM-STATE-BASED-OBJECTS-TO-REPLICATED-STATE-BASED-OBJECTS)) * 範例： * Web caching (cf., [cache consistency](https://hackmd.io/B1se9bbjQVepE7ojuSd4IA?view#Cache-Consistency)) * Network File System (cf., [concurrent writes](https://hackmd.io/B1se9bbjQVepE7ojuSd4IA?view#No-Concurrent-Writes-in-NFS)) * 實例： * Cassandra, Dynamo are AP systems * Eventual consistency：可能會讀到舊資料 (stale data)。 * Tunable towards CP：可以微調！ ### 取捨 3 - CP Systems * Best Effort Consistency * 特性： * [Eager replication](https://hackmd.io/LgURxUqoRXSiZOh8z9i3bw?view#%E5%AE%9A%E7%BE%A91) * [Pessimistic locking](https://mgleon08.github.io/blog/2017/11/01/optimistic-locking-and-pessimistic-locking/)：認為別人非常有可能進入 Critical Section，因此嚴格地上鎖。 * Minority partition becomes unavailable：發生 partition 時，次網域沒有辦法使用服務。 * 範例： * Distributed lock services (Chubby, ZooKeeper) * Paxos (safe, not live) * 實例： * BigTable / HBase are CP systems * linearizability。 * Partitions 時，無法讀取 TabletServer 或 RegionServer ### Misconception #3: Static systems * 系統於運行時不可能總是表現得一致。 * 在 partitioning 之時，系統可能會設計有不同的表現。 * 如不需要再進行兩個 partitions 之間的同步作業。 * 範例 - airline reservation system： * 大多數座位可預訂時，因為比較不在意定位數量上限，因此系統可以進入 AP 模式。 * 當坐位快滿了，系統要保證不會出現重複預定，因此可以轉換成 CP 模式。 * 或是也可以繼續保持 AP 模式，以最大化定位效率 (盡可能的塞滿)，後續再處理 compensations (這樣稱作 out-of-band 的處理)。 ### Extended Model: PACELC * 對於分散式系統潛在的 tradeoffs 有更完整的描述。 * 如果有 partition **(P\)**，系統要如何取捨 availability **(A)** 與 consistency **(C\)**？ * 如果沒有 partition **(Else, E)**，系統要如何取捨 latency **(L)** 與 consistency **(C\)**？ * 原本的 PAC theorem 忽略了 latency，但在實作中 latency 是相當重要的。 * 圖示： ![](https://i.imgur.com/BjneG4G.png =400x) * **範例**： * PA/EL：放棄 consistency，追求 availability 與 lower latency。 * Dynamo, Cassandra (tuneable), Riak, web caching * PC/EC：不放棄 consistency，因此必須在 availability 與 latency 上付出代價。 * BigTable, HBase, VoltDB/H-Store * PA/EC：有 partions 時放棄 consistency (主備伺服器都可以使用)，但於 normal operation 保持 consistency。 * MongoDB、**`ADS`** * PC/EL：於 partions 時需要 consistency (備援伺服器無法使用)，但於 normal operation 追求 lower latency 而放棄 consistency。 * Yahoo! PNUTS ### Summary CAP Theorem * 經典的描述就是「三選二」：consistency、availability、partition-tolerance。 * 但其實只**選擇 consistency 或 availability** (因為分散式系統怎麼可能不需要 partition-tolerance)。 * 在**沒有 partions** 的狀態下 (此時的操作稱作 **normal operations**)，系統可以是 **CA** (**perfect**)。 * 但實際上 (有 partions)，可以對 consistency 與 availability 進行**微調 (tune)**：cAP 與 CaP。 * 更進一步，系統可以動態地 (dynamically) 於不同 operational situation **採用 (adapt)** AP 或 CP。 * PACELC：更精確地考慮有無 partitions 狀態下的選擇，特別是無 partitions 的 latency 考量，是實作中不可忽略的要素。 ### Self-study Questions * Go through the distributed systems discussed throughout the course and determine where in the CAP space they lie? * At a high-level, specify systems that are CA, CP and AP – what systems are CA? * Is partition-tolerance a binary property? Discuss. * See if you can specify a spectrum of C vs. A, given P