7-2 Spanner: Google’s Globally-Distributed Database

# 7-2 Spanner: Google’s Globally-Distributed Database ## Intro * 2012 發表 * Scalable: million+ * Externally consistent read & write * Synchronous Replication * Semi-relational * 可以想成是 table * 但其實是用 (key,timestamp) -> value * SQL-based query * Timestamp-versioned data ## Implementation Overview ![image](https://hackmd.io/_uploads/r1dEdBjX0.png) > https://youtu.be/LaLT6EC7Trc?t=547 ### 伺服器配置 ![image](https://hackmd.io/_uploads/H1uOOroXR.png) * 一整個 Spanner 叫做 universe * 這張圖就是一個 * paper 當時只有三個 universe: test/playground, developement/production, production-only * 分成多個 zone * 通常一個 datacenter 就是一個 zone * 一個 zone 有一百到幾千個 spanserver * 負責服務 client * 一個 zone 有數個 location proxy * 用來查找 client 的資料放在哪個 spanserver * 一個 zone 有一個 zonemaster * 負責分配資料給 spanserver 管理 * 一個 universe 有一個 universemaster * 用來查看系統狀態、以及除錯用途 * 一個 universe 有一個 placement driver * 在 zone 之間搬運資料，搬運以分鐘為時間尺度 ### 軟體架構 * 論文的軟體架構圖 ![image](https://hackmd.io/_uploads/HyVs_BsQ0.png) * [網路文章的架構圖](https://salemal.medium.com/summary-of-spanner-googles-globally-distributed-database-f66a6ea25a81) ![image](https://hackmd.io/_uploads/S15L2Bs7R.png) * 名稱定義 - **mapping**： `(key, timestamp) -> value` * 一個 spanserver 負責 100~1000 個 **tablet** * tablet 概念 ![image](https://hackmd.io/_uploads/SkxF2HjQA.png) * Spanner 使用 Colossus 這個分散式檔案系統 * 可以把 spanner 看作 compute 與 storage 兩個部份 * storage 裡面寫的 split 就是 tablet * 同一個 tablet 會存在於多個 zone 作為 replica * 同一個 zone 裡面也很可能會有 tablet 的 replica * spanserver 在每個 tablet 掛一個 **Paxos** * tablet replica們的 Paxos 就叫做 **Paxos Group** * 其中一個 replica 的 Paxos 會被選為 **leader** * 預設每10秒換一次 * 這些 Paxos 的用途是維護 "consistantly replicated bag of mappings" * Read/Write 簡介 * Write 需要向 Paxos leader 發起表決 * Read 可以在任何夠新的 replica 詢問 * 每個 tablet replica 的 leader 維護 lock table * 用於 **2-phase locking** * 論文表示，leader任期較長是個關鍵，否則效率會低 * "(range of keys) -> lock states" * 需要同步的操作 (如transaction read)，會向這個 lock table 取得 lock；不需要同步的話就不用拿 lock * 論文特別說有針對需時較長的 transaction 做設計 * 每個 tablet replica 的 leader 也有做一個 **transaction manager** * 需要跨 tablet 操作會用到這東西 * 跨 tablet <=> 跨 paxos group * 在進行跨 tablet 操作時，一個 Paxos group 的 leader 叫做 **participant leader**，其餘叫做 **participant slave** :::spoiler ... 因為 leader 可能在操作途中掛掉重選，刻意不把 leader 視為 participant leader (一個抽象化的概念) ::: * 參與跨tablet操作的 participant leader 會進行 **2-phase commit** * 參與跨tablet操作的其中一個 participant leader 被選為協調者，稱為 **coordinator leader**，這一個 Paxos group 的其他 participant slave 稱為 **coordinator slave** * 2-phase locking、2-phase commit 的狀態會存在 tablet 裡面 * 因此在 Paxos group 存在多個 replica * 換句話說，達成共識 <=> 2PL, 2PC 成功 ### Directories and Placement * Spanner 用一個 directory 的概念讓使用者控制資料 * 論文描述："a set of contiguous keys that share a common prefix" * 使用者(application) 可以自訂 directory 怎麼調配 (如 key 的 prefix)，藉此控制 locality * :::spoiler 我的理解 * 以下面的相簿+相片 tablet 為例 ``` /1/1/Maui /1/1/2/Beach /1/1/5/Snorkeling /1/2/St.Louis /1/2/3/Gateway Arch ``` ::: * 範例一：https://youtu.be/LaLT6EC7Trc * 創兩個 table: Albums, Photos * ![Spanner Logical](https://hackmd.io/_uploads/BJRpnBjQ0.png) * 創建時告知 Spanner 相簿跟相片的關係：同樣 (uid,aid) 的相片跟相簿應該放在一起 * ![Spanner Physical](https://hackmd.io/_uploads/S1PJTBim0.png) * 範例二：論文 * ![image](https://hackmd.io/_uploads/Hyb-pHsXC.png) ``` CREATE TABLE Users { uid INT64 NOT NULL, email STRING } PRIMARY KEY (uid), DIRECTORY; CREATE TABLE Albums { uid INT64 NOT NULL, aid INT64 NOT NULL, name STRING } PRIMARY KEY (uid, aid), INTERLEAVE IN PARENT Users ON DELETE CASCADE; ``` * 實際上，可能因為 Directory 太大，會切成 *fragment*，來放在不同 spanserver * 怎麼切的不知道 * 論文說 "Fragments may be served from different Paxos groups (and therefore different servers)" * 以一個包含很多 tablet 的 directory 來看 * 代表不同 fragment 可能沒有交集的 tablet ## TrueTime * TrueTime 提供三個 API | Method | Returns | | --- | --- | | TT.now() | TTinterval: [earliest, latest] | | TT.after(t) | true if t has definitely passed | | TT.before(t) | true if t has definitely not arrived | * 概念是，TT.now() 保證「呼叫的瞬間」會落在他的回傳值之間 * 名稱定義： $t_{abs}(e)$ 代表事件 e 發生的時間 * 假設 tt = TrueTime.now()，呼叫的這個動作叫做 e 那麼 $tt.earliest \leq t_{abs}(e) \leq tt.latest$ * 一個資料中心有數個 *Time Master* 機器 * 一部份是使用 GPS * 一部分使用原子鐘 * 所有機器都執行 *timeslave daemon* * 向 Time Master 取得時間的服務 * Time Master 們會定期互相對時 * 對時之後會得知自己的時間與其他人差多少 * 差距的時間除以兩個對時的 interval，可得知 drift rate * 沒有在對時的時候， Time Master 會 advertise 一個漸增的 time uncertainty * 原子鐘會根據 drift rate 來 advertise 一個保守的 time uncertainty * GPS的 time uncertainty 接近 0 (how? 沒說) * timeslave daemon 會跟數個 Time Master 對時 * 使用 Marzullo's algorithm 來偵測並挑掉差太多的 * :::spoiler Wiki 說明 Marzullo's algorithm, invented by Keith Marzullo for his Ph.D. dissertation in 1984, is an agreement algorithm used to select sources for estimating accurate time from a number of noisy time sources. A refined version of it, renamed the "intersection algorithm", forms part of the modern Network Time Protocol. ::: * 為了避免 timeslave 的本地時鐘壞掉發生問題，跟 Time Master 差太多的會被踢掉（叫工程師來修） * 沒有在對時的時候，timeslave daemon 也會 advertise 一個漸增的 time uncertainty * Master 跟 slave 都會有 time uncertainty，全部一起考慮，可以得到 local time uncertainty (對 slave 來說) * ![TrueTime time uncertainty](https://hackmd.io/_uploads/HJJS6Ho7R.png) > https://youtu.be/oeycOVX70aE?t=916 * 論文數據： * slave 的保守 drift rate: 0.2ms / s * slave 每 30 秒對時一次 * master 本身的 uncertainty + 與 master 的溝通延遲 = 1ms * => 週期 30 秒，從 1~7 ms 的 local time uncertainty * 論文裡面稱這個 time uncertainty 為 $\epsilon$ ## Concurrency Control TrueTime 怎麼用來達成 externally consistent transactions, lock- free read-only transactions, and non-blocking reads ### Timestamp Managment * Spanner 支援以下操作 * Read-Write transaction * Write only 也算 * Read-only transaction * 非 snapshot read 也算 * 不拿 lock 的讀取 * 可以從任何 **夠新的** replica 讀取 * Snapshot read * 指定過去某個時間點，或是提供年紀上限(?) #### Paxos Leader Leases > 回憶：一個 tablet 的 replica們形成一個 Paxos Group，並有一個 replica 被選為 leader * Paxos Leader 任期預設為 10 秒 * 成功 write 就延長 * 我的理解是，從 write 的 timestamp 再算 10 秒 * 任期開始時間是 leader 發現自己有多數票的時候 * 結束時間是 leader 發現不再有多數票 (可能因為別人的票過期了) * Spanner 假設這件事永遠成立： * **同一個 Paxos Group 裡的 Paxos Leader 任期不重疊** * 有可能空缺一段時間沒有 leader #### Assigning Timestamps to RW Transactions * 使用 2-phase locking * transaction commit 的 timestamp 用的是 Paxos write 的時間 * :::spoiler 原文 For a given transaction, Spanner assigns it the timestamp that Paxos assigns to the Paxos write that represents the transaction commit ::: * 透過下面兩個操作以及一個規定，Spanner 可以保證 Paxos write 的 timestamp 是單調遞增 * 註：只用到單一 Paxos Group 的，Leader 可以自己決定；跨 Paxos Group 由 Coordinator leader 來主導 * 規定：timestamp 會落在 Paxos leader 的任期內 * 操作一: **Start** * Coordinator Leader 給 Transaction $T_i$ 一個 timestamp $s_i$，$s_i \geq TT.now().latest$ * TT.now() 的呼叫時機是 coordinator leader 收到 transaction 之後 * 操作二: **Commit Wait** * Coordinator Leader 會確保 $T_i$ 的改動在 $TT.after(s_i)$ 才會被 client 看到 * 畢竟 $s_i$ 是 write 的 timestamp * Commit Wait 會確保 $T_i$ 的 commit 時間（不只是 write，而是整個 transaction) 會在 $s_i$ 之後 * 也就是 $s_i \lt t_{abs}(e_i^{commit})$ * 因此可得單調遞增的 Paxos write： * ![image](https://hackmd.io/_uploads/ryUv6SimR.png) #### Serving Reads at a Timestamp * 只要指定的 timestamp 是在一個 $t_{safe}$ 之前，client 可以從任意 replica 讀取 * $t_{safe} = min(t_{safe}^{Paxos}, t_{safe}^{TM})$ * $t_{safe}^{Paxos}$ 代表的是一個 replica 有紀錄到最新的 Paxos write * 由於 Paxos write 時間是單調遞增，可以確定不會有這個時間之前的其他 write * $t_{safe}^{TM}$ 有兩種情況。首先回憶 transaction 有用到 2 phase commit * 如果沒有正在進行的 commit (也就是沒有在 prepare 階段的)，$t_{safe}^{TM} = \infty$ * 如果有，commit 的過程會得到一個 transaction 最早的完成時間，也就是 prepare 的 timestamp，因此： * 假設 transaction 發生在 Paxos Group $g$，且有可能同時有多個 prepare * $t_{safe}^{TM} = min(s_g^{prepare}) - 1$ * 也就是在 $g$ 的 prepare timestamp 之中最小的 #### Assigning Timestamps to RO Transactions * Read-only Transaction 的實做方式是先要到一個 timestamp $s_{read}$，再用這個 $s_{read}$ 做 timestamp read * 如果令 $s_{read} = TT.now().latest$，可能要 block 到 $s_{read} \leq t_{safe}$ * 所以可以挑別的 $s_{read}$