分散式系統
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Write
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Help
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Write
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
Publish Note

Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

Your note will be visible on your profile and discoverable by anyone.
Your note is now live.
This note is visible on your profile and discoverable online.
Everyone on the web can find and read all notes of this public team.
See published notes
Unpublish note
Please check the box to agree to the Community Guidelines.
View profile
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# 7-2 Spanner: Google’s Globally-Distributed Database ## Intro * 2012 發表 * Scalable: million+ * Externally consistent read & write * Synchronous Replication * Semi-relational * 可以想成是 table * 但其實是用 (key,timestamp) -> value * SQL-based query * Timestamp-versioned data ## Implementation Overview ![image](https://hackmd.io/_uploads/r1dEdBjX0.png) > https://youtu.be/LaLT6EC7Trc?t=547 ### 伺服器配置 ![image](https://hackmd.io/_uploads/H1uOOroXR.png) * 一整個 Spanner 叫做 universe * 這張圖就是一個 * paper 當時只有三個 universe: test/playground, developement/production, production-only * 分成多個 zone * 通常一個 datacenter 就是一個 zone * 一個 zone 有 一百到幾千個 spanserver * 負責服務 client * 一個 zone 有數個 location proxy * 用來查找 client 的資料放在哪個 spanserver * 一個 zone 有一個 zonemaster * 負責分配資料給 spanserver 管理 * 一個 universe 有一個 universemaster * 用來查看系統狀態、以及除錯用途 * 一個 universe 有一個 placement driver * 在 zone 之間搬運資料,搬運以分鐘為時間尺度 ### 軟體架構 * 論文的軟體架構圖 ![image](https://hackmd.io/_uploads/HyVs_BsQ0.png) * [網路文章的架構圖](https://salemal.medium.com/summary-of-spanner-googles-globally-distributed-database-f66a6ea25a81) ![image](https://hackmd.io/_uploads/S15L2Bs7R.png) * 名稱定義 - **mapping**: `(key, timestamp) -> value` * 一個 spanserver 負責 100~1000 個 **tablet** * tablet 概念 ![image](https://hackmd.io/_uploads/SkxF2HjQA.png) * Spanner 使用 Colossus 這個分散式檔案系統 * 可以把 spanner 看作 compute 與 storage 兩個部份 * storage 裡面寫的 split 就是 tablet * 同一個 tablet 會存在於多個 zone 作為 replica * 同一個 zone 裡面也很可能會有 tablet 的 replica * spanserver 在每個 tablet 掛一個 **Paxos** * tablet replica們 的 Paxos 就叫做 **Paxos Group** * 其中一個 replica 的 Paxos 會被選為 **leader** * 預設每10秒換一次 * 這些 Paxos 的用途是維護 "consistantly replicated bag of mappings" * Read/Write 簡介 * Write 需要向 Paxos leader 發起表決 * Read 可以在任何夠新的 replica 詢問 * 每個 tablet replica 的 leader 維護 lock table * 用於 **2-phase locking** * 論文表示,leader任期較長 是個關鍵,否則效率會低 * "(range of keys) -> lock states" * 需要同步的操作 (如transaction read),會向這個 lock table 取得 lock;不需要同步的話就不用拿 lock * 論文特別說有針對需時較長的 transaction 做設計 * 每個 tablet replica 的 leader 也有做一個 **transaction manager** * 需要跨 tablet 操作會用到這東西 * 跨 tablet <=> 跨 paxos group * 在進行跨 tablet 操作時,一個 Paxos group 的 leader 叫做 **participant leader**,其餘叫做 **participant slave** :::spoiler ... 因為 leader 可能在操作途中掛掉重選,刻意不把 leader 視為 participant leader (一個抽象化的概念) ::: * 參與 跨tablet操作 的 participant leader 會進行 **2-phase commit** * 參與 跨tablet操作 的其中一個 participant leader 被選為協調者,稱為 **coordinator leader**,這一個 Paxos group 的其他 participant slave 稱為 **coordinator slave** * 2-phase locking、2-phase commit 的狀態會存在 tablet 裡面 * 因此在 Paxos group 存在多個 replica * 換句話說,達成共識 <=> 2PL, 2PC 成功 ### Directories and Placement * Spanner 用一個 directory 的概念讓使用者控制資料 * 論文描述:"a set of contiguous keys that share a common prefix" * 使用者(application) 可以自訂 directory 怎麼調配 (如 key 的 prefix),藉此控制 locality * :::spoiler 我的理解 * 以下面的 相簿+相片 tablet 為例 ``` /1/1/Maui /1/1/2/Beach /1/1/5/Snorkeling /1/2/St.Louis /1/2/3/Gateway Arch ``` ::: * 範例一:https://youtu.be/LaLT6EC7Trc * 創兩個 table: Albums, Photos * ![Spanner Logical](https://hackmd.io/_uploads/BJRpnBjQ0.png) * 創建時告知 Spanner 相簿跟相片的關係:同樣 (uid,aid) 的相片跟相簿應該放在一起 * ![Spanner Physical](https://hackmd.io/_uploads/S1PJTBim0.png) * 範例二:論文 * ![image](https://hackmd.io/_uploads/Hyb-pHsXC.png) ``` CREATE TABLE Users { uid INT64 NOT NULL, email STRING } PRIMARY KEY (uid), DIRECTORY; CREATE TABLE Albums { uid INT64 NOT NULL, aid INT64 NOT NULL, name STRING } PRIMARY KEY (uid, aid), INTERLEAVE IN PARENT Users ON DELETE CASCADE; ``` * 實際上,可能因為 Directory 太大,會切成 *fragment*,來放在不同 spanserver * 怎麼切的不知道 * 論文說 "Fragments may be served from different Paxos groups (and therefore different servers)" * 以一個包含很多 tablet 的 directory 來看 * 代表不同 fragment 可能沒有交集的 tablet ## TrueTime * TrueTime 提供三個 API | Method | Returns | | --- | --- | | TT.now() | TTinterval: [earliest, latest] | | TT.after(t) | true if t has definitely passed | | TT.before(t) | true if t has definitely not arrived | * 概念是,TT.now() 保證「呼叫的瞬間」會落在他的回傳值之間 * 名稱定義: $t_{abs}(e)$ 代表 事件 e 發生的時間 * 假設 tt = TrueTime.now(),呼叫的這個動作叫做 e 那麼 $tt.earliest \leq t_{abs}(e) \leq tt.latest$ * 一個資料中心有數個 *Time Master* 機器 * 一部份是使用 GPS * 一部分使用原子鐘 * 所有機器都執行 *timeslave daemon* * 向 Time Master 取得時間的服務 * Time Master 們會定期互相對時 * 對時之後會得知自己的時間與其他人差多少 * 差距的時間 除以 兩個對時的 interval,可得知 drift rate * 沒有在對時的時候, Time Master 會 advertise 一個漸增的 time uncertainty * 原子鐘會根據 drift rate 來 advertise 一個保守的 time uncertainty * GPS的 time uncertainty 接近 0 (how? 沒說) * timeslave daemon 會跟數個 Time Master 對時 * 使用 Marzullo's algorithm 來偵測並挑掉差太多的 * :::spoiler Wiki 說明 Marzullo's algorithm, invented by Keith Marzullo for his Ph.D. dissertation in 1984, is an agreement algorithm used to select sources for estimating accurate time from a number of noisy time sources. A refined version of it, renamed the "intersection algorithm", forms part of the modern Network Time Protocol. ::: * 為了避免 timeslave 的本地時鐘壞掉發生問題,跟 Time Master 差太多的會被踢掉(叫工程師來修) * 沒有在對時的時候,timeslave daemon 也會 advertise 一個漸增的 time uncertainty * Master 跟 slave 都會有 time uncertainty,全部一起考慮,可以得到 local time uncertainty (對 slave 來說) * ![TrueTime time uncertainty](https://hackmd.io/_uploads/HJJS6Ho7R.png) > https://youtu.be/oeycOVX70aE?t=916 * 論文數據: * slave 的保守 drift rate: 0.2ms / s * slave 每 30 秒對時一次 * master 本身的 uncertainty + 與 master 的溝通延遲 = 1ms * => 週期 30 秒,從 1~7 ms 的 local time uncertainty * 論文裡面稱這個 time uncertainty 為 $\epsilon$ ## Concurrency Control TrueTime 怎麼用來達成 externally consistent transactions, lock- free read-only transactions, and non-blocking reads ### Timestamp Managment * Spanner 支援以下操作 * Read-Write transaction * Write only 也算 * Read-only transaction * 非 snapshot read 也算 * 不拿 lock 的讀取 * 可以從任何 **夠新的** replica 讀取 * Snapshot read * 指定過去某個時間點,或是提供年紀上限(?) #### Paxos Leader Leases > 回憶:一個 tablet 的 replica們 形成一個 Paxos Group,並有一個 replica 被選為 leader * Paxos Leader 任期預設為 10 秒 * 成功 write 就延長 * 我的理解是,從 write 的 timestamp 再算 10 秒 * 任期開始時間是 leader 發現自己有多數票的時候 * 結束時間是 leader 發現不再有多數票 (可能因為別人的票過期了) * Spanner 假設這件事永遠成立: * **同一個 Paxos Group 裡的 Paxos Leader 任期不重疊** * 有可能空缺一段時間沒有 leader #### Assigning Timestamps to RW Transactions * 使用 2-phase locking * transaction commit 的 timestamp 用的是 Paxos write 的時間 * :::spoiler 原文 For a given transaction, Spanner assigns it the timestamp that Paxos assigns to the Paxos write that represents the transaction commit ::: * 透過下面兩個操作以及一個規定,Spanner 可以保證 Paxos write 的 timestamp 是單調遞增 * 註:只用到單一 Paxos Group 的,Leader 可以自己決定;跨 Paxos Group 由 Coordinator leader 來主導 * 規定:timestamp 會落在 Paxos leader 的任期內 * 操作一: **Start** * Coordinator Leader 給 Transaction $T_i$ 一個 timestamp $s_i$,$s_i \geq TT.now().latest$ * TT.now() 的呼叫時機是 coordinator leader 收到 transaction 之後 * 操作二: **Commit Wait** * Coordinator Leader 會確保 $T_i$ 的改動在 $TT.after(s_i)$ 才會被 client 看到 * 畢竟 $s_i$ 是 write 的 timestamp * Commit Wait 會確保 $T_i$ 的 commit 時間(不只是 write,而是整個 transaction) 會在 $s_i$ 之後 * 也就是 $s_i \lt t_{abs}(e_i^{commit})$ * 因此可得單調遞增的 Paxos write: * ![image](https://hackmd.io/_uploads/ryUv6SimR.png) #### Serving Reads at a Timestamp * 只要指定的 timestamp 是在一個 $t_{safe}$ 之前,client 可以從任意 replica 讀取 * $t_{safe} = min(t_{safe}^{Paxos}, t_{safe}^{TM})$ * $t_{safe}^{Paxos}$ 代表的是一個 replica 有紀錄到 最新的 Paxos write * 由於 Paxos write 時間是單調遞增,可以確定不會有這個時間之前的其他 write * $t_{safe}^{TM}$ 有兩種情況。首先回憶 transaction 有用到 2 phase commit * 如果沒有正在進行的 commit (也就是沒有在 prepare 階段的),$t_{safe}^{TM} = \infty$ * 如果有,commit 的過程會得到一個 transaction 最早的完成時間,也就是 prepare 的 timestamp,因此: * 假設 transaction 發生在 Paxos Group $g$,且有可能同時有多個 prepare * $t_{safe}^{TM} = min(s_g^{prepare}) - 1$ * 也就是在 $g$ 的 prepare timestamp 之中最小的 #### Assigning Timestamps to RO Transactions * Read-only Transaction 的實做方式是先要到一個 timestamp $s_{read}$,再用這個 $s_{read}$ 做 timestamp read * 如果令 $s_{read} = TT.now().latest$,可能要 block 到 $s_{read} \leq t_{safe}$ * 所以可以挑別的 $s_{read}$

Import from clipboard

Paste your markdown or webpage here...

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.
Upgrade
All
  • All
  • Team
No template.

Create a template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

Slide Example

API Docs

Edit in VSCode

Install browser extension

Contacts

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Get Full History Access

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

Note content is identical to the latest version.
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully