Calvin - HackMD

# Calvin ## Abstract 以前的 DBMS 為了達到 high scalability，避免讓單一個 tx 分散在不同機器上執行。而 Calvin 為了確保 scalalibility 並且讓 tx 可以分散在不同機器，使用了 deterministic ordering 以及 synchronous replication 的技術，最終超越了現有的 DBMS。 ## Background and Introduction **1. The cost of distributed transactions** Distributed tx 的問題主要在於需要所有執行同一個 tx 的機器 agree 彼此的操作。因此在完成時需要做所謂的 two phase commit。而這樣是非常沒有效率並且破壞 scalability 的主要原因。 **2. Consistent replication** 分散式的 DBMS 常常需要在各地建很多 replicas，而這會導致一件事：因為網路延遲的關係造成 synchronous updates 有很大的 cost。 **3. Achieving agreement without increasing contention** Calvin 是如何達到 distributed tx 和 sync replication呢？在還沒拿到locks 和執行 tx 前，預先排好各個 tx 的執行順序，並且在執行時一定要執行到完成，不能abort。如果node fail，可以從其他的 replica 還原或者是 replay 在那個 node 上的 history。 ## Deterministic DBMS 傳統的 distributed DBMS 用 two-phase commit 的原因可以分為兩個： 1. nondeterministic events (node failures) 2. deterministic events (transaction logic failures) 然而，nondeterministic events 其實是非必要的，因為當某個 tx fail 時，可以用另一個跟他一樣的 replica 來 serve 需要它的 txs。不過這裡有個問題，如果每個 state 都要跟另一個 replica sync 的話，那麼其實是很不實際的，所以這裡採用的是 sychronously replicate batches of requests。而對於 deterministic events（不合邏輯事件：例如貨物數量小於零）說，並不需要用到 two-phase commit，只要讓每個參與的 txs 等待一個 one way message，只有收到了 message 他們才能 commit。 ## System Architecture Calvin 主要可以分為三個 layer： 1. The sequencing layer：負責排出一個 global order，同時處理這些 inputs 的replication 和 logging。 2. The scheduling layer：用 deterministci lock scheme 的方式來執行 txs。 3. The storage layer： Access data using a simple CRUD interface ### Sequencer and replication： Calvin 每 10ms 會做一個 epoch，每一個 epoch 最後會把所收到的 requests compile 成一個 batch，然後複製這些batch 並且分送給各個不同的 partition。這些 batch 裡面包含了：sequencer ID、epoch number、所有會參與的 tx requests。 ### Scheduler and concurrency control：這階段可以分為拿 locks 跟 concurrently 執行。 1. Lock Calvin 裡面的每個 node 都只需要負責自己那部份的 lock，也就是有哪些 record 要執行，就拿哪些 locks。並且是 Strict 2PL 的方式。在實作時，透過一個 thread 去 scan 一個 serial transactions order，在這個 order 裡面的每一個 entry，會去跟lock manager 要這個 tx 這一生會用到的所有 locks。 2. Execution 丟給stored procedure，不然無解。