# key-value store
two types of backend servers:
- master: remember which row ranges are stored where, keep track of which storage nodes are currently alive, tablet -> replica storage nodes mapping; trigger re-replication of failed tablet copies, poll the storage node for health check.
- storage node: manage one tablet file.
- primary storage node: coordinate update and 2PC, write request/resp
- secondary storage node: purely replica, read
(remark: in bigtable implementation, each storage ndoe contains ~ 100 tablets (for better failure recovery and load balancing), but we only contain one tablet for simplicilty ??)
## 3.4 Partitioning the storage
master maintains a partition node mapping:
row hash range -> nodes' ip:port (include 1 secondary nodes + 1 primary nodes)
the mapping (of row->node index) is defined by hash-modulo operation, i.e., Hash(row) % # of primary nodes .
~~like bigtable, wihtin a tablet file, row is ordered lexographically for easy searching.~~
Tablet file logical format: a mapping (row, column, ~~version~~ ) -> value
physical format on disk: a tablet is a directory. each row is a file in that directory. the file name is row(or hash value of row), the content is columns->value mapping.
Extra credit - dynamic range:, dynamically added removed tablet server. Tricky data movement, omit for now
## 3.5 Consistency and fault tolerance
update atomicity: per row lock is fine. need a per row lock array per tablet??
One write ahead log (WAL) per tablet, all tablet in memory
periodical checkpointing and truncate log.
Write cache: copy on write (COW), if cached, overwrite cache; if not cached create new cache entry. (by `mmap()` a file with the `MAP_PRIVATE` flag ??)
checkpointing: new checkpoint = old checkpoint + write (dirty) cache.
checkpointing when updated transaction number exceed certain level
[IOWOW - Write Ahead Logging](https://iowow.io/wal)
tablet replication( sequential consistency: same copy of tablet file among its storage nodes is ok??), master maintain the list of all storage nodes for a tablet replica.
### crash recovery
Normally, master monitors alive storage node by heart beat and row range -> node mapping.
once a node is not alive (crashes): remove from row range -> node mapping.
once a node restarts and registers itself again: 1. master adds it to previous mapping and replica set. 2. master askes restarted node to copy storage state from one of the replica.
storage node recovery after crash: pull tablet checkpoint from other storage node, pull log while suspending all writes/delete? (by block the write opeartion at master server?? ).
Extra credit - master node failure (low priority): in-memory master state -> master state as replicated tablet, master as tablet server.
## get/put/cput/delete operation.
get:
1. query master -> fetch from one storage node (check write cache first, if miss check files on disk)
put/cput/delete:
1. master -> primary storage node
2. primary storage node initiate a 2-phase commit among all replicas(include self) and reply to client
## concurrency issue
read-write conflict
recovery(read)-write conflict.
## questions:
How is the systems tested/deployed, in a single vm or multiple vms? A: in a single vm
minimal setup: 1 mater, 2 partitions each with 1 primary and 1 secondary.
suggestions:
enuerate request grpc
frontend storage node
when checkpoint flush log
use grpc for heart beat
I’d delay 2PC until you have the rest of the end-to-end project working first. There are many core components—e.g., frontend and backend integration, checkpointing and logging, recovery, etc.
Quesitons:
2PC not needed, b/c primary is the sequencer?
cache not needed? read all into memory ?
fault tolerance:
1. how to operate when a primary/secondary node fails?
- secondary fail -> master inform primary the lastest node group, stroage nodes work with one less replica
- primary fail -> master pick a secondary as primary, inform the primary the lastest node group, stroage nodes work with one less replica.
2. how to rejoin a group? who assume to rejoin (rebooted node? same address?)
3. node recovery/logging:
checkpointing and logging, recovery
http://www.cs.cmu.edu/~dga/15-440/F10/lectures/Write-ahead%20Logging-private.pdf
implementation based on grpc architecture:
who are the client/server
what kind of rpc to call?
# failure model:
admin webpage, suspend the node, modify data, resume node.
ctrl-c kill node process and rerun.