key-value store

# key-value store two types of backend servers: - master: remember which row ranges are stored where, keep track of which storage nodes are currently alive, tablet -> replica storage nodes mapping; trigger re-replication of failed tablet copies, poll the storage node for health check. - storage node: manage one tablet file. - primary storage node: coordinate update and 2PC, write request/resp - secondary storage node: purely replica, read (remark: in bigtable implementation, each storage ndoe contains ~ 100 tablets (for better failure recovery and load balancing), but we only contain one tablet for simplicilty ??) ## 3.4 Partitioning the storage master maintains a partition node mapping: row hash range -> nodes' ip:port (include 1 secondary nodes + 1 primary nodes) the mapping (of row->node index) is defined by hash-modulo operation, i.e., Hash(row) % # of primary nodes . ~~like bigtable, wihtin a tablet file, row is ordered lexographically for easy searching.~~ Tablet file logical format: a mapping (row, column, ~~version~~ ) -> value physical format on disk: a tablet is a directory. each row is a file in that directory. the file name is row(or hash value of row), the content is columns->value mapping. Extra credit - dynamic range:, dynamically added removed tablet server. Tricky data movement, omit for now ## 3.5 Consistency and fault tolerance update atomicity: per row lock is fine. need a per row lock array per tablet?? One write ahead log (WAL) per tablet, all tablet in memory periodical checkpointing and truncate log. Write cache: copy on write (COW), if cached, overwrite cache; if not cached create new cache entry. (by `mmap()` a file with the `MAP_PRIVATE` flag ??) checkpointing: new checkpoint = old checkpoint + write (dirty) cache. checkpointing when updated transaction number exceed certain level [IOWOW - Write Ahead Logging](https://iowow.io/wal) tablet replication( sequential consistency: same copy of tablet file among its storage nodes is ok??), master maintain the list of all storage nodes for a tablet replica. ### crash recovery Normally, master monitors alive storage node by heart beat and row range -> node mapping. once a node is not alive (crashes): remove from row range -> node mapping. once a node restarts and registers itself again: 1. master adds it to previous mapping and replica set. 2. master askes restarted node to copy storage state from one of the replica. storage node recovery after crash: pull tablet checkpoint from other storage node, pull log while suspending all writes/delete? (by block the write opeartion at master server?? ). Extra credit - master node failure (low priority): in-memory master state -> master state as replicated tablet, master as tablet server. ## get/put/cput/delete operation. get: 1. query master -> fetch from one storage node (check write cache first, if miss check files on disk) put/cput/delete: 1. master -> primary storage node 2. primary storage node initiate a 2-phase commit among all replicas(include self) and reply to client ## concurrency issue read-write conflict recovery(read)-write conflict. ## questions: How is the systems tested/deployed, in a single vm or multiple vms? A: in a single vm minimal setup: 1 mater, 2 partitions each with 1 primary and 1 secondary. suggestions: enuerate request grpc frontend storage node when checkpoint flush log use grpc for heart beat I’d delay 2PC until you have the rest of the end-to-end project working first. There are many core components—e.g., frontend and backend integration, checkpointing and logging, recovery, etc. Quesitons: 2PC not needed, b/c primary is the sequencer? cache not needed? read all into memory ? fault tolerance: 1. how to operate when a primary/secondary node fails? - secondary fail -> master inform primary the lastest node group, stroage nodes work with one less replica - primary fail -> master pick a secondary as primary, inform the primary the lastest node group, stroage nodes work with one less replica. 2. how to rejoin a group? who assume to rejoin (rebooted node? same address?) 3. node recovery/logging: checkpointing and logging, recovery http://www.cs.cmu.edu/~dga/15-440/F10/lectures/Write-ahead%20Logging-private.pdf implementation based on grpc architecture： who are the client/server what kind of rpc to call? # failure model: admin webpage, suspend the node, modify data, resume node. ctrl-c kill node process and rerun.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.