Sinfonia - a new paradigm for distributed systems

### Sinfonia: a new paradigm for building scalable distributed systems --- ## Intention * manage to replace *message passing* with *shared memory* for collaboration between processes * disk management service that facilitates the development of distrubuted applications * targeting infrastructure applications * cluster file system * group communication services * lock managers ---- ## Inspiration * Database system - BerkerlyDB * lack performance for infrastructure application * DSM (Distributed Share Memory) system - Plurix * lack scalability or fault tolerance ---- ## Ideas ### Scalability * decouple operations executed by different hosts * apply fine-grained address spaces ### Performance * distributed over multiple memory nodes * modified two-phase commit ---- ![](https://i.imgur.com/FQJojhT.png) * application nodes (memory used by application) * coordinators * memory nodes (memory managed by Sinfonia) --- ## Minitransaction Design ---- * Each memory node keeps a sequence of contiguous memory space * Simply a linear address space * Accessed by (mem-id, addr, len, data) ---- ### Minitransaction ```graphviz digraph { compound=true rankdir=RL graph [ fontname="Source Sans Pro", fontsize=20 ]; node [ fontname="Source Sans Pro", fontsize=14 ]; edge [ fontname="Source Sans Pro", fontsize=12 ]; subgraph core { handler [label="minitransaction \n(mem-id, addr, len, data) "] [shape=box] } handler -> b subgraph cluster1 { a [label="bytes"] [shape=box] c [label="bytes"] [shape=box] b [label="bytes"] [shape=box] c -> b [dir="none"] [style="dotted"] b -> a [dir="none"] [style="dotted"] label="memory node" } } ``` * compare(s) *(mem-id, addr, len, data)* * read(s) *(mem-id, addr, len, ~~data~~)* * write(s) *(mem-id, addr, len, data)* --- ## Minitransaction Protocol ---- * Block on participant crash instead of coodinator crash * Consider a transaction to be committed * if all participants vote yes * two phase commit ---- ![](https://i.imgur.com/DyFJPV9.png) ---- ### Phase 1 * Prepare and execute the minitransaction * Acquire locks * if it fails, release all locks and vote abort * If all locations could be locked and all compare items succeeded, vote for committing ---- ### Phase 1 (Cont.) * Participants log minitransactions in the redo-log * log only if the participant votes to commit * ***write-ahead redo-log*** for all transactions * *in-doubt list* for the undecided * ***forced-abort list*** for recovery (cover later) * ***decided list*** either committed or aborted ---- ### Phase 2 * If all vote all committing, tell all to commit * If at least one aborts because some locks are busy, the coordinator retries the minitransaction with new *tid* ---- ### Optimized two-phase commit * If a minitransaction has jush one participant, it can be executed in one phase. * Try to maintain a file's content (cached data block) and its inode (meta data) in the same memory node. * It is not necessary for memory nodes to store read-only minitransaction in the redo-log. ---- ### Recovery Coordinator * Recovery is executed by a dedicated management node for Sinfonia * Periodically probes minitransactions in the *in-doubt list* of each node * timeout and start recovery ---- ### Recovery from participant crashed * Syncronize its disk image by replaying its redo-log * only those that committed based on *decided list* * If undecided (in *redo-log* but not in *decided list*), recovery coodinator requests all other participants to vote abort * if voted, keeps it * if not voted yet, votes abort and places the transaction in the *forced-abort list* ---- ### Recovery from system crashed * when rebooted, the memory node send a reboot notification to the management node * wait if there is other remote notification * send each other their votes on recent minitransactions ---- ### Log garbage collection * a commited transaction can be garbage collected only when it is applied to disk image of evey memory node * *in doubt list* & *decided list* & *redo-log* * forced-abort is garbage collected according to epoch number or coornidator termination * current epoch is kept by each memory node ---- ### Conclusion on recovery * recovery on a node in three cases * timeout on undecided * memory node reboots * discover a gap on receiving COMMIT --- ## Applications * SinfoniaFS * Group Communication Service ---- ### Cluster File System * Sinfonia simplified the design by * cluster nodes need not to be coordinated or negociated * cluster nodes need not to keep journals or status cached at remote nodes * leverage Sinfonia's write-ahead redo-log for performance * data block number are pairs of memory node id and an offset in the memory node ---- ### Caching * Cluster nodes can cache arbitrary amounts of data & meta data * superblock -- static information about the entire volumes * free-block bitmap -- which data block is used * chaining-list blocks -- where the data block comprising the file located * Validated before being used * *compare operations* in minitransaction ---- ### Contention & load balancing * Sinfonia does not balance load across memory nodes but provide load information of each memory node * Each cluster node keeps a preferred memory node, for new data written * choose a new inode & block at random if any other cluster node allocate it first --- ### Group Communication Service * Current set of members is call the latest view * View change messages indicate members joining or leaving ---- ### Design * Each member keeps a tail pointer indicating the latest message received as if there exists a global tail * indicates the latest message being received * each member keeps the tail pointer * For example * to join, a member acquires a global lease on the metadata and uses a minitransaction that combines *compare & write operations* * Combine many of small message for optimization --- ## Evaluation * Base performance * Scalability * Performance under contentions ---- ### Base Performance <img src="https://i.imgur.com/QD8we6l.png" width="450"> * Modes LOG and LOG-REPL log minitransactions synchronously to disk, so comparable to Berkeley DB (versus non-volatile RAM) * REPL (Read-Eval-Print Loop) * Peak at 6500-7500 minitransaction per seconds because of the bottleneck of the CPU ---- ### Base Performance (Cont.) <img src="https://i.imgur.com/YEUt53J.png" width="450"> ---- ### Scalability * Up to 246 machines & 32 threads issuing minitransactions fast <img src="https://i.imgur.com/ixQhIXe.png" width="450"> ---- ### With contentions * Varied the probability the two minitransaction overlap <img src="https://i.imgur.com/HvVHWjR.png" width="450"> --- ### Conclusion * Sinfonia use fine-grained address space to increase scalability and performance * Sinfonia does not cache data for application * leave it to application for more flexibility on caching policy * application knows their data better * Sinfonia does not balance load across memory nodes * also left to applications * provides load information