### Sinfonia: a new paradigm for building scalable distributed systems
---
## Intention
* manage to replace *message passing* with *shared memory* for collaboration between processes
* disk management service that facilitates the development of distrubuted applications
* targeting infrastructure applications
* cluster file system
* group communication services
* lock managers
----
## Inspiration
* Database system - BerkerlyDB
* lack performance for infrastructure application
* DSM (Distributed Share Memory) system - Plurix
* lack scalability or fault tolerance
----
## Ideas
### Scalability
* decouple operations executed by different hosts
* apply fine-grained address spaces
### Performance
* distributed over multiple memory nodes
* modified two-phase commit
----

* application nodes (memory used by application)
* coordinators
* memory nodes (memory managed by Sinfonia)
---
## Minitransaction Design
----
* Each memory node keeps a sequence of contiguous memory space
* Simply a linear address space
* Accessed by (mem-id, addr, len, data)
----
### Minitransaction
```graphviz
digraph {
compound=true
rankdir=RL
graph [ fontname="Source Sans Pro", fontsize=20 ];
node [ fontname="Source Sans Pro", fontsize=14 ];
edge [ fontname="Source Sans Pro", fontsize=12 ];
subgraph core {
handler [label="minitransaction \n(mem-id, addr, len, data) "] [shape=box]
}
handler -> b
subgraph cluster1 {
a [label="bytes"] [shape=box]
c [label="bytes"] [shape=box]
b [label="bytes"] [shape=box]
c -> b [dir="none"] [style="dotted"]
b -> a [dir="none"] [style="dotted"]
label="memory node"
}
}
```
* compare(s) *(mem-id, addr, len, data)*
* read(s) *(mem-id, addr, len, ~~data~~)*
* write(s) *(mem-id, addr, len, data)*
---
## Minitransaction Protocol
----
* Block on participant crash instead of coodinator crash
* Consider a transaction to be committed
* if all participants vote yes
* two phase commit
----

----
### Phase 1
* Prepare and execute the minitransaction
* Acquire locks
* if it fails, release all locks and vote abort
* If all locations could be locked and all compare items succeeded, vote for committing
----
### Phase 1 (Cont.)
* Participants log minitransactions in the redo-log
* log only if the participant votes to commit
* ***write-ahead redo-log*** for all transactions
* *in-doubt list* for the undecided
* ***forced-abort list*** for recovery (cover later)
* ***decided list*** either committed or aborted
----
### Phase 2
* If all vote all committing, tell all to commit
* If at least one aborts because some locks are busy, the coordinator retries the minitransaction with new *tid*
----
### Optimized two-phase commit
* If a minitransaction has jush one participant, it can be executed in one phase.
* Try to maintain a file's content (cached data block) and its inode (meta data) in the same memory node.
* It is not necessary for memory nodes to store read-only minitransaction in the redo-log.
----
### Recovery Coordinator
* Recovery is executed by a dedicated management node for Sinfonia
* Periodically probes minitransactions in the *in-doubt list* of each node
* timeout and start recovery
----
### Recovery from participant crashed
* Syncronize its disk image by replaying its redo-log
* only those that committed based on *decided list*
* If undecided (in *redo-log* but not in *decided list*), recovery coodinator requests all other participants to vote abort
* if voted, keeps it
* if not voted yet, votes abort and places the transaction in the *forced-abort list*
----
### Recovery from system crashed
* when rebooted, the memory node send a reboot notification to the management node
* wait if there is other remote notification
* send each other their votes on recent minitransactions
----
### Log garbage collection
* a commited transaction can be garbage collected only when it is applied to disk image of evey memory node
* *in doubt list* & *decided list* & *redo-log*
* forced-abort is garbage collected according to epoch number or coornidator termination
* current epoch is kept by each memory node
----
### Conclusion on recovery
* recovery on a node in three cases
* timeout on undecided
* memory node reboots
* discover a gap on receiving COMMIT
---
## Applications
* SinfoniaFS
* Group Communication Service
----
### Cluster File System
* Sinfonia simplified the design by
* cluster nodes need not to be coordinated or negociated
* cluster nodes need not to keep journals or status cached at remote nodes
* leverage Sinfonia's write-ahead redo-log for performance
* data block number are pairs of memory node id and an offset in the memory node
----
### Caching
* Cluster nodes can cache arbitrary amounts of data & meta data
* superblock -- static information about the entire volumes
* free-block bitmap -- which data block is used
* chaining-list blocks -- where the data block comprising the file located
* Validated before being used
* *compare operations* in minitransaction
----
### Contention & load balancing
* Sinfonia does not balance load across memory nodes but provide load information of each memory node
* Each cluster node keeps a preferred memory node, for new data written
* choose a new inode & block at random if any other cluster node allocate it first
---
### Group Communication Service
* Current set of members is call the latest view
* View change messages indicate members joining or leaving
----
### Design
* Each member keeps a tail pointer indicating the latest message received as if there exists a global tail
* indicates the latest message being received
* each member keeps the tail pointer
* For example
* to join, a member acquires a global lease on the metadata and uses a minitransaction that combines *compare & write operations*
* Combine many of small message for optimization
---
## Evaluation
* Base performance
* Scalability
* Performance under contentions
----
### Base Performance
<img src="https://i.imgur.com/QD8we6l.png" width="450">
* Modes LOG and LOG-REPL log minitransactions synchronously to disk, so comparable to Berkeley DB (versus non-volatile RAM)
* REPL (Read-Eval-Print Loop)
* Peak at 6500-7500 minitransaction per seconds because of the bottleneck of the CPU
----
### Base Performance (Cont.)
<img src="https://i.imgur.com/YEUt53J.png" width="450">
----
### Scalability
* Up to 246 machines & 32 threads issuing minitransactions fast
<img src="https://i.imgur.com/ixQhIXe.png" width="450">
----
### With contentions
* Varied the probability the two minitransaction overlap
<img src="https://i.imgur.com/HvVHWjR.png" width="450">
---
### Conclusion
* Sinfonia use fine-grained address space to increase scalability and performance
* Sinfonia does not cache data for application
* leave it to application for more flexibility on caching policy
* application knows their data better
* Sinfonia does not balance load across memory nodes
* also left to applications
* provides load information
{"metaMigratedAt":"2023-06-16T12:53:28.929Z","metaMigratedFrom":"YAML","title":"Sinfonia - a new paradigm for distributed systems","breaks":true,"slideOptions":"{\"theme\":\"solarized\",\"transition\":\"fade\"}","contributors":"[{\"id\":\"bd34bb29-6393-4de1-a879-0655a63bb8b4\",\"add\":13479,\"del\":5999}]"}