filesystem
fuse
I received a copy of this dissertation and the author was kind enough to send code as well.
note: I have agreed not to share either publicly, but can share the paper with the development team. I am also not disclosing the author's name publicly without permission.
Team members can find the paper here.
The author has implemented a distributed filesystem (in Rust!) that implements the crdt-tree algorithm and mounts via FUSE.
I have read the paper in great detail and have also built, run, and briefly reviewed the code.
I find of most interest/relevance 1, 2, and 9:
Let's say we wanted to use strategy #1 above in SAFE filesystem. ie, encode a 16 bit (or other size) replica ID in each u64 ino. This approach could simplify our design because there is no need to map local ino to network/tree identifiers and perform extra lookups.
Still, there's some q's/issues around that:
Some benefits of our current design:
Even still, this method is certainly worth thinking about and may even be worth prototyping to compare with.
The crdt-tree algo requires the ability to undo and re-apply each operation. In a crdt-tree, there is only one type of operation: Move.
The paper identifies the following filesystem write operations and a corresponding inverse op for each:
Operation | Inverse Op |
---|---|
mkdir | rmdir |
rmdir | mkdir |
create | unlink |
unlink | create |
rename | rename (with source and target swapped) |
write | N/A |
In the paper, these Ops are sent across the wire. When a new op is received, any later ops have their inverse applied in timestamp-decreasing order, then the new op is applied, and then the later ops are re-applied.
Doing things this way enables each replica to know what the actual filesystem-level operation is, rather than the lower-level crdt-tree operations.
I'm not yet certain what advantages this might bring in practice over our current design. My filesystem.php demonstrates that identical filesystem state can be achieved across replicas using only the single OpMove primitive. see "realization" below.
If a single high-level op corresponded to many low-level ops, then I can see where there would be a network savings in trasmitting only the high-level ops. To quantify, here's a count of low-level ops per high-level op in the filesystem.php prototype:
high-level op | low-level ops | notes |
---|---|---|
init | 3 | init is no-op in paper |
mkdir | 1 | |
rmdir | 1 | |
rename | 1 | |
mknod (create) | 2 | 2 needed for hard-link support |
write | 1 | possibly 0 in final impl |
rmdir | 1 | |
link | 1 | |
unlink | 1 or 2 | 2 needed for final unlink (common case) |
symlink | 1 |
realization! (found design flaw)
My filesystem.php prototype does NOT demonstrate identical filesystem state across replicas. It only demonstrates identical tree state. For example, here is the mkdir() function:
public function mkdir(int $parent_ino, string $name) {
$r = $this->replica;
$ops = [];
// 1. find parent_id from parent_ino.
$parent_id = $this->ino_to_tree_id($parent_ino);
// 2. find parent dir (under /root/)
$inode_entry = $this->tree_find($parent_id);
// 3. create tree node under /root/../parent_id
$fim = new fs_inode_meta($name, inode_kind::directory);
$ops[] = new op_move($r->tick(), $parent_id, $fim, $new_inode_id = new_id() );
// 4. create/add fs_inode_local
$ino = $this->add_fs_inode_local($new_inode_id, $fim->kind);
$r->apply_ops($ops);
return $ino;
}
The call to $this->add_fs_inode_local() modifies local state.
The call to $r->apply_ops() modifies tree state, that will be replicated across the network.
But the modified local state will not be replicated. The remote replica, receiving only a generic OpMove, has no way to understand that it was a mkdir() filesystem Op, and thus doesn't know to call $this->add_fs_inode_local(). So the local inode entry for the directory never gets created.
Like so many things, this is obvious in retrospect. A complete replication test case of the filesystem API would have caught this, but this was still on my todo list. Anyway, this is why I wrote quick/dirty prototypes, to prove out design ideas and find flaws as early as possible and iterate on solutions.
Anyway, as I see it, there are two solutions we could think about:
Likely we will end up going with (1), but it's worth thinking more about if (2) is possible.
Update
I added a test case that demonstrates that 2nd replica does not build needed local state. code
Here is output:
$ php filesystem.php test_fs_replicas
== Fail! replica1 and replica2 filesystems do not match. ==
------- fs state after: created /home/bob. (replica1 state) -------
- null => forest
- 1000 => {"name":"root","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}
- 1003 => {"name":"home","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}
- 1004 => {"name":"bob","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}
- 1001 => {"name":"fileinodes","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}
- 1002 => {"name":"trash","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}
ino --> fs_inode_local:
3 => {"ino":3,"tree_id":1000,"ref_count":2,"links":1,"is_file":false}
4 => {"ino":4,"tree_id":1003,"ref_count":1,"links":1,"is_file":false}
5 => {"ino":5,"tree_id":1004,"ref_count":1,"links":1,"is_file":false}
uuid --> fs_inode_local:
1000 => {"ino":3,"tree_id":1000,"ref_count":2,"links":1,"is_file":false}
1003 => {"ino":4,"tree_id":1003,"ref_count":1,"links":1,"is_file":false}
1004 => {"ino":5,"tree_id":1004,"ref_count":1,"links":1,"is_file":false}
------- end state -------
------- fs state after: created /home/bob. (replica2 state) -------
- null => forest
- 1000 => {"name":"root","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}
- 1003 => {"name":"home","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}
- 1004 => {"name":"bob","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}
- 1001 => {"name":"fileinodes","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}
- 1002 => {"name":"trash","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}
ino --> fs_inode_local:
uuid --> fs_inode_local:
------- end state -------
It was neat to see this working in practice and I think we can leverage it for caching files locally if we wish.
Storing all files in a flat directory list (as in the paper) should be given consideration. We could use a hashed tree structure instead, though that does potentially increase lookup time.
Regarding storing lots of files in a single dir, see:
Basically: