Notes on PhD Dissertation: A Highly Available Distributed Filesystem.

tags: filesystem fuse

I received a copy of this dissertation and the author was kind enough to send code as well.

note: I have agreed not to share either publicly, but can share the paper with the development team. I am also not disclosing the author's name publicly without permission.

Team members can find the paper here.

The author has implemented a distributed filesystem (in Rust!) that implements the crdt-tree algorithm and mounts via FUSE.

I have read the paper in great detail and have also built, run, and briefly reviewed the code.

My notes, items of interest

I find of most interest/relevance 1, 2, and 9:

  1. replica node IDs are 16 bit and are included in the u64 Inode ID, making Inodes globally unique, provided each replica is unique. Leaving (only?) 48 bits per replica for allocating inode IDs.
  2. Operations sent across the network are filesystem (fuse) centric, ie enum SyncOp{ Mkdir, Rmdir, Create, Unlink, Rename, Write }. Consequence of this is that to undo operations, every op must have an equivalent undo (except write apparently). Also, ino Ids are the same on all replicas. In our design, I was envisioning the Tree as a generic network type and it supports only a single Move operation, so all local ops would be reduced to that before hitting the wire, and ino IDs are local only.
  3. does NOT yet implement: symlinks, hardlinks, locking, permissions, file attributes. but all (somewhat) contemplated in design.
  4. Uses ZeroMQ Radio/Dish pattern for broadcast/gossip. previously offline nodes request/sync changes from peers. SAFE stack should have us covered here.
  5. stores all files in flat list beneath the mountpoint of the underlying filesystem, each file named with its ino id. I wonder about performance of that with large filesystems I was always taught to break large set of files into sub-dirs / tree.
  6. runs in 2 threads: 1 for processing local fuse ops, 1 for receiving network ops from ZeroMQ.
  7. Shares SQLite DB between the 2 threads, with 2 locks. Op Log is stored in DB and also inode/tree metadata.
  8. Files are stored outside the tree crdt, so file writes do not share crdt properties. atomicity of writes is (somehow) offered for 512 byte blocks. I don't fully understand yet how that works.
  9. Briefly discusses a method of reducing log size by merging timestamp-adjacent writes by the same replica. we should be able to use that idea.
  10. Discusses a possible improvement to only undo/redo operations that will affect the outcome. Gives some examples that I didn't fully grok yet.

Some analysis

Use of u64 ino identifier

Let's say we wanted to use strategy #1 above in SAFE filesystem. ie, encode a 16 bit (or other size) replica ID in each u64 ino. This approach could simplify our design because there is no need to map local ino to network/tree identifiers and perform extra lookups.

Still, there's some q's/issues around that:

  1. How can we generate unique replica (actor) IDs that do not collide without coordination? My first thought is: obtain current list of replicas, then pick a random number amongst unused set. This can fail especially as unused set gets smaller, but maybe not a big problem when total number of replicas is small.
  2. Is 2^16 (65,536) replicas enough? It's plenty for Elders/vaults in a section, but when we consider that every Client is also considered a replica when they mount the filesystem read/write, then maybe its not crazy to imagine some popular content that might be mounted 100,000's of times.
  3. is 2^48 inode handles enough? well, 281,474,976,710,656 is still a lot of directories + files, so probably ok for now.
  4. Maybe there is a more optimal (for our use-case) split than 16/48?
  5. point (2) above is interesting because it forced me to think about mounting read/write vs read-only. For GET type operations, there is no need to write, so the actor won't contribute to potential collisions, etc. The more general takeaway is that the code should support both read-only and read/write modes, both for SAFE API and fuse mount.

Some benefits of our current design:

  • Easy to create node IDs as UUID without collision
  • The ino/uuid mapping provides a place to store local inode data that doesn't need to be shared such as ino, ref_count, and link_count.

Even still, this method is certainly worth thinking about and may even be worth prototyping to compare with.

Inverse Operations

The crdt-tree algo requires the ability to undo and re-apply each operation. In a crdt-tree, there is only one type of operation: Move.

The paper identifies the following filesystem write operations and a corresponding inverse op for each:

Operation Inverse Op
mkdir rmdir
rmdir mkdir
create unlink
unlink create
rename rename (with source and target swapped)
write N/A

In the paper, these Ops are sent across the wire. When a new op is received, any later ops have their inverse applied in timestamp-decreasing order, then the new op is applied, and then the later ops are re-applied.

Doing things this way enables each replica to know what the actual filesystem-level operation is, rather than the lower-level crdt-tree operations.

I'm not yet certain what advantages this might bring in practice over our current design. My filesystem.php demonstrates that identical filesystem state can be achieved across replicas using only the single OpMove primitive. see "realization" below.

If a single high-level op corresponded to many low-level ops, then I can see where there would be a network savings in trasmitting only the high-level ops. To quantify, here's a count of low-level ops per high-level op in the filesystem.php prototype:

high-level op low-level ops notes
init 3 init is no-op in paper
mkdir 1
rmdir 1
rename 1
mknod (create) 2 2 needed for hard-link support
write 1 possibly 0 in final impl
rmdir 1
link 1
unlink 1 or 2 2 needed for final unlink (common case)
symlink 1

realization! (found design flaw)

My filesystem.php prototype does NOT demonstrate identical filesystem state across replicas. It only demonstrates identical tree state. For example, here is the mkdir() function:

    public function mkdir(int $parent_ino, string $name) {

        $r = $this->replica;
        $ops = [];

        // 1. find parent_id from parent_ino.
        $parent_id = $this->ino_to_tree_id($parent_ino);

        // 2. find parent dir (under /root/)
        $inode_entry = $this->tree_find($parent_id);

        // 3. create tree node under /root/../parent_id
        $fim = new fs_inode_meta($name, inode_kind::directory);
        $ops[] = new op_move($r->tick(), $parent_id, $fim, $new_inode_id = new_id() );

        // 4. create/add fs_inode_local
        $ino = $this->add_fs_inode_local($new_inode_id, $fim->kind);

        $r->apply_ops($ops);

        return $ino;
    }

The call to $this->add_fs_inode_local() modifies local state.

The call to $r->apply_ops() modifies tree state, that will be replicated across the network.

But the modified local state will not be replicated. The remote replica, receiving only a generic OpMove, has no way to understand that it was a mkdir() filesystem Op, and thus doesn't know to call $this->add_fs_inode_local(). So the local inode entry for the directory never gets created.

Like so many things, this is obvious in retrospect. A complete replication test case of the filesystem API would have caught this, but this was still on my todo list. Anyway, this is why I wrote quick/dirty prototypes, to prove out design ideas and find flaws as early as possible and iterate on solutions.

Anyway, as I see it, there are two solutions we could think about:

  1. Transmit higher-level ops, as in the paper.
  2. Eliminate concept of local-only state, so that all state is shared state within the tree.

Likely we will end up going with (1), but it's worth thinking more about if (2) is possible.

Update

I added a test case that demonstrates that 2nd replica does not build needed local state. code

Here is output:

$ php filesystem.php test_fs_replicas

== Fail!  replica1 and replica2 filesystems do not match. ==


------- fs state after: created /home/bob.  (replica1 state) -------
- null => forest
  - 1000 => {"name":"root","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}
    - 1003 => {"name":"home","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}
      - 1004 => {"name":"bob","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}
  - 1001 => {"name":"fileinodes","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}
  - 1002 => {"name":"trash","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}

ino --> fs_inode_local:
3 => {"ino":3,"tree_id":1000,"ref_count":2,"links":1,"is_file":false}
4 => {"ino":4,"tree_id":1003,"ref_count":1,"links":1,"is_file":false}
5 => {"ino":5,"tree_id":1004,"ref_count":1,"links":1,"is_file":false}

uuid --> fs_inode_local:
1000 => {"ino":3,"tree_id":1000,"ref_count":2,"links":1,"is_file":false}
1003 => {"ino":4,"tree_id":1003,"ref_count":1,"links":1,"is_file":false}
1004 => {"ino":5,"tree_id":1004,"ref_count":1,"links":1,"is_file":false}
------- end state -------


------- fs state after: created /home/bob.  (replica2 state) -------
- null => forest
  - 1000 => {"name":"root","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}
    - 1003 => {"name":"home","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}
      - 1004 => {"name":"bob","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}
  - 1001 => {"name":"fileinodes","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}
  - 1002 => {"name":"trash","size":0,"ctime":1597885743,"mtime":1597885743,"kind":"dir"}

ino --> fs_inode_local:

uuid --> fs_inode_local:
------- end state -------

File storage in underlying FileSystem beneath mount point.

It was neat to see this working in practice and I think we can leverage it for caching files locally if we wish.

Storing all files in a flat directory list (as in the paper) should be given consideration. We could use a hashed tree structure instead, though that does potentially increase lookup time.

Regarding storing lots of files in a single dir, see:

https://serverfault.com/questions/98235/how-many-files-in-a-directory-is-too-many-downloading-data-from-net

Basically:

  • performance differs by filesystem.
  • ext3/ext4/ntfs should be ok
  • even if fileystem can handle large number of files, common commands like 'ls' become slow to unusable. (Though that doesn't affect our use-case, only if a user tries to ls in mount-point dir after unmounting).
Select a repo