執行人: RealBigMickey
GitHub repository
rota1001
This project is so cool!
原本是這樣想的:
前面講到 cache 的部份有提到,上傳像 .mp4 這種相對大的檔案的時候,立即打開會出錯,我想期望上不該出現還沒完全上傳就無法打開的檔案。對於使用者來說,我認為付出幾 GB 的硬碟空間可以達成即時的檔案讀寫是很划算的,所以 cache 可以大一點,然後 write back 可以是非同步的寫,讓使用者最新的改動可以即時的在本地更新,上傳結束也不一定要馬上刪掉。
和 weiso131 討論後,我認為直接的決定所有的檔案寫入需要即時在本地複製是不必要的,但是剛被寫入的檔案容易被使用的這個 temporal locality 是有考量的必要的。另外是快取不一定是以檔案為單位的,有時候只暫時複製檔案的一部分是更好的,因為有時候我們只要寫或讀檔案的一部分。所以,我想快取的考量做在讀寫硬碟這一層會比起檔案系統更合理,可以寫一個 diskord 專案,讓檔案系統看到的是一個可以對於特定地址寫入特定內容的空間。
另外是多個使用者造成的並行議題,這件事情是 server 端要解決的,對於各個檔案(或者說對應某個路徑的檔案),要維護現在有哪些使用者在使用它,然後即使是在更新本地檔案前也要透過 request 去決定是否可以寫某個檔案。
另外是有提到說 Discord API 有單位時間送 request 的限制,不管是 bot 還是 ip 都有。那對於同個 ip 的那個額度有比單一 bot 還高嗎?如果有的話還是可以創更多的 bot 去達到這個額度。
謝謝同學的用心評論,提出很多實際的觀點,這邊一一回覆。
- 上傳時先把本地端檔案移到快取處確實使用者體驗上確實好許多,會更新此功能。
- 操作最小單位不一定是檔案,而實際上現代的 filesystem 也都善用 partial read & write,本次沒做出來誠實的說是因為:
- 時間不足
- 能力不足
想先做出個原型,確認系統的理論,然後再去擴展功能。下一個需要新增的功能確實應是部份讀寫。但部份讀取寫入也會遇到需要下載的問題,即便提前預判即將用到的資料存進快取,若突然跳到另一部份將會有巨大的延遲,想避免就能先去提供部份檔案,背景慢慢將完整檔案抓下來,到頭來仍消耗許多 API 次數。
- 關於多使用者並行,這確實是伺服器端的願景之一,設計時有考量到未來可能會新增並行的問題,有將 discord api 與 server 端分開, PostgreSQL 資料庫也不會因為多線程而出錯。
- ip 比單一 bot 高上不少,詳細資訊可參照 https://discord.com/developers/docs/topics/rate-limits
RealBigMickey
EricccTaiwan
This project is so cool!
先前 Andrea Righi 在演講中的 future idea 提到,未來或許可以在 kernel 中引入一個名為 fs_ext
的機制,延續 sched_ext
的設計理念,作為檔案系統的擴充層,讓 FUSE
不再依賴傳統的 VFS
路徑。也許現在已經有開發者正在實驗這樣的架構,值得進一步關注!
謝謝提供這個資訊,看起來很有意思,趁暑假會研究一下!
RealBigMickey
Andrushika
This project is so cool!
想請問使用 closure table 而非其他資料結構實作的考量因素為何?(除了影片裡面提到的名字很酷以外)
Closure table 在移動節點或列出目錄時時間複雜度為 ,優於 Ltree 的 ,而 Adjacency list Query 完整 subtree 時為 稍微遜色。
RealBigMickey
MikazukiHikari
This project is so cool!
想請問此專案是完全不用到任何 linux kernel 內建的 virtio-fs 功能嗎?
如果答案為否,是否需要編譯配置 (make menuconfig
) 或者載入特定核心模組?可透過 grep -i virtio /boot/config-$(uname -r)
查看是否支援 VFS 。
DISFS 完全不碰 virtio-fs,是純使用者空間的 FUSE 3 檔案系統。只要你的核心有 CONFIG_FUSE_FS(大多數環境預設就有)就能跑,不需要重新 make menuconfig、也不必載入 virtio 相關模組喔。
RealBigMickey
weiso131
This project is so cool!
簡報第 14 頁的表格提到 look up 的時間複雜度,四種資料結構都寫 ,但 look up 的時間複雜度應該跟目標所在的深度有關,而非總節點數,寫 沒辦法滿足大 的定義?
/boot/config-$(uname -r)
查看是否支援 VFS 。
同學說的沒錯,不過 d 在實務檔案系統通常 30 層,因此這邊視為常數, 中的 d 併入常數項 。
若要數學上絕對嚴謹,應標成 ,但兩者在工程上是等價的。
RealBigMickey
參照 s3fs,探討其原理
Amazon S3 (Simple Storage Service) is an object storage service. It stores files (called "objects") in containers called . It’s NOT a filesystem in a traditional sense. It’s more like a massive hash map, key-value pairs.
e.g.
Path | S3 Key |
---|---|
/mydir/file.txt |
mydir/file.txt |
Objects are immutable, that is to say partial reads and writes AREN'T available. Reupload of the whole object is necessary, unless chunked uploads are done manually. Each bucket IS the top-level container in S3, there are no such things as "directories" or file hiearchies.
e.g. :
s3fs
lets you mount an S3 bucket as a local file system using FUSE. That is to says3fs
fakes a filesystem interface over S3 by translating file operations into S3 API calls, through smart use of FUSE (Filesystem in Userspace).
e.g. :
open()
, read()
, write()
become GET/PUT/HEAD http requestsmkdir()
creates empty object to fake directoryrename()
would copy + delete (since S3 doesn’t support renames)The focus here isn't the setup procedure, so I'll quickly gloss over the process.
Step1: Credentials
Step 2: Mounting the bucket
That's it, use it like any other filesystem:
Unmount with:
Being a network based file system/hack, it's:
It's slow, clunky and not really idea for storage?
Here's some reasons I organized:
s3fs-fuse
turns an S3 bucket into a mountable POSIX-like file system. VERY similiar to what DISFS aims to do, but on a different cloud platform, with a LOT more polish and professionality. It's the high bar goal this project aims to be one day. (Though less cool in my opinion)
研讀〈To FUSE or Not to FUSE: Performance of User-Space File Systems〉
研讀〈Extension Framework for File Systems in User space,探討如何利用 eBPF 改善 FUSE 檔案系統效能,藉由快取 inode 與 dentry,以降低對 metadata 的存取延遲
分析 FUSE Hooks Up With IO_uring For Greater Performance Potential In Linux 6.14 揭露的效能改進手法
This paper talks about the tool Stackfs
, a passthrough file system that allows for simpler benchmarking under various configurations.
LOOKUP
, FORGET
) are noticably more expensive with FUSEfusectl
for real-time stats.This paper introduces EXTFUSE
, a kernel extension framework that allows user-space file systems to register “thin” request handlers in kernel space for faster path operations.
Basically context switches are costly for FUSE, thus a solution is to move some of the simpler logic from user-space to kernel space, while keeping complex logic in user space. Trying to get the best of both worlds
It provides a simple pathway for developers to decide which operations should be in user-space, which should be kernel-space.
e.g.
Merged in Linux 6.14, this patch allows FUSE to use IORING_OP_URING_CMD
, allowing processing of multiple requests with less syscall overhead. Combine the commit of old and fetch of new as a single syscall and implement CPU/NUMA affinity of queues.
Benchmarks show reduced CPU usage and improved throughput, especially on parallel jobs.
DISFS is a FUSE-based file system backed by a Python HTTP server and Discord API. To properly implement these is difficult due to:
For example for a bash command like read
:
There are multiple context switches, memory copies, and user-kernel round-trips just for a single read.
Most importantly, the biggest bottleneck is how to efficiently cache and server side download/upload times. It doesn't matter how fast the context-switches are if downloads are necessary for that operation.
To properly benchmark would require big modifications to the current code, for now I'll call the new hypothetical version DISFS-lite
.
Use strace
, perf
, bpftrace
, or manual hooks via log timestamps for analysis.
This is the most dangerous part to implement. Since every look-up realistically has to send a request to the server to get the most up to date version of files and queries.
Thus caching or handling lookups locally isn't really possible or viable unless it's guarenteed that files on the server are read only or that only 1 system is logged in to 1 account at anytime.
DISFS would benefit from:
DISFS-lite
version.ping
)But metadata correctness and cloud sync consistency remain the biggest constraint, and no kernel-side optimization will fix that without cache coherence strategies.
參照 rhizofs 一類的實作。請留意:為確保專案具有更高的應用彈性,開發者在使用者層級應避免採用以 GNU GPL 授權發布的程式碼與函式庫(GNU LGPL 授權則可接受使用)
For a MVP, I plan to implement the following commands:
FUSE maintains five queues for (1) interrupts, (2) forgets, (3) pending, (4) processing, and (5) background. The kernel module manages the complexity of request queues, prioritization, and fairness.
So using FUSE for user-space filesystems we don't need to manually manage these queues. Thus being able to focus on implementing our custom operations (e.g., read, write, mkdir, getattr) as separate callback functions in their FUSE daemon.
Note: This project is actually a revision of another project from 2024 -> Discord Clouding v0.1 review
The name is actually an acronym, it stands for:
I was inspired by recursive nature of GNU's name, it left a lasting impression on me the first time I heard it:
Fuse flow:
DISFS is a user-space file system built using FUSE, designed to store file data not on a disk, but in the cloud—via the Discord platform. It mimics a traditional POSIX file system interface, but under the hood, every file operation is translated into HTTP requests, and files are split into chunks and uploaded as Discord attachments.
This project explores the boundaries of what can be considered a "file system." Instead of using conventional storage like ext4 or S3, DISFS uses Discord + HTTP + PostgreSQL to simulate a real, persistent file system. Despite Discord being unsuited for this task, DISFS presents a working prototype that supports file creation, reading/writing, directory structure, login, and caching.
The goal isn’t performance or practicality, but design realism in an absurd context. DISFS is a sandbox to practice real system architecture: FUSE integration, stateless protocol design, caching strategies, user login, and metadata handling through closure tables. It's a full-stack engineering experiment from the kernel interface down to cloud object storage—just not the kind you'd expect.
Linux kernel is split into 2 different spaces, user space and kernel space:
Why the separation? With great power comes great responsibility. And to give all programs full access to all resources is irresponsible. Just like how you wouldn't give your home key to anyone on the street. With the split we can keep order in our systems, reduce cyber security threats and isolating bad code from crashing the whole system.
Applications work on the user-space, if it wants to make any modify or access any files it must do so through syscalls (System calls).
Syscalls communicates to VFS (Virtual File System). VFS then talks to actual kernel-implemented file systems (ext4, xfs, etc.).
VFS provides a uniform interface to different backends.
Kernel modules (like ext4) are complex, dangerous to debug (a bug may crash the system).
Commonly, core file system logic consist of mostly just = data structures + pathname resolution. These don't require the details and manipulation on hardware, thus working on these function in kernel level is like working with your hands tied behind your back.
Programmer Miklos Szeredi created FUSE to let developers safely implement file systems in user space without touching the kernel.
FUSE, acronym for:
Allows user-space daemons to implement file system logic.
The kernel's FUSE module handles bridging VFS ↔ your daemon via /dev/fu
Yes. Still a real FS.
It doesn't matter where the core logic is operating in. FUSE bridges the VFS, while the daemon talks to the network and handles commands. If it correctly interprets file system commands and acts like a file system, it's still considered a file system.
implemented in user space via FUSE.
For creating a web storage device, I chose the platform Discord. It's a social platform that most would be all too familiar with. The reason why it can be used but as backend cloud storage is because of the following characteristics.
Messages can have attachments of size (<10MB) to them.
Attachments (for now) don’t expire.
Attachment download occur via HTTPS.
No storage cap per channel.
User friendly API
Completely free.
However, that is not to say that there aren't any drawbacks:
URLs have hidden tokens, which expire every 24h.
There are per discord bot API limits and per IP API limits, as the project scales up there may be a hard cap on what CAN be done
Note: Discord Bot data limits
Discord urls are updated every 24 hours, must obtain 3 keys and create the correct url before being able to fetch files. (ex = expire time, is = valid time, hm = key)
Discord API protocols may be changed
Speed is directly tied to discord servers
But for the purposes of this project, Discord fits the needs quite well.
It's important to note that there are 2 sids of DISFS:
When a user opens a file with $ cat
for example, behind the scenes multiple commands are called, but to cut fat I'll just focus on the command open()
.
This is a very simplified diagram:
The user calls open, the call is intercepted by FUSE, FUSE communicates to the user space daemon DISFS (basically running it's code) and sends a http request to the server side.
Server side checks whether the request is valid (check user_id and parameters), then checks with it's database PostgreSQL to see whether it exits or not
Then the message_ids are used to fetch the attachment, attachments are then sent back to DISFS one by one. (Originally it would send complete files all at once after downloading and fetching every fileinstead of working one by one)
DISFS merges and caches the chunks in a hidden local directory, once done shows the result to the user. Completing a cycle.
Though compared to other options PostgreSQL is a bit heavier and slower, but natively it supports concurrency and async. Features that would prove useful once more workers or threads are added.
Files LOTS of metadata, but to keep things simple while still preserving some I kept the following:
i_mode, i_uid, i_gid, i_size, i_atime, i_mtime, i_ctime, i_crtime.
These make up a file relatively well, keeping data the I guess what the general user would want.
Original schema used a full file path string stored in a column. Simple to implement, but:
ls
, rm -r
, etc.It relied on the B-tree datastucture built into PostgreSQL for queries. A design that wouldn't hold as the project expanded
It’s a read-optimized data structure built using a relational table. Used mainly for representing hierarchical data using a relational table. Being read-optimized, inserts/moves/deletes are more costly, but hierarchical queries are blazing-fast in return.
A closure table is a table that stores all the paths between all elements in a hierarchical data structure. The table includes two columns for the IDs of the related elements and a third column that represents the distance between them.
— Yusoof Ali
They say a picture is worth a thousand words. A visual example helps clarify how closure tables work.
For a simple hierarchy like this:
The structure would look like:
Each node in the closure table has connections to all of its descendants. Each row has 3 values: (Ancestor, Descendant, Depth)
For node 1: / (root)
↓
node_closure:
Ancestor | Descendant | Depth | Meaning |
---|---|---|---|
1 | 1 | 0 | root is itself |
1 | 2 | 1 | root → docs |
1 | 3 | 2 | root → docs → report.txt |
1 | 4 | 1 | root → pics |
node:
id | user_id | name | parent_id | type |
---|---|---|---|---|
1 | foo | "/" | NULL | 2 ← directory |
This setup allows for fast descendant queries and keeps the hierarchy intact even as the number of files and directories grows.
To insert a new node under a parent:
(new_id, new_id, 0)
.To move a node (and its subtree):
Two approaches:
I chose to use Cascade delete, though might change to Detach/orphan to implement a temporary rubbish bin 🗑️ or redo commands down the line…
Normally a filesystem wouldn't have a "login" mechanism. But anyone, anywhere should be able to access their files with DISFS. Shouldn't require CLI args or separate login paths.
Solution: Login as File
Runs a comparison for every path, if the directory is at the root and starts with a period (.)
-> Treat it like a command.
do_read detects this and sends a different http request to the server in order to check if the user exists.
Server checks database, returning the result/session. With the session DISFS is able to access directories for said user.
All http requests for this project come in 2 flavors, GET and POST.
Thus it's necessary to have a flag. Instead of a variable, I took a page out of rbtree.h
from Linux kernel and used pointer tagging.
on on x86-64 systems don't not use the lowest bits for addressing. On 64-bit systems, only 48–52 bits are used for addressing.
000
in binary, e.g.:
& 1
masks everything except the LSB, allowing for checks.
Afterwards it's necessary to change LSB back to 0 to fufill 8 byte-alignment. (or it would write into memory it shouldn't)
Q: Is this practical for this project?
No. Speeds are tied directly to discord api limits, downloads and upload speeds, chunking, concurrency and threaded operations (or lack there of) on server side etc. etc… not really "THIS"
As Discord has a hard limit of 10 MiB per attachments, to upload bigger files it's necessary to split them into smaller chunks.
Strategy:
In an effort to cut down on API calls, caching was implemented. WHEN to cache is the elephant in the room.
Navigating around or viewing stats with ls
, ls -l
should be almost instant. But due to Discord being the backbone, files must go through the process of uploading and downloading for writing and reading.
Operation | release() to finalize | User experience |
---|---|---|
echo foo > a | 1~2 seconds | Almost instant |
mv large.mp4 | tens of seconds | Error on open, works after a while |
Instead of this occuring everytime, read requests cache the file and consequent reads don't fetch the file from server.
On wifi (52.4 Mbs ↓, 14.2 Mbs ↑, Google's speed test)
Testing speeds on a 1.3 GB video file:
There must be trade-offs that must be done, I'm still trying to figure out which way of doing things IS the most advantageous.
Working on the project really gave insight on how filesytems truly work. Many problems faced along the way were unexpected but quite interesting.
After coding up mkdir on fuse
& server-end
, running the command would result in this error:
On the server's end, the output was:
indicating that the signal recived was /stat, yet the command ran was mkdir.
REASON FOR THIS:
Kernel call sequence for $ mkdir mnt/dog
is as follows:
Before even running mkdir, it checks whether the parent exists (same as other commands) BUT ALSO check if the entry exists.
In other instances, if the entry didn't exist that meant the action was illegal, but in this case it's the opposite. getattr
MUST return with the correct error message of -ENOENT
It's true that for coding we often take too much for granted, but this is painfully so here. As an example:
Look at this command, it writes to hello.txt, seems simple on the surface, but its not.
The following commands are executed in sequence:
READATTR
CREATE
O_CREAT
flagOPEN
WRITE
RELEASE
Each command along the way has functions on fuse's end + Server's end. Like the famous quote:
All problems in computer science can be solved by another level of indirection
– Butler Lampson
This is true, but it took -5 whole days of work before $ echo
was working properly.
Designing a file system that lives in a completely unintended environment like Discord reveals a lot about real-world system constraints. For instance:
mv
, touch
, cat
rely on multiple syscalls that expect real-time I/O feedback and consistent metadata.user_id
, path
, chunk
) every time. No room for assumptions.To make performance comparisons more reproducible and rigorous:
ls
, cat
, mv
, mkdir
, rm -r
with tree sizes from 10–10,000