Linux 核心專題: 雲端檔案系統設計和實作

# Linux 核心專題: 雲端檔案系統設計和實作 > 執行人: RealBigMickey > [GitHub repository](https://github.com/RealBigMickey/DisFS) [點我看**中文**專題解說錄影](https://www.youtube.com/watch?v=CVQGNpWnIkM) [點我看**中文**簡報](https://www.canva.com/design/DAGrj598ynk/Tj-azRkor0ow2XK8BvfHNA/edit?utm_content=DAGrj598ynk&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton) ### Reviewed by `rota1001` > This project is so cool! **原本是這樣想的：** `前面講到 cache 的部份有提到，上傳像 .mp4 這種相對大的檔案的時候，立即打開會出錯，我想期望上不該出現還沒完全上傳就無法打開的檔案。對於使用者來說，我認為付出幾 GB 的硬碟空間可以達成即時的檔案讀寫是很划算的，所以 cache 可以大一點，然後 write back 可以是非同步的寫，讓使用者最新的改動可以即時的在本地更新，上傳結束也不一定要馬上刪掉。` 和 weiso131 討論後，我認為直接的決定所有的檔案寫入需要即時在本地複製是不必要的，但是剛被寫入的檔案容易被使用的這個 temporal locality 是有考量的必要的。另外是快取不一定是以檔案為單位的，有時候只暫時複製檔案的一部分是更好的，因為有時候我們只要寫或讀檔案的一部分。所以，我想快取的考量做在讀寫硬碟這一層會比起檔案系統更合理，可以寫一個 diskord 專案，讓檔案系統看到的是一個可以對於特定地址寫入特定內容的空間。另外是多個使用者造成的並行議題，這件事情是 server 端要解決的，對於各個檔案（或者說對應某個路徑的檔案），要維護現在有哪些使用者在使用它，然後即使是在更新本地檔案前也要透過 request 去決定是否可以寫某個檔案。另外是有提到說 Discord API 有單位時間送 request 的限制，不管是 bot 還是 ip 都有。那對於同個 ip 的那個額度有比單一 bot 還高嗎？如果有的話還是可以創更多的 bot 去達到這個額度。 > 謝謝同學的用心評論，提出很多實際的觀點，這邊一一回覆。 > - 上傳時先把本地端檔案移到快取處確實使用者體驗上確實好許多，會更新此功能。 > - 操作最小單位不一定是檔案，而實際上現代的 filesystem 也都善用 partial read & write，本次沒做出來誠實的說是因為： > 1. 時間不足 > 2. 能力不足 > 想先做出個原型，確認系統的理論，然後再去擴展功能。下一個需要新增的功能確實應是部份讀寫。但部份讀取寫入也會遇到需要下載的問題，即便提前預判即將用到的資料存進快取，若突然跳到另一部份將會有巨大的延遲，想避免就能先去提供部份檔案，背景慢慢將完整檔案抓下來，到頭來仍消耗許多 API 次數。 > - 關於多使用者並行，這確實是伺服器端的願景之一，設計時有考量到未來可能會新增並行的問題，有將 discord api 與 server 端分開， PostgreSQL 資料庫也不會因為多線程而出錯。 > - ip 比單一 bot 高上不少，詳細資訊可參照 https://discord.com/developers/docs/topics/rate-limits > [name=RealBigMickey] ### Reviewed by `EricccTaiwan` > This project is so cool! 先前 Andrea Righi 在演講中的 [future idea](https://youtu.be/L-39aeUQdS8?t=1979) 提到，未來或許可以在 kernel 中引入一個名為 `fs_ext` 的機制，延續 `sched_ext` 的設計理念，作為檔案系統的擴充層，讓 `FUSE` 不再依賴傳統的 `VFS` 路徑。也許現在已經有開發者正在實驗這樣的架構，值得進一步關注！ > 謝謝提供這個資訊，看起來很有意思，趁暑假會研究一下！ > [name=RealBigMickey] ### Reviewed by `Andrushika` > This project is so cool! 想請問使用 closure table 而非其他資料結構實作的考量因素為何？（除了影片裡面提到的名字很酷以外） > Closure table 在移動節點或列出目錄時時間複雜度為 $O(1)$，優於 Ltree 的 $O(k)$，而 Adjacency list Query 完整 subtree 時為 $O(k\ \text{log} n)$ 稍微遜色。 > [name=RealBigMickey] ### Reviewed by `MikazukiHikari` > This project is so cool! 想請問此專案是完全不用到任何 linux kernel 內建的 virtio-fs 功能嗎？如果答案為否，是否需要編譯配置 (`make menuconfig`) 或者載入特定核心模組？可透過 `grep -i virtio /boot/config-$(uname -r)` 查看是否支援 VFS 。 > DISFS 完全不碰 virtio-fs，是純使用者空間的 FUSE 3 檔案系統。只要你的核心有 CONFIG_FUSE_FS（大多數環境預設就有）就能跑，不需要重新 make menuconfig、也不必載入 virtio 相關模組喔。 > [name=RealBigMickey] ### Reviewed by `weiso131` > This project is so cool! 簡報第 14 頁的表格提到 look up 的時間複雜度，四種資料結構都寫 $O(log n)$ ，但 look up 的時間複雜度應該跟目標所在的深度有關，而非總節點數，寫 $O(log n)$ 沒辦法滿足大 $O$ 的定義？ `/boot/config-$(uname -r)` 查看是否支援 VFS 。 > 同學說的沒錯，不過 d 在實務檔案系統通常 $\le$ 30 層，因此這邊視為常數， $d · \text{log}\ n$ 中的 d 併入常數項 $\to$ $O(\text{log}\ n)$。若要數學上絕對嚴謹，應標成 $O(d · \text{log}\ n)$，但兩者在工程上是等價的。 > [name=RealBigMickey] ## TODO: FUSE 原理與 s3fs > 參照 [s3fs](https://github.com/s3fs-fuse/s3fs-fuse)，探討其原理 ### What is s3fs & S3 **Amazon S3 (Simple Storage Service)** is an object storage service. It stores files (called "objects") in containers called $buckets$. It’s NOT a filesystem in a traditional sense. It’s more like a massive hash map, key-value pairs. e.g. | Path | S3 Key | |------------------|---------------------------| | `/mydir/file.txt` | `mydir/file.txt` | Objects are immutable, that is to say partial reads and writes AREN'T available. Reupload of the whole object is necessary, unless chunked uploads are done manually. Each bucket IS the top-level container in S3, there are no such things as "directories" or file hiearchies. e.g. : ```bash s3fs foo-name /mnt/directory -o passwd_file=~/.passwd-s3fs # This mounts the S3 bucket `foo-name` at `/mnt/directory` ``` `s3fs` lets you mount an S3 bucket as a local file system using FUSE. That is to say`s3fs` fakes a filesystem interface over S3 by translating file operations into S3 API calls, through smart use of FUSE (Filesystem in Userspace). e.g. : - `open()`, `read()`, `write()` become GET/PUT/HEAD http requests - `mkdir()` creates empty object to fake directory - `rename()` would copy + delete (since S3 doesn’t support renames) ### Setup The focus here isn't the setup procedure, so I'll quickly gloss over the process. Step1: Credentials ```bash echo ACCESS_KEY:SECRET_KEY > ~/.passwords chmod 600 ~/.passwords ``` Step 2: Mounting the bucket ```bash s3fs foo-name mnt/ -o passwd_file=~.passwords ``` That's it, use it like any other filesystem: ```bash cd mnt/ touch file.txt echo "hello" > file.txt ls -la ``` Unmount with: ```bash fusermount -u /mnt/mybucket ``` ### Limitations Being a network based file system/hack, it's: - **Not fast** – every file op is a network call - **Not safe for DBs** – no file locking, no atomicity - **Slow Renames** - requires copy + delete - **No partial writes and uploads** - requires reupload (though multipart helps) ### Why s3 anyways? It's slow, clunky and not really idea for storage? Here's some reasons I organized: - Massive existing ecosystem - Lots of cloud apps already support s3 natively - Payment per byte, relatively cheap - Data is secured, hard to lose - Able survive entire data center failures (inherent redundancy and replication strategies) - Very simple to use, though features are limited - Global coverage, servers across the world `s3fs-fuse` turns an S3 bucket into a mountable POSIX-like file system. **VERY** similiar to what DISFS aims to do, but on a different cloud platform, with a LOT more polish and professionality. It's the high bar goal this project aims to be one day. (Though less cool in my opinion) ## TODO: 近期核心效能改善 > 研讀〈[To FUSE or Not to FUSE: Performance of User-Space File Systems](https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf)〉 > 研讀〈[Extension Framework for File Systems in User space](https://www.usenix.org/system/files/atc19-bijlani.pdf)，探討如何利用 eBPF 改善 FUSE 檔案系統效能，藉由快取 inode 與 dentry，以降低對 metadata 的存取延遲 > 分析 [FUSE Hooks Up With IO_uring For Greater Performance Potential In Linux 6.14](https://www.phoronix.com/news/Linux-6.14-FUSE) 揭露的效能改進手法 ## To FUSE or Not to FUSE This paper talks about the tool `Stackfs`, a passthrough file system that allows for simpler benchmarking under various configurations. - A lot of FUSE overhead comes from syscall round-trips and context switches. Communication between kernel and user-spaces for any minor operation - Adding things like writeback cache + multithreading + splice can close the perf gap of Ext4 to ~5% - Metadata-heavy workloads (e.g., `LOOKUP`, `FORGET`) are noticably more expensive with FUSE - Used queue-level latency histograms, instrumentation hooks, `fusectl` for real-time stats. [GitHub link](https://github.com/ashishbijlani/StackFS) ### Extension Framework for File Systems in User space This paper introduces `EXTFUSE`, a kernel extension framework that allows user-space file systems to register **“thin” request handlers in kernel space** for faster path operations. Basically context switches are costly for FUSE, thus a solution is to move some of the simpler logic from user-space to kernel space, while keeping complex logic in user space. Trying to get the best of both worlds It provides a simple pathway for developers to decide which operations should be in user-space, which should be kernel-space. e.g. ```c extfuse_register_handler(FUSE_OP_LOOKUP, &my_lookup_handler) ``` [GitHub link](https://github.com/extfuse/extfuse) ### FUSE Hooks Up With IO_uring For Greater Performance Potential In Linux 6.14 Merged in Linux 6.14, this patch allows FUSE to use `IORING_OP_URING_CMD`, allowing processing of multiple requests with less syscall overhead. Combine the commit of old and fetch of new as a single syscall and implement CPU/NUMA affinity of queues. ![image(2)](https://hackmd.io/_uploads/Hy6u4GHBgx.png) Benchmarks show reduced CPU usage and improved throughput, especially on parallel jobs. ### How would I apply These to DISFS DISFS is a FUSE-based file system backed by a Python HTTP server and Discord API. To properly implement these is difficult due to: - A network-bound back end - Chunked and stateless data interface - FUSE syscall overhead for both data and metadata For example for a bash command like `read`: ```clike → syscall into kernel → kernel asks FUSE daemon via /dev/fuse → FUSE daemon reads request, does something (e.g., HTTP, Discord API) → FUSE daemon replies → kernel returns data to user ``` There are multiple context switches, memory copies, and user-kernel round-trips just for a single read. Most importantly, the biggest bottleneck is how to efficiently cache and server side download/upload times. It **doesn't matter** how fast the context-switches are if downloads are necessary for that operation. ### Benchmarking with stackfs To properly benchmark would require big modifications to the current code, for now I'll call the new hypothetical version `DISFS-lite`. - Add blocking for background operations until complete - Communication between server and FUSE to know when uploading/downloading officially completes - Improve caching logic - If wishing to go REALLY further and beyond, tracing of: - Kernel FUSE queue delay - HTTP request duration - Discord API round-trip - Local cache lookup/miss Use `strace`, `perf`, `bpftrace`, or manual hooks via log timestamps for analysis. ### Integrate EXTFUSE Kernel Handlers & io_uring This is the most **dangerous** part to implement. Since every look-up **realistically** has to send a request to the server to get the most up to date version of files and queries. Thus caching or handling lookups locally isn't really possible or viable unless it's guarenteed that files on the server are read only or that only 1 system is logged in to 1 account at anytime. DISFS would benefit from: - Actual performance metrics through a `DISFS-lite` version - MAYBE carefully scope EXTFUSE fast-paths (e.g. local `.ping`) - io_uring integration in the FUSE backend to batch requests But **metadata correctness and cloud sync consistency remain the biggest constraint**, and no kernel-side optimization will fix that without cache coherence strategies. ## TODO: 設計雲端檔案系統 > 參照 [rhizofs](https://github.com/oliverkurth/rhizofs) 一類的實作。請留意：為確保專案具有更高的應用彈性，開發者在使用者層級應避免採用以 GNU GPL 授權發布的程式碼與函式庫（GNU LGPL 授權則可接受使用） ### Minimum Viable Product (MVP) For a MVP, I plan to implement the following commands: - getattr - Used for retrieving metadata - readdir - Needed for listing directory contents - open - Required for reading and writing - read - Essential for viewing and copying - mkdir - Allowing creation of new directories - rmdir - Allowing removal of empty directories - create - Needed to create new files - unlink - Essential for deleting files - write - Needed to actually write data to files - rename - Needed for mv FUSE maintains five queues for (1) interrupts, (2) forgets, (3) pending, (4) processing, and (5) background. The kernel module manages the complexity of request queues, prioritization, and fairness. So using FUSE for user-space filesystems we don't need to manually manage these queues. Thus being able to **focus** on implementing our custom operations (e.g., read, write, mkdir, getattr) as separate callback functions in their FUSE daemon. ## DISFS Overview > Note: This project is actually a revision of another project from 2024 -> [Discord Clouding v0.1 review](https://hackmd.io/@BigMickey/2025DCC_review) ### What does DISFS mean? The name is actually an acronym, it stands for: - “**D**ISFS **I**s a **S**ervice **F**ile **S**ystem” I was inspired by recursive nature of GNU's name, it left a lasting impression on me the first time I heard it: - “GNU’s Not Unix!”, DISFS is recursive and humorous. > *Fuse flow:* ![image](https://hackmd.io/_uploads/r14UrjL7le.png) DISFS is a user-space file system built using FUSE, designed to store file data not on a disk, but in the cloud—via the Discord platform. It mimics a traditional POSIX file system interface, but under the hood, every file operation is translated into HTTP requests, and files are split into chunks and uploaded as Discord attachments. This project explores the boundaries of what can be considered a "file system." Instead of using conventional storage like ext4 or S3, DISFS uses Discord + HTTP + PostgreSQL to simulate a real, persistent file system. Despite Discord being unsuited for this task, DISFS presents a working prototype that supports file creation, reading/writing, directory structure, login, and caching. The goal isn’t performance or practicality, but design realism in an absurd context. DISFS is a sandbox to practice real system architecture: FUSE integration, stateless protocol design, caching strategies, user login, and metadata handling through closure tables. It's a full-stack engineering experiment from the kernel interface down to cloud object storage—just not the kind you'd expect. ## Background Knowledge ### Linux User Space v.s. Kernel Space ![image (2)](https://hackmd.io/_uploads/BJBCnkbrxx.png) Linux kernel is split into 2 different spaces, user space and kernel space: - Kernel-space: Has full access to memory/hardware, runs with highest privilege. - User-space: Limited access; must call syscalls to interact with kernel-space. Why the separation? With great power comes great responsibility. And to give all programs full access to all resources is irresponsible. Just like how you wouldn't give your home key to anyone on the street. With the split we can keep order in our systems, reduce cyber security threats and isolating bad code from crashing the whole system. ### What Is a File System on Linux? <div style="display: flex; gap: 10px;"> <img src="https://hackmd.io/_uploads/rk5R2JZBgx.png" style="width: 300px; height: 200px; object-fit: contain;" /> <img src="https://hackmd.io/_uploads/SJCRhJZBxl.png" style="width: 300px; height: 200px; object-fit: contain;" /> </div> Applications work on the user-space, if it wants to make any modify or access any files it must do so through syscalls (System calls). Syscalls communicates to VFS (Virtual File System). VFS then talks to actual kernel-implemented file systems (ext4, xfs, etc.). VFS provides a uniform interface to different backends. ### Why Writing File Systems Is Hard Kernel modules (like ext4) are complex, dangerous to debug (a bug may crash the system). Commonly, core file system logic consist of mostly just = data structures + pathname resolution. These don't require the details and manipulation on hardware, thus working on these function in kernel level is like working with your hands tied behind your back. Programmer *Miklos Szeredi* created FUSE to let developers safely implement file systems in user space without touching the kernel. ### FUSE FUSE, acronym for: - “**F**ilesystem in **USE**rspace” ![image (3)](https://hackmd.io/_uploads/ry6P4lZBel.png) Allows user-space daemons to implement file system logic. The kernel's FUSE module handles bridging VFS ↔ your daemon via /dev/fu ### Is a File System on user space even a File system? Yes. Still a real FS. It doesn't matter where the core logic is operating in. FUSE bridges the VFS, while the daemon talks to the network and handles commands. If it correctly interprets file system commands and acts like a file system, it's still considered a file system. implemented in user space via FUSE. ## Platform of choice For creating a web storage device, I chose the platform Discord. It's a social platform that most would be all too familiar with. The reason why it can be used but as backend cloud storage is because of the following characteristics. 1. Messages can have attachments of size (<10MB) to them. 2. Attachments (for now) don’t expire. 3. Attachment download occur via HTTPS. 4. No storage cap per channel. 5. User friendly API 6. Completely free. However, that is not to say that there aren't any drawbacks: - URLs have hidden tokens, which expire every 24h. - There are per discord bot API limits and per IP API limits, as the project scales up there may be a hard cap on what CAN be done > *Note: [Discord Bot data limits](https://discord.com/developers/docs/topics/rate-limits)* - Discord urls are updated every 24 hours, must obtain 3 keys and create the correct url before being able to fetch files. (ex = expire time, is = valid time, hm = key) ![image (5)](https://hackmd.io/_uploads/SJNRN-Wrlg.png) - Discord API protocols may be changed - Speed is directly tied to discord servers But for the purposes of this project, Discord fits the needs quite well. ## DISFS flow: It's important to note that there are 2 sids of DISFS: - FUSE side - acts as the file systems - sends requests to server side - Server side - holds a database to every user and file - communicates with discord api - handles requests from FUSE side and respond correspondingly When a user opens a file with `$ cat` for example, behind the scenes multiple commands are called, but to cut fat I'll just focus on the command `open()`. This is a very simplified diagram: ![image](https://hackmd.io/_uploads/HypIRgWrgg.png) The user calls open, the call is intercepted by FUSE, FUSE communicates to the user space daemon DISFS (basically running it's code) and sends a http request to the server side. Server side checks whether the request is valid (check user_id and parameters), then checks with it's database PostgreSQL to see whether it exits or not - exists -> send message return message_ids - doesn't -> throw an error, return error to DISFS Then the message_ids are used to fetch the attachment, attachments are then sent back to DISFS one by one. (Originally it would send complete files all at once after downloading and fetching every fileinstead of working one by one) DISFS merges and caches the chunks in a hidden local directory, once done shows the result to the user. Completing a cycle. ### Why PostgreSQL? Though compared to other options PostgreSQL is a bit heavier and slower, but natively it supports concurrency and async. Features that would prove useful once more workers or threads are added. ### File metadata Files LOTS of metadata, but to keep things simple while still preserving some I kept the following: i_mode, i_uid, i_gid, i_size, i_atime, i_mtime, i_ctime, i_crtime. These make up a file relatively well, keeping data the I guess what the general user would want. ## Closure table ### Problem: Original schema used a full file path string stored in a column. Simple to implement, but: - Performance dropped as directory tree grew. - Can't efficiently handle `ls`, `rm -r`, etc. It relied on the B-tree datastucture built into PostgreSQL for queries. A design that wouldn't hold as the project expanded ### What *is* a closure table? ![1_cN0Ws5AM6rY5SKcJCjs2Tw](https://hackmd.io/_uploads/HkOuU5wVgl.png) It’s a **read-optimized** data structure built using a relational table. Used mainly for representing hierarchical data using a relational table. Being **read-optimized**, inserts/moves/deletes are more costly, but **hierarchical queries** are blazing-fast in return. > A closure table is a table that stores all the paths between all elements in a hierarchical data structure. The table includes two columns for the IDs of the related elements and a third column that represents the distance between them. > — *[Yusoof Ali](https://medium.com/@yusoofash/handling-hierarchical-data-with-closure-tables-in-postgresql-167aac3a74f2)* They say a picture is worth a thousand words. A visual example helps clarify how closure tables work. --- ### Visual example For a simple hierarchy like this: ```clike A ("/", directory, id=1) └── B ("docs", directory, id=2) └── C ("report.txt", file, id=3) └── D ("pics", directory, id=4) ``` The structure would look like: ![image](https://hackmd.io/_uploads/HJ3zD9P4ex.png) Each node in the closure table has connections to **all of its descendants**. Each row has 3 values: `(Ancestor, Descendant, Depth)` **For node `1: / (root)`↓** *node_closure:* | Ancestor | Descendant | Depth | Meaning | |----------|------------|-------|------------------------------| | 1 | 1 | 0 | root is itself | | 1 | 2 | 1 | root → docs | | 1 | 3 | 2 | root → docs → report.txt | | 1 | 4 | 1 | root → pics | *node*: id | user_id | name | parent_id | type ----|---------|------------|-----------|------ 1 | foo | "/" | NULL | 2 ← directory This setup allows for **fast descendant queries** and keeps the hierarchy intact even as the number of files and directories grows. ```sql -- Nodes for files & folders CREATE TABLE IF NOT EXISTS nodes ( id SERIAL PRIMARY KEY, user_id INT NOT NULL REFERENCES users(id) ON DELETE CASCADE, name TEXT NOT NULL, parent_id INT REFERENCES nodes(id) ON DELETE CASCADE, type SMALLINT NOT NULL, -- 1=file, 2=folder i_atime BIGINT, i_mtime BIGINT, i_ctime BIGINT, i_crtime BIGINT ); -- Closure table CREATE TABLE IF NOT EXISTS node_closure ( ancestor INT NOT NULL REFERENCES nodes(id) ON DELETE CASCADE, descendant INT NOT NULL REFERENCES nodes(id) ON DELETE CASCADE, depth INT NOT NULL, PRIMARY KEY (ancestor, descendant) ); -- File chunks CREATE TABLE IF NOT EXISTS file_chunks ( node_id INT NOT NULL REFERENCES nodes(id) ON DELETE CASCADE, -- must be type=1 chunk_index INT NOT NULL, chunk_size INT, message_id BIGINT, PRIMARY KEY (node_id, chunk_index) ); ``` --- ### Inserts, Moves, & Deletes #### Inserting ↓ To insert a new node under a parent: - Copy all ancestor paths from the parent and extend them to the new node. - Add a self-reference `(new_id, new_id, 0)`. #### Moving ↓ To move a node (and its subtree): - Delete all closure entries where the node or its descendants are involved. - Recalculate and insert new paths from the new parent’s ancestors to the moved node and its subtree. #### Deleting ↓ Two approaches: - **Cascade delete**: Remove the node and all its descendants. - ACTUALLY removing the data - **Detach/orphan**: Remove the node’s links but keep children, possibly reattaching them elsewhere. - Can be used to put data in trash. :::success I chose to use **Cascade delete**, though might change to **Detach/orphan** to implement a temporary rubbish bin 🗑️ or redo commands down the line... ::: ### Time complexity and comparison between different SQL implementations: > ![image (6)](https://hackmd.io/_uploads/BkeLo--Seg.png) [Implementing Closure table](https://medium.com/@yusoofash/handling-hierarchical-data-with-closure-tables-in-postgresql-167aac3a74f2) ## Login System Normally a filesystem wouldn't have a "login" mechanism. But anyone, anywhere should be able to access their files with DISFS. Shouldn't require CLI args or separate login paths. **Solution: Login as File** Runs a comparison for every path, if the directory is at the root and starts with a `period (.)` -> Treat it like a command. do_read detects this and sends a different http request to the server in order to check if the user exists. Server checks database, returning the result/session. With the session DISFS is able to access directories for said user. ```bash # Mount at mnt $ cat mnt/.ping/William # This is a login request. ``` ```c else if (!logged_in && strncmp(path, CSTR_LEN("/.ping/")) == 0) { const char *username = path + sizeof("/.ping/") - 1; char url[URL_MAX]; snprintf(url, sizeof(url), "http://" SERVER_IP "/login?user=%s", username); string_buf_t resp = {0}; if (http_get(url, &resp, NULL) == 0) { int id; char name[32]; if (sscanf(resp.ptr, "%d:%32s", &id, name) == 2) { current_user_id = id; // making sure '\0' doesn't get overwritten memcpy(current_username, name, sizeof(name)-1); logged_in = 1; free(resp.ptr); LOGMSG("logged in now! :D"); return snprintf(buf, size, "Logged in as \"%s\"\n", name); } } else { return snprintf(buf, size, "Failed to login.\n"); } ``` ## A page out of Linux kernel: Pointer Tagging All http requests for this project come in 2 flavors, GET and POST. - GET -> read data, no modification - POST -> modify data Thus it's necessary to have a flag. Instead of a variable, I took a page out of `rbtree.h` from Linux kernel and used pointer tagging. >on on x86-64 systems don't not use the lowest bits for addressing. On 64-bit systems, only 48–52 bits are used for addressing. - Addresses divisible by 8 end with `000` in binary, e.g.: - Decimal 8 → Binary 1000 - Decimal 16 → Binary 10000 - Decimal 24 → Binary 11000 - Thus 3 bits of metadata in these always-zero bits. ```c if ((uintptr_t)status & 1) { curl_easy_setopt(c, CURLOPT_POST, 1L); // make it POST status = (uint32_t*)((uintptr_t)status & ~1); } ``` `& 1` masks everything except the LSB, allowing for checks. Afterwards it's necessary to change LSB back to 0 to fufill 8 byte-alignment. (or it would write into memory it shouldn't) *Q: Is this practical for this project?* > *No. Speeds are tied directly to discord api limits, downloads and upload speeds, chunking, concurrency and threaded operations (or lack there of) on server side etc. etc... not really "THIS"* ## Chunking and Caching As Discord has a hard limit of 10 MiB per attachments, to upload bigger files it's necessary to split them into smaller chunks. Strategy: - Use (user_id, path, chunk) to locate files - No open connections - Stateless by design → less overhead In an effort to cut down on API calls, caching was implemented. WHEN to cache is the elephant in the room. Navigating around or viewing stats with `ls`, `ls -l` should be almost instant. But due to Discord being the backbone, files must go through the process of uploading and downloading for writing and reading. | Operation | release() to finalize | User experience | | -------- | -------- | -------- | | echo foo > a | 1~2 seconds | Almost instant | | mv large.mp4 | tens of seconds | Error on open, works after a while | Instead of this occuring everytime, read requests cache the file and consequent reads don't fetch the file from server. ### HTTP Upload/Download Examples - Upload: ```bash PUT /upload?user_id=3&path=/foo.txt&chunk=0 → Discord bot uploads + DB logs it ``` - Download: ```bash GET /download?user_id=3&path=/foo.txt → Server queries DB → streams N chunks on the fly ``` - Tolerant to chunk failures → retryable - Keeps POSIX I/O illusion > On wifi (52.4 Mbs ↓, 14.2 Mbs ↑, Google's speed test) > Testing speeds on a 1.3 GB video file: > ![image (7)](https://hackmd.io/_uploads/HyLgoNZHex.png) ### Catch-22 - Caching gives better user experience, BUT - The more caching: - More issues with version control - More user storage used There must be trade-offs that must be done, I'm still trying to figure out which way of doing things IS the most advantageous. ## Problems you wouldn't expect Working on the project really gave insight on how filesytems truly work. Many problems faced along the way were unexpected but quite interesting. ### Implementing mkdir After coding up mkdir on `fuse` & `server-end`, running the command would result in this error: ```bash foo: $ mkdir mnt/dog mkdir: cannot create directory ‘mnt/dog’: Input/output error ``` On the server's end, the output was: ```bash [2025-06-24 16:26:39 +0800] [28131] [INFO] 127.0.0.1:58558 GET /stat 1.1 404 0 7109 ``` indicating that the signal recived was /stat, yet the command ran was mkdir. **REASON FOR THIS:** Kernel call sequence for `$ mkdir mnt/dog` is as follows: ```sql getattr("/") ⟹ make sure parent exists getattr("/dog") ⟹ check that the new entry does *not* exist mkdir("/dog") ⟹ actually create it ``` Before even running mkdir, it checks whether the parent exists (same as other commands) BUT ALSO check if the entry exists. In other instances, if the entry didn't exist that meant the action was illegal, but in this case it's the opposite. `getattr` MUST return with the correct error message of `-ENOENT` ### More than meets the eye... It's true that for coding we often take too much for granted, but this is painfully so here. As an example: ```bash echo "Hello Discord!" > mnt/hello.txt ``` Look at this command, it writes to hello.txt, seems simple on the surface, but its **not**. The following commands are executed in sequence: 1. READATTR - Called first to see if file exists - Sends request to server - Server runs QUERY to validate existance - False, Returns -ENOENT (doesn't exist) 2. CREATE - Called to create the file - Calls OPEN with `O_CREAT` flag 3. OPEN - Once again checks with server to see if file exists - It doesn't, creates a file on local disk - Calls WRITE to actually write the file 4. WRITE - finds local file and writes to it - calls RELEASE to release the file 5. RELEASE - File is closed - Clean up file handlers etc. - Split file into chunks - Wait for chunks to upload through HTTP - return to user Each command along the way has functions on fuse's end + Server's end. Like the famous quote: > All problems in computer science can be solved by another level of indirection > > -- <cite>Butler Lampson</cite> This is true, but it took -5 whole days of work before `$ echo` was working properly. ## Conclusion ### Realistic Trade-offs & Lessons Learned Designing a file system that lives in a *completely unintended* environment like Discord reveals a lot about real-world system constraints. For instance: - **POSIX compliance is hard** – Even simple commands like `mv`, `touch`, `cat` rely on multiple syscalls that expect real-time I/O feedback and consistent metadata. - **Rate limits are not just numbers** – When your "disk" punishes you for being too fast (via 429s), caching, batching, and retries become design priorities. - **Stateless is not simple** – Stateless HTTP forces full parameterization (e.g., `user_id`, `path`, `chunk`) every time. No room for assumptions. - **Performance ≠ usability** – A system can be "slow but usable" if the feedback is designed well (write-back cache, progress reporting). - **Toolchain complexity matters** – C (FUSE) + Python (server) + PostgreSQL (DB) + Discord (API) = constant context switching. Many bugs were from glue code, not logic. ### Future work: #### Benchmark Suite (Future Addition) To make performance comparisons more reproducible and rigorous: - `ls`, `cat`, `mv`, `mkdir`, `rm -r` with tree sizes from 10–10,000 - Cache-hit v.s. cache-miss timing - Cold vs warm startup latency - Comparison against memory FS or tmpfs baseline - Rate limit tolerance test (many small writes over time) #### Improvements - Perhaps instead of downloading each chunk of a file on server end THEN sending everything. It could just communicate the urls of every chunk directly to DISFS so it can fetch the files itself, effectively doubling the speed. - Adding on to this, after fetching 3 keys and creating the correct url for each chunk, these chunks could be logged onto the database. Since these urls will remain the same until expiry, there's no need to spend API requests to create the same url repeadily - Concurrency with more worker threads or bots would definitely help - Encryption during logins, perhaps a session based login system where a key is generated and can be used for login in a set time period. - Smarter caching is important, when to cache when to delete the cache etc. - Though this also brings up the problem of version control, cahced files and files on the cloud may not be of the same version ## UPDATE: caching system When the user checks the attributes/metadata of a file (like access time, size, modify time etc.), the data is fetched from the server's side. This is true EVEN IF a file has been cached locally. The local file only contain size and mtime(modify time) attributes. This way, by comparing mtime locally and online we can version control, also making sure that metadata is always correct, caching works if mtime has never been changed.