Linux 核心設計: 檔案系統概念及實作手法

--- tags: LINUX KERNEL, LKI --- # [Linux 核心設計](https://beta.hackfoldr.org/linux/): 檔案系統概念及實作手法 Copyright (**慣C**) 2019, 2020 [宅色夫](http://wiki.csie.ncku.edu.tw/User/jserv) ==[直播錄影 (上)](https://youtu.be/d8ZN5-XTIJM)== ==[直播錄影 (下)](https://youtu.be/BCPe0EICfwI)== ![](https://i.imgur.com/BGOhQmT.png) ## 目標設定 Ken Thompson 在 1969 年進行 UNIX 作業系統的早期實驗時 (UNIX 這名稱甚至還沒出現)，就發展了檔案系統，而且某些特性輾轉由今日我們所見的 Linux 核心所繼承，這樣長達 50 年的關鍵概念，著實值得我們探討。檔案系統是種邏輯概念，抽象了各式實體儲存裝置，後者像是硬碟、網路儲存，或者主記憶體等等，而種邏輯概念包含檔案、目錄、路徑、檔案屬性等面向，並且於 Linux 核心出現前的 UNIX 就以各種物件導向的途徑予以設計並實作。 Linux 核心為了統合各個檔案系統實作，引入一層虛擬檔案系統 ((Virtual File System, VFS)，後者是一組檔案操作的抽象介面，於是依循 VFS 界面開發的任何的檔案系統，可在執行時期掛載到 Linux 核心，在 FUSE (Filesystem in Userspace) 出現後，更給檔案系統開發者極大的彈性，能夠快速且多樣地延展作業系統的特徵，像是你可以透過 FUSE 來掛載 Google Drive 的內容，好似這些遠端的物件就對應於本地端檔案系統中。除了界面和設計的巧妙，著眼於工業強度的 Linux 更是在多款檔案系統提出高效的實作，於是乎，我們可見 scalability 和豐富的裝置驅動程式的支援，成為 Linux 核心勝出的重要因素。本議程也著眼於上述概念及實作手法。 ## "file" 的翻譯引述 "[Cambridge Dictionary](https://dictionary.cambridge.org/dictionary/english/file)": > any of several different types of container used to store papers, letters, and other documents in an ordered way, especially in an office ![image](https://hackmd.io/_uploads/BJod5bkhp.png) 依據 [國家教育研究院雙語詞彙資料庫](http://terms.naer.edu.tw/detail/1278441/?index=9): > 檔案；歸檔 > 1. 由若干個邏輯記錄構成的資訊集合，這些記錄可以是同一種類型的，也可以是不同種類型的。 > 2. 表示在目的、形式和內容上彼此相似的資訊項的集合。 > 3. 儲存文件的磁帶、磁碟、紙帶或一組卡片。若將 "file" 翻譯為「文件」，則偏離其「保存」和「資訊集合」的本意。對於作業系統來說，"file" 著重於 "container" (容器) 的特質。 ## UNIX 第 1 版裡頭的檔案系統 [unix-v1](https://github.com/jserv/unix-v1) 嘗試從 Bell Labs 已歸檔的文件中，透過 OCR (光學字元識別) 將 [Research UNIX](https://en.wikipedia.org/wiki/Research_Unix) 的原始程式碼、技術文件，還有公司內部的技術討論材料，重新予以數位化，並透過 [SimH](http://simh.trailing-edge.com/) 模擬器，讓第一版的 UNIX 作業系統在今日的電腦上運作。 > 精彩的對談錄影: [Brian Kernighan interviews Ken Thompson](https://www.youtube.com/watch?v=EY6q5dv_B-o) (2019 年 5 月 4 日) 摘自 [notes/machine.txt](https://github.com/jserv/unix-v1/blob/master/notes/machine.txt): * Disks: - `rf0` - 1024 blocks, always mounted, has root and swap. - last 64 blocks (32kbyte) allocated to swap by `u0.s` - `u0.s` can setup rudimentary fs on this if built with "cold=1" - rk0 - 4871 blocks * `u0.s` sets up rf0 as: - superblock layout - free storage map contains 128 bytes - inode map contains 64 bytes - allocate top 64 blocks "to unix" (960..1023) - allocate 17 blocks per process (nproc) (688..960 - 16 procs) - free blocks 687..34 - zero out blocks 33..1 for use as inodes (and 2nd half of superblock which is unused?) - for each special inode i-1 to i-40 initialize it - mode 100017 (allocated, read, write, other read, othe write) - num links = 1 - uid = 1 (yup, not root) - fill in timestamp info - write out premade i-nodes on i-41 .. i-47 with premade directory data. * i-41 root (with dev, bin, etc, usr, tmp) * i-42 dev (with tty, ppt, mem, rf0, rk0, tap0 .. tap7, tty0..tty7, lpr, tty8) * i-43 bin (empty) * i-44 etc (with init) * i-45 usr (empty) * i-46 tmp (empty) * i-47 init (binary, included in `u0.s`) special version that opens `/dev/tap0` and reads files from it, creates them, chmods them and chowns them 對照觀察 [cmd/ls.s](https://github.com/jserv/unix-v1/blob/master/src/cmd/ls.s) 原始程式碼 ```c=343 bit $20,(r4) /executable beq 2f jsr r5,mode; 'x ``` 100000~2~ ```c=350 bit $10,(r4) /read owner beq 2f jsr r5,mode; 'r ``` 010000~2~ ```c=357 bit $4,(r4) /write owner beq 2f jsr r5,mode; 'w ``` 000100~2~ [The PDP11-40 Processor Handbook](https://pdos.csail.mit.edu/6.828/2005/pdp11/) [cmd/init.s](https://github.com/jserv/unix-v1/blob/master/src/cmd/init.s) * 系統呼叫: close, open, exec, chmod, creat, unlink, fork, chdir, wait, seek, write, chown, time chmod / chown 系統呼叫的實作: [pages/e02-07](https://github.com/jserv/unix-v1/blob/master/pages/e02-07) ```c=6 syschmod: / name; mode jsr r0,isown / get the i-node and check user status bit $40000,i.flgs / directory? beq 2f / no bic $60,r2 / su & ex / yes, clear set user id and / executable modes 2: movb r2,i.flgs / move remaining mode to i.flgs br 1f syschown: / name; owner jsr r0,isown / get the i-node and check user status tstb u.uid / super user beq 2f / yes, 2f bit $40,i.flgs / no, set userid on execution? bne 3f / yes error, could create Trojan Horses 2: movb r2,i.uid / no, put the new owners id in the i-node 1: jmp sysret4 3: jmp error ``` ## Universal I/O Model Linux 秉持 UNIX 哲學 "Everything is a file"，後者被 [The Linux Programming Interface](http://man7.org/tlpi/) 一書 (該書可簡稱為 TLPI) 稱作 "Universality of I/O"： > One of the distinguishing features of the UNIX I/O model is the concept of universality of I/O. This means that the same four system calls—open(), read(), write(), and close()—are used to perform I/O on all types of files, including devices such as terminals. 因此，看似不同種類的 I/O (如輸出到終端機 vs. 寫資料到檔案)，即可在這個一致的抽象層下運作(都可以用檔案操作視之)。裝置驅動程式的目的就是要給出這個抽象層的實作，使得作業系統能夠透過一致的介面去存取。該書第 14 章提到： > A device special file corresponds to a device on the system. Within the kernel, each device type has a corresponding device driver, which handles all I/O requests for the device. A device driver is a unit of kernel code that implements a set of operations that (normally) correspond to input and output actions on an associated piece of hardware. The API provided by device drivers is fixed, and includes operations corresponding to the system calls `open()`, `close()`, `read()`, `write()`, `mmap()`, and `ioctl()`. 這些操作大致上是那些基於 file descriptor 的操作，比如 `open()`, `read()`, `write()`, `lseek()` 等等，換言之，檔案在 UNIX 就是位元組的序列，對檔案進行 I/O 操作，就用前述一致的檔案處理介面來達成。延伸閱讀: [In UNIX Everything is a File](https://hackmd.io/@jkyang/unix-everything-is-a-file) 考慮到 UNIX 的實際行為，應該把 "Everything is a file" 改寫為 "Everything is a file descriptor" 會更恰當。Linux 核心貫徹這理念，實作出若干以 `fd` 結尾的核心機制，如: * [eventfd](http://man7.org/linux/man-pages/man2/eventfd.2.html) * [timerfd](http://man7.org/linux/man-pages/man2/timerfd_settime.2.html) * [signalfd](http://man7.org/linux/man-pages/man2/signalfd.2.html) UNIX 和後續的 BSD (可視為具備血緣關係) 及 (半途殺出的程咬金) Linux 並非真的落實 "Everything is a file"，部分存在例外。 * 在 Linux 提出 signalfd 之前，signal 並非以檔案形式存在，因此我們無法使用 [poll](https://man7.org/linux/man-pages/man2/poll.2.html) 來等待 signal，只能依賴於風險較高的訊 signal handler。 * 在 FreeBSD 引入 pdfork 之前，子行程亦非檔案形式，無法使用 poll 來等待子行程結束，只能透過 wait 系統呼叫。這樣也就無法安全地（無競爭狀態）終止行程。 * Linux 並不支援 FreeBSD 的 pdfork 機制。Linux v5.4 引入 pidfd，這讓我們也能把行程當作檔案來進行等待（poll），但不支援執行緒，參見 [Adding the pidfd abstraction to the kernel](https://lwn.net/Articles/801319/) * 執行緒仍非以檔案形式存在，因此我們無法使用 poll 來等待執行緒結束，只能進行 join。 * Linux 的 `CLONE_FD` 能以檔案的方式管理子行程和執行緒，但該並未全面在 GNU/Linux 的個別元件採納。只有 pipe 和 sockets 真正適合作為可被 poll。[select](https://man7.org/linux/man-pages/man2/select.2.html) 系統本身也是和 Sockets API 同時被 1983 年 4.2BSD 提出。 Bell Labs 在 UNIX 之後，發展出 [Plan 9](https://en.wikipedia.org/wiki/Plan_9_from_Bell_Labs) 作業系統，才算是真正的 "Everything is a file"，但沒有太多開發者重視。在 [Plan 9](https://en.wikipedia.org/wiki/Plan_9_from_Bell_Labs) 作業系統中，所有的裝置和服務都視作檔案為基礎的操作，例如 telnet, ftp, nfs 等等。 > ==The client process's input/output on virtual files, that appear in other processes' namespace, becomes inter-process communication between the two processes.== This way, Plan 9 generalizes the Unix notion of the filesystem as the central point of access to computing resources. It carries over Unix's idea of device files to provide access to peripheral devices (mice, removable media, etc.) and the possibility to mount filesystems residing on physically distinct filesystems into a hierarchical namespace, but adds the possibility to mount a connection to a server program that speaks a standardized protocol and treat its services as part of the namespace. 行程間的通訊可藉由對虛擬檔案進行 I/O 操作來達成，因此從檔案系統可輕易的存取週邊裝置，也能掛載特定通訊協定 (9P) 的伺服器程式 > All programs that wish to provide services-as-files to other programs speak a unified protocol, called ==[9P](https://en.wikipedia.org/wiki/9P_(protocol))==. Plan 9 儘管在商業上不成功，開發者社群規模也有限，但其精神卻在 Linux 和虛擬化技術發揚光大。節錄 [Virtio: An I/O virtualization framework for Linux](https://developer.ibm.com/articles/l-virtio/) 如下 > Linux is the [hypervisor](https://en.wikipedia.org/wiki/Hypervisor) playground. As my article on Linux as a hypervisor showed, Linux offers a variety of hypervisor solutions with different attributes and advantages. Examples include the Kernel-based Virtual Machine (KVM), lguest, and User-mode Linux. Having these different hypervisor solutions on Linux can tax the operating system based on their independent needs. One of the taxes is virtualization of devices. **Rather than have a variety of device emulation mechanisms (for network, block, and other drivers), virtio provides a common front end for these device emulations to standardize the interface and increase the reuse of code across the platforms.** * virtio 是一個通用的介面，給不同的 device 使用，允許這些不同 device 只要使用特定共用 API 就能在 linux 平台上運作 * 因此省去了每個 device 都需要各自實作 host (如 Linux 核心) / guest (各式 device) 對應的介面的麻煩 * [v9fs](https://www.kernel.org/doc/Documentation/filesystems/9p.txt): Plan 9 Resource Sharing for Linux * [Plan 9: Not (Only) A Better UNIX](https://www.slideshare.net/jserv/plan-9-not-only-a-better-unix) * [9psetup](https://wiki.qemu.org/Documentation/9psetup) * [An Updated Overview of the QEMU Storage Stack](https://events.static.linuxfound.org/slides/2011/linuxcon-japan/lcj2011_hajnoczi.pdf) * [溫柔的教你用 QEMU 開始寫系統軟體](https://hackmd.io/@VIRqdo35SvekIiH4p76B7g/rkVzoomyx) * [測試 Linux 核心的虛擬化環境](https://hackmd.io/@sysprog/linux-virtme) ## File Descriptor 及開啟的檔案透過 `open()`, `read()`, `write()` 等函式進行各種 I/O 操作時，都是以 file descriptor 為對象。而實際上這件事牽扯到 3 個面向： 1. 每個行程自己看到的 file descriptor 2. open file table 3. inode：那個「檔案」真正的 inode ![image](https://hackmd.io/_uploads/rk_olD7AT.png) ## I/O 事件模型 ![](https://i.imgur.com/uXPFiNs.png) > blocking I/O ![](https://i.imgur.com/c77Sx0q.png) > I/O Multiplexing > source: [I/O Multiplexing](http://www.cs.toronto.edu/~krueger/csc209h/f05/lectures/Week11-Select.pdf) - Blocking / Non-blocking I/O：在等待 I/O 資料**準備就緒** (e.g. **有資料可以讀取**) 的過程，使用者層級的行程是否會被 **blocked**。 - Blocking I/O：在等待 I/O 資料準備就緒的過程，使用者層級的行程會被 blocked。 - e.g. 未設定 **O_NONBLOCK** 的 [read](http://man7.org/linux/man-pages/man2/read.2.html), [select](http://man7.org/linux/man-pages/man2/select.2.html), [epoll](http://man7.org/linux/man-pages/man7/epoll.7.html) - Non-blocking I/O：在向核心詢問是否可進行 I/O 操作的過程，不論 I/O 資料是否準備就緒，使用者層級的行程都不會被 blocked。若 I/O 資料尚未準備就緒 (e.g. 尚無資料可讀取)，則會直接返回，並設定 `errno` 指明錯誤 (e.g. `EAGAIN`、`EWOULDBLOCK`)。使用者層級的行程必須重複詢問 (e.g. 在 `while` 迴圈中重複呼叫 [read](http://man7.org/linux/man-pages/man2/read.2.html)) 直到 I/O 資料準備就緒才可進行 I/O operation。 - e.g. 設定 **O_NONBLOCK** 的 [read](http://man7.org/linux/man-pages/man2/read.2.html), `aio_read()`, `aio_write()` - Synchronous / Asynchronous I/O：在從/向核心空間讀取/寫入資料 (i.e. **實際進行 I/O 操作**) 的過程，使用者層級的行程是否會被 **blocked**。 - Synchronous I/O：從/向核心空間讀取/寫入資料的過程，使用者層級的行程會被 blocked - e.g. [read](http://man7.org/linux/man-pages/man2/read.2.html), [recvfrom](http://man7.org/linux/man-pages/man2/recv.2.html) - Asynchronous I/O：從/向核心空間讀取/寫入的過程交由核心內部處理。核心複製完資料後，再通知使用者層級的行程。這中間使用者層級的行程不會被 blocked。 - e.g. `aio_read()`, `aio_write()` 延伸閱讀: [I/O Models 中文解說](https://rickhw.github.io/2019/02/27/ComputerScience/IO-Models/) ### I/O Multiplexing ```c int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout); ``` [select](http://man7.org/linux/man-pages/man2/select.2.html) 搭配 [read](http://man7.org/linux/man-pages/man2/read.2.html) 可實作出 I/O multiplexing 機制，而 I/O multiplexing 其實是 Blocking I/O 的延伸，但 `select()` + `read()` 相較單一 [read](http://man7.org/linux/man-pages/man2/read.2.html) 的優勢是 `select()` 可同時等待多個 `file descriptors` (`fd`)。若 I/O multiplexing 要處理的連線數量過少，可能會因內部處理較複雜，致使更高的延遲。在 4.2BSD 提出 select 系統呼叫時，I/O multiplexing 並非著眼於提高單一連線的速度，而是可在只使用單一行程或執行緒狀況下，監視或處理更多的連線。呼叫 `select()` 時所傳入的是多個 `fd set`，而非只是一個 `fd`。只要 `fd set` 中任一個以上的 `fd` 的 I/O 資料準備就緒，`select()` 就會返回。呼叫 `select()` 的行程必須走訪 `fd set` 來檢查是哪幾個 `fd` 可以進行 I/O 操作，爾後再呼叫 `read()`、`recvfrom()` 讀取資料。此時 I/O 資料應準備就緒，可直接讀取。但也有例外，e.g. `select()` man page 所述： > Under Linux, select() may report a socket file descriptor as "ready for reading", while nevertheless a subsequent read blocks. This could for example happen when data has arrived but upon examination has wrong checksum and is discarded. There may be other circumstances in which a file descriptor is spuriously reported as ready. Thus it may be safer to use O_NONBLOCK on sockets that should not block. `select()` 後的 `read()`、`recvfrom()` 是 Blocking I/O 或是 Non-blocking I/O，取決於設定 **O_NONBLOCK** 與否。將 **read()** 或 **recvfrom()** 設為 **NON_BLOCKING** 才可避免使用者層級的行程再次被 blocked。由於 `select()` 最多只可同時監控 (watch) `1024` 個 `fd`，且當返回時，使用者層級仍要走訪整個 `fd set`，才可知道是哪些 `fd` 的資料已準備好，存在效率疑慮，因此才有後續的 `epoll()`。延伸閱讀: * [高效 Web 伺服器開發](https://hackmd.io/@sysprog/fast-web-server) * [透過 timerfd 處理週期性任務](https://hackmd.io/@sysprog/linux-timerfd) * [signalfd is useless](https://ldpreload.com/blog/signalfd-is-useless) ### Linux Virtual File System 介面 VFS 定義了在實體檔案系統上更高一層的介面，讓應用程式得以透過 VFS 定義好的介面存取底層資料，不用考慮底層是如何實作。有了 VFS，增加擴充新的檔案系統非常容易，只要實作出 VFS 規範的介面。 ![](https://i.imgur.com/UJ9oSGe.png) 聚焦在 VFS 和系統呼叫之間的關聯: ![](https://i.imgur.com/m6eIZJi.png) VFS 有主要幾個物件 `superblock`, `inode`, `dentry`, `file` 等等。這裡先解釋 `inode` 和 `file`: * `inode` 和 `file` 在檔案系統中都代表某個檔案; * `file` 只是在核心執行 `open` 時，為行程建立的資料結構，因此 `open` 執行幾次就會有多少個 `file` (也有可能增加 `file.f_count` 的值)，但是這些 `file` 都會指向同一個實體的 file，也就是 `inode` 取自 Donald E. Porter 教授的 [CSE 506: Operating Systems](http://www.cs.unc.edu/~porter/courses/cse506/) 教材 - [ ] [Virtual File System](http://www.cs.unc.edu/~porter/courses/cse506/s16/slides/vfs.pdf) ![](https://i.imgur.com/WUbzLaZ.png) [ Page 4 ] * 不要將 [pseudo](https://en.oxforddictionaries.com/definition/pseudo) 和 [virtual](https://en.oxforddictionaries.com/definition/virtual) 兩個詞彙的意思搞混 * [virtual](https://en.oxforddictionaries.com/definition/virtual) 不是指物質上的實質，是說可透過模仿、模擬而達到某種程度，比如說 "virtual reality" 雖然和現實世界不一樣，但構成現實世界的一些要素。另一個例子是，人與人的交往可以達成，因此是對現實的一種虛擬 (virtual) * [pseudo](https://en.oxforddictionaries.com/definition/pseudo) 則只是像真的，但並沒有成為真的條件。參照 [2011海峽兩岸數學名詞對照研討會](https://epaper.naer.edu.tw/print.php?edm_no=22&content_no=535) * 在 Linux 核心，我們談到 "virtual" file system，是著重於「構成現實的要素」，也就是藉由規範一系列介面操作，「要求」檔案系統該具備若干特性 * 反過來說，"pseudo" file system 顯然「不是真的檔案系統」，其存在是為了特定目標，但不構成「真的條件」 * Wikipedia 的 [pseudo file system 列表](https://en.wikipedia.org/wiki/Category:Pseudo_file_systems_supported_by_the_Linux_kernel) * [Pseudo File systems In Linux](https://medium.com/@jain.sm/pseudo-file-systems-in-linux-5bf67eb6e450) `/tmp` 目錄裡頭的檔案在重開機後就會消失，考量到實體儲存設備比主記憶體慢多了，因此核心發展 tmpfs 這種特製檔案系統。 ![](https://i.imgur.com/uEwvScr.png) [ Page 8 ] 延伸閱讀: [Block Device Scheduling](http://www.cs.unc.edu/~porter/courses/cse506/s16/slides/block-scheduling.pdf) Linux 5.1.x 曾被發現 FSTRIM/Discard (一種用來告知 SSD 底層管理回收 FS layer 不要的 block 的機制，避免 block 已全部配置作 Filesystem 對應造成效能大幅降低) 會造成大量資料流失的問題，後來在 Linux 5.1.5 修正: * [dm: make sure to obey max_io_len_target_boundary](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.1.y&id=871e122d55e8d1cd7c0d5dec9bdba1fe45406196) 延伸閱讀: [Linux 5.1.5 Kernel Fixes The Latest Data Corruption Bug](https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.1.5-Released) (記得看相關評論) ![](https://i.imgur.com/ARWZ5MA.png) [ Page 11 ] super block, inode, dentry, file 等辭彙在 UNIX 第 1 版原始程式碼就已出現。 ![](https://i.imgur.com/HmmJ6wA.png) > Relationships of major objects in the VFS 參照 [Anatomy of the Linux file system](https://developer.ibm.com/tutorials/l-linux-filesystem/) 得知 Linux 核心程式碼的關聯 ![](https://i.imgur.com/f2HrsbN.png) [ Page 16 ] There is no `delete` system call, only `unlink`. [Why are rmdir and unlink two separate system calls?](https://unix.stackexchange.com/questions/150960/why-are-rmdir-and-unlink-two-separate-system-calls) ![](https://i.imgur.com/VpoPW2Z.png) [ Page 17 ] * "Hard" link (link system call/ln utility): creates a second name for the same file; modifications to either name changes contents. - This is ==not a copy== * Open files create an in-memory reference to a file - If an open file is unlinked, the directory entry is deleted immediately, but the inode and data are retained until all in-memory references are deleted ![](https://i.imgur.com/JVVBSc6.png) [ Page 18 ] * 3 bits for each of User, Group, Other + 3 special bits GE-635 是 1960 年代奇異電氣 (General Electric; GE) 發展的 [GE-600 系列](https://en.wikipedia.org/wiki/GE-600_series)大型主機的一項產品，BASIC 就在達特茅斯學院發展的 Dartmouth 分時作業系統 (DTSS) 上開發出來，當時使用 GE-600 系列的 GE-235 大型主機，而在 1965 年，DTSS 移植到 GE-635 大型主機 (當然也包含 BASIC 語言執行環境)，並長期運作，直到 1999 年才關機。1960 年代，在美蘇冷戰的時空背景，作為 [Project MAC](https://multicians.org/project-mac.html) 的執行者，AT&T, GE 及 MIT 合作開發 Multics 作業系統。在 1960 年代初期，MIT 設計的 CTSS (Compatible Time-Sharing System) 已相當成功，因此 1964 年啟動的 Multics 作業系統專案大幅延伸 CTSS 的成果，Multics 也採用 GE-635 大型主機。值得一書的是，GE-635 採用 36 位元的電腦架構，等等，36 顯然不是 2 的 N 次方，這到底是什麼魔法呢？電腦最初的用途就是為了運算，並可追溯到第二次世界大戰時期，納粹德國、英國，和美國都有電子化計算機的發展。第二次世界大戰期間，美國賓州大學和阿伯丁彈道研究實驗室共同負責為陸軍每日提供 6 張火力表，每張表都要計算幾百條彈道，儘管改進了微分分析儀、聘用 200 多名計算員 (以前 "computer" 一詞是指「人」，後來才改指「電腦」)，一張火力表仍要算兩、三個月。電子計算機發展的迫切需求催生了 ENIAC。彈道研究實驗室發起 ENIAC 電腦，專案進行一年後，當時研究氫彈的 John von Neumann 注意到這台電腦，隨後他所屬的洛斯阿拉莫斯國家實驗室深入參與 ENIAC 專案，以至於首度測試執行是計算氫彈相關資料，而非火力表。輸出資料是 100 萬張卡片。在 1946 年 2 月公布的 ENIAC，作為第一台通用電子和機械的計算機，採用 10 進位的設計，大異於之前存在的德國 Z3 計算機 (1941 年 5 月發表) 和美國 ABC 計算機 (1941 年夏季發表) 兩者所採用的二進位。另外，在大一計算機概論提及的浮點數運算表達標準 IEEE 754 則一直到 1985 年才初步定案，那麼從第二次世界大戰到 1985 年浮點運算標準定案之間的四十餘年，到底人們如何表示數值系統呢？ 36 位元在很長一段時間中被廣泛採納，直到 1970 年代才逐漸式微，為何有這麼特別的數值，是因為過去所謂的「通用」電腦主要還是做科學運算為主、人口普查和商務處理為輔，在相關的運算模型還未充分發展的狀況下，電腦處理器的製造商儘量追求最高的數值處理精準度。考慮到將 36 位元其中 1 個位元挪來表示正負號，其他的 35 位元若全部拿來表示整數的話，那麼可涵蓋正負 343 億的範圍。另外，由於 36 = 6 * 6，當時有個名為 [six-bit character code](https://en.wikipedia.org/wiki/Six-bit_character_code) 的表示法，也就是用 6 個位元表示字元，其中 2 的 6 次方是 64，以 DEC SIXBIT 來說，就把大寫的 A-Z, 0-9, 四則運算和常見符號 (含空白) 列入編碼，這也是為何你在電影見到早期電腦主機的顯示畫面充斥大寫字母，因為 6 位元字元編碼只允許上述組合。再者，由於處理器採用 36 位元的架構，也就是一次可存取 6 個以 six-bit character code 編碼的字元，早期的作業系統甚至限制檔案名稱只能有 6 個字元，以確保檔名可存放在一個 word 中 (中央處理器的資料匯流排寬度，這裡就是說 36 位元)。也就是說，今天我們熟知的 1 byte 等於 8 bits 這樣的「事實」，是逐步演化，就像許多工程規格是種妥協和取捨 (trade-off)，在上述 1960 年代，1 個 byte 則是 6 bits，其中 byte 就是表達單一字元的最小空間。相較於 IEEE 754 浮點數表達法在 1985 年定案，「用 8 位元表達 1 個位元組 (byte)」對應的工業標準則在 [ISO/IEC 2382-1:1993](https://www.iso.org/standard/7229.html) 進行正式文件表述，對，你沒看錯，在 1993 年。要表達涵蓋 0, 1, 2, ..., 7 等 8 個數字，我們需要至少 3 個位元，再對照上述的 six-bit character code，不難發現，6 = 3 + 3，也就是說，像是 DEC SIXBIT 這樣的編碼可用 2 個 8 進位數值表示，就像 16 進位系統和 8 位元表示 byte 的關係。考慮到 UNIX 作業系統的設計受到 Multics 作業系統的啟發，不少原本在 1964 年就在 Multics 存在的特徵，陸續以全新 (或簡化的) 姿態存在於 UNIX 作業系統，其中就包含存取權限。 Multics 作業系統在 GE-645 大型主機上發展，在 GE 後續的機種還引入現在我們所知的「保護模式」，也就是透過處理器切換不同的特權等級 (privilege levels，如 Intel x86 的 protection ring)，實現作業系統核心和使用者層級的執行隔離，至於檔案系統，Multics 的特徵還比後來的 UNIX 作業系統第 1 版複雜多了，是首個實作檔案系統層級 Access Control Lists (ACL) 的作業系統，自然就包含「可讀」、「可寫」，和「可執行」等基本權限。 UNIX 作業系統汲取 Multics 的養分，並且拋開 Multics 眾多笨重且難以在當時中低階大型主機上實作的特徵，諸如動態連結和資料庫 (史上首個商業關聯性資料庫就在 Multics 作業系統上開發)。上述的八進位不僅能表示 rwx (讀-寫-執行) 等排列組合，還能用以表示「使用者」、「同一個群組」，還有「其他人」等身份，組合起來就是典型 UNIX 檔案權限的設定，而作為開發 UNIX 作業系統而生的 C 語言，也就順勢提供八進位的支援，終身只在 Bell Labs 工作的 Dennis Ritchie 在 1972 年到 1973 年間發展了 C 語言及其編譯器，並在 1974 年發表經典論文 "[The UNIX time-sharing system](https://dl.acm.org/citation.cfm?id=361061)"，彼時 UNIX 作業系統已用 C 語言重寫，並在 DEC PDP-11 硬體上驗證，而 PDP-11 硬體採用 16 位元的處理器架構。1970 年代早期設計的 PDP-8 機種定址空間是 12-bit address space，對應的指標當然也是 12-bit 的寬度，恰好可對應 4 個 3 位元表示的八進位數值。於是，UNIX (及其相容的) 作業系統和 C 語言兩者都支援八進位，像是透過 chmod 命令指定檔案存取權限時，用到 `0777` 這樣八進位表示法，但 C 語言沒有內建的二進位表示法。 ![](https://i.imgur.com/gZTXnhV.png) [ Page 19 ] * Setuid bit (一樣在 UNIX 第 1 版就出現) - Mostly relevant for executables: Allows anyone who runs this program to execute with owner’s uid * 延伸閱讀: [setuid](http://man7.org/linux/man-pages/man2/setuid.2.html) 和 [UNIX / Linux: Explains setuid File Permission](https://www.cyberciti.biz/faq/unix-bsd-linux-setuid-file/) ![](https://i.imgur.com/KjMWeQx.png) [ Page 21 ] * Files have a reference count. Why? - Fork also copies the file handles - If your child reads from the handle, it advances your (shared) cursor ![](https://i.imgur.com/Evqx11h.png) [ Page 22 ] * `CLOSE_ON_EXEC` – a bit that prevents file inheritance if a new binary is exec’ed (set by open or fcntl) [Why isn't close_on_exec the default configuration?](https://stackoverflow.com/questions/9583845/why-isnt-close-on-exec-the-default-configuration) > File descriptors open in the calling process image shall remain open in the new process image, except for those whose close-on- exec flag `FD_CLOEXEC` is set. 延伸閱讀: * [fs: add FD_CLOFORK and O_CLOFORK](https://lwn.net/Articles/441931/) ![](https://i.imgur.com/u5OC50c.png) [ Page 36 ] [Mazu-Editor](https://github.com/jserv/mazu-editor): 媽祖程式碼編輯器 - [ ] [VFS, Continued](http://www.cs.unc.edu/~porter/courses/cse506/s16/slides/vfs2.pdf) ![](https://i.imgur.com/bXZllEB.png) [ Page 6 ] * 3 Cases for a dentry: - In memory (exists) - Not in memory (doesn’t exist) - Not in memory (on disk/evicted for space or never used) ![](https://i.imgur.com/ROZ0eH2.png) [ Page 7 ] * A hash table (for quick lookup) * A LRU list (for freeing cache space wisely) * A child list of subdirectories (mainly for freeing) * An alias list (to do reverse mapping of inode -> dentries) **[include/linux/fs.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/fs.h#L572)** ```c struct inode { // ... dev_t i_rdev; // ... union { struct pipe_inode_info *i_pipe; struct block_device *i_bdev; struct cdev *i_cdev; char *i_link; unsigned i_dir_seq; }; // ... } __randomize_layout; ``` 在 `inode` 中 `i_rdev` 會存有 device 的 major number 和 minor number。（如果這個 node 是個 device 的話啦）。且根據 device 的類型，存有關於 device 相對應的資料，以 [fibdrv](https://hackmd.io/@sysprog/linux2020-fibdrv) 來說，屬於 char device，所以資料結構是 `struct cdev`。 **[include/linux/cdev.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/cdev.h#L14)** ```c struct cdev { struct kobject kobj; struct module *owner; const struct file_operations *ops; struct list_head list; dev_t dev; unsigned int count; } __randomize_layout; ``` 在 `struct cdev` 中，可見 `struct file_operations` 這個成員，該結構存有關於此 char device 的相關操作，如 `open`, `read`, `write` 等等。這是 Linux 核心常見的物件導向程式設計手法。 > 延伸閱讀: [你所不知道的 C 語言：物件導向程式設計篇](https://hackmd.io/@sysprog/c-oop) 在 `fibdrv.c` 中我們定義了一系列對於此 char device 的操作 ```c const struct file_operations fib_fops = { .owner = THIS_MODULE, .read = fib_read, .write = fib_write, .open = fib_open, .release = fib_release, .llseek = fib_device_lseek, }; ``` 定義完操作後便是將這些操作關聯到相對應的 char device 上，並註冊 char device 到核心中。 ```c static int __init init_fib_dev(void) { // ... fib_cdev = cdev_alloc(); if (fib_cdev == NULL) { printk(KERN_ALERT "Failed to alloc cdev"); rc = -1; goto failed_cdev; } cdev_init(fib_cdev, &fib_fops); rc = cdev_add(fib_cdev, fib_dev, 1); // ... } ``` 最後就是建立 device node，讓使用者可以透過對這個 node 操作，來和 device 互動。 ```c static int __init init_fib_dev(void) { // ... fib_class = class_create(THIS_MODULE, DEV_FIBONACCI_NAME); if (!fib_class) { printk(KERN_ALERT "Failed to create device class"); rc = -3; goto failed_class_create; } if (!device_create(fib_class, NULL, fib_dev, NULL, DEV_FIBONACCI_NAME)) { printk(KERN_ALERT "Failed to create device"); rc = -4; goto failed_device_create; } // ... } ``` ## 依循 VFS 的檔案系統實作 - [ ] [Ext3/4 file systems](http://www.cs.unc.edu/~porter/courses/cse506/s16/slides/ext4.pdf) - [ ] [Encrypted File Systems](http://www.cs.unc.edu/~porter/courses/cse506/s16/slides/cryptfs.pdf) - [ ] [MezzFS — Mounting object storage in Netflix’s media processing platform](https://medium.com/netflix-techblog/mezzfs-mounting-object-storage-in-netflixs-media-processing-platform-cda01c446ba) - [ ] [Linux 5.10 FUSE To Allow Faster Performance With VirtIO-FS](https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.10-FUSE-Improvements) * FUSE (Filesystem in Userspace) 出現後，更給檔案系統開發者極大的彈性，能夠快速且多樣地延展作業系統的特徵，像是你可以透過 FUSE 來掛載 Google Drive 的內容，好似這些遠端的物件就對應於本地端檔案系統中。 * 在過去 VirtIO 中 Host 與 Guest 之間分享檔案是透過 [VirtIO-9P](https://wiki.qemu.org/Documentation/9psetup) 的實作，然而效能不理想。Red Hat 發展 [VirtIO-FS](https://virtio-fs.gitlab.io/)，利用 FUSE 與 VirtIO shared memory 方式, 以 file system as service 形式達到 file sharing 的功能，效能也相較 9p 大幅提升，詳見 [virtio-fs](https://archive.fosdem.org/2020/schedule/event/vai_virtio_fs/attachments/slides/3666/export/events/attachments/vai_virtio_fs/slides/3666/virtio_fs_A_Shared_File_System_for_Virtual_Machines_FOSDEM.pdf) * 過往 FUSE 的效能受限於頻繁的系統呼叫和資料複製，但在 VirtIO-FS 整合後，搭配 DAX (direct access) 模式，不管是 FUSE 抑或是虛擬機器中的作業系統 (guest OS) 即可直接存取 host 端的 page cache，從而縮減資料複製的必要，這項變更預期是 Linux v5.10 的一部分。 - [ ] [vramfs](https://github.com/Overv/vramfs) 是個運用 FUSE 開發的檔案系統，能夠將 video RAM 轉化為儲存裝置，可作為 swap 使用。當然這樣的手段會遇到效能衝擊，若記憶體真的不敷使用，往往需要 swap 或者記憶體壓縮 (如 zram) * [Swap on video RAM](https://wiki.archlinux.org/index.php/Swap_on_video_RAM) - [ ] [simplefs](https://github.com/jserv/simplefs) [Linux 核心設計: Scalability 議題](https://hackmd.io/@sysprog/linux-scalability) 提及一個 Linux scalability 問題：讀取 mount table 時所造成。 ## 待整理 * [Virtual filesystems: why we need them and how they work](http://she-devel.com/ChaikenSCALE2019.pdf) * [article](https://opensource.com/article/19/3/virtual-filesystems-linux) * [Btrfs vs ZFS 實現 snapshot 的差異](https://farseerfc.me/btrfs-vs-zfs-difference-in-implementing-snapshots.html) * [Page cache in Linux kernel](https://www.slideshare.net/AdrianHuang/page-cache-in-linux-kernel) * [Virtual File System](https://www.slideshare.net/AdrianHuang/linux-kernel-virtual-file-system)

Read more

事件驅動伺服器：原理和實例

從 CPU cache coherence 談 Linux spinlock 可擴展能力議題

Linux 核心設計: RCU 同步機制

從登月計劃看電腦軟體發展