CSAPP: Chapter 10

--- tags: CSAPP, 作業系統 --- # CSAPP: Chapter 10 :::info 僅記錄最粗淺的概念，詳細的例子請直接參考原書籍或者課程投影片: * [System-Level I/O](http://www.cs.cmu.edu/afs/cs/academic/class/15213-m19/www/lectures/16-io.pdf) ::: ## UNIX I/O 一個 Linux 中的文件是一個 m bytes 的序列。在 Linux 中，所有的 I/O 裝置都可以被視為文件來進行讀寫操作。這使得 Linux 可以提供一個簡單的介面，稱作 UNIX I/O，使得所有的輸入與輸出可以通過一致的介面來操作: * [open](https://man7.org/linux/man-pages/man2/open.2.html): 要求 kernel 打開對應的文件，回傳的非負整數叫作 file descriptor, fd(若回傳之 `fd == -1` 表示打開失敗)，kernel 會紀錄關於這個 fd 之文件的信息，而應用程式端則透過 fd 對該打開之文件操作 * 每個 process 開始時都有3個打開的文件，分別是 0 (STDIN)、1(STDOUT)、2(STDERR) * [lseek](https://man7.org/linux/man-pages/man2/lseek.2.html): 對於每個打開的文件，kernel 會紀錄一個初始為 0 的位置 k，代表從文件開始的偏移量，可以透過 lseek 顯示的改變位置 * [read](https://linux.die.net/man/2/read) / [write](https://man7.org/linux/man-pages/man2/write.2.html): 一個 read 操作從文件複製 n 個 bytes 到記憶體，且移動文件的位置 k 至 k + n，一個 write 操作從記憶體複製 n 個 bytes 到文件中的位置 k 中並更新位置至 k + n * 當 read 操作 k + n >= 文件的大小 m，會觸發 EOF 條件 * [close](https://man7.org/linux/man-pages/man2/close.2.html): 當不在需要對文件進行操作，需要通知 kernel 關閉並釋放當初打開時配置的資料結構 ## File Linux 文件都具有一種類型表示其在系統中的不同角色: * Regular file: 包含只含 ASCII 或 Unicide 人類可讀的 text file 和其他的 binary file(object file, jpeg)，對 kernel 來說兩者並無區別 * Directory: directory 是一組 link 陣列，每個 link 將一個檔案名稱映射到一個文件，每個 directory 包含至少兩個 link: * `.`: 連結到自己 * `..`: 連結到 parent * Socket: 用來與其他 process 進行溝通 * [Named pipe](https://en.wikipedia.org/wiki/Named_pipe) * [Symbolic links](https://en.wikipedia.org/wiki/Symbolic_link) * [Character and block devices](https://en.wikipedia.org/wiki/Device_file) > Reference [Character device drivers](https://linux-kernel-labs.github.io/refs/heads/master/labs/device_drivers.html): In the UNIX world there are two categories of device files and thus device drivers: character and block. This division is done by the speed, volume and way of organizing the data to be transferred from the device to the system and vice versa. In the first category, there are slow devices, which manage a small amount of data, and access to data does not require frequent seek queries. Examples are devices such as keyboard, mouse, serial ports, sound card, joystick. In general, operations with these devices (read, write) are performed sequentially byte by byte. The second category includes devices where data volume is large, data is organized on blocks, and search is common. Examples of devices that fall into this category are hard drives, cdroms, ram disks, magnetic tape drives. For these devices, reading and writing is done at the data block level. ## RIO package 在使用 UNIX I/O 時，最好對函式的回傳值做額外的檢查。在某些情況下，read 和 write 所操作的 bytes (回傳值)會比應用端要求的要少，然而這些不足值並非表示發生錯誤，可能產生這樣情境的原因有: * read 遇到 EOF: 假設對一個只有 20 bytes 的文件要求 read 50 個 bytes，第 1 次 read 將返回 20，此後的 read 則通過返回 0 表示 EOF * 從 terminal 讀取 text lines: 如果文件是與 terminal 相關(鍵盤/顯示器)，每次的 read 只會傳送一個 text lines * read / write socket: kernel 緩衝限制與網路的延遲會對於 socket 的 read / write 產生不足值 * read / write disk 為了處理各種情境下的不足值問題，CSAPP 中描述了一個 RIO package 以健壯且高效的處理這些問題。程式碼可以在此找到(`src/csapp.c`, `include/csapp.h`): http://csapp.cs.cmu.edu/3e/code.html RIO 提供兩種不同的函數: * 不帶 buffer 的輸入輸出: `rio_readn` / `rio_writen` * 具有 buffer 的輸入: `rio_readlineb` / `rio_readnb` * thread-safe ### Unbuffered RIO Input and Output #### `rio_readn` ```cpp= ssize_t rio_readn(int fd, void *usrbuf, size_t n) { size_t nleft = n; ssize_t nread; char *bufp = usrbuf; while (nleft > 0) { if ((nread = read(fd, bufp, nleft)) < 0) { if (errno == EINTR) /* Interrupted by sig handler return */ nread = 0; /* and call read() again */ else return -1; /* errno set by read() */ } else if (nread == 0) break; /* EOF */ nleft -= nread; bufp += nread; } return (n - nleft); /* Return >= 0 */ } ``` 和 `read` 介面相同，但 `rio_readn` 除了遇到 EOF 之外不會有不足值的回傳。可以從程式碼看到具體的實現方法是透過迴圈不斷讀取文件直到滿足要求的數量 $n$。 * 第 9 行的處理看到 read 函式被中斷時的處理，read 將被重啟 #### `rio_writen` ```cpp ssize_t rio_writen(int fd, void *usrbuf, size_t n) { size_t nleft = n; ssize_t nwritten; char *bufp = usrbuf; while (nleft > 0) { if ((nwritten = write(fd, bufp, nleft)) <= 0) { if (errno == EINTR) /* Interrupted by sig handler return */ nwritten = 0; /* and call write() again */ else return -1; /* errno set by write() */ } nleft -= nwritten; bufp += nwritten; } return n; } ``` 和 `write` 介面相同，但 `rio_writen` 確保了總是對文件寫入 n bytes。整體的概念與 `rio_readn` 相似。 ### Buffered RIO Input Functions #### `rio_readinitb` 假設我們想要更要有效的讀取文件，使用記憶體以預先緩存內容是很不錯的作法。為了在 RIO 中實現此功能，需要先初始化一個 `rio_t` 結構。 ```cpp typedef struct { int rio_fd; /* descriptor for this internal buf */ int rio_cnt; /* unread bytes in internal buf */ char *rio_bufptr; /* next unread byte in internal buf */ char rio_buf[RIO_BUFSIZE]; /* internal buffer */ } rio_t; void rio_readinitb(rio_t *rp, int fd) { rp->rio_fd = fd; rp->rio_cnt = 0; rp->rio_bufptr = rp->rio_buf; } ``` * `rio_fd` 是關聯的文件之 file descriptor * `rio_cnt` 紀錄 buffer 中的緩存的 bytes 數量 * `rio_bufptr` 指向下一個要從 buffer 讀取的位置 * `rio_buf` 是緩存 bytes 的空間本體 #### `rio_read` `rio_read` 是兩個輸入函數 `rio_readlineb` 和 `rio_readnb` 的核心。 ```cpp static ssize_t rio_read(rio_t *rp, char *usrbuf, size_t n) { int cnt; while (rp->rio_cnt <= 0) { /* Refill if buf is empty */ rp->rio_cnt = read(rp->rio_fd, rp->rio_buf, sizeof(rp->rio_buf)); if (rp->rio_cnt < 0) { if (errno != EINTR) /* Interrupted by sig handler return */ return -1; } else if (rp->rio_cnt == 0) /* EOF */ return 0; else rp->rio_bufptr = rp->rio_buf; /* Reset buffer ptr */ } /* Copy min(n, rp->rio_cnt) bytes from internal buf to user buf */ cnt = n; if (rp->rio_cnt < n) cnt = rp->rio_cnt; memcpy(usrbuf, rp->rio_bufptr, cnt); rp->rio_bufptr += cnt; rp->rio_cnt -= cnt; return cnt; } ``` * `while` 迴圈中，如果 buffer 中沒有緩衝內容 `rp->rio_cnt <= 0`，透過 `read` 先將其填滿並指向 buffer 開頭 * 接著，從 buffer 中將內容複製到給定的 `usrbuf` 中，並更新 buffer 的剩餘數量 `rio_cnt` 和指向的位置 `rio_bufptr` #### `rio_readlineb` ```cpp ssize_t rio_readlineb(rio_t *rp, void *usrbuf, size_t maxlen) { int n, rc; char c, *bufp = usrbuf; for (n = 1; n < maxlen; n++) { if ((rc = rio_read(rp, &c, 1)) == 1) { *bufp++ = c; if (c == '\n') { n++; break; } } else if (rc == 0) { if (n == 1) return 0; /* EOF, no data read */ else break; /* EOF, some data was read */ } else return -1; /* Error */ } *bufp = 0; return n-1; } ``` `rio_readlineb` 讀取一個至多 `maxlen` 長度的 text line (結束於 `\n` 的字串)到 `usrbuf` 中 #### `rio_readnb` ```cpp ssize_t rio_readnb(rio_t *rp, void *usrbuf, size_t n) { size_t nleft = n; ssize_t nread; char *bufp = usrbuf; while (nleft > 0) { if ((nread = rio_read(rp, bufp, nleft)) < 0) return -1; /* errno set by read() */ else if (nread == 0) break; /* EOF */ nleft -= nread; bufp += nread; } return (n - nleft); /* return >= 0 */ } ``` `rio_readnb` 從文件中讀取 n 個 bytes 下圖中展示了整個實作的基本概念: ![](https://i.imgur.com/FMi2Zp3.png) ## 標準 I/O C 語言函式庫中定義了一組 high-level 的標準 I/O。標準 I/O 將打開的文件模型化成 stream，對應用層級的程式撰寫者來說，stream 就是指向 `FILE` 類型結構的 pointer。其實作和 RIO 類似的通過記憶體的 buffer 來減少直接使用 UNIX I/O 系統呼叫的成本。包含: * 開啟與關閉檔案 (`fopen` and `fclose`) * 讀取與寫入 bytes (`fread` and `fwrite`) * 讀取與寫入 text lines (`fgets` and `fputs`) * 格式化(formatting)讀取與寫入 (`fscanf` and `fprintf`) ## Which I/O? 那麼，在甚麼情境下應該使用哪種介面呢? 對於高階和低階的 I/O 介面，各自的優缺點如下: ### Unix I/O 優點: * 最一般的處理 I/O 方式，包含標準 I/O 和 RIO 也是基於此介面 * 可以存取檔案的 metadata * async-signal-safe 因此可以直接在 signal handler 中使用缺點: * 需要額外處理不足值以及錯誤值 * 沒有透過 buffer 讀取 text lines 來提升效率 ### Standard I/O 優點: * 透過 buffer 方式減少頻繁的 read / write 以提升操作文件的效率 * 函式庫內部處理了不足值缺點: * 沒有提供對 metadata 存取的介面 * 非 async-signal-safe * 不適合使用在對於 socket 的操作(對 stream 的限制和對 socket 操作限制會有若干衝突，詳見 CSAPP 第 11 章) ### Rules 在對優缺點有大致的掌握後，我們可以統整選擇規則如以下: 1. 盡可能在允許的條件下總是用高階的 I/O 2. 操作 disk/terminal 時 -> 使用 standard I/O 3. 在 signal handlers 中，或者針對特定情形需要另外設計高效的文件操作時 -> 使用 UNIX I/O 4. 操作 socket 時 -> 可以使用 RIO，**避免 standard I/O** 5. 操作 binary file 時 -> 1. 避免使用與文字操作相關的函式 (`fgets`, `scanf`, `rio_readlineb`)，可以使用 `rio_readn` 或 `rio_readnb` * 因為 EOL(end of line) 的編碼具有特殊意義 2. 避免使用操作字串的函式(`strlen`, `strcpy`, `strcat`) * 因為 `\0` 的編碼具有特殊的意義 ## Files in kernel ### File Metadata Metadata 表示資料的資料，kernel 會維護這些內容，可以通過 [fstat](https://linux.die.net/man/2/fstat) 或 stat 來存取到。 ### How the Unix Kernel Represents Open Files 下圖中展示了在 UNIX OS 下，開啟的檔案在 kernel 中的架構: ![](https://i.imgur.com/HlVvENl.png) kernel 透過 3 個數據結構來表示打開的文件: * **Descripter table**: 每個 process 會有獨立的 descripter table，指向 open file table 的其中一項 * **Open file table**: kernel 中維護一個 open file table，表中的每一項包含打開的文件之位置、reference count 等內容，並且指向 v-node table 的某一項，關閉一個 file descripter 會將其對應的 open file table 項之 reference count 減一，直到 reference count 才會真正關閉 * **v-node table**: kernel 中維護一個 v-node table，表中每一項包含 `stat` 中可取得的大多 metadata ![](https://i.imgur.com/FRHqtFN.png) 上圖中展示了一個共享檔案的情境，例如在一個 process 中透過同一個檔案名稱調用 `open` 兩次，此時在 open file table 會有兩個不同的項，指向 v-node table 的同一個檔案 ![](https://i.imgur.com/bZoMscQ.png) 另一個情境是 fork 之後對檔案的共享，假設在 fork 之前，檔案的架構如上圖所示。 ![](https://i.imgur.com/XP2yjcp.png) fork 之後則如上圖所呈現，child 會繼承一個 file descriptor table 的副本，指向相同的 open file table 項，因位增加了 process 指向的 open file table 項因此許要將 reference count 加一。 ### I/O Redirection shell 可以通過 `>` 符號來將 I/O 重定向，得以將 disk file 和 stdin / stdout 連繫在一起。內部的實作與 [`dup2`](https://man7.org/linux/man-pages/man2/dup2.2.html) 的呼叫有關。 ![](https://i.imgur.com/i9YP1Sj.png) ![](https://i.imgur.com/smQjoKN.png) 上圖中展示了呼叫 `dup2(4, 1)` 的過程，file descriptor table 中的 fd 1 會指向 fd 4 指向的 open file table 項，且更新變動後的 reference count。