# 13 File I/O Buffering ###### *The Linux Programing Interface, by M. Kerrisk, 2010* **Overview** - Describe I/O system calls and standard C library types of buffering - How both types of buffering affects application performance - Introduces various technique for influece and disable both types of buffering - Direct I/O: technieuq to bypasing kernel buffering # 13.1 Kernel Buffering of File I/O: The buffer cache ```mermaid sequenceDiagram participant UB as User Buffer participant P1 as Process 1 participant K as Kernel participant C as Buffer Cache participant D as Disk participant P2 as Process 2 P1->>K: write(fd, buf, 3) K->>UB: read data from user buffer UB-->>K: "abc" K->>C: copy to kernel buffer cache K-->>P1: return Note over D: Disk still has old data P2->>K: read(fd, ...) K->>C: read latest data C-->>K: "abc" K-->>P2: return "abc" K->>D: flush to disk (later) ``` 1. Read() and write() don't access direcly disk storage, it will put the data in the from the **user buffer** to kernel buffer (**buffer cache**) 2. If, in the interim, another process try to read these data, it will return from the **buffer cache** **Vice Versa** :::info The read() also read the data from the **buffer cache** until the next segment of the data is in the disk ==read-ahead== Usually kernel will apply read-ahead to ensure that the next blocks of data is available. ::: This design is good as they don't need to wait on a slow disk operation. No Limist on size of buffer cache. Limited only by the amount of physical memory. :::danger **Betrayal** ### Page Cache vs Buffer Cache Before Linux kernel 2.4, the kernel maintained two separate caches: - Buffer cache (for read/write system calls) - Page cache (for memory-mapped files) This caused duplication and inefficiency. From kernel 2.4 onward, Linux unified these into a single page cache. All file I/O operations now use the page cache, including read(), write(), and mmap(). ::: ## Effect of buffer size on I/O system call performance :::warning **Important:** Same number of disk access are performs, regardless of **1000 writes of single byte** or **Single write of 1000 bytes**. But the latter is preferable because it only requires a **single system call.**. ::: `BUF_SIZE` control the buffer size when running the program. ![13-1](https://hackmd.io/_uploads/HyGkt0iqWe.png) :::info The test performed using a vanilla 2.6.30 kernel on ext2 file system with a block size of 4096 bytes ==vanilla kernel== means unpatched mainline kernel ::: Elapsed ~ Total time from start to finish, includes waiting disk I/0 and also Total CPU, scheduling delays. Total CPU ~ Total CPU = User CPU + System CPU - User CPU: the time spent executing code in user mode - System CPU: the time spent executing kernel code (i.e, system calls) ![image](https://hackmd.io/_uploads/BkuxX1n9-l.png) What we can observe is the Elapsed time is mostly same as Total CPU since `write()` return immediately after transferring data from user pace into buffer cache. (RAM is 4G, 100 miliion bytes = 100 MB) ==For BUF_SIZE = 65536, the 2.06-second elapsed time is mainly due to disk **reads**== >File systems can be measured by various other criteria, such as performance under heavy multiuser load, speed of file creation and deletion, time required to search for a file in a large directory, space required to store small files, or maintenance of file integrity in the event of a system crash. Where the performance of I/O or other file-system operations is critical, there is no substitute for application-specific benchmarks on the target platform > Where the performance of I/O or other file-system operations is critical, there is no substitute for application-specific benchmarks on the target platform. **Meaning: ** You can’t rely on general benchmarks—you need to measure performance using your own application on the real system you care about. # Buffering in the `stdio` library ```mermaid sequenceDiagram participant App as Your Program participant Stdio as stdio Buffer (C library) participant Kernel as Kernel participant Cache as Page Cache participant Disk App->>Stdio: fprintf()/fwrite() Stdio->>Stdio: buffer data in user space Stdio->>Kernel: write() Kernel->>Cache: copy data to page cache Kernel-->>Stdio: return Note over Disk: not written immediately Cache->>Disk: flush later ``` Sequence of buffered output: program → stdio buffer → write() → kernel page cache → disk later **User buffer vs stdio buffer** - User buffer = any buffer in your program (you control it) - stdio buffer = a specific buffer managed by the C library :::info stdio buffer is a type of user-space buffer, but not all user buffers are stdio buffers ::: ![image](https://hackmd.io/_uploads/SJ-u3xn9bx.png) ## Setting the buffering mode of a `stdio` stream setvbuf() -> controls the form of buffering ```C= #include <stdio.h> int setvbuf(FILE *stream, char *buf, int mode, size_t size); Returns 0 on success, or nonzero on error ``` ## Why we use setvbuf() Default stdio buffering is a general compromise and may not suit all use cases. `setvbuf()` allows control over buffering behavior to optimize for performance (fewer system calls) or immediacy (timely output), depending on program requirements. **We will see it in the `mode` argument** `setvbuf()` changes the buffering behavior of a stdio stream, such as its mode and buffer size. > The setvbuf() call affects the behavior of all **subsequent** **stdio** operations on the specified stream ### `buf` argument `buf` is non-NULL ~ points to a block of memory of `size` bytes provided by your own. The memory need to be long-lived, such as statically allocated or dynamically allocated on the heap. `buf` is NULL ~ buffer provided automatically by `stdio` library ### `mode` argument `_IONBF` ~ unbuffered I/O. Each stdio call immediately performs a `read()` or `write()` system call. This is the **default** for `stderr`, so error messages appear immediately. `_IOLBF` ~ line-buffered I/O. Output is buffered until a newline is written (or the buffer becomes full). For input, data is read one line at a time. This is the **default** for streams connected to terminal devices. `_IOFBF` ~ fully buffered I/O. Data is transferred in blocks whose size is determined by the buffer size. This is the **default** for streams connected to disk files. ### What if we remove `v` from `setvbuf()` ? then it become `setbuf()`, which is a simplied wrapper of `setvbuf()` ```C= #include <stdio.h> void setbuf(FILE *stream, char *buf) // which is same as setvbuf(fp, buf, (buf != NULL) ? _IOFBF: _IONBF, BUFSIZ); ``` if buf is null then no bufferng if buf is non-null then full buffering #### Another kind of `setvbuf()` ```C= void setbuffer(fp, buf, size) // is same as setvbuf(fp, buf, (buf != NULL) ?_IOFBF: _IONBF, size) ``` ## Flushing a `stdio` buffer We can force the data in a `stdio` output stream to be written ```C= #include <stdio.h> int fflush(FILE *stream); // Returns 0 on success, EOF on errror ``` fflush(NULL) ->flushes all output streams fflush(stdio) -> flushes only stdio buffers # Controlling Kernel Buffering of File I/O > Sometimes it's necessary to force flushing the kernel buffer if an application (e.g., a database jounaling process) must ensure that output really has been written to the disk(or at least the disk's hardware cache) before continuing ## Synchronized I/O completion > SUSv3 defines the term `syncronized I/O completion` to mean "an I/O operation that has either been **successfuly** transferred [to the disk] or diagnosed as **unsuccessful**" There are 2 types of it: 1. Synchronized I/O data integrity 2. Synchronized I/O file integrity They differ in metadata("data about data") describing the file, which the kernel stores along with the data for a file. Metadata includds information such as the file owner and group; file permissions, file size; number of (hard) links to the file; timestamps indicating the time of the last file acccess, last file modification, and last metadata change; and file data block pointers. commands to look metadata is `stat` for more details and `ls -l` but less detailed ### Synchronized I/O data integrity In the read() ~ If there is some data written but not yet saved to disk (maybe in the page cache), **it will force flushes** before read() being called In the write() ~ A write is completed when, data is in the **disk** and all file metadata required to retrieve the data also been transferred. (e.g, file size). Other metadata is saved in the page cache. (e.g, last updated timestamp) If ==crash== happen then the essential data is safe, however timestamp may be outdated. ### Synchronized I/O file integrity A **superset** (a set that contains another set) of synchronized I/O data integrity, which means file integrity completion includes everything in data integrity, plus more. > The difference with this mode of I/O completion is that during a file update, ==all updated file metadata is transferred to disk==, even if it is not necessary for the operation of a subsequent read of the file data ## System calls for controlling kernel buffering of file I/O fsync() ~ forces the file to the synchronized I/O **file integrity** ``` #include <unistd.h> int fsync(int fd); // Returns 0 on success, or -1 on error ``` fdatasync() ~ foces the file to the synchronized I/O **data integrity** ``` #include <unistd.h> int fdatasync(int fd); // Returns 0 on succes, or -1 on error ``` > Using fdatasync() potentially reduces the number of disk operations from the **two** required by fsync() to **one**. > Performance improves when unnecessary metadata updates are avoided, since file data and metadata may reside in **different parts of the disk** >In Linux 2.2 and earlier, fdatasync() is implemented as a call to fsync(), and thuscarries no performance gain. ### Sync() while fflush() flushes from user space into page cache, **sync()** flushes all from **page cache/kernel buffers** into **disk**. > sync() flushes all buffered file data and metadata in the system to disk, while fsync(fd) flushes data for a specific file. sync() provides system-wide flushing, whereas fsync() offers fine-grained control for ensuring data durability of a single file. #### Real-world usage sync() ~ before shutdown system-wide consistency fsync() ~ saving a file (e.g., text editor) databases anything requiring data durability ## Making all writes synchronous: O_SYNC ### `O_SYNC` Specifying `O_SYNC` in `open()` makes all subsequent `write()` calls synchronous. ```c fd = open(pathname, O_WRONLY | O_SYNC); ``` After opening the file with `O_SYNC`, each `write()` automatically flushes both file data and metadata to disk. In other words, writes follow **synchronized I/O file integrity completion**. `O_FSYNC` ~ older BSD name for `O_SYNC`; in `glibc`, `O_FSYNC` is defined as a synonym for `O_SYNC`. #### Performance impact of `O_SYNC` `O_SYNC` can severely reduce performance because every `write()` must wait until the data and metadata are flushed to disk. ![image](https://hackmd.io/_uploads/SkNctN6cbg.png) - elapsed time can become much larger than CPU time, because the process blocks while waiting for disk I/O - the slowdown is especially extreme for small buffer sizes - larger write buffers reduce the overhead, but `O_SYNC` is still costly :::warning Modern disk drives have large internal caches, and by default, **O_SYNC** **merely** causes data to be transfereed to the cache. If we disable caching on the disk (command hdparm -WO), then it become worse. In the 1-byte case, the elapsed time rises from 1030 -> 16000 seconds. ::: ### Takeaway If forced flushing is required, it is often better to: - use larger `write()` buffer sizes - call `fsync()` or `fdatasync()` occasionally instead of opening the file with `O_SYNC`. ### `O_DSYNC` and `O_RSYNC` `O_DSYNC` ~ makes `write()` follow **synchronized I/O data integrity completion**, similar to `fdatasync()`. Only file data and metadata required for later data retrieval are flushed. `O_SYNC` ~ makes `write()` follow **synchronized I/O file integrity completion**, similar to `fsync()`. Both file data and all relevant metadata are flushed. #### Difference between `O_DSYNC` and `O_SYNC` `O_DSYNC` ~ flushes file data and only the metadata needed to retrieve that data `O_SYNC` ~ flushes file data plus all file metadata required for full file integrity `O_RSYNC` ~ used together with either `O_DSYNC` or `O_SYNC`, and extends their synchronized I/O behavior to `read()` operations -- `O_RSYNC | O_DSYNC` ~ before each `read()`, pending writes affecting the requested data are completed according to **data integrity** requirements -- `O_RSYNC | O_SYNC` ~ before each `read()`, pending writes affecting the requested data are completed according to **file integrity** requirements # Summary of I/O Buffering ![image](https://hackmd.io/_uploads/SJoMnVa9bx.png) ## Overview of output buffering Output file I/O involves two buffering layers: 1. **stdio buffer** in user space 2. **kernel buffer cache** in kernel space Data written by stdio functions is first stored in the **stdio buffer**. When that buffer becomes full, the stdio library calls `write()`, which copies the data into the **kernel buffer cache**. Later, the kernel flushes the cached data to disk. ## Controlling buffering ### Explicit flushing These can be used at any time to force buffered data to be written out: - `fflush()` ~ flushes the stdio buffer - `fsync()`, `fdatasync()`, `sync()` ~ flush kernel buffers to disk ### Automatic flushing These make flushing happen automatically: - `setbuf()`, `setvbuf()` ~ control or disable stdio buffering - `O_SYNC`, `O_DSYNC` ~ make `write()` operations synchronous, so data is flushed to disk immediately # Advising the Kernel About I/O Patterns posix_favise() ~ system call that allows a process to give a hint to kernel about its likely pattern for accesing file data. The kernel **may** use the information from `posix_fadvise()` because it is only a **hint**, not a requirement. This gives the kernel flexibility to ignore the advice if it is not useful or practical. ```C= #define _XOPEN_SOURCE 600 #include <fcntl.h> int posix_fadvise(int fd, off_t offset, off_t len, int advice); // Returns 0 on sucess, or a positive error number on error ``` > Calling posix_fadvise() has no effect on the semantics(meaning, what a program does) of a program The kernel decides based on whether the hint would likely improve **cache behavior**. **Some factors:** - the advice type you gave (SEQUENTIAL, RANDOM, WILLNEED, DONTNEED) - the file’s current cached pages - available memory - current I/O and cache pressure - what the kernel/filesystem implementation actually supports ### `advise` arguments The `advice` argument tells the kernel the expected file access pattern. `POSIX_FADV_NORMAL` ~ no special access pattern; default behavior. On Linux, read-ahead uses the default size. `POSIX_FADV_SEQUENTIAL` ~ data will likely be read sequentially, from lower offsets to higher offsets. On Linux, this increases the read-ahead window. `POSIX_FADV_RANDOM` ~ data will likely be accessed in random order. On Linux, this disables read-ahead. `POSIX_FADV_WILLNEED` ~ the specified file region will likely be accessed soon. The kernel may preload that region into the buffer cache, so later `read()` calls can get data from memory instead of waiting for disk I/O. `POSIX_FADV_DONTNEED` ~ the specified file region will likely not be accessed soon. The kernel may free the corresponding cache pages. If pages are modified, they may need to be flushed first. `POSIX_FADV_NOREUSE` ~ the specified file region will likely be accessed only once and not reused. On Linux, this currently has no effect. # Bypassing the Buffer Cache: Direct I/O **Direct I/O** bypasses the kernel buffer cache and transfers data directly between **user space** and **a file or block device**. Direct I/O is **not** generally faster; for most applications it can significantly **degrade performance** because it loses buffer-cache optimizations such as **sequential read-ahead**, **clustered I/O**, and **shared cached buffers** across processes. Direct I/O is mainly useful for applications with specialized I/O requirements, such as **database systems** that already perform their own caching and I/O optimizations. ### How to perform direct I/O Direct I/O is enabled by specifying the `O_DIRECT` flag in `open()` when opening a file or block device. ```c fd = open(pathname, O_RDONLY | O_DIRECT); ``` ### `O_DIRECT` support `O_DIRECT` is effective on Linux since kernel `2.4.10`. Support depends on the kernel version and file system. Most native Linux file systems support it, but many non-UNIX file systems (e.g., `VFAT`) do not. If unsupported, `open()` fails with `EINVAL`. ### Cache coherency warning Mixing `O_DIRECT` access in one process with normal buffered access in another process for the same file should be avoided, because there is **no coherency** between direct I/O and the buffer cache. #### Example: mixing buffered I/O and direct I/O If Process A writes using normal buffered I/O, the new data may remain only in the page cache. If Process B then reads the same file using `O_DIRECT`, it bypasses the page cache and may read old data from disk. Likewise, if Process B writes using `O_DIRECT`, Process A may still read stale data from the page cache. :::info **Takeaway** Buffered I/O uses the page cache, while direct I/O bypasses it. Since they are not automatically synchronized, mixing them on the same file can lead to stale or inconsistent data. ::: ## Alignment restrictions for direct I/O sampe sini When using direct I/O, the following must be multiples of the block size: - the **memory address** of the data buffer - the **file or device offset** - the **length** of the transfer If any of these restrictions is violated, `read()` or `write()` fails with `EINVAL`. Here, block size usually means the **physical block size** of the device, typically `512` bytes. On Linux `2.4`, the rules were stricter: alignment, offset, and length had to be multiples of the **logical block size** of the underlying file system (commonly `1024`, `2048`, or `4096` bytes). ## Example program The example program `direct_read.c` demonstrates direct I/O input. It: - opens a file with `O_DIRECT` - optionally seeks to a specified offset - allocates an aligned buffer using `memalign()` - reads from the file using `read()` The program takes up to four arguments: 1. file path 2. number of bytes to read 3. optional offset 4. optional buffer alignment The default offset is `0`, and the default alignment is `4096` bytes. ### Example outcomes ![13-1](https://hackmd.io/_uploads/r1tMWvpc-l.png) ### Example code ```c #define _GNU_SOURCE #include <fcntl.h> #include <malloc.h> #include "tlpi_hdr.h" int main(int argc, char *argv[]) { int fd; ssize_t numRead; size_t length, alignment; off_t offset; void *buf; if (argc < 3 || strcmp(argv[1], "--help") == 0) usageErr("%s file length [offset [alignment]]\n", argv[0]); length = getLong(argv[2], GN_ANY_BASE, "length"); offset = (argc > 3) ? getLong(argv[3], GN_ANY_BASE, "offset") : 0; alignment = (argc > 4) ? getLong(argv[4], GN_ANY_BASE, "alignment") : 4096; fd = open(argv[1], O_RDONLY | O_DIRECT); if (fd == -1) errExit("open"); buf = (char *) memalign(alignment * 2, length + alignment) + alignment; if (buf == NULL) errExit("memalign"); if (lseek(fd, offset, SEEK_SET) == -1) errExit("lseek"); numRead = read(fd, buf, length); if (numRead == -1) errExit("read"); printf("Read %ld bytes\n", (long) numRead); exit(EXIT_SUCCESS); } ``` # Mixing Library Functions and System Calls for File I/O ## File Descriptor and Stream A Fd is the low-level unix/linux **integer** handle for an open file A Stream(File *) is the higher-level stdio **object** built on top of that handle Stream uses stdio functions like fprintf(), fgets(), fclose() Fd uses system calls like read(), write(), close() :::info It is possible to mix **stdio library functions** and **I/O system calls** on the same file. ::: ### `fileno()` and `fdopen()` `fileno(stream)` ~ returns the **file descriptor** associated with a stdio stream `fdopen(fd, mode)` ~ creates a **stdio stream** from an existing file descriptor The `mode` argument of `fdopen()` is the same as in `fopen()`. It must be consistent with the access mode of the file descriptor, otherwise `fdopen()` fails. ### Use of `fdopen()` `fdopen()` is useful for file descriptors returned by interfaces such as **pipes** and **sockets**, since these are created as file descriptors and must be converted to streams before using stdio functions on them. ### Buffering issue when mixing System calls such as `read()` and `write()` transfer data directly between user space and the **kernel buffer cache**. stdio functions use a **user-space buffer** first, and call `write()` only when that **buffer is flushed or becomes full**. Because of this difference, mixing stdio and system calls on the same file can produce unexpected results. ### Example ```c printf("To man the world is twofold, "); write(STDOUT_FILENO, "in accordance with his twofold attitude.\n", 41); ``` The `printf()` output may remain in the stdio buffer, while `write()` goes directly to the kernel. As a result, the `write()` output may appear first. ### Avoiding the problem - use `fflush()` before mixing stdio functions and system calls - disabling stdio buffering with `setvbuf()` or `setbuf()` can also help, but may reduce performance because each output operation then causes a `write()` system call ## Summary Both the **kernel** and the **stdio library** perform buffering for file I/O. Buffering improves efficiency, but disabling it may reduce performance. Various system calls and library functions can be used to: - control kernel and stdio buffering - force one-time buffer flushes `posix_fadvise()` ~ allows a process to give the kernel hints about expected file access patterns, which may help the kernel optimize use of the buffer cache and improve I/O performance `O_DIRECT` ~ Linux-specific flag for `open()` that allows specialized applications to bypass the kernel buffer cache `fileno()` ~ returns the file descriptor associated with a stdio stream `fdopen()` ~ creates a stdio stream from an existing file descriptor, helping mix stdio functions and system calls on the same file . . . # English Notes In the interim ~ during the time between two specific event or in the meantime(在此期间) i.e. ~ (latin: id est, english: that is, namely, in other words, that is to say): for clarifying something # Word Choice - make sure -> ensure # Vocab naive ~ 天真的, polos, describes someone lacking experience, wisdom, or judgment. Implies simplistic. semantic ~ about meaning example : - “Shut the door” and “Close the door” → different words, similar semantics coherency ~ things matching each other and staying consistent