# 13 File I/O Buffering
###### *The Linux Programing Interface, by M. Kerrisk, 2010*
**Overview**
- Describe I/O system calls and standard C library types of buffering
- How both types of buffering affects application performance
- Introduces various technique for influece and disable both types of buffering
- Direct I/O: technieuq to bypasing kernel buffering
# 13.1 Kernel Buffering of File I/O: The buffer cache
```mermaid
sequenceDiagram
participant UB as User Buffer
participant P1 as Process 1
participant K as Kernel
participant C as Buffer Cache
participant D as Disk
participant P2 as Process 2
P1->>K: write(fd, buf, 3)
K->>UB: read data from user buffer
UB-->>K: "abc"
K->>C: copy to kernel buffer cache
K-->>P1: return
Note over D: Disk still has old data
P2->>K: read(fd, ...)
K->>C: read latest data
C-->>K: "abc"
K-->>P2: return "abc"
K->>D: flush to disk (later)
```
1. Read() and write() don't access direcly disk storage, it will put the data in the from the **user buffer** to kernel buffer (**buffer cache**)
2. If, in the interim, another process try to read these data, it will return from the **buffer cache**
**Vice Versa**
:::info
The read() also read the data from the **buffer cache** until the next segment of the data is in the disk
==read-ahead== Usually kernel will apply read-ahead to ensure that the next blocks of data is available.
:::
This design is good as they don't need to wait on a slow disk operation.
No Limist on size of buffer cache. Limited only by the amount of physical memory.
:::danger
**Betrayal**
### Page Cache vs Buffer Cache
Before Linux kernel 2.4, the kernel maintained two separate caches:
- Buffer cache (for read/write system calls)
- Page cache (for memory-mapped files)
This caused duplication and inefficiency.
From kernel 2.4 onward, Linux unified these into a single page cache. All file I/O operations now use the page cache, including read(), write(), and mmap().
:::
## Effect of buffer size on I/O system call performance
:::warning
**Important:** Same number of disk access are performs, regardless of **1000 writes of single byte** or **Single write of 1000 bytes**. But the latter is preferable because it only requires a **single system call.**.
:::
`BUF_SIZE` control the buffer size when running the program.

:::info
The test performed using a vanilla 2.6.30 kernel on ext2 file system with a block size of 4096 bytes
==vanilla kernel== means unpatched mainline kernel
:::
Elapsed
~ Total time from start to finish, includes waiting disk I/0 and also Total CPU, scheduling delays.
Total CPU
~ Total CPU = User CPU + System CPU
- User CPU: the time spent executing code in user mode
- System CPU: the time spent executing kernel code (i.e, system calls)

What we can observe is the Elapsed time is mostly same as Total CPU since `write()` return immediately after transferring data from user pace into buffer cache. (RAM is 4G, 100 miliion bytes = 100 MB)
==For BUF_SIZE = 65536, the 2.06-second elapsed time is mainly due to disk **reads**==
>File systems can be measured by various other criteria, such as performance under heavy multiuser load, speed of file creation and deletion, time required to search for a file in a large directory, space required to store small files, or maintenance of file integrity in the event of a system crash. Where the performance of I/O or other file-system operations is critical, there is no substitute for application-specific benchmarks on the target platform
> Where the performance of I/O or other file-system operations is critical, there is no substitute for application-specific benchmarks on the target platform.
**Meaning: **
You can’t rely on general benchmarks—you need to measure performance using your own application on the real system you care about.
# Buffering in the `stdio` library
```mermaid
sequenceDiagram
participant App as Your Program
participant Stdio as stdio Buffer (C library)
participant Kernel as Kernel
participant Cache as Page Cache
participant Disk
App->>Stdio: fprintf()/fwrite()
Stdio->>Stdio: buffer data in user space
Stdio->>Kernel: write()
Kernel->>Cache: copy data to page cache
Kernel-->>Stdio: return
Note over Disk: not written immediately
Cache->>Disk: flush later
```
Sequence of buffered output:
program → stdio buffer → write() → kernel page cache → disk later
**User buffer vs stdio buffer**
- User buffer = any buffer in your program (you control it)
- stdio buffer = a specific buffer managed by the C library
:::info
stdio buffer is a type of user-space buffer, but not all user buffers are stdio buffers
:::

## Setting the buffering mode of a `stdio` stream
setvbuf() -> controls the form of buffering
```C=
#include <stdio.h>
int setvbuf(FILE *stream, char *buf, int mode, size_t size);
Returns 0 on success, or nonzero on error
```
## Why we use setvbuf()
Default stdio buffering is a general compromise and may not suit all use cases. `setvbuf()` allows control over buffering behavior to optimize for performance (fewer system calls) or immediacy (timely output), depending on program requirements. **We will see it in the `mode` argument**
`setvbuf()` changes the buffering behavior of a stdio stream, such as its mode and buffer size.
> The setvbuf() call affects the behavior of all
**subsequent** **stdio** operations on the specified stream
### `buf` argument
`buf` is non-NULL
~ points to a block of memory of `size` bytes provided by your own. The memory need to be long-lived, such as statically allocated or dynamically allocated on the heap.
`buf` is NULL
~ buffer provided automatically by `stdio` library
### `mode` argument
`_IONBF`
~ unbuffered I/O. Each stdio call immediately performs a `read()` or `write()` system call. This is the **default** for `stderr`, so error messages appear immediately.
`_IOLBF`
~ line-buffered I/O. Output is buffered until a newline is written (or the buffer becomes full). For input, data is read one line at a time. This is the **default** for streams connected to terminal devices.
`_IOFBF`
~ fully buffered I/O. Data is transferred in blocks whose size is determined by the buffer size. This is the **default** for streams connected to disk files.
### What if we remove `v` from `setvbuf()` ?
then it become `setbuf()`, which is a simplied wrapper of `setvbuf()`
```C=
#include <stdio.h>
void setbuf(FILE *stream, char *buf)
// which is same as
setvbuf(fp, buf, (buf != NULL) ? _IOFBF: _IONBF, BUFSIZ);
```
if buf is null then no bufferng
if buf is non-null then full buffering
#### Another kind of `setvbuf()`
```C=
void setbuffer(fp, buf, size)
// is same as
setvbuf(fp, buf, (buf != NULL) ?_IOFBF: _IONBF, size)
```
## Flushing a `stdio` buffer
We can force the data in a `stdio` output stream to be written
```C=
#include <stdio.h>
int fflush(FILE *stream);
// Returns 0 on success, EOF on errror
```
fflush(NULL) ->flushes all output streams
fflush(stdio) -> flushes only stdio buffers
# Controlling Kernel Buffering of File I/O
> Sometimes it's necessary to force flushing the kernel buffer if an application (e.g., a database jounaling process) must ensure that output really has been written to the disk(or at least the disk's hardware cache) before continuing
## Synchronized I/O completion
> SUSv3 defines the term `syncronized I/O completion` to mean "an I/O operation that has either been **successfuly** transferred [to the disk] or diagnosed as **unsuccessful**"
There are 2 types of it:
1. Synchronized I/O data integrity
2. Synchronized I/O file integrity
They differ in metadata("data about data") describing the file, which the kernel stores along with the data for a file.
Metadata includds information such as the file owner and group; file permissions, file size; number of (hard) links to the file; timestamps indicating the time of the last file acccess, last file modification, and last metadata change; and file data block pointers.
commands to look metadata is `stat` for more details and `ls -l` but less detailed
### Synchronized I/O data integrity
In the read()
~ If there is some data written but not yet saved to disk (maybe in the page cache), **it will force flushes** before read() being called
In the write()
~ A write is completed when, data is in the **disk** and all file metadata required to retrieve the data also been transferred. (e.g, file size). Other metadata is saved in the page cache. (e.g, last updated timestamp)
If ==crash== happen then the essential data is safe, however timestamp may be outdated.
### Synchronized I/O file integrity
A **superset** (a set that contains another set) of synchronized I/O data integrity, which means file integrity completion includes everything in data integrity, plus more.
> The difference with this mode of I/O completion is that during a file
update, ==all updated file metadata is transferred to disk==, even if it is not necessary for the operation of a subsequent read of the file data
## System calls for controlling kernel buffering of file I/O
fsync()
~ forces the file to the synchronized I/O **file integrity**
```
#include <unistd.h>
int fsync(int fd);
// Returns 0 on success, or -1 on error
```
fdatasync()
~ foces the file to the synchronized I/O **data integrity**
```
#include <unistd.h>
int fdatasync(int fd);
// Returns 0 on succes, or -1 on error
```
> Using fdatasync() potentially reduces the number of disk operations from the **two** required by fsync() to **one**.
> Performance improves when unnecessary metadata updates are avoided, since file data and metadata may reside in **different parts of the disk**
>In Linux 2.2 and earlier, fdatasync() is implemented as a call to fsync(), and thuscarries no performance gain.
### Sync()
while fflush() flushes from user space into page cache, **sync()** flushes all from **page cache/kernel buffers** into **disk**.
> sync() flushes all buffered file data and metadata in the system to disk, while fsync(fd) flushes data for a specific file. sync() provides system-wide flushing, whereas fsync() offers fine-grained control for ensuring data durability of a single file.
#### Real-world usage
sync()
~ before shutdown
system-wide consistency
fsync()
~ saving a file (e.g., text editor)
databases
anything requiring data durability
## Making all writes synchronous: O_SYNC
### `O_SYNC`
Specifying `O_SYNC` in `open()` makes all subsequent `write()` calls synchronous.
```c
fd = open(pathname, O_WRONLY | O_SYNC);
```
After opening the file with `O_SYNC`, each `write()` automatically flushes both file data and metadata to disk. In other words, writes follow **synchronized I/O file integrity completion**.
`O_FSYNC`
~ older BSD name for `O_SYNC`; in `glibc`, `O_FSYNC` is defined as a synonym for `O_SYNC`.
#### Performance impact of `O_SYNC`
`O_SYNC` can severely reduce performance because every `write()` must wait until the data and metadata are flushed to disk.

- elapsed time can become much larger than CPU time, because the process blocks while waiting for disk I/O
- the slowdown is especially extreme for small buffer sizes
- larger write buffers reduce the overhead, but `O_SYNC` is still costly
:::warning
Modern disk drives have large internal caches, and by default, **O_SYNC** **merely** causes data to be transfereed to the cache.
If we disable caching on the disk (command hdparm -WO), then it become worse. In the 1-byte case, the elapsed time rises from 1030 -> 16000 seconds.
:::
### Takeaway
If forced flushing is required, it is often better to:
- use larger `write()` buffer sizes
- call `fsync()` or `fdatasync()` occasionally
instead of opening the file with `O_SYNC`.
### `O_DSYNC` and `O_RSYNC`
`O_DSYNC`
~ makes `write()` follow **synchronized I/O data integrity completion**, similar to `fdatasync()`. Only file data and metadata required for later data retrieval are flushed.
`O_SYNC`
~ makes `write()` follow **synchronized I/O file integrity completion**, similar to `fsync()`. Both file data and all relevant metadata are flushed.
#### Difference between `O_DSYNC` and `O_SYNC`
`O_DSYNC`
~ flushes file data and only the metadata needed to retrieve that data
`O_SYNC`
~ flushes file data plus all file metadata required for full file integrity
`O_RSYNC`
~ used together with either `O_DSYNC` or `O_SYNC`, and extends their synchronized I/O behavior to `read()` operations
-- `O_RSYNC | O_DSYNC`
~ before each `read()`, pending writes affecting the requested data are completed according to **data integrity** requirements
-- `O_RSYNC | O_SYNC`
~ before each `read()`, pending writes affecting the requested data are completed according to **file integrity** requirements
# Summary of I/O Buffering

## Overview of output buffering
Output file I/O involves two buffering layers:
1. **stdio buffer** in user space
2. **kernel buffer cache** in kernel space
Data written by stdio functions is first stored in the **stdio buffer**.
When that buffer becomes full, the stdio library calls `write()`, which copies the data into the **kernel buffer cache**.
Later, the kernel flushes the cached data to disk.
## Controlling buffering
### Explicit flushing
These can be used at any time to force buffered data to be written out:
- `fflush()`
~ flushes the stdio buffer
- `fsync()`, `fdatasync()`, `sync()`
~ flush kernel buffers to disk
### Automatic flushing
These make flushing happen automatically:
- `setbuf()`, `setvbuf()`
~ control or disable stdio buffering
- `O_SYNC`, `O_DSYNC`
~ make `write()` operations synchronous, so data is flushed to disk immediately
# Advising the Kernel About I/O Patterns
posix_favise()
~ system call that allows a process to give a hint to kernel about its likely pattern for accesing file data. The kernel **may** use the information from `posix_fadvise()` because it is only a **hint**, not a requirement. This gives the kernel flexibility to ignore the advice if it is not useful or practical.
```C=
#define _XOPEN_SOURCE 600
#include <fcntl.h>
int posix_fadvise(int fd, off_t offset, off_t len, int advice);
// Returns 0 on sucess, or a positive error number on error
```
> Calling posix_fadvise() has no effect on the semantics(meaning, what a program does) of a program
The kernel decides based on whether the hint would likely improve **cache behavior**.
**Some factors:**
- the advice type you gave (SEQUENTIAL, RANDOM, WILLNEED, DONTNEED)
- the file’s current cached pages
- available memory
- current I/O and cache pressure
- what the kernel/filesystem implementation actually supports
### `advise` arguments
The `advice` argument tells the kernel the expected file access pattern.
`POSIX_FADV_NORMAL`
~ no special access pattern; default behavior. On Linux, read-ahead uses the default size.
`POSIX_FADV_SEQUENTIAL`
~ data will likely be read sequentially, from lower offsets to higher offsets. On Linux, this increases the read-ahead window.
`POSIX_FADV_RANDOM`
~ data will likely be accessed in random order. On Linux, this disables read-ahead.
`POSIX_FADV_WILLNEED`
~ the specified file region will likely be accessed soon. The kernel may preload that region into the buffer cache, so later `read()` calls can get data from memory instead of waiting for disk I/O.
`POSIX_FADV_DONTNEED`
~ the specified file region will likely not be accessed soon. The kernel may free the corresponding cache pages. If pages are modified, they may need to be flushed first.
`POSIX_FADV_NOREUSE`
~ the specified file region will likely be accessed only once and not reused. On Linux, this currently has no effect.
# Bypassing the Buffer Cache: Direct I/O
**Direct I/O** bypasses the kernel buffer cache and transfers data directly between **user space** and **a file or block device**.
Direct I/O is **not** generally faster; for most applications it can significantly **degrade performance** because it loses buffer-cache optimizations such as **sequential read-ahead**, **clustered I/O**, and **shared cached buffers** across processes.
Direct I/O is mainly useful for applications with specialized I/O requirements, such as **database systems** that already perform their own caching and I/O optimizations.
### How to perform direct I/O
Direct I/O is enabled by specifying the `O_DIRECT` flag in `open()` when opening a file or block device.
```c
fd = open(pathname, O_RDONLY | O_DIRECT);
```
### `O_DIRECT` support
`O_DIRECT` is effective on Linux since kernel `2.4.10`. Support depends on the kernel version and file system. Most native Linux file systems support it, but many non-UNIX file systems (e.g., `VFAT`) do not. If unsupported, `open()` fails with `EINVAL`.
### Cache coherency warning
Mixing `O_DIRECT` access in one process with normal buffered access in another process for the same file should be avoided, because there is **no coherency** between direct I/O and the buffer cache.
#### Example: mixing buffered I/O and direct I/O
If Process A writes using normal buffered I/O, the new data may remain only in the page cache.
If Process B then reads the same file using `O_DIRECT`, it bypasses the page cache and may read old data from disk.
Likewise, if Process B writes using `O_DIRECT`, Process A may still read stale data from the page cache.
:::info
**Takeaway**
Buffered I/O uses the page cache, while direct I/O bypasses it. Since they are not automatically synchronized, mixing them on the same file can lead to stale or inconsistent data.
:::
## Alignment restrictions for direct I/O
sampe sini
When using direct I/O, the following must be multiples of the block size:
- the **memory address** of the data buffer
- the **file or device offset**
- the **length** of the transfer
If any of these restrictions is violated, `read()` or `write()` fails with `EINVAL`.
Here, block size usually means the **physical block size** of the device, typically `512` bytes. On Linux `2.4`, the rules were stricter: alignment, offset, and length had to be multiples of the **logical block size** of the underlying file system (commonly `1024`, `2048`, or `4096` bytes).
## Example program
The example program `direct_read.c` demonstrates direct I/O input. It:
- opens a file with `O_DIRECT`
- optionally seeks to a specified offset
- allocates an aligned buffer using `memalign()`
- reads from the file using `read()`
The program takes up to four arguments:
1. file path
2. number of bytes to read
3. optional offset
4. optional buffer alignment
The default offset is `0`, and the default alignment is `4096` bytes.
### Example outcomes

### Example code
```c
#define _GNU_SOURCE
#include <fcntl.h>
#include <malloc.h>
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
{
int fd;
ssize_t numRead;
size_t length, alignment;
off_t offset;
void *buf;
if (argc < 3 || strcmp(argv[1], "--help") == 0)
usageErr("%s file length [offset [alignment]]\n", argv[0]);
length = getLong(argv[2], GN_ANY_BASE, "length");
offset = (argc > 3) ? getLong(argv[3], GN_ANY_BASE, "offset") : 0;
alignment = (argc > 4) ? getLong(argv[4], GN_ANY_BASE, "alignment") : 4096;
fd = open(argv[1], O_RDONLY | O_DIRECT);
if (fd == -1)
errExit("open");
buf = (char *) memalign(alignment * 2, length + alignment) + alignment;
if (buf == NULL)
errExit("memalign");
if (lseek(fd, offset, SEEK_SET) == -1)
errExit("lseek");
numRead = read(fd, buf, length);
if (numRead == -1)
errExit("read");
printf("Read %ld bytes\n", (long) numRead);
exit(EXIT_SUCCESS);
}
```
# Mixing Library Functions and System Calls for File I/O
## File Descriptor and Stream
A Fd is the low-level unix/linux **integer** handle for an open file
A Stream(File *) is the higher-level stdio **object** built on top of that handle
Stream uses stdio functions like fprintf(), fgets(), fclose()
Fd uses system calls like read(), write(), close()
:::info
It is possible to mix **stdio library functions** and **I/O system calls** on the same file.
:::
### `fileno()` and `fdopen()`
`fileno(stream)`
~ returns the **file descriptor** associated with a stdio stream
`fdopen(fd, mode)`
~ creates a **stdio stream** from an existing file descriptor
The `mode` argument of `fdopen()` is the same as in `fopen()`. It must be consistent with the access mode of the file descriptor, otherwise `fdopen()` fails.
### Use of `fdopen()`
`fdopen()` is useful for file descriptors returned by interfaces such as **pipes** and **sockets**, since these are created as file descriptors and must be converted to streams before using stdio functions on them.
### Buffering issue when mixing
System calls such as `read()` and `write()` transfer data directly between user space and the **kernel buffer cache**.
stdio functions use a **user-space buffer** first, and call `write()` only when that **buffer is flushed or becomes full**.
Because of this difference, mixing stdio and system calls on the same file can produce unexpected results.
### Example
```c
printf("To man the world is twofold, ");
write(STDOUT_FILENO, "in accordance with his twofold attitude.\n", 41);
```
The `printf()` output may remain in the stdio buffer, while `write()` goes directly to the kernel. As a result, the `write()` output may appear first.
### Avoiding the problem
- use `fflush()` before mixing stdio functions and system calls
- disabling stdio buffering with `setvbuf()` or `setbuf()` can also help, but may reduce performance because each output operation then causes a `write()` system call
## Summary
Both the **kernel** and the **stdio library** perform buffering for file I/O. Buffering improves efficiency, but disabling it may reduce performance.
Various system calls and library functions can be used to:
- control kernel and stdio buffering
- force one-time buffer flushes
`posix_fadvise()`
~ allows a process to give the kernel hints about expected file access patterns, which may help the kernel optimize use of the buffer cache and improve I/O performance
`O_DIRECT`
~ Linux-specific flag for `open()` that allows specialized applications to bypass the kernel buffer cache
`fileno()`
~ returns the file descriptor associated with a stdio stream
`fdopen()`
~ creates a stdio stream from an existing file descriptor, helping mix stdio functions and system calls on the same file
.
.
.
# English Notes
In the interim
~ during the time between two specific event or in the meantime(在此期间)
i.e.
~ (latin: id est, english: that is, namely, in other words, that is to say): for clarifying something
# Word Choice
- make sure -> ensure
# Vocab
naive
~ 天真的, polos, describes someone lacking experience, wisdom, or judgment. Implies simplistic.
semantic
~ about meaning
example :
- “Shut the door” and “Close the door”
→ different words, similar semantics
coherency
~ things matching each other and staying consistent