---
# System prepended metadata

title: Chapter 13 File I/Ｏ Buffering
tags: [Network Application Programming (Linux), English Writing Notes]

---


# 13 File I/Ｏ Buffering
######  *The Linux Programing Interface, by M. Kerrisk, 2010*
**Overview**
- Describe I/O system calls and standard C library types of buffering
- How both types of buffering affects application performance
- Introduces various technique for influece and disable both types of buffering
- Direct I/O: technieuq to bypasing kernel buffering


# 13.1 Kernel Buffering of File I/O: The buffer cache

```mermaid
sequenceDiagram
    participant UB as User Buffer
    participant P1 as Process 1
    participant K as Kernel
    participant C as Buffer Cache
    participant D as Disk
    participant P2 as Process 2

    P1->>K: write(fd, buf, 3)
    K->>UB: read data from user buffer
    UB-->>K: "abc"
    K->>C: copy to kernel buffer cache
    K-->>P1: return

    Note over D: Disk still has old data

    P2->>K: read(fd, ...)
    K->>C: read latest data
    C-->>K: "abc"
    K-->>P2: return "abc"

    K->>D: flush to disk (later)
```

1. Read() and write() don't access direcly disk storage, it will put the data in the from the **user buffer** to kernel buffer (**buffer cache**)
2. If, in the interim, another process try to read these data, it will return from the **buffer cache**


**Vice Versa**
:::info
The read() also read the data from the **buffer cache** until the next segment of the data is in the disk
==read-ahead== Usually kernel will apply read-ahead to ensure that the next blocks of data is available.
:::

This design is good as they don't need to wait on a slow disk operation. 

No Limist on size of buffer cache. Limited only by the amount of physical memory. 

:::danger
**Betrayal**

### Page Cache vs Buffer Cache
Before Linux kernel 2.4, the kernel maintained two separate caches:
- Buffer cache (for read/write system calls)
- Page cache (for memory-mapped files)

This caused duplication and inefficiency.
From kernel 2.4 onward, Linux unified these into a single page cache. All file I/O operations now use the page cache, including read(), write(), and mmap().
:::

## Effect of buffer size on I/O system call performance

:::warning
**Important:** Same number of disk access are performs, regardless of **1000 writes of single byte** or **Single write of 1000 bytes**. But the latter is preferable because it only requires a **single system call.**.
:::

`BUF_SIZE` control the buffer size when running the program. 

![13-1](https://hackmd.io/_uploads/HyGkt0iqWe.png)

:::info
The test performed using a vanilla 2.6.30 kernel on ext2 file system with a block size of 4096 bytes 

==vanilla kernel== means unpatched mainline kernel
:::

Elapsed
~ Total time from start to finish, includes waiting disk I/0 and also Total CPU, scheduling delays.

Total CPU
~ Total CPU = User CPU + System CPU
- User CPU: the time spent executing code in user mode
- System CPU: the time spent executing kernel code (i.e, system calls)

![image](https://hackmd.io/_uploads/BkuxX1n9-l.png)

What we can observe is the Elapsed time is mostly same as Total CPU since `write()` return immediately after transferring data from user pace into buffer cache. (RAM is 4G, 100 miliion bytes = 100 MB) 

==For BUF_SIZE = 65536, the 2.06-second elapsed time is mainly due to disk **reads**==


>File systems can be measured by various other criteria, such as performance under heavy multiuser load, speed of file creation and deletion, time required to search for a file in a large directory, space required to store small files, or maintenance of file integrity in the event of a system crash. Where the performance of I/O or other file-system operations is critical, there is no substitute for application-specific benchmarks on the target platform

> Where the performance of I/O or other file-system operations is critical, there is no substitute for application-specific benchmarks on the target platform.
**Meaning: **
You can’t rely on general benchmarks—you need to measure performance using your own application on the real system you care about.


# Buffering in the `stdio` library

```mermaid
sequenceDiagram
    participant App as Your Program
    participant Stdio as stdio Buffer (C library)
    participant Kernel as Kernel
    participant Cache as Page Cache
    participant Disk

    App->>Stdio: fprintf()/fwrite()
    Stdio->>Stdio: buffer data in user space
    Stdio->>Kernel: write()
    Kernel->>Cache: copy data to page cache
    Kernel-->>Stdio: return
    Note over Disk: not written immediately
    Cache->>Disk: flush later

```
Sequence of buffered output: 
program → stdio buffer → write() → kernel page cache → disk later

**User buffer vs stdio buffer**
- User buffer = any buffer in your program (you control it)
- stdio buffer = a specific buffer managed by the C library

:::info
stdio buffer is a type of user-space buffer, but not all user buffers are stdio buffers
:::
![image](https://hackmd.io/_uploads/SJ-u3xn9bx.png)

## Setting the buffering mode of a `stdio` stream

setvbuf() -> controls the form of buffering
```C=
#include <stdio.h>

int setvbuf(FILE *stream, char *buf, int mode, size_t size);

    Returns 0 on success, or nonzero on error
```

## Why we use setvbuf()

Default stdio buffering is a general compromise and may not suit all use cases. `setvbuf()` allows control over buffering behavior to optimize for performance (fewer system calls) or immediacy (timely output), depending on program requirements. **We will see it in the `mode` argument**

`setvbuf()` changes the buffering behavior of a stdio stream, such as its mode and buffer size.

> The setvbuf() call affects the behavior of all
**subsequent** **stdio** operations on the specified stream


### `buf` argument

`buf` is non-NULL
~ points to a block of memory of `size` bytes provided by your own. The memory need to be long-lived, such as statically allocated or dynamically allocated on the heap. 

`buf` is NULL
~ buffer provided automatically by `stdio` library

### `mode` argument

`_IONBF`
~ unbuffered I/O. Each stdio call immediately performs a `read()` or `write()` system call. This is the **default** for `stderr`, so error messages appear immediately.

`_IOLBF`
~ line-buffered I/O. Output is buffered until a newline is written (or the buffer becomes full). For input, data is read one line at a time. This is the **default** for streams connected to terminal devices.

`_IOFBF`
~ fully buffered I/O. Data is transferred in blocks whose size is determined by the buffer size. This is the **default** for streams connected to disk files.


### What if we remove `v` from `setvbuf()` ?

then it become `setbuf()`, which is a simplied wrapper of `setvbuf()`

```C=
#include <stdio.h>

void setbuf(FILE *stream, char *buf)

// which is same as 

setvbuf(fp, buf, (buf != NULL) ? _IOFBF: _IONBF, BUFSIZ);

```

if buf is null then no bufferng
if buf is non-null then full buffering

#### Another kind of `setvbuf()`

```C=
void setbuffer(fp, buf, size) 

// is same as 

setvbuf(fp, buf, (buf != NULL) ?_IOFBF: _IONBF, size)

```


## Flushing a `stdio` buffer

We can force the data in a `stdio` output stream to be written

```C=
#include <stdio.h>

int fflush(FILE *stream);

// Returns 0 on success, EOF on errror
```

fflush(NULL) ->flushes all output streams
fflush(stdio) -> flushes only stdio buffers


# Controlling Kernel Buffering of File I/O

>　Sometimes it's necessary to force flushing the kernel buffer if an application (e.g., a database jounaling process) must ensure that output really has been written to the disk(or at least the disk's hardware cache) before continuing

## Synchronized I/O completion

> SUSv3 defines the term `syncronized I/O completion` to mean "an I/O operation that has either been **successfuly** transferred [to the disk] or diagnosed as **unsuccessful**"

There are 2 types of it: 
1. Synchronized I/O data integrity
2. Synchronized I/O file integrity

They differ in metadata("data about data") describing the file, which the kernel stores along with the data for a file. 
Metadata includds information such as the file owner and group; file permissions, file size; number of (hard) links to the file; timestamps indicating the time of the last file acccess, last file modification, and last metadata change; and file data block pointers. 

commands to look metadata is `stat` for more details and `ls -l` but less detailed 

###  Synchronized I/O data integrity 

In the read()
~ If there is some data written but not yet saved to disk (maybe in the page cache), **it will force flushes** before read() being called 

In the write() 
~ A write is completed when, data is in the **disk** and all file metadata required to retrieve the data also been transferred. (e.g, file size). Other metadata is saved in the page cache. (e.g, last updated timestamp)

If ==crash== happen then the essential data is safe, however timestamp may be outdated.

### Synchronized I/O file integrity

A **superset** (a set that contains another set) of synchronized I/O data integrity, which means file integrity completion includes everything in data integrity, plus more.

> The difference with this mode of I/O completion is that during a file
update, ==all updated file metadata is transferred to disk==, even if it is not necessary for the operation of a subsequent read of the file data

## System calls for controlling kernel buffering of file I/O

fsync()
~ forces the file to the synchronized I/O **file integrity**

```
#include <unistd.h>
int fsync(int fd);

// Returns 0 on success, or -1 on error


```

fdatasync()
~ foces the file to the synchronized I/O **data integrity**

```
#include <unistd.h>

int fdatasync(int fd);

// Returns 0 on succes, or -1 on error

```

> Using fdatasync() potentially reduces the number of disk operations from the **two** required by fsync() to **one**.

> Performance improves when unnecessary metadata updates are avoided, since file data and metadata may reside in **different parts of the disk**

>In Linux 2.2 and earlier, fdatasync() is implemented as a call to fsync(), and thuscarries no performance gain.

### Sync()
while fflush() flushes from user space into page cache, **sync()** flushes all from **page cache/kernel buffers** into **disk**.

> sync() flushes all buffered file data and metadata in the system to disk, while fsync(fd) flushes data for a specific file. sync() provides system-wide flushing, whereas fsync() offers fine-grained control for ensuring data durability of a single file.

#### Real-world usage

sync()
~ before shutdown
system-wide consistency

fsync()
~ saving a file (e.g., text editor)
databases
anything requiring data durability


## Making all writes synchronous: O_SYNC


### `O_SYNC`

Specifying `O_SYNC` in `open()` makes all subsequent `write()` calls synchronous.

```c
fd = open(pathname, O_WRONLY | O_SYNC);
```

After opening the file with `O_SYNC`, each `write()` automatically flushes both file data and metadata to disk. In other words, writes follow **synchronized I/O file integrity completion**.

`O_FSYNC`  
~ older BSD name for `O_SYNC`; in `glibc`, `O_FSYNC` is defined as a synonym for `O_SYNC`.

#### Performance impact of `O_SYNC`

`O_SYNC` can severely reduce performance because every `write()` must wait until the data and metadata are flushed to disk.

![image](https://hackmd.io/_uploads/SkNctN6cbg.png)


- elapsed time can become much larger than CPU time, because the process blocks while waiting for disk I/O
- the slowdown is especially extreme for small buffer sizes
- larger write buffers reduce the overhead, but `O_SYNC` is still costly

:::warning
Ｍodern disk drives have large internal caches, and by default, **O_SYNC** **merely** causes data to be transfereed to the cache. 

If we disable caching on the disk (command hdparm -WO), then it become worse. In the 1-byte case, the elapsed time rises from 1030 -> 16000 seconds. 
:::


### Takeaway

If forced flushing is required, it is often better to:

- use larger `write()` buffer sizes
- call `fsync()` or `fdatasync()` occasionally

instead of opening the file with `O_SYNC`.


### `O_DSYNC` and `O_RSYNC`

`O_DSYNC`
~ makes `write()` follow **synchronized I/O data integrity completion**, similar to `fdatasync()`. Only file data and metadata required for later data retrieval are flushed.

`O_SYNC`
~ makes `write()` follow **synchronized I/O file integrity completion**, similar to `fsync()`. Both file data and all relevant metadata are flushed.

#### Difference between `O_DSYNC` and `O_SYNC`

`O_DSYNC`
~ flushes file data and only the metadata needed to retrieve that data

`O_SYNC`
~ flushes file data plus all file metadata required for full file integrity

`O_RSYNC`
~ used together with either `O_DSYNC` or `O_SYNC`, and extends their synchronized I/O behavior to `read()` operations

-- `O_RSYNC | O_DSYNC`
  ~ before each `read()`, pending writes affecting the requested data are completed according to **data integrity** requirements

-- `O_RSYNC | O_SYNC`
  ~ before each `read()`, pending writes affecting the requested data are completed according to **file integrity** requirements


# Summary of I/O Buffering

![image](https://hackmd.io/_uploads/SJoMnVa9bx.png)


## Overview of output buffering

Output file I/O involves two buffering layers:

1. **stdio buffer** in user space  
2. **kernel buffer cache** in kernel space  

Data written by stdio functions is first stored in the **stdio buffer**.  
When that buffer becomes full, the stdio library calls `write()`, which copies the data into the **kernel buffer cache**.  
Later, the kernel flushes the cached data to disk.

## Controlling buffering

### Explicit flushing
These can be used at any time to force buffered data to be written out:

- `fflush()`  
  ~ flushes the stdio buffer

- `fsync()`, `fdatasync()`, `sync()`  
  ~ flush kernel buffers to disk

### Automatic flushing
These make flushing happen automatically:

- `setbuf()`, `setvbuf()`  
  ~ control or disable stdio buffering

- `O_SYNC`, `O_DSYNC`  
  ~ make `write()` operations synchronous, so data is flushed to disk immediately


# Advising the Kernel About I/O Patterns

posix_favise()
~ system call that allows a process to give a hint to kernel about its likely pattern for accesing file data. The kernel **may** use the information from `posix_fadvise()` because it is only a **hint**, not a requirement. This gives the kernel flexibility to ignore the advice if it is not useful or practical.

```C=
#define _XOPEN_SOURCE 600
#include <fcntl.h>

int posix_fadvise(int fd, off_t offset, off_t len, int advice);

// Returns 0 on sucess, or a positive error number on error
```
> Calling posix_fadvise() has no effect on the semantics(meaning, what a program does) of a program


The kernel decides based on whether the hint would likely improve **cache behavior**. 

**Some factors:**
- the advice type you gave (SEQUENTIAL, RANDOM, WILLNEED, DONTNEED)
- the file’s current cached pages
- available memory
- current I/O and cache pressure
- what the kernel/filesystem implementation actually supports

### `advise` arguments
The `advice` argument tells the kernel the expected file access pattern.

`POSIX_FADV_NORMAL`
~ no special access pattern; default behavior. On Linux, read-ahead uses the default size.

`POSIX_FADV_SEQUENTIAL`
~ data will likely be read sequentially, from lower offsets to higher offsets. On Linux, this increases the read-ahead window.

`POSIX_FADV_RANDOM`
~ data will likely be accessed in random order. On Linux, this disables read-ahead.

`POSIX_FADV_WILLNEED`
~ the specified file region will likely be accessed soon. The kernel may preload that region into the buffer cache, so later `read()` calls can get data from memory instead of waiting for disk I/O.

`POSIX_FADV_DONTNEED`
~ the specified file region will likely not be accessed soon. The kernel may free the corresponding cache pages. If pages are modified, they may need to be flushed first.

`POSIX_FADV_NOREUSE`
~ the specified file region will likely be accessed only once and not reused. On Linux, this currently has no effect.

# Bypassing the Buffer Cache: Direct I/O

**Direct I/O** bypasses the kernel buffer cache and transfers data directly between **user space** and **a file or block device**. 

Direct I/O is **not** generally faster; for most applications it can significantly **degrade performance** because it loses buffer-cache optimizations such as **sequential read-ahead**, **clustered I/O**, and **shared cached buffers** across processes.

Direct I/O is mainly useful for applications with specialized I/O requirements, such as **database systems** that already perform their own caching and I/O optimizations. 

### How to perform direct I/O

Direct I/O is enabled by specifying the `O_DIRECT` flag in `open()` when opening a file or block device. 

```c
fd = open(pathname, O_RDONLY | O_DIRECT);
```

### `O_DIRECT` support

`O_DIRECT` is effective on Linux since kernel `2.4.10`. Support depends on the kernel version and file system. Most native Linux file systems support it, but many non-UNIX file systems (e.g., `VFAT`) do not. If unsupported, `open()` fails with `EINVAL`. 

### Cache coherency warning

Mixing `O_DIRECT` access in one process with normal buffered access in another process for the same file should be avoided, because there is **no coherency** between direct I/O and the buffer cache. 

#### Example: mixing buffered I/O and direct I/O

If Process A writes using normal buffered I/O, the new data may remain only in the page cache.
If Process B then reads the same file using `O_DIRECT`, it bypasses the page cache and may read old data from disk.
Likewise, if Process B writes using `O_DIRECT`, Process A may still read stale data from the page cache.

:::info
**Takeaway**
Buffered I/O uses the page cache, while direct I/O bypasses it. Since they are not automatically synchronized, mixing them on the same file can lead to stale or inconsistent data.
:::

## Alignment restrictions for direct I/O

sampe sini 

When using direct I/O, the following must be multiples of the block size:

- the **memory address** of the data buffer
- the **file or device offset**
- the **length** of the transfer

If any of these restrictions is violated, `read()` or `write()` fails with `EINVAL`. 

Here, block size usually means the **physical block size** of the device, typically `512` bytes. On Linux `2.4`, the rules were stricter: alignment, offset, and length had to be multiples of the **logical block size** of the underlying file system (commonly `1024`, `2048`, or `4096` bytes). 

## Example program

The example program `direct_read.c` demonstrates direct I/O input. It:

- opens a file with `O_DIRECT`
- optionally seeks to a specified offset
- allocates an aligned buffer using `memalign()`
- reads from the file using `read()`

The program takes up to four arguments:
1. file path
2. number of bytes to read
3. optional offset
4. optional buffer alignment

The default offset is `0`, and the default alignment is `4096` bytes. 

### Example outcomes

![13-1](https://hackmd.io/_uploads/r1tMWvpc-l.png)

### Example code

```c
#define _GNU_SOURCE
#include <fcntl.h>
#include <malloc.h>
#include "tlpi_hdr.h"

int
main(int argc, char *argv[])
{
    int fd;
    ssize_t numRead;
    size_t length, alignment;
    off_t offset;
    void *buf;

    if (argc < 3 || strcmp(argv[1], "--help") == 0)
        usageErr("%s file length [offset [alignment]]\n", argv[0]);

    length = getLong(argv[2], GN_ANY_BASE, "length");
    offset = (argc > 3) ? getLong(argv[3], GN_ANY_BASE, "offset") : 0;
    alignment = (argc > 4) ? getLong(argv[4], GN_ANY_BASE, "alignment") : 4096;

    fd = open(argv[1], O_RDONLY | O_DIRECT);
    if (fd == -1)
        errExit("open");

    buf = (char *) memalign(alignment * 2, length + alignment) + alignment;
    if (buf == NULL)
        errExit("memalign");

    if (lseek(fd, offset, SEEK_SET) == -1)
        errExit("lseek");

    numRead = read(fd, buf, length);
    if (numRead == -1)
        errExit("read");

    printf("Read %ld bytes\n", (long) numRead);
    exit(EXIT_SUCCESS);
}
```
# Mixing Library Functions and System Calls for File I/O

## File Descriptor and Stream

A Fd is the low-level unix/linux **integer** handle for an open file

A Stream(File *) is the higher-level stdio **object** built on top of that handle

Stream uses stdio functions like fprintf(), fgets(), fclose()

Fd uses system calls like read(), write(), close()

:::info
It is possible to mix **stdio library functions** and **I/O system calls** on the same file.
:::
### `fileno()` and `fdopen()`

`fileno(stream)`
~ returns the **file descriptor** associated with a stdio stream

`fdopen(fd, mode)`
~ creates a **stdio stream** from an existing file descriptor

The `mode` argument of `fdopen()` is the same as in `fopen()`. It must be consistent with the access mode of the file descriptor, otherwise `fdopen()` fails.

### Use of `fdopen()`

`fdopen()` is useful for file descriptors returned by interfaces such as **pipes** and **sockets**, since these are created as file descriptors and must be converted to streams before using stdio functions on them.

### Buffering issue when mixing

System calls such as `read()` and `write()` transfer data directly between user space and the **kernel buffer cache**.

stdio functions use a **user-space buffer** first, and call `write()` only when that **buffer is flushed or becomes full**.

Because of this difference, mixing stdio and system calls on the same file can produce unexpected results.

### Example

```c
printf("To man the world is twofold, ");
write(STDOUT_FILENO, "in accordance with his twofold attitude.\n", 41);
```

The `printf()` output may remain in the stdio buffer, while `write()` goes directly to the kernel. As a result, the `write()` output may appear first.

### Avoiding the problem

- use `fflush()` before mixing stdio functions and system calls
- disabling stdio buffering with `setvbuf()` or `setbuf()` can also help, but may reduce performance because each output operation then causes a `write()` system call


## Summary

Both the **kernel** and the **stdio library** perform buffering for file I/O. Buffering improves efficiency, but disabling it may reduce performance.

Various system calls and library functions can be used to:
- control kernel and stdio buffering
- force one-time buffer flushes

`posix_fadvise()`
~ allows a process to give the kernel hints about expected file access patterns, which may help the kernel optimize use of the buffer cache and improve I/O performance

`O_DIRECT`
~ Linux-specific flag for `open()` that allows specialized applications to bypass the kernel buffer cache

`fileno()`
~ returns the file descriptor associated with a stdio stream

`fdopen()`
~ creates a stdio stream from an existing file descriptor, helping mix stdio functions and system calls on the same file


.
.
.


# English Notes

In the interim
~ during the time between two specific event or in the meantime(在此期间)


i.e. 
~ (latin: id est, english: that is, namely, in other words, that is to say): for clarifying something

# Word Choice
- make sure -> ensure

# Vocab

naive 
~ 天真的, polos, describes someone lacking experience, wisdom, or judgment. Implies simplistic. 

semantic
~ about meaning

example : 
- “Shut the door” and “Close the door”
→ different words, similar semantics

coherency
~ things matching each other and staying consistent