# Week <8> — Team <27>
**Members:** @112062124, @111000116, @111006239 (student IDs)
:::info
You can write in Chinese.
:::
<style>
.blue { background-color:#66FFFF; color:darkblue; padding:2px 4px; border-radius:3px; }
</style>
<style>
.yellow { background-color:#F9F900; color:darkgreen; padding:2px 4px; border-radius:3px; }
</style>
<style>
.s124 { background-color:#66FFFF; color:darkblue; padding:2px 4px; border-radius:3px; }
</style>
## Lecture & Reading Notes
### Pages and Blocks
* **Storage Unit:** Storage devices (HDD/SSD) operate on fixed-size units called ==Pages== (or Sectors), typically **4KB**.
* **CPU Unit:** The CPU accesses memory in units of a ==Cacheline== (e.g., 64 Bytes).
* **Granularity Mismatch:** **Disk I/O unit (4KB) \>\> CPU access unit (64B)**.
* **Buffering:** The kernel accumulates many small writes (like `fprintf`) into a 4KB buffer and writes to the disk once. This ==buffering strategy== drastically reduces I/O operations because **disk I/O is very slow**.
### The Cache Hierarchy
**Read Path (e.g., `fread()`):**
1. Kernel checks the ==buffer cache== (page cache).
* This cache is ==coherent== (shared by all processes).
2. **Cache Hit:** Kernel copies data from the cache directly to user-space memory.
3. **Cache Miss:**
* Kernel issues a disk I/O (slow).
* Data is read from disk into the buffer cache.
* Data is **copied again** to user-space memory (due to **memory protection**).
**User-space Buffering (e.g., libc):**
* C library functions like `printf` / `fread` buffer in *user-space* first.
<span class="s124"> `printf` writes to libc's buffer. A `write()` system call is only triggered when the buffer is full or `fflush` is called.</span>
* The first `fread` call might trigger a `read()` system call that ==prefetches== 4KB into libc's buffer. Subsequent `fread` calls read from this user-space buffer, avoiding system calls.
* **Trade-off:** Buffering adds copy overhead but **avoids expensive system call overheads**.
* **Key Point:** The **user-space buffer is *not* coherent**. Different processes have their own copies. The Kernel's ==buffer cache== is the ==single point of coherence==.
### Direct I/O
* **Problem:** Reading/writing huge files (e.g., 50GB video) fills the buffer cache, evicting other useful cached pages. This is called ==cache pollution==.
* **Solution:** Use ==Direct I/O== (Linux flag: `O_DIRECT`).
* **How:** **Data is transferred directly between user-space memory and the disk**, completely bypassing the kernel's buffer cache.
* `dd if=ubuntu.iso of=/dev/my-usb oflag=direct`
### Memory Mapping
* **Mechanism:** Use the `mmap()` system call.
* **How:** The kernel **maps the file's pages (in the buffer cache) directly into the process's virtual address space**.
* **Benefit 1 (Performance):** ==Avoids the double-copy==. When the app reads/writes those memory addresses, it's operating directly on the kernel's buffer cache.
* **Benefit 2 (Coherency):** Multiple processes mapping the same file automatically share the same kernel buffer pages, **guaranteeing coherency**.
### Write Caching & Write Coalescing
* **Write Coalescing:**
* When the kernel receives multiple, small `write()` calls for the same 4KB page, it doesn't write to disk immediately.
* It **merges (coalesces) these writes into a single page buffer** in memory.
* Eventually, it performs **only one 4KB page write** to the disk.
* **Benefit:** **Maximizes write bandwidth** (SSDs are far more efficient with large, sequential I/O than small, random I/O).
* **Risk:** If power is lost before the kernel flushes the buffer to disk, **data in memory is lost**.
* **Durability:** You can call `fsync()` to **force the kernel to flush dirty buffers to disk immediately**.
### Write-Back and Write-Through Caching
* **==Write-Back Caching== (Default Policy):**
* The `write()` call **returns immediately** after placing data in the buffer cache.
* The kernel writes "dirty" pages to disk later (asynchronously).
* **Pro:** Application perceives writes as very fast.
* **Con:** **Risk of data loss** (on power failure).
* **==Write-Through Caching==:**
* The `write()` call **does not return until the data is confirmed to be on disk**.
* **Pro:** Data is safe (durable).
* **Con:** Application perceives writes as very slow.
* **Direct I/O (`O_DIRECT`)** behaves like a ==write-through== cache.
### Latency vs. Throughput
* **Latency:** Time to complete a *single* I/O request (e.g., 100μs).
* **Throughput (or Bandwidth):** Total data transferred per second (e.g., 500 MB/s).
* **Key Insight:** **Low latency and high throughput do not always go together**.
* **==Queue Depth (QD)==:** The number of I/O requests sent to a device simultaneously.
* **QD 1:** Send one request, wait for it to complete, send the next. (Performance is limited by latency).
* **QD 32:** Send 32 requests to the SSD at once.
* **Result:** At high QD, the SSD can optimize I/O, achieving **extremely high throughput**, even if the latency for any single request increases.
* **Connection:** Kernel's Write Buffering (coalescing) is a way to **accumulate dirty pages and flush them at a high QD** to achieve high throughput.
### Page Alignment
* **Rule:** When using `O_DIRECT` or `mmap`, your **memory buffer** ==must be 4KB page-aligned== (memory address must be a multiple of 4096).
* **Why:** The kernel needs to perform **direct transfer (DMA)** between a memory page (4KB) and a disk page (4KB). They must be perfectly aligned. If not, the kernel would have to perform an internal copy to align them, defeating the purpose of Direct I/O.
### Cache inside SSD
* **SLC Cache:** Modern SSDs have their own internal cache (e.g., ==SLC cache==).
* **How:** Writes first go to the fast SLC cache.
* **Performance Cliff:** When the SLC cache is full, the SSD must slow down to flush this cache to the slower TLC/QLC NAND. This causes a **sudden, severe performance drop**.
* **Thermal Throttling:** Continuous high-pressure writes cause the disk to overheat. The controller will **reduce performance (throttle)** to protect the hardware.
### Cache is Deceptive
* **Benchmarks:** Manufacturer performance numbers are usually **best-case scenarios** (empty drive, high QD, peak performance writing to SLC cache).
* **Real World:** Actual performance (sustained write, full drive) can be much lower.
* **Testing Cache:**
* **Warm Cache (best-case):** `hyperfine --warmup 3 'cmd'`
* **Cold Cache (worst-case):** Clear the OS page cache before each run:
```bash
echo 3 | sudo tee /proc/sys/vm/drop_caches
```
### Shio (Explorations & Discoveries)
- Genuine questions that arose from the videos/reading (e.g.,
“Why does X hold under Y?”, “Does this assumption break on Z?”).
- Show how you sought answers: website, short experiments, citations, logs, or your own reasoning.
<span class="yellow">fsync() 在 SSD 上究竟保證了什麼?</span> :
The lecture states fsync() forces a flush to disk, but it also mentions SSDs have their own internal SLC caches. This raises a critical question:
> Does fsync() also force the SSD's internal cache (e.g., its volatile DRAM cache) to flush to persistent NAND flash? Or does it just flush the OS's buffer cache to the SSD's volatile cache? If it's the latter, then a power failure could still lose data that fsync() claimed was "safe."
<span class="yellow">Investigation & Analysis </span> :
My investigation found that this is a critical and dangerous distinction. The fsync() contract is supposed to be a durability guarantee, but this is often broken by hardware.
**The OS's Contract:**
- Flush all "dirty" data for that file from the kernel's page cache to the storage device.
- Issue a special hardware command (e.g., FLUSH CACHE for SATA/ATA, or a FLUSH command for NVMe) to the device itself.
**The Hardware's Role:**
- This hardware-level FLUSH command is the key. It instructs the drive:
"Take any data you have in your own volatile write-back cache (e.g., DRAM or SLC cache) and commit it to non-volatile media (NAND flash)."
- The fsync() system call should not return to the application until the drive reports that this hardware flush is complete.
**The "Consumer-Grade Lie":**
- Many consumer-grade SSDs lie to the OS to get better benchmark scores. They receive the FLUSH command, but instead of actually waiting for the slow NAND write, they just report "Flush complete\!" immediately while the data is still in their volatile DRAM cache.
- Enterprise-grade drives (with "Power Loss Protection" - PLP) solve this. They have capacitors. They honor the FLUSH command, and if power is cut mid-flush, the capacitors provide enough energy to finish writing their cache to NAND.
**Conclusion:**
- fsync() is designed to guarantee full durability, but this guarantee is broken by many consumer-grade drives that prioritize speed over data integrity.
- For any application I write that truly needs durability (e.g., a database, a payment system), I cannot trust fsync() alone. I must also ensure the underlying hardware is enterprise-grade and respects the flush command.
既然<span class="s124">`printf` writes to libc's buffer. A `write()` system call is only triggered when the buffer is full or `fflush` is called.</span>那我們正常使用terminal會覺得印到終端的行為很迅速。是因為`printf`結束會直接呼叫`fflush`還是有其他原因。
參考自
> 參考來源:
> [Reddit 討論串 : Why do I need to fflush(stdout) when printf'ing in signal handling](https://www.reddit.com/r/C_Programming/comments/mhejfu/why_do_i_need_to_fflushstdout_when_printfing_in/?utm_source=chatgpt.com)
> [Reddit 討論串 : Slow output and pipes: how do you control buffering?
](https://www.reddit.com/r/C_Programming/comments/mhejfu/why_do_i_need_to_fflushstdout_when_printfing_in/?utm_source=chatgpt.com)
---
真正原因是 **libc 的緩衝策略(Buffering Modes)**。userbuffer有三種緩衝策略,緩衝策略 (Buffering Modes),
| 模式 | 名稱 | 行為 | 常見例子 |
|------|------|------|-----------|
| **Fully buffered** | 全緩衝 | buffer 滿才寫出 | 一般檔案 (`stdout → regular file`) |
| **Line buffered** | 行緩衝 | 遇到 `\n` 就寫出 | 終端機 (Terminal) |
| **Unbuffered** | 無緩衝 | 立即寫出 | `stderr` |
一般執行程式的情況下,在terminal(互動式介面),libc 會自動切換成 line-buffered 模式。所以printf("hello\n");,遇到換行符 \n,它會立刻 fflush(stdout)。那如果不換行呢,就不會顯示嗎。
例如
```c
#include <stdio.h>
#include <unistd.h>
int main() {
printf("Hello");
sleep(2);
printf(" World\n");
return 0;
}
```
在這種情況下可以發現,hello不會直接輸出,需要等到換行。`printf`本身 不會在結尾自動呼叫 `fflush()`。真正讓輸出「即時」的是 行緩衝模式 (line buffered)。只有當輸出error的時候才會是直接輸出。通過調整`setvbuf(stdout, NULL, _IONBF, 0);`可以更換模式。
Reference: [Why Consumer SSD Reviews are Useless for Database Performance](https://www.percona.com/blog/why-consumer-ssd-reviews-are-useless-for-database-performance-use-case/)
-----
<span class="yellow">What exactly happens if I ignore the 4KB alignment rule for O\_DIRECT?<span>
The text states that O\_DIRECT requires 4KB alignment. This makes sense for a "no-copy" DMA transfer. But what happens if I violate this?
**Hypothesis 1:** The call fails with an error (e.g., EINVAL).
**Hypothesis 2:** The kernel "silently fixes it" by creating a temporary, aligned "bounce buffer," copying my user-space data into it, and then performing the O\_DIRECT write from that buffer. This would kill performance and defeat the purpose.
**Investigation & Deeper Reasoning:**
- The man 2 open page (Linux programmer's manual) confirms Hypothesis 1: The system call (e.g., read() or write()) will fail with the error EINVAL (Invalid Argument).
- This isn't user-friendliness; it's contract-based programming. The O\_DIRECT flag is a contract where the developer says: "I am taking control. I promise my buffers are aligned for DMA. In exchange, I want you (the kernel) to promise no buffering and no copies."
- If the kernel "silently fixed" the alignment (Hypothesis 2), it would be breaking its side of the contract (by performing a copy to a bounce buffer). This would be deceptive and disastrous for performance. The developer would think they have zero-copy DMA but would actually have an extra copy, potentially running slower than standard buffered I/O.
**Conclusion:**
- The EINVAL failure is a feature, not a bug. It's the kernel's way of saying, "Your contract is void. Fix your code." It enforces predictable performance rather than silently breaking the performance promise.
Reference: (https://man7.org/linux/man-pages/man2/open.2.html)
-----
<span class="yellow">Exploration: What triggers the kernel's write-back flush (if not fsync)?<span> :
The lecture explains write-back caching: the kernel collects "dirty" pages and writes them "later." This is vague. If fsync is manual, what stops the kernel from never writing data until the system is out of memory, risking catastrophic data loss on a crash?
**Investigation & Analysis:**
- Controlled by background kernel threads (e.g., flusher) and is tunable via /proc/sys/vm/.
- Kernel uses two primary, independent triggers. This reveals a classic Throughput vs. Data-Loss-Risk trade-off.
**Time-Based Trigger (vm.dirty\_expire\_centisecs):**
- What it is: The maximum time a page can be "dirty" (in memory) before it becomes "eligible" to be written out. Default is often 3000 centiseconds (30 seconds).
- The "Shio" Insight: Write-coalescing mechanism. By waiting 30 seconds, the kernel gives applications time to make more writes to the same 4KB page. Kernel can perform one disk I/O for multiple writes.
- Trade-off: High throughput vs 30-second window of data-loss risk if system crashes.
**Pressure-Based Trigger (vm.dirty\_background\_ratio / vm.dirty\_ratio):**
- What it is: "Soft" and "hard" limit on % of total memory that can be dirty (e.g., 10% / 20%).
- The "Shio" Insight: Prevents I/O "stalls" due to excessive dirty pages.
- Trade-off:
- background\_ratio (10%): background threads start flushing asynchronously.
- dirty\_ratio (20%): application performing write() is blocked until memory pressure is relieved.
**Conclusion:**
- Default settings are a compromise: 30s of data-loss risk in exchange for 30s of write-coalescing opportunities, with a safety valve at 10% memory pressure.
- Reference: Linux Kernel Documentation, Documentation/sysctl/vm.txt, sections on dirty\_ variables.
Reference: (https://docs.kernel.org/admin-guide/sysctl/vm.html)
---
## Hand-on Lab
### Answers to Reflection Questions
Q1. <answer>
Q2. <answer>
...
### Shio (Explorations & Discoveries)
- Extra question(s) we asked:
- **<span class="yellow">Q: If engineers successfully developed a 3D Cache (stacking L3 cache layer), why not also do it on L1, and L2 ? Or why not just keep expanding L3 size horizontally**?</span>
A: **Why not also do it on L1, and L2** => L1 and L2 caches sit right next to the cores, making them bigger would slow them down. That would end up with a “L1.5” cache that’s slower than L1, yet not big enough to be useful. On the other hand L3 is on a sweet spot, not too fast to add a little bit delay.
**Why not just keep expanding L3 size vertically** => If you expanded the base die laterally (making everything larger), you reduce yield (fewer chips per wafer, higher defect risk). So that's not cost-wise for the CPU vendors.
- Our Discovery(ies):
### What I learned
(each member writes 2–4 sentences, in their own words)
### @112062124
緩衝是一個深刻的議題,沒有OS的緩衝功能,程式或許還是可以跑,但問題在於,緩衝可以使得資料的處理更有條理,如果都沒有緩衝,事情的處理也就沒有輕重緩急,所有事情都會變成DIRECT,可是不是所有訊息都是八百里加急,有的甚至可有可無,在這種情況下,我們當然只有當有重要事項才會特地降低效率,去更快的完成這些寫入讀取。不然大部分時候還是會希望,使用者的體驗優先。cache miss/hit,這些在計算機結構都有討論到,但那時候比較偏向如果發生了硬體應該怎麼做。但這邊的討論比較偏向,為甚麼要用buffer,以及什麼時候會direct之類的軟體排程。
### @111000116
以前對「快取」的印象只是讓系統變快的一種技術,但透過這次的內容,我發現其中其實涉及很多層次的權衡。例如,資料在從磁碟讀取到記憶體的過程中,會經過多層快取,而每一層都在平衡「速度」與「一致性」。當資料不在快取中時所發生的「cache miss」,會導致額外的延遲,這讓我更能理解為什麼程式效能有時會差異很大。
另一個讓我印象深刻的部分是寫入緩衝(write buffering)與寫回策略。原來當我們呼叫寫入函式時,作業系統並不一定會立刻把資料寫進磁碟,而是會先暫存在記憶體中再批次處理。這種做法雖然提高了效率,但也帶來資料遺失的風險。這讓我體會到系統設計不只是追求速度,更要考慮穩定性與安全性。整體來說,這次的主題讓我對 OS 如何在背後「偷偷幫忙」加速 I/O 有了更深的理解。
### @111006239
I learned that caching happens at multiple levels. from the CPU’s tiny but ultra-fast L1 cache to the larger, but slower in speed, L2 and L3 caches, and all the way up to the OS buffer cache that stores disk data in RAM. And recently, I have seen new CPUs with 3D Cache (for example, Ryzen 9 9800x3d), but always curios about how does it really work. After I did a researching, I found out that it’s an extra layer of L3 cache stacked vertically on the CPU die, which increases cache capacity without needing a larger die area. Latency stays almost the same, so more data can stay “close” to the cores. That's why it's popular used for gaming as most of the games reuse lots of data repeatedly (AI logic, large maps).
## Content Sharing Agreement
**Please uncheck if you disagree!**
- [X] We agree that excerpts of this note may be shared by TA to Discord if our team's note is selected as *shio* highlight.
- [X] We agree that our team ID may be shown on Discord if our team's note is selected as *shio* highlight.
- [X] We agree that the instructore can show excerpts of this note if our team's note is selected as *shio* highlight.
:::warning
* Please don't share this HackMD note directly with non-teammate. But you are encouraged to help others (including non-teammate) on Discord. [Follow our collaboration policy](https://sys-nthu.github.io/os25-fall/admin/policies.html#collaboration-citation)
* Please don't peek at or copy another team's HackMD, or using LLM to generate your reflection.
:::