Week <10> — Team <27>

# Week <10> — Team <27> <style> .blue { background-color:#66FFFF; color:darkblue; padding:2px 4px; border-radius:3px; } </style> <style> .yellow { background-color:#F9F900; color:darkgreen; padding:2px 4px; border-radius:3px; } </style> **Members:** @112062124, @111000116, @112XXXXXX (student IDs) :::info You can write in Chinese. ::: ## Lecture & Reading Notes ### Core Concepts of Virtual Memory (VM) Virtual memory is a powerful abstraction managed by the **Operating System (OS)**. Its primary goal is to give every single process the illusion that it has **exclusive access to the entire memory (DRAM)**. --- ### Core Properties - ### Isolation <span class="blue">Each process gets its **own private address space**.</span> Process A **cannot see or access** Process B's data without explicit permission from the OS. - ### Illusion of Size <span class="blue">Every process believes it has access to the **full memory range** (e.g., all 4GB or 8GB),</span> regardless of how much physical memory is actually installed or how many other processes (even 10,000) are running. - ### Management Behind the scenes, the OS is responsible for **mapping virtual addresses → physical addresses** in DRAM, or even **swapping them out to disk** when physical memory is full. The process is completely unaware of this mapping. --- ### The Problem of Fragmentation - ### External Fragmentation Occurs when you have enough total free memory to satisfy a request, but the free space is **not continuous** — there are small, unusable "holes" between allocated blocks. **Solution: Paging** To solve this, virtual memory divides both virtual and physical memory into **fixed-size blocks**. | Type | Description | |------|--------------| | Virtual Pages (VPNs) | Blocks in the process’s virtual address space | | Physical Frames (PFNs) | Blocks in physical DRAM | <span class="blue">With paging, a process’s memory can **appear continuous** in its virtual address space, but be stored in **non-continuous physical frames**, making allocation much more efficient.</span> --- - ### Internal Fragmentation Paging introduces its own problem. If a process requests only half a page of memory, the OS must still allocate a **full page** for it. The unused space inside that page is **internal fragmentation**. This creates a **page-size trade-off**: - **Large Pages:** Fewer pages to manage, smaller page tables (faster lookup), but more internal fragmentation. - **Small Pages:** Less internal fragmentation, but more pages to manage, leading to massive page tables (slower lookup). --- ### The Address Translation Process The CPU's **Memory Management Unit (MMU)** is the hardware responsible for translating **virtual addresses (VA)** into **physical addresses (PA)** for every load and store instruction. A virtual address is split into two parts: | Part | Description | |------|--------------| | Virtual Page Number (VPN) | The upper bits, which identify the page | | Page Offset | The lower bits, which identify the byte within that page | The MMU’s job is to translate the **VPN → PFN (Physical Frame Number)**. The offset remains unchanged. The final **physical address** is formed by combining **PFN + Offset**. Each process has its own **page table**, stored in physical memory. A CPU register called the **Page Table Base Register (PTBR)** points to the start of the current process’s page table. --- ### The Page Table Entry (PTE) The page table is an array of **Page Table Entries (PTEs)**. The VPN is used as an index to find the correct PTE. Each PTE contains not just the PFN, but also several critical control bits: | Bit | Purpose | |-----|----------| | Valid Bit | Indicates if this translation is valid. Invalid access triggers a segmentation fault. | | Protection Bit | Specifies access rights (Read, Write, Execute). Violations trigger protection exceptions. | | Present Bit | Indicates whether the page is currently in DRAM or on disk. A missing page causes a page fault. | | Dirty Bit | Marks whether the page has been modified since loaded from disk. | | User/Supervisor Bit | Distinguishes between kernel-only and user-accessible pages. | | Reference Bit | Records whether the page has been accessed recently (used for replacement decisions). | --- ### Challenge 1: Page Tables are Too Big (Space) A simple **linear page table** (one giant array) is extremely wasteful. A 32-bit system with 4KB pages needs **2²⁰ entries = 4MB per process**. Most of these entries are unused because programs have **sparse address spaces**. **Solution: Multi-Level Page Tables** This creates a **tree-like structure** where the VPN is split into multiple parts. Example: In a two-level system - The first part (Page Directory Index) points to a Level 2 page table. - The second part (Page Table Index) points to the final PTE. If a huge region of memory is unused, its corresponding Level 1 entry is marked invalid, and the entire Level 2 page table for that region is not allocated — saving enormous space. Modern Intel CPUs use **4-level page tables**. --- ### Challenge 2: Page Tables are Too Slow (Speed) Multi-level page tables solve the **space problem** but create a **speed problem**. A 4-level table requires 4 separate memory accesses for a single translation — very slow. **Solution: The TLB (Translation Lookaside Buffer)** The **TLB** is a small, fast hardware cache inside the MMU that stores **recently used VPN → PFN translations**. | Case | Description | |------|--------------| | TLB Hit | The VPN is found in the TLB → translation is retrieved instantly (1 cycle). | | TLB Miss | The VPN is not found → the MMU performs a slow page table walk (tens or hundreds of cycles), then stores the result in the TLB. | Because programs have good **locality** (e.g., accessing array elements a[0], a[1], a[2] on the same page), the TLB hit rate is typically very high (around 99%), keeping the system fast. --- ### Challenge 3: The TLB and Context Switching When the OS performs a **context switch** (switching from Process A to Process B), the TLB is full of translations for Process A — invalid for Process B. - ### Old Solution **Flush the TLB** on every context switch. This is slow because the new process starts with an empty TLB and suffers many misses initially. - ### Modern Solution: PCID / ASID (Process Context Identifier) A small **tag (PCID)** is added to each TLB entry to indicate which process it belongs to. When the OS switches contexts, it just tells the MMU the new process’s PCID. The TLB is not flushed — translations for multiple processes can coexist, and the MMU uses the PCID to ensure only the correct entries are used. ### Shio (Explorations & Discoveries) - Genuine questions that arose from the videos/reading (e.g., “Why does X hold under Y?”, “Does this assumption break on Z?”). - Show how you sought answers: website, short experiments, citations, logs, or your own reasoning. **Q_1:**<span class="blue">Each process gets its **own private address space**.</span>Why can’t we just use one big global page table with PCID to separate processes? **A_1:** 1. Because the page table itself defines what memory a process can see. If all processes shared one big table, they could access each other’s memory — no real isolation. 註解 : In other words, with one global table, the OS would need extra checks to ensure each access is allowed. The table itself would also have to be hidden or restricted, or else every process could read others’ mappings. This would make the design more complex and reduce the flexibility for developers to manage memory freely. 2. <span class="yellow">PCID only helps the CPU know *which TLB entries* belong to which process. </span> It improves speed during context switches but doesn’t provide memory protection. Each process needs its own page table to keep its address space private and safe. PCID only avoids flushing the TLB; it doesn’t replace per-process page tables. **Q_2:**<span class="blue">Every process believes it has access to the **full memory range** (e.g., all 4GB or 8GB),</span> What happens if it really tries to use more than the available memory? **A_2:** There are several possible outcomes depending on the situation: 1. **Page Fault** - The page is valid but not currently in physical memory. - The OS will fetch the data from disk (swap in). (`swap`指的是把那些不常用到的丟回`disk`，磁碟區有特殊空間`swap space` - ✅ No error — this is a normal part of virtual memory. 2. **Segmentation Fault** - The process tries to access an invalid or unmapped address. - ❌ The OS rejects access and sends a `Segmentation Fault (SIGSEGV)` error. 3. **Out-of-Memory (OOM / MLE)** - All physical memory and swap space are used up, so the OS cannot allocate new pages. - ❌ The OS kills the process and reports an *Out of Memory* error. Q_3:<span class="blue">With paging, a process’s memory can **appear continuous** in its virtual address space, but be stored in **non-continuous physical frames**, making allocation much more efficient.</span>所以實際上我們在程式中看到的連續陣列，可能在實體記憶體中是不連續的嗎？ A_3: 是的，實際上一個看起來連續的陣列，可能被儲存在不同的 physical pages 中。下面是一個小實驗，用來驗證這個現象。 ```clike #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <inttypes.h> uint64_t virt2pfn(void *vaddr) { uint64_t value; int fd = open("/proc/self/pagemap", O_RDONLY); if (fd < 0) { perror("open pagemap"); exit(1); } uint64_t vaddr_u = (uint64_t)vaddr; off_t offset = (vaddr_u / getpagesize()) * sizeof(value); if (lseek(fd, offset, SEEK_SET) == (off_t)-1) { perror("lseek"); exit(1); } if (read(fd, &value, sizeof(value)) != sizeof(value)) { perror("read"); exit(1); } close(fd); if (!(value & (1ULL << 63))) { printf("Page not present!\n"); return 0; } return value & ((1ULL << 55) - 1); // PFN bits } int main() { const int N = 20000; int *a = malloc(sizeof(int) * N); if (!a) { perror("malloc"); return 1; } for (int i = 0; i < N; i++) a[i] = i; long page_size = sysconf(_SC_PAGESIZE); int ints_per_page = page_size / sizeof(int); printf("Page size: %ld bytes (%d ints per page)\n", page_size, ints_per_page); printf("------------------------------------------------------------\n"); printf(" Index | Virtual Address | PFN (Physical Frame)\n"); printf("------------------------------------------------------------\n"); uint64_t prev_pfn = virt2pfn(&a[0]); for (int i = 1; i < N; i++) { uint64_t pfn = virt2pfn(&a[i]); // 偵測物理位址跳號 (PFN 不連續) if (pfn != prev_pfn) { printf("\n🔸 Found non-contiguous physical pages at index %d!\n", i); printf(" Showing nearby entries (i-5 ... i+5):\n"); printf("------------------------------------------------------------\n"); for (int j = i - 5; j <= i + 5; j++) { if (j < 0 || j >= N) continue; uint64_t pf = virt2pfn(&a[j]); printf("%6d | %p | 0x%llx\n", j, &a[j], (unsigned long long)pf); } printf("------------------------------------------------------------\n\n"); break; // 只顯示第一次發現的不連續，可移除此行顯示多次 } prev_pfn = pfn; } free(a); return 0; } ``` 我在 Windows 的 WSL 中執行結果如下： ``` Page size: 4096 bytes (1024 ints per page) ------------------------------------------------------------ Index | Virtual Address | PFN (Physical Frame) ------------------------------------------------------------ 🔸 Found non-contiguous physical pages at index 856! Showing nearby entries (i-5 ... i+5): ------------------------------------------------------------ 851 | 0x5c8e1d961fec | 0x1d94a1 852 | 0x5c8e1d961ff0 | 0x1d94a1 853 | 0x5c8e1d961ff4 | 0x1d94a1 854 | 0x5c8e1d961ff8 | 0x1d94a1 855 | 0x5c8e1d961ffc | 0x1d94a1 856 | 0x5c8e1d962000 | 0x11415e 857 | 0x5c8e1d962004 | 0x11415e 858 | 0x5c8e1d962008 | 0x11415e 859 | 0x5c8e1d96200c | 0x11415e 860 | 0x5c8e1d962010 | 0x11415e 861 | 0x5c8e1d962014 | 0x11415e ------------------------------------------------------------ ``` 可以觀察到：在虛擬位址 (VA) 上，`0x5c8e1d961xxx → 0x5c8e1d962xxx`，看起來是連續的位址區段。但對應的 `PFN` 從 `0x1d94a1` 跳到 `0x11415e`，表示實際物理頁面完全不同。由於實體位址的計算方式為： `Physical Address=PFN×PageSize+PageOffset` 不同的 PFN 代表它們對應到完全不同的實體記憶體位置。 <span class="yellow">總得來說，這邊證明的是資料的存取是實體上不連續的。並且還有個小發現，雖然不連續，但是跨頁的部分是同時發生，所以VA和PA應該是對齊的，雖然分布不同，側面證明都是由4KB當作一個page。</span> Q_4:<span class="blue">How does virtual memory inside virtual machines works?</span> Normally, each process has its own virtual address space, but inside a VM, there's another layer of virtualization. Guest OS thinks it can manages virtual memory like a normal OS does, but the "physical memory" it sees is virtualized by the hypervisor, which the "real physical memory" is actually managed by the host like it's supposed to be. So the translation would work like Guest Virtual Address (GVA) -> Guest Physical Address (GPA) -> Host Physical Address (HPA) Also, with hardware support like nested paging, and the TLB can cache these combined translations for efficiency. After CPU translates GPA to HPA, CPU'll directly use the HPA and save those in TLB. So repeated access are fast without walking both tables. It’s like having a map of a village (guest page table) inside a city. The city map (hypervisor table) shows where each village is located. Nested paging is like having a GPS that can combine both maps instantly, so instead of looking at each map separately, you can just ask the GPS (TLB) to guide you directly. --- ## Hand-on Lab ### Answers to Reflection Questions ### 1. Why does increasing the page size reduce TLB misses? Increasing the page size reduces TLB misses because it increases the effective "coverage" of each entry in the TLB (Translation Lookaside Buffer). The TLB is a small, fast cache that stores recent virtual-to-physical address translations. Each entry in this cache maps one page. By increasing the page size from 4KB to 16KB, a single TLB entry now covers four times as much memory. Because programs often access memory in the same general area (locality), this larger coverage means the system is far less likely to need a translation for a new page. This results in fewer TLB misses, which avoids the slow, multi-level page table walks that are required when a translation isn't in the cache. ### 2. Why is internal fragmentation less of a problem in modern allocators? Internal fragmentation (wasted space inside a page) is less of a problem because modern memory allocators, like Android's ART runtime or the system malloc, do not allocate a full 16KB page for every tiny object. Instead, these allocators work more efficiently: They request large memory regions (called "arenas" or "slabs") from the operating system. They then sub-divide these large regions internally to manage many small objects. This allows many tiny objects to be packed together and share the same page. As a result, the amount of wasted space per page is minimal. The significant performance gains from reduced TLB misses are considered a much more important factor than this small, well-managed increase in internal fragmentation. ### 3. What changed between 2008 and 2025 that made 16KB pages practical for Android? The primary change is the shift from "memory poverty" to "memory abundance" and a corresponding shift in engineering priorities. In 2008 (Memory Poverty): The first Android phones (like the HTC Dream) had extremely limited RAM (e.g., 192MB). The main challenge was simply fitting the OS and apps into this tiny space. A 4KB page size was a good balance, as minimizing memory waste (like internal fragmentation) was critical. By 2025 (Memory Abundance): Modern phones have 8GB, 12GB, or even 16GB of RAM. Memory scarcity is no longer the main constraint. New Priority: The new challenge is efficiency—making the system faster and more power-efficient. Modern, graphics-fancy games and complex apps are limited by the speed of address translation. In this new environment, the small amount of memory "wasted" by 16KB pages is an acceptable trade-off for the significant performance gains (faster app starts, lower power draw) that come from having fewer TLB misses. ### Shio (Explorations & Discoveries) - Extra question(s) we asked: - Q: <...> A: <what we found / how we tested> - Our Discovery(ies): ### What I learned (each member writes 2–4 sentences, in their own words) ### @112062124 透過這次 Worksheet 10，我發現許多作業系統設計上的爭論，其實都與「歷史遺留」有關。以 page size（頁面大小）為例，關於該採用`16 KB`還是`4 KB`的問題，從教授提供的資料分析可以明顯看出：在效能與 TLB 命中率等面向上，`16 KB`帶來的實質效益遠超過`4 KB`所造成的內部碎片 (Internal Fragmentation)。(當然Large Page / Huge Page可以對TLB的部分做改善) 但即便如此，Windows 仍沿用 4 KB 的設計。我認為這背後的主要原因是「使用慣性」和「相容性風險」： `4 KB`已經成為長期以來的標準，任何改動都會牽一髮而動全身。若強行更改為`16 KB`，可能會導致大量應用程式、驅動程式甚至硬體邏輯的錯誤與不相容。(就像有時候新的更新可能造成更多BUG 😎) 因此，除非 Windows 12 決定徹底拋棄`4 KB`架構，並且 Intel、AMD 等硬體廠商同步修改`MMU`的設計，否則讓 `4 KB` 與 `16 KB` 同時共存於系統中，將可能帶來極高的混亂與不可預期的錯誤。(就是教授提到的`Hybrid pages`) ### @111000116 這次的虛擬記憶體課程真的讓我大開眼界。以前我只知道 VM 是一個抽象層，讓每個程式都以為自己獨佔記憶體，但從沒想過底下藏著這麼多「環環相扣」的取捨。最讓我印象深刻的，是頁表（Page Table）的演進。一開始聽到 32-bit 線性頁表會佔用 4MB 記憶體時，就意識到這是個嚴重的空間浪費；接著學到「multi-level page」利用記憶體「sparse」的特性來節省空間，覺得這是個很聰明的設計。但我一開始沒想到的是，這個解法馬上引發嚴重的代價：他為了查詢一個位址，可能需要多次的記憶體存取（4次）。這也讓我理解到，TLB 的存在不只是一個「額外的快取」或「優化」，它根本是讓多層頁表這個設計得以實用的關鍵。如果沒有 TLB 的高命中率來隱藏掉大部分的查詢延遲，多層頁表在效能上的開銷就會大到無法使用。這堂課讓我學到，系統設計總是在處理一連串的trade-offs。你為了解決 A 問題（空間）而設計了 B（multi-level），但 B 卻產生了 C 問題（速度），因此又必須引入 D（TLB）來解決 C。理解這整個設計背後的因果關係與依賴鏈，才是我覺得最精妙的地方。 ### @111006239 As my previous understanding about how computer work is like fetching data from disk to RAM, then RAM and CPU talks directly together. I never know that there's such a thing like "virtual memory". After I have learnt this week, I now understand that doing such a thing is dangerous, as every process would have to share the same memory address, which they can read/write another program's data. So that's why each process need "private" memory space. And it works by CPU will translate the virtual address in each of the virtual memory space (that private memory space), into physical address, so that each process can access the data. Also, to increase the above process, CPU has a thing called "TLB" which will stores the mapping of Virtual Memory Address -> Physical Memory Address, so that we don't have to do the translation all over again. But whose "mapping" is it belong to? As CPU does the "context switch" almost always, we'll have to mark which "mapping" belongs to which process, that's called "PCID/ASID". ## Content Sharing Agreement **Please uncheck if you disagree!** - [X] We agree that excerpts of this note may be shared by TA to Discord if our team's note is selected as *shio* highlight. - [X] We agree that our team ID may be shown on Discord if our team's note is selected as *shio* highlight. - [X] We agree that the instructore can show excerpts of this note if our team's note is selected as *shio* highlight. :::warning * Please don't share this HackMD note directly with non-teammate. But you are encouraged to help others (including non-teammate) on Discord. [Follow our collaboration policy](https://sys-nthu.github.io/os25-fall/admin/policies.html#collaboration-citation) * Please don't peek at or copy another team's HackMD, or using LLM to generate your reflection. :::

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.