OS Memory Management

# OS Memory Management ## Ch. 8 Memory Management ### Memory Hierarchy - Main memory: the only large storage media that the CPU can access directly. [More info](/ABZGtc8UT8e3rvG1o9x-2g#Solid-State-Disks-SSD) - ![image](https://hackmd.io/_uploads/HkVNiMyWyx.png) ### Contigous Memory Allocation - Also known as dynamic memory/space allocation - Heap management in Linux - `mmap(-1)`、`malloc()` 會從 program 的 heap 拿記憶體 - `free()` or `munmap()` 釋出記憶體 - 如果 heap 不夠大了則會用 `brk()` 或 `sbrk()` 增大 heap - 長久操作下來，heap 會產生許多洞，該如何解決這些洞？ - First-fit：使用第一個看到且夠大的洞 - Best-fit：使用最剛好的（夠大且最小），所以必須歷遍整個 heap - Worst-fit: 使用最大的 - Buddy System - Linux 對實體記憶體的管理方法 - 將記憶體切成不同大小的 block，每個 block 大小為 $2^n$，$n$ 又稱為 order - 對於每次配置，尋找最接近其大小的 block，查詢有無空間，若無則往上一層尋找，並把上層的 block 切半，一半配置給他，一半加進下一層的 list（等於這層少了一個 block，下一層多了一個 block） - ![image](https://hackmd.io/_uploads/HJGoemybke.png) - Fragmentation - external：所有可用空間總和大於某個 process 所需要，但因為這些空間不連續所以無法配給該 process 使用 - internal：配置給 process 的 memory 空間大於 process 真正所需的空間，這些多出來的空間該 process 用不到，而且也沒辦法供其他 process 使用 - Slab Allocation - 為了解決 external fragmentation - 根本原因為每次 allocation 的 size 都不同 - 解法：每次都 allocate 同樣的 size - part of Linux memory allocation - 因為 kernel object（e.g. inode）很常需要 allocate / deallocate - ![image](https://hackmd.io/_uploads/SkMLuXkWJe.png) - 在實體記憶體中每個 block 都是相同大小 - Linux kernel 用了 slab allocation + buddy system - 對於比較常 allocate / deallocate 的 object，並且其 size 都相同，則用 slab allocation - 若是 size 都不同的，則用 buddy system ### Address Binding - address binding: assign memory addresses to instructions and data - can happen at - compile time - load time - execution time - Compile-time binding - 如果 inst. 和 data 的位置在 compile 的時候就被確定，那就可以先配置 - 如果起始位址變更或被其他 process block 著，則需要重新 compile - Load-time binding - compile 產生 relocatable code，給出相對位址，在 load 的時候才據此決定真正的位址 - Execution-time binding - ![image](https://hackmd.io/_uploads/S1_S3QyZkg.png) - 需要硬體支援 e.g. relocation reg in MMU (Memory Management Unit) ### Paging - Compaction - 一個解決 external fragmentation 的方法 - ![image](https://hackmd.io/_uploads/Sk6HamyZ1e.png) - Slow - 另一個方法：paging - Paging: A modern approach that separates logical (virtual) address space from physical address space - 將 physical memory 分成固定大小的 block（稱為 frame） e.g. 4 KB - 將 logical memory 也拆成固定大小的 block（稱為 page） - page table: translate logical to physical addresses（OS 負責） - ![image](https://hackmd.io/_uploads/S1Of3hPbJe.png) - page number 經過 page table 找到對應的 frame number 後即轉為 physical address - page table 放在 memory 裡 - page table 由 OS 填，查詢跟轉換由硬體（MMU, Memory Menagement Unit）負責 - e.g. - ![image](https://hackmd.io/_uploads/B1z-CnDbJg.png) - logical memory has 16 bytes and physical memory has 32 bytes - page size and frame size are 4 bytes - Q: how many bits are required to represent - A logical address: 16 bytes -> 4 bits - The page number: 16 bytes / 4 bytes = 4 -> 2 bits - A physical address: 32 bytes -> 5 bits - The frame number: 32 bytes / 4 bytes = 8 -> 3 bits - The displacement (p): 2 bits - Each entry of page table (f): 3 bits (to store frame index) - OS 會有個 free-frame list，當 process 需要配置記憶體時，OS 會去這個 list 找 memory 中可用的洞 #### Implementation of Page Table - 每個 process 有自己的 page table - `PTBR` (Page Table Base Register) 指向 page table - `PTLR` (Page Table Length Register) 標示大小 - Data access 會需要兩次的 memory access，一次取用 page table 一次取用真的要的資料 → sol: translation look-aside buffers (TLBs) - Translation Look-aside Buffers (TLBs) - 用 associative 的概念實作 - 像一個 cache，在 CPU 裡，存最近查詢的結果 - ![image](https://hackmd.io/_uploads/BydTbTv-1l.png) - TLB 查詢過程利用 fully associative，通常只有 64~1024 entries，所以需要 replacement policy - EAT (Effective Access Time) - Associative Lookup need $\varepsilon$ time unit ($\varepsilon<<1$) - Memory access need 1 time unit - TLB hit ratio $\alpha$ - $\text{EAT}=(1+\varepsilon)\alpha+(2+\varepsilon)(1-\alpha)=2+\varepsilon-\alpha$ #### Structure of the Page Table - Two-level Page Table (Hierarchical Paging) - Page table are large and sparse → break up into multi-level page table, cna allocated on demand - e.g. Intel 80386 - logical addr. ![image](https://hackmd.io/_uploads/SJxaZP_W1l.png) - ![image](https://hackmd.io/_uploads/BylR-PuWJe.png) - 右邊的 addr. 都是 physical addr. - 如果還是用 logical addr. 就需要再 paging 一次，沒必要 - pros：page table 間可以不連續，不用一開始就配置很大的空間 - cons：TLB miss 增加 - Hashed Page Tables - 在大於 32-bit 的處理器上常用，其 page number 的長度可以很長 - 若使用 multi-level page table，addr. bit 越多就需要越多層，一旦遇到 TLB miss 就需要查很多層 - ![image](https://hackmd.io/_uploads/BkrJHPObkl.png) - TLB miss 的花費跟 hash function 和 hash table 的大小有關 - Inverted Page Table - 若用先前提到的 forward page table，每個 process 都會自己有一個，當 process 一堆 page table 加起來占的空間會變很大 - Goal：所有 process 共用同個 table - 反過來，存 frame num. 對應的 page num.，這樣 page table 的大小僅跟 physical memory 的大小有關 - 在 memory reference 的時候，要給 PID 和需要的 page num. - 需要給 PID 是因為可能有多個 process 映射到同一個 frame - ![image](https://hackmd.io/_uploads/SJ6N8POb1x.png) - search 常用 hash 做，不然 linear search 太慢了 - pages 若發生 hash conflict 可以直接用 linked list 串在一起 - 可能比較難實現 shared memory → 因為要讓不同 pid 及不同 p 對應到同一個 i 需要特別處理，可能得用 linked list 取得所有指向同一個 i 的 (pid, p) pair ### Segmentation - 為了保護記憶體（權限問題） - A program is a collection of segments. A segment is a logical unit such as: main program, procedure, function, local variables, global variables, ... - Common segments and permission - stack segment `rw-`: local var., function arg., return addr., CPU flags, ... - 為什麼 stack 不能執行：stack overrun attack - ![image](https://hackmd.io/_uploads/Hyfc2b-MJg.png) - 把惡意程式碼當作 function arg. 存進 stack，他會覆蓋掉原本的資料 - 當運行到 return addr. 的時候，反而會跳到惡意代碼的位址，進而運行 - 禁止手段：把 stack 設為不可被執行，一旦被執行則回報 segmentation fault - data segment - bss, data（無 / 有初始值的全域變數） `rw-` - data RO (Read Only) `r--`: const. - heap `rw-` :::success ```cpp= #include<stdio.h> #include<stdlib.h> int main() { char *p = "hello world"; p[0] = 'H'; } ``` 會觸發 segmentation fault，因為 C 語言內會將第四行看作是 constant literal ::: - code segment `r-x` （`r` 是為了 copy-on-write 或 debug） - hardware - logical addr. 可以被拆成 `<segment-number, offset>` - Segment table：可以將 logical addr. 轉成 physical addr.（跟 page table 很像），每個 entry 存 - base：segment 的 starting physical addr. - limit：segment 的長度 - Segment-table base register (STBR) - Segment-table length register (STLR) - ![image](https://hackmd.io/_uploads/ryXaR--Myx.png) - segment table 放在記憶體（DRAM）內 - 查詢過程由 MMU 執行 - 填表由 OS 負責 - 在查詢的過程中也會順便檢查要求的權限是否正確 - ![image](https://hackmd.io/_uploads/ryCSlz-Gyl.png) - e.g. `exe S2 offset=20` - 去 segment table 找到第二個 segment，檢查 limit 是否 < offset - 檢查權限（此處應為 `r-x` 而要求權限為 `x` -> 通過） - 去拿 phsical addr. = 4300+20 - segmentation 能保護記憶體 - segment limit reg. 避免 out of range - 控管每類 segment 的權限（rwx） - 看 user / kernel mode，user mode 只能存取自己的 segment - paging vs. segmentation - paging 和 segmentation 的功能大部分是重複的（paging 也能管理每個 page table 的 rwx 權限） - segmentation 的實作會根據不同 CPU 而有不同方法 - 每種 CPU 對 segmentation 的支援程度不同 - 如 RISC 的支援較少 - 在現今為了 portability，OS 很少使用 segmentation - 在 Linux segmentation 的概念已經用 paging 實現 - 只有在 x86 架構上 context switch 的時候會用到 ### Example - Linux on Intel 80386 - 僅使用 6 個 segment（儘管這顆 CPU 定義了更多） - Shared by all processes - Kernel code - Kernel data - User code - User data - The default LDT (usually not used) - Per-core - Task-state（TSS, 用於切換 user / kernel mode） - segmented paging - 先做 segmentation 再 paging - ![image](https://hackmd.io/_uploads/rk5iHzbf1e.png) - selector 實際上有 3 bits 用來 control - descriptor table (segmentation table) 共有 $2^{13}=8192$ 個 entry - Intel x86-64 - 64 bits 太大了，不用用那麼多 logical address，只會用 48 bit addressing - ![image](https://hackmd.io/_uploads/S1UjJE2NJe.png) - 可以 paging 到一半就出去，此時那個 page 叫 super page - page size 有 4 KB、2 MB、1 GB，最基本是 2-level page table - 如此可以增加 TLB cover 到的範圍 - 如果需要的記憶體比較大且連續，就可以使用比較大的的 page table - arm - 支援不同的 page size，可以選如何切 logical address（12-8-12、8-10-14） - ![image](https://hackmd.io/_uploads/BkEDb4nVJe.png) ## Ch. 9 Virtual-Memory Management ### Demand Paging - Requiring a (large) secondary storage as the backup of memory pages - Bringing a page to memory only when it is needed - 不用讀太多東西進 memory - logical address space 可以比 physical address space 還大 - 提升 degree of multiprogramming - 支援共享記憶體空間 - Valid-invalid bit in page table entry - 硬體支援，page table 多一個 bit - 1 -> valid -> page 對 frame 的對應關係已建立 - 0 -> valid -> page 還沒對應到 frame -> logical addr. 還沒被對應到 physical addr. - 在 addr. translation 時如果這個 bit 是 0 會觸發 page fault（一個 trap），並交由 OS 處理後續 - ![image](https://hackmd.io/_uploads/ryp5JcZz1e.png) - ![image](https://hackmd.io/_uploads/S1Tiyq-Mke.png) - I/O 很慢（在硬碟找 / 拿 page 的過程） - Fetch instruction 可能也會導致 page fault（如果那段 code 還沒被執行過，而且那行在新的 page） - Performance - Page fault rate $0\le p\le 1$ - Effective Access Time $$\begin{gather*}\text{EAT} = (1-p)\times \text{memory access} + p\times(\text{page fault overhead} \\ + \text{ swap page out} + \text{swap page in} + \text{restart overhead})\end{gather*}$$ - 但 page fault 發生時代價很高，所以通常 page fault rate 都很小 - 將 TLB 加進考慮 - TLB hit: ETA = TLB access - TLB miss & no page fault: ETA = TLB access + page table access + TLB update - TLB miss & page fault: ETA = TLB access + page table access + page fault handling + TLB update ### Page Replacement - Memory 滿了之後要把哪些 frame 刪掉（called "victim frame"），page replacement 就是一個選擇要刪掉誰的 policy - mechanism - ![image](https://hackmd.io/_uploads/H1tX69Wfye.png) - 寫回 victim frame - 不一定每次都要做（不然花費兩倍的時間，因為寫入硬碟也很慢），若 victim frame 之前在 memory 中有被修改過，才需要寫回硬碟 - 使用 dirty bit 判斷該 frame 是否有被修改過（MMU 查詢） ### Page Replacement Algo. - FIFO #### Optimal Page Replacement (OPT) - 查看未來會用到的 frame，看哪個會最晚被用到就刪掉那個 - 最佳解 - 實際上不可能，因為系統很難知道之後哪些 frame 會被用到 #### Least-Recently Used (LRU) - 利用 temporal locality - Implementation - ![image](https://hackmd.io/_uploads/rJYsZo-fJx.png) - 每次新 reference 一個 page，就把他拉到最前面（MRU, Most-Recently Used） - replace 時直接拿最後一個 - 通常會有個 look up table 直接拿到那個 page 在 list 的 index，不然 linear search 太慢 - Stack property - ![image](https://hackmd.io/_uploads/HyWRQiWMJe.png) - 如果 stack 只有 3 個 ``` a: 2 1 0 b: 7 2 1 ``` -> 是原先的 subset - stack 存的 page 越多，能包含小 stack 能做到的事 -> 效果越好 - LRU 是 stable algo.（給越多 frame 表現越好，但非線性） - ![image](https://hackmd.io/_uploads/SyntEiWf1g.png) - FIFO 無 stack property，frame 越多 page miss 不一定越少（Belady’s Anomaly） - ![image](https://hackmd.io/_uploads/HyyAsd7mye.png) - 如何解決：cache simulation or ghost cache #### LRU approximation algo. - Also called second chance algo. or clock algo. - 實務上 store 和 load 都是一個 instr.，如果每次呼叫都要動用一堆指令把來找到 LRU 並搬到最前面，太慢 - 用 approximation 加硬體支援來改善 - 但 LRU 還是適合用在 IO-based 的系統（e.g. file system or database） - Reference bit - 每個 page 多存一個 bit，如果該 page 被 reference 過就把他設成 1 - Implementation - 把 cache 當做一個圓，輪流去看每個 page - 如果該 page 的 ref bit 為 1，改成 0 - 如果是 0，代表上一次把他改成 0 後都沒有被 ref 過，所以就可以把他丟掉 - ![image](https://hackmd.io/_uploads/BJGvJYX71x.png) - 指針在看到兩個 1 的時候就會把他們設成 0 - 看到 0 就代表可以把那個 page 換掉 - enhanced ver. - 加入 dirty 一起考量 (ref, dirty) 1. (0, 0) 最適合被丟掉 2. (0, 1) 雖然需要寫回，但因為最近沒用到，所以還算可以被丟掉 3. (1, 0) 雖然不用寫回，但最近會被用到，根據 Spatial locality 之後可能還會用到，如果先把他丟掉之後還需要讀入，如此花的時間就跟上述情形差不多了，於是不太適合 4. (1, 1) 最不適合 #### Counting Algorithms - 存每個 page 目前被 reference 的次數 - 每次換掉最少被使用的 page（LFU, Least Frequently Used） - Issue：曾經很常用的 page 會一直待在 cache 很難被取代（因為 cnt 很大） - Sol: 週期性 `cnt bit >>= 1` (aging) - LRU vs. LFU - LRU (Least recently used) - (+) fast response to locality change - (-) sequential reference（一次 reference 大量的東西 e.g. 複製檔案）會把 cache 原先的內容都洗掉，但通常這種操作只會 reference 一次，cache 其實不需要存他們 - 現況還是較常用 - LFU (Least frequently used) - (+) 可以處理 sequential reference，因為他們的 cnt 不會很大 - (-) slow response to locality change - 累積 cnt（warm up）或把很久沒用但以前很常用的 page 刪掉（cool down）需要一些時間 ### Swap-Space Management - 從 disk 中拿一些空間來做虛擬記憶體，讓記憶體看起來很大，並用上面的 replecement algo. 選擇要放在真的記憶體的資料 - swap space 可以真的是一個檔案（e.g. `C:\pagefile.sys` in Windows、有些 Linux 也有），或者是直接從 disk 分一個 partition（Linux 常用） - ![image](https://hackmd.io/_uploads/rkssvKQm1x.png) - swap map 在 RAM 裡面，其他東西在 disk - 還有 page table 和 page cache 在 memory 內 - 對 page table 內某些已經不在 page cache 的 entry，會有個 bit 標示已被 swap out，他的 frame number 會改為表示 swap partition 的 slot number，標示該 page 放在哪裡 - 如果有 page 要被換掉，則會把他放在 swap partition 空的地方，並更新 page table（利用 page table 的 invalid bit 標示他已不再被 cache 住，該 logical addr. 沒有對應到任何 physical addr.） - swap map 的數字代表當前有幾個 process 正在用，在 shared memory 的時候一個位址可能有多個 process 在存取，每當 detach 的時候就會把這裡減一，減到 0 代表沒有人要用了，就可以把他搬離 swap partition ### OS features based on virtual memory #### Page sharing - shared code - dynamic library (e.g. libc) - kernel code - shared memory - 創建一段記憶體 `shmget()` - 把這段記憶體 attach 到 process 上（在這個 process 的 page table 標示這段記憶體的位址）`shmat()` #### Copy-on-write - A fast implementation of memory duplication between processes, e.g., fast fork() - 在 fork 一個 child 的時候我們需要把 parent 的資料都複製給 child，但直接在 memory 複製一份太不實際 - 因此實際上我們會讓 child 的 page table 也先同樣指到一樣的位址 - 只有在 child 寫入的時候，才會真的從記憶體複製一段給 child 用 (copy on write) - ![image](https://hackmd.io/_uploads/BkaLQ8VQ1l.png) - RO (Read Only) / RW (Read Write) bit：偵測該位址有沒有被寫入 - 由 MMU 查找 → 因此 COW 需要 MMU 支援 - 在 child 的 page table 中，每個 page 的預設值都為 RO - 當 child 想被標記 RO 的 page 的時候，MMU 就會觸發一個 trap 給 OS，讓 OS 去複製記憶體，達成 copy on write - `vfork()` 可用在沒有 MMU 的情況 - parent 先停下來讓 child 用他的 memory - 假定 child 等等會運行`exec()`，等到 child 跑 `exec()` 的時候就會有一個新的記憶體位址給他 - 原先的記憶體就可以還給 parent，child 在上面的修改會被刪掉 #### Memory-mapped file - page table 直接對應到 disk 裡的檔案 - 當我們讀到 page table 裡的那個 entry 後會觸發 page fault，於是就會去找 file system 裡面的那個檔案並讀回來（一般的 page fault 會去找 swap partition） - 若需要修改檔案，可以直接改 memory 內容而不需使用 `fwrite()`，並把 page table 標為 dirty，後續就會由 flush threads 寫回 disk - e.g. `mmap(fd)` - 使得讀寫檔案不用進 kernel mode - `fd` 為一般的檔案，就直接做上面的操作 - `fd` 也可以是 temporary object （在 /dev/shm 下），這些 object 由 `shm_open()` 創建，可以做到 IPC（讓多個 process 可以拿到同一塊記憶體位置） :::warning 這裡的 temporary object 沒有指到任何檔案，只是作為資源共享用 ::: - `fd = -1`：向 kernel 要一塊記憶體 - 於是 page 有分兩種 - Anonymous page - 會指到 process 的記憶體片段，如 data、stack，或 `mmap(-1)` 得到的 memory - 寫回時寫進 swap - File page - 指到一個檔案 - 寫回時直接寫進原本的檔案 - e.g. - Memory of executable binary: file - Process stacks: anonymous - Process heaps: anonymous - Memory created by mmap(fd), fd is regular file: file - Memory created by mmap(fd=-1): anonymous ### Performance Issues #### Thrashing - Thrashing: a state that the kernel is busy swapping pages in and out - ![image](https://hackmd.io/_uploads/ry2_7WNm1l.png) - Thrashing 造成的 performance 問題 - processes 因 page fault 在等待 I/O，使 CPU 使用率下降 - OS 發現 CPU 使用率下降，增加 degree of multiprogramming - 新產生的 processes 跟舊的 processes 搶 memory 中能放的 page 數，使 page fault 更多 - 結果 CPU 的使用率繼續下降 - Working-set model - $\Delta$: a fixed number of page references - WSSi（Working Set if the process $P_i$）: a set of page referenced in the most recent $\Delta$ for process $P_i$ - $D=\sum\text{WSS_i}$. If $D>m\Rightarrow$ Thrashing: de-allocate pages from processes - Thrashing Management - swapper: aka midterm scheduler，可以先釋放 process，之後再把該 process 和做到一半的資料取回來 - swapped process 用的 memory 於是就可以放出來給其他人用 → useful to thrashing management - 但現代 UNIX OS 會更傾向使用 finer-granularity page swapping，因為當系統在 thrashing 的時候把 free page 讀回來會比較花時間 - 而當還是無法解決時，kill process #### Background Dirty Page Flushing - 在背景 flush dirty page（把 dirty page 寫回） - Related Linux kernel threads - kswapd (background) & DirectReclaim (blocking): scan and reclaim memory - 當 scan 的速度太慢，就會用 DirectReclaim 先 block 住直到掃完 - flush threads (pdflush vs. bdi_writeback): flush dirty page to storage #### Pre-Paging (Pre-Fetching) - 在 page fault 的時候多讀一些 requested page 後面的資料 - For Linux，會多讀 128 KB = 32 pages - (+) Less disk head movement - Sequential disk access 比較有效率 - (+) Fewer page fault - Spatial locality - (-) 如果後續沒有用到，IO 和 memory 就被浪費 #### TLB Reach & Page Size tradeoff - TLB Reach = (TLB Size) X (Page Size) - 不用去查 page table 的情況有多少 - 理想上，每個 process 的 working set 都在 TLB - Increase TLB reach - 增加 TLB size - 然而 TLB 都是用 fullt-associative，電路面積很大，不太可能 - Increase page size - 一個 page 可以涵蓋的範圍更大了 - Issue - Internal fragmentation：作業系統配置給 process 的 memory 空間大於 process 真正所需的空間，其他 process 也無法使用那些地方，造成浪費 - more I/O traffic - sol: 提供多種 page size e.g. IA-64 supports 4 KB and 4 MB pages - page size tradeoff - large page - (+) small page table - (+) large TLB reach - (-) 多帶用不到的資料、浪費記憶體 - small page - （上面反過來） ### OS Examples - Status Bits Associated with a Page - x86 - ![image](https://hackmd.io/_uploads/r1bNkMEQJe.png) - present = valid bit: 該 page 在不在 memory - RW bit: read-only / read-write - reference bit - dirty bit - user / supervisor: page mode，supervisor 只能被 kernel mode 讀到 - Linux page reclaiming - ![image](https://hackmd.io/_uploads/H1lmZfNmkl.png) - file page：來源為實體檔案 - anonymous page：來源如 heap、shared memory - scan window：會把 window 內的 page 拿出來檢查看能不能回收 - writeback：若該 page 正在被寫回，還不能被丟掉，於是放回去 inactive - swappiness - anonymous page 和 file page 發生 page fault 時的代價不同 - Swap IO: 因為 anonymous page 比較小且隨機，處理起來較慢 - File IO: 檔案通常都是連續的，在 file system 中處理起來較快 - swappiness: a tunable kernel knob - 控制從兩個 page 類型回收的比例 - 0 (file) ~ 200 (anon) - 通常設為 60，亦即傾向 swap file page - 然而當今大多用 SSD 儲存，IO 更快，使得兩個類型的 page fault 花費差異縮小 - 在最近的一些 Linux distro. 會把 swappiness 設為大於 100（e.g. Android on smartphone） - Mobile system - Swap - 手機的記憶體多使用快閃記憶體（flash memory, e.g. eMMC / UFS / NVMe） - 這類的記憶體有讀寫壽命的問題，且此類裝置的寫出量很大，因此若一直把資料 swap 到硬碟不太好 - 如 Android 會在記憶體內部創一個 swap 區（zRAM），少用的資料會被壓縮並移到這個區域 - 因此 anonymous page fault 的 cost 反而降低了，swappiness 會傾向 200 - Other method for free memory - Android：直接結束 app - iOS：要求 app 釋放記憶體