Linux 核心設計: Memory

--- tags: NCKU Linux Kernel Internals, 作業系統 --- # Linux 核心設計: Memory [Linux 核心設計: 記憶體管理](https://hackmd.io/@sysprog/linux-memory?type=view) ## Buddy Allocator ### 簡介 Buddy allocator 是 linux 對 physical memory 的管理方法。對於連續的記憶體，會被分成許多種不同大小的 "block"，而每個 "block" 是由 2^n (n 為整數)個固定大小的 "page" 組成。我們把這個 "n" 稱為 **order**。則可以使用一個陣列去維護這些不同 order 的 block 形成的 linked list，如下圖。 ![](https://i.imgur.com/r3SjZ6y.png) > 圖源: https://www.twblogs.net/a/5e553337bd9eee2116848bbf 當有記憶體的配置需求時，首先從足夠空間、且最接近大小 order 的 block 開始找起，如果該 order 中沒有可用的 block，那麼就把更大 order 的 block 對半分，切分後的兩個子塊相互爲 buddies。其中一個 block 用於分配，另一個則根據其大小置放回對應 order 的 linked list 下。這些 block 在必要時連續減半，直到達到所需大小的塊可用爲止。而對於釋放的處理，當一個 block 被釋放後，檢查它的 buddy 是否位在可分配的 linked list 中，如果是，將這對 buddies 將會合併，並放進所屬 order 的 linked list 中。這有甚麼好處呢?這讓我們在應對一個記憶體配置的要求發生時，可以根據需要的大小找出合適的連續記憶體空間，減少在需要小空間時反而去一段大的連續記憶體空間中切出的狀況，降低記憶體的 fragmentation 的發生。舉例來說，假設每個 page 是 4K 大小，則當要求配置 12K 的記憶體時，首先從最接近的 order 4，也就是大小為 2^4 = 16K 的 block 開始查詢，如果 order 4 上的 linked list 存在可用的 block，就將其配置出去;如果不存在呢?則將 order 5 中的 block 對半切，則其中一半可以分配出去，另一半則掛回 order 4 的 linked list 上。如果 order 5 上的 linked list 也沒有可用的 block，就再往更高的 order 去找，並且重複對半切的步驟，讓合適的大小分配出去，切剩的部分則根據 order 掛回對應的 linked list 上。當這個配置出去的 16K block 被釋放掉，此時回到 order 4 的 linked list 中查找其 buddy 是否存在，如果存在，這兩者可以合併成一個 order 5 的 block，並且可以重複此步驟合併出更大 order 的連續記憶體空間。可以在 [include/linux/mmzone.h](https://elixir.bootlin.com/linux/latest/source/include/linux/mmzone.h#L409) 中找到相關的資料結構： ```cpp struct zone { ... /* free areas of different sizes */ struct free_area free_area[MAX_ORDER]; ... ``` 而 `struct free_area` 的定義為: ```cpp struct free_area { struct list_head free_list[MIGRATE_TYPES]; unsigned long nr_free; }; ``` * `free_list` 為該 order 下的可用連續空間，其中還區分了不同類型的 free block * `nr_free` 則為這些可用 block 的數量 ### TODO - [ ] 探討 buddy allocator 在 linux kernel 中的實作細節，並修正前面的筆記中不夠詳盡的敘述 ## Bump allocator > Reference to : [The Art and Science of Memory Allocation](http://www.cs.unc.edu/~porter/courses/cse506/s16/slides/malloc.pdf) malloc 是撰寫程式時，經常呼叫來從 heap 配置記憶體的函式。如果要粗淺的涵蓋其背後的原理的話，malloc 會透過 [`mmap`](https://man7.org/linux/man-pages/man2/mmap.2.html) 從作業系統中取得足夠 page，再從這些 page 中分出 malloc 要求的數量。當然，實際的細節則根據記憶體配置器的不同會有不同的實作。記憶體的配置可以有許多策略，而 Bump allocator 是相對簡單的一種，參考 [Writing an OS in Rust: Allocator Designs](https://os.phil-opp.com/allocator-designs/#bump-allocator) 的說明: Bump allocator 會線性的分配記憶體，如下圖，配置發生時， `next` 會往 heap 的底部移動需要的大小，並且只往此方向移動，且通常實作中會搭配一個 counter，配置時把這個 counter 加一，free 時則將 counter 減一(但 `next` 維持其位置!)。直到碰到底部為止，此時若再要求配置記憶體，如果 counter 的值歸零，表示可以把 `next` 設回開頭，記憶體可以再次被配置，否則就產生 [OOM](https://en.wikipedia.org/wiki/Out_of_memory)。這個實作的缺點顯而易見，如果沒有所有配置出去的記憶體都被釋放，則此記憶體配置器不能將這些部分的釋放回收，造成許多資源的浪費。 ![](https://i.imgur.com/uVVjiY0.png) 顯然，一些更有效率、或者針對特定情境的記憶體配置器策略需要被設計。 ## Hoard allocator 談到記憶體配置的「策略」，有幾點是我們需要考慮的: * Fragmentation * Allocation and free latency * Implementation complexity * Cache behavior ![](https://i.imgur.com/815uekH.png) 相對於簡單但可能不夠有效的 Bump allocator，Hoard allocator 針對這些考慮點做出了相應的策略。 ### Superblocks 在 Hoard allocator 中，它透過名為 "superblocks" 的結構嘗試解決減少 fragmentation 的問題。在此策略中，若干個連續的 page 形成一個 "superblocks"，**每個 superblocks 的大小都相同**。superblocks 被切分成相同大小的 object(此大小需為 b 的若干次方，而大多時候 b = 2)。以 b = 2 為例，因此可能會有 object 大小為 $2^1$、$2^2$、$2^3$... bytes 的 superblocks，superblocks 將這些 object 以一個 LIFO 的 list 去維護。記憶體配置器的任務是根據需要的大小找到合適的 superblocks(或者依此大小建立新的 superblocks)，並從 superblocks 的 list 中取出 object 分配。舉例來說，如果程式需要的物件大小有 4, 5, 7, 34, 和 40 bytes 幾種，可以計算得知需要有 object 大小為 $2^2$、$2^3$、$2^6$ bytes 的 3 種 superblocks。你應可發現此策略並沒有「排除」 fragmentation 的問題，例如 5 bytes 的配置請求會給予 8 bytes 大小的記憶體，但至少這個做法可以讓 fragmentation 的比例抑制在 < 50%。 ### Allocation 基於 superblocks 的概念，在 Hoard allocator 的配置策略中，每個 CPU 都會維護一個獨立的 heap，另外還有一個 global heap 需要被維護。則當配置要求發生: 1. 首先嘗試 per-CPU heap 2. 如果這個大小在 per-CPU heap 目前的 superblocks 中沒有 free block，則改嘗試 global heap 3. 若仍沒有 free block，最後才在 per-CPU heap 建立新的 superblocks。 :::danger Why? ::: 如果需要的 object 大小超過一半的 superblock 大小該如何是好呢? 此時可以直接 `mmap()` 配置即可(還記得 superblock 是由若干個 page 組成嗎?)，不過此情況下， fragmentation 的問題就顯得有些嚴重了。例如一個 4097 byte 的配置需求，就必須配置兩個 page(8K=8192 bytes)，代表有 8192 - 4097 = 4095 bytes 的記憶體浪費。 ### Free 解釋完了配置的策略，Hoard allocator 的釋放又是如何呢? Hoard allocator 的釋放十分簡單，只要將配置的空間放回對應 superblocks 的 free list 中。不過，我們該如何得知這個空間是屬於哪個 superblock?舉個例子就很好理解了: 假設 superblocks 的大小是兩個 page(8K byte=$2^{13}$ bytes)，如果 object 的地址為 0x431a01c，則只要把低位的 13 bits 遮蔽掉，也就是取得位址 0x431a000，則我們知道這個 object 是來自這個位址為基底的 superblocks。 ### Why LIFO? 另一個可能會有疑惑的問題是，為了甚麼 list 是 LIFO 的結構呢?這是來自對 cache 的考量，比較慢回到 free list 的可用記憶體很可能已經在 cache 中，因此直接分配之可以有機會避免從 memory 重新載回 cache 的 cycle overhead。 ### Locking Heap 的分配涉及在多個 CPU 架構下使用的考量，因此 allocator 會需要 lock 來達成同步。不過值得注意的是，在 Hoard allocator 中， per-CPU heap 的配置和釋放仍需要上鎖，這是因為 per-CPU 雖然只有所屬的 CPU 可以 access，但從所屬 heap 分配出來的空間，不一定要由自己呼叫釋放。承上所述，所以呼叫 allocate 和 free 的 CPU 除了要知道這個空間是哪個 superblocks 之外，也要知道這個 superblocks 屬於哪個 per-CPU heap。該如何知道是哪個 superblocks 的方法我們之前已經說明過了，而此時我們可以透過維護 superblocks 的 metadata，得知 superblocks 所屬的 heap，以及對應的 lock 為何。 ### Cache and False Sharing 在多 CPU 的架構下，allocator 的效能會涉及 cache coherence 的考量。我們可以思考一個簡單的 coherence 模型: 對於每個 cacheline 大小的 memory 區域都存在一個 read-write lock，多個 CPU 可以共享 read lock，但 write lock 只能由一個 CPU 持有。因此，當一個記憶體區域被多個 CPU cached 住時，如果其中一個 CPU 對其進行了寫操作，為了保證所有 CPU 上的 cache 中資料的正確性，cache invalidate 會發生，把每個 CPU 中有同對該區域作 cached 的 cacheline 也更新成寫完後的結果。 ![](https://i.imgur.com/1zN2SX3.png) 然而這個作法可能存在效能的隱憂。思考當一個 cacheline 實際存在兩個各自獨立的物件空間，如圖的 `foo` 和 `bar`。則當 `foo` 被更動時，即使對 `bar` 理應並沒有影響，但因為它們存在同一個 cacheline 中，所以那些 cached 住 `bar` 而對 `foo` 沒興趣的 CPU 仍需要做 cache invalidate，這造成了效能的衝擊。如果想簡單避免這個問題的話，讓每個 object 的配置空間必須至少是一個 cacheline 大小會是可行的作法，不過這造成記憶體的浪費。在 Hoard allocator 這個問題是怎麼應對的呢？因為每個 superblocks 同時只能被一個 CPU 持有。因此以配置而言，不同 CPU 的物件不會共用同一個 superblocks，而在釋放上，物件只會回到原本的 superblocks 中，這些作法降低了 false sharing 的發生。不過 false sharing 仍有發生的機會。因為整個 superblocks 是允許在 heap 間移動的，幸運的是，在 Hoard allocator 中這並非頻繁發生的情境。 ### TODO - [ ] 閱讀 [Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://people.cs.umass.edu/~emery/pubs/berger-asplos2000.pdf) 釐清更多細節，並修正前面的筆記中不夠詳盡的敘述 ## Slab allocator :::warning :warning: 因為這個章節牽涉許多細節，短時間內不能很清楚的釐清，因此暫時僅凌亂的紀錄基本的概念，更詳細的內容則需要閱讀更多文件並實際鑽研程式碼，敬請見諒QQ ::: Slab allocator 在 buddy system 之上，其背後的基本思想是 cache 常用的 object 並用於 kernel 中的 allocation。因為 Linux kernel 中經常有連續的分配及釋放記憶體(例如與 process 有關的 `struct task_struct`，與檔案系統有關的 `inodes` / `dentries` 等 object)。則如果我們將釋放的空間 cache 住，當有同樣的結構配置請求時，就可以將相同結構的 object 立即分配出去。我們通常以 slab allocator 作為此層級的記憶體管理方法之名稱，不過隨著 Linux kernel 的演進，slab allocator 存在許多不同的實作，根據策略的不同其實可以分成下列三種: * SLOB * SLAB * SLUB 後面會探討三者的實作區別。 ### Cache 在 slab allocator 中，需要知道兩個重要的名詞: * slab: slab 為一段連續的記憶體，通常由許多物理連續的 page 組成 * cache: cache 為特定 object 的儲存空間，可以很快被配置而 cache 在 slab allocator 中又可以分為兩種: * dedicated: dedicated cache 用來配置常用的 kernel object (如 `mm_struct`, `vm_area_struct`)，cache 中的 object 在釋放時會重新初始化，因此之後再配置時就會快上許多 * generic: 只要大小符合，generic cache 可以給分配不同類型的 object 對於 generic cache，呼叫 `kmalloc` / `kfree` 系列的 kernel API 就可以透過 slab allocator 使用。如果想要為自定義的物件建立 dedicated cache，則需要建立對應的 `kmem_cache` 結構，並使用 `kmem_cache_alloc` / `kmem_cache_free` 進行建立與釋放。兩者背後的實作都與 `kmem_cache` 結構有關。利用 `sudo cat /proc/slabinfo` 命令可以看到不同 slab 的資訊，包含 `task_struct` 這樣的 dedicated cache，或者 `dma-kmalloc-1k` 這樣的 generic cache。 ### SLAB 在 SLAB 的最上層由 doubly linked circular list 串接 `kmem_cache` 結構，稱為 cache chain。cache chain 的每一個節點都是一個 `kmem_cache`。`kmem_cache` 下的 slab 又依據狀態被被置於不同的 list: * slabs_full: slab 中所有的 object 都被配置 * slabs_partial: slab 中的 object 部份被配置，部份為 free * slabs_empty: slab 中所有的 object 都為 free 狀態根據 object 的配置狀況，slab 在三個 list 上移動。當 slab allocator 收到配置請求時，首先查找 `kmem_cache` 的 slabs_partial，如果沒有可用的就查找 slabs_empty，如果皆沒可用的 slab，就透過 buddy allocator 分配新的 page frames 作為 slab 的使用。透過這樣的 slabs 區分，系統可以更好的管理記憶體: slabs_partial 是配置 object 的主要來源，而 slabs_empty 則是回收記憶體的最佳候選。 ### SLUB 相較於 SLAB 針對不同類型的 object / cpu 都要維護各自的 list(queue)，SLUB 中唯一要維護的 queue 是每個 slab page 下的每個 object，因此減少了 metadata 需要的 overhead。此外，SLUB 讓 slab 對應到 CPU 而非 queue，減少 TLB thrashing 對記憶體配置效率的影響。 SLUB 也是目前 linux kernel 預設的 slab allocator，下面嘗試探討 linux 中的實作細節 :::info 參考之程式碼版本為 [linux 5.11.6 版](https://elixir.bootlin.com/linux/v5.11.6/source) ::: :::warning :warning: 下面有關 SLUB 的段落，由於自己還有很多不足以清晰的理解程式碼邏輯，所以解釋的很含糊，只能大致敘述每個階段的工作，真的很抱歉QQ ::: #### `kmem_cache` 當 buddy system 初始化之後，slab allocator 就可以取得 page 並將 page 分割成大小相等的 object，這些 object 的管理顯然需要 metadata 來維護，其中的關鍵結構為 `kmem_cache`。 ```cpp struct kmem_cache { struct kmem_cache_cpu __percpu *cpu_slab; ... unsigned int size; /* The size of an object including metadata */ unsigned int object_size;/* The size of an object without metadata */ ... unsigned int offset; /* Free pointer offset */ ... void (*ctor)(void *); ... struct kmem_cache_node *node[MAX_NUMNODES]; }; ``` 這個結構包含許多變數，這裡整理幾個相對重要的變數。 * `cpu_slab`: 每個 cpu 各對應一個結構體，各自追蹤 slab，則在分配時以 cpu 本地的 slab 優先分配可以提升 cache 的命中率，細節的定義可參考[這裡](https://elixir.bootlin.com/linux/v5.11.6/source/include/linux/slub_def.h#L42)，可以看到結構中有直接指向 slab pool 的 `freelist`，或者可以從 `*partial` 間接找到 slab pool ```cpp struct kmem_cache_cpu { void **freelist; /* Pointer to next available object */ unsigned long tid; /* Globally unique transaction id */ struct page *page; /* The slab from which we are allocating */ #ifdef CONFIG_SLUB_CPU_PARTIAL struct page *partial; /* Partially allocated frozen slabs */ #endif #ifdef CONFIG_SLUB_STATS unsigned stat[NR_SLUB_STAT_ITEMS]; #endif }; ``` * `size`: 含 metadata 的 object 大小 * `object_size`: 實際要求分配的大小 * `offset`: `offset` 可以用來取得下一個 free object * `*ctor`: function pointer，可以定義 object 的 constructor * `node`: [`struct kmem_cache_node`](https://elixir.bootlin.com/linux/v5.11.6/source/mm/slab.h#L525) 為單個 slab 的節點，透過 `struct list_head` 來串聯 `struct page`，後者會維護 `*freelist` 指向 slab pool ```cpp struct kmem_cache_node { spinlock_t list_lock; unsigned long nr_partial; struct list_head partial; }; ``` 下面是一個簡化的示意圖: > ![](https://i.imgur.com/u0aUqVg.png) > Reference: [Slab allocators in the Linux Kernel: SLAB, SLOB, SLUB](https://events.static.linuxfound.org/sites/events/files/slides/slaballocators.pdf) #### `kmem_cache_init` 首先要初始化 `kmem_cache` 結構體。 ```cpp=4374 void __init kmem_cache_init(void) { static __initdata struct kmem_cache boot_kmem_cache, boot_kmem_cache_node; if (debug_guardpage_minorder()) slub_max_order = 0; kmem_cache_node = &boot_kmem_cache_node; kmem_cache = &boot_kmem_cache; create_boot_cache(kmem_cache_node, "kmem_cache_node", sizeof(struct kmem_cache_node), SLAB_HWCACHE_ALIGN, 0, 0); register_hotmemory_notifier(&slab_memory_callback_nb); /* Able to allocate the per node structures */ slab_state = PARTIAL; create_boot_cache(kmem_cache, "kmem_cache", offsetof(struct kmem_cache, node) + nr_node_ids * sizeof(struct kmem_cache_node *), SLAB_HWCACHE_ALIGN, 0, 0); kmem_cache = bootstrap(&boot_kmem_cache); kmem_cache_node = bootstrap(&boot_kmem_cache_node); ``` 首先注意到，函式的 `__init` 屬性提示編譯器把 function 放到 `.init.text` section，這些 section 中的函式與初始化有關，因此在執行之後的記憶體可以直接被釋放 ```cpp #define __init __section(".init.text") __cold __latent_entropy __noinitretpoline ``` 在初始化時會發現一件有趣的事，要建立不同大小的 object 所需的 `kmem_cache`，得基於 `struct kmem_cache_node` 和 `struct kmem_cache_node` 結構，而這兩個結構的配置需從 `kmem_cache` 申請，這就像是雞生蛋還是蛋生雞的問題。由此可知，kernel 需要相應的 bootstraping 來建立 `kmem_cache`。做法是透過 `boot_kmem_cache` 和 `boot_kmem_cache_node` 兩個 static variable 作為臨時的 `kmem_cache`。呼叫 [`create_boot_cache`](https://elixir.bootlin.com/linux/v5.11.6/source/mm/slab_common.c#L542) 先建立 `struct kmem_cache` 和 `struct kmem_cache_node` 大小的 `kmem_cache`。再藉由 [`bootstrap`](https://elixir.bootlin.com/linux/v5.11.6/source/mm/slub.c#L4345) 從目前的 `kmem_cache` 中申請之。將 `boot_kmem_cache` 和 `boot_kmem_cache_node` 的內容分別複製至新申請到的 `kmem_cache` 後，調整新得到的 `kmem_cache` 底下的 page 以指回自己(最開始是指向 `boot_kmem_cache` 或 `boot_kmem_cache_node`)，最後加入 `slab_caches` 這個維護所有 slab caches 的 global linked-list 中。最後，bootstrap 完成，我們會得到 global 的 `kmem_cache` 和 `kmem_cache_node`，分別可以分配 `struct kmem_cache` 和 `struct kmem_cache_node` 大小的 `kmem_cache`。 :::info 注意到 `kmem_cache_node` 的變數是 `struct kmem_cache` 型態而非 `struct kmem_cache_node` ::: ```cpp=4401 /* Now we can use the kmem_cache to allocate kmalloc slabs */ setup_kmalloc_cache_index_table(); create_kmalloc_caches(0); ``` [`setup_kmalloc_cache_index_table`](https://elixir.bootlin.com/linux/v5.11.6/source/mm/slab_common.c#L715) 的作用是修改 [`size_index`](https://elixir.bootlin.com/linux/v5.11.6/source/mm/slab_common.c#L598)，後者是用來根據要求的大小對應到適合的 general cache 的編號。根據 `setup_kmalloc_cache_index_table` 註解可知道是為了某些 alignment 規範比較奇特的硬體而作，一般常見的硬體很可能不會對 `size_index` 做更多調整。所謂的 general cache 指的是 [`kmalloc_caches`](https://elixir.bootlin.com/linux/v5.11.6/source/mm/slab_common.c#L588) 結構。可從宣告知道共會有 `KMALLOC_SHIFT_HIGH + 1` 種，以 x86 為例，`PAGE_SHIFT` 為 12(表示一個 page 有 2^12 = 4KB 大小)，則可知道 general cache 共有 14 種。 ```cpp #define PAGE_SHIFT 12 #define KMALLOC_SHIFT_HIGH (PAGE_SHIFT + 1) #define KMALLOC_SHIFT_LOW 3 ``` 於是 `kmalloc_caches` 就可以由 [`create_kmalloc_caches`](https://elixir.bootlin.com/linux/v5.11.6/source/mm/slab_common.c#L768) 初始化，可以見到從 `KMALLOC_SHIFT_LOW == 3` 到 `KMALLOC_SHIFT_HIGH == 13` 編號範圍的 `kmalloc_cache` 會被初始化。而 for 迴圈中還會額外設置 $2^6$ 到 $2^7$ 間的 96 和 $2^7$ 到 $2^8$ 間的 192 大小的 cache(為了增加顆粒度(granularity)，減少 fragmentation)，對應編號 1 和 2。每個編號所對應的空間大小可以在 [`kmalloc_info`](https://elixir.bootlin.com/linux/v5.11.6/source/mm/slab_common.c#L674) 中找到，整體如下圖所示: > ![](https://i.imgur.com/65XfqQg.png) > Reference: [【Linux内存源码分析】SLUB分配算法（2）](https://www.jeanleo.com/2018/09/07/%E3%80%90linux%E5%86%85%E5%AD%98%E6%BA%90%E7%A0%81%E5%88%86%E6%9E%90%E3%80%91slub%E5%88%86%E9%85%8D%E7%AE%97%E6%B3%95%EF%BC%882%EF%BC%89/) ```cpp=4405 /* Setup random freelists for each cache */ init_freelist_randomization(); cpuhp_setup_state_nocalls(CPUHP_SLUB_DEAD, "slub:dead", NULL, slub_cpu_dead); ... } ``` 最後，`init_freelist_randomization` 應該與避免 `freelist` 有固定的 pattern 有關? `cpuhp_setup_state_nocalls` 則應與 slub 失敗時需觸發的 function pointer 註冊有關，細節則暫時省略。 #### `kmem_cache_create` ```cpp=406 struct kmem_cache * kmem_cache_create(const char *name, unsigned int size, unsigned int align, slab_flags_t flags, void (*ctor)(void *)) { return kmem_cache_create_usercopy(name, size, align, flags, 0, 0, ctor); } ``` [`kmem_cache_create`](https://elixir.bootlin.com/linux/v5.11.6/source/mm/slab_common.c#L407) 是建立 cache 的 interface。內部會呼叫 `kmem_cache_create_usercopy`，主要執行到的到關鍵函式有 [`__kmem_cache_alias`](https://elixir.bootlin.com/linux/v5.11.6/source/mm/slub.c#L4422)。先通過 [`find_mergeable`](https://elixir.bootlin.com/linux/v5.11.6/source/mm/slab_common.c#L186) 查找是否有屬性及大小相近的 cache 可以複用，若沒有，則再通過 [`create_cache`](https://elixir.bootlin.com/linux/v5.11.6/source/mm/slab_common.c#L233) 根據要求建立新的 cache。 #### `kmem_cache_destroy` ```cpp=482 void kmem_cache_destroy(struct kmem_cache *s) { int err; if (unlikely(!s)) return; get_online_cpus(); get_online_mems(); mutex_lock(&slab_mutex); s->refcount--; if (s->refcount) goto out_unlock; err = shutdown_cache(s); if (err) { pr_err("kmem_cache_destroy %s: Slab cache still has objects\n", s->name); dump_stack(); } out_unlock: mutex_unlock(&slab_mutex); put_online_mems(); put_online_cpus(); } ``` [`kmem_cache_destroy`](https://elixir.bootlin.com/linux/v5.11.6/source/mm/slab_common.c#L482) 每次被呼叫就減去 reference count，若 reference count 歸零則呼叫 [`shutdown_cache`](https://elixir.bootlin.com/linux/v5.11.6/source/mm/slab_common.c#L447) 做後續的釋放。 :::warning 嘗試伸入探討 `kmem_cache_alloc` 和 `kmem_cache_free` 時發現沒辦法很好的釐清細節，而省略過多的細節則說了等於白說，為避免錯誤的 "意淫" 程式碼，這裡決定不多加討論，可以先參考相關的文章。 ::: ### SLOB 與 SLAB 和 SLUB 的架構比較不同，[SLOB](https://en.wikipedia.org/wiki/SLOB) 並不維護 cache，不過 `kmalloc()`、`kmem_cache_alloc()` 這類的 API 仍可以使用，只是實際的運作則會被 config 成 `slob_alloc()` 系列對 SLOB 之 list 的操作。也因此 SLOB 程式碼相對其他兩者精簡，使用較少的 metadata 維護可用的記憶體，因此適用於記憶體資源稀缺的嵌入式系統上。其 free list 簡單的維護可用的 chunk 和其大小，並基於 first-fit 策略分配。不過歸咎於其相對簡單的策略，SLOB 有嚴重的 fragmentation 問題。 :::danger P.S. wikipedia 說的 internal fragmentation 應該是從 page 的角度，但 [The Art and Science of Memory Allocation](https://www.cs.unc.edu/~porter/courses/cse506/s16/slides/malloc.pdf) 提到的 internal fragmentation 是以整個記憶體的角度，所以 wikipedia 上說: > SLOB allocator is that it suffers greatly from internal fragmentation. 而投影片中說: > SLOB: No internal fragmentation, obviously ::: SLOB 需要維護的只有 3 個 free lists: * size < 256 - free_slob_small * [256, 1024) - free_slob_medium * [1024, PAGE_SIZE) - free_slob_large 並根據要求的配置大小從合適的 list 去查找。這個簡單的設計也讓 SLOB 有一下的缺點: * SLOB 必須線性的去查找遍歷 free list，直到找到 chunk >= 所需大小 * Fragmentation: 當 SLOB cache 需要擴充時，某個 free list 中會新增一個 page 以切出 chunk 使用，且這個 page 就只能為此 free list 所用，這造成了幾個問題: * page internal fragmentation: 假設 page 大小為 4K page，則對於一個大小為 3096 bytes 的物件，其配置需要的 page 將為 free_slob_large 所用，因此 page 剩餘的 1000 bytes 將無法被妥善分配 * 如果在 free_slob_small 已經沒有足夠的空間，即使 free_slob_large page 剩餘的空間可以分配出需要的物件大小，但因為此 page 不屬於 free_slob_small，free_slob_small 需要額外在多配置一個 page 以切出 chunk ### TODO - [ ] 實際閱讀 linux kernel 中的 SLAB / SLOB / SLUB 原始碼以確認更多實作細節，並且修正前面不夠詳盡的筆記內容 - [ ] [A Heap of Trouble: Breaking the Linux Kernel SLOB Allocator](https://www.vsecurity.com/download/papers/slob-exploitation.pdf) ### Reference * [The Slab Allocator in the Linux kernel](https://hammertux.github.io/slab-allocator) * [Slab allocators in the Linux Kernel: SLAB, SLOB, SLUB](https://events.static.linuxfound.org/sites/events/files/slides/slaballocators.pdf) * [Slab allocation](https://en.wikipedia.org/wiki/Slab_allocation) * [SLOB memory allocator](https://nitingupta.dev/post/slob-memory-allocator/) * [The SLUB allocator](https://lwn.net/Articles/229984/) * [图解slub](http://www.wowotech.net/memory_management/426.html) * [slub分配器](http://www.wowotech.net/memory_management/247.html) ## The Page Cache > [The Page Cache](https://www.cs.unc.edu/~porter/courses/cse506/s16/slides/page-cache.pdf) ### What is page cache? page cache 用來暫存從硬碟中讀取出來的檔案，因為對硬碟的 I/O 速度遠低於對記憶體的存取，如果有空閒的記憶體，則用來暫存檔案可以提升操作檔案的效率。因此，在 linux kernel 中，對檔案的讀取會先從 page cache 開始找起，如果不在 cache 中，才從硬碟將讀取出來; 寫入方面，同樣先查找 page cache，如果在 page cache 中則改寫 page cache 並標註為 dirty，直到 kernel thread `pdflush` 去處理或者主動呼叫 [`sync()`](https://man7.org/linux/man-pages/man1/sync.1.html) 才寫回磁碟中。 ### Mapping files 在這個章節中，我們要思考的議題是: 給定一段 VMA 或者檔案的 inode，該如何對應到其儲存資料的 physical page? 對於匿名 (anonymous) 的記憶體而言(例如一個 process 的 stack / heap 空間，沒有對應的檔案)，這個問題比較容易，只要透過 page tables 將 virtual page 對應到 physical page 即可。但是 VMA 也允許檔案的對應([memory mapped file](https://en.wikipedia.org/wiki/Memory-mapped_file))。memory mapped file 可使得檔案的讀寫不需要在 kernel 和 user space 間來回的複製，且不同的 process 可以透過映射相同的實體記憶體來共享此檔案，增加了操作檔案的效率。 > 推荐閱讀 [Linux中的mmap映射 [一]](https://zhuanlan.zhihu.com/p/67894878) 以釐清 mmap 與常規檔案操作、page cache 之間的關係該如何從 VMA 找到對應的檔案呢？要找到檔案對應 physical，首先需要找到描述檔案的 inode。因為 VMA 中會具有檔案的 file descriptor 及 offset，則首先可以先用 file descriptor 查找 file descriptor table，找到檔案對應的 inode。 :::info 實際上，使用 file descriptor 找到的 entry 會用來查找 open file table，再從 open file table 的 entry 搜尋 inode table 找到檔案的 inode，可參考以下連結: * [Unix system file tables - stack overflows](https://stackoverflow.com/questions/14189944/unix-system-file-tables) * [Linux系統篇-文件系統&虛擬文件系統](https://kknews.cc/zh-tw/code/n3g58v2.html) ::: 有了 inode 之後，接下來需要透過 inode 找到真正的檔案。在 inode 底下的 `struct address_space` 管理著前面所述的 page cache，其中就包含對應檔案的 page。為了提升查找的效率，這些 page cache 通過 radix tree 結構管理，則透過查找 radix tree，最後就能得到檔案所對應的 physical page 了。 ### Radix Tree Radix tree 是一個 space-optimized trie: * [trie](https://en.wikipedia.org/wiki/Trie): node 中不會儲存整個 key，遍歷的 parent 會建立 key 的 prefix，而 node 僅表示 suffix 因為 radix tree 的 branching factor $k$ > 2，適當的設計下，可使得對大檔案的搜尋更有效率。具體該如何搜尋 radix tree 呢?先用一個簡單的案例說明: 假設檔案最大可達 256k，branching factor $k$ = 6，則 256k / 4k = 64，可以建立出高度 = 1 的 radix tree，因此 root 下存在 64 個 slots，可以指向 NULL 或者一個 page。然後將 page offset 則做為查找的 index: 先將低位的 12 bits shift 掉(因為 1 個 page 是 4K)，然後將 LSB 的 6 bits 當成 index，如果 slots 不為 NULL，該 page 即為目標 page，否則表示不存在。如果 radix tree 允許的檔案更大，樹高更高呢?其實也是類似的道理: 一樣先將低位的 12 bits shift 掉，然後取出 6 * height 個 bits 做為 index(例如高度為 2 就取 12 個 bits)，然後從 MSB 開始前 6 個 bits 是第一層的 index，再往後推 6 個 bits 是第二層的 index...以此類推，找到對應的 page。另一個問題是，如果檔案大小增加，影響了 radix tree 的高度呢?其實很簡單，增加一個新的 root，然後把原本 radix tree 的 root 當成是新的 radix tree 的第一個 child 即可。 ### Dirty page 大部分的作業系統中，當 page cache 被標註 dirty 後，考量到寫 disk 的時間耗費，因此並不會馬上寫回 disk 中。如果有需要馬上寫回 disk 的需求的話，我們可以透過 sync system call 來要求: * [sync](https://man7.org/linux/man-pages/man2/sync.2.html): 將標註為 dirty 的內容寫回 disk * [fsync(fd)](https://man7.org/linux/man-pages/man2/fsync.2.html): 將指定的整個檔案為 dirty 的內容寫回 disk(包含檔案的 metadata(inode)) * fdatasync(fd): 類似 fsync，但僅寫回資料 dirty 的部分， metadata 則不會被寫回 > [Linux IO同步函数:sync、fsync、fdatasync](http://byteliu.com/2019/03/09/Linux-IO%E5%90%8C%E6%AD%A5%E5%87%BD%E6%95%B0-sync%E3%80%81fsync%E3%80%81fdatasync/) 那麼，sync 實際是怎麼被實作的呢? 對於 sync 而言，其關注的目標應該是減少找出 dirty block 的 overhead，因為遍歷所有 page 雖然可行，但效率會不高。為此，我們必須追蹤 dirty page 的所在，在檔案系統中存在的 superblock(注意這跟前面 Hoard allocator 中所說的 superblock 不同!) 可以幫助我們達到此目的。每個 superblock 會維護一個 dirty 的 inode，因此我們就可以追蹤哪些檔案的狀態為 dirty。 > [What is a Superblock, Inode, Dentry and a File?](https://unix.stackexchange.com/questions/4402/what-is-a-superblock-inode-dentry-and-a-file) ![](https://i.imgur.com/ExOG158.png) 如果不使用 sync system call，dirty page 需要透過非同步的機制更新回硬碟中。 `pdflush` 是負責此任務的 kernel thread(事實上，同時可能會有多個 `pdflush` thread 存在，可參考 [The pdflush Mechanism](https://www.halolinux.us/kernel-architecture/the-pdflush-mechanism.html) 對此的說明)。當 dirty page 存在的時間太長，或者 dirty page 的總數超過一定比例時，`pdflush` 會去將 dirty page 寫回 disk，直到總 dirty page 低於此預設比例，或者沒有可以寫回的 dirty page 為止。 > * [The pdflush Daemon](http://books.gigatux.nl/mirror/kerneldevelopment/0672327201/ch15lev1sec4.html) > 這篇文章中則指出了作為改進 pdflush 的 flusher > * [Flushing out pdflush](https://lwn.net/Articles/326552/) ### Reference * [Linux中的Page Cache](https://zhuanlan.zhihu.com/p/68071761) ## Page Frame Reclaiming > [Page Frame Reclaiming](https://www.cs.unc.edu/~porter/courses/cse506/s16/slides/pfra.pdf) ### Swapping 作業系統中大多允許 virtual memory 被 [overcommitted](https://en.wikipedia.org/wiki/Memory_overcommitment)。換句話說，分配出的總 virtual memory 數量可以大於 physical memory 的大小。然而實際上該怎麼做到這件事呢? 系統會透過 swapping 達成此目的。當 physical memory 不足時，有些 page 會被寫回硬碟上。系統會將原本該 page 對應之 page table entry 之 `PTE_P` 清除，表示此對應無效。因此下次嘗試操作該 entry 時就會產生 page fault，系統需要找到另一個 physical page 並做 mapping。 > 這裡有一篇還不錯的翻譯文章: [【譯】替 swap 辯護：常見的誤解](https://farseerfc.me/in-defence-of-swap.html) 為了做到 swapping，在 Linux 中會掃描 page descriptor table，以決定要把哪個 page 置換到 swapping space 中。此前我們提到，為了將 `PTE_P` 清除，在回收這個 physical page 的同時，勢必會需要找到它的 page table entry 去做 unmapping。那麼，如何從 physical page descriptor 對應到其 mapping 的對象呢? * 注意同一個 physical page 可以對應多個 virtaul page! 這就是 reverse mapping 存在的目的了。 ### Reverse mapping :::info 開始之前推荐閱讀: [逆向映射的演进](http://www.wowotech.net/memory_management/reverse_mapping.html)，相當完整而且精彩的解說了 reverse mapping。 ::: 在最早 Linux 沒有 reverse mapping 的時候，要從 page 找到其對應的 page table entry，就只能真的掃描系統中每個 process 的每個 vma。不過顯然這個作法是相當沒有效率的(當然可以設計某些機制來提升效率，不過沒有 reverse mapping 的概念下，效率的提升是有限的)，正因如此，reverse mapping 出現在了作業系統的視野中。 object-based reverse mapping(可參考 [Virtual Memory II: the return of objrmap](https://lwn.net/Articles/75198/)) 是一個很好的進入點。讓我們先從 anonymous page 的 reverse mapping 開始討論起。當一個匿名頁的 mapping 產生時，一個 `vma` 會被建立，而如果對應的 physical page 是第一次被映射，也會建立一個 `anon_vma`。`vma` 和 physical page descriptor 皆會指向 `anon_vma`。而 `anon_vma` 會將該 page 的所有 `vma` 以 circular linked list 的形式串在一起。換句話說，當 copy-on-write 的 `fork` 發生而出現共享的 page 時，一個新的 `vma` 會被建立且加入 `anon_vma` 中。如下圖所展示: ![](https://i.imgur.com/FF8WYEy.png) 不過這個作法會有個問題，假設 fork 出來的 child process 對原本共享的 page 寫入時，因為 COW 機制，此時會分配出新的 physical page 來對應，然而 `anova_vma` 卻沒辦法隨之更新。換句話說，遍歷的 `anova_vma` 中包含了事實上和該 physical page 無關的 `vma`，導致了不必要的搜尋。有改進的方法嗎? 當然有: [The case of the overly anonymous anon_vma](https://lwn.net/Articles/383162/) :::warning 我覺得對我來說有些複雜，避免錯誤，所以暫時就不整理了QQ ::: 在 page descriptors 下，會有兩個欄位紀錄 physical page 的屬性: * `_mapcount`: 紀錄有多少共享的 vma。-1 表示未被映射，0 表示只有一個映射，1 以上則表示分享的數量 * `mapping` 指標則可能指向 `address_space` 結構(file/device)或者 `anon_vma`(anonymous)，least significant bit 會紀錄這個屬性(1 == 指向 `anon_vma`) 如果 page 映射到檔案的話又如何呢? `mapping` 指標會指向 `address_space` 結構，後者在某個 inode 底下，包含了文件 page cache 的相關信息。於是和 anonymous page 類似，`vma` 可以串成一個 linked list 並放在 `address_space` 中被維護...對嗎? > [linux内核中的address_space 结构解析](https://blog.csdn.net/jinking01/article/details/106490467) 事實上，檔案的映射存在其他不同複雜的特性。首先，記憶體不一定總是需要映射整個檔案。因此，假設我們在尋找文件的第 4 個 page 的所有映射，每個線性的映射掃描可能需要額外過濾不包含第 4 頁的vma。另一方面，相較於 anonymous page，檔案的 page 更經常被大量的共享。換句話說，`vma` 的數量很多，顯然線性的 scan 就會顯得不夠有效率。因此事實上檔案的 `vma` 很多，Linux 中會透過 PST([Priority Search Tree](https://en.wikipedia.org/wiki/Priority_search_tree)) 來維護(在 `i_mmap` 之下)。PST 以重疊的範圍作為 node keys，涵蓋範圍大者會是 parent，比較小的則是 child，則對於某個 page `N`，透過這個結構可以有效的走訪這個 page 的 `vma`。搜尋時，如果 page `N` 在 node 所表示的範圍之中，則兩邊的 child 都要走訪;如果超出範圍，則只走訪右節點。在 PST 的節點上則存在在該範圍中存在的所有 `vma`。 ![](https://i.imgur.com/IuVnjHL.png) :::warning 很可惜的，PST 好像已經被新的方法取代了? 參考 [rbtree based interval tree as a prio_tree replacement](https://lwn.net/Articles/509994/) ::: > * [Linux中匿名页的反向映射](http://liujunming.top/2017/09/03/Linux%E4%B8%AD%E5%8C%BF%E5%90%8D%E9%A1%B5%E7%9A%84%E5%8F%8D%E5%90%91%E6%98%A0%E5%B0%84/) > * [匿名反向映射的前世今生](https://richardweiyang-2.gitbook.io/kernel-exploring/00-index/01-anon_rmap_history) ### Reclaim 下面我們想探討的事情是: 當記憶體空間不敷使用，該怎麼有效率的回收 page 呢。先讓我們將 page 的類型做分類: ![](https://i.imgur.com/w1gclXP.png) 對於 page 回收的基本原則為: * 首先釋放掉釋放後不會產生問題的 page * 從 user program 的 page 開始回收，尤其是那些尚未被使用到的 page * Temporal locality: 回收近期都沒有使用到的 page * 回收那些回收成本小的 page * 例如可以不用額外 write back 的 page 對於回收的方法，我們需要考慮一個情境: 當記憶體陷入不足的狀況時，最終都需要回歸到某個 process 拿到足夠的記憶體然後完成任務，接著才會有真正可用的空間被釋放出來! 因此在進行回收時，系統需要儘量避免回收掉 process 完成任務所需要的 page。 ### Detect the Page Use 要知道一個 page 是否最近正在被使用，Linux 系統中會使用兩個 LRU list 維護之，分成 active 和 inactive 兩種。當 page 最近曾被存取時，就移動到 active 類型(hot)，反之在一段時間內未被存取的 page 就會移動到 inactive(cold)。具體要怎麼知道一個 page 是否被使用呢? 像是 mmap、mprotect 被呼叫時可以明確的得知被使用的 page。另一方面，硬體可能也支援 page table entry 的 access bit 來得到資訊: 當一個 page 被存取時，acceess bit 會被設置。因此作業系統可以定期去清掉 access bit，然後設定一個計時器，當 timeout 時，如果檢查到這個 bit 已設回來了，表示最近被存取過，反之，表示該 page 相對在近期中沒被存取到，因此可以優先回收。 > 更多細節請參考: > * [Chapter 10 Page Frame Reclamation](https://www.kernel.org/doc/gorman/html/understand/understand013.html) > * [Linux中的内存回收[一]](https://zhuanlan.zhihu.com/p/70964195) ## 延伸閱讀 * [Introduction to Memory Management in Linux](https://youtu.be/7aONIVSXiJ8)