sysprog
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Write
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Help
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Write
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    1
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # slab 記憶體配置器 > 資料整理: [jserv](https://wiki.csie.ncku.edu.tw/User/jserv) ## 簡介 slab 一詞源於 SunOS 原始程式碼,開發者選擇這詞彙,是相較於 "object" 或 "cache",slab 更具辨識度,該詞彙的[原意](https://dictionary.cambridge.org/dictionary/english/slab)是針對石、木、金屬、食物等的板或片段。根據 Jeff Bonwick 發表在 1994 年的論文〈[The slab Allocator: An Object-Caching Kernel](https://www.usenix.org/conference/usenix-summer-1994-technical-conference/slab-allocator-object-caching-kernel)〉,記憶體配置器管理記憶體區域 (arenas) 的策略可區分為 sequential-fit, buddy 和 segregated-storage 這三個手法。slab 配置器屬於 segregated-storage。本文為了讓讀者易於對照論文,像是 cache, object, slab 和 buffer 一類的術語,保留原文。 所有的 slab 具有相同的空間,通常設定為為記憶體管理單元 (MMU) 的頁面 (page) 的大小,在 POSIX 相容的系統上,可藉由 `sysconf(_SC_PAGESIZE)` 查詢。slab 的記憶體從後端配置器中設定,這意味著從更大的記憶體區塊 (block) 中請求記憶體,通常至少是一個頁面的大小。一個 slab 包含多個大小相同的 buffer,slab 的資料部分(slab Data) 置於 slab 的結尾,以減少非法記憶體存取導致的損壞風險。一個 cache 包含多個 slab,cache 中的 slab 透過雙向鏈結串列連接,freelist 指標指向至少有個空閒 buffer 的 slab。 ![image](https://hackmd.io/_uploads/r18P0hBEA.png) 當 buffer 被釋放時,配置器會檢查 slab 中剩餘的空閒 buffer 的數量。若一個 slab 中所有的 buffer 都被釋放,則將該 slab 移到鏈結串列的末尾,確保 freelist 指標指向至少有一個空閒 buffer 的 slab。若 cache 中的 slab 數量低於一定的閾值(工作集 [work set] 大小),該 slab 不會立即返回後端配置器,以降低重新設定和初始化 slab 的開銷。 Buffer 大小會事先定義,例如 8B, 16B, 32B, 48B 等,每個大小對應一個包含雙向鏈結串列的 cache。設定記憶體時,配置器會將請求的大小匹配到不小於該請求的最小 buffer 大小,然後從對應的 cache 中找到有空閒 buffer 的 slab 進行設定。超過半頁大小的請求,直接從後端配置器設定。 釋放記憶體時,配置器藉由檢查是否對齊頁面大小來區分 buffer 和大型 object 的地址。buffer 不用對齊頁面大小,而大型 object 則要對齊頁面大小。這樣即可避免使用雜湊表進行映射的需求,確保記憶體管理的高效。 [userland-slab-allocator](https://github.com/bbu/userland-slab-allocator) 提供文字模式的 slab 配置動態展示。 ## 實作示範 儘管 Jeff Bonwick 描述的 slab 起初用於核心物件的管理,但同樣適用於使用者空間。記憶體配置器請求一個頁面大小的記憶體區塊,並將其分割成固定大小的片段。每個片段可以容納一個指標或整數,並鏈接成一個串列,串列的頭部指向第一個空閒節點。 以下是 Marek Vavrusa 提供的[實作程式碼](https://marek.vavrusa.com/memory/): ```c struct slab { void **head; }; /* Create page-aligned slab */ struct slab *slab = NULL; posix_memalign(&slab, page_size, page_size); slab->head = (void **) ((char *) slab + sizeof(struct slab)); /* Create a NULL-terminated slab freelist */ char* item = (char *) slab->head; for (unsigned i = 0; i < item_count; ++i) { *((void **) item) = item + item_size; item += item_size; } *((void **) item) = NULL; ``` 記憶體的配置操作只需取出鏈結串列的開頭,而記憶體釋放操作則相當於將一個新的開頭節點置入鏈結串列。這裡可留意到一個技巧:若 slab 對齊到頁面大小的邊界,可藉由向下取整數到頁面大小,以較低的成本來獲取 slab 指標。 ```c /* Free an element */ struct slab *slab = (void *) ((size_t) ptr & PAGESIZE_BITS); *((void **) ptr) = (void *) slab->head; slab->head = (void **) ptr; /* Allocate an element */ if ((item = slab->head)) { slab->head = (void **) *item; } else { /* No elements left. */ } ``` ## Linux 核心的 slab ![image](https://hackmd.io/_uploads/S1OvgRI4A.png) 以下是 Linux 核心的用語: * slab: 一組包含一個或多個連續頁面的記憶體,這些頁面包含特定大小的核心物件。 * slab cache: 多個相同類型 slab 的容器。有兩種類型的 cache: - 專用 cache: 由 Linux 核心建立,每個 cache 僅用於保存一種(常用)物件類型,例如 `mm_struct` 或 `cred_jar` - 通用 cache: 可容納任何特定大小 (加上 padding) 的物件。通常,這些 cache 中的物件大小為 2 的冪。當呼叫 `kmalloc()` 時,返回的已配置記憶體將在這些 cache 之一(例如 `kmalloc-32`, `kmalloc-64` 等)中,實際則取決於請求的大小。 * SLOB / SLAB / SLUB: 都是 Linux 核心中可用的 slab 配置器,較新的 Linux 核心版本已棄置 slob 和 slab,因此新的 Linux 僅提供 SLUB 這個 slab 的改進版本。 要顯示不同 slab cache 的資訊,我們可執行以下命令: ```shell $ sudo cat /proc/slabinfo ``` 參考輸出: ``` slabinfo - version: 2.1 # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> ... kmalloc-8k 1700 1700 8192 4 8 kmalloc-4k 6329 6552 4096 8 8 kmalloc-2k 20791 20896 2048 16 8 kmalloc-1k 20906 20960 1024 32 8 kmalloc-512 337585 506176 512 64 8 kmalloc-256 85546 89472 256 64 4 kmalloc-128 297466 302272 128 64 2 kmem_cache_node 1070 1088 128 64 2 kmem_cache 896 896 256 64 4 ``` ### SLUB 配置器 我們專注於 SLUB 配置器,不僅因它是 Linux 核心預設的記憶體配置器,而且針對多核處理器進行調整。 Linux 核心中的記憶體配置會比使用者模式中略為單純,因為 metadata 通常不與資料主體存在放同一個空間 (例如使用者模式的 heap),而是儲存在不同的結構中。 每個 CPU 都有一個有效的 slab 用於每種類型的物件。這意味著,當在某個 CPU 上配置某種類型/大小的物件時,僅從該 CPU 的有效 slab 中配置空間,直到沒有更多的空閒空間。當沒有更多空間時,另一個 slab 將被標記為該 CPU 的有效 slab。需要注意的是,不同的處理器有不同的有效的 slab。 在 SLUB 配置器中,`kmem_cache` 結構體負責分類管理。說到分類,我們總是按照某些屬性進行分類,對應到真實世界,就是說賣場中各項商品的用途。 SLUB 的分類依據是什麼呢?首先是按照名字分類。呼叫 `kmem_cache_create()` 時,首個參數是 name,而這個名字保存在 kmem_cache 結構體的 name 成員中。我們可藉由以下命令查看系統中 SLUB 的分類: ``` $ cat /proc/slabinfo ``` 隨即可見 `task_struct` 和 `inode_cache` 等資訊。 另一種分類方式是按照大小。我們可在上述命令的輸出中看到 `kmalloc-512` 和 `kmalloc-256` 等字樣,這些是按照大小配置的記憶體快取。 按照名字分類隱含按大小分類的意義,這裡單獨列出是為了強調 SLUB 中對大小的計算,也就是進行預先存取 (prefetch) 的依據,後者的目的是提高系統性能。在解釋其作法前,我們回顧現實生活的案例:超級市場中,貨架上琳瑯滿目的商品都可算是預先存取,超市按照估計預先將商品放在貨架上,等待顧客購買。若每種商品貨架上只有一個,於是眾多顧客購買後,就要等待商家補貨,方可滿足購物需求,預先存取則可節省顧客的平均購買時間。 再來看商場服務人員的工作:在顧客購買商品的過程,服務人員也會持續補充貨架。倘若每當一個商品被拿走,服務人員就去倉庫取一個商品補充,那未免太繁瑣。因此,當貨架上的商品賣完時,服務人員才會去倉庫取出一批商品補充貨架。換言之,預先存取可降低服務人員的工作負擔。 不過前述的「一批」商品的數量該是多少?若拿太少,就不能提升每次取貨的效率,而若拿多,則貨架上放不下。生活中如此,Linux 核心也有對應的考量。 記憶體管理模組具備多個層次,SLUB 配置器建立在頁面配置器之上,因此可理解為頁面是 SLUB 的貨架。超市貨物上架前,應當做好以下估算: * 貨架的大小 * 貨架上能放多少貨物 SLUB 同樣需要計算以下: * 用多大的頁面來作為貨架 * 每個頁面中可以放多少對象 這些資料都保存在 `kmem_cache` 結構體中,整個計算過程在 `calculate_sizes()` 函式。注意:這裡說的頁面不是單個的實體頁面,而是核心中 `struct page` 對應的頁面。 用更細緻的案例來解說: 1. 假如申請的結構體 (貨物) 大小是 512 位元組,那麼頁面 (貨架) 可選擇 4K 位元組大小,每個頁面 (貨架) 上可存放 8 個結構體 (貨物)。 2. 假如申請的結構體 (貨物) 大小是 2050 位元組,那麼頁面 (貨架) 可選擇 8K 位元組大小,每個頁面 (貨架) 上可以存放 3 個結構體 (貨物)。 實際計算時,第二種情況的值可能不是如此,因為這樣選擇會導致較大的記憶體空間浪費,核心可能選擇更大的頁面,以減少記憶體浪費的發生。 - [ ] [`struct kmem_cache`](https://elixir.bootlin.com/linux/v6.9.2/source/mm/slab.h#L251) 該結構體是 slab cache 的存取介面。每種類型的物件都有一個對應的 `kmem_cache` 實例,並且每個大小的通用 cache 也有一個實例。它用來管理給定物件類型/大小的所有 slab。 ``` +------------------------------+ |name | | (char *) | +------------------------------+ |object_size | = original object size |inuse | = ALIGN(object_size, sizeof(void *)) |size | = ALIGN(inuse + padding + debug space, s->align) |align | |offset | | (int) | |reserved | +------------------------------+ |oo | |min | |max | | (kmem_cache_order_objects) | | +--------------------------+ | |order | | |order_objects | +---+--------------------------+ ``` 該結構體的重要成員: * `const char *name` : slab cache 的名稱,如 `kmalloc-32` 等 * `unsigned int object_size` : cache 中物件的大小 * `unsigned int size` : 包括 metadata 的物件大小,如果存在 metadata * `struct kmem_cache_cpu __percpu *cpu_slab` : 指向 `struct kmem_cache_cpu` 的指標,並非尋常的指標,而是個 `__per_cpu` 指標,意味著地址空間不同,即需要額外的運算來取得 `kmem_cache_cpu` 物件的實際地址 * `struct kmem_cache_node *node[MAX_NUMNODES]` : 指向 `kmem_cache_node` 結構體陣列的指標 * `unsigned int offset` : 在物件中的偏移量,指向下個空閒物件的指標儲存位置(當物件在 freelist 中時)。 * `unsigned long random`:僅在設定 `CONFIG_SLAB_FREELIST_HARDENED` 時存在,用於變更 freelist 中指標的地址 * `unsigned int *random_seq`:僅在設定 `CONFIG_SLAB_FREELIST_RANDOM` 時存在。這是一個隨機序列,以確保初始 freelist (建立 slab 時包含所有物件) 並非循序,而是隨機順序。換言之,意味著物件不會在開始時按順序配置 名為 `slab_caches` 的變數是個包含所有 slab cache 的雙向鏈結串列。名為 kmalloc_caches 的變數則包含所有的 kmalloc (通用) cache,其第一個索引表示 kmalloc 的 cache 類型,如下: * `KMALLOC_NORMAL` = 0: 對應於名為 `kmalloc-8`, `kmalloc-16` 等的 cache * `KMALLOC_CGROUP`:對應於名為 `kmalloc-cg-8`, `kmalloc-cg-16` 等的 cache * `KMALLOC_RECLAIM`:對應於名為 `kmalloc-rcl-8`, `kmalloc-rcl-16` 等的 cache * `KMALLOC_DMA`:對應於名為 `dma-kmalloc-8`, `dma-kmalloc-16` 等的 cache 第二個索引表示該 cache 中的物件大小。索引 n 對應於 kmalloc-2^n^,其中包含大小從 $2^{n-1}+1$ 到 $2^{n}$ 的物件。 - [ ] [`struct kmem_cache_cpu`](https://elixir.bootlin.com/linux/v6.9.2/source/mm/slub.c#L384) 該結構管理目前 CPU 的有效 slab。該結構體的重要成員如下: * `void **freelist` : 指向 slab 中下個可用物件的指標,亦即這個 slab 中的下次配置將返回此指標指向的地址 * `struct slab *slab` : 指向有效 slab 的指標 - [ ] [`struct kmem_cache_node`](https://elixir.bootlin.com/linux/v6.9.2/source/mm/slub.c#L425) 該結構體跟蹤部分 slab (亦即有空間的 slab),而非有效的 slab。根據配置,它還跟蹤已滿的 slab。該結構體的重要成員如下: * `unsigned long nr_partial`:部分 slab 的數量 * `struct list_head partial`:部分 slab 的串列 * `struct list_head full`:僅在設定 `CONFIG_SLUB_DEBUG` 時定義,是已滿 slab 的串列 - [ ] [`struct slab`](https://elixir.bootlin.com/linux/v6.9.2/source/mm/slab.h#L52) 該結構體包含一個 slab 的 metadata,包含以下: * `struct kmem_cache *slab_cache` : 該 slab 所屬的 slab cache,由 kmem_cache 表示 * `void *freelist` : slab 中第一個空閒物件 ### 建立 `kmem_cache` 結構體 kmem_cache 結構體是 SLUB 進行配置時的中心節點。在該節點上,我們可獲得所有關於這個 SLUB 的資訊。前述 `kmem_cache` 結構體內部的成員在建立後就不會改變,主要用來儲存 SLUB 的大小資訊,這裡我們關注以下: * cpu_slab * NUMA node 可見 SLUB 引入二個層級來管理記憶體:per-CPU 層級和 NUMA 節點 層級。下圖展示 `kmem_cache` 剛建立時的樣貌,可見一切都剛開始,內容都是空的。 ``` kmem_cache +------------------------------+ |name | | (char *) | +------------------------------+ |cpu_slab | * per cpu variable pointer | (struct kmem_cache_cpu*) | | +--------------------------+ | |stat[NR_SLUB_STAT_ITEMS] | | | (unsigned) | | |tid | #cpu | | (unsigned long) | | |freelist (void **) | NULL | |page (struct page *) | NULL | |partial (struct page *) | NULL +---+--------------------------+ |node[MAX_NUMNODES] | | (struct kmem_cache_node*) | | +--------------------------+ | |list_lock | | | (spinlock_t) | | |nr_partial | 0 | | (unsigned long) | | |partial | empty | | (struct list_head) | +---+--------------------------+ ``` ### 配置記憶體 由於採用分層設計,SLUB 配置時根據每個層次是否存在記憶體區分為幾種情況。在 Linux 核心運作過程,SLUB 按照以下順序檢查是否有可用的記憶體: * cpu_slab $\to$ freelist * cpu_slab $\to$ partial * node $\to$ partial 原理不難懂,就是先檢查最容易取得的空間是否有可用記憶體。以下展現 SLUB 在不同狀況下的樣貌。 - [ ] 什麼地方都沒有記憶體 按照上述順序,這個過程應該是以上三個空間中都沒有記憶體的情況。不過這也是 SLUB 剛建立後第一次配置時的情況,就是從 buddy 系統中取得一個頁面,裝到 cpu_slab $\to$ freelist 中。這步執行於 `allocate_slab()`。 這種情況的初始狀態是所有空間都是空的,但在下圖中展示二個狀態,目的是為了突出 page $\to$ freelist 指標的變化。 ``` kmem_cache +------------------------------+ |name (char *) | +------------------------------+ |cpu_slab | * per cpu variable pointer | (struct kmem_cache_cpu*) | | +--------------------------+ | |stat[NR_SLUB_STAT_ITEMS] | | | (unsigned) | | +--------------------------+ | |tid (unsigned long) | | | | | |freelist | | | (void **) | | | | | |page +--|--->+--------+--------+--------+--------+ | | (struct page *) | | | | | | | | | +-------------------|--+ +--------+--------+--------+--------+ | | |freelist ---+ | | | |inuse = objects | | | |frozen = 1 | | | +----------------------+ | |partial | = NULL | | (struct page *) | +---+--------------------------+ kmem_cache +------------------------------+ |name (char *) | +------------------------------+ |cpu_slab | * per cpu variable pointer | (struct kmem_cache_cpu*) | | +--------------------------+ | |stat[NR_SLUB_STAT_ITEMS] | | | (unsigned) | | +--------------------------+ | |tid (unsigned long) | | | | | |freelist ------|----+ = page->freelist | | (void **) | | | | | v | |page | +--------+--------+--------+--------+ | | (struct page *) | | | | | | | | +----------------------+ +--------+--------+--------+--------+ | | |freelist = NULL | | | |inuse = objects | | | |frozen = 1 | | | +----------------------+ | |partial | = NULL | | (struct page *) | +---+--------------------------+ ``` - [ ] freelist 上有記憶體 該情況也稱為 fast path,因為在該情況下,只需移動 freelist 指標即可,這步執行於 `slab_alloc_node()`。 下圖展示 freelist 從 a 移動到 A。注意 freelist 保存的是一個鏈結串列,而非線性儲存,因此指標的變化不一定是後移。 ``` kmem_cache +------------------------------+ |name (char *) | +------------------------------+ |cpu_slab | * per cpu variable pointer | (struct kmem_cache_cpu*) | | +--------------------------+ | |stat[NR_SLUB_STAT_ITEMS] | | | (unsigned) | | +--------------------------+ | |tid (unsigned long) | | | | a A | |freelist ------|-------------+.................. | | (void **) | | . | | | v v | |page | +--------+--------+--------+--------+ | | (struct page *) | | | | | | | | +----------------------+ +--------+--------+--------+--------+ | | |freelist = NULL | | | |inuse = objects | | | |frozen = 1 | | | +----------------------+ | | | | |partial | = NULL | | (struct page *) | +---+--------------------------+ ``` - [ ] partial 上有記憶體 這時需要從 partial 中找到可用的頁面,裝到 freelist 上。這步執行於 `___slab_alloc()` 和 `get_freelist()`。 下面二圖分別展現裝之前和之後的變化,可見原來在 partial 中的頁面 A,安裝後移到 cpu_slab $\to$ page。注意在 partial 上的頁面,frozen 也是 1,但 inuse 則反映真實的配置情況而不是等於 objects。 ``` kmem_cache +------------------------------+ |name (char *) | +------------------------------+ |cpu_slab | * per cpu variable pointer | (struct kmem_cache_cpu*) | | +--------------------------+ | |stat[NR_SLUB_STAT_ITEMS] | | | (unsigned) | | +--------------------------+ | |tid (unsigned long) | | | | | |freelist | = NULL | | (void **) | | |page | = NULL | | (struct page *) | | | | | |partial ------|--->+-----------+ +-----------+ | | (struct page *) | |A next|---->|B next| | | | +-----------+ +-----------+ | | | |pobjectsA | = |pobjectsB | + objectsA | | | | | | | | | | |inuse < | |inuse < | | | | | objectsA | | objectsB | | | | |frozen = 1 | |frozen = 1 | | | | +-----------+ +-----------+ +---+--------------------------+ kmem_cache +------------------------------+ |name (char *) | +------------------------------+ |cpu_slab | * per cpu variable pointer | (struct kmem_cache_cpu*) | | +--------------------------+ | |stat[NR_SLUB_STAT_ITEMS] | | | (unsigned) | | +--------------------------+ | |tid (unsigned long) | | | | | |freelist ------|----+ = page->freelist | | (void **) | | | | | v | |page | +--------+--------+--------+--------+ | | (struct page *) A | | | | | | | | +----------------------+ +--------+--------+--------+--------+ | | |freelist = NULL | | | |inuse = objects | | | |frozen = 1 | | | +----------------------+ | | | | |partial ------|--->+-----------+ | | (struct page *) | |B next| | | | +-----------+ | | | |pobjectsB | = objectsB | | | | | | | | |inuse < | | | | | objectsB | | | | |frozen = 1 | | | | +-----------+ +---+--------------------------+ ``` - [ ] node 上有記憶體 看完 freelist 和 partial,最後我們來看 cpu_slab $\to$ node 上有記憶體的情況。在該情況,我們需要從 node 上取出頁面,並將其裝入 freelist 中。該過程執行於 `new_slab_objects()` $\to$ `get_partial()` $\to$ `get_partial_node()`。 注意在某些情況下,不僅會將頁面裝入 freelist 中,還會取出多個頁面裝入 partial 中,避免下次配置時找不到可用的記憶體。下圖展示的頁面 B 就是從 node 轉移到 partial 上。此外,在 node 上的頁面中,frozen 為 0。 ``` kmem_cache +------------------------------+ |name (char *) | +------------------------------+ |cpu_slab | * per cpu variable pointer | (struct kmem_cache_cpu*) | | +--------------------------+ | |stat[NR_SLUB_STAT_ITEMS] | | | (unsigned) | | +--------------------------+ | |tid (unsigned long) | | | | | |freelist | = NULL | | (void **) | | |page | = NULL | | (struct page *) | | | | | |partial | = NULL | | (struct page *) | | | | +---+--------------------------+ |node[MAX_NUMNODES] | | (struct kmem_cache_node*) | | +--------------------------+ | |nr_partial = 3 | | | (unsigned long) | | |partial ----|--->+-----------+ +-----------+ +-----------+ | | (struct list_head) | |A lru|---->|B lru|---->| lru| +---+--------------------------+ +-----------+ +-----------+ +-----------+ |inuse < | |inuse < | | objects | | objects | |frozen = 0 | |frozen = 0 | +-----------+ +-----------+ kmem_cache +------------------------------+ |name (char *) | +------------------------------+ |cpu_slab | * per cpu variable pointer | (struct kmem_cache_cpu*) | | +--------------------------+ | |stat[NR_SLUB_STAT_ITEMS] | | | (unsigned) | | +--------------------------+ | |tid (unsigned long) | | | | | |freelist ------|----+ = page->freelist | | (void **) | | | | | v | |page | +--------+--------+--------+--------+ | | (struct page *) A | | | | | | | | +----------------------+ +--------+--------+--------+--------+ | | |freelist = NULL | | | |inuse = objects | | | |frozen = 1 | | | +----------------------+ | | | | |partial ------|--->+-----------+ | | (struct page *) | |B next| | | | +-----------+ | | | |pobjects | = objects | | | | | | | | |inuse < | | | | | objects | | | | |frozen = 1 | | | | +-----------+ | | | +---+--------------------------+ |node[MAX_NUMNODES] | | (struct kmem_cache_node*) | | +--------------------------+ | |nr_partial = 1 | | | (unsigned long) | | |partial ----|--->+-----------+ | | (struct list_head) | |C lru| +---+--------------------------+ +-----------+ ``` ### 釋放記憶體 釋放雖然是配置的另一面,但在細節與 SLUB 配置略有不同。由於 SLUB 的設計是個鏈結串列,所以無論需要釋放記憶體對應的頁面在哪個地方,都是在 freelist 鏈結串列的開頭插入一個物件。因此,單個物件的釋放沒有太大區別。區別在於釋放物件後,需要將對應的頁面移動到適當的位置。 SLUB 釋放記憶體階段需要在以下三個位置之間移動頁面: * cpu_slab $\to$ freelist * cpu_slab $\to$ partial * node $\to$ partial - [ ] 移動 freelist 該情況相對簡單,不涉及頁面的移動,只要移動 cpu_slab $\to$ freelist 即可。 ``` kmem_cache +------------------------------+ |name (char *) | +------------------------------+ |cpu_slab | * per cpu variable pointer | (struct kmem_cache_cpu*) | | +--------------------------+ | |stat[NR_SLUB_STAT_ITEMS] | | | (unsigned) | | +--------------------------+ | |tid (unsigned long) | | | | obj | |freelist ------|-------------+.................. | | (void **) | | . | | | v v | |page | +--------+--------+--------+--------+ | | (struct page *) | | | | | | | | +----------------------+ +--------+--------+--------+--------+ | | |freelist = NULL | | | |inuse = objects | | | |frozen = 1 | | | +----------------------+ | | | | |partial | = NULL | | (struct page *) | +---+--------------------------+ ``` - [ ] 替換 cpu_slab->page 該情況發生於配置記憶體時,當 cpu_slab $\to$ page 無法滿足配置需求時,而將這個頁面退回到 node $\to$ partial。過程由 `deactivate_slab()` 實作。 ``` kmem_cache +------------------------------+ |name (char *) | +------------------------------+ |cpu_slab | * per cpu variable pointer | (struct kmem_cache_cpu*) | | +--------------------------+ | |stat[NR_SLUB_STAT_ITEMS] | | | (unsigned) | | +--------------------------+ | |tid (unsigned long) | | | | | |freelist ------|----+ = page->freelist | | (void **) | | | | | v | |page | +--------+--------+--------+--------+ | | (struct page *) A | | | | | | | | +----------------------+ +--------+--------+--------+--------+ | | |freelist = NULL | | | |inuse = objects | | | |frozen = 1 | | | +----------------------+ | | | | |partial ------|--->+-----------+ | | (struct page *) | |B next| | | | +-----------+ | | | |pobjects | = objects | | | | | | | | |inuse < | | | | | objects | | | | |frozen = 1 | | | | +-----------+ | | | +---+--------------------------+ |node[MAX_NUMNODES] | | (struct kmem_cache_node*) | | +--------------------------+ | |nr_partial = 1 | | | (unsigned long) | | |partial ----|--->+-----------+ | | (struct list_head) | |C lru| +---+--------------------------+ +-----------+ kmem_cache +------------------------------+ |name (char *) | +------------------------------+ |cpu_slab | * per cpu variable pointer | (struct kmem_cache_cpu*) | | +--------------------------+ | |stat[NR_SLUB_STAT_ITEMS] | | | (unsigned) | | +--------------------------+ | |tid (unsigned long) | | | | | |freelist | = NULL | | (void **) | | |page | = NULL | | (struct page *) | | | | | |partial ------|--->+-----------+ | | (struct page *) | |B next| | | | +-----------+ | | | |pobjects | = objects | | | | | | | | |inuse < | | | | | objects | | | | |frozen = 1 | | | | +-----------+ | | | +---+--------------------------+ |node[MAX_NUMNODES] | | (struct kmem_cache_node*) | | +--------------------------+ | |nr_partial = 1 | | | (unsigned long) | | |partial ----|--->+-----------+ +-----------+ | | (struct list_head) | |A lru|---->|C lru| +---+--------------------------+ +-----------+ +-----------+ |inuse < | | objects | |frozen = 0 | +-----------+ ``` - [ ] 解凍 cpu_slab $\to$ partial 當偵測到 cpu_slab $\to$ partial 上的物件數量足夠多時,系統會將鏈結串列上的所有頁面退回到 node $\to$ partial。注意該過程會將 cpu_slab $\to$ partial 清空。 ``` kmem_cache +------------------------------+ |name (char *) | +------------------------------+ |cpu_slab | * per cpu variable pointer | (struct kmem_cache_cpu*) | | +--------------------------+ | |stat[NR_SLUB_STAT_ITEMS] | | | (unsigned) | | +--------------------------+ | |tid (unsigned long) | | | | | |freelist ------|----+ = page->freelist | | (void **) | | | | | v | |page | +--------+--------+--------+--------+ | | (struct page *) A | | | | | | | | +----------------------+ +--------+--------+--------+--------+ | | |freelist = NULL | | | |inuse = objects | | | |frozen = 1 | | | +----------------------+ | | | | |partial ------|--->+-----------+ +-----------+ | | (struct page *) | |B next|---->|C next| | | | +-----------+ +-----------+ | | | |pobjectsA | = |pobjectsB | + objectsA | | | | | | | | | | |inuse < | |inuse < | | | | | objectsA | | objectsB | | | | |frozen = 1 | |frozen = 1 | | | | +-----------+ +-----------+ | | | +---+--------------------------+ |node[MAX_NUMNODES] | | (struct kmem_cache_node*) | | +--------------------------+ | |nr_partial = 1 | | | (unsigned long) | | |partial ----|--->+-----------+ | | (struct list_head) | |D lru| +---+--------------------------+ +-----------+ kmem_cache +------------------------------+ |name (char *) | +------------------------------+ |cpu_slab | * per cpu variable pointer | (struct kmem_cache_cpu*) | | +--------------------------+ | |stat[NR_SLUB_STAT_ITEMS] | | | (unsigned) | | +--------------------------+ | |tid (unsigned long) | | | | | |freelist ------|----+ = page->freelist | | (void **) | | | | | v | |page | +--------+--------+--------+--------+ | | (struct page *) A | | | | | | | | +----------------------+ +--------+--------+--------+--------+ | | |freelist = NULL | | | |inuse = objects | | | |frozen = 1 | | | +----------------------+ | | | | |partial | = NULL | | (struct page *) | | | | +---+--------------------------+ |node[MAX_NUMNODES] | | (struct kmem_cache_node*) | | +--------------------------+ | |nr_partial = 1 | | | (unsigned long) | | |partial ----|--->+-----------+ +-----------+ +-----------+ | | (struct list_head) | |B lru|---->|C lru|---->|D lru| +---+--------------------------+ +-----------+ +-----------+ +-----------+ |inuse < | |inuse < | | objects | | objects | |frozen = 0 | |frozen = 0 | +-----------+ +-----------+ ``` - [ ] 從 node $\to$ partial 取出一個頁面加入 cpu_slab $\to$ partial 特定狀況下,我們會將 node $\to$ partial 鏈結串列中的一個頁面放到 cpu_slab $\to$ partial 上。 ``` kmem_cache +------------------------------+ |name (char *) | +------------------------------+ |cpu_slab | * per cpu variable pointer | (struct kmem_cache_cpu*) | | +--------------------------+ | |stat[NR_SLUB_STAT_ITEMS] | | | (unsigned) | | +--------------------------+ | |tid (unsigned long) | | | | | |freelist ------|----+ = page->freelist | | (void **) | | | | | v | |page | +--------+--------+--------+--------+ | | (struct page *) A | | | | | | | | +----------------------+ +--------+--------+--------+--------+ | | |freelist = NULL | | | |inuse = objects | | | |frozen = 1 | | | +----------------------+ | | | | |partial ------|--->+-----------+ | | (struct page *) | |B next| | | | +-----------+ | | | |pobjectsB | = objectsB | | | | | | | | |inuse < | | | | | objectsB | | | | |frozen = 1 | | | | +-----------+ | | | +---+--------------------------+ |node[MAX_NUMNODES] | | (struct kmem_cache_node*) | | +--------------------------+ | |nr_partial = 3 | | | (unsigned long) | | |partial ----|--->+-----------+ +-----------+ | | (struct list_head) | |C lru|---->|D lru| +---+--------------------------+ +-----------+ +-----------+ |inuse < | | objects | |frozen = 0 | +-----------+ kmem_cache +------------------------------+ |name (char *) | +------------------------------+ |cpu_slab | * per cpu variable pointer | (struct kmem_cache_cpu*) | | +--------------------------+ | |stat[NR_SLUB_STAT_ITEMS] | | | (unsigned) | | +--------------------------+ | |tid (unsigned long) | | | | | |freelist ------|----+ = page->freelist | | (void **) | | | | | v | |page | +--------+--------+--------+--------+ | | (struct page *) A | | | | | | | | +----------------------+ +--------+--------+--------+--------+ | | |freelist = NULL | | | |inuse = objects | | | |frozen = 1 | | | +----------------------+ | | | | |partial ------|--->+-----------+ +-----------+ | | (struct page *) | |C next|---->|B next| | | | +-----------+ +-----------+ | | | |pobjectsC | = |pobjectsB | + objectsC | | | | | | | | | | |inuse < | |inuse < | | | | | objectsC | | objectsB | | | | |frozen = 1 | |frozen = 1 | | | | +-----------+ +-----------+ | | | +---+--------------------------+ |node[MAX_NUMNODES] | | (struct kmem_cache_node*) | | +--------------------------+ | |nr_partial = 1 | | | (unsigned long) | | |partial ----|--->+-----------+ | | (struct list_head) | |C lru| +---+--------------------------+ +-----------+ ``` - [ ] 將頁面加入 node $\to$ partial 當一個頁面的所有空間都被配置後,該頁面就不會出現在其他位置 (除非是 Linux 啟用 `CONFIG_SLUB_DEBUG` 選項)。當釋放頁面上的一個物件時,該頁面會重新出現在 node $\to$ partial 上。 為了清晰起見,下圖展示在 Linux 核心在啟用 `CONFIG_SLUB_DEBUG` 時,node->full 鏈結串列的開頭,以表示頁面的移動情況。 ``` kmem_cache +------------------------------+ |name (char *) | +------------------------------+ |cpu_slab | * per cpu variable pointer | (struct kmem_cache_cpu*) | | +--------------------------+ | |stat[NR_SLUB_STAT_ITEMS] | | | (unsigned) | | +--------------------------+ | |tid (unsigned long) | | | | | |freelist ------|----+ = page->freelist | | (void **) | | | | | v | |page | +--------+--------+--------+--------+ | | (struct page *) A | | | | | | | | +----------------------+ +--------+--------+--------+--------+ | | |freelist = NULL | | | |inuse = objects | | | |frozen = 1 | | | +----------------------+ | | | | |partial ------|--->+-----------+ | | (struct page *) | |B next| | | | +-----------+ | | | |pobjectsB | = objectsB | | | | | | | | |inuse < | | | | | objectsB | | | | |frozen = 1 | | | | +-----------+ | | | +---+--------------------------+ |node[MAX_NUMNODES] | | (struct kmem_cache_node*) | | +--------------------------+ | |nr_partial = 3 | | | (unsigned long) | | |partial ----|--->+-----------+ +-----------+ | | (struct list_head) | |C lru|---->|D lru| | +--------------------------+ +-----------+ +-----------+ | |full ----|--->+-----------+ | | (struct list_head) | |E lru| +---+--------------------------+ +-----------+ kmem_cache +------------------------------+ |name (char *) | +------------------------------+ |cpu_slab | * per cpu variable pointer | (struct kmem_cache_cpu*) | | +--------------------------+ | |stat[NR_SLUB_STAT_ITEMS] | | | (unsigned) | | +--------------------------+ | |tid (unsigned long) | | | | | |freelist ------|----+ = page->freelist | | (void **) | | | | | v | |page | +--------+--------+--------+--------+ | | (struct page *) A | | | | | | | | +----------------------+ +--------+--------+--------+--------+ | | |freelist = NULL | | | |inuse = objects | | | |frozen = 1 | | | +----------------------+ | | | | |partial ------|--->+-----------+ | | (struct page *) | |B next| | | | +-----------+ | | | |pobjectsB | = objectsB | | | | | | | | |inuse < | | | | | objectsB | | | | |frozen = 1 | | | | +-----------+ | | | +---+--------------------------+ |node[MAX_NUMNODES] | | (struct kmem_cache_node*) | | +--------------------------+ | |nr_partial = 3 | | | (unsigned long) | | |partial ----|--->+-----------+ +-----------+ +-----------+ | | (struct list_head) | |E lru|---->|C lru|---->|D lru| | +--------------------------+ +-----------+ +-----------+ +-----------+ | |full | = NULL | | (struct list_head) | +---+--------------------------+ ``` ## SLUB 的安全強化機制 一些與 freelist 相關的核心選項如下: - [ ] `CONFIG_SLAB_FREELIST_HARDENED` 亦即安全強化的的 freelist,啟用此選項後,freelist 中的指標將通過與每個 cache 的隨機數 (`s->random`) 和指標儲存地址 (`ptr_addr`) 進行 XOR 運算,以增加攻擊者猜測的難度: ```c /* * Returns freelist pointer (ptr). With hardening, this is obfuscated * with an XOR of the address where the pointer is held and a per-cache * random number. */ static inline void *freelist_ptr(const struct kmem_cache *s, void *ptr, unsigned long ptr_addr) { #ifdef CONFIG_SLAB_FREELIST_HARDENED /* ... */ return (void *)((unsigned long)ptr ^ s->random ^ swab((unsigned long)kasan_reset_tag((void *)ptr_addr))); #else return ptr; #endif } ``` 若存在一個漏洞允許攻擊者覆蓋這些指標,為了成功攻擊,他們需要知道 `s->random` 和 `ptr_addr`(這有可能發生,但難度更大)。 - [ ] `CONFIG_SLAB_FREELIST_RANDOMIZATION` 對 freelist 進行隨機化,在 Linux v4.8 中引入,啟用後,freelist 中區塊的初始順序不再是連續的。亦即原本當 slab 建立時,所有區塊都是空閒的,且是 freelist 的一部分。但若若啟用 `CONFIG_SLAB_FREELIST_RANDOMIZATION` 選項,配置將不再按它們在 slab 中的順序 (第一個配置返回 slab 中的第一個區塊,依此類推),而是按隨機順序。這由每個 cache 隨機序列(`s->random_seq`)決定(在 cache 初始化時定義),並且 cache 中的每個 slab 都從這個序列中隨機選擇一個起點 (這樣不同 slab 中的配置不會從相同的偏移量開始)。 此舉減輕核心 heap overflow 的風險,在此之前,若易受攻擊的物件和目標物件一個接一個地配置,它們總是相鄰的,因此很容易被攻擊者所利用。然而,這可透過配置許多目標物件,然後釋放其中一個,接著配置一個易受攻擊的物件來輕易繞過。事實上,易受攻擊的物件將被目標物件包圍,然後我們可觸發 heap overflow,並覆蓋其中一個目標物件。Michael S 和 Vitaly Nikolenko 撰寫的文章〈[Linux kernel heap feng shui in 2022](https://duasynt.com/blog/linux-kernel-heap-feng-shui-2022)〉清楚解釋這點,feng shui 即「風水」的拼音。 ## 待整理 * [hardened_malloc](https://github.com/GrapheneOS/hardened_malloc) * [SlabDbg](https://github.com/NeatMonster/slabdbg): GDB plug-in that makes it easier to develop Linux kernel exploits targeting the SLUB allocator.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully