Try   HackMD

Linux Kernel - Virtual Memory management

Author: 堇姬Naup

前言

當一個program被跑起後,就會變成process,每個process都有屬於自己的virtual memory,實際該process在操作時,也是通過虛擬記憶體來處理的
這些process不管是在userspace或kernel space,看到的都是virtual memory,通過映射的方式來去對應到一塊實際的physical momory

早期計算機

早期的計算機會直接使用 Physical memory 這樣造成了一些問題

  • 每個process的memory不隔離,因為都是在一大塊的 Physical memory上,所以很容易造成汙染或越界
  • 因為沒有良好的memory管理機制,導致當有一個已經在run的process時,突然要run一個新的,需要先把舊的寫到硬碟中
    這邊假設一下計算機有128 MB的記憶體,有三個program需求是
program 需求
A 20 MB
B 100 MB
C 60 MB

現在正在run A跟B,我們想要run C需要現換出B(換出 A 空間仍不足),之後放入 C,這樣其實蠻影響效率的

  • 地址不穩定,看上方的狀況就知道,程式所使用的地址是直接在Physical上運作的,需要運行而重新載入會導致位址不太一樣,這讓跳轉或重定位遇到許多麻煩

Virtual memory

virtual memory通過映射方式來映射到實體記憶體(詳細映射方式寫在四層頁表)
來讓每隻process看起來有連續的記憶體
這種方式很好解決了不隔離的問題,只要看你給的address有沒有在Virtual address裡面就可以,頂多影響到自己process擁有的Physicsl address
也解決了跳轉問題,程式只需要跳轉到Virtual address,硬體會幫你映射到對應的Physical address
另外也解決了記憶體長期使用碎片化的問題

看整個 virtual memory 長甚麼樣子

32 bit virtual memory

1 KB (KiloByte) = 1024 Bytes
1 MB (MegaByte) = 1024 KB
1 GB (GigaByte) = 1024 MB

2^32B = 4294967296 B = 4GB (0x0000 0000 ~ 0xFFFF FFFFF)

其中virtual memory被分成兩個,user space(3GB,0x0000 0000~0xC000 0000)、kernel space(1GB,0xC000 0000 - 0xFFFF FFFFF)

不過user space會有一塊保留區在最低的地址區段
實際上userspace從0x0804 8000開始

並且virtual memory也被切分成不同的區段
PS: stack 跟 heap段中間我習慣在塞個匿名&文件映射區,這區域就是使用mmap時會使用的,他是高往低增長

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

64 bit virtual memory

1 Byte = 8 Bits
1 Kilobyte (KB) = 1024 Bytes
1 Megabyte (MB) = 1024 KB
1 Gigabyte (GB) = 1024 MB
1 Terabyte (TB) = 1024 GB
1 Petabyte (PB) = 1024 TB
1 Exabyte (EB) = 1024 PB
1 Zettabyte (ZB) = 1024 EB
1 Yottabyte (YB) = 1024 ZB

2^64 B = 18446744073709551616 B = 16 EB
有夠大www,實際上根本用不到這麼多,也因此實際上64bits虛擬記憶體只會用到48位
2^48 B = 281474976710656 B = 256 TB

其中virtual memory被分成兩個,user space(128 TB,0x0000 0000 0000 0000 ~ 0x0000 7FFF FFFF F000)、kernel space(128 TB,0xFFFF 8000 0000 0000 ~ 0xFFFF FFFF FFFF FFFF)

0x0000 7FFF FFFF F000 - 0xFFFF 8000 0000 0000中間有一塊空洞,被稱作 canonical address
https://stackoverflow.com/questions/25852367/x86-64-canonical-address

給張全圖(我找不到,自己畫了)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

process

談到virtual memory就必須得提process
每支process都有一個task_struct

想知道process & thread可以參考這篇
奔跑吧 CH 3.1 進程的誕生

來看看task struct
source code on /linux/v6.12.6/source/include/linux/sched.h#L778

通過
struct mm_struct *mm;
可以來描述一個process自己的virtual memory

mm_struct source code
source code on /linux/v6.12.6/source/include/linux/mm_types.h#L790

PS: 至於如何創建這些東西,我決定挖坑其他篇有機會說

Distinguish between userapce and kernel space

首先,我們知道virtual memory區分了user和kernel
mm_struct裡也描述了這件事

https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/mm_types.h#L816
unsigned long task_size; /* size of task vm space */

該變數描述這件事
32 bits
source code about 32bits task size

首先這邊說明了3 GB以及0xC0000000為界線

/*
 * This handles the memory map.
 *
 * A __PAGE_OFFSET of 0xC0000000 means that the kernel has
 * a virtual address space of one gigabyte, which limits the
 * amount of physical memory you can use to about 950MB.
 *
 * If you want more physical memory than this then see the CONFIG_HIGHMEM4G
 * and CONFIG_HIGHMEM64G options in the kernel configuration.
 */
#define __PAGE_OFFSET_BASE	_AC(CONFIG_PAGE_OFFSET, UL)
#define __PAGE_OFFSET		__PAGE_OFFSET_BASE

/*
 * User space process size: 3GB (default).
 */
#define IA32_PAGE_OFFSET	__PAGE_OFFSET
#define TASK_SIZE		__PAGE_OFFSET

64 bits
為 1 << 47 - 一個page(4KB)

hex((1 << 47)- 1024 * 4) = 0x7ffffffff000

source code about 64bits task size

#define __VIRTUAL_MASK_SHIFT	47
#define task_size_max()		((_AC(1,UL) << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)

descript memory segment layout

再來是描述virtual memory 區段

struct mm_struct {
    unsigned long task_size;    /* size of task vm space */
    unsigned long start_code, end_code, start_data, end_data;
    unsigned long start_brk, brk, start_stack;
    unsigned long arg_start, arg_end, env_start, env_end;
    unsigned long mmap_base;  /* base of mmap area */
    unsigned long total_vm;    /* Total pages mapped */
    unsigned long locked_vm;  /* Pages that have PG_mlocked set */
    unsigned long pinned_vm;  /* Refcount permanently increased */
    unsigned long data_vm;    /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
    unsigned long exec_vm;    /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
    unsigned long stack_vm;    /* VM_STACK */
    ...	
}	
  • arg_start、arg_end: 參數列表 (stack最高處)
  • env_start、env_end: 環境變數列表 (stack最高處)
  • total_vm: virtual page映射到的physical page總量
  • locked_vm: 記憶體吃緊時不能換出的page總量
  • pinned_vm: 記憶體吃緊時不能換出及移動的page總量

剩下的看圖
PS: 又要自己畫圖了,總之畫張圖比較容易理解

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

以上就是在process創建時會對virtual memory布局

Virtual Memory Management

接下來看kernel如何去管理virtual memory
管理VMA(virtual memory area)的struct是 vm_area_struct
* This struct describes a virtual memory area.

source code on /linux/v6.12.6/source/include/linux/mm_types.h#L667

這邊舊版跟新版kernel不太一樣,原本可以在task_struct中的mm_struct下找到
struct vm_area_struct *mmap; /* list of VMAs */
但在新版kernel被移除了
這邊先基於舊版kernel來看(5.19.17)
https://elixir.bootlin.com/linux/v5.19.17/source/include/linux/mm_types.h#L481

這裡放一個參考資源
https://richardweiyang-2.gitbook.io/kernel-exploring/nei-cun-guan-li/00-index/05-vma

接下來開始解釋吧(這邊會從vm_area_struct本身,往上到他與mm_struct連動,以及整個資料結構)

vm_area_struct

source code on /linux/v5.19.17/source/include/linux/mm_types.h#L398

通過vm_area_struct來管理每個記憶體區段
PS: 簡單看一下就可以發現有rb tree的特徵,不過先不要想太複雜

/*
 * This struct describes a virtual memory area. There is one of these
 * per VM-area/task. A VM area is any part of the process virtual memory
 * space that has a special rule for the page-fault handlers (ie a shared
 * library, the executable area etc).
 */
struct vm_area_struct {
	/* The first cache line has the info for VMA tree walking. */

	unsigned long vm_start;		/* Our start address within vm_mm. */
	unsigned long vm_end;		/* The first byte after our end address
					   within vm_mm. */

	/* linked list of VM areas per task, sorted by address */
	struct vm_area_struct *vm_next, *vm_prev;

	struct rb_node vm_rb;

	/*
	 * Largest free memory gap in bytes to the left of this VMA.
	 * Either between this VMA and vma->vm_prev, or between one of the
	 * VMAs below us in the VMA rbtree and its ->vm_prev. This helps
	 * get_unmapped_area find a free area of the right size.
	 */
	unsigned long rb_subtree_gap;

	/* Second cache line starts here. */

	struct mm_struct *vm_mm;	/* The address space we belong to. */

	/*
	 * Access permissions of this VMA.
	 * See vmf_insert_mixed_prot() for discussion.
	 */
	pgprot_t vm_page_prot;
	unsigned long vm_flags;		/* Flags, see mm.h. */

	/*
	 * For areas with an address space and backing store,
	 * linkage into the address_space->i_mmap interval tree.
	 *
	 * For private anonymous mappings, a pointer to a null terminated string
	 * containing the name given to the vma, or NULL if unnamed.
	 */

	union {
		struct {
			struct rb_node rb;
			unsigned long rb_subtree_last;
		} shared;
		/*
		 * Serialized by mmap_sem. Never use directly because it is
		 * valid only when vm_file is NULL. Use anon_vma_name instead.
		 */
		struct anon_vma_name *anon_name;
	};

	/*
	 * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
	 * list, after a COW of one of the file pages.	A MAP_SHARED vma
	 * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
	 * or brk vma (with NULL file) can only be in an anon_vma list.
	 */
	struct list_head anon_vma_chain; /* Serialized by mmap_lock &
					  * page_table_lock */
	struct anon_vma *anon_vma;	/* Serialized by page_table_lock */

	/* Function pointers to deal with this struct. */
	const struct vm_operations_struct *vm_ops;

	/* Information about our backing store: */
	unsigned long vm_pgoff;		/* Offset (within vm_file) in PAGE_SIZE
					   units */
	struct file * vm_file;		/* File we map to (can be NULL). */
	void * vm_private_data;		/* was vm_pte (shared mem) */

#ifdef CONFIG_SWAP
	atomic_long_t swap_readahead_info;
#endif
#ifndef CONFIG_MMU
	struct vm_region *vm_region;	/* NOMMU mapping region */
#endif
#ifdef CONFIG_NUMA
	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
#endif
	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
} __randomize_layout;
  • 首先是vm_start、vm_end,他標示了這個struct描述的區域範圍
  • *vm_next、*vm_prev、vm_rb等,等會會提到
  • vm_page_prot 和 vm_flags則標示了該區段的權限及行為
  • 一個區段是由許多的page組成的,prot規範的偏向單一page的行為,flags則偏向規範整塊

source code about vm_flags

這邊列出一些常用的

vm_flags 訪問權限
VM_READ 可讀
VM_WRITE 可寫
VM_EXEC 可執行
VM_SHARED 可多進程之間共享
VM_IO 可映射至設備 IO 空間
VM_RESERVED 記憶體區域不可被換出
VM_SEQ_READ 記憶體區域可能被順序訪問
VM_RAND_READ 記憶體區域可能被隨機訪問

就以目前所知的,畫張圖(由於空間不夠,我省略了bss、data)

asdfawef

overall

現在來看所有結構的關聯
mm_struct的這個會把底下的vm_area_struct串成一個double linked list

struct vm_area_struct *mmap; /* list of VMAs */

也可以在vm_area_struct看到next、prev

	/* linked list of VM areas per task, sorted by address */
	struct vm_area_struct *vm_next, *vm_prev;

並且vm_area_struct中的vm_mm會指回去mm_struct,標示該vma屬於哪個task

struct mm_struct *vm_mm;	/* The address space we belong to. */

所以現在長這樣

agrargw

最後是紅黑樹
如果一支process的vm_area很多,那使用紅黑樹查值可以在O(log N)
vm_area_struct會存在一個double linklist,一個rb tree

mm_struct中的mm_rb是root node

struct rb_root mm_rb;

vm_area_struct的vm_rb是節點

struct rb_node vm_rb;

shser

load ELF segment to virtual memory

load_elf_binary這個函數實現了初始化mm_struct、vm_area_struct及mmap ELF segment等事情

https://elixir.bootlin.com/linux/v6.12.5/source/fs/binfmt_elf.c#L819

kernel space virtual memory

剛剛講完了userspace的virtual memory ,現在來打開kernel space的布局

首先有兩個概念很重要

  • 每支process都有自己的virtual memory space,再透過映射方式映射到physical memory,所以說當process A、B、C去訪問usermode相同的virtual memory訪問到的是不同的東西。然而kernel mode的virtual memory,A、B、C訪問到的是一樣的,這就是重要的kernel mode virtual memory space是每個process共享的 (更詳細可以參考 這篇)
  • 第二個是雖然kernel會去管理physical address,但是並非kernel 使用的是physical address,kernel仍舊是使用virtual address

接下來來看 kernel virtual memory長怎樣(分32bits 64bits)

32bits

0xC000 0000 - 0xFFFF FFFF 是屬於kernel space範圍

先上全圖

argqr

直接映射區(線性映射區)

kernel space前面的896MB是直接映射區,這塊區域會直接映射到physical memory的 0~896MB上(virtual memory 減去 0xC000 0000其實就是physical address了),雖說這塊直接映射,但實際使用仍是通過virtual address,另外也會建立page table

該區塊的前 1M 在系統啟動時被占用,之後開始存放kernel的code、bss、data等,這些原本是存在ELF中,跑起來後放到記憶體。
另外 task_struct、vm_area_struct、mm_struct 也是在這裡(process被創建的資訊)
kernel為每個process創建的kernel stack(userspace 的stack可以動態調整,但kernel stack 容量小且固定大小)也在這

這塊區域在32bits中被劃分成兩塊,前16M被稱為DMA (詳細DMA是啥以及為啥要分兩塊先挖個坑,有機會補,這跟硬體限制有關係)

VMalloc

Normal Mapping area最上面的位置是high_memory
https://elixir.bootlin.com/linux/v6.12.6/source/arch/x86/include/asm/pgtable_32_areas.h

這段定義了VMalloc開始位置,就是Normal Mapping area + 8M位置(因此有8M空洞)

#define VMALLOC_OFFSET	(8 * 1024 * 1024)
#define VMALLOC_START	((unsigned long)high_memory + VMALLOC_OFFSET)

VMalloc END則是VMalloc區域結束

#ifdef CONFIG_HIGHMEM
# define VMALLOC_END	(PKMAP_BASE - 2 * PAGE_SIZE)
#else
# define VMALLOC_END	(LDT_BASE_ADDR - 2 * PAGE_SIZE)
#endif

這一整塊區域通過動態映射的方式,映射記憶體到physical memory(與malloc原理一樣,只是這裡是vmalloc)
physical memory對應的vmalloc區段記憶體不一定是連續的,通過page table方式映射( physical page -> Virtual page )

永久映射區

繼續往上,這塊區域用來建立長期映射關係
alloc_pages()、

https://elixir.bootlin.com/linux/v6.12.6/source/arch/x86/include/asm/pgtable_32_areas.h

一樣的地方也定義了該區域開始的地方
跟最多可以映射的page數,正常情況最多1024個

#define PKMAP_BASE		\
	((LDT_BASE_ADDR - PAGE_SIZE) & PMD_MASK)

#ifdef CONFIG_X86_PAE
#define LAST_PKMAP 512
#else
#define LAST_PKMAP 1024
#endif

固定映射區

FIXADDR_START 到 FIXADDR_TOP(常規是 0xFFFF F000) 定義固定映射區範圍
在source code裡面這段
https://elixir.bootlin.com/linux/v6.12.6/source/arch/x86/include/asm/fixmap.h

unsigned long __FIXADDR_TOP = 0xfffff000;

#define FIXADDR_TOP	((unsigned long)__FIXADDR_TOP)
#else
#define FIXADDR_TOP	(round_up(VSYSCALL_ADDR + PAGE_SIZE, 1<<PMD_SHIFT) - \
			 PAGE_SIZE)
#endif

#define FIXADDR_SIZE		(__end_of_permanent_fixed_addresses << PAGE_SHIFT)
#define FIXADDR_START		(FIXADDR_TOP - FIXADDR_SIZE)

臨時映射區

Buffered IO 在進行讀寫時或其他情況會需要臨時的文件映射,就會使用這區塊
使用 kmap_atomic

以上就是32bits

64 bits

相較於32bits 只有 1GB 空間,64bits則有 128 TB,也因此不需要如此精細的管理 (詳細精細化管理一樣先挖個坑之後補),也因為virtual memory夠大,所以可以全部採用直接映射方式

基本上以下區域的用途跟上面一樣,這邊放一些source code定義的範圍就好了 (為何64bits只有直接映射區跟上述的空間很大有關,先這樣知道就行了)

aew

https://elixir.bootlin.com/linux/v6.12.6/source/arch/x86/include/asm/page_64_types.h#L61
首先是這裡 task_size_max() 的大小是 1 << 47 - page_size 算出來就是 0x7FFF FFFF F000

#define __VIRTUAL_MASK_SHIFT	47
#define task_size_max()		((_AC(1,UL) << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
#endif

#define TASK_SIZE_MAX		task_size_max()

將 64 bits virtual memory 割成kernel space、user space

這部分

#define __PAGE_OFFSET_BASE      _AC(0xffff880000000000, UL)
#define __PAGE_OFFSET           __PAGE_OFFSET_BASE

__PAGE_OFFSET 開始是直接映射區

https://elixir.bootlin.com/linux/v6.12.6/source/arch/x86/include/asm/pgtable_64_types.h#L138
VMALLOC_START 跟 VMALLOC_END 界定出VMALLOC 區段範圍
>>> hex(0xffffc90000000000+((32 << 40))) = 0xffffe90000000000

#define __VMALLOC_BASE_L4	0xffffc90000000000UL
# define VMALLOC_START		__VMALLOC_BASE_L4

#define VMALLOC_SIZE_TB_L4	32UL
# define VMALLOC_SIZE_TB	VMALLOC_SIZE_TB_L4
#define VMEMORY_END		(VMALLOC_START + (VMALLOC_SIZE_TB << 40) - 1)
#define VMALLOC_END		VMEMORY_END

https://elixir.bootlin.com/linux/v6.12.6/source/arch/x86/include/asm/pgtable_64_types.h#L130
接下來是 VMEMMAP_START

#define __VMEMMAP_BASE_L4	0xffffea0000000000UL
# define VMEMMAP_START		__VMEMMAP_BASE_L4

https://elixir.bootlin.com/linux/v6.12.6/source/arch/x86/include/asm/page_64_types.h#L50
最後是Code段

#define __START_KERNEL_map	_AC(0xffffffff80000000, UL)

https://draveness.me/whys-the-design-linux-default-page/
https://hackmd.io/@r34796/HJCjT8Krq

After ALL

以上是針對virtual memory管理相關source code及機制的介紹