# Memory Management
主要處理三個議題
1. External fragmentation
2. Internal fragmentation (Linux的Buddy system主要suffer內部空間斷裂,詳細請看buddy system,補救方式是透過 slab allocation)
> linux最大的structure就是task_struct,但也只不過1KB而已,所以linux裡面大多數都是很多的小structure,所以大多數的狀況都是要面對internal fragmentation
4. Search for empty space (找整個記憶體空間符合使用者需求的space的時間複雜度)
> 內部空間斷裂跟page設計的大小是有關係的,page預設越大,給別人越容易有要五毛給一塊的情形發生,內部斷裂會越多,但好處是用來紀錄的page size就變小(因為index變少了),page預設越小,越不會有內部空間斷裂(想像小七店員找你錢全部都只找一塊),但壞處就是page table必須要超大才能紀錄每個page的資訊(管理記憶體本身也是佔記憶體的)
Kernel Memory Management的傾向
* 管理實體記憶體本身也是需要花費記憶體的(page table)
* Balance between management overhead and waste
* kernel自己prefer PA連續 and VA連續的區段
* better performance
* 跟user space process會很大不同,user space process通常不太需要PA連續的記憶體,給VA連續就足夠了,剩下的讓kernel的memory management去處理
* Aoid external fragmentation
* 把配置大記憶體跟小記憶體分開
* Avoid internal fragmentation
* Multiplexing more memory block requests into one page
* 需要額外的機制去管理per page

大的空間去跟Page Allocator申請(Buddy system)
小的空間去跟Slab Allocator申請
(Slab是為了解決external fragment的問題)

# Pages(描述physical pages, not VP)
kernel把physical pages當作base unit, 雖然cpu的最小定址單元是byte or a word. MMU只deal with pages.Therefore, the MMU maintains the system’s pagetables with page-sized granularity.
* in virtual memory, pages are the smallest unit that matters.
each architecture defines its own pages size.
* 32-bit architectures have 4KB pages.
* 64-bit have 8KB pages.
* if a machine have 4KB pages and 1GB memory, physical memory 可被切為262144 distinct pages.
代表page的structure

* flag:
* page的status, Such flags include whether the page is dirty or whether it is locked in memory Bit flags represent the various values, so at least 32 different flags are simultaneously available
* _count: (可以有多個process的AS都有對應到同個page)
* how many references there are to the page. -1代表沒人使用這個page. 代表可以被kernel去新配置記憶體,kernel用page_count()來check這個值
* page_count()
* free的話回傳0
* 有人在使用的話回傳使用者個數
* virtual: page's VA address.
* 有些memory不會永久的被mapped到kernel's address space(high memory).這種情況,this field is NULL.
* 一個很重要的地方是page structure is associated with Physical pages, not virtual pages. 所以這個structure是短暫快速變動的。就算data conatianed in the page持續存在, it might not always be associated with the same page structure because of swapping and so on,The data structure’s goal is to describe physical memory, not the data contained therein(kernel關注的是整體的physical memory使用的情況,不是誰誰誰跟哪段pa有mapping啥的)
* kernel uses this structure追蹤all the pages in the system. because kernel needs to know whether a page is free(就是還沒被allocated).如果有page is not free, kernel需要了解誰owns the page.可能的owner有
* user-space processes
* dynamically allocated kernel data
* static kernel code
* page cache..
* Develper常會驚訝,這個structure竟然associate到system裡的每一個physical page. they think, "*what a lot of memory wasted*!!" 來看看最壞情況如何
* assume struct page consumes 40 bytes of memory.
* system has 8KB physical pages.
* system has 4GB of physical memory.
* system has 524288 pages
* system has 524288 page structures.
* page structures --> 20MB
* 數量看起來很多,但20MB跟4GB比起來還好吧
* 作者表示用20MB來managing all system's physical pages划得來
# Zones


為了應對不同的需求,kernel把pages group起來分到不同的zones.每個zone應對不同的需求. in particular, Linux has to deal with two shortcomings of hardware with respect to memory addressing:
* 有些HW can perform DMA to only certain memory addresses.
* Some architectures can physically addressing larger amounts of memory than they can virtually address. Consequently, some memory is not permanently mapped into the kernel address space.
因為上述的這些條件,Linux提供4種主要的zones:
1. ZONE_DMA:
* 專門用來作DMA的zone
3. ZONE_DMA32:
* 跟DMA一樣,但只能被32-bit devices access.On some architectures, this zone is a larger subset of memory.
5. ZONE_NORMAL: Kernel先來這邊找memory
* contains normal, regularly mapped, pages.
7. ZONE_HIGHMEM: Userspace先來這邊找memory
* high memory, which are pages not permanently mapped into the kernel's address space.
這些配置都是architectures相關的,不是一定的
怎麼配置要去看特定的architectures.





簡單理解為,linux把這些ZONE分出來可以當作資源池(POOL)
ZONE_DMA就專門提空給需要DMA的人
有些情況下,有些人會從多個ZONE拉pages.
(有點像mstar愛從FAE端拉人去RD端吃屎的感覺)
a normal DMA allocation可以從ZONE_NORMAL裡面拉,不一定要從ZONE_DMA裡面拉,but not both. allocations cannot cross zone boundaries.
kernel prefer to satisfy normal allocations from normal zone.這是為了節省ZONE_DMA的空間給真正需要的人。(當整體memory不夠用的時候,kernel會打破規則,哪個ZONE有記憶體資源就配置出去)
不是所有的architectures defines all zones.
ex: 64-bit system such as intel's x86-64 can fully map and handle 64-bits of memory. 所以沒有配置ZONE_HIGHMEM and all physical memory is contained with ZONE_DMA and ZONE_NORMAL.
kernel版本不一樣都會有差異,但這個大致相同


這個structure雖然很大,但因為系統只有3個ZONE,所以還好,來看看裡面,
* lock使用spinlock來protect concurrent access.
* watermark array holds the min, low, high watermarks for this zone.
* kernel use watermark to set benchmarks for suitable per-zone memory consumption
* name: stores DMA, Normal, HighMem.
# Getting Pages(Buddy System)
kernel在最底層只提供一種access memory pages的方法,在上面提供了多種interface給開發者用

他會配置
(1,2,4,8,16,32,64.......個pages)
and returns a pointer to the first page's page structure; on error it returns NULL.
you can convert a given page to its logical address with the function(直接查這個page實際對應到哪個VA)

if you have no need for the actual struct page, you can call

其實也是呼叫alloc_pages, 但他直接回傳整塊空間的第一個page的VA,因為pages是連續的,其它的page就跟著前面的Page走
if you need only one page
用下面兩個macro可以少打點字


# Getting Zeroed Pages
If you need the returned page filled with zeros, use the function

這個function跟__get_free_page()一樣, except that the allocated page is then zero-filled—every bit of every byte is unset.
> This is useful for pages given to userspace because the random garbage in an allocated page is not so random; it might contain
sensitive data.All data must be zeroed or otherwise cleaned before it is returned to userspace to ensure system security is not compromised


# Freeing Pages
A family of functions enables you to free allocated pages when you no longer need them

這個需要非常小心!!! 你要確保你free的pages是你allocate的. passing the wrong struct page or address, or incorrect **order** 都會導致kernel掛點.記住,kernel trusts itself, unlike with user-space
> the kernel will happily hang itself if you ask it!
>

* GFP_KERNEL 就是gfp_mask的flag.
* always sure after call to __get_free_pages(),always do error handling, because kernel allocation can fail.不然會很難debug
以上 low-level page functions are useful when you need page-sized chunks of physically contiguous pages, 特別是你只需要 1 or 2個page的情況。
對於 more general byte-sized allocations, kernel provides kmalloc()
# kmalloc()- 跟Slab去要memory
kmalloc() 的operation 跟 userspace的 malloc()很像,只有flags有差別。專門用來allocate byte-sized memory chunks.如果你需要一整個pages, 前面討論的interfaces是更好的選擇。另外,連續記憶體空間對於DMA來說是很重要的
> For most kernel allocations, however, kmalloc() is the preferred interface.(更多的情況不需要到page這麼大量的空間,所以都是用kmalloc())

The function returns a pointer to a region of memory that is at least size bytes in length
* the region of memory allocated is physically contiguous.
* On error, it returns NULL.
* Kernel allocations always succeed, unless an insufficient amount of memory is available.
* 所以kmalloc()後一定要做error handling


# gfp_mask Flags
可以分成三個part
* Action Modifiers
* specify how the kernel is supposed to allocate the requested memory
* In certain situations, only certain methods can be employed to allocate memory
* Zone Modifiers
* specify from which zone to allocate memory
* Types
* specify a combination of action of memory allocation.
* simplify the specification of multiple modifiers
* like GFP_KERNEL(用在process context inside the kernel)
# Action Modifiers
All the flags, the action modifiers included, are declared in <linux/gfp.h>
The file <linux/slab.h> includes this header, however, so you often need not include it directly

These allocations can be specified together. For example

上面這個其實可以簡化成
`ptr = kmalloc(size, GFP_KERNEL)`
This call instructs the page allocator (ultimately alloc_pages()) that the allocation can block, perform I/O, and perform filesystem operations, if needed.This enables the kernel great freedom in how it can find the free memory to satisfy the allocation.
# Zone Modifiers
決定allocation 要從哪個zone配置出去,Normally, allocations can be fulfilled from any zone.
Kernel prefers ZONE_NORMAL

* __GFP_DMA
* forces the kernel to satisfy the request from ZONE_DMA
* This flag says, *I absolutely must have memory into which I can perform DMA*
* __GFP_HIGHMEM
* instructs the allocator to satisfy the request from either ZONE_NORMAL or (preferentially) ZONE_HIGHMEM
* .This flag says, *I can use high memory, so I can be a doll and hand you back some of that, but normal memory works, too*
> 不能在__get_free_pages() or kmalloc() 使用 __GFP_HIGHMEM, 因為these both return a logical address, and not a page structure, it is possible that these functions would allocate memory not currently mapped in the kernel's VA and thus, does not have a logical address.
>
only alloc_pages() can allocate high memory.
The majority of your allocations, however, will not specify a zone modifier because ZONE_NORMAL is sufficient.
# Type Flags (FLAG裡面還有包些內餡)
The type flags specify the required action and zone modifiers to fulfill a particular type of transaction. 所以kernel code可以只用type flag就不用再去specify the myriad of other flags it might need.(Flag把多個動作包起來,易於使用)



我們來看看最常用到的flags-- GFP_KERNEL
* 這是一個normal priority allocation that might sleep.
* 所以只能用在process context.(才能safely reschedule.)(that is, no locks are held and so on.)
* Because this flag does not make any stipulations as to how the kernel may obtain the requested memory, the memory allocation has a high probability of succeeding.
* the GFP_KERNEL allocation can put the caller to sleep
* to swap inactive pages to disk,
* flush dirty pages to disk, and so on
GFP_ATOMIC
* 記憶體配置的時候不會sleep
* If no sufficiently sized contiguous chunk of memory is available, the kernel is not likely to free memory because it cannot put the caller to sleep
* Because GFP_ATOMIC cannot perform any of these actions, it has less of a chance of succeeding (at least when memory is low) compared to GFP_KERNEL allocations
* 不能睡覺的唯一選擇 ex. interrupt handler, softirqs and tasklets.
GFP_NOIO
* might sleep
* does not initiate any disk I/O whatsoever to fulfill the request
GFP_NOFS
* might initiate disk I/O, but does not initiate filesystem I/O.
GFP_DMA
* specify that the allocator must satisfy the request from ZONE_DMA
* This flag is used by device drivers, which need DMA-able memory for their devices
* Normally, you combine this flag with the GFP_ATOMIC or GFP_KERNEL flag.
底下列出各種開發情境大概會用到的FLAG
* 不能睡覺的都用GFP_ATOMIC
* 能睡覺的都用GFP_KERNEL

# kfree()
after kmalloc() remember to kree() it.
# vmalloc()
only allocates memory that is virtually contiguous and not necessarily physically contiguous. (就是user-space allocate的方式,malloc() return 的方式也是VA連續的)
kmalloc()則保證配出去的記憶體是VA連續且是PA連續的
vmalloc()會故意把非連續的PA chunks分配給你,然後對page table動手腳把配出去的空間修正成VA連續的
只有hardware devices需要PA連續的空間,在很多架構上,HW devices根本不鳥MMU的,也根本不碰VA. software用到的記憶體空間則沒這麼多限制, like process-related buffers, are fine using memory that is only VA連續的。
But, kernel code還是傾向於使用kmalloc()而不是vmalloc(), 通常是為了performance考量
(vmalloc()的缺點,使用情境只剩當你需要大量記憶體時,但又不太care performance的時候)
> The vmalloc() function, to make nonphysically contiguous pages contiguous in the virtual address space, must specifically set up the page table entries.Worse,
pages obtained via vmalloc() must be mapped by their individual pages (because they
are not physically contiguous), which results in much greater TLB4 thrashing than you see
when directly mapped memory is used. Because of these concerns, vmalloc() is used
only when absolutely necessary—typically, to obtain large regions of memory
For example, when modules are dynamically inserted into the kernel, they are loaded into memory created via vmalloc()
* 可能會sleep,interrupt context不能用
* use vfree() to free vmalloc() memory

# 亂入一下malloc()
Q: when the user calls malloc(4) to allocate 4 bytes of memory how does the operating system (Linux) respond? Which subsystem responds to this system call?
https://stackoverflow.com/questions/5716100/what-happens-in-the-kernel-during-malloc
malloc怎麼運作
https://www.youtube.com/watch?v=XV5sRaSVtXQ
What if I try to malloc WAY too much memory?
https://www.youtube.com/watch?v=Fq9chEBQMFE
# Slab Layer
記憶體的資源回收車,kernel會把被free掉的記憶體先收回這個地方cache起來,先不把記憶體吐回去給Buddy System (Structure先留著,有同樣structure要求的人就可以直接配給他,省去了一堆intialize的時間),如果有人需要就再配發出去,所以你用的記憶體可能是剛被回收的,不是全新的,這樣比較環保,這個結構可稱為free lists. the free list acts as an object cache, caching a frequently used type of object.
kernel用 slab layer(slab allocator)來管理這個議題。讓使用者可以不用直接面對page allocator(前面提過容易有內部斷裂), slab layer想要達成的目標如下
* Frequently used data structures tend to be allocated and freed often, so cache them
* 像task_struct, mm_struct, 這些常用的
* Frequent allocation and deallocation can result in memory fragmentation (the inability to find large contiguous chunks of available memory).To prevent this, the cached free lists are arranged contiguously. Because freed data structures return to the free list, there is no resulting fragmentation
* 頻繁的要還要還要還會讓實體記憶體連續空間越來越碎片化
* The free list provides improved performance during frequent allocation and deallocation because a freed object can be immediately returned to the next allocation
* If the allocator is aware of concepts such as
* object size
* page size
* and total cache size
* it can make more intelligent decisions
* If part of the cache is made per-processor (separate and unique to each processor on the system), allocations and frees can be performed without an SMP lock.
* If the allocator is NUMA-aware, it can fulfill allocations from the same memory node as the requestor
* Stored objects can be colored to prevent multiple objects from mapping to the same cache lines.
# Design of the Slab Layer
slab layer把不同的objects分成不同的groups,稱為caches, 每種caches專門存放專門的objects

* caches for task_struct
* caches for inode struct
* 專門弄了個generel purpose caches給kmalloc()取用的
* 這些caches又會被切成不同的slabs. slabs由1個或多個PA連續的pages構成,通常,slabs只有一個page.
* 每個slab contains some number of objects.
* slab會有三種狀態(這個策略是為了reduce fragmentation)
* full (no free objects)
* partial (some free, some allocated)
* empty (no allocated objects)
* 優先順序kernel會先從partial挑,再來是empty.


cache的structure(kmem_cache)
The slab allocator creates new slabs by interfacing with the low-level kernel page allocator via __get_free_pages():

如果不考慮NUMA架構,可以簡單化為

FREE

slab layer存在的目的是避免allocating and freeing pages. 所以
* 只有在cache裡面沒有空的slabs的時候才會invokes the page allocation function.(逼不得已才會去碰Buddy System)
* 只有在系統memory不夠時會去call free function.
* 可以
* create
* destruct
* allocate objects
* free objects
# Slab Allocator interface
# Create
不要在interrupt context去call他
會睡覺,成功回傳pointer to the created cache.
a new cache is created via

* name: name of the cache
* size: size of each element in cache.
* flag: 控制cache行為
* SLAB_HWCACHE_ALIGN (看課本)
* SLAB_POISON
* 在slab插入一個magic value,"a5a5a5a5"
(可以抓出有沒有踩到未初始化的memory)
* SLAB_RED_ZONE
* 在slab layert插入一個區來偵測buffer overruns.
* SLAB_PANIC
* if allocation fails,進panic
* SLAB_CACHE_DMA
* DMA專用
* ctor
* constructor for the cache. whenever new pages are added to the cache, the constructor is called. (passing 預設(NULL))
# Destroy
driver如果有create their own caches.
離開時必須call這個把caches放掉,call之前還必須確認兩件事
* All slabs in the cache are empty.
* No one accesses the cache during (and obviously after) a call to kmem_cache_destroy(), synchronization問題.
也不能在interrupt context call, 會睡覺
To destroy a cache, call

# Allocating from the Cache
After a cache is created, an object is obtained from the cache via

# Example of Using the Slab Allocator
**task_struct的例子** /kernel/fork.c
First, the kernel has a global variable that stores a pointer to the task_struct cache:

During kernel initialization, in fork_init(), defined in kernel/fork.c, the cache is created:

這邊就create了一個cache就叫做task_struct.專門用來存放屬於task_struct的objects
(cache align需求,L1 cache的bytes大小(ISA相依))
The objects are created with an offset of ARCH_MIN_TASKALIGN bytes within the slab.This preprocessor definition is an architecture-specific value. It is usually defined as L1_CACHE_BYTES—the size in bytes of the L1 cache
每當有user space process call fork(), task_struct就會被create, 詳見do_fork()
After a task dies, if it has no children waiting on it, its process descriptor is freed and returned to the task_struct_cachep slab cache,this is done in free_task_struct()
> kmem_cache_free(task_struct_cachep, tsk);
>
因為task struct太常被用到(always needed)
the task_struct_cachep cache is never destroyed.但你可以這樣摧毀他

Easy enough? The slab layer handles all the low-level alignment, coloring, allocations, freeing, and reaping during low-memory conditions. If you frequently create many objects of the same type, consider using the slab cache. Definitely do not implement your
own free list!
# Statically Allocating on the Stack
不像User space process的stack是很大的而且還是動態增長的,kernel的stack size是固定而且很小的,size的大小根據
* architecture
* compile-time option.
Historically, the kernel stack has been 2 pages per process.
* 8KB for 32-bit architectures (4KB*2)
* 16KB for 64-bit architectures (8KB*2)
# Single-Page Kernel Stacks
2.6早期,kernel設計其實想更縮小stack size成1個page而已4KB on 32bits and 8KB on 64bits. 兩個原因
* per process有更少的記憶體消耗
* 最重要的考量,隨著開機時間越久,2個尚未配置的連續的實體記憶體pages會很難找到,記憶體外碎問題嚴重
還有更複雜的
* each process’s entire call chain has to fit in its kernel stack
* 過去,interrupt handlers還會使用被他們interrupt的Process的stack, Stack更擠了一點
* 當kernel stack down size成1個Page, interrupt就擠不進去了。
* 為了解決這個問題,kernel開發者們端出- kernel stack.
* per-processor stack給interrupt handlers用
* 這樣只consume only a single page per cpu.
# Playing Fair on the Stack
* in kernel, any given function, you shoud keep the sum of all local variables to a maximum of a couple hundred bytes.
* 你在這裡用大量的static array or struct是找死
* kernel 沒有對stack作啥其它保護機制
* 如果stack overflow了,遭殃的就是stack尾端的東東
* chapter 3講過的 thread_info structure,就是他
* At best, kernel panic
* At worse, the overflow will silently corrupt data.
所以,如果你需要大量記憶體,你就該使用這章節前面講過的那些dynamic allocation scheme,不要自己亂用array.
# High Memory Mappings


# Permanent Mappings

# Temporary Mappings (atomic mappings)
給不能睡覺的context mapping用(interrupt context)
These are a set of reserved mappings that can hold a temporary mapping.The kernel can atomically map a
* **high memory** page into one of these reserved mappings


km_type代表the purpose of the temporary mapping

# Per-CPU Allocations
Modern SMP-capable operating systems use per-CPU data—data that is unique to a given processor—extensively
* 通常 per-CPU data 用array存
* array裡的elements會跟某個CPU相依

值得注意的是這裡不需要lock, 因為是per-cpu data, no concurrency concerns exist.
Kernel preemption 是per-CPU data唯一要注意的地方
* 如果你的code被preempted and reschedules on another cpu, 變數cpu就不再有效了,因為他指向了一個錯誤的cpu.(通常,你不能在拿到current cpu時去睡覺)
* if another task preempts your code, it can concurrently access my_percpu on the same processor, which is a race condition.
以上兩點其實不用怕,因為


# The New percpu Interface
上面講的是舊的per cpu方法,這是新的,但舊的還是可以用
The header <linux/percpu.h> declares all the routines.You can find the actual definitions there, in mm/slab.c, and in <asm/percpu.h>.
# Per-CPU Data at Compile-Time


注意: compile-time per-CPU examples do not work for modules because the linker actually creates them in a unique executable section.
(你好奇的話,在.data.percpu裡),如果你想在modules裡access per-CPU data, 要用dynamically的如下
# Per-CPU Data at Runtime

__alignof__跟sizeof()很像

alignment知識小補帖(19章會詳細討論)
http://mark-shih.blogspot.com/2012/10/compiler-data-alignment.html
https://www.kernel.org/doc/html/latest/arm/mem_alignment.html
Jserv大大
https://hackmd.io/@sysprog/c-memory?type=view


# Reasons for Using Per-CPU Data
* reduction in locking requirements
* greatly reduces cache invalidation.
* this occurs as cpu try to keep their caches in sync.
* If one processor manipulates data held in another processor’s cache, that processor must flush or otherwise update its cache(重要觀念)
* Constant cache invalidation is called thrashing the cache and wreaks havoc on system performance
* The use of per-CPU data keeps cache effects to a minimum because processors ideally access only their own data
* The percpu interface cache-aligns all data to ensure that accessing one processor’s data does not bring in another processor’s data on the same cache line
* 明顯的好處就是大量的減少了locking的需求,唯一的安全上的需求是要disableing kernel preemption,跟locking比起來成本少許多,而這個Interface完全cover了這個問題
* Per-CPU data在interrupt context or process context都可以安全使用(但,你不能在accessing per-CPU data的時候去睡覺,不然你睡醒後會發現你在別的房間(別的CPU))
* 現在沒有限制你一定要用新的介面,但強烈建議你用新的
# 總結-Picking an Allocation Method
到底什麼情況要用哪種allocation Methods呢?
* If you need contiguous physical pages
* kmalloc()
* low-level page allocators
* flag的使用呢?
* GFP_ATOMIC
* high priority allocation that will not sleep
* interrupt handlers
* other can't sleep code
* GFP_KERNEL
* code that can sleep
* process context code
* does not hold a spin lock
* if you want to allocate from high memory
* alloc_pages()
* return a struct page
* not a pointer to a logical address
* because high memory might not be mapped
* 唯一access it 的方法是拿到page structure
* kmap()
* to obtain an actual pointer
* 因為他會把high memory map進kernel's logical address space
* if you do not need physically contiguous pages-- only virtually contiguous
* vmalloc()
* performance較差
* 但會給你連續的VA,不連續的PA
* if you are creating and destroying many large data structures
* set up a slab cache.
* slab layer
* maintains per-processor object cache
* 大幅提升效能(object allocation and deallocate)
* slab會直接從資源回收車(cache)找現成的object給你,而不是給你配全新的memory