C 語言：記憶體管理、對齊及硬體特性

--- tags: NCKU Linux Kernel Internals, C語言 --- # C 語言：記憶體管理、對齊及硬體特性 [你所不知道的 C 語言：記憶體管理、對齊及硬體特性](https://hackmd.io/@sysprog/c-memory?type=view) ## [What a C programmer should know about memory](https://marek.vavrusa.com/memory/) ### Virtual memory 透過virtual memory，可以讓每個process以為自己擁有一塊連續的定址空間，因而不被侷限於可使用的空間中。而kernel會負責把virtual memory mapping到 physical memory(例如RAM)、甚至外部的磁碟上。當我們透過malloc想要一塊記憶體時，即使virtual memory allocator(VMA)確實回傳了一個可用的位址(檢查malloc的return，非NULL)，但實際上physical memory可能無法給予要求數量的空間(**overcommiting**)。但是直到我們確實access到該位址後，kernel才會真正因應實體記憶體不足的問題，透過OOM killer把process kill掉。 ### Process memory layout [In-Memory Layout of a Program (Process)](https://gabrieletolomei.wordpress.com/miscellanea/operating-systems/in-memory-layout/) ![](https://i.imgur.com/h5LD8T6.png) ### Stack allocation **Allocate 記憶體不一定要用heap!** [alloca](https://linux.die.net/man/3/alloca) > The alloca() function allocates size bytes of space in the stack frame of the caller. This temporary space is automatically freed when the function that called alloca() returns to its caller. * alloca 會把記憶體配置在 stack frame 中。 * stack frame 和 memory frame(或稱physical page)不同，負責存放如區域變數、函式的參數、函數的返回位址等。 * 由於 stack frame 的存活時間只存在function中，stack上的資料會自行產生與回收。因此 alloca 可以利用 stack 的剩餘空間配置記憶體，而透過stack frame的特性不需要自行 free。 * 不過既然我們是透過stack 來配置記憶體，這也代表了配置出的空間在出了 function 後就會被自動釋放掉。 [variable-length arrays (VLA)](https://en.wikipedia.org/wiki/Variable-length_array) 同樣也是透過stack來配置空間的代表。和alloca的差距是: VLA的存活週期是在scope中，而alloca則可以存活直至function結束。 ```c= void laugh(void) { for (unsigned i = 0; i < megatron; ++i) { char *res = alloca(2); memcpy(res, "ha", 2); char vla[2] = {'h','a'} } /* vla dies, res lives */ } /* all allocas die */ ``` ### Heap allocation 一個簡單的heap allocation就是移動 program break ，並把舊的到新的 program break 之間的memory當成分配出去的空間。這件事可以透過[sbrk](https://linux.die.net/man/2/sbrk)做到。從這個角度來說heap allocation和stack allocation應該是一樣快的。 ```c= char *block = sbrk(1024 * sizeof(char)); ``` :::info :bell: program break 是 process 的 data segment 的結束位置 ::: 但是透過這種方式直接 allocate heap 會有一些問題: * 用完不再需要的memory沒辦法輕易回收 * 非 thread-safe * 函式庫不一定可以碰到break，導致了移植的困難。因應於此才有了`malloc`，提供thread safe的memory allocation interface，讓程式撰寫者更容易的配置記憶體，不過也為此犧牲了執行時間，因為現在配置記憶體需要 lock 與額外的 data structures 來維護 used / free blocks。 ### Slab allocator 不僅僅是配置可用的空間，allocator需要考量 * allocate 大量的小的空間(small chunks)時，產生的[fragmentation](https://en.wikipedia.org/wiki/Fragmentation_(computing))問題 * 常一起被access的object卻落在不同page的可能導致的cache locality問題。 [slab allocator](https://en.wikipedia.org/wiki/Slab_allocation)會跟[buddy system](https://en.wikipedia.org/wiki/Buddy_memory_allocation)要一個page，並切成多個固定的大小的chunk，避免buddy system的internal fragmentation問題。 Some reference * [Allocating kernel memory (buddy system and slab system)](https://www.geeksforgeeks.org/operating-system-allocating-kernel-memory-buddy-system-slab-system/) * [The Slab Allocator in the Linux kernel](https://hammertux.github.io/slab-allocator) ### Memory pools [Obstacks](https://www.gnu.org/software/libc/manual/html_node/Obstacks.html#Obstacks) > obstack is a pool of memory containing a stack of objects ### Demand paging 如之前所說的overcommiting，allocate時只是拿到一個virtual address，然而並沒有實際的對應，直到access到時，page fault會發生，此時kernel才會真正找一塊可用的physical page，並且更新page table來對應，這就是demand paging的技巧。 ### Fixed memory mappings 指定要map到的virtual address。 ### File-backed memory maps 透過把檔案mapping到記憶體，加速讀寫的時間。不過這個作法必須考慮到如何做synchronization(寫在記憶體的檔案，甚麼時候要寫回硬碟?)，以及mapping的file無法輕易拓展或者減短的問題。 ### Copy-on-write [Copy-on-write](https://en.wikipedia.org/wiki/Copy-on-write)最經典的例子如fork，當process B需要複製process A的一份龐大的記憶體空間(多個page)時，事實上可以先讓process A 和 B 先共享同一個空間即可，直到某個page被write而產生page fault，此時才會真正copy出另一個page。如此一來可以避免不必要的複製造成空間的浪費，以及複製造成耗時。 ### Zero-copy [以 sendfile 和 splice 系統呼叫達到 Zero-Copy](https://hackmd.io/@sysprog/linux2020-zerocopy) ### mmap不是完美無瑕! 並不是直接把檔案mapping到記憶體，就可以在所有情境下有最佳的access時間。注意到處理一個page fault的時間會超過讀取一個block的時間(處理page fault會需要讀取block，加上其他的流程)。mmap原本的優勢在於，如果需要多次的操作某個file，則僅第一次需要讀硬碟，之後直接對memory操作即可。但例如在 "對一個比memory還大的檔案，進行少量的隨機read" 的情境下，則用memory cache住的檔案很可能馬上就會替換掉，page fault不斷發生，導致時間不減反增。 ### PSS (Proportional Set Size) shared memory 影響了計算使用memory的方法，[PSS](https://en.wikipedia.org/wiki/Proportional_set_size)的計算方式是程式本身的使用的private memory 加上使用shared memory比例(總共使用的memory / 共享的process數量)。一個常見的誤解是mapping file會消耗memory，而使用file API不會。事實上，不管是哪個方法都會使用memory來cache檔案。差別只是在mmap透過建立檔案的mapping，在操作檔案時不需頻繁的把檔案在kernel和user space間來回搬動，並使page可以被共享。 :::warning :warning: 可以看到，文章中提到許多memory相關的議題，而基本上我都是一知半解。因此有許多不確定的地方，只能用含糊的用詞帶過。請務必對照原文閱讀，如果有發現有錯誤的理解還請糾正我! 未來若有機會，也會再深入研究，並補上相關細節! ::: ## data alignment 由於CPU抓取的資料通常是4bytes(32位元)或者8bytes(64位元)一數，因此對齊資料可使cpu不需額外的一次access。 ![](https://i.imgur.com/KrIC39F.png) 如圖，同樣是4bytes的資料，左圖只需要一次access就能完成。右圖則受限於CPU必須access兩次。 ``` c= #include<stdio.h> struct s1 { char c; int a; }; int main() { printf("struct s1 size: %ld byte\n", sizeof(struct s1)); return 0; } ``` 執行此程式，得到的輸出是`struct s1 size: 8 byte `，因為char 1 bytes + int 4 bytes為5 bytes，如果直接把int擺在char之下會導致access變慢，則為了alignment必須調整成8byte。 ```c= #include<stdio.h> struct s1 { char a[5]; }; int main() { printf("struct s1 size: %ld byte\n", sizeof(struct s1)); return 0; } ``` 執行此程式，得到的結果會是 `struct s1 size: 5 byte `，因為char本身是1byte alignment。由此可見，編譯器會自動幫我們以 data 的大小做 alignment。有時候，我們為了正確性(因為padding使得資料的擺放和預期不同)可以犧牲執行時間，因此可以透過[#pragma pack](https://blog.gtwang.org/programming/c-language-pragma-pack-tutorial-and-examples/)來處理。 ```c= #include<stdio.h> #pragma pack(push) #pragma pack(1) struct s1 { char c; int a; }; #pragma pack(pop) struct s2{ char c; int a; }; int main() { printf("struct s1 size: %ld byte\n", sizeof(struct s1)); printf("struct s2 size: %ld byte\n", sizeof(struct s2)); return 0; } ``` 執行結果為: ``` struct s1 size: 5 byte struct s2 size: 8 byte ``` ## TODO - [ ] 深入閱讀[What Every Programmer Should Know About Memory](https://akkadia.org/drepper/cpumemory.pdf) - [ ] 研究malloc的實作，釐清heap allocation的細節。