The Process Address Space

# The Process Address Space 典型32bits的分布情形 ![](https://i.imgur.com/lb4xv6m.png) ![](https://i.imgur.com/vQ3utYg.png) 每一個process都會有mm_struct來記錄所有跟memory management有關的資訊，VMA會被記錄在mm_struct裡，包括address space的描述，page與VMA與file的對應 ![](https://i.imgur.com/qDQIJNR.png) VMA除了描寫空間之外，還會記錄每個page對應到哪些physical page，這裡page123都有對應到實體記憶體，page45有合法的位址，但尚未跟實體記憶體有對應關係(可能還沒用到就不會配給他，所以第一次存取會page fault) VMA的page使用者看起來邏輯連續的，但對到實體記憶體可能是不連續的 ![](https://i.imgur.com/wn8F8sB.png) # Address Spaces VMA是描述process在記憶體分佈的樣子，但這些資訊是哪裡來的? 都是透過linker scripts去把執行檔爬出來然後把資訊轉交給mm_struct才成為記憶體中的樣子，linker scripts可以參考這篇 http://wen00072.github.io/blog/2014/03/14/study-on-the-linker-script/ ![](https://i.imgur.com/GA62l5X.png) OS Process & Thread的差別 (user/kernel) https://medium.com/@yovan/os-process-thread-user-kernel-%E7%AD%86%E8%A8%98-aa6e04d35002 Context switch發生了什麼事 https://stackoverflow.com/questions/12630214/context-switch-internals ![](https://i.imgur.com/Fcqvvea.png) Segmentation Fault * If a process accesses a memory address not in a valid memory area * if it accesses a valid area in an invalid manner * the kernel kills the process with the dreaded “Segmentation Fault” message Memory areas有很多好處: * text section * 程式碼所在地 * data section * 已經被初始化的global variable * bss section * 還沒被初始化的global variable * User-space stack * A memory map of the zero page used for the process’s user-space stack * An additional text, data, and bss section for each shared library, such as the C library and dynamic linker, loaded into the process’s address space * Any memory mapped files * Any shared memory segments * Any anonymous memory mappings, such as those associated with malloc() # The Memory Descriptor 用來描述process address space的data structure, called the > memory descriptor > This structure contains all the information related to the process address space ![](https://i.imgur.com/hzaVQ4y.png) ![](https://i.imgur.com/fYOY86e.png) * mm_users: * 代表使用這個address space的processes有多少人 * ex: 2 threads使用，mm_users == 2 * mm_count: * 代表還有沒有人在使用這個地址空間 * 只有還有1個人以上，他就是1 * 除非mm_users = 0, mm_count才會是0 * mmap:(linked list) * contain all the areas in this address space * mm_rb:(rbtree) * 跟mmap存一樣的東西 * searching time (O(log n)) > 一般來說kernel會盡量避免用兩個不同的struct存放同樣的data,會需要上面這兩個的原因是，mmap讓traversing all elements比較容易(因為linked list結構)，mm_rb則可以比較快找到某個element > 所有的mm_struct 會用doubly linked list串起來 (mmlist field)，第一個element會放 init_mm這個memory descriptor, 專門放init process的address space.用來保護整個串列的lock為 mmlist_lock, which is defined in kernel/fork.c # Allocating a Memory Descriptor mm_struct 其實就放在process descriptor裡面 ![](https://i.imgur.com/9ET2rcW.png) 所以你如果要存取 current process's memory descriptor ``` current->mm ``` * copy_mm() 用來在fork()的時候把parent的mm_struct copy給child * mm_struct() 是從mm_cachep slab cache的allocate_mm()分配出來的 > 一般來說，每個process都有自己的mm_struct，and thus a unique process address space. > * processes可以跟自己的child分享他們的address spaces(clone()的時候把CLONE_VM flag立起來即可) * 你只要作了這件事，你就是thread!!! * 在linux裡，這件事是process跟thread唯一不同的地方，thread在linux裡就是個regular processes merely share certain resources. * CLONE_VM立起來的話，就不會再去call allocate_mm()了，child的mm field會直接去指到他parent的地址空間(就是直接住你屋子的概念) 又有人來住屋子，所以mm_users要加一然後child's mm point to parent's mm ![](https://i.imgur.com/tPGOQGE.png) ![](https://i.imgur.com/93gaImi.png) # Destroying a Memory Descriptor process要exit時，會觸發exit_mm()把地址空間給放掉, defined in kernel/exit.c ![](https://i.imgur.com/9HTdkQq.png) 最後會call mmput() ![](https://i.imgur.com/rL9WgB2.png) mmdrop()會decrease mm_count 如果mm_count被降為0，就會觸發 free_mm() macro，他就會把前面提到過的資源回收車叫來，把mm_struct裝進mm_cachep slab cache裡面(透過kmem_cache_free()) # The mm_struct and Kernel Threads Kernel threads因為沒有 Process address space所以也沒有相關的memory descriptor.所以! > mm field 在kernel thread's process descriptor is NULL! 這就是kernel thread的定義: **一個沒有user context的process!!** 這也沒關係，因為kernel thread本來就不會access任何user-space memory.也因為kernel threads沒有任何user-space pages,他們也不配擁有自己的memory descriptor跟page tables. 但，kernel threads要運作，還是要有代替的東東，為了以下幾點不要不要的原因 * 為了讓kernel threads擁有page tables * 為了不要浪費memory給他配備mm_struct and page struct. * 為了不要浪費cpu cycles就因為kernel threads想要執行而必須change address space kernel threads你就使用任何被你搶佔的task的mm_struct吧!!!! > 當一個process被scheduled進CPU的時候，這個task的地址空間就會被load進來，mm_struct當然也是，active_mm field就會指向這個要進來的新address space.因為kernel threads沒有自己的地址空間而且mm is NULL,所以，當一個kernel threads被scheduled, kernel注意到mm is NULL, 所以會保留前一個process的address space. kernel接著會更新這個kernel threads的active_mm field去指向前一個process的mm_struct，藉此來使用前一個衰鬼的page tables. > > 注意! kernel threads不能access user-space memory, 只能使用地址空間內保留給kernel memory的區域，對於所有processes而言都是這樣 > # Virtual Memory Areas ![](https://i.imgur.com/Fklwo8S.png) 描述一段地址空間內的某段連續區域，kernel把一段VMA當成一種object來使用.each memory area has certain properties, such as permissions and a set of associated operations.所以每個VMA structure可以用來表示不同型態的memory areas. ex: * memory-mapped files * process's user-space stack ![](https://i.imgur.com/aKrL2v8.png) ![](https://i.imgur.com/H7qMOym.png) two threads that share an address space also share all the vm_area_struct structures therein # VMA Flags The vm_flags field contains bit flags, defined in <linux/mm.h>, that specify the behavior of and provide information about the pages contained in the memory area 他不像是physical page的那些存取限制規定，VMA flags主要是針對kernel的行為，跟hardware無關。 * vm_flags包含了memory area裡面pages的資訊 ![](https://i.imgur.com/c90DM0G.png) ![](https://i.imgur.com/CpFPQM5.png) * permissions for the pages * VM_READ * VM_WRITE * VM_EXEC * ex: object code for a process * VM_READ and VM_EXEC * data section from an executable object * VM_READ and VM_WRITE * read-only memory mapped data file * VM_READ * VM_SHARED (跟不同process分享這段VMA) * specifies whether the memory area contains a mapping that is shared among multiple processes * VM_IO * 這段VMA代表某個device的I/O mapping space * drivers call mmap()就會set這個flag * VM_RESERVED * 這段VMA不能被swapped out * VM_SEQ_READ * 提示kernel這個APP執行方式是很sequential的，kernel可以去做些最佳化 * VM_RAND_READ * 跟上面的相反 # VMA Operations ![](https://i.imgur.com/x0g4jmL.png) ![](https://i.imgur.com/b6lURjk.png) # Memory Areas in Real Life(一個process實際內部mapping的例子) ![](https://i.imgur.com/Ffpyr2z.png) 先了解一下這個process的address space都有些什麼 * text section * data section * bss section * process's stack 假設這個process is dynamically linked with the C library. 這三個section也會存在 libc.so and ld.so. > 用這個指令可以看到process內部的mapping狀況 > cat /proc/<pid>/maps ![](https://i.imgur.com/TVOpQGP.png) ![](https://i.imgur.com/sKhNb68.png) ![](https://i.imgur.com/R14kyr7.png) ![](https://i.imgur.com/G1UrAoX.png) 前三行是text section, data, bss of libc.so(C library) 下兩行是text and data section for .exe檔下三行分別是 text, data ,bss for ld.so(dynamic linker) 最後一個是 process's stack * text(程式碼) * 當然是 readable and executable * data(contain global variables) * readable and writable * not executable * bss (contain global variables) * readable and writable * not executable * stack * readable, writeable, executable 整個地址空間大概 > 如果一個memory region is shared or nonwritable, kernel只會保留一份copy在memory裡就夠了，比如說lib.so，只能讀不能改他，所以lib.so只要占用1212KB在實體記憶體裡就好，而不是每個process都去複製一份lib.so。可以看到整個process可以access 1340KB地址空間，但實際上只消耗了40KB是writable/private.可說是非常節省 (32bits的定址空間是4GB，但不表示process就會用到4G，實際上用到多少還是視實際申請了多少memory而定) ![](https://i.imgur.com/tuRubAJ.png) 上面看到的每個memory areas都是由VMA構成的，即 vm_area_struct. 因為這是一個process而不是thread. 她在task_struct會有自己的mm_struct # Manipulating Memory Areas The kernel often has to perform operations on a memory area ex: * 給你一個地址你要去檢查這個地址是不是有在VMA裡 * 很常執行這個操作，這也是mmap()的例行公事可以用find_vma()來達成 (找東西肯定是用紅黑樹比較快) ![](https://i.imgur.com/iKXFRuO.png) 傳入的位址不一定是合法的，找不到會return NULL 有找到的話result會被存在 mmap_cache in mm_struct來增加效率(找不到再去search整個紅黑樹) ![](https://i.imgur.com/qFrUZRp.png) # find_vma_intersections() The find_vma_intersection() function returns the first VMA that overlaps a given address interval ![](https://i.imgur.com/Dbgsn2w.png) # mmap() and do_mmap(): Creating an Address Interval ![](https://i.imgur.com/flS13aA.png) do_mmap()被用來create一段新的linear address interval.注意到這不一定會create出一段新的VMA，因為有可能新的interval跟舊的interval相鄰的話，而且他們share一樣的permissions，kernel會把它們合併在一起。如果不是這樣才會create出一個新的VMA。 ![](https://i.imgur.com/mfmmbRC.png) ![](https://i.imgur.com/9FcnkWh.png) ![](https://i.imgur.com/Y4xhbxL.png) 右左:存取到非法空間左右:存取到不存在的空間 * anonymous mapping (對應到記憶體) * file = NULL * offset = 0 * file-backed mapping (對應到硬碟) * otherwised * addr * specifies the initial address from which to start the search for a free interval * prot (保護protection) * specifies the access permissions for pages in the memory area ![](https://i.imgur.com/sP5I93M.png) * flags * specifies flags that correspond to the remaining VMA flags * 定義你想mmap的這塊區間想怎麼跟其他人分享 ![](https://i.imgur.com/4CmE2ik.png) 有任何的參數是invalid, do_mmap() return negative value.不然一段合適的VMA會被locate出來. * VMA is allocated from vm_area_cahep slab cache. * VMA被加入linked list and RBtree # mmap() system call (Page cache) Memory Mapped有兩種 1. VA對應到實體記憶體 2. VA對應到 file * 對IO存取有極大的好處 * 對減少memory copy也有很大的好處 * 對應一旦建立，userspace就可以存取這個空間就像是直接存取IO一樣 memory map I/O影片 https://www.youtube.com/watch?v=m7E9piHcfr4 do_mmap()下面實際會呼叫到mmap()這個system call，實作上參考Page Cache的機制，有另一篇章節會詳細討論 ![](https://i.imgur.com/mRI6LTD.png) ![](https://i.imgur.com/C8fPf8e.png) ![](https://i.imgur.com/MJMpGol.png) Shared file mapping 兩種狀況 * memory-mapped I/O, 直接在VA上讀寫 I/O(等於把存取I/O的速度直接拉到讀寫記憶體一樣快) * 原本一個user process要read disk上的東西的話，需要從disk搬到kernel space memory,再從kernel space memory搬到user process的address space，總共要搬兩次。如果現在可以直接把file mapping到 userspace，DMA就可以直接把data搬到user-space家裡速度大概快兩倍 ![](https://i.imgur.com/xKodWST.png) > 原本AP中要想存取device或實體記憶體的化，因為kernel的保護機制，你只能透過ioctl() 或 read/write system call的機制，但對於大量的資料進出的case來說，比如video或streaming，這樣子的效能是無法被接受的，所以mmap就幫了大忙，device這邊只要配合實作mmap的方法，兩邊就可以盡情地進行交流了 > 簡單圖解如下： > AP->開啟/dev/mem->mmap到實體記憶體位址->AP快樂的存取 > DRIVER->module_init時做ioremap->取得記憶體指標->DRIVER快樂的存取 * IPC * data-transfer(not byte stream) * with filesystem persistence * among unrelated processes 兩個process對shared的使用情況 ![](https://i.imgur.com/vFRgzGh.png) * stack, heap不shared * lib.so, abc.dat shared * text shared(同一份code fork出不同的process) # Memory-mapped I/O的缺點 * memory garbage * significent waste of memory * memory mapping must fit in the process address space * 32bits system上，會導致記憶體碎片，會越來越難找到大的連續記憶體空間，64bits系統上不會有這個問題 * there is kernel overhead in maintaining mappings # Removing an Address Interval ![](https://i.imgur.com/fmd5oEm.png) # Page Tables Page Tables把VA切成chunks, 每個chunk用index指出一個table. table可能指向更後面的table或直接把PA給翻譯出來。 Linux的pages tables是三層架構@@ * top level * page global directory (PGD) * which is an array of pgd_t types * second level * page middle directory (PMD) * which is an array of pmd_t types * final level * page table entries (PTE) * simply the page table and consists of page table entries of type pte_t page table lookups通常是由HW完成的 ![](https://i.imgur.com/VBK6yCt.png) # TLB > looking up all these addresses in memory can be done only so quickly. To facilitate this, most processors implement a translation lookaside buffer, or simply TLB,which acts as a hardware cache of virtual-to-physical mappings.When accessing a virtual address, the processor first checks whether the mapping is cached in the TLB. If there is a hit, the physical address is immediately returned. Otherwise, if there is a miss, the page tables are consulted for the corresponding physical address Context switch的時候要flush TLB 因為process A的TLB跟process B的 TLB是不一樣的 ![](https://i.imgur.com/703X9WQ.png) 但有也TLB加上pid的版本，這種版本就不用flush TLB ![](https://i.imgur.com/5zbeswQ.png) * 有分先做cache後作MMU的 (logical cache) * cache要擺Pid or flush while context switch * 也有先作MMU後作cache的 (physical cache) * slow but share without flush ![](https://i.imgur.com/Luen8Vb.png) Q: 實體記憶體被配光了怎麼辦? A: 所有的應用程式都是無所不用其極地想去配memory,不管是Buffer cache(mmap I/O), slab allocator, 還是user malloc, 他們都很期待所有的physical memory背對應到，但你有VMA，不代表你有用到，所以在實體記憶體真的不夠用之前，一定要有一套機制先起來開始砍沒用到記憶體的人，Linux的話就是用kernel swapped daemon定期的去檢查 * 哪些人可以被swapped? * kernel的program是不能被swapped的 * 把資源分成actived and nonactived的 * 設一些watermark