# Linux lab 1 Report ###### tags: `linux`, `linux-2022-autumn` </br> :::spoiler Hierarchy ## Hierarchy - Lab requirment - Env - Compile kernel source code - using kvm/qemu (env1) - kernel config settings - errors & solutions - give up - install in host (env2) - erros & solutions - Add syscall - hello_world.c - errors & solutions - Obtaining the address - thread program - task_struct and vm_area_struct - /proc/$pid/maps - Result disccussion - All errors & solution ::: ## Requirements ### Basic - [X] add some new system calls - [X] Write a multi-thread program with three threads - [X] check which segments of a thread are shared by which other thread(s) - [X] report - [X] 老師指定內容 - [X] Kernel 與 OS 版本 - [X] Kernel 編譯過程 - [X] 新增Syscall 過程 - [X] kernel space & user space code - [X] 遇到的問題與解決方式(參考的資料與原始碼) - [X] 圖片 ### Bouns - [X] calculate the size, star address, and end address of each thread segment. ### [補充](#補充1) - [x] Physical address - [x] paging - [X] current->mm 相關的東東 - [x] [pthread_create 怎麼實作的](#Thread-Program-amp-address) - [X] copy to user 和 copy from user --- :::info **Links** :link: - [Linux-lab1-jornal](https://hackmd.io/@wasabi-neko/Byg_IAtri) ::: ## Enviornment Host Enviornment: ``` Ubuntu 20.04.5 LTS linux-5.15.0-52-generic ``` The kernel to be build: ~~`linux-5.17.5`~~ _(This version has many capability problem with my machine. give up)_ `linux-5.15.77` ### ~~Virtual Machine Environment~~ > After encounter a lot of errors and long long complie time, > i give up on using VirtualMachine. Instead, i decised to use my own computer to build linux kernel :::spoiler virtual machine envirnment #### Install KVM/QEMU, virt-manager #### fedora setup #### Errors & Solutions ::: ### Build Environment Original: ``` Ubuntu 20.04.5 LTS linux-5.15.0-52-generic ``` kernel to be build: ``` linux-5.15.77 ``` </br> ### Building kernel After downloading and unpacking the linux kernel source code, do the following command ```bash $ make menuconfig $ make -j12 $ sudo make modules_install -j12 $ sudo make install -j12 $ sudo reboot now ``` 在 make meuconfig 裡,可以先刪掉一些不需要的 modules 像是 `google chrome book hardware support` 等,以加快 compile speed. 在 compile/install 新 kernel 後,用 `uname -r` 檢查當前版本 ```bash $ uname -r 5.15.77 ``` 確實成功安裝新的 kernel 了 > nvidia-driver 在更新 kernel 後爆炸了,需要重新安裝 </br> #### Errors & Solutions 1. **No rule to make target 'debian/canonical-certs.pem** 在 `make -j12` 時出現的 error,讓整個 make process 直接停掉 ``` make[1]: *** No rule to make target 'debian/canonical-certs.pem', needed by 'certs/x509_certificate_list'. Stop. make: *** [Makefile:1809: certs] Error 2 ``` 看來只是一些驗證相關的問題,跑以下 command 就 ok 了 ```bash $ scripts/config --disable SYSTEM_TRUSTED_KEYS $ CONFIG_SYSTEM_REVOCATION_KEYS="debian/canonical-revoked-certs.pem" ``` [solution reference](https://askubuntu.com/questions/1329538/compiling-the-kernel-5-11-11) </br> 2. **nvidia driver** 在 `make -j12` 除中出現的 error message,但是並不會讓 make process stop。 在後續 `make install` 也沒有什麼影響,可以無視 ```bash $ sudo make install -j12 ... Error! Bad return status for module build on kernel: 5.15.77 (x86_64) Consult /var/lib/dkms/nvidia/520.56.06/build/make.log for more information. ``` 檢查 make.log ```bash $ view /var/lib/dkms/nvidia/520.56.06/build/make.log DKMS make.log for nvidia-520.56.06 for kernel 5.15.77 (x86_64) Sun 06 Nov 2022 04:37:27 AM CST make[1]: warning: -j12 forced in submake: resetting jobserver mode. Makefile:18: /Kbuild: No such file or directory make[1]: *** No rule to make target '/Kbuild'. Stop. ``` 看起來沒有什麼大問題,只是 nvidia driver 日常發揮 </br> 3. **The initrd is too big** 在 `sudo reboot now` 之後... ![img of my computer fucked](https://media.discordapp.net/attachments/887215164208857138/1038557345334165514/20221106_045545.jpg?width=810&height=608 =x500) ``` The initrd is too big ... [ end kernel panic not syncing:VFS: Unable to mount root fs on unkown-block] ``` 首先先 reboot. when system booting hold `esc` key to access to grub menu boot back in my old `5.15.52` and try to fix this problem. First edit the file `/etc/initramfs-tools/initramfs.conf` Change "MODULES=most" to "MODULES=dep" ``` # in /etc/initramfs-tools/initramfs.conf, ... # MODULES: [ most | netboot | dep | list ] # # most - Add most filesystem and all harddrive drivers. # # dep - Try and guess which modules to load. # # netboot - Add the base modules, network modules, but skip block devices. # # list - Only include modules from the 'additional modules' list # MODULES=most MODULES=dep ``` 更改完 `/etc/initramfs-tools/initramfs.conf` 後需要重製 `initramfs` ```bash $ update-initramfs -u -k all ``` 這會讓系統 bootup 時不會載入全部的 module,以此避開 `initrd too big` error, 雖然感覺這並不是完美的作法,但目前應該是夠用了 </br> </br> ## Add new system call 我們所使用的這版 Linux Kernel 有提供一個方便的方法來加入新的 `system call`。 有幾個步驟需要做,分別是 1. 開一個新的目錄 2. 把 `system call` 的 source code 放進去新的目錄內 3. 把新的 `system call` 加入 `system call table` 4. 把新的 `system call` 加入 `syscalls.h` 裡面 5. 重新編譯! ### hello_world.c > 萬事起頭 Hello World ```=c // hello/hello_world.c #include <linux/kernel.h> #include <linux/syscalls.h> // asmlinkage long sys_hello_world(pid_t pid) // 舊版的 define 方式 SYSCALL_DEFINE1(hello_world, pid_t, pid) // 新版比較安全的 define 方式 { printk("Hello World!\n"); return 0; } ``` ``` # hello/Makefile obj-y = hello_world.o ``` ``` # Makefile ... ifeq ($(KBUILD_EXTMOD),) core-y += kernel/ certs/ mm/ fs/ ipc/ security/ crypto/ block/ hello/ ... ``` ```c // inlucde/linux/syscalls.h ... asmlinkage long sys_hello_world(void); #endif ``` ``` # arch/x86/entry/syscalls/syscall_64.tbl # 500 多號的為 x32 使用 # 加在 400 多號後面即可 # 中間的註解中 `add ne calls after the last common entry` 指的不是 # 加在 335 的位置,而是 449,如果加在 335 會出現不明錯誤 ... 334 common rseq sys_rseq # don't use numbers 387 through 423, add new calls after the last # 'common' entry 424 common pidfd_send_signal sys_pidfd_send_signal ... 449 common hello_world sys_hello_world ``` ### Errors ``` ld: arch/x86/entry/syscall_64.o:(.rodata+0xa78): undefined reference to `__x64_sys_hello_world' ld: arch/x86/entry/syscall_x32.o:(.rodata+0xa78): undefined reference to `__x64_sys_hello_world' BTF .btf.vmlinux.bin.o pahole: .tmp_vmlinux.btf: No such file or directory LD .tmp_vmlinux.kallsyms1 .btf.vmlinux.bin.o: file not recognized: file format not recognized make: *** [Makefile:1216: vmlinux] Error 1 ``` 如果在 `hello_world.c` 中使用 `asmlinkage long sys_hello_world(pid_t pid) ` 會出現的錯誤 資料顯示在 linux-4.19 以後需要以 `__x64_sys_` 為前綴,但是我們 在嘗試改為 `__x64_sys_` 為前綴後,依然出現錯誤。 最後改為使用 `SYSCALL_DEFINEX` macro 才成功 ## Thread Program & address 在寫 Multithread Program 之前,有一些關於 `pthread` 的事情需要先認識一下。 ### About `pthread` 全名 `POSIX Thread`,是 Linux 實作 thread 採用的標準,因此函式庫的命名叫做 `pthread`。 要使用 `pthread` 的 API 需要 include `pthread.h`,並且在編譯時下 `-lpthread` 或 `-lpthread` 的 flag。兩者的差異在於讓前處理器在編譯過程進行優化,前者只會單純 link library,後者則會進行優化。 `pthread` 這個 library 裡面有幾個重點函式需要介紹: - `pthread_create()` - `pthread_join()` - `pthread_exit()` #### `pthread_create()` `pthread_create` 是用來創建新的 thread 的 API,它的完整函式原型如下: ```c= int pthread_create(pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine) (void *), void *arg); ``` 總共需要四個參數: - `pthread_t *thread`: 它會儲存新 thread 的 id - `const pthread_attr_t *attr`: 它是一個指標,可以用它來調整 pthread 的一些進階特性 - `void *(*start_routine) (void *)`: 一個 function pointer,用來指定 thread create 之後要做的事情 - `void *arg`: 傳遞 `start_routine` 的參數 :::info **How is thread implemented on Linux** 在 Linux 上面,`process` 和 `thread` 同樣都使用 `task_struct` 來維護和紀錄運作的資訊。其中 thread 又稱為 *Light Weight Process*。它與 process 最大的不同就在於它沒有一塊完全只屬於他自己的 memory space,會跟其他 thread 共用記憶體空間,並且它所使用的 system call 與創建 process 也不同。建立一個新的 process 會使用到 `fork()` 這個 system call,但是建立一個新的 thread 則會使用 `clone()`。 `clone()` 這個 system call 可以決定 parent 和 child 有哪些部分要共享,因此藉由這個 system call 即可實作出 thread 這個概念。接下來,直接來看 `clone()` 需要傳入的參數有什麼。 ```c= int clone(int (*fn)(void *), void *stack, int flags, void *arg, ... /* pid_t *parent_tid, void *tls, pid_t *child_tid */ ); ``` 從上面的 code 可以看到 `clone()` 可以指定要從哪一個 function 開始執行,有別於 `fork()` 只能看 parent 如何寫 child 就如何執行。並且可以決定 stack 的 start address 要從哪裡開始。 另外值得討論是在 Linux 的 manual 上面的說明如此說: `clone, __clone2, clone3 - create a child process` 所以其實 Linux 上 thread 和 process 這兩個東西蠻模糊的,不論從 Linux kernel 對於維護 thread 的結構的 source code 還是所使用的 system call 都可以看見這個狀況。 ::: #### `pthread_join()` 用來做同步,在呼叫 `pthread_create()` 的 function 中呼叫 `pthread_join()` ,這支 function 會等到 thread 執行完畢後才會繼續往下執行,並且可以用來傳遞回傳值。 完整函式原型如下: ```c= int pthread_join(pthread_t thread, void **retval); ``` **Parameters** - `pthread_t thread`: thread id - `void **retval`: 用來儲存 thread 執行完後的回傳值 #### `pthread_exit()` 寫在不同地方會有不同的效果。 寫在 `main` 裡面,會等到所有的 thread 都執行結束才會將整個 process 結束,避免 thread 的資源被卡住沒有還回去。 寫在一般 function 內,則會等到 thread 執行結束才會離開這個 function。 完整函式原型如下: ```c= void pthread_exit(void *retval); ``` **Parameters**: - `void *retval`: 儲存回傳值的指標 ### 編寫 thread program ```c= #define _GNU_SOURCE /* for RTLD_NEXT */ #include <dlfcn.h> #include <stdio.h> #include <pthread.h> #include <unistd.h> #include <stdlib.h> #include <string.h> const int global_constant = 0xc8763; // .rodata char global_str[] = "Global Str."; // Data segment char global_var; // Bss segment void (*func_ptr)(); void foo() { // Text segment printf("foo func here!\n"); } void* func4thread1(void *arg) { int local = 0xdeadbeef; int *heap_ptr = (int*) malloc(5); heap_ptr[0] = 0xc8763; // func_ptr = foo; // printf("This is thread1.\n"); printf("thread1_stack %p\n", &local); printf("thread1_heap %p\n", heap_ptr); printf("thread1_text %p\n", func4thread1); // printf("Address of global_str \"%s\": %p\n", global_str, global_str); // printf("Address of global_var: %p\n", &global_var); // printf("Address of foo: %p\n", func_ptr); // printf("Address of str_in_heap \"%s\": %p\n", (char *)arg, arg); fflush(stdout); sleep(100000); // pending pthread_exit(NULL); } void* func4thread2(void *arg) { int local = 0xdeadbeef; int *heap_ptr = (int*) malloc(5); // func_ptr = foo; // printf("This is thread2.\n"); printf("thread2_stack %p\n" , &local); printf("thread2_heap %p\n", heap_ptr); printf("thread2_text %p\n", func4thread1); // printf("Address of global_str \"%s\": %p\n", global_str, global_str); // printf("Address of global_var: %p\n", &global_var); // printf("Address of foo: %p\n", func_ptr); // printf("Address of str_in_heap \"%s\": %p\n", (char *)arg, arg); fflush(stdout); sleep(10000); // pending pthread_exit(NULL); } int main(int argc, char **argv) { pid_t pid = getpid(); pthread_t t1, t2; char str[] = "str in heap."; char *str_in_heap = (char *)malloc(sizeof(char) * 12); strncpy(str_in_heap, str, 12); void (*exit_addr)(int) = dlsym(RTLD_NEXT, "exit"); char *env_home = getenv("HOME"); printf("libc %p\n", exit_addr); printf("libc-var %p\n", stdout); printf("main_stack %p\n", &pid); printf("main_heap %p\n", str_in_heap); printf("main_text %p\n", foo); printf("bss %p\n", &global_var); printf("data %p\n", global_str); printf(".rodata %p\n", &global_constant); printf("argv %p\n", argv); printf("env %p\n", env_home); fflush(stdout); // call the syscall before createing the thread long retval = syscall(449, pid); pthread_create(&t1, NULL, func4thread1, (void *)str_in_heap); pthread_create(&t2, NULL, func4thread2, (void *)str_in_heap); // call the syscall after creating the thread retval = syscall(449, pid); pthread_join(t1, NULL); pthread_join(t2, NULL); return 0; } ``` **結果** ``` libc 0x7f3c20830a40 libc-var 0x7f3c209d76a0 main_stack 0x7ffedccc91b0 main_heap 0x55a98714b2a0 main_text 0x55a98516b309 bss 0x55a98516e030 data 0x55a98516e010 .rodata 0x55a98516c004 argv 0x7ffedccc92f8 env 0x7ffedcccb6e5 thread1_stack 0x7f3c207e5edc thread1_heap 0x7f3c18000b60 thread1_text 0x55a98516b320 thread2_stack 0x7f3c1ffe4edc thread2_heap 0x7f3c10000b60 thread2_text 0x55a98516b320 ``` ### vm_area_struct 改寫 syscall 以獲得記憶體位置。 經過搜尋了解 `struct task_struct` 包含了一 process 相關資料 沿著 source code 一路看下去會發現 `task_struct.mm` -> `mm_struct.mmap` -> `vm_area_struct` ![](https://i.imgur.com/PwkAo9J.png) > img_src: https://hackmd.io/@PIFOPlfSS3W_CehLxS3hBQ/S14tx4MqP ```c struct vm_area_struct { unsigned long vm_start; /* Our start address within vm_mm. */ unsigned long vm_end; /* The first byte after our end address within vm_mm. */ /* linked list of VM areas per task, sorted by address */ struct vm_area_struct *vm_next, *vm_prev; ... } ``` 從這裡可以了解到 vm_aread-struct is a linked list, and has two attribute: vm_start and vm_end ```c= // hello/hello_world.c #include <linux/kernel.h> #include <linux/syscalls.h> #include <linux/sched.h> #include <linux/types.h> // asmlinkage long sys_hello_world(pid_t pid) // 舊版的 define 方式 SYSCALL_DEFINE1(hello_world, pid_t, pid) // 新版比較安全的 define 方式 { printk("Hello World!\n"); struct task_struct *task_ptr; int retval = 0; rcu_read_lock(); task_ptr = find_task_by_vpid(pid); if (!task_ptr) { rcu_read_unlock(); return retval; } struct vm_area_struct *vmptr = task_ptr->mm->mmap; while (vmptr != NULL) { unsigned long size = vmptr->vm_end - vmptr->vm_start; unsigned long phy_addr = vir2phys(task_ptr->mm, vmptr->vm_start); // printk("size: %08lx, start: %lx, end: %lx, flag: %07lx, prot: %07lx\n", size, vmptr->vm_start, vmptr->vm_end, vmptr->vm_flags, vmptr->vm_page_prot.pgprot); printk("size: %08lx, start: %lx, end: %lx, phy: %07lx, flag: %07lx, prot: %07lx\n", size, vmptr->vm_start, vmptr->vm_end, phy_addr, vmptr->vm_flags, vmptr->vm_page_prot.pgprot); vmptr = vmptr->vm_next; } rcu_read_unlock(); return retval; } ``` vm_flag in `include/linux/mm.h` ```c #define VM_NONE 0x00000000 #define VM_READ 0x00000001 /* currently active flags */ #define VM_WRITE 0x00000002 #define VM_EXEC 0x00000004 #define VM_SHARED 0x00000008 #define VM_GROWSDOWN 0x00000100 /* general info on the segment */ #define VM_UFFD_MISSING 0x00000200 /* missing pages tracking */ #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ #define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */ #define VM_LOCKED 0x00002000 #define VM_IO 0x00004000 /* Memory mapped I/O or similar */ /* Used by sys_madvise() */ #define VM_SEQ_READ 0x00008000 /* App will access data sequentially */ #define VM_RAND_READ 0x00010000 /* App will not benefit from clustered reads */ #define VM_DONTCOPY 0x00020000 /* Do not copy this vma on fork */ #define VM_DONTEXPAND 0x00040000 /* Cannot expand with mremap() */ #define VM_LOCKONFAULT 0x00080000 /* Lock the pages covered when they are faulted in */ #define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */ #define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */ #define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */ #define VM_SYNC 0x00800000 /* Synchronous page faults */ #define VM_ARCH_1 0x01000000 /* Architecture-specific flag */ #define VM_WIPEONFORK 0x02000000 /* Wipe VMA contents in child. */ #define VM_DONTDUMP 0x04000000 /* Do not include in the core dump */ ``` </br> **Write a script to find informations of segment** 從之前寫好的 system call 可以得知每個 segment 的 start address, end address, size, flag 和 prot 等資訊,但是還不清楚 segment 實際上所對應的是 process 的哪個部分。 而從原本的 multithread program 的實驗結果可以看到各個變數所在的記憶體位置,並且也可以查看兩者的 segment 有哪些是共用的。同時也透過 flag 看到各 segment 的讀寫權限 ``` Hello World! size: 00001000, start: 55a8ad927000, end: 55a8ad928000, flag: 8000071, prot: 8000000000000025 size: 00001000, start: 55a8ad928000, end: 55a8ad929000, flag: 8000075, prot: 0000000000000025 <-main_text, thread1_text, thread2_text size: 00001000, start: 55a8ad929000, end: 55a8ad92a000, flag: 8000071, prot: 8000000000000025 size: 00001000, start: 55a8ad92a000, end: 55a8ad92b000, flag: 8100071, prot: 8000000000000025 size: 00001000, start: 55a8ad92b000, end: 55a8ad92c000, flag: 8100073, prot: 8000000000000025 <-bss, data size: 00021000, start: 55a8ae20b000, end: 55a8ae22c000, flag: 8100073, prot: 8000000000000025 <-main_heap size: 00021000, start: 7f81f8000000, end: 7f81f8021000, flag: 8200073, prot: 8000000000000025 <-thread1_heap, thread2_heap size: 03fdf000, start: 7f81f8021000, end: 7f81fc000000, flag: 8200070, prot: 0000000000000120 size: 00001000, start: 7f81fca1b000, end: 7f81fca1c000, flag: 8000070, prot: 0000000000000120 size: 00800000, start: 7f81fca1c000, end: 7f81fd21c000, flag: 8100073, prot: 8000000000000025 <-thread2_stack size: 00003000, start: 7f81fd21c000, end: 7f81fd21f000, flag: 8000071, prot: 8000000000000025 size: 00012000, start: 7f81fd21f000, end: 7f81fd231000, flag: 8000075, prot: 0000000000000025 size: 00004000, start: 7f81fd231000, end: 7f81fd235000, flag: 8000071, prot: 8000000000000025 size: 00001000, start: 7f81fd235000, end: 7f81fd236000, flag: 8100071, prot: 8000000000000025 size: 00001000, start: 7f81fd236000, end: 7f81fd237000, flag: 8100073, prot: 8000000000000025 size: 00001000, start: 7f81fd237000, end: 7f81fd238000, flag: 8000070, prot: 0000000000000120 size: 00803000, start: 7f81fd238000, end: 7f81fda3b000, flag: 8100073, prot: 8000000000000025 <-thread1_stack size: 00022000, start: 7f81fda3b000, end: 7f81fda5d000, flag: 8000071, prot: 8000000000000025 size: 00178000, start: 7f81fda5d000, end: 7f81fdbd5000, flag: 8000075, prot: 0000000000000025 size: 0004e000, start: 7f81fdbd5000, end: 7f81fdc23000, flag: 8000071, prot: 8000000000000025 size: 00004000, start: 7f81fdc23000, end: 7f81fdc27000, flag: 8100071, prot: 8000000000000025 size: 00002000, start: 7f81fdc27000, end: 7f81fdc29000, flag: 8100073, prot: 8000000000000025 size: 00004000, start: 7f81fdc29000, end: 7f81fdc2d000, flag: 8100073, prot: 8000000000000025 size: 00006000, start: 7f81fdc2d000, end: 7f81fdc33000, flag: 8000071, prot: 8000000000000025 size: 00011000, start: 7f81fdc33000, end: 7f81fdc44000, flag: 8000075, prot: 0000000000000025 size: 00006000, start: 7f81fdc44000, end: 7f81fdc4a000, flag: 8000071, prot: 8000000000000025 size: 00001000, start: 7f81fdc4a000, end: 7f81fdc4b000, flag: 8100071, prot: 8000000000000025 size: 00001000, start: 7f81fdc4b000, end: 7f81fdc4c000, flag: 8100073, prot: 8000000000000025 size: 00006000, start: 7f81fdc4c000, end: 7f81fdc52000, flag: 8100073, prot: 8000000000000025 size: 00001000, start: 7f81fdc6e000, end: 7f81fdc6f000, flag: 8000071, prot: 8000000000000025 size: 00023000, start: 7f81fdc6f000, end: 7f81fdc92000, flag: 8000075, prot: 0000000000000025 size: 00008000, start: 7f81fdc92000, end: 7f81fdc9a000, flag: 8000071, prot: 8000000000000025 size: 00001000, start: 7f81fdc9b000, end: 7f81fdc9c000, flag: 8100071, prot: 8000000000000025 size: 00001000, start: 7f81fdc9c000, end: 7f81fdc9d000, flag: 8100073, prot: 8000000000000025 size: 00001000, start: 7f81fdc9d000, end: 7f81fdc9e000, flag: 8100073, prot: 8000000000000025 size: 00021000, start: 7ffe943af000, end: 7ffe943d0000, flag: 0100173, prot: 8000000000000025 <-main_stack size: 00004000, start: 7ffe943f3000, end: 7ffe943f7000, flag: c044411, prot: 8000000000000025 size: 00002000, start: 7ffe943f7000, end: 7ffe943f9000, flag: 8040075, prot: 0000000000000025 ``` 使用 `/proc/$pid/maps` 取得 vm_area: ```bash $ cat /proc/$pid/maps ``` ```= #address perms offset device inode pathname 5595acc05000-5595acc06000 r--p 00000000 103:0a 530447 /home/wasabi-neko/Documents/NCU/2022-autumn/linux/project/test-thread/hello2.out 5595acc06000-5595acc07000 r-xp 00001000 103:0a 530447 /home/wasabi-neko/Documents/NCU/2022-autumn/linux/project/test-thread/hello2.out 5595acc07000-5595acc08000 r--p 00002000 103:0a 530447 /home/wasabi-neko/Documents/NCU/2022-autumn/linux/project/test-thread/hello2.out 5595acc08000-5595acc09000 r--p 00002000 103:0a 530447 /home/wasabi-neko/Documents/NCU/2022-autumn/linux/project/test-thread/hello2.out 5595acc09000-5595acc0a000 rw-p 00003000 103:0a 530447 /home/wasabi-neko/Documents/NCU/2022-autumn/linux/project/test-thread/hello2.out 5595ad1aa000-5595ad1cb000 rw-p 00000000 00:00 0 [heap] 7f6e18000000-7f6e18021000 rw-p 00000000 00:00 0 7f6e18021000-7f6e1c000000 ---p 00000000 00:00 0 7f6e20000000-7f6e20021000 rw-p 00000000 00:00 0 7f6e20021000-7f6e24000000 ---p 00000000 00:00 0 7f6e26b2d000-7f6e26b2e000 ---p 00000000 00:00 0 7f6e26b2e000-7f6e2732e000 rw-p 00000000 00:00 0 7f6e2732e000-7f6e2732f000 ---p 00000000 00:00 0 7f6e2732f000-7f6e27b32000 rw-p 00000000 00:00 0 7f6e27b32000-7f6e27b54000 r--p 00000000 103:0a 6948009 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f6e27b54000-7f6e27ccc000 r-xp 00022000 103:0a 6948009 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f6e27ccc000-7f6e27d1a000 r--p 0019a000 103:0a 6948009 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f6e27d1a000-7f6e27d1e000 r--p 001e7000 103:0a 6948009 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f6e27d1e000-7f6e27d20000 rw-p 001eb000 103:0a 6948009 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f6e27d20000-7f6e27d24000 rw-p 00000000 00:00 0 7f6e27d24000-7f6e27d25000 r--p 00000000 103:0a 6953248 /usr/lib/x86_64-linux-gnu/libdl-2.31.so 7f6e27d25000-7f6e27d27000 r-xp 00001000 103:0a 6953248 /usr/lib/x86_64-linux-gnu/libdl-2.31.so 7f6e27d27000-7f6e27d28000 r--p 00003000 103:0a 6953248 /usr/lib/x86_64-linux-gnu/libdl-2.31.so 7f6e27d28000-7f6e27d29000 r--p 00003000 103:0a 6953248 /usr/lib/x86_64-linux-gnu/libdl-2.31.so 7f6e27d29000-7f6e27d2a000 rw-p 00004000 103:0a 6953248 /usr/lib/x86_64-linux-gnu/libdl-2.31.so 7f6e27d2a000-7f6e27d30000 r--p 00000000 103:0a 6960663 /usr/lib/x86_64-linux-gnu/libpthread-2.31.so 7f6e27d30000-7f6e27d41000 r-xp 00006000 103:0a 6960663 /usr/lib/x86_64-linux-gnu/libpthread-2.31.so 7f6e27d41000-7f6e27d47000 r--p 00017000 103:0a 6960663 /usr/lib/x86_64-linux-gnu/libpthread-2.31.so 7f6e27d47000-7f6e27d48000 r--p 0001c000 103:0a 6960663 /usr/lib/x86_64-linux-gnu/libpthread-2.31.so 7f6e27d48000-7f6e27d49000 rw-p 0001d000 103:0a 6960663 /usr/lib/x86_64-linux-gnu/libpthread-2.31.so 7f6e27d49000-7f6e27d4f000 rw-p 00000000 00:00 0 7f6e27d6b000-7f6e27d6c000 r--p 00000000 103:0a 6947394 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f6e27d6c000-7f6e27d8f000 r-xp 00001000 103:0a 6947394 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f6e27d8f000-7f6e27d97000 r--p 00024000 103:0a 6947394 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f6e27d98000-7f6e27d99000 r--p 0002c000 103:0a 6947394 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f6e27d99000-7f6e27d9a000 rw-p 0002d000 103:0a 6947394 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f6e27d9a000-7f6e27d9b000 rw-p 00000000 00:00 0 7fff30d81000-7fff30da2000 rw-p 00000000 00:00 0 [stack] 7fff30de5000-7fff30de9000 r--p 00000000 00:00 0 [vvar] 7fff30de9000-7fff30deb000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall] ``` 由以上的圖表可知 stack, heap, bss, lib 等記憶體分佈 ![](https://i.imgur.com/wBOG1fP.jpg =250x) ### Others 近一步增加觀察的 variable address, 結果發生奇怪的事情 ```c // 觀察不同大小的 malloc 會不會有所影響 int *smol_heap_ptr = (int*) malloc(5); int *big_heap_ptr = (int*) malloc(10000); const int global_constant = 0xc8763; // .rodata // main void (*exit_addr)(int) = dlsym(RTLD_NEXT, "exit"); char *env_home = getenv("HOME"); printf(".rodata %p\n", &global_constant); printf("argv %p\n", argv); ``` ``` Hello World! size: 00001000, start: 55a37da32000, end: 55a37da33000, flag: 8000071, prot: 8000000000000025 size: 00001000, start: 55a37da33000, end: 55a37da34000, flag: 8000075, prot: 0000025 <-main_text, thread1_text, thread2_text size: 00001000, start: 55a37da34000, end: 55a37da35000, flag: 8000071, prot: 8000000000000025 <-.rodata size: 00001000, start: 55a37da35000, end: 55a37da36000, flag: 8100071, prot: 8000000000000025 size: 00001000, start: 55a37da36000, end: 55a37da37000, flag: 8100073, prot: 8000000000000025 <-bss, data size: 00021000, start: 55a37f214000, end: 55a37f235000, flag: 8100073, prot: 8000000000000025 <-main_heap, main_big_heap size: 00021000, start: 7f8c64000000, end: 7f8c64021000, flag: 8200073, prot: 8000000000000025 <-thread2_smol_heap, thread2_big_heap size: 03fdf000, start: 7f8c64021000, end: 7f8c68000000, flag: 8200070, prot: 0000120 size: 00001000, start: 7f8c6b7ff000, end: 7f8c6b800000, flag: 8000070, prot: 0000120 size: 00800000, start: 7f8c6b800000, end: 7f8c6c000000, flag: 8100073, prot: 8000000000000025 <-thread2_stack size: 00021000, start: 7f8c6c000000, end: 7f8c6c021000, flag: 8200073, prot: 8000000000000025 <-thread1_smol_heap, thread1_big_heap size: 03fdf000, start: 7f8c6c021000, end: 7f8c70000000, flag: 8200070, prot: 0000120 size: 00001000, start: 7f8c70329000, end: 7f8c7032a000, flag: 8000070, prot: 0000120 size: 00803000, start: 7f8c7032a000, end: 7f8c70b2d000, flag: 8100073, prot: 8000000000000025 <-thread1_stack size: 00022000, start: 7f8c70b2d000, end: 7f8c70b4f000, flag: 8000071, prot: 8000000000000025 size: 00178000, start: 7f8c70b4f000, end: 7f8c70cc7000, flag: 8000075, prot: 0000025 <-libc size: 0004e000, start: 7f8c70cc7000, end: 7f8c70d15000, flag: 8000071, prot: 8000000000000025 size: 00004000, start: 7f8c70d15000, end: 7f8c70d19000, flag: 8100071, prot: 8000000000000025 size: 00002000, start: 7f8c70d19000, end: 7f8c70d1b000, flag: 8100073, prot: 8000000000000025 <-libc-var size: 00004000, start: 7f8c70d1b000, end: 7f8c70d1f000, flag: 8100073, prot: 8000000000000025 size: 00001000, start: 7f8c70d1f000, end: 7f8c70d20000, flag: 8000071, prot: 8000000000000025 size: 00002000, start: 7f8c70d20000, end: 7f8c70d22000, flag: 8000075, prot: 0000025 size: 00001000, start: 7f8c70d22000, end: 7f8c70d23000, flag: 8000071, prot: 8000000000000025 size: 00001000, start: 7f8c70d23000, end: 7f8c70d24000, flag: 8100071, prot: 8000000000000025 size: 00001000, start: 7f8c70d24000, end: 7f8c70d25000, flag: 8100073, prot: 8000000000000025 size: 00006000, start: 7f8c70d25000, end: 7f8c70d2b000, flag: 8000071, prot: 8000000000000025 size: 00011000, start: 7f8c70d2b000, end: 7f8c70d3c000, flag: 8000075, prot: 0000025 size: 00006000, start: 7f8c70d3c000, end: 7f8c70d42000, flag: 8000071, prot: 8000000000000025 size: 00001000, start: 7f8c70d42000, end: 7f8c70d43000, flag: 8100071, prot: 8000000000000025 size: 00001000, start: 7f8c70d43000, end: 7f8c70d44000, flag: 8100073, prot: 8000000000000025 size: 00006000, start: 7f8c70d44000, end: 7f8c70d4a000, flag: 8100073, prot: 8000000000000025 size: 00001000, start: 7f8c70d66000, end: 7f8c70d67000, flag: 8000071, prot: 8000000000000025 size: 00023000, start: 7f8c70d67000, end: 7f8c70d8a000, flag: 8000075, prot: 0000025 size: 00008000, start: 7f8c70d8a000, end: 7f8c70d92000, flag: 8000071, prot: 8000000000000025 size: 00001000, start: 7f8c70d93000, end: 7f8c70d94000, flag: 8100071, prot: 8000000000000025 size: 00001000, start: 7f8c70d94000, end: 7f8c70d95000, flag: 8100073, prot: 8000000000000025 size: 00001000, start: 7f8c70d95000, end: 7f8c70d96000, flag: 8100073, prot: 8000000000000025 size: 00021000, start: 7fffbb3b5000, end: 7fffbb3d6000, flag: 0100173, prot: 8000000000000025 <-main_stack, argv, env size: 00004000, start: 7fffbb3d7000, end: 7fffbb3db000, flag: c044411, prot: 8000000000000025 size: 00002000, start: 7fffbb3db000, end: 7fffbb3dd000, flag: 8040075, prot: 0000025 ``` 發現 thread1_heap, thread2_heap, main_heap 在不同 vma。 https://www.kernel.org/doc/gorman/html/understand/understand006.html PGD、P4D、PUD、PMD、PTE cr3, cr4 ## 補充 </br> ### current ```c // arch/x86/include/asm/current.h static __always_inline struct task_struct *get_current(void) { return this_cpu_read_stable(current_task); } #define current get_current() ``` ```c // arch/x86/include/asm/percpu.h /* * this_cpu_read() makes gcc load the percpu variable every time it is * accessed while this_cpu_read_stable() allows the value to be cached. * this_cpu_read_stable() is more efficient and can be used if its value * is guaranteed to be valid across cpus. The current users include * get_current() and get_thread_info() both of which are actually * per-thread variables implemented as per-cpu variables and thus * stable for the duration of the respective task. */ #define this_cpu_read_stable_1(pcp) percpu_stable_op(1, "mov", pcp) #define this_cpu_read_stable_2(pcp) percpu_stable_op(2, "mov", pcp) #define this_cpu_read_stable_4(pcp) percpu_stable_op(4, "mov", pcp) #define this_cpu_read_stable_8(pcp) percpu_stable_op(8, "mov", pcp) #define this_cpu_read_stable(pcp) __pcpu_size_call_return(this_cpu_read_stable_, pcp) ``` </br> </br> ### virt_to_phys 用以下指令確認 page 5 level setting ```bash cat /boot/config-`uname -r` | grep 5LEVEL ``` 若有設定 5levle pageing 則 traece 過程會需要多一個 `p4d` pgd -> p4d -> pud -> pmd -> pte ```c pgd = pgd_offset(mm, vaddr); if (pgd_none(*pgd)) { printk("not mapped in pgd\n"); return -1; } p4d = p4d_offset(pgd, vaddr); if (p4d_none(*p4d)) { printk("not mapped in p4d\n"); return -1; } pud = pud_offset(p4d, vaddr); if (pud_none(*pud)) { printk("not mapped in pud\n"); return -1; } pmd = pmd_offset(pud, vaddr); if (pmd_none(*pmd)) { printk("not mapped in pmd\n"); return -1; } pte = pte_offset_kernel(pmd, vaddr); if (pte_none(*pte)) { printk("not mapped in pte\n"); return -1; } page_addr = pte_val(*pte); page_offset = vaddr & ~PAGE_MASK; // vma 都是 000 (12bit) 結尾,所以這行沒有影響 paddr = (page_addr << PAGE_SHIFT) | page_offset; ``` :::spoiler Result ``` Hello World!: this is new size: 00001000, start: 560cf5b3a000, end: 560cf5b3b000, phy_s: 00003e1827025000, phy_e: 00003d9da0025000, flag: 08000071 size: 00001000, start: 560cf5b3b000, end: 560cf5b3c000, phy_s: 00003d9da0025000, phy_e: 00003d9da1025000, flag: 08000075 <-main_text, thread1_text, thread2_text size: 00001000, start: 560cf5b3c000, end: 560cf5b3d000, phy_s: 00003d9da1025000, phy_e: 00001dcc7f865000, flag: 08000071 <-.rodata size: 00001000, start: 560cf5b3d000, end: 560cf5b3e000, phy_s: 00001dcc7f865000, phy_e: 00001dcced867000, flag: 08100071 not mapped in pte size: 00001000, start: 560cf5b3e000, end: 560cf5b3f000, phy_s: 00001dcced867000, phy_e: ffffffffffffffff, flag: 08100073 <-bss, data not mapped in pte size: 00021000, start: 560cf72da000, end: 560cf72fb000, phy_s: 00002e34aa867000, phy_e: ffffffffffffffff, flag: 08100073 <-main_heap, main_test_heap, main_big_heap not mapped in pte size: 00021000, start: 7fc450000000, end: 7fc450021000, phy_s: 00003742a0867000, phy_e: ffffffffffffffff, flag: 08200073 <-thread1_smol_heap, thread1_big_heap, thread1_test_heap, thread2_smol_heap, thread2_big_heap, thread2_test_heap not mapped in pte not mapped in pmd size: 03fdf000, start: 7fc450021000, end: 7fc454000000, phy_s: ffffffffffffffff, phy_e: ffffffffffffffff, flag: 08200070 not mapped in pmd not mapped in pmd size: 00001000, start: 7fc456cb8000, end: 7fc456cb9000, phy_s: ffffffffffffffff, phy_e: ffffffffffffffff, flag: 08000070 <-thread2_stack not mapped in pmd size: 00803000, start: 7fc456cb9000, end: 7fc4574bc000, phy_s: ffffffffffffffff, phy_e: 00004b9a7b025000, flag: 08100073 <-thread1_stack size: 00022000, start: 7fc4574bc000, end: 7fc4574de000, phy_s: 00004b9a7b025000, phy_e: 00004b8961025000, flag: 08000071 not mapped in pte size: 00178000, start: 7fc4574de000, end: 7fc457656000, phy_s: 00004b8961025000, phy_e: ffffffffffffffff, flag: 08000075 <-libc not mapped in pte size: 0004e000, start: 7fc457656000, end: 7fc4576a4000, phy_s: ffffffffffffffff, phy_e: 00003488f1865000, flag: 08000071 size: 00004000, start: 7fc4576a4000, end: 7fc4576a8000, phy_s: 00003488f1865000, phy_e: 00003e2f7a867000, flag: 08100071 size: 00002000, start: 7fc4576a8000, end: 7fc4576aa000, phy_s: 00003e2f7a867000, phy_e: 000035a286867000, flag: 08100073 <-libc-var size: 00004000, start: 7fc4576aa000, end: 7fc4576ae000, phy_s: 000035a286867000, phy_e: 0000c2f24d025000, flag: 08100073 size: 00001000, start: 7fc4576ae000, end: 7fc4576af000, phy_s: 0000c2f24d025000, phy_e: 0000c2f24e025000, flag: 08000071 not mapped in pte size: 00002000, start: 7fc4576af000, end: 7fc4576b1000, phy_s: 0000c2f24e025000, phy_e: ffffffffffffffff, flag: 08000075 not mapped in pte size: 00001000, start: 7fc4576b1000, end: 7fc4576b2000, phy_s: ffffffffffffffff, phy_e: 0000321069865000, flag: 08000071 size: 00001000, start: 7fc4576b2000, end: 7fc4576b3000, phy_s: 0000321069865000, phy_e: 00001dccf8867000, flag: 08100071 size: 00001000, start: 7fc4576b3000, end: 7fc4576b4000, phy_s: 00001dccf8867000, phy_e: 000010b873025000, flag: 08100073 size: 00006000, start: 7fc4576b4000, end: 7fc4576ba000, phy_s: 000010b873025000, phy_e: 000010b879025000, flag: 08000071 not mapped in pte size: 00011000, start: 7fc4576ba000, end: 7fc4576cb000, phy_s: 000010b879025000, phy_e: ffffffffffffffff, flag: 08000075 not mapped in pte size: 00006000, start: 7fc4576cb000, end: 7fc4576d1000, phy_s: ffffffffffffffff, phy_e: 00003fcee6865000, flag: 08000071 size: 00001000, start: 7fc4576d1000, end: 7fc4576d2000, phy_s: 00003fcee6865000, phy_e: 00003e5fbc867000, flag: 08100071 not mapped in pte size: 00001000, start: 7fc4576d2000, end: 7fc4576d3000, phy_s: 00003e5fbc867000, phy_e: ffffffffffffffff, flag: 08100073 not mapped in pte not mapped in pte size: 00006000, start: 7fc4576d3000, end: 7fc4576d9000, phy_s: ffffffffffffffff, phy_e: ffffffffffffffff, flag: 08100073 size: 00001000, start: 7fc4576f5000, end: 7fc4576f6000, phy_s: 00004b8852025000, phy_e: 000010bb09025000, flag: 08000071 size: 00023000, start: 7fc4576f6000, end: 7fc457719000, phy_s: 000010bb09025000, phy_e: 00004b9835025000, flag: 08000075 not mapped in pte size: 00008000, start: 7fc457719000, end: 7fc457721000, phy_s: 00004b9835025000, phy_e: ffffffffffffffff, flag: 08000071 size: 00001000, start: 7fc457722000, end: 7fc457723000, phy_s: 0000199b5d865000, phy_e: 00001dccf3867000, flag: 08100071 size: 00001000, start: 7fc457723000, end: 7fc457724000, phy_s: 00001dccf3867000, phy_e: 000041ed55867000, flag: 08100073 not mapped in pte size: 00001000, start: 7fc457724000, end: 7fc457725000, phy_s: 000041ed55867000, phy_e: ffffffffffffffff, flag: 08100073 not mapped in pte not mapped in pte size: 00021000, start: 7ffea2eda000, end: 7ffea2efb000, phy_s: ffffffffffffffff, phy_e: ffffffffffffffff, flag: 00100173 <-main_stack, argv, env not mapped in pte size: 00004000, start: 7ffea2fe0000, end: 7ffea2fe4000, phy_s: ffffffffffffffff, phy_e: 00004b82b6025000, flag: 0c044411 not mapped in pte size: 00002000, start: 7ffea2fe4000, end: 7ffea2fe6000, phy_s: 00004b82b6025000, phy_e: ffffffffffffffff, flag: 08040075 ``` ::: 後 12 bit 為零,因為一個 page size 就是 12 bit,因此 vma->start 後 12 bit 為 0 前 12 bit 為零,因為實驗用機器中記憶體使用只用 52 bit > 上面結果中許多 pmd/pte 沒有被 mapped,猜測是因為 cache 相關的問題造成的。 </br> </br> ### for_each_thread ```c #define __for_each_thread(signal, t) \ list_for_each_entry_rcu(t, &(signal)->thread_head, thread_node) #define for_each_thread(p, t) \ __for_each_thread((p)->signal, t) /* Careful: this is a double loop, 'break' won't work as expected. */ #define for_each_process_thread(p, t) \ for_each_process(p) for_each_thread(p, t) ``` ```c #define list_for_each_entry_rcu(pos, head, member, cond...) \ for (__list_check_rcu(dummy, ## cond, 0), \ pos = list_entry_rcu((head)->next, typeof(*pos), member); \ &pos->member != (head); \ pos = list_entry_rcu(pos->member.next, typeof(*pos), member)) ``` </br> </br> ### copy kernel to user space 參數需要這樣宣告 `void __user *buffer` ```c // compiler_typs.h # define __user __attribute__((noderef, address_space(__user))) ``` `__user` 是一種 `__attribute__` 的 macro 用作於幫助 GNU 的錯誤檢查 `copy_to_user` -> `_copy_to_user` -> `raw_copy_to_user` -> `copy_to_user_generic` -> alternative by cpu feature (erms, rep_good, other) -> (rep good) copy_user_generic_string ```c copy_to_user(void __user *to, const void *from, unsigned long n); copy_from_user(void *to, const void __user*from, unsigned long n); ``` ```c // uaccess.h static __always_inline unsigned long __must_check copy_to_user(void __user *to, const void *from, unsigned long n) { if (likely(check_copy_size(from, n, true))) n = _copy_to_user(to, from, n); return n; } static inline __must_check unsigned long _copy_to_user(void __user *to, const void *from, unsigned long n) { might_fault(); if (should_fail_usercopy()) return n; if (access_ok(to, n)) { instrument_copy_to_user(to, from, n); n = raw_copy_to_user(to, from, n); } return n; } ``` ```c // arch/x86/include/asm/uaccess_64.h static __always_inline __must_check unsigned long raw_copy_to_user(void __user *dst, const void *src, unsigned long size) { return copy_user_generic((__force void *)dst, src, size); } static __always_inline __must_check unsigned long copy_user_generic(void *to, const void *from, unsigned len) { unsigned ret; /* * If CPU has ERMS feature, use copy_user_enhanced_fast_string. * Otherwise, if CPU has rep_good feature, use copy_user_generic_string. * Otherwise, use copy_user_generic_unrolled. */ alternative_call_2(copy_user_generic_unrolled, copy_user_generic_string, X86_FEATURE_REP_GOOD, copy_user_enhanced_fast_string, X86_FEATURE_ERMS, ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from), "=d" (len)), "1" (to), "2" (from), "3" (len) : "memory", "rcx", "r8", "r9", "r10", "r11"); return ret; } ``` ```c /* Some CPUs run faster using the string copy instructions. * This is also a lot simpler. Use them when possible. * * Only 4GB of copy is supported. This shouldn't be a problem * because the kernel normally only writes from/to page sized chunks * even if user space passed a longer buffer. * And more would be dangerous because both Intel and AMD have * errata with rep movsq > 4GB. If someone feels the need to fix * this please consider this. * * Input: * rdi destination * rsi source * rdx count * * Output: * eax uncopied bytes or 0 if successful. */ ENTRY(copy_user_generic_string) ASM_STAC cmpl $8,%edx jb 2f /* less than 8 bytes, go to byte copy loop */ ALIGN_DESTINATION movl %edx,%ecx shrl $3,%ecx andl $7,%edx 1: rep movsq 2: movl %edx,%ecx 3: rep movsb xorl %eax,%eax ASM_CLAC ret .section .fixup,"ax" 11: leal (%rdx,%rcx,8),%ecx 12: movl %ecx,%edx /* ecx is zerorest also */ jmp copy_user_handle_tail .previous _ASM_EXTABLE(1b,11b) _ASM_EXTABLE(3b,12b) ENDPROC(copy_user_generic_string) EXPORT_SYMBOL(copy_user_generic_string) ``` </br> </br> ## Reference [pthread introduction](https://github.com/angrave/SystemProgramming/wiki/Pthreads%2C-Part-1%3A-Introduction) https://linux-kernel-labs.github.io/refs/heads/master/lectures/address-space.html https://zhuanlan.zhihu.com/p/436879901