# /dev/mem research contributed by < `jhan1998` > ## Environment Settings Operating Environment Information ``` OS: Ubuntu 20.04.2 LTS Kernel Version: 5.4.0-72-generic Memory: 15 G CPU: Intel® Core™ i7-4770HQ CPU @ 2.20GHz × 8 ``` First of all, we can reserve the memory from being managed by the kernel management system by modifying the startup parameters in `/etc/default/grub`. In [/dev/mem](https://hackmd.io/@sysprog/linux-mem-device#Linux-%E6%A0%B8%E5%BF%83%E7%9A%84-devmem-%E8%A3%9D%E7%BD%AE) The method mentioned in the article is to add `mem=14G` to `GRUB_CMDLINE_LINUX_DEFAULT=""` and then execute `sudo update-grub`. After restarting, 15G - 14G = 1G will be reserved memory down. **Before Setting** ```bash $ free total used free shared buff/cache available Mem: 16270944 2263540 11458908 1256396 2548496 12436764 Swap: 1999868 0 1999868 $ sudo cat /proc/iomem | grep RAM [sudo] password for jhan1998: 00001000-00057fff : System RAM 00059000-0009ffff : System RAM 00100000-6650b80f : System RAM 6650b810-6650bcd2 : System RAM 6650bcd3-78d00fff : System RAM 78d49000-78d5cfff : System RAM 78d8f000-78e39fff : System RAM 78e8f000-78ed3fff : System RAM 78eff000-78f84fff : System RAM 78fdf000-78ffffff : System RAM 100000000-47f5fffff : System RAM 47f600000-47fffffff : RAM buffer ``` **After Setting** ```bash $ free total used free shared buff/cache available Mem: 12152416 2019588 8262100 683884 1870728 9161272 Swap: 1999868 0 1999868 $ sudo cat /proc/iomem | grep RAM [sudo] password for jhan1998: 00001000-00057fff : System RAM 00059000-0009ffff : System RAM 00100000-6650b80f : System RAM 6650b810-6650bcd2 : System RAM 6650bcd3-78d00fff : System RAM 78d49000-78d5cfff : System RAM 78d8f000-78e39fff : System RAM 78e8f000-78ed3fff : System RAM 78eff000-78f84fff : System RAM 78fdf000-78ffffff : System RAM 100000000-37fffffff : System RAM ``` After comparison, it can be found that the displayed memory space is indeed reduced, because the reserved memory will not be recorded in any statistics of the core, but the space lost is more than 1 GB. We can find out the clues from the mapping table of the address space. It can be seen that when `mem=14G` is not set at the beginning, the address space to which the memory is mapped will be `0x47fffffff`, and after `mem=14G` is set, it will become `0x37fffffff`. After a simple conversion here: $$ 0x47fffffff = 18G - 1 \\ 0x37fffffff = 14G - 1 $$ We can know that `mem=` is set here is that the highest address space segment that can be mapped to is `18G`, so when we set `mem=14G`, it will only be mapped to a segment of 14G at most, just behind the segment The segment is the address space mapped to the memory storage device, so `15G - (18G - 14G) = 11G` Therefore, the free command will be executed to see that the available memory is 11G, so we can set `mem=17G` To accurately reserve the last 1G memory for operation. **After Setting** ```bash $ free total used free shared buff/cache available Mem: 15248992 607292 13567272 405288 1074428 13958960 Swap: 1999868 0 1999868 $ free -g total used free shared buff/cache available Mem: 14 0 12 0 1 13 Swap: 1 0 1 $ sudo cat /proc/iomem | grep RAM [sudo] password for jhan1998: 00001000-00057fff : System RAM 00059000-0009ffff : System RAM 00100000-6650b80f : System RAM 6650b810-6650bcd2 : System RAM 6650bcd3-78d00fff : System RAM 78d49000-78d5cfff : System RAM 78d8f000-78e39fff : System RAM 78e8f000-78ed3fff : System RAM 78eff000-78f84fff : System RAM 78fdf000-78ffffff : System RAM 100000000-43fffffff : System RAM ``` Here `0x43fffffff = 17G - 1`, `0x440000000` to `0x47fffffff` are our reserved sections. ## Use crash to map the memory reserved by the system First of all, it took me a lot of time to run the crash. To run the crash, you need to use the debug symbol. The debug symbol of ubuntu needs to be in accordance with [Debug symbol package](https://wiki.ubuntu.com/Debug%20Symbol%20Packages) The instructions of `Getting -dbgsym.ddeb packages` set `/etc/apt/sources.list.d/ddebs.list` and then follow [Getting Kernel Symbols/Sources on Ubuntu Linux](https://sysprogs.com/VisualKernel/ tutorials/setup/ubuntu/) Download the debug symbol of the corresponding version. After downloading, we can test run to see the crash ```bash sudo crash /home/jhan1998/modules/boot/vmlinux-5.4.0-73-generic /dev/mem crash 7.2.8 ... GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... ``` We can test to see the difference between our reserved memory and unreserved memory. ```bash crash> rd -p 0x43ffffff1 43ffffff1: 5000000030000000...0...P crash> rd -p 0x440000000 rd: seek error: physical address: 440000000 type: "64-bit PHYSADDR" ``` Here, since the memory after 0x440000000 is the paging table that we reserved without mapping, there is no way to read it, but 0x43ffffff1 can read the content inside. Next we can use mmap to use the reserved space: ```c #include <stdio.h> #include <unistd.h> #include <sys/mman.h> #include <fcntl.h> #include <stdlib.h> int main(int argc, char *argv[]) { unsigned char *addr; int fd; fd = open("/dev/mem",O_RDWR); if (fd < 0){ printf("device file open error !\n"); return 0; } addr = mmap(0,4096,PROT_READ|PROT_WRITE,MAP_SHARED,fd,0x440000000); printf("addr = %p \n", addr); *(volatile unsigned int *)(addr + 0x00) = 0x1; // 0x440000000,令其值為1 *(volatile unsigned int *)(addr + 0x04) = 0x9; // 0x440000004,令其值為9 printf("the address is %p, and the value is %d\n", addr + 0x00, *(addr + 0x00)); printf("the address is %p, and the value is %d\n", addr + 0x04, *(addr + 0x04)); system("read -p 'Press Enter to continue...' var"); munmap(addr,4096); close(fd); return 0; } ``` From the output results, we can know the location of virtual memory. ```bash $ sudo ./test addr = 0x7f9828a97000 the address is 0x7f9828a97000, and the value is 1 the address is 0x7f9828a97004, and the value is 9 Press Enter to continue... ``` Next, we use crash to view the corresponding entity address. ``` crash> vtop 0x7f9828a97000 VIRTUAL PHYSICAL 7f9828a97000 (not accessible) ``` At the beginning, when you want to check the corresponding entity address, you can't check it because there is no mapping table. We can use the `set` command to get the current context so that we can view the corresponding entity address. ``` crash> ps | grep test 21073 21072 2 ffff96885eec8000 IN 0.0 2492 1436 test crash> set 21073 PID: 21073 COMMAND: "test" TASK: ffff96885eec8000 [THREAD_INFO: ffff96885eec8000] CPU: 2 STATE: TASK_INTERRUPTIBLE crash> vtop 0x7f9828a97000 VIRTUAL PHYSICAL 7f9828a97000 440000000 PGD: 1e2da07f8 => 80000003cb2a4067 PUD: 3cb2a4300 => 27b99e067 PMD: 27b99ea28 => 3bcd98067 PTE: 3bcd984b8 => 8000000440000267 PAGE: 440000000 PTE PHYSICAL FLAGS 8000000440000267 440000000 (PRESENT|RW|USER|ACCESSED|DIRTY|NX) VMA START END FLAGS FILE ffff9687bb915450 7f9828a97000 7f9828a98000 d0444bb /dev/mem ``` It can be seen that `0x440000000` is the physical address we reserved. :::info :bell: Then re-run `rd` to try to print the content of the entity address, but it still fails ``` crash> rd 0x7f9828a97000 rd: seek error: user virtual address: 7f9828a97000 type: "64-bit UVADDR" ``` According to the error message, I speculate that it may be because the crash is a kernel core For analysis tools, the reserved memory will be mapped to the user space and use the memory space of the process, so it cannot be read by crash. ::: ## Use crash to observe Five-level page tables According to the description in [Five-level page tables](https://lwn.net/Articles/717293/), we can know that the Linux MMU under the current x86_64 architecture will convert the virtua address into a physical address with a 5-level page table. ![](https://i.imgur.com/CMz0h48.png) It should be noted that the virtual address only has 48 bits instead of 64 bits, and the top 16 bits will be discarded, because 48 bits can already map a large enough 256 TB. At the beginning of conversion, we can query the location of `page global directory (PGD)` from mm_struct in task_struct, and then find the corresponding index according to the top 9 bits (bits 39-47) of the virtual address to get `page upper directory ( PUD)` location is also based on the virtual address (bits 30 - 38) to find the corresponding `page middle middle directory (PMD)` location, and then find `page table entry (PTE)` by analogy and finally use the most The following 12 bits of offset find the address we want, and the conversion is completed. Then we can execute the following program and use crash to observe the mechanism of [Five-level page tables](https://lwn.net/Articles/717293/). ```c #include <stdio.h> #include <unistd.h> #include <sys/mman.h> #include <fcntl.h> #include <stdlib.h> int main(int argc, char *argv[]) { unsigned char *addr; int fd; fd = open("/dev/mem",O_RDWR); if (fd < 0){ printf("device file open error !\n"); return 0; } addr = mmap(0,4096,PROT_READ|PROT_WRITE,MAP_SHARED,fd,0x440000000); printf("addr = %p \n", addr); *(volatile unsigned int *)(addr + 0x00) = 0x1; // 0x440000000,令其值為1 *(volatile unsigned int *)(addr + 0x04) = 0x9; // 0x440000004,令其值為9 printf("the address is %p, and the value is %d\n", addr + 0x00, *(addr + 0x00)); printf("the address is %p, and the value is %d\n", addr + 0x04, *(addr + 0x04)); printf("PGD index = 0x%llx\n", ((unsigned long long int)addr >> 39) & 0x1ff); printf("PUD index = 0x%llx\n", ((unsigned long long int)addr >> 30) & 0x1ff); printf("PMD index = 0x%llx\n", ((unsigned long long int)addr >> 21) & 0x1ff); printf("PTE index = 0x%llx\n", ((unsigned long long int)addr >> 12) & 0x1ff); pause(); munmap(addr,4096); close(fd); return 0; } ``` The resulting output is: ```bash $ sudo ./test addr = 0x7eff0939a000 the address is 0x7eff0939a000, and the value is 1 the address is 0x7eff0939a004, and the value is 9 PGD index = 0xfd PUD index = 0x1fc PMD index = 0x49 PTE index = 0x19a ``` Then use crash to observe: Find out where the PGD is first ```bash crash> ps | grep test 9298 9296 6 ffff889ffe01dd00 IN 0.0 2492 1540 test crash> set 9298 PID: 9298 COMMAND: "test" TASK: ffff889ffe01dd00 [THREAD_INFO: ffff889ffe01dd00] CPU: 6 STATE: TASK_INTERRUPTIBLE crash> px ((struct task_struct*)0xffff889ffe01dd00)->mm->pgd $1 = (pgd_t *) 0xffff889e7894a000 ``` Among them, `$1 = (pgd_t *) 0xffff889e7894a000` Since it is returning a virtual address, we also need to use vtop to convert it to a physical address. ```bash crash> vtop 0xffff889e7894a000 VIRTUAL PHYSICAL ffff889e7894a000 27894a000 PGD DIRECTORY: ffffffffb3c0a000 PAGE DIRECTORY: 38b801067 PUD: 38b8013c8 => 278919063 PMD: 278919e20 => 2789ee063 PTE: 2789eea50 => 800000027894a063 PAGE: 27894a000 PTE PHYSICAL FLAGS 800000027894a063 27894a000 (PRESENT|RW|ACCESSED|DIRTY|NX) PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffce5b49e25280 27894a000 0 ffff88a02a767b40 1 17ffffc0000000 ``` It can be obtained that the starting position of PGD is 0x27894a000, and then use the original virtual address of vtop to verify whether the corresponding index is correct. ```bash crash> vtop 0x7eff0939a000 VIRTUAL PHYSICAL 7eff0939a000 440000000 PGD: 27894a7e8 => 8000000321be3067 PUD: 321be3fe0 => 321be5067 PMD: 321be5248 => 3c8c0a067 PTE: 3c8c0acd0 => 8000000440000267 PAGE: 440000000 PTE PHYSICAL FLAGS 8000000440000267 440000000 (PRESENT|RW|USER|ACCESSED|DIRTY|NX) VMA START END FLAGS FILE ffff889f1919d2b0 7eff0939a000 7eff0939b000 d0444bb /dev/mem ``` Here I found that the PGD index is not the same as what I calculated. The index displayed here is the result of my calculation shifted to the left by three bits `0x7e8 = 0xfd << 3`. There are definitions of pgd_index and pgd_offset in [linux/include/linux/pgtable.h](https://github.com/torvalds/linux/blob/master/include/linux/pgtable.h). ```c #ifndef pgd_index /* Must be a compile-time constant, so implement it as a macro */ #define pgd_index(a) (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1)) #endif static inline pgd_t *pgd_offset_pgd(pgd_t *pgd, unsigned long address) { return (pgd + pgd_index(address)); }; /* * a shortcut to get a pgd_t in a given mm */ #ifndef pgd_offset #define pgd_offset(mm, address) pgd_offset_pgd((mm)->pgd, (address)) #endif ``` [linux/arch/x86/include/asm/pgtable_64.h ](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/pgtable_64.h) ```c #define PGDIR_SHIFT 39 #define PTRS_PER_PGD 512 ``` But there is still no way to explain why it is shifted to the left by 3 bits, so this is still to be verified. Back to crash, we can read the obtained PGB position and get the value of PUD. ```bash crash> rd -p 27894a7e8 27894a7e8: 8000000321be3067 g0.!.... ``` It can be directly regarded as `0x321be3067`, and the following 12 bits are flags bits, so we can know that the starting position of the PUD is `0x321be3000`, and after adding the shifted index, read the starting position of the PMD. **`0x1fc << 3 = 0xfe0`** ```bash crash> rd -p 0x321be3fe0 321be3fe0: 0000000321be5067 gP.!.... ``` And so on to get PTE. **`0x49 << 3 = 0x248`** ```bash crash> rd -p 0x321be5248 321be5248: 00000003c8c0a067 g....... ``` Get the page we want. **`0x19a << 3 = 0xcd0`** ```bash crash> rd -p 0x3c8c0acd0 3c8c0acd0: 8000000440000267 g..@.... ``` Adding the last 12 bits offset of the virtual address is our physical address `0x440000000`. **Verify with vtop:** ```bash crash> vtop 0x7eff0939a000 VIRTUAL PHYSICAL 7eff0939a000 440000000 PGD: 27894a7e8 => 8000000321be3067 PUD: 321be3fe0 => 321be5067 PMD: 321be5248 => 3c8c0a067 PTE: 3c8c0acd0 => 8000000440000267 PAGE: 440000000 PTE PHYSICAL FLAGS 8000000440000267 440000000 (PRESENT|RW|USER|ACCESSED|DIRTY|NX) VMA START END FLAGS FILE ffff889f1919d2b0 7eff0939a000 7eff0939b000 d0444bb /dev/mem ``` :::warning :question: Why does the index have to be shifted to the left by three bits and then added to the starting position of the table? The current guess may be related to Big-Endian and Little-Endian. ::: ## Page exchange between Processes > There is a need: We don't want process_A and process_B to share any paging, which means they cannot operate on the same data at the same time. But occasionally we also want process_A and process_B to exchange information, but we don't want to use the inefficient traditional inter-process communication mechanism. After understanding the mechanism of [Five-level page tables](https://lwn.net/Articles/717293/), we can use crash to modify the operation of reserved memory and exchange pages between processes, which is not like It is share memory to give a piece of memory to share information between processes, and to manually modify `/dev/mem` to achieve information exchange between two processes. **master.c** ```c #include <stdio.h> #include <unistd.h> #include <sys/mman.h> #include <string.h> #include <fcntl.h> int main(int argc, char **argv) { int fd; unsigned long *addr; fd = open("/dev/mem", O_RDWR); // 建立一個分頁 P1 映射到保留記憶體 addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd,0x440000000); // 修改 P1 的内容 *addr = 0x1122334455667788; printf("address at: %p content is: 0x%lx\n", addr, addr[0]); // 等待分頁交換 getchar(); printf("address at: %p content is: 0x%lx\n", addr, addr[0]); close(fd); munmap(addr, 4096); return 1; } ``` **slave.c** ```c #include <stdio.h> #include <unistd.h> #include <sys/mman.h> #include <string.h> #include <fcntl.h> int main(int argc, char **argv) { int fd; unsigned long *addr; fd = open("/dev/mem", O_RDWR); // 建立分頁 P2 映射到保留的記憶體 addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0x440004000); // 修改 P2 的内容 *addr = 0x8877665544332211; printf("address at: %p content is: 0x%lx\n", addr, addr[0]); // 等待分頁交換 getchar(); printf("address at: %p content is: 0x%lx\n", addr, addr[0]); close(fd); munmap(addr, 4096); return 1; } ``` After execution, you can see the addresses and values of the two processes: **master** ```bash $ sudo ./master address at: 0x7f9822c40000 content is: 0x1122334455667788 ``` **slave** ```bash $ sudo ./slave address at: 0x7f9822cd1000 content is: 0x8877665544332211 ``` To use crash to modify `/dev/mem`, you need to set up the environment first. > When using crash to modify /dev/mem, you need to use ststemtap hook to live in devmeme_is_allowed, so that the return value is always 1, and then you can directly modify it. Steps: install systemtap Execute stap -g -e 'probe kernel.function("devmem_is_allowed").return { $return = 1 }' Then turn on crash again **Use crash to modify the page of master** ```bash crash> ps | grep master 27867 27866 4 ffff889e253945c0 IN 0.0 2492 1332 master crash> set 27867 PID: 27867 COMMAND: "master" TASK: ffff889e253945c0 [THREAD_INFO: ffff889e253945c0] CPU: 4 STATE: TASK_INTERRUPTIBLE crash> vtop 0x7f9822c40000 VIRTUAL PHYSICAL 7f9822c40000 440000000 PGD: 1ee8e27f8 => 800000020514a067 PUD: 20514a300 => 1f0d80067 PMD: 1f0d808b0 => 391d34067 PTE: 391d34200 => 8000000440000267 PAGE: 440000000 PTE PHYSICAL FLAGS 8000000440000267 440000000 (PRESENT|RW|USER|ACCESSED|DIRTY|NX) VMA START END FLAGS FILE ffff889dfa6cc0d0 7f9822c40000 7f9822c41000 d0444bb /dev/mem crash> wr -64 -p 0x391d34200 0x8000000440004267 ``` **Use crash to modify slave's page** ```bash crash> ps | grep slave 27869 27868 4 ffff889e2ab21740 IN 0.0 2492 1384 slave crash> set 27869 PID: 27869 COMMAND: "slave" TASK: ffff889e2ab21740 [THREAD_INFO: ffff889e2ab21740] CPU: 4 STATE: TASK_INTERRUPTIBLE crash> vtop 0x7f9002cd1000 VIRTUAL PHYSICAL 7f9002cd1000 440004000 PGD: 1f0aea7f8 => 80000002539fc067 PUD: 2539fc200 => 20a3a8067 PMD: 20a3a80b0 => 205689067 PTE: 205689688 => 8000000440004267 PAGE: 440004000 PTE PHYSICAL FLAGS 8000000440004267 440004000 (PRESENT|RW|USER|ACCESSED|DIRTY|NX) VMA START END FLAGS FILE ffff889ecc833d40 7f9002cd1000 7f9002cd2000 d0444bb /dev/mem crash> wr -64 -p 0x205689688 0x8000000440000267 ``` Then let the program continue to execute and you can see the information exchange between the two processes! **master** ```bash $ sudo ./master address at: 0x7f9822c40000 content is: 0x1122334455667788 address at: 0x7f9822c40000 content is: 0x8877665544332211 ``` **slave** ```bash $ sudo ./slave address at: 0x7f9002cd1000 content is: 0x8877665544332211 address at: 0x7f9002cd1000 content is: 0x1122334455667788 ``` > This example is very suitable for designing micro-kernel inter-process communication. With the cache consistency protocol, it can achieve very high efficiency. ## Securely tamper with the memory of the process This time we can modify the page information in `/dev/mem` without crashing, but use another process to safely tamper with the memory of another process. First, let's randomly map a piece of memory: ```c #include <stdio.h> #include <unistd.h> #include <sys/mman.h> #include <string.h> int main(int argc, char **argv) { unsigned char *addr; // 匿名映射一段記憶體空間 addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0); // 修改內容 strcpy(addr, "浙江溫州皮鞋濕"); // 只是範例,所以直接顯示 address 實際操作時需要手工 hack 記憶體位置 printf("address at: %p content is: %s\n", addr, addr); getchar(); printf("address at: %p content is: %s\n", addr, addr); munmap(addr, 4096); return 1; } ``` The output at this time is ```bash $ ./change address at: 0x7f4a035f1000 content is: 浙江溫州皮鞋濕 ``` Then we can use crash to query the physical memory location, and then modify it through other programs. First use crash to find the process and physical address. ```bash crash> ps | grep change crash: current context no longer exists -- restoring "crash" context: 36261 9027 4 ffff889e344d0000 IN 0.0 2496 1392 change crash> set 36261 PID: 36261 COMMAND: "change" TASK: ffff889e344d0000 [THREAD_INFO: ffff889e344d0000] CPU: 4 STATE: TASK_INTERRUPTIBLE crash> vtop 0x7f4a035f1000 VIRTUAL PHYSICAL 7f4a035f1000 18a6c4000 PGD: 3b084a7f0 => 80000003b1171067 PUD: 3b1171940 => 1ee97f067 PMD: 1ee97f0d0 => 1f0b21067 PTE: 1f0b21f88 => 800000018a6c4867 PAGE: 18a6c4000 PTE PHYSICAL FLAGS 800000018a6c4867 18a6c4000 (PRESENT|RW|USER|ACCESSED|DIRTY|NX) VMA START END FLAGS FILE ffff889feeb425b0 7f4a035f1000 7f4a035f2000 80000fb dev/zero PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffce5b4629b100 18a6c4000 ffff889e2aa0c4b8 0 2 17ffffc008001c uptodate,dirty,lru,swapbacked ``` The converted physical address is `0x18a6c4000`. So we write a program to map the memory of this offset and change the content inside to `下雨進水不會胖`。 **hack.c** ```c #include <unistd.h> #include <sys/mman.h> #include <string.h> #include <fcntl.h> int main(int argc, char **argv) { int fd; unsigned char *addr; unsigned long long off; off = strtoll(argv[1], NULL, 16); fd = open("/dev/mem", O_RDWR); addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, off); strcpy(addr, "下雨進水不會胖"); close(fd); munmap(addr, 4096); return 1; } ``` When executing, remember to hook `devmeme_is_allowed` first, otherwise there will be a segmentation fault. ```bash sudo ./hack 0x18a6c4000 ``` Go back and look at the results, and you can find that we have successfully modified the value stored at `0x18a6c4000`. ```bash $ ./change address at: 0x7f4a035f1000 content is: 浙江溫州皮鞋濕 address at: 0x7f4a035f1000 content is: 下雨進水不會胖 ``` ## Change the name of the process by changing /dev/mem This time we will not use crash, just rely on hack /dev/mem to modify a process name. > This makes sense for an Internet product to work. Especially on some managed machines, in order to prevent information leakage, it is generally not allowed to use tools like `crash & gdb` to debug. Of course, `systemtap` API has restrictions, so it is relatively safe, and core modules are generally not will be banned. But having `systemtap` and `/dev/mem` is enough! Take a look at the following program, we will do a simple experiment: - [ ] **Modify the name of the process being executed** ```c #include <stdio.h> int main(int argc, char **argv) { getchar(); } ``` ```bash gcc -o pixie pixie.c && ./pixie ``` Now we have to find a way to change the name of the process from pixie to skinshoe. There is no `crash` and no `gdb`, only a `/dev/mem` that can be read and written (assuming we have hooked `devmem_si_allowed`) how to do it? It is now known that all data structures in the core can be found in `/dev/mem`, so we need to find the location of the `task_struct` structure of the pixie process, and then change its `comm` field. It is very easy if you use the `crash` tool, as long as you find out the position of the process, you can easily find the corresponding position of comm in the task_struct. ```bash crash> set 63972 PID: 63972 COMMAND: "pixie" TASK: ffff9c3832572e80 [THREAD_INFO: ffff9c3832572e80] CPU: 4 STATE: TASK_INTERRUPTIBLE crash> px ((struct task_struct*)0xffff9c3832572e80)->comm $1 = "pixie\000PoolSingl" crash> px &(((struct task_struct*)0xffff9c3832572e80)->comm) $2 = (char (*)[16]) 0xffff9c38325738f8 ``` But what if you can't use `crash` or `gdb` now? We know that `/dev/mem` is a physical memory space, and any memory operated by the operating system is based on virtual addresses. How to establish the relationship between the two is the key. We notice three facts: * x86_64 can directly map 64TiB of physical memory, which is enough to map any common physical memory one by one. * The Linux kernel creates a one-to-one mapping of all physical memory. Fixed offset between physical address and virtual address. * The data structure of the Linux core is a network of interrelated structures, so it is possible to follow the vines. This means that as long as we provide a virtual address of a Linux kernel space data structure, we can find its physical address, and then follow the clues to find the task_struct structure of our pixie process. In the Linux system, the address of the core data structure can be found in many places: * `/proc/kallsyms` * `/boot/System.map` * the result of `lsof` The article explains one of the examples of finding `init_task` in `/proc/kallsyms`: ```bash $ sudo cat /proc/kallsyms | grep init_task ffffffff953a90c0 T ftrace_graph_init_task ffffffff95406390 T perf_event_init_task ffffffff96685cec r __ksymtab_init_task ffffffff966aaa19 r __kstrtab_init_task ffffffff96800000 D __start_init_task ffffffff96804000 D __end_init_task ffffffff96813780 D init_task ffffffff96f48c98 b ext4_lazyinit_task ``` Then find the mapping rules from `init_task` to physical memory, start from `init_task` to visit the task linked list of the entire system, find our target `pixie` itinerary, and then make changes. But this is not modified through `/dev/mem`, so another method is provided in the article. First create a tcpdump process without capturing any packets, it is just a cover to provide clues, let's start with it: ```bash $ sudo tcpdump -i lo -n tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes ``` The reason why the tcpdump process is established is because tcpdump will generate a packet socket, and the virtual address of the socket can be found from procfs: ``` $ sudo cat /proc/net/packet sk RefCnt Type Proto Iface R Rmem User Inode ffff9c39c0c87000 2 2 0000 0 0 0 0 384885 ffff9c397b431000 3 2 888e 2 1 0 0 384886 ffff9c386ec14000 3 2 890d 2 1 0 0 382437 ffff9c397b4ab800 2 2 0000 0 0 0 0 383853 ffff9c3796d78000 3 2 0800 2 1 0 0 380767 ``` **After starting tcpdump** ```bash $ sudo cat /proc/net/packet sk RefCnt Type Proto Iface R Rmem User Inode ffff9c39c0c87000 2 2 0000 0 0 0 0 384885 ffff9c397b431000 3 2 888e 2 1 0 0 384886 ffff9c386ec14000 3 2 890d 2 1 0 0 382437 ffff9c397b4ab800 2 2 0000 0 0 0 0 383853 ffff9c3796d78000 3 2 0800 2 1 0 0 380767 ffff9c3888878000 3 3 0003 1 1 0 0 382711 ``` We can see that a `packet socket` has indeed been added. However, since the method of switching between virtual addresses and physical addresses in the article is different from my computer, so here we still need to use the help of crash to convert virtual addresses and physical addresses. In this article, we will use the virtual address of `packet socket` to push back and forth the position of `wati_queue_head_t` in the structure, and then find the next `task_struct` and search the entire `task_struct list`. > This point requires you to be very familiar with the data structure of the Linux core. If you are not familiar with it, go to the corresponding source code to calculate the offset. [Or use `struct X.y -o` of `crash` to calculate] But because the article is old and the architecture of the computer system is not the same, so at this time we still have to use the `struct X.y -o` function of `crash` to see how to find the structure we are looking for. First we start with `struct sock` to find `sk_wq`: ```bash crash> struct sock struct sock { struct sock_common __sk_common; socket_lock_t sk_lock; atomic_t sk_drops; int sk_rcvlowat; struct sk_buff_head sk_error_queue; struct sk_buff *sk_rx_skb_cache; struct sk_buff_head sk_receive_queue; struct { atomic_t rmem_alloc; int len; struct sk_buff *head; struct sk_buff *tail; } sk_backlog; int sk_forward_alloc; unsigned int sk_ll_usec; unsigned int sk_napi_id; int sk_rcvbuf; struct sk_filter *sk_filter; union { struct socket_wq *sk_wq; struct socket_wq *sk_wq_raw; }; struct xfrm_policy *sk_policy[2]; struct dst_entry *sk_rx_dst; struct dst_entry *sk_dst_cache; atomic_t sk_omem_alloc; int sk_sndbuf; int sk_wmem_queued; refcount_t sk_wmem_alloc; unsigned long sk_tsq_flags; union { struct sk_buff *sk_send_head; struct rb_root tcp_rtx_queue; }; struct sk_buff *sk_tx_skb_cache; struct sk_buff_head sk_write_queue; __s32 sk_peek_off; int sk_write_pending; __u32 sk_dst_pending_confirm; u32 sk_pacing_status; long sk_sndtimeo; struct timer_list sk_timer; __u32 sk_priority; __u32 sk_mark; unsigned long sk_pacing_rate; unsigned long sk_max_pacing_rate; struct page_frag sk_frag; netdev_features_t sk_route_caps; netdev_features_t sk_route_nocaps; netdev_features_t sk_route_forced_caps; int sk_gso_type; unsigned int sk_gso_max_size; gfp_t sk_allocation; __u32 sk_txhash; unsigned int __sk_flags_offset[0]; unsigned int sk_padding : 1; unsigned int sk_kern_sock : 1; unsigned int sk_no_check_tx : 1; unsigned int sk_no_check_rx : 1; unsigned int sk_userlocks : 4; unsigned int sk_protocol : 8; unsigned int sk_type : 16; u16 sk_gso_max_segs; u8 sk_pacing_shift; unsigned long sk_lingertime; struct proto *sk_prot_creator; rwlock_t sk_callback_lock; int sk_err; int sk_err_soft; u32 sk_ack_backlog; u32 sk_max_ack_backlog; kuid_t sk_uid; struct pid *sk_peer_pid; const struct cred *sk_peer_cred; long sk_rcvtimeo; ktime_t sk_stamp; u16 sk_tsflags; u8 sk_shutdown; u32 sk_tskey; atomic_t sk_zckey; u8 sk_clockid; u8 sk_txtime_deadline_mode : 1; u8 sk_txtime_report_errors : 1; u8 sk_txtime_unused : 6; struct socket *sk_socket; void *sk_user_data; void *sk_security; struct sock_cgroup_data sk_cgrp_data; struct mem_cgroup *sk_memcg; void (*sk_state_change)(struct sock *); void (*sk_data_ready)(struct sock *); void (*sk_write_space)(struct sock *); void (*sk_error_report)(struct sock *); int (*sk_backlog_rcv)(struct sock *, struct sk_buff *); struct sk_buff *(*sk_validate_xmit_skb)(struct sock *, struct net_device *, struct sk_buff *); void (*sk_destruct)(struct sock *); struct sock_reuseport *sk_reuseport_cb; struct bpf_sk_storage *sk_bpf_storage; struct callback_head sk_rcu; } SIZE: 760 ``` It can be seen that the structure size of the entire `sock` is 760. Knowing that `sk_wq` exists in the structure, we can use `struct X.y` to query its offset. ```bash crash> struct sock.sk_wq struct sock { [280] struct socket_wq *sk_wq; } ``` Next query the structure of `struct socket_wq`: ```bash crash> struct socket_wq struct socket_wq { wait_queue_head_t wait; struct fasync_struct *fasync_list; unsigned long flags; struct callback_head rcu; } SIZE: 64 ``` This structure is much smaller than the previous one. We can see that the first item is `wait_queue_head_t`, so the offset is 0, and we can directly observe `wait_queue_head_t`. ```bash crash> struct wait_queue_head_t typedef struct wait_queue_head { spinlock_t lock; struct list_head head; } wait_queue_head_t; SIZE: 24 ``` This should be the waiting queue of the socket, but according to the clues in the article, you should also find `poll_wqueues` through `wait_queue_t` and then find `task_struct` from it, but the `wiat_queue_head_t` found so far is already at the end and will not be connected `poll_wqueues`, so find another way. :::warning Can't find a way to find `struct task_struct` from `struct sock` at present, so I still use `crash` to change the name of the process ::: Next, we use the crash tool to help us modify the process name. First of all, re-execute pixie ``` $ ./pixie ``` We can use `crash` to query the location of `task_struct` of this process, and then search for the location of `comm` in `task_struct`. ```bash crash> ps | grep pixie 24127 10259 1 ffff8d7aa975c5c0 IN 0.0 2492 1248 pixie crash> set 24127 PID: 24127 COMMAND: "pixie" TASK: ffff8d7aa975c5c0 [THREAD_INFO: ffff8d7aa975c5c0] CPU: 1 STATE: TASK_INTERRUPTIBLE crash> px ((struct task_struct *)0xffff8d7aa975c5c0)->comm $3 = "pixie\000PoolSingl" crash> px &((struct task_struct *)0xffff8d7aa975c5c0)->comm $4 = (char (*)[16]) 0xffff8d7aa975d038 ``` Then we can convert this virtual address into a physical address for mapping: ```bash crash> vtop 0xffff8d7aa975d038 VIRTUAL PHYSICAL ffff8d7aa975d038 22975d038 PGD DIRECTORY: ffffffffa280a000 PAGE DIRECTORY: 3b8801067 PUD: 3b8801f50 => 2191d0063 PMD: 2191d0a58 => 219038063 PTE: 219038ae8 => 800000022975d063 PAGE: 22975d000 PTE PHYSICAL FLAGS 800000022975d063 22975d000 (PRESENT|RW|ACCESSED|DIRTY|NX) PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffe56c88a5d740 22975d000 dead000000000400 0 0 17ffffc0000000 ``` We can see that `0x22975d038` is the physical address of `task_struct->comm`. ```bash crash> rd -p 22975d038 22975d038: 656f68736e696b73 pixie ``` But we can't directly map this section of memory, because the direct mapping of the 12bits flag behind may not be allowed, so we need to clear the 12bits behind, do the mapping and then add the diff back. **hack.c** ```c #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <string.h> #include <fcntl.h> int main(int argc, char **argv) { int fd; unsigned char *addr; unsigned long long off, diff; off = strtoll(argv[1], NULL, 16); diff = off & 0x000000fff; off &= 0xffffff000; fd = open("/dev/mem", O_RDWR); addr = mmap(NULL, 0xffffffff, PROT_READ|PROT_WRITE, MAP_SHARED, fd, off); addr += diff; printf("program name is: %s\n", addr); // strcpy(addr, "skinshoe"); close(fd); munmap(addr, 0xffffffff); return 1; } ``` After execution: ```bash $ sudo ./skinshoe 22975d038 program name is: pixie ``` We can see that the name has changed. ```bash crash> px ((struct task_struct *)0xffff8d7aa975c5c0)->comm $8 = "skinshoe\000lSingl" rash> rd -p 22975d038 22975d038: 656f68736e696b73 skinshoe ``` We can also use pid query: ```bash $ cat /proc/24127/comm skinshoe ``` So far we have successfully modified the name of the process through `/dev/mem`. ## Implement vtop In [/dev/mem](https://hackmd.io/@sysprog/linux-mem-device#Linux-%E6%A0%B8%E5%BF%83%E7%9A%84-devmem-%E8%A3%9D%E7%BD%AE) mentioned in this article > Non-reserved physical memory will be mapped one by one to the virtual address starting from `0xffff880000000000` So when we want to convert the virtual address mapped in `/dev/mem`, we only need to subtract `0xffff880000000000`. The author also uses this to accomplish many things, but because of the different versions, my computer does not start from `0xffff880000000000 `Start. After many comparisons and corrections, I found that the benchmark value of the computer will be different every time it is turned on. This time it starts from `0xffff8d7880000000`, so subtracting this value can successfully map the virtual address without conversion by `crash`, I It is speculated that it is randomly determined when the mapping table is created at startup. So we can use the virtual address to view the name of the process. Execute the pixie code in the previous paragraph as well: ```bash ./pixie ``` This time we can query pixie's `task_struct`: ```bash crash> ps | grep pixie crash: current context no longer exists -- restoring "crash" context: 27857 10259 0 ffff8d7a99105d00 IN 0.0 2492 1252 pixie crash> set 27857 PID: 27857 COMMAND: "pixie" TASK: ffff8d7a99105d00 [THREAD_INFO: ffff8d7a99105d00] CPU: 0 STATE: TASK_INTERRUPTIBLE ``` So we know the virtual address is `0xffff8d7a99105d00`. We can also query the displacement of task_struct.comm by the way: ```bash crash> struct task_struct.comm struct task_struct { [2680] char comm[16]; } ``` So when we get the entity address and add 2680 it is `comm`. The implementation code is as follows: ```c #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <string.h> #include <fcntl.h> #define OFFSET 0xffff8d7880000000 int main(int argc, char **argv) { int fd; unsigned char *addr; unsigned long long off, diff; off = strtoull(argv[1], NULL, 16); off -= OFFSET; off += 2680; diff = off & 0x000000fff; off &= 0xffffff000; fd = open("/dev/mem", O_RDWR); addr = mmap(NULL, 0xffffffff, PROT_READ|PROT_WRITE, MAP_SHARED, fd, off); addr += diff; printf("program name is: %s\n", addr); close(fd); munmap(addr, 0xffffffff); return 1; } ``` Final execution result: ```bash $ sudo ./vtop ffff8d7a99105d00 program name is: pixie ``` ## Visit the process of the system Continuing the previous example, we can try to visit the process after pixie. We can first query the position of `(list_head *) tasks` in `task_struct`, which is the linked-list connecting each `task_struct`. ```bash crash> struct task_struct.tasks struct task_struct { [1984] struct list_head tasks; } ``` After knowing the position deviation, we can try to visit the processes and their names. The same we do from pixie as the entry point. ```bash ./pixie ``` Then find the virtual address of `task_struct` of this process as before, this time I won’t demonstrate them one by one. In addition, in this example, the virtual address is mapped from `0xffff9fa540000000`. After that, we can add and subtract the offsets of `task_struct->tasks` and `task_struct->comm` to visit the subsequent processes and their names. Here I list a total of 12. **traverse.c** ```c #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <string.h> #include <fcntl.h> #define OFFSET 0xffff9fa540000000 int main(int argc, char **argv) { int fd; unsigned char *addr; unsigned long long off, diff; unsigned long *pltmp; off = strtoull(argv[1], NULL, 16); off += 1984; fd = open("/dev/mem", O_RDWR); for(int i = 0; i < 12; i++){ off -= OFFSET; off -= 1984; diff = off & 0x000000fff; off &= 0xffffff000; addr = mmap(NULL, 0xffffffff, PROT_READ|PROT_WRITE, MAP_SHARED, fd, off); addr += diff; addr += 2680; printf("program name is: %s\n", addr); addr -= 2680; addr += 1984; pltmp = (long unsigned int *)addr; off = (unsigned long long)*pltmp; munmap(addr, 0xffffffff); } close(fd); return 1; } ``` Output result: ```bash $ sudo ./traverse ffff9fa791151740 program name is: pixie program name is: bash program name is: kworker/u16:1 program name is: kworker/u16:2 program name is: kworker/0:0 program name is: kworker/1:1 program name is: kworker/u16:3 program name is: kworker/3:2 program name is: cpptools-srv program name is: kworker/u16:0 program name is: sudo program name is: traverse ``` It can be seen that the process name after pixie is successfully listed, and the traverse at the end is our current process, so we have successfully extended the previous example to visit each process in the system. ## Access NULL address legally Next, we will try to see if we can legally access the address of NULL. In fact, the NULL address can be accessed completely, as long as there is a paging table to map it to a physical memory page, we can first look at [mmap(2)](https://man7.org/linux/man-pages/man2/mmap .2.html) inside the description. > on Linux, the kernel will pick a nearby page boundary (but always above or equal to the value specified by `/proc/sys/vm/mmap_min_addr`) and attempt to create the mapping there. It can be seen that addresses smaller than the number in `/proc/sys/vm/mmap_min_addr` are protected and cannot be mapped, so we need to change the value inside to 0 so that we can use NULL Space. So first we change `/proc/sys/vm/mmap_min_addr`: ```bash $ cat /proc/sys/vm/mmap_min_addr 65536 $ sudo sh -c "echo 0 > /proc/sys/vm/mmap_min_addr" $ cat /proc/sys/vm/mmap_min_addr 0 ``` Next we can try to map to use NULL address: ```c #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> int main(int argc, char **argv) { int i; unsigned char *niladdr = NULL; unsigned char str[] = "Zhejiang Wenzhou pixie shi,xiayu jinshui buhui pang!"; mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_ANONYMOUS|MAP_SHARED, -1, 0); perror("a"); for (i = 0 ; i < sizeof(str); i++) { niladdr[i] = str[i]; } printf("using assignment at NULL: %s\n", niladdr); for (i = 0 ; i < sizeof(str); i++) { printf ("%c", *((char*)NULL+i)); } printf ("\n"); getchar(); munmap(0, 4096); return 0; } ``` Output result: ```bash $ sudo ./access0 a: Success using assignment at NULL: (null) Zhejiang Wenzhou pixie shi, xiayu jinshui buhui pang! ``` Observe through `crash`: ```bash crash> ps | grep access0 crash: current context no longer exists -- restoring "crash" context: 8447 8446 1 ffff9ddb907a2e80 IN 0.0 2492 1452 access0 crash> set 8447 PID: 8447 COMMAND: "access0" TASK: ffff9ddb907a2e80 [THREAD_INFO: ffff9ddb907a2e80] CPU: 1 STATE: TASK_INTERRUPTIBLE crash> vtop 0 VIRTUAL PHYSICAL 0 3d8725000 PGD: 2dbf7e000 => 800000038bfd9067 PUD: 38bfd9000 => 426544067 PMD: 426544000 => 32f357067 PTE: 32f357000 => 80000003d8725867 PAGE: 3d8725000 PTE PHYSICAL FLAGS 80000003d8725867 3d8725000 (PRESENT|RW|USER|ACCESSED|DIRTY|NX) VMA START END FLAGS FILE ffff9ddc3a17a8f0 0 1000 80000fb dev/zero PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffefce4f61c940 3d8725000 ffff9ddcbea4d598 0 2 17ffffc008001c uptodate,dirty,lru,swapbacked ``` It can be seen that we have successfully mapped the NULL address to the physical memory, and we can also observe the value of the NULL address: ```bash crash> rd 0 8 0: 676e61696a65685a 756f687a6e655720 Zhejiang Wenzhou 10: 7320656978697020 75796169782c6968 pixie shi, xiayu 20: 697568736e696a20 7020697568756220 jinshui buhui p 30: 0000000021676e61 0000000000000000 ang!.......... ``` Exactly the same as what we put in! So why it is impossible to access NULL is to better distinguish what is a legal address, so a special address called NULL is artificially created to make it inaccessible, but at the MMU (Memory Management Unit) level, NULL is no different from other memory. ## Kernel protect mechanism **KPTI** Since the general shared address space may cause core data leakage, the linux kernel introduces the technology KPTI (shared address space) to effectively hide the relative location of the kernel in the user space. According to the description of [KAISER: hiding the kernel from user space](https://lwn.net/Articles/738975/), KPTI will randomize the position of the kernel in the virtual address space at boot time, which can prevent attackers from knowing the kernel correct position, KPTI will provide a The shadow page table records all user space data, and only records a small part of kernel data to ensure that system calls and interrupts can be executed correctly, thereby achieving the function of hiding the kernel. However, it is still possible that the base address of the kernel is leaked during mode conversion. **ASLR** ASLR is another memory protection mechanism that places process data at unpredictable random addresses. This method can be used to prevent attackers from using stack overflow to jump to specific locations for attacks. Because of the above two mechanisms, some unexpected situations will appear when we use crash to observe the memory content, so I have to turn off KPTI and ASLR and then observe the memory content to see if there is any difference. First we turn off KPTI: This [HOW TO DISABLE PAGE-TABLE ISOLATION ON UBUNTU FOR BENCHMARKING](https://www.stevenrombauts.be/2018/02/how-to-disable-page-table-isolation-on-ubuntu-for-benchmarking/ ) with examples. Let's first take a look at the status of KPTI in the system: ``` bash $ cat /sys/devices/system/cpu/vulnerabilities/* KVM: Mitigation: Split huge pages Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable Mitigation: Clear CPU buffers; SMT vulnerable Mitigation: PTI Mitigation: Speculative Store Bypass disabled via prctl and seccomp Mitigation: usercopy/swapgs barriers and __user pointer sanitization Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling Mitigation: Microcode Not affected ``` In this way, you can see that Mitigation has the option of PTI. Then use the grub boot file and then restart the shutdown, go to `/etc/default/grub` to modify the `GRUB_CMDLINE_LINUX_DEFAULT` parameter. like this: ```bash GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pti=off" ``` It can be turned off after executing `update-grub` and rebooting. ```bash $ cat /sys/devices/system/cpu/vulnerabilities/* KVM: Mitigation: Split huge pages Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable Mitigation: Clear CPU buffers; SMT vulnerable Vulnerable Mitigation: Speculative Store Bypass disabled via prctl and seccomp Mitigation: usercopy/swapgs barriers and __user pointer sanitization Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling Mitigation: Microcode Not affected ``` We can choose to turn off ASLR again: From [How ASLR protects Linux systems from buffer overflow attacks](https://www.networkworld.com/article/3331199/what-does-aslr-do-for-linux.html) we can get from `/proc/sys The state of ASLR is known in /kernel/randomize_va_space`. ```bash $ cat /proc/sys/kernel/randomize_va_space 2 $ sysctl -a --pattern randomize kernel.randomize_va_space = 2 ``` Here 2 means Full Randomization. The article mentioned an interesting way to test ASLR, using ldd to verify whether the listed address is different every time. ```bash $ ldd /bin/bash linux-vdso.so.1 (0x00007ffcf11fe000) libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007f888ba57000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f888ba51000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f888b85f000) /lib64/ld-linux-x86-64.so.2 (0x00007f888bbd1000) $ ldd /bin/bash linux-vdso.so.1 (0x00007ffcbc7c6000) libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007fa4001c6000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fa4001c0000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa3fffce000) /lib64/ld-linux-x86-64.so.2 (0x00007fa400340000) ``` Next, we have to turn off ASLR, and then use ldd to observe: ```bash $ sudo sysctl -w kernel.randomize_va_space=0 kernel.randomize_va_space = 0 $ ldd /bin/bash linux-vdso.so.1 (0x00007ffff7fce000) libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007ffff7e51000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ffff7e4b000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff7c59000) /lib64/ld-linux-x86-64.so.2 (0x00007ffff7fcf000) $ ldd /bin/bash linux-vdso.so.1 (0x00007ffff7fce000) libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007ffff7e51000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ffff7e4b000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff7c59000) /lib64/ld-linux-x86-64.so.2 (0x00007ffff7fcf000) ``` We can see that we have successfully disabled ASLR. Then repeat the previous experiment to see if there is any difference. Starting from the page table offset, re-execute `test` to get the offset of the page table of each layer. ```bash $ sudo ./test addr = 0x7fe93c5e6000 the address is 0x7fe93c5e6000, and the value is 1 the address is 0x7fe93c5e6004, and the value is 9 PGD index = 0xff PUD index = 0x1a4 PMD index = 0x1e2 PTE index = 0x1e6 ``` We can test to see if the offset will be as we expected, or it will still be the result of shifting to the left by three bits. ```bash crash> ps | grep test crash: current context no longer exists -- restoring "crash" context: 4310 4309 2 ffff8e83da5edd00 IN 0.0 2492 1424 test crash> set 4310 PID: 4310 COMMAND: "test" TASK: ffff8e83da5edd00 [THREAD_INFO: ffff8e83da5edd00] CPU: 2 STATE: TASK_INTERRUPTIBLE crash> vtop 0x7ffff7ffb000 VIRTUAL PHYSICAL 7ffff7ffb000 440000000 PGD: 2e98b67f8 => 2f89cb067 PUD: 2f89cbff8 => 42498a067 PMD: 42498adf8 => 403e32067 PTE: 403e32fd8 => 8000000440000267 PAGE: 440000000 PTE PHYSICAL FLAGS 8000000440000267 440000000 (PRESENT|RW|USER|ACCESSED|DIRTY|NX) VMA START END FLAGS FILE ffff8e82f2e57930 7ffff7ffb000 7ffff7ffc000 d0444bb /dev/mem ``` According to the result displayed by the crash, the offset is still the same as the original display, but one thing worth noting is that since ASLR is turned off, the virtual address will not change no matter how many times it is executed> ```bash $ sudo ./test addr = 0x7ffff7ffb000 the address is 0x7ffff7ffb000, and the value is 1 the address is 0x7ffff7ffb004, and the value is 9 PGD index = 0xff PUD index = 0x1ff PMD index = 0x1bf PTE index = 0x1fb $ sudo ./test addr = 0x7ffff7ffb000 the address is 0x7ffff7ffb000, and the value is 1 the address is 0x7ffff7ffb004, and the value is 9 PGD index = 0xff PUD index = 0x1ff PMD index = 0x1bf PTE index = 0x1fb $ sudo ./test addr = 0x7ffff7ffb000 the address is 0x7ffff7ffb000, and the value is 1 the address is 0x7ffff7ffb004, and the value is 9 PGD index = 0xff PUD index = 0x1ff PMD index = 0x1bf PTE index = 0x1fb $ sudo ./test addr = 0x7ffff7ffb000 the address is 0x7ffff7ffb000, and the value is 1 the address is 0x7ffff7ffb004, and the value is 9 PGD index = 0xff PUD index = 0x1ff PMD index = 0x1bf PTE index = 0x1fb ``` Next we can observe to see if the non-reserved memory will have a fixed mapping address. This time we can execute the pixie program repeatedly to see the change of his address. ```bash crash> ps | grep pixie 2598 2357 3 ffff979b7e2e8000 IN 0.0 2492 1232 pixie crash> set 2598 PID: 2598 COMMAND: "pixie" TASK: ffff979b7e2e8000 [THREAD_INFO: ffff979b7e2e8000] CPU: 3 STATE: TASK_INTERRUPTIBLE crash> px 0xffff979b7e2e8000 $1 = 0xffff979b7e2e8000 crash> vtop 0xffff979b7e2e8000 VIRTUAL PHYSICAL ffff979b7e2e8000 3fe2e8000 PGD DIRECTORY: ffffffffa040a000 PAGE DIRECTORY: 244401067 PUD: 244401368 => 3ffa28063 PMD: 3ffa28f88 => 3fe38b063 PTE: 3fe38b740 => 80000003fe2e8163 PAGE: 3fe2e8000 PTE PHYSICAL FLAGS 80000003fe2e8163 3fe2e8000 (PRESENT|RW|ACCESSED|DIRTY|GLOBAL|NX) PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffe5d44ff8ba00 3fe2e8000 ffff979ba281a840 ffff979b7e2e9740 1 17ffffc0010200 slab,head crash> px 0xffff979b7e2e8000-0x3fe2e8000 $2 = 0xffff979780000000 ``` We can see that the base address is `0xffff979780000000` when you execute it for the first time, so let’s reboot to see if it changes. Reboot and execute pixie: ```bash crash> ps | grep pixie 2666 2437 2 ffff9633e7492e80 IN 0.0 2492 1232 pixie crash> set 2666 PID: 2666 COMMAND: "pixie" TASK: ffff9633e7492e80 [THREAD_INFO: ffff9633e7492e80] CPU: 2 STATE: TASK_INTERRUPTIBLE crash> vtop 0xffff9633e7492e80 VIRTUAL PHYSICAL ffff9633e7492e80 427492e80 PGD DIRECTORY: ffffffff9ce0a000 PAGE DIRECTORY: 103401067 PUD: 103401678 => 42e3aa063 PMD: 42e3aa9d0 => 4273be063 PTE: 4273be490 => 8000000427492163 PAGE: 427492000 PTE PHYSICAL FLAGS 8000000427492163 427492000 (PRESENT|RW|ACCESSED|DIRTY|GLOBAL|NX) PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffe865d09d2480 427492000 dead000000000400 0 0 17ffffc0000000 crash> px 0xffff9633e7492e80-0x427492e80 $1 = 0xffff962fc0000000 ``` :::warning :bell: After rebooting, I found that the base address is still different, so the previous assumption was wrong. ::: Then let's try `Spectre` that the teacher said. `Spectre` basically uses branch prediction and speculative execution on modern cpus to bypass access control to obtain privileged data and does not modify memory. This time, let's try to turn off Linux's defense mechanism against `Spectre`. According to [Specter Side Channels](https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/spectre.html) we can turn off `Specter` through the grub configuration file. ```bash GRUB_CMDLINE_LINUX_DEFAULT="nospectre_v1 nospectre_v2 nopti quiet splash" ``` Then restart the machine and use crash to test it. ```bash crash> ps | grep pixie 3234 2949 6 ffff948c0457c5c0 IN 0.0 2492 1232 pixie crash> set 3234 PID: 3234 COMMAND: "pixie" TASK: ffff948c0457c5c0 [THREAD_INFO: ffff948c0457c5c0] CPU: 6 STATE: TASK_INTERRUPTIBLE crash> vtop 0xffff948c0457c5c0 VIRTUAL PHYSICAL ffff948c0457c5c0 34457c5c0 PGD DIRECTORY: ffffffffa6e0a000 PAGE DIRECTORY: 268201067 PUD: 268201180 => 80000003400001e3 PMD: 340000110 => 0 PAGE PHYSICAL MAPPING INDEX CNT FLAGS fffff03a0d115f00 34457c000 dead000000000400 0 0 17ffffc0000000 crash> px 0xffff948c0457c5c0-0x34457c5c0 $1 = 0xffff9488c0000000 ``` The base address is `0xffff9488c0000000`, and then reboot again. ```bash crash> ps | grep pixie 3249 3022 6 ffff998ac5bd0000 IN 0.0 2492 1232 pixie crash> set 3249 PID: 3249 COMMAND: "pixie" TASK: ffff998ac5bd0000 [THREAD_INFO: ffff998ac5bd0000] CPU: 6 STATE: TASK_INTERRUPTIBLE crash> vtop 0x ffff998ac5bd0000 VIRTUAL PHYSICAL 0 (not accessible) VIRTUAL PHYSICAL ffff998ac5bd0000 305bd0000 PGD DIRECTORY: ffffffffb620a000 PAGE DIRECTORY: 344e01067 PUD: 344e01158 => 80000003000001e3 PMD: 300000168 => 280000000a PAGE PHYSICAL MAPPING INDEX CNT FLAGS fffff1ed0c16f400 305bd0000 ffff998bb153e840 ffff998ac5bd1740 1 17ffffc0010200 slab,head crash> vtop 0xffff998ac5bd0000 VIRTUAL PHYSICAL ffff998ac5bd0000 305bd0000 PGD DIRECTORY: ffffffffb620a000 PAGE DIRECTORY: 344e01067 PUD: 344e01158 => 80000003000001e3 PMD: 300000168 => 280000000a PAGE PHYSICAL MAPPING INDEX CNT FLAGS fffff1ed0c16f400 305bd0000 ffff998bb153e840 ffff998ac5bd1740 1 17ffffc0010200 slab,head crash> px 0xffff998ac5bd0000-0x305bd0000 $1 = 0xffff9987c0000000 ``` :::warning :bell: The default base address of the two boots is still different, so it may be caused by other defense mechanisms. ::: According to the teacher's prompt, I will turn off KASLR to see if there will be expected results. The way to turn off is to set `GRUB_CMDLINE_LINUX_DEFAULT` in grub and add `nokaslr` in it. After restarting, let's observe the results: ```bash crash> ps | grep pixie 3890 3235 1 ffff8883cd758000 IN 0.0 2492 1224 pixie crash> set 3890\ set: invalid task or pid value: 3890\ crash> set 3890 PID: 3890 COMMAND: "pixie" TASK: ffff8883cd758000 [THREAD_INFO: ffff8883cd758000] CPU: 1 STATE: TASK_INTERRUPTIBLE crash> vtop 0xffff8883cd758000 VIRTUAL PHYSICAL ffff8883cd758000 3cd758000 PGD DIRECTORY: ffffffff8260a000 PAGE DIRECTORY: 3001067 PUD: 3001078 => 3ffdd1063 PMD: 3ffdd1358 => 3cd6f9063 PTE: 3cd6f9ac0 => 80000003cd758163 PAGE: 3cd758000 PTE PHYSICAL FLAGS 80000003cd758163 3cd758000 (PRESENT|RW|ACCESSED|DIRTY|GLOBAL|NX) PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffea000f35d600 3cd758000 ffff888404f38540 ffff8883cd759740 1 17ffffc0010200 slab,head crash> px 0xffff8883cd758000-0x3cd758000 $1 = 0xffff888000000000 ``` This time the final result is our expected base address `0xffff888000000000`! :::success Therefore, we can infer that the main mechanism for changing the kernel mapping base address at boot time should be caused by KASLR. :::