/dev/mem research

contributed by < jhan1998 >

Environment Settings

Operating Environment Information

OS： Ubuntu 20.04.2 LTS
Kernel Version： 5.4.0-72-generic
Memory： 15 G
CPU： Intel® Core™ i7-4770HQ CPU @ 2.20GHz × 8

First of all, we can reserve the memory from being managed by the kernel management system by modifying the startup parameters in /etc/default/grub.

In /dev/mem The method mentioned in the article is to add mem=14G to GRUB_CMDLINE_LINUX_DEFAULT="" and then execute sudo update-grub. After restarting, 15G - 14G = 1G will be reserved memory down.

Before Setting

$ free
              total        used        free      shared  buff/cache   available
Mem:       16270944     2263540    11458908     1256396     2548496    12436764
Swap:       1999868           0     1999868

$ sudo cat /proc/iomem | grep RAM
[sudo] password for jhan1998: 
00001000-00057fff : System RAM
00059000-0009ffff : System RAM
00100000-6650b80f : System RAM
6650b810-6650bcd2 : System RAM
6650bcd3-78d00fff : System RAM
78d49000-78d5cfff : System RAM
78d8f000-78e39fff : System RAM
78e8f000-78ed3fff : System RAM
78eff000-78f84fff : System RAM
78fdf000-78ffffff : System RAM
100000000-47f5fffff : System RAM
47f600000-47fffffff : RAM buffer

After Setting

$ free
              total        used        free      shared  buff/cache   available
Mem:       12152416     2019588     8262100      683884     1870728     9161272
Swap:       1999868           0     1999868

$ sudo cat /proc/iomem | grep RAM
[sudo] password for jhan1998: 
00001000-00057fff : System RAM
00059000-0009ffff : System RAM
00100000-6650b80f : System RAM
6650b810-6650bcd2 : System RAM
6650bcd3-78d00fff : System RAM
78d49000-78d5cfff : System RAM
78d8f000-78e39fff : System RAM
78e8f000-78ed3fff : System RAM
78eff000-78f84fff : System RAM
78fdf000-78ffffff : System RAM
100000000-37fffffff : System RAM

After comparison, it can be found that the displayed memory space is indeed reduced, because the reserved memory will not be recorded in any statistics of the core, but the space lost is more than 1 GB.

We can find out the clues from the mapping table of the address space. It can be seen that when mem=14G is not set at the beginning, the address space to which the memory is mapped will be 0x47fffffff, and after mem=14G is set, it will become 0x37fffffff.
After a simple conversion here:
$0 x 47 f f f f f f f = 18 G - 1 0 x 37 f f f f f f f = 14 G - 1$
We can know that mem= is set here is that the highest address space segment that can be mapped to is 18G, so when we set mem=14G, it will only be mapped to a segment of 14G at most, just behind the segment The segment is the address space mapped to the memory storage device, so 15G - (18G - 14G) = 11G Therefore, the free command will be executed to see that the available memory is 11G, so we can set mem=17G To accurately reserve the last 1G memory for operation.

After Setting

$ free
              total        used        free      shared  buff/cache   available
Mem:       15248992      607292    13567272      405288     1074428    13958960
Swap:       1999868           0     1999868

$ free -g
              total        used        free      shared  buff/cache   available
Mem:             14           0          12           0           1          13
Swap:             1           0           1

$ sudo cat /proc/iomem | grep RAM
[sudo] password for jhan1998: 
00001000-00057fff : System RAM
00059000-0009ffff : System RAM
00100000-6650b80f : System RAM
6650b810-6650bcd2 : System RAM
6650bcd3-78d00fff : System RAM
78d49000-78d5cfff : System RAM
78d8f000-78e39fff : System RAM
78e8f000-78ed3fff : System RAM
78eff000-78f84fff : System RAM
78fdf000-78ffffff : System RAM
100000000-43fffffff : System RAM

Here 0x43fffffff = 17G - 1, 0x440000000 to 0x47fffffff are our reserved sections.

Use crash to map the memory reserved by the system

First of all, it took me a lot of time to run the crash. To run the crash, you need to use the debug symbol. The debug symbol of ubuntu needs to be in accordance with Debug symbol package The instructions of Getting -dbgsym.ddeb packages set /etc/apt/sources.list.d/ddebs.list and then follow [Getting Kernel Symbols/Sources on Ubuntu Linux](https://sysprogs.com/VisualKernel/ tutorials/setup/ubuntu/) Download the debug symbol of the corresponding version.

After downloading, we can test run to see the crash

 sudo crash /home/jhan1998/modules/boot/vmlinux-5.4.0-73-generic /dev/mem

crash 7.2.8
...
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

We can test to see the difference between our reserved memory and unreserved memory.

crash> rd -p 0x43ffffff1
        43ffffff1: 5000000030000000...0...P
crash> rd -p 0x440000000
rd: seek error: physical address: 440000000 type: "64-bit PHYSADDR"

Here, since the memory after 0x440000000 is the paging table that we reserved without mapping, there is no way to read it, but 0x43ffffff1 can read the content inside.

Next we can use mmap to use the reserved space:

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
    unsigned char *addr;
    int fd;
    fd = open("/dev/mem",O_RDWR);

    if (fd < 0){
        printf("device file open error !\n");
        return 0;
    }

    addr = mmap(0,4096,PROT_READ|PROT_WRITE,MAP_SHARED,fd,0x440000000);
    printf("addr = %p \n", addr);

    *(volatile unsigned int *)(addr + 0x00) = 0x1; // 0x440000000，令其值為1
    *(volatile unsigned int *)(addr + 0x04) = 0x9; // 0x440000004，令其值為9
    printf("the address is %p, and the value is %d\n", addr + 0x00, *(addr + 0x00));
    printf("the address is %p, and the value is %d\n", addr + 0x04, *(addr + 0x04));
    
    system("read -p 'Press Enter to continue...' var");

    munmap(addr,4096);
    close(fd);
    return 0;
}

From the output results, we can know the location of virtual memory.

$ sudo ./test
addr = 0x7f9828a97000
the address is 0x7f9828a97000, and the value is 1
the address is 0x7f9828a97004, and the value is 9
Press Enter to continue...

Next, we use crash to view the corresponding entity address.

crash> vtop 0x7f9828a97000
VIRTUAL PHYSICAL
7f9828a97000 (not accessible)

At the beginning, when you want to check the corresponding entity address, you can't check it because there is no mapping table.

We can use the set command to get the current context so that we can view the corresponding entity address.

crash> ps | grep test
  21073  21072   2  ffff96885eec8000  IN   0.0    2492   1436  test
crash> set 21073
    PID: 21073
COMMAND: "test"
   TASK: ffff96885eec8000  [THREAD_INFO: ffff96885eec8000]
    CPU: 2
  STATE: TASK_INTERRUPTIBLE 
crash> vtop 0x7f9828a97000
VIRTUAL     PHYSICAL        
7f9828a97000  440000000       

   PGD: 1e2da07f8 => 80000003cb2a4067
   PUD: 3cb2a4300 => 27b99e067
   PMD: 27b99ea28 => 3bcd98067
   PTE: 3bcd984b8 => 8000000440000267
  PAGE: 440000000

      PTE         PHYSICAL   FLAGS
8000000440000267  440000000  (PRESENT|RW|USER|ACCESSED|DIRTY|NX)

      VMA           START       END     FLAGS FILE
ffff9687bb915450 7f9828a97000 7f9828a98000 d0444bb /dev/mem

It can be seen that 0x440000000 is the physical address we reserved.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Then re-run rd to try to print the content of the entity address, but it still fails

crash> rd 0x7f9828a97000
rd: seek error: user virtual address: 7f9828a97000 type: "64-bit UVADDR"

According to the error message, I speculate that it may be because the crash is a kernel core
For analysis tools, the reserved memory will be mapped to the user space and use the memory space of the process, so it cannot be read by crash.

Use crash to observe Five-level page tables

According to the description in Five-level page tables, we can know that the Linux MMU under the current x86_64 architecture will convert the virtua address into a physical address with a 5-level page table.

It should be noted that the virtual address only has 48 bits instead of 64 bits, and the top 16 bits will be discarded, because 48 bits can already map a large enough 256 TB. At the beginning of conversion, we can query the location of page global directory (PGD) from mm_struct in task_struct, and then find the corresponding index according to the top 9 bits (bits 39-47) of the virtual address to get page upper directory ( PUD) location is also based on the virtual address (bits 30 - 38) to find the corresponding page middle middle directory (PMD) location, and then find page table entry (PTE) by analogy and finally use the most The following 12 bits of offset find the address we want, and the conversion is completed.

Then we can execute the following program and use crash to observe the mechanism of Five-level page tables.

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
    unsigned char *addr;
    int fd;
    fd = open("/dev/mem",O_RDWR);

    if (fd < 0){
        printf("device file open error !\n");
        return 0;
    }

    addr = mmap(0,4096,PROT_READ|PROT_WRITE,MAP_SHARED,fd,0x440000000);
    printf("addr = %p \n", addr);

    *(volatile unsigned int *)(addr + 0x00) = 0x1; // 0x440000000，令其值為1
    *(volatile unsigned int *)(addr + 0x04) = 0x9; // 0x440000004，令其值為9
    printf("the address is %p, and the value is %d\n", addr + 0x00, *(addr + 0x00));
    printf("the address is %p, and the value is %d\n", addr + 0x04, *(addr + 0x04));
    printf("PGD index = 0x%llx\n", ((unsigned long long int)addr >> 39) & 0x1ff);
    printf("PUD index = 0x%llx\n", ((unsigned long long int)addr >> 30) & 0x1ff);
    printf("PMD index = 0x%llx\n", ((unsigned long long int)addr >> 21) & 0x1ff);
    printf("PTE index = 0x%llx\n", ((unsigned long long int)addr >> 12) & 0x1ff);
    pause();

    munmap(addr,4096);
    close(fd);
    return 0;
}

The resulting output is:

$ sudo ./test
addr = 0x7eff0939a000
the address is 0x7eff0939a000, and the value is 1
the address is 0x7eff0939a004, and the value is 9
PGD index = 0xfd
PUD index = 0x1fc
PMD index = 0x49
PTE index = 0x19a

Then use crash to observe:

Find out where the PGD is first

crash> ps | grep test
   9298   9296   6  ffff889ffe01dd00  IN   0.0    2492   1540  test
crash> set 9298
    PID: 9298
COMMAND: "test"
   TASK: ffff889ffe01dd00  [THREAD_INFO: ffff889ffe01dd00]
    CPU: 6
  STATE: TASK_INTERRUPTIBLE 
crash> px ((struct task_struct*)0xffff889ffe01dd00)->mm->pgd
$1 = (pgd_t *) 0xffff889e7894a000

Among them, $1 = (pgd_t *) 0xffff889e7894a000 Since it is returning a virtual address, we also need to use vtop to convert it to a physical address.

crash> vtop 0xffff889e7894a000
VIRTUAL           PHYSICAL        
ffff889e7894a000  27894a000       

PGD DIRECTORY: ffffffffb3c0a000
PAGE DIRECTORY: 38b801067
   PUD: 38b8013c8 => 278919063
   PMD: 278919e20 => 2789ee063
   PTE: 2789eea50 => 800000027894a063
  PAGE: 27894a000

      PTE         PHYSICAL   FLAGS
800000027894a063  27894a000  (PRESENT|RW|ACCESSED|DIRTY|NX)

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffce5b49e25280 27894a000                0 ffff88a02a767b40  1 17ffffc0000000

It can be obtained that the starting position of PGD is 0x27894a000, and then use the original virtual address of vtop to verify whether the corresponding index is correct.

crash> vtop 0x7eff0939a000
VIRTUAL     PHYSICAL        
7eff0939a000  440000000       

   PGD: 27894a7e8 => 8000000321be3067
   PUD: 321be3fe0 => 321be5067
   PMD: 321be5248 => 3c8c0a067
   PTE: 3c8c0acd0 => 8000000440000267
  PAGE: 440000000

      PTE         PHYSICAL   FLAGS
8000000440000267  440000000  (PRESENT|RW|USER|ACCESSED|DIRTY|NX)

      VMA           START       END     FLAGS FILE
ffff889f1919d2b0 7eff0939a000 7eff0939b000 d0444bb /dev/mem

Here I found that the PGD index is not the same as what I calculated. The index displayed here is the result of my calculation shifted to the left by three bits 0x7e8 = 0xfd << 3.

There are definitions of pgd_index and pgd_offset in linux/include/linux/pgtable.h.

#ifndef pgd_index
/* Must be a compile-time constant, so implement it as a macro */
#define pgd_index(a)  (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
#endif

static inline pgd_t *pgd_offset_pgd(pgd_t *pgd, unsigned long address)
{
	return (pgd + pgd_index(address));
};

/*
 * a shortcut to get a pgd_t in a given mm
 */
#ifndef pgd_offset
#define pgd_offset(mm, address)		pgd_offset_pgd((mm)->pgd, (address))
#endif

linux/arch/x86/include/asm/pgtable_64.h

#define PGDIR_SHIFT    39
#define PTRS_PER_PGD   512

But there is still no way to explain why it is shifted to the left by 3 bits, so this is still to be verified.

Back to crash, we can read the obtained PGB position and get the value of PUD.

crash> rd -p 27894a7e8
        27894a7e8: 8000000321be3067 g0.!....

It can be directly regarded as 0x321be3067, and the following 12 bits are flags bits, so we can know that the starting position of the PUD is 0x321be3000, and after adding the shifted index, read the starting position of the PMD.

0x1fc << 3 = 0xfe0

crash> rd -p 0x321be3fe0
        321be3fe0: 0000000321be5067 gP.!....

And so on to get PTE.

0x49 << 3 = 0x248

crash> rd -p 0x321be5248
       321be5248:  00000003c8c0a067                    g.......

Get the page we want.

0x19a << 3 = 0xcd0

crash> rd -p 0x3c8c0acd0
        3c8c0acd0: 8000000440000267 g..@....

Adding the last 12 bits offset of the virtual address is our physical address 0x440000000.

Verify with vtop:

crash> vtop 0x7eff0939a000
VIRTUAL     PHYSICAL        
7eff0939a000  440000000       

   PGD: 27894a7e8 => 8000000321be3067
   PUD: 321be3fe0 => 321be5067
   PMD: 321be5248 => 3c8c0a067
   PTE: 3c8c0acd0 => 8000000440000267
  PAGE: 440000000

      PTE         PHYSICAL   FLAGS
8000000440000267  440000000  (PRESENT|RW|USER|ACCESSED|DIRTY|NX)

      VMA           START       END     FLAGS FILE
ffff889f1919d2b0 7eff0939a000 7eff0939b000 d0444bb /dev/mem

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Why does the index have to be shifted to the left by three bits and then added to the starting position of the table? The current guess may be related to Big-Endian and Little-Endian.

Page exchange between Processes

There is a need:
We don't want process_A and process_B to share any paging, which means they cannot operate on the same data at the same time.
But occasionally we also want process_A and process_B to exchange information, but we don't want to use the inefficient traditional inter-process communication mechanism.

After understanding the mechanism of Five-level page tables, we can use crash to modify the operation of reserved memory and exchange pages between processes, which is not like It is share memory to give a piece of memory to share information between processes, and to manually modify /dev/mem to achieve information exchange between two processes.

master.c

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <fcntl.h>

int main(int argc, char **argv)
{
    int fd;
    unsigned long *addr;

    fd = open("/dev/mem", O_RDWR);

    // 建立一個分頁 P1 映射到保留記憶體
    addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd,0x440000000);

    // 修改 P1 的内容
    *addr = 0x1122334455667788;
    
    printf("address at: %p   content is: 0x%lx\n", addr, addr[0]);

    // 等待分頁交換
    getchar();
    
    printf("address at: %p   content is: 0x%lx\n", addr, addr[0]);

    close(fd);
    munmap(addr, 4096);
    return 1;
}

slave.c

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <fcntl.h>

int main(int argc, char **argv)
{
    int fd;
    unsigned long *addr;

    fd = open("/dev/mem", O_RDWR);

    // 建立分頁 P2 映射到保留的記憶體
    addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0x440004000);
    
    // 修改 P2 的内容
    *addr = 0x8877665544332211;

    printf("address at: %p   content is: 0x%lx\n", addr, addr[0]);  
    // 等待分頁交換
    getchar();
    printf("address at: %p   content is: 0x%lx\n", addr, addr[0]);

    close(fd);
    munmap(addr, 4096);
    return 1;
}

After execution, you can see the addresses and values of the two processes:

master

$ sudo ./master
address at: 0x7f9822c40000   content is: 0x1122334455667788

slave

$ sudo ./slave
address at: 0x7f9822cd1000   content is: 0x8877665544332211

To use crash to modify /dev/mem, you need to set up the environment first.

When using crash to modify /dev/mem, you need to use ststemtap hook to live in devmeme_is_allowed, so that the return value is always 1, and then you can directly modify it.
Steps:
install systemtap
Execute stap -g -e 'probe kernel.function("devmem_is_allowed").return { $return = 1 }'
Then turn on crash again

Use crash to modify the page of master

crash> ps | grep master
  27867  27866   4  ffff889e253945c0  IN   0.0    2492   1332  master
crash> set 27867
    PID: 27867
COMMAND: "master"
   TASK: ffff889e253945c0  [THREAD_INFO: ffff889e253945c0]
    CPU: 4
  STATE: TASK_INTERRUPTIBLE 
crash> vtop 0x7f9822c40000
VIRTUAL     PHYSICAL        
7f9822c40000  440000000       

   PGD: 1ee8e27f8 => 800000020514a067
   PUD: 20514a300 => 1f0d80067
   PMD: 1f0d808b0 => 391d34067
   PTE: 391d34200 => 8000000440000267
  PAGE: 440000000

      PTE         PHYSICAL   FLAGS
8000000440000267  440000000  (PRESENT|RW|USER|ACCESSED|DIRTY|NX)

      VMA           START       END     FLAGS FILE
ffff889dfa6cc0d0 7f9822c40000 7f9822c41000 d0444bb /dev/mem

crash> wr -64 -p 0x391d34200 0x8000000440004267

Use crash to modify slave's page

crash> ps | grep slave
  27869  27868   4  ffff889e2ab21740  IN   0.0    2492   1384  slave
crash> set 27869
    PID: 27869
COMMAND: "slave"
   TASK: ffff889e2ab21740  [THREAD_INFO: ffff889e2ab21740]
    CPU: 4
  STATE: TASK_INTERRUPTIBLE 
crash> vtop 0x7f9002cd1000
VIRTUAL     PHYSICAL        
7f9002cd1000  440004000       

   PGD: 1f0aea7f8 => 80000002539fc067
   PUD: 2539fc200 => 20a3a8067
   PMD: 20a3a80b0 => 205689067
   PTE: 205689688 => 8000000440004267
  PAGE: 440004000

      PTE         PHYSICAL   FLAGS
8000000440004267  440004000  (PRESENT|RW|USER|ACCESSED|DIRTY|NX)

      VMA           START       END     FLAGS FILE
ffff889ecc833d40 7f9002cd1000 7f9002cd2000 d0444bb /dev/mem

crash> wr -64 -p 0x205689688 0x8000000440000267

Then let the program continue to execute and you can see the information exchange between the two processes!

master

$ sudo ./master
address at: 0x7f9822c40000   content is: 0x1122334455667788

address at: 0x7f9822c40000   content is: 0x8877665544332211

slave

$ sudo ./slave
address at: 0x7f9002cd1000   content is: 0x8877665544332211

address at: 0x7f9002cd1000   content is: 0x1122334455667788

This example is very suitable for designing micro-kernel inter-process communication. With the cache consistency protocol, it can achieve very high efficiency.

Securely tamper with the memory of the process

This time we can modify the page information in /dev/mem without crashing, but use another process to safely tamper with the memory of another process.

First, let's randomly map a piece of memory:

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>

int main(int argc, char **argv)
{
    unsigned char *addr;

    // 匿名映射一段記憶體空間
    addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0);

    // 修改內容
    strcpy(addr, "浙江溫州皮鞋濕");

    // 只是範例，所以直接顯示 address 實際操作時需要手工 hack 記憶體位置
    printf("address at: %p   content is: %s\n", addr, addr);
    getchar();
    printf("address at: %p   content is: %s\n", addr, addr);

    munmap(addr, 4096);
    return 1;
}

The output at this time is

$ ./change
address at: 0x7f4a035f1000 content is: 浙江溫州皮鞋濕

Then we can use crash to query the physical memory location, and then modify it through other programs.

First use crash to find the process and physical address.

crash> ps | grep change
crash: current context no longer exists -- restoring "crash" context:

  36261   9027   4  ffff889e344d0000  IN   0.0    2496   1392  change
crash> set 36261
    PID: 36261
COMMAND: "change"
   TASK: ffff889e344d0000  [THREAD_INFO: ffff889e344d0000]
    CPU: 4
  STATE: TASK_INTERRUPTIBLE 
crash> vtop 0x7f4a035f1000
VIRTUAL     PHYSICAL        
7f4a035f1000  18a6c4000       

   PGD: 3b084a7f0 => 80000003b1171067
   PUD: 3b1171940 => 1ee97f067
   PMD: 1ee97f0d0 => 1f0b21067
   PTE: 1f0b21f88 => 800000018a6c4867
  PAGE: 18a6c4000

      PTE         PHYSICAL   FLAGS
800000018a6c4867  18a6c4000  (PRESENT|RW|USER|ACCESSED|DIRTY|NX)

      VMA           START       END     FLAGS FILE
ffff889feeb425b0 7f4a035f1000 7f4a035f2000 80000fb dev/zero

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffce5b4629b100 18a6c4000 ffff889e2aa0c4b8        0  2 17ffffc008001c uptodate,dirty,lru,swapbacked

The converted physical address is 0x18a6c4000.

So we write a program to map the memory of this offset and change the content inside to 下雨進水不會胖。

hack.c

#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <fcntl.h>

int main(int argc, char **argv)
{
    int fd;
    unsigned char *addr;
    unsigned long long off;

    off = strtoll(argv[1], NULL, 16);
    fd = open("/dev/mem", O_RDWR);

    addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, off);

    strcpy(addr, "下雨進水不會胖");
    close(fd);

    munmap(addr, 4096);

    return 1;
}

When executing, remember to hook devmeme_is_allowed first, otherwise there will be a segmentation fault.

sudo ./hack 0x18a6c4000

Go back and look at the results, and you can find that we have successfully modified the value stored at 0x18a6c4000.

$ ./change
address at: 0x7f4a035f1000   content is: 浙江溫州皮鞋濕

address at: 0x7f4a035f1000   content is: 下雨進水不會胖

Change the name of the process by changing /dev/mem

This time we will not use crash, just rely on hack /dev/mem to modify a process name.

This makes sense for an Internet product to work.
Especially on some managed machines, in order to prevent information leakage, it is generally not allowed to use tools like crash & gdb to debug. Of course, systemtap API has restrictions, so it is relatively safe, and core modules are generally not will be banned.
But having systemtap and /dev/mem is enough!

Take a look at the following program, we will do a simple experiment:

Modify the name of the process being executed

#include <stdio.h>

int main(int argc, char **argv)
{
	getchar();
}

gcc -o pixie pixie.c && ./pixie

Now we have to find a way to change the name of the process from pixie to skinshoe.
There is no crash and no gdb, only a /dev/mem that can be read and written (assuming we have hooked devmem_si_allowed) how to do it?

It is now known that all data structures in the core can be found in /dev/mem, so we need to find the location of the task_struct structure of the pixie process, and then change its comm field.

It is very easy if you use the crash tool, as long as you find out the position of the process, you can easily find the corresponding position of comm in the task_struct.

crash> set 63972
    PID: 63972
COMMAND: "pixie"
   TASK: ffff9c3832572e80  [THREAD_INFO: ffff9c3832572e80]
    CPU: 4
  STATE: TASK_INTERRUPTIBLE 
crash> px ((struct task_struct*)0xffff9c3832572e80)->comm
$1 = "pixie\000PoolSingl"
crash> px &(((struct task_struct*)0xffff9c3832572e80)->comm)
$2 = (char (*)[16]) 0xffff9c38325738f8

But what if you can't use crash or gdb now?

We know that /dev/mem is a physical memory space, and any memory operated by the operating system is based on virtual addresses. How to establish the relationship between the two is the key.

We notice three facts:

x86_64 can directly map 64TiB of physical memory, which is enough to map any common physical memory one by one.
The Linux kernel creates a one-to-one mapping of all physical memory. Fixed offset between physical address and virtual address.
The data structure of the Linux core is a network of interrelated structures, so it is possible to follow the vines.

This means that as long as we provide a virtual address of a Linux kernel space data structure, we can find its physical address, and then follow the clues to find the task_struct structure of our pixie process.

In the Linux system, the address of the core data structure can be found in many places:

/proc/kallsyms
/boot/System.map
the result of lsof

The article explains one of the examples of finding init_task in /proc/kallsyms:

$ sudo cat /proc/kallsyms | grep init_task
ffffffff953a90c0 T ftrace_graph_init_task
ffffffff95406390 T perf_event_init_task
ffffffff96685cec r __ksymtab_init_task
ffffffff966aaa19 r __kstrtab_init_task
ffffffff96800000 D __start_init_task
ffffffff96804000 D __end_init_task
ffffffff96813780 D init_task
ffffffff96f48c98 b ext4_lazyinit_task

Then find the mapping rules from init_task to physical memory, start from init_task to visit the task linked list of the entire system, find our target pixie itinerary, and then make changes.

But this is not modified through /dev/mem, so another method is provided in the article.

First create a tcpdump process without capturing any packets, it is just a cover to provide clues, let's start with it:

$ sudo tcpdump -i lo -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes

The reason why the tcpdump process is established is because tcpdump will generate a packet socket, and the virtual address of the socket can be found from procfs:

$ sudo cat /proc/net/packet
sk RefCnt Type Proto Iface R Rmem User Inode
ffff9c39c0c87000 2 2 0000 0 0 0 0 384885
ffff9c397b431000 3 2 888e 2 1 0 0 384886
ffff9c386ec14000 3 2 890d 2 1 0 0 382437
ffff9c397b4ab800 2 2 0000 0 0 0 0 383853
ffff9c3796d78000 3 2 0800 2 1 0 0 380767

After starting tcpdump

$ sudo cat /proc/net/packet
sk       RefCnt Type Proto  Iface R Rmem   User   Inode
ffff9c39c0c87000 2      2    0000   0     0 0      0      384885
ffff9c397b431000 3      2    888e   2     1 0      0      384886
ffff9c386ec14000 3      2    890d   2     1 0      0      382437
ffff9c397b4ab800 2      2    0000   0     0 0      0      383853
ffff9c3796d78000 3      2    0800   2     1 0      0      380767
ffff9c3888878000 3      3    0003   1     1 0      0      382711

We can see that a packet socket has indeed been added.

However, since the method of switching between virtual addresses and physical addresses in the article is different from my computer, so here we still need to use the help of crash to convert virtual addresses and physical addresses.

In this article, we will use the virtual address of packet socket to push back and forth the position of wati_queue_head_t in the structure, and then find the next task_struct and search the entire task_struct list.

This point requires you to be very familiar with the data structure of the Linux core. If you are not familiar with it, go to the corresponding source code to calculate the offset. [Or use struct X.y -o of crash to calculate]

But because the article is old and the architecture of the computer system is not the same, so at this time we still have to use the struct X.y -o function of crash to see how to find the structure we are looking for.

First we start with struct sock to find sk_wq:

crash> struct sock
struct sock {
    struct sock_common __sk_common;
    socket_lock_t sk_lock;
    atomic_t sk_drops;
    int sk_rcvlowat;
    struct sk_buff_head sk_error_queue;
    struct sk_buff *sk_rx_skb_cache;
    struct sk_buff_head sk_receive_queue;
    struct {
        atomic_t rmem_alloc;
        int len;
        struct sk_buff *head;
        struct sk_buff *tail;
    } sk_backlog;
    int sk_forward_alloc;
    unsigned int sk_ll_usec;
    unsigned int sk_napi_id;
    int sk_rcvbuf;
    struct sk_filter *sk_filter;
    union {
        struct socket_wq *sk_wq;
        struct socket_wq *sk_wq_raw;
    };
    struct xfrm_policy *sk_policy[2];
    struct dst_entry *sk_rx_dst;
    struct dst_entry *sk_dst_cache;
    atomic_t sk_omem_alloc;
    int sk_sndbuf;
    int sk_wmem_queued;
    refcount_t sk_wmem_alloc;
    unsigned long sk_tsq_flags;
    union {
        struct sk_buff *sk_send_head;
        struct rb_root tcp_rtx_queue;
    };
    struct sk_buff *sk_tx_skb_cache;
    struct sk_buff_head sk_write_queue;
    __s32 sk_peek_off;
    int sk_write_pending;
    __u32 sk_dst_pending_confirm;
    u32 sk_pacing_status;
    long sk_sndtimeo;
    struct timer_list sk_timer;
    __u32 sk_priority;
    __u32 sk_mark;
    unsigned long sk_pacing_rate;
    unsigned long sk_max_pacing_rate;
    struct page_frag sk_frag;
    netdev_features_t sk_route_caps;
    netdev_features_t sk_route_nocaps;
    netdev_features_t sk_route_forced_caps;
    int sk_gso_type;
    unsigned int sk_gso_max_size;
    gfp_t sk_allocation;
    __u32 sk_txhash;
    unsigned int __sk_flags_offset[0];
    unsigned int sk_padding : 1;
    unsigned int sk_kern_sock : 1;
    unsigned int sk_no_check_tx : 1;
    unsigned int sk_no_check_rx : 1;
    unsigned int sk_userlocks : 4;
    unsigned int sk_protocol : 8;
    unsigned int sk_type : 16;
    u16 sk_gso_max_segs;
    u8 sk_pacing_shift;
    unsigned long sk_lingertime;
    struct proto *sk_prot_creator;
    rwlock_t sk_callback_lock;
    int sk_err;
    int sk_err_soft;
    u32 sk_ack_backlog;
    u32 sk_max_ack_backlog;
    kuid_t sk_uid;
    struct pid *sk_peer_pid;
    const struct cred *sk_peer_cred;
    long sk_rcvtimeo;
    ktime_t sk_stamp;
    u16 sk_tsflags;
    u8 sk_shutdown;
    u32 sk_tskey;
    atomic_t sk_zckey;
    u8 sk_clockid;
    u8 sk_txtime_deadline_mode : 1;
    u8 sk_txtime_report_errors : 1;
    u8 sk_txtime_unused : 6;
    struct socket *sk_socket;
    void *sk_user_data;
    void *sk_security;
    struct sock_cgroup_data sk_cgrp_data;
    struct mem_cgroup *sk_memcg;
    void (*sk_state_change)(struct sock *);
    void (*sk_data_ready)(struct sock *);
    void (*sk_write_space)(struct sock *);
    void (*sk_error_report)(struct sock *);
    int (*sk_backlog_rcv)(struct sock *, struct sk_buff *);
    struct sk_buff *(*sk_validate_xmit_skb)(struct sock *, struct net_device *, struct sk_buff *);
    void (*sk_destruct)(struct sock *);
    struct sock_reuseport *sk_reuseport_cb;
    struct bpf_sk_storage *sk_bpf_storage;
    struct callback_head sk_rcu;
}
SIZE: 760

It can be seen that the structure size of the entire sock is 760.

Knowing that sk_wq exists in the structure, we can use struct X.y to query its offset.

crash> struct sock.sk_wq
struct sock {
   [280] struct socket_wq *sk_wq;
}

Next query the structure of struct socket_wq:

crash> struct socket_wq
struct socket_wq {
     wait_queue_head_t wait;
     struct fasync_struct *fasync_list;
     unsigned long flags;
     struct callback_head rcu;
}
SIZE: 64

This structure is much smaller than the previous one. We can see that the first item is wait_queue_head_t, so the offset is 0, and we can directly observe wait_queue_head_t.

crash> struct wait_queue_head_t
typedef struct wait_queue_head {
    spinlock_t lock;
    struct list_head head;
} wait_queue_head_t;
SIZE: 24

This should be the waiting queue of the socket, but according to the clues in the article, you should also find poll_wqueues through wait_queue_t and then find task_struct from it, but the wiat_queue_head_t found so far is already at the end and will not be connected poll_wqueues, so find another way.

Can't find a way to find struct task_struct from struct sock at present, so I still use crash to change the name of the process

Next, we use the crash tool to help us modify the process name.

First of all, re-execute pixie

$ ./pixie

We can use crash to query the location of task_struct of this process, and then search for the location of comm in task_struct.

crash> ps | grep pixie

  24127  10259   1  ffff8d7aa975c5c0  IN   0.0    2492   1248  pixie
crash> set 24127
    PID: 24127
COMMAND: "pixie"
   TASK: ffff8d7aa975c5c0  [THREAD_INFO: ffff8d7aa975c5c0]
    CPU: 1
  STATE: TASK_INTERRUPTIBLE 
crash> px ((struct task_struct *)0xffff8d7aa975c5c0)->comm
$3 = "pixie\000PoolSingl"
crash> px &((struct task_struct *)0xffff8d7aa975c5c0)->comm
$4 = (char (*)[16]) 0xffff8d7aa975d038

Then we can convert this virtual address into a physical address for mapping:

crash> vtop 0xffff8d7aa975d038
VIRTUAL           PHYSICAL        
ffff8d7aa975d038  22975d038       

PGD DIRECTORY: ffffffffa280a000
PAGE DIRECTORY: 3b8801067
   PUD: 3b8801f50 => 2191d0063
   PMD: 2191d0a58 => 219038063
   PTE: 219038ae8 => 800000022975d063
  PAGE: 22975d000

      PTE         PHYSICAL   FLAGS
800000022975d063  22975d000  (PRESENT|RW|ACCESSED|DIRTY|NX)

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffe56c88a5d740 22975d000 dead000000000400        0  0 17ffffc0000000

We can see that 0x22975d038 is the physical address of task_struct->comm.

crash> rd -p 22975d038
        22975d038: 656f68736e696b73 pixie

But we can't directly map this section of memory, because the direct mapping of the 12bits flag behind may not be allowed, so we need to clear the 12bits behind, do the mapping and then add the diff back.

hack.c

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <fcntl.h>

int main(int argc, char **argv)
{
    int fd;
    unsigned char *addr;
    unsigned long long off, diff;

    off = strtoll(argv[1], NULL, 16);
    diff = off & 0x000000fff;
    off &= 0xffffff000;

    fd = open("/dev/mem", O_RDWR);
    addr = mmap(NULL, 0xffffffff, PROT_READ|PROT_WRITE, MAP_SHARED, fd, off);
    addr += diff;
    printf("program name is: %s\n", addr);
    // strcpy(addr, "skinshoe");
    
    close(fd);
    munmap(addr, 0xffffffff);

    return 1;
}

After execution:

$ sudo ./skinshoe 22975d038
program name is: pixie

We can see that the name has changed.

crash> px ((struct task_struct *)0xffff8d7aa975c5c0)->comm
$8 = "skinshoe\000lSingl"
rash> rd -p 22975d038
        22975d038: 656f68736e696b73 skinshoe

We can also use pid query:

$ cat /proc/24127/comm
skinshoe

So far we have successfully modified the name of the process through /dev/mem.

Implement vtop

In /dev/mem mentioned in this article

Non-reserved physical memory will be mapped one by one to the virtual address starting from 0xffff880000000000

So when we want to convert the virtual address mapped in /dev/mem, we only need to subtract 0xffff880000000000. The author also uses this to accomplish many things, but because of the different versions, my computer does not start from 0xffff880000000000 Start.

After many comparisons and corrections, I found that the benchmark value of the computer will be different every time it is turned on. This time it starts from 0xffff8d7880000000, so subtracting this value can successfully map the virtual address without conversion by crash, I It is speculated that it is randomly determined when the mapping table is created at startup.

So we can use the virtual address to view the name of the process.

Execute the pixie code in the previous paragraph as well:

./pixie

This time we can query pixie's task_struct:

crash> ps | grep pixie
crash: current context no longer exists -- restoring "crash" context:

  27857  10259   0  ffff8d7a99105d00  IN   0.0    2492   1252  pixie
crash> set 27857
    PID: 27857
COMMAND: "pixie"
   TASK: ffff8d7a99105d00  [THREAD_INFO: ffff8d7a99105d00]
    CPU: 0
  STATE: TASK_INTERRUPTIBLE

So we know the virtual address is 0xffff8d7a99105d00.

We can also query the displacement of task_struct.comm by the way:

crash> struct task_struct.comm
struct task_struct {
   [2680] char comm[16];
}

So when we get the entity address and add 2680 it is comm.

The implementation code is as follows:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <fcntl.h>
#define OFFSET 0xffff8d7880000000

int main(int argc, char **argv)
{
    int fd;
    unsigned char *addr;
    unsigned long long off, diff;
    off = strtoull(argv[1], NULL, 16);
    off -= OFFSET;
    off += 2680;
    diff = off & 0x000000fff;
    off &= 0xffffff000;

    fd = open("/dev/mem", O_RDWR);
    addr = mmap(NULL, 0xffffffff, PROT_READ|PROT_WRITE, MAP_SHARED, fd, off);
    addr += diff;
    printf("program name is: %s\n", addr);
    
    close(fd);
    munmap(addr, 0xffffffff);

    return 1;
}

Final execution result：

$ sudo ./vtop ffff8d7a99105d00
program name is: pixie

Visit the process of the system

Continuing the previous example, we can try to visit the process after pixie.

We can first query the position of (list_head *) tasks in task_struct, which is the linked-list connecting each task_struct.

crash> struct task_struct.tasks
struct task_struct {
   [1984] struct list_head tasks;
}

After knowing the position deviation, we can try to visit the processes and their names.

The same we do from pixie as the entry point.

./pixie

Then find the virtual address of task_struct of this process as before, this time I won’t demonstrate them one by one. In addition, in this example, the virtual address is mapped from 0xffff9fa540000000.

After that, we can add and subtract the offsets of task_struct->tasks and task_struct->comm to visit the subsequent processes and their names. Here I list a total of 12.

traverse.c

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <fcntl.h>
#define OFFSET 0xffff9fa540000000

int main(int argc, char **argv)
{
    int fd;
    unsigned char *addr;
    unsigned long long off, diff;
    unsigned long *pltmp;
    off = strtoull(argv[1], NULL, 16);
    off += 1984;
    fd = open("/dev/mem", O_RDWR);
    
    for(int i = 0; i < 12; i++){
        off -= OFFSET;
        off -= 1984;
        diff = off & 0x000000fff;
        off &= 0xffffff000;

        addr = mmap(NULL, 0xffffffff, PROT_READ|PROT_WRITE, MAP_SHARED, fd, off);
        addr += diff;
        addr += 2680;
        printf("program name is: %s\n", addr);
        addr -= 2680;
        addr += 1984;
        pltmp = (long unsigned int *)addr;
        off = (unsigned long long)*pltmp;
        munmap(addr, 0xffffffff);
    }
    
    close(fd);
    return 1;
}

Output result:

$ sudo ./traverse ffff9fa791151740
program name is: pixie
program name is: bash
program name is: kworker/u16:1
program name is: kworker/u16:2
program name is: kworker/0:0
program name is: kworker/1:1
program name is: kworker/u16:3
program name is: kworker/3:2
program name is: cpptools-srv
program name is: kworker/u16:0
program name is: sudo
program name is: traverse

It can be seen that the process name after pixie is successfully listed, and the traverse at the end is our current process, so we have successfully extended the previous example to visit each process in the system.

Access NULL address legally

Next, we will try to see if we can legally access the address of NULL.

In fact, the NULL address can be accessed completely, as long as there is a paging table to map it to a physical memory page, we can first look at [mmap(2)](https://man7.org/linux/man-pages/man2/mmap .2.html) inside the description.

on Linux, the kernel will pick a nearby page boundary (but always above or equal to the value specified by /proc/sys/vm/mmap_min_addr) and attempt to create the mapping there.

It can be seen that addresses smaller than the number in /proc/sys/vm/mmap_min_addr are protected and cannot be mapped, so we need to change the value inside to 0 so that we can use NULL Space.

So first we change /proc/sys/vm/mmap_min_addr:

$ cat /proc/sys/vm/mmap_min_addr
65536
$ sudo sh -c "echo 0 > /proc/sys/vm/mmap_min_addr"
$ cat /proc/sys/vm/mmap_min_addr
0

Next we can try to map to use NULL address:

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>

int main(int argc, char **argv)
{
    int i;
    unsigned char *niladdr = NULL;
    unsigned char str[] = "Zhejiang Wenzhou pixie shi,xiayu jinshui buhui pang!";

    mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_ANONYMOUS|MAP_SHARED, -1, 0);
    perror("a");

    for (i = 0 ; i < sizeof(str); i++) {
        niladdr[i] = str[i];
    }
    printf("using assignment at NULL: %s\n", niladdr);
    for (i = 0 ; i < sizeof(str); i++) {
        printf ("%c", *((char*)NULL+i));
    }
    printf ("\n");

    getchar();

    munmap(0, 4096);

    return 0;
}

Output result:

$ sudo ./access0
a: Success
using assignment at NULL: (null)
Zhejiang Wenzhou pixie shi, xiayu jinshui buhui pang!

Observe through crash:

crash> ps | grep access0
crash: current context no longer exists -- restoring "crash" context:

   8447   8446   1  ffff9ddb907a2e80  IN   0.0    2492   1452  access0
crash> set 8447
    PID: 8447
COMMAND: "access0"
   TASK: ffff9ddb907a2e80  [THREAD_INFO: ffff9ddb907a2e80]
    CPU: 1
  STATE: TASK_INTERRUPTIBLE 
crash> vtop 0
VIRTUAL     PHYSICAL        
0           3d8725000       

   PGD: 2dbf7e000 => 800000038bfd9067
   PUD: 38bfd9000 => 426544067
   PMD: 426544000 => 32f357067
   PTE: 32f357000 => 80000003d8725867
  PAGE: 3d8725000

      PTE         PHYSICAL   FLAGS
80000003d8725867  3d8725000  (PRESENT|RW|USER|ACCESSED|DIRTY|NX)

      VMA           START       END     FLAGS FILE
ffff9ddc3a17a8f0          0       1000 80000fb dev/zero

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffefce4f61c940 3d8725000 ffff9ddcbea4d598        0  2 17ffffc008001c uptodate,dirty,lru,swapbacked

It can be seen that we have successfully mapped the NULL address to the physical memory, and we can also observe the value of the NULL address:

crash> rd 0 8
                0: 676e61696a65685a 756f687a6e655720 Zhejiang Wenzhou
               10: 7320656978697020 75796169782c6968 pixie shi, xiayu
               20: 697568736e696a20 7020697568756220 jinshui buhui p
               30: 0000000021676e61 0000000000000000 ang!..........

Exactly the same as what we put in!

So why it is impossible to access NULL is to better distinguish what is a legal address, so a special address called NULL is artificially created to make it inaccessible, but at the MMU (Memory Management Unit) level, NULL is no different from other memory.

Kernel protect mechanism

KPTI
Since the general shared address space may cause core data leakage, the linux kernel introduces the technology KPTI (shared address space) to effectively hide the relative location of the kernel in the user space.

According to the description of KAISER: hiding the kernel from user space, KPTI will randomize the position of the kernel in the virtual address space at boot time, which can prevent attackers from knowing the kernel correct position, KPTI will provide a
The shadow page table records all user space data, and only records a small part of kernel data to ensure that system calls and interrupts can be executed correctly, thereby achieving the function of hiding the kernel.

However, it is still possible that the base address of the kernel is leaked during mode conversion.

ASLR
ASLR is another memory protection mechanism that places process data at unpredictable random addresses. This method can be used to prevent attackers from using stack overflow to jump to specific locations for attacks.

Because of the above two mechanisms, some unexpected situations will appear when we use crash to observe the memory content, so I have to turn off KPTI and ASLR and then observe the memory content to see if there is any difference.

First we turn off KPTI:

This HOW TO DISABLE PAGE-TABLE ISOLATION ON UBUNTU FOR BENCHMARKING with examples.

Let's first take a look at the status of KPTI in the system:

$ cat /sys/devices/system/cpu/vulnerabilities/*
KVM: Mitigation: Split huge pages
Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable
Mitigation: Clear CPU buffers; SMT vulnerable
Mitigation: PTI
Mitigation: Speculative Store Bypass disabled via prctl and seccomp
Mitigation: usercopy/swapgs barriers and __user pointer sanitization
Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling
Mitigation: Microcode
Not affected

In this way, you can see that Mitigation has the option of PTI.

Then use the grub boot file and then restart the shutdown, go to /etc/default/grub to modify the GRUB_CMDLINE_LINUX_DEFAULT parameter.

like this:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pti=off"

It can be turned off after executing update-grub and rebooting.

$ cat /sys/devices/system/cpu/vulnerabilities/*
KVM: Mitigation: Split huge pages
Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable
Mitigation: Clear CPU buffers; SMT vulnerable
Vulnerable
Mitigation: Speculative Store Bypass disabled via prctl and seccomp
Mitigation: usercopy/swapgs barriers and __user pointer sanitization
Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling
Mitigation: Microcode
Not affected

We can choose to turn off ASLR again:

From How ASLR protects Linux systems from buffer overflow attacks we can get from /proc/sys The state of ASLR is known in /kernel/randomize_va_space.

$ cat /proc/sys/kernel/randomize_va_space
2
$ sysctl -a --pattern randomize
kernel.randomize_va_space = 2

Here 2 means Full Randomization.

The article mentioned an interesting way to test ASLR, using ldd to verify whether the listed address is different every time.

$ ldd /bin/bash
	linux-vdso.so.1 (0x00007ffcf11fe000)
	libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007f888ba57000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f888ba51000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f888b85f000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f888bbd1000)
$ ldd /bin/bash
	linux-vdso.so.1 (0x00007ffcbc7c6000)
	libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007fa4001c6000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fa4001c0000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa3fffce000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fa400340000)

Next, we have to turn off ASLR, and then use ldd to observe:

$ sudo sysctl -w kernel.randomize_va_space=0
kernel.randomize_va_space = 0
$ ldd /bin/bash
	linux-vdso.so.1 (0x00007ffff7fce000)
	libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007ffff7e51000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ffff7e4b000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff7c59000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ffff7fcf000)
$ ldd /bin/bash
	linux-vdso.so.1 (0x00007ffff7fce000)
	libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007ffff7e51000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ffff7e4b000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff7c59000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ffff7fcf000)

We can see that we have successfully disabled ASLR.

Then repeat the previous experiment to see if there is any difference.

Starting from the page table offset, re-execute test to get the offset of the page table of each layer.

$ sudo ./test
addr = 0x7fe93c5e6000
the address is 0x7fe93c5e6000, and the value is 1
the address is 0x7fe93c5e6004, and the value is 9
PGD index = 0xff
PUD index = 0x1a4
PMD index = 0x1e2
PTE index = 0x1e6

We can test to see if the offset will be as we expected, or it will still be the result of shifting to the left by three bits.

crash> ps | grep test
crash: current context no longer exists -- restoring "crash" context:

   4310   4309   2  ffff8e83da5edd00  IN   0.0    2492   1424  test
crash> set 4310
    PID: 4310
COMMAND: "test"
   TASK: ffff8e83da5edd00  [THREAD_INFO: ffff8e83da5edd00]
    CPU: 2
  STATE: TASK_INTERRUPTIBLE 
crash> vtop 0x7ffff7ffb000
VIRTUAL     PHYSICAL        
7ffff7ffb000  440000000       

   PGD: 2e98b67f8 => 2f89cb067
   PUD: 2f89cbff8 => 42498a067
   PMD: 42498adf8 => 403e32067
   PTE: 403e32fd8 => 8000000440000267
  PAGE: 440000000

      PTE         PHYSICAL   FLAGS
8000000440000267  440000000  (PRESENT|RW|USER|ACCESSED|DIRTY|NX)

      VMA           START       END     FLAGS FILE
ffff8e82f2e57930 7ffff7ffb000 7ffff7ffc000 d0444bb /dev/mem

According to the result displayed by the crash, the offset is still the same as the original display, but one thing worth noting is that since ASLR is turned off, the virtual address will not change no matter how many times it is executed>

$ sudo ./test
addr = 0x7ffff7ffb000 
the address is 0x7ffff7ffb000, and the value is 1
the address is 0x7ffff7ffb004, and the value is 9
PGD index = 0xff
PUD index = 0x1ff
PMD index = 0x1bf
PTE index = 0x1fb

$ sudo ./test
addr = 0x7ffff7ffb000 
the address is 0x7ffff7ffb000, and the value is 1
the address is 0x7ffff7ffb004, and the value is 9
PGD index = 0xff
PUD index = 0x1ff
PMD index = 0x1bf
PTE index = 0x1fb

$ sudo ./test
addr = 0x7ffff7ffb000 
the address is 0x7ffff7ffb000, and the value is 1
the address is 0x7ffff7ffb004, and the value is 9
PGD index = 0xff
PUD index = 0x1ff
PMD index = 0x1bf
PTE index = 0x1fb

$ sudo ./test
addr = 0x7ffff7ffb000 
the address is 0x7ffff7ffb000, and the value is 1
the address is 0x7ffff7ffb004, and the value is 9
PGD index = 0xff
PUD index = 0x1ff
PMD index = 0x1bf
PTE index = 0x1fb

Next we can observe to see if the non-reserved memory will have a fixed mapping address.

This time we can execute the pixie program repeatedly to see the change of his address.

crash> ps | grep pixie
   2598   2357   3  ffff979b7e2e8000  IN   0.0    2492   1232  pixie
crash> set 2598
    PID: 2598
COMMAND: "pixie"
   TASK: ffff979b7e2e8000  [THREAD_INFO: ffff979b7e2e8000]
    CPU: 3
  STATE: TASK_INTERRUPTIBLE 
crash> px 0xffff979b7e2e8000
$1 = 0xffff979b7e2e8000
crash> vtop 0xffff979b7e2e8000
VIRTUAL           PHYSICAL        
ffff979b7e2e8000  3fe2e8000       

PGD DIRECTORY: ffffffffa040a000
PAGE DIRECTORY: 244401067
   PUD: 244401368 => 3ffa28063
   PMD: 3ffa28f88 => 3fe38b063
   PTE: 3fe38b740 => 80000003fe2e8163
  PAGE: 3fe2e8000

      PTE         PHYSICAL   FLAGS
80000003fe2e8163  3fe2e8000  (PRESENT|RW|ACCESSED|DIRTY|GLOBAL|NX)

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffe5d44ff8ba00 3fe2e8000 ffff979ba281a840 ffff979b7e2e9740  1 17ffffc0010200 slab,head
crash> px 0xffff979b7e2e8000-0x3fe2e8000
$2 = 0xffff979780000000

We can see that the base address is 0xffff979780000000 when you execute it for the first time, so let’s reboot to see if it changes.

Reboot and execute pixie:

crash> ps | grep pixie
   2666   2437   2  ffff9633e7492e80  IN   0.0    2492   1232  pixie
crash> set 2666
    PID: 2666
COMMAND: "pixie"
   TASK: ffff9633e7492e80  [THREAD_INFO: ffff9633e7492e80]
    CPU: 2
  STATE: TASK_INTERRUPTIBLE 
crash> vtop 0xffff9633e7492e80
VIRTUAL           PHYSICAL        
ffff9633e7492e80  427492e80       

PGD DIRECTORY: ffffffff9ce0a000
PAGE DIRECTORY: 103401067
   PUD: 103401678 => 42e3aa063
   PMD: 42e3aa9d0 => 4273be063
   PTE: 4273be490 => 8000000427492163
  PAGE: 427492000

      PTE         PHYSICAL   FLAGS
8000000427492163  427492000  (PRESENT|RW|ACCESSED|DIRTY|GLOBAL|NX)

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffe865d09d2480 427492000 dead000000000400        0  0 17ffffc0000000
crash> px 0xffff9633e7492e80-0x427492e80
$1 = 0xffff962fc0000000

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

After rebooting, I found that the base address is still different, so the previous assumption was wrong.

Then let's try Spectre that the teacher said. Spectre basically uses branch prediction and speculative execution on modern cpus to bypass access control to obtain privileged data and does not modify memory.

This time, let's try to turn off Linux's defense mechanism against Spectre.

According to Specter Side Channels we can turn off Specter through the grub configuration file.

GRUB_CMDLINE_LINUX_DEFAULT="nospectre_v1 nospectre_v2 nopti quiet splash"

Then restart the machine and use crash to test it.

crash> ps | grep pixie
   3234   2949   6  ffff948c0457c5c0  IN   0.0    2492   1232  pixie
crash> set 3234
    PID: 3234
COMMAND: "pixie"
   TASK: ffff948c0457c5c0  [THREAD_INFO: ffff948c0457c5c0]
    CPU: 6
  STATE: TASK_INTERRUPTIBLE 
crash> vtop 0xffff948c0457c5c0
VIRTUAL           PHYSICAL        
ffff948c0457c5c0  34457c5c0       

PGD DIRECTORY: ffffffffa6e0a000
PAGE DIRECTORY: 268201067
   PUD: 268201180 => 80000003400001e3
   PMD: 340000110 => 0

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
fffff03a0d115f00 34457c000 dead000000000400        0  0 17ffffc0000000
crash> px 0xffff948c0457c5c0-0x34457c5c0
$1 = 0xffff9488c0000000

The base address is 0xffff9488c0000000, and then reboot again.

crash> ps | grep pixie
   3249   3022   6  ffff998ac5bd0000  IN   0.0    2492   1232  pixie
crash> set 3249
    PID: 3249
COMMAND: "pixie"
   TASK: ffff998ac5bd0000  [THREAD_INFO: ffff998ac5bd0000]
    CPU: 6
  STATE: TASK_INTERRUPTIBLE 
crash> vtop 0x ffff998ac5bd0000
VIRTUAL     PHYSICAL        
0           (not accessible)


VIRTUAL           PHYSICAL        
ffff998ac5bd0000  305bd0000       

PGD DIRECTORY: ffffffffb620a000
PAGE DIRECTORY: 344e01067
   PUD: 344e01158 => 80000003000001e3
   PMD: 300000168 => 280000000a

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
fffff1ed0c16f400 305bd0000 ffff998bb153e840 ffff998ac5bd1740  1 17ffffc0010200 slab,head
crash> vtop 0xffff998ac5bd0000
VIRTUAL           PHYSICAL        
ffff998ac5bd0000  305bd0000       

PGD DIRECTORY: ffffffffb620a000
PAGE DIRECTORY: 344e01067
   PUD: 344e01158 => 80000003000001e3
   PMD: 300000168 => 280000000a

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
fffff1ed0c16f400 305bd0000 ffff998bb153e840 ffff998ac5bd1740  1 17ffffc0010200 slab,head
crash> px 0xffff998ac5bd0000-0x305bd0000
$1 = 0xffff9987c0000000

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The default base address of the two boots is still different, so it may be caused by other defense mechanisms.

According to the teacher's prompt, I will turn off KASLR to see if there will be expected results. The way to turn off is to set GRUB_CMDLINE_LINUX_DEFAULT in grub and add nokaslr in it.

After restarting, let's observe the results:

crash> ps | grep pixie
   3890   3235   1  ffff8883cd758000  IN   0.0    2492   1224  pixie
crash> set 3890\
set: invalid task or pid value: 3890\
crash> set 3890
    PID: 3890
COMMAND: "pixie"
   TASK: ffff8883cd758000  [THREAD_INFO: ffff8883cd758000]
    CPU: 1
  STATE: TASK_INTERRUPTIBLE 
crash> vtop 0xffff8883cd758000
VIRTUAL           PHYSICAL        
ffff8883cd758000  3cd758000       

PGD DIRECTORY: ffffffff8260a000
PAGE DIRECTORY: 3001067
   PUD: 3001078 => 3ffdd1063
   PMD: 3ffdd1358 => 3cd6f9063
   PTE: 3cd6f9ac0 => 80000003cd758163
  PAGE: 3cd758000

      PTE         PHYSICAL   FLAGS
80000003cd758163  3cd758000  (PRESENT|RW|ACCESSED|DIRTY|GLOBAL|NX)

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffea000f35d600 3cd758000 ffff888404f38540 ffff8883cd759740  1 17ffffc0010200 slab,head
crash> px 0xffff8883cd758000-0x3cd758000
$1 = 0xffff888000000000

This time the final result is our expected base address 0xffff888000000000!

Therefore, we can infer that the main mechanism for changing the kernel mapping base address at boot time should be caused by KASLR.

/dev/mem research

Environment Settings

Use crash to map the memory reserved by the system

Use crash to observe Five-level page tables

Page exchange between Processes

Securely tamper with the memory of the process

Change the name of the process by changing /dev/mem

Implement vtop

Visit the process of the system

Access NULL address legally

Kernel protect mechanism

Read more

Kata container

Benchmark Your Computer Black Box

Key-Value Stroages

Kata container deploy record