# KVM Host Initialization and Configuration Process
This is the note for the initialization and configuration process of the KVM-host.
## Overall Architecture
In [`src/main.c`](https://github.com/sysprog21/kvm-host/blob/master/src/main.c):
- [`vm_init(&vm)`](#VM-Initialization): Initialize the virtual machine.
- [`vm_load_image(&vm, kernel_file)`](#VM-Load-Image-/-Initial-RAM-File-System): Load the kernel image file.
- [`vm_load_initrd(&vm, initrd_file)`](#VM-Load-Image-/-Initial-RAM-File-System): Load the initial RAM filesystem.
- [`vm_load_diskimg(&vm, diskimg_file)`](#VM-Load-Disk-Image):
- `diskimg_init(&v->diskimg, diskimg_file)`: Initialize the disk image.
- `virtio_blk_init_pci(&v->virtio_blk_dev, &v->diskimg, &v->pci, &v->io_bus, &v->mmio_bus)`: Initialize the virtio block.
- [`vm_enable_net(&vm)`](#Network-Device): Enable the network device.
- [`vm_late_init(&vm)`](#VM-Late-Init): Final initialization for specific architecture.
- [`vm_run(&vm)`](#VM-Run): Start the virtual machine.
- `vm_exit(&vm)`: Exit the virtual machine.
## VM Overall Process
### VM Initialization
- Open the KVM device file.
```c
v->kvm_fd = open("/dev/kvm", O_RDWR);
```
- Create a new virtual machine. In this step, KVM in Linux creates a virtual machine and returns the corresponding file descriptor.
```c
v->vm_fd = ioctl(v->kvm_fd, KVM_CREATE_VM, 0);
```
- `vm_arch_init()`
- [x86](#VM-Initialization-in-x86): [The Linux/x86 Boot Protocol](https://www.kernel.org/doc/html/latest/arch/x86/boot.html)
- [arm64](#VM-Initialization-in-arm64): [Booting AArch64 Linux](https://docs.kernel.org/arch/arm64/booting.html)
- Allocate the guest memory space.
```c
v->mem = mmap(NULL, RAM_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
```
- Allocate the memory space created by the `mmap()` in the previous step to the guest OS.
```c
struct kvm_userspace_memory_region region = {
.slot = 0,
.flags = 0,
.guest_phys_addr = RAM_BASE,
.memory_size = RAM_SIZE,
.userspace_addr = (__u64) v->mem,
};
ioctl(v->vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion);
```
- The `RAM_BASE` is 0 in x86 system, and 2GB in arm64 system.
- Create virtual CPU.
```c
v->vcpu_fd = ioctl(v->vm_fd, KVM_CREATE_VCPU, 0);
```
- `vm_arch_cpu_init()`
- x86:
- [arm64](#VM-CPU-Initialization-in-arm64):
- Initialize the IO and MMIO bus. [`bus_init()`](#Bus-Initialization)
```c
bus_init(&v->io_bus);
bus_init(&v->mmio_bus);
```
- `vm_arch_init_platform_device(v)`
- x86:
- [arm64](#VM-Platform-Devices-Initialization-in-arm64):
[Go Back to Overall Architecture](#Overall-Architecture)
### VM Load Image / Initial RAM File System
- Opens the image and initrd files via their file paths.
- Retrieve the size of each file.
- Employ `mmap()` to map each file into memory
- Call `vm_arch_load_image()` with the base memory address and size parameters.
- x86:
- [arm64](#VM-Load-Image-/-Initial-RAM-File-System-in-arm64):
[Go Back to Overall Architecture](#Overall-Architecture)
### VM Load Disk Image
- `diskimg_init(&v->diskimg, diskimg_file);`
- `virtio_blk_init_pci(&v->virtio_blk_dev, &v->diskimg, &v->pci, &v->io_bus, &v->mmio_bus);`
[Go Back to Overall Architecture](#Overall-Architecture)
### VM Late Init
- x86: Nothing to do.
- [arm64](#VM-Late-Init-in-arm64):
- Create device tree: `generate_fdt(v)`
- Initialize CPU registers: `init_reg(v)`
[Go Back to Overall Architecture](#Overall-Architecture)
### VM Run
> When a virtual CPU is executing guest code, it runs in non-root mode, allowing most instructions to execute directly on the hardware.
> This enables high performance with
minimal overhead. However, certain operations—such as accessing I/O ports, modifying control registers, or executing privileged system instructions—are not permitted in this mode.
When the CPU encounters such an instruction or a predefined condition, it performs a VM exit (VMEXIT), which transitions control from non-root to root mode, handing
execution back to the kernel.
>
> Page 70 in [The Conceptual Guide to the Linux Kernel v1.1.2025](https://www.linkedin.com/feed/update/urn:li:activity:7334420081403146240/) by [Moon Hee Lee](https://www.linkedin.com/in/moon-hee-lee/)
[Go Back to Overall Architecture](#Overall-Architecture)
## x86-Specific Process
### VM Initialization in x86
[The Linux/x86 Boot Protocol](https://www.kernel.org/doc/html/latest/arch/x86/boot.html)
1. `KVM_SET_TSS_ADDR`: Defines the physical address range spanning three pages to configure the [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment).
2. `KVM_SET_IDENTITY_MAP_ADDR`: Defines the physical address range spanning one page to configure the identity map (page table).
3. `KVM_CREATE_IRQCHIP`: Creates a virtual [Programmable Interrupt Controller (PIC)](https://en.wikipedia.org/wiki/Programmable_interrupt_controller).
4. `KVM_CREATE_PIT2`: Creates a virtual [Programmable Interval Timer (PIT)](https://en.wikipedia.org/wiki/Programmable_interval_timer).
[Go Back to VM Initialization](#VM-Initialization)
## arm64-Specific Process
### VM Initialization in arm64
> References:
> [sysprog: Linux KVM](https://hackmd.io/@sysprog/linux-kvm#%E5%BB%BA%E7%AB%8B%E8%88%87%E5%88%9D%E5%A7%8B%E5%8C%96%E4%B8%AD%E6%96%B7%E6%8E%A7%E5%88%B6%E5%99%A8)
> [Booting AArch64 Linux](https://docs.kernel.org/arch/arm64/booting.html)
> [ARM GICv3 and GICv4 Software Overview Release B](https://developer.arm.com/documentation/dai0492/latest/)
> [Linux KVM API](https://docs.kernel.org/virt/kvm/api.html)
> [ARM Virtual Generic Interrupt Controller v3 and later (VGICv3)](https://docs.kernel.org/virt/kvm/devices/arm-vgic-v3.html)
> [kvmtool](https://github.com/kvmtool/kvmtool/blob/master/arm/gic.c)
#### Create and initialize the interrupt controller
KVM provides a virtualized interrupt controller with support for both GICv2 and GICv3. Depending on the capabilities of the host hardware, if the host employs GICv3 hardware that does not offer GICv2 emulation, then only a virtual GICv3 interrupt controller can be instantiated. Conversely, if the host’s interrupt controller is GICv2, only a GICv2 virtual controller may be created.
By leveraging the virtualized interrupt controller supplied by KVM, there is no need to implement interrupt-controller emulation in user space; one can instead rely directly on the implementation provided within the Linux kernel.
The [eMAG 8180](https://en.wikichip.org/wiki/ampere_computing/emag/8180) host only supports the creation of a virtual GICv3 interrupt controller; accordingly, the following discussion will focus on GICv3.
Below is the architecture of the GICv3 interrupt controller:

> Reference: [GICv3 and v4 Software Overview](https://developer.arm.com/documentation/dai0492/latest/) Chapter 3.5, Page 16
In GICv3, the CPU Interface is accessed through system registers, employing the MSR and MRS instructions for reading from and writing to those registers. Each CPU core has its own Redistributor, whereas a single Distributor serves the entire system. Both the Redistributor and the Distributor are accessed via MMIO.
Once a virtual machine has been created using the `KVM_CREATE_VM` `ioctl`, the interrupt controller can be instantiated by issuing the `KVM_CREATE_DEVICE` `ioctl` for that VM.
In [`src/arch/arm64/vm.c`](https://github.com/sysprog21/kvm-host/blob/master/src/arch/arm64/vm.c):
The `KVM_CREATE_DEVICE` `ioctl` must be provided with a `struct kvm_create_device`, as shown below:
```c
struct kvm_create_device gic_device = {
.type = KVM_DEV_TYPE_ARM_VGIC_V3,
};
ioctl(v->vm_fd, KVM_CREATE_DEVICE, &gic_device);
```
Upon creation, the fd member within the `struct kvm_create_device` can be used to retrieve the file descriptor for this GICv3 interrupt controller, which is then stored in the arm64-specific private data structure `vm_arch_priv_t`.
```c
priv->gic_fd = device.fd;
```
Next, the MMIO addresses for the Redistributor and Distributor must be configured. To do this, the `KVM_SET_DEVICE_ATTR` `ioctl` is used, with the file descriptor set to the GICv3 fd obtained earlier. The argument passed to this ioctl is a `struct kvm_device_attr`. The procedure is as follows:
```c
uint64_t dist_addr = ARM_GIC_DIST_BASE;
uint64_t redist_addr = ARM_GIC_REDIST_BASE;
struct kvm_device_attr dist_attr = {
.group = KVM_DEV_ARM_VGIC_GRP_ADDR,
.attr = KVM_VGIC_V3_ADDR_TYPE_DIST,
.addr = (uint64_t) &dist_addr,
};
struct kvm_device_attr redist_attr = {
.group = KVM_DEV_ARM_VGIC_GRP_ADDR,
.attr = KVM_VGIC_V3_ADDR_TYPE_REDIST,
.addr = (uint64_t) &redist_addr,
};
ioctl(gic_fd, KVM_SET_DEVICE_ATTR, &redist_attr);
ioctl(gic_fd, KVM_SET_DEVICE_ATTR, &dist_attr);
```
Since the `.addr` field must hold a pointer to a `uint64_t`, we first declare a local variable and then obtain its address using the `&` operator.
After GICv3 has been created and all required [vCPUs](#VM-CPU-Initialization-in-arm64) have been instantiated, additional initialization is necessary to allow the VM to function correctly. The GICv3 initialization takes place within the `finalize_irqchip()` function in [`src/arch/arm64/vm.c`](https://github.com/sysprog21/kvm-host/blob/master/src/arch/arm64/vm.c), which is invoked by [vm_arch_init_platform_device()](#VM-Platform-Devices-Initialization-in-arm64).
[Go Back to VM Initialization](#VM-Initialization)
### VM CPU Initialization in arm64
On ARM64, the vCPU can be initialized by invoking the `KVM_ARM_VCPU_INIT` ioctl on the `vcpu_fd`, where a pointer to a `struct kvm_vcpu_init` must be supplied as the argument. The `struct kvm_vcpu_init` itself can be obtained directly by issuing the `KVM_ARM_PREFERRED_TARGET` ioctl on the `vm_fd`.
In [`src/arch/arm64/vm.c`](https://github.com/sysprog21/kvm-host/blob/master/src/arch/arm64/vm.c):
```c
struct kvm_vcpu_init vcpu_init;
ioctl(v->vm_fd, KVM_ARM_PREFERRED_TARGET, &vcpu_init);
ioctl(v->vcpu_fd, KVM_ARM_VCPU_INIT, &vcpu_init);
```
[Go Back to VM Initialization](#VM-Initialization)
### VM Platform Devices Initialization in arm64
At this stage, we initialize the [system bus](#Device-Initialization), [PCI bus](#PCI-Initialization), and serial device.
In [`src/arch/arm64/vm.c`](https://github.com/sysprog21/kvm-host/blob/master/src/arch/arm64/vm.c):
```c
static void pio_handler(void *owner,
void *data,
uint8_t is_write,
uint64_t offset,
uint8_t size)
{
vm_t *v = (vm_t *) owner;
bus_handle_io(&v->io_bus, data, is_write, offset, size);
}
/* Initial system bus */
dev_init(&priv->iodev, ARM_IOPORT_BASE, ARM_IOPORT_SIZE, v, pio_handler);
bus_register_dev(&v->mmio_bus, &priv->iodev);
/* Initialize PCI bus */
pci_init(&v->pci);
v->pci.pci_mmio_dev.base = ARM_PCI_CFG_BASE;
bus_register_dev(&v->mmio_bus, &v->pci.pci_mmio_dev);
/* Initialize serial device */
serial_init(&v->serial, &v->io_bus);
```
Finally, we must invoke the `KVM_SET_DEVICE_ATTR` ioctl on the file descriptor of the GICv3 that was just created, supplying a struct kvm_device_attr as the argument, as follows:
```c
struct kvm_device_attr vgic_init_attr = {
.group = KVM_DEV_ARM_VGIC_GRP_CTRL,
.attr = KVM_DEV_ARM_VGIC_CTRL_INIT,
};
ioctl(gic_fd, KVM_SET_DEVICE_ATTR, &vgic_init_attr);
```
:::info
Once GICv3 has been initialized, **no further vCPUs may be created**.
:::
[Go Back to VM Initialization](#VM-Initialization)
### VM Load Image / Initial RAM File System in arm64
According to the information provided by the [Linux kernel documentation](https://docs.kernel.org/arch/arm64/booting.html#call-the-kernel-image), the decompressed kernel image contains a 64-byte header as follows:
In [`src/arch/arm64/vm.c`](https://github.com/sysprog21/kvm-host/blob/master/src/arch/arm64/vm.c):
```c
typedef struct {
uint32_t code0; /* Executable code */
uint32_t code1; /* Executable code */
uint64_t text_offset; /* Image load offset, little endian */
uint64_t image_size; /* Effective Image size, little endian */
uint64_t flags; /* kernel flags, little endian */
uint64_t res2; /* reserved */
uint64_t res3; /* reserved */
uint64_t res4; /* reserved */
uint32_t magic; /* Magic number, little endian, "ARM\x64" */
uint32_t res5; /* reserved (used for PE COFF offset) */
} arm64_kernel_header_t;
```
First, we check if the magic number is legal.
```c
if (header->magic != 0x644d5241U)
return throw_err("Invalid kernel image\n");
```
The document states
> Prior to v3.17, the endianness of text_offset was not specified. In these cases image_size is zero and text_offset is 0x80000 in the endianness of the kernel. Where image_size is non-zero image_size is little-endian and must be respected. Where image_size is zero, text_offset can be assumed to be 0x80000.
```c
uint64_t offset;
if (header->image_size == 0)
offset = 0x80000;
else
offset = header->text_offset;
if (offset + datasz >= ARM_KERNEL_SIZE ||
offset + header->image_size >= ARM_KERNEL_SIZE) {
return throw_err("Image size too large\n");
}
void *dest = vm_guest_to_host(v, ARM_KERNEL_BASE + offset);
memmove(dest, data, datasz);
```
After loading the kernel, the address at which execution begins (i.e., the address of the first instruction) is the location where the image was loaded, namely the address of `code0`; therefore, this address is recorded and stored in the `entry` member variable of `vm_arch_priv_t`.
```c
priv->entry = ARM_KERNEL_BASE + offset;
```
The document also states
> The Image must be placed text_offset bytes from a 2MB aligned base address anywhere in usable system RAM and called there.
Therefore, a 2 MB–aligned memory address must be selected, and the value of `text_offset` from the header is then added; this resulting address designates where the kernel image is placed. The `image_size` denotes, from the starting position of the kernel image placement, the amount of memory to be reserved as usable space for the kernel.
In addition to loading the kernel, the initramfs must also be loaded as the first filesystem during boot. The document states:
> If an initrd/initramfs is passed to the kernel at boot, it must reside entirely within a 1 GB aligned physical memory window of up to 32 GB in size that fully covers the kernel Image as well.
This indicates that the initramfs and the Linux kernel must both reside within the same 32 GB window aligned on a 1 GB boundary; however, it does not specify whether the initramfs itself must be aligned.
```c
void *dest = vm_guest_to_host(v, ARM_INITRD_BASE);
memmove(dest, data, datasz);
priv->initrdsz = datasz;
```
[Go Back to VM Load Image / Initial RAM File System](#VM-Load-Image-/-Initial-RAM-File-System)
### VM Late Init in arm64
#### Device Tree
> [Device Tree Specification](https://github.com/devicetree-org/devicetree-specification/releases/download/v0.3/devicetree-specification-v0.3.pdf)
> [Implementation in kvmtool](https://github.com/kvmtool/kvmtool/blob/master/arm/fdt.c)
Prior to booting the kernel, the physical memory address at which the device tree resides must be passed to the kernel via the `x0` register.
[kvmtool](https://github.com/kvmtool/kvmtool) generates the device tree using [libfdt](https://git.kernel.org/pub/scm/utils/dtc/dtc.git), and we can adopt the same approach. [libfdt](https://git.kernel.org/pub/scm/utils/dtc/dtc.git) is a library included within the dtc package.
For implementation of the device tree, refer to kvmtool’s implementation. This is the [DTB dump produced by kvmtool](https://gist.github.com/yanjiew1/53be2d03430d187e61fff48eca3b6591).
The Device Tree primarily defines the following:
- Machine Type
- CPU
- Memory
- Initramfs Address
- Boot Arguments
- Interrupt Controller
- 16550 UART Addresses
- PCI Addresses
Procedure for generating the device tree using libfdt:
1. Use the `fdt_create()` function to specify the buffer and its size for placing the device tree—this creates an empty device tree.
2. Call `fdt_begin_node()` to add a node. Since the root node must be added first, invoke `fdt_begin_node(fdt, "")`.
3. Within each node, you can add properties using `fdt_property()`, `fdt_property_cell()`, `fdt_property_u64()`, and so on. When using `fdt_property()`, libfdt will copy the data verbatim; however, because values in the device tree must be represented in big-endian format, you must pair `fdt_property()` with `cpu_to_fdt32()` or `cpu_to_fdt64()` to convert endianness.
4. After all properties for a node have been added, call `fdt_end_node()` to close that node.
5. Finally, call `fdt_finish()` to complete the device tree. Once `fdt_finish()` returns—provided that every `fdt_begin_node()` call has a matching `fdt_end_node()`—the contents of the buffer constitute a valid device tree.
#### Initialize vCPU Register
> References:
> [Booting AArch64 Linux](https://docs.kernel.org/arch/arm64/booting.html)
> [kvmtool](https://github.com/kvmtool/kvmtool/blob/master/arm/aarch64/kvm-cpu.c)
Before jumping into the kernel, the following conditions must be met:
- Primary CPU general-purpose register settings:
- x0 = physical address of device tree blob (dtb) in system RAM.
- x1 = 0 (reserved for future use)
- x2 = 0 (reserved for future use)
- x3 = 0 (reserved for future use)
The `__REG` macro can be used to generate the register ID within the `struct kvm_one_reg` passed to the `KVM_SET_ONE_REG` ioctl.
```c
/* Initialize the vCPU registers according to Linux arm64 boot protocol
* Reference: https://www.kernel.org/doc/Documentation/arm64/booting.txt
*/
static int init_reg(vm_t *v)
{
vm_arch_priv_t *priv = (vm_arch_priv_t *) v->priv;
struct kvm_one_reg reg;
uint64_t data;
reg.addr = (uint64_t) &data;
#define __REG(r) \
(KVM_REG_ARM_CORE_REG(r) | KVM_REG_ARM_CORE | KVM_REG_ARM64 | \
KVM_REG_SIZE_U64)
/* Clear x1 ~ x3 */
for (int i = 0; i < 3; i++) {
data = 0;
reg.id = __REG(regs.regs[i]);
if (ioctl(v->vcpu_fd, KVM_SET_ONE_REG, ®) < 0)
return throw_err("Failed to set x%d\n", i);
}
/* Set x0 to the address of the device tree */
data = ARM_FDT_BASE;
reg.id = __REG(regs.regs[0]);
if (ioctl(v->vcpu_fd, KVM_SET_ONE_REG, ®) < 0)
return throw_err("Failed to set x0\n");
/* Set program counter to the begining of kernel image */
data = priv->entry;
reg.id = __REG(regs.pc);
if (ioctl(v->vcpu_fd, KVM_SET_ONE_REG, ®) < 0)
return throw_err("Failed to set program counter\n");
#undef _REG
return 0;
}
```
[Go Back to VM Late Init](#VM-Late-Init)
## Bus
Used to handle the mapping between addresses and devices, employing a *singly linked list* to manage the devices.
The `owner` parameter points to the structure that owns this device and is passed to the callback.
In [`src/bus.h`](https://github.com/sysprog21/kvm-host/blob/master/src/bus.h):
```c
typedef void (*dev_io_fn)(void *owner,
void *data,
uint8_t is_write,
uint64_t offset,
uint8_t size);
struct dev {
uint64_t base;
uint64_t len;
void *owner;
dev_io_fn do_io;
struct dev *next;
};
struct bus {
uint64_t dev_num;
struct dev *head;
};
```
### Bus Initialization
Initialize the singly linked list of the bus.
In [`src/bus.c`](https://github.com/sysprog21/kvm-host/blob/master/src/bus.c):
```c
void bus_init(struct bus *bus)
{
bus->dev_num = 0;
bus->head = NULL;
}
```
### Device Initialization
Initialize the device structure.
```c
void dev_init(struct dev *dev,
uint64_t base,
uint64_t len,
void *owner,
dev_io_fn do_io)
{
dev->base = base;
dev->len = len;
dev->owner = owner;
dev->do_io = do_io;
dev->next = NULL;
}
```
### Register the device to the bus
Insert the device into the linked list of the bus.
```c
void bus_register_dev(struct bus *bus, struct dev *dev)
{
dev->next = bus->head;
bus->head = dev;
bus->dev_num++;
}
```
### Deregister the deivce from the bus
Remove the device from the linked list of the bus.
```c
void bus_deregister_dev(struct bus *bus, struct dev *dev)
{
struct dev **p = &bus->head;
while (*p != dev && *p) {
p = &(*p)->next;
}
if (*p)
*p = (*p)->next;
}
```
### Handle IO in the Bus
Use the following function to issue an I/O request to the bus. It locates the target device based on the device’s `base` and `len`, then invokes the `do_io` callback.
```c
void bus_handle_io(struct bus *bus,
void *data,
uint8_t is_write,
uint64_t addr,
uint8_t size)
{
struct dev *dev = bus_find_dev(bus, addr); // Traverse the linked list
if (dev && addr + size - 1 <= dev->base + dev->len - 1) {
dev->do_io(dev->owner, data, is_write, addr - dev->base, size);
}
}
```
In the implementation, there are `io_bus` and `mmio_bus` components to handle `KVM_EXIT` events, as well as a `pci_bus` to manage the PCI device’s configuration space.
[Go back to VM Platform Devices Initialization in arm64](#VM-Platform-Devices-Initialization-in-arm64)
## PCI
### PCI Definition
> Reference: [OS Dev: PCI](https://wiki.osdev.org/PCI)
The PCI architecture is as follows: the Host Bridge is responsible for connecting the CPU and managing all PCI devices and buses. Devices are classified as either endpoint devices or bridges; a bridge serves to interconnect two separate buses.

A PCI logical device provides **256 bytes of Configuration Space** used to perform the device’s configuration and initialization.
In `/usr/include/linux/pci_regs.h`:
```c
/*
* Conventional PCI and PCI-X Mode 1 devices have 256 bytes of
* configuration space. PCI-X Mode 2 and PCIe devices have 4096 bytes of
* configuration space.
*/
#define PCI_CFG_SPACE_SIZE 256
#define PCI_CFG_SPACE_EXP_SIZE 4096
```
The CPU cannot directly access this space and must instead rely on a special mechanism provided by the **PCI Host Bridge** to facilitate access to the configuration registers.
Under Intel’s architecture, this mechanism employs two I/O ports: *CF8* and *CFC*. The CPU first writes the target configuration register’s address to CF8; subsequently, reading from or writing to CFC completes the operation on that register.

In [`src/pci.h`](https://github.com/sysprog21/kvm-host/blob/master/src/pci.h):
```c
union pci_config_address {
struct {
unsigned reg_offset : 2;
unsigned reg_num : 6;
unsigned func_num : 3;
unsigned dev_num : 5;
unsigned bus_num : 8;
unsigned reserved : 7;
unsigned enable_bit : 1;
}; // Little endian
uint32_t value;
};
```

The Bus Number, in conjunction with the Device Number, is used to identify a physical PCI device. Each device may offer multiple functions, and each function is treated as a separate logical device; the combination of
> Bus Number : Device Number : Function Number
distinguishes each logical device.
The least significant byte selects the offset into the 256-byte configuration space available through this method. Since all reads and writes must be both 32-bits and aligned to work on all implementations, the two lowest bits of CONFIG_ADDRESS (0xCF8) must always be zero, with the remaining six bits allowing you to choose each of the 64 32-bit words.
In [`src/pci.h`](https://github.com/sysprog21/kvm-host/blob/master/src/pci.h):
```c
struct pci {
union pci_config_address pci_addr;
struct bus pci_bus;
struct dev pci_bus_dev;
struct dev pci_addr_dev;
struct dev pci_mmio_dev;
};
```
### PCI Initialization
In [`src/pci.c`](https://github.com/sysprog21/kvm-host/blob/master/src/pci.c):
```c
#define PCI_CONFIG_ADDR 0xCF8
#define PCI_CONFIG_DATA 0xCFC
#define PCI_MMIO_SIZE (1UL << 16)
void pci_init(struct pci *pci)
{
dev_init(&pci->pci_addr_dev, PCI_CONFIG_ADDR, sizeof(uint32_t), pci,
pci_address_io);
dev_init(&pci->pci_bus_dev, PCI_CONFIG_DATA, sizeof(uint32_t), pci,
pci_data_io);
dev_init(&pci->pci_mmio_dev, 0, PCI_MMIO_SIZE, pci, pci_mmio_io);
bus_init(&pci->pci_bus);
}
```
The callback associated with `pci_bus_dev` invokes the `pci_bus` to locate the registered `virtio-blk` device, thereby performing read and write operations within the PCI device’s configuration space.
```c
static void pci_address_io(void *owner,
void *data,
uint8_t is_write,
uint64_t offset,
uint8_t size)
{
struct pci *pci = (struct pci *) owner;
void *p = (void *) ((uintptr_t) &pci->pci_addr + offset);
/* The data in port 0xCF8 is as an address when Guest Linux accesses the
* configuration space.
*/
if (is_write)
memcpy(p, data, size);
else
memcpy(data, p, size);
pci->pci_addr.reg_offset = 0;
}
#define PCI_ADDR_ENABLE_BIT (1UL << 31)
static void pci_data_io(void *owner,
void *data,
uint8_t is_write,
uint64_t offset,
uint8_t size)
{
struct pci *pci = (struct pci *) owner;
if (pci->pci_addr.enable_bit) {
uint64_t addr = (pci->pci_addr.value | offset) & ~(PCI_ADDR_ENABLE_BIT);
bus_handle_io(&pci->pci_bus, data, is_write, addr, size);
}
}
static void pci_mmio_io(void *owner,
void *data,
uint8_t is_write,
uint64_t offset,
uint8_t size)
{
struct pci *pci = (struct pci *) owner;
bus_handle_io(&pci->pci_bus, data, is_write, offset, size);
}
```
In [`src/pci.h`](https://github.com/sysprog21/kvm-host/blob/master/src/pci.h):
```c
#define PCI_HDR_READ(hdr, offset, width) \
(*((uint##width##_t *) ((uintptr_t) hdr + offset)))
#define PCI_HDR_WRITE(hdr, offset, value, width) \
((uint##width##_t *) ((uintptr_t) hdr + offset))[0] = value
struct pci_dev {
uint8_t cfg_space[PCI_CFG_SPACE_SIZE]; // Configuration space
void *hdr; // Pointer to the cfg_space above, init in `pci_dev_init`
uint32_t bar_size[6];
bool bar_active[6];
bool bar_is_io_space[6];
struct dev space_dev[6];
struct dev config_dev;
struct bus *io_bus;
struct bus *mmio_bus;
struct bus *pci_bus;
};
```
In [`src/pci.c`](https://github.com/sysprog21/kvm-host/blob/master/src/pci.c):
```c
void pci_dev_init(struct pci_dev *dev,
struct pci *pci,
struct bus *io_bus,
struct bus *mmio_bus)
{
memset(dev, 0x00, sizeof(struct pci_dev));
dev->hdr = dev->cfg_space;
dev->pci_bus = &pci->pci_bus;
dev->io_bus = io_bus;
dev->mmio_bus = mmio_bus;
}
```
[Go back to VM Platform Devices Initialization in arm64](#VM-Platform-Devices-Initialization-in-arm64)
### VirtIO PCI Header
The first 64 bytes of the PCI Header *Type 0* configuration space are common to all devices, whereas the remaining 192 bytes are defined individually by each device. The common portion is illustrated in the table below.
- Vendor ID: Identifies the manufacturer’s identifier; for Virtio, this value is 0x1AF4.
- Device ID: The identifier assigned to the device; for Virtio-blk, this value is 0x1042.
- Command: Used to configure the device’s operational settings; this register is writable.
- Status: Indicates the current status of the device.
- Class Code: Specifies the device’s category or class.
- Base Address Register (BAR): Denotes the address to which the device’s internal memory space is mapped; this register is also writable.

[Go back to VM Platform Devices Initialization in arm64](#VM-Platform-Devices-Initialization-in-arm64)
### Base Address Register (BAR)
Base Address Registers (or BARs) can be used to hold memory addresses used by the device, or offsets for port addresses. Typically, memory address BARs need to be located in physical ram while I/O space BARs can reside at any memory address (even beyond physical memory). To distinguish between them, you can check the value of the lowest bit. The following tables describe the two types of BARs:
1. Memory Space BAR Layout
| Bits 31-4 | Bit 3 | Bits 2-1 | Bit 0 |
| ---------------------------- | ------------ | -------- | -------- |
| 16-Byte Aligned Base Address | Prefetchable | Type | Always 0 |
2. I/O Space BAR Layout
| Bits 31-2 | Bit 1 | Bits 0 |
| --------------------------- | -------- | -------- |
| 4-Byte Aligned Base Address | Reserved | Always 1 |
[Go back to VM Platform Devices Initialization in arm64](#VM-Platform-Devices-Initialization-in-arm64)
### Capability List
The <font color=00FFFF>device-specific region</font> is composed of a <font color=00FFFF>Capability List</font>. <font color=00FFFF>**Bit 4**</font> of the <font color=00FFFF>Status register</font> indicates whether the device implements a Capability List, and the starting offset of the Capability List is fixed at 0x34.
| Bit 15 | Bit 14 | Bit 13 | Bit 12 | Bit 11 | Bits 9-10 | Bit 8 | Bit 7 | Bit 6 | Bit 5 | <font color=FF0000>Bit 4</font> | Bit 3 | Bits 0-2 |
| --------------------- | --------------------- | --------------------- | --------------------- | --------------------- | ------------- | ------------------------ | ------------------------- | -------- | -------------- | ------------------------------------------- | ---------------- | -------- |
| Detected Parity Error | Signaled System Error | Received Master Abort | Received Target Abort | Signaled Target Abort | DEVSEL Timing | Master Data Parity Error | Fast Back-to-Back Capable | Reserved | 66 MHz Capable | <font color=FF0000>Capabilities List</font> | Interrupt Status | Reserved |
| RW1C | RW1C | RW1C | RW1C | RW1C | RO | RW1C | RO | RO | RO | <font color=FF0000>RO</font> | RO | |
In order to traverse the capabilities list. The low 8 bits of a capability register are the ID - 0x05 for MSI in <font color=FF0000>standard</font> PCI capability ID. The next 8 bits are the offset (in PCI Configuration Space) of the next capability.

The <font color=FF0000>standard</font> capability ID can be found in [PCI Code ID, Page 22-23](https://pcisig.com/sites/default/files/files/PCI_Code-ID_r_1_11__v24_Jan_2019.pdf).
### Capability List for VirtIO
For <font color=00FFFF>VirtIO devices</font>, the capability ID deviates from the standard specification; the definitions are provided below.
```c
/* Common configuration */
#define VIRTIO_PCI_CAP_COMMON_CFG 1
/* Notifications */
#define VIRTIO_PCI_CAP_NOTIFY_CFG 2
/* ISR Status */
#define VIRTIO_PCI_CAP_ISR_CFG 3
/* Device specific configuration */
#define VIRTIO_PCI_CAP_DEVICE_CFG 4
/* PCI configuration access */
#define VIRTIO_PCI_CAP_PCI_CFG 5
```
This reference is taken from [Chapter 4.1.4, “Virtio Structure PCI Capabilities,”](https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-1090004) in the [Virtio version 1.1](https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html) specification published by OASIS Open.
In [`src/virtio-pci.h`](https://github.com/sysprog21/kvm-host/blob/master/src/virtio-pci.h):
```c
struct virtio_pci_dev {
struct pci_dev pci_dev;
struct virtio_pci_config config;
uint64_t device_feature;
uint64_t guest_feature;
struct virtio_pci_notify_cap *notify_cap;
struct virtio_pci_cap *dev_cfg_cap;
struct virtq *vq;
};
```
In [`src/virtio-pci.c`](https://github.com/sysprog21/kvm-host/blob/master/src/virtio-pci.c):
```c
void virtio_pci_init(struct virtio_pci_dev *dev,
struct pci *pci,
struct bus *io_bus,
struct bus *mmio_bus)
{
/* The capability list begins at offset 0x40 of pci config space */
uint8_t cap_list = 0x40;
memset(dev, 0x00, sizeof(struct virtio_pci_dev));
pci_dev_init(&dev->pci_dev, pci, io_bus, mmio_bus);
// Set Vendor ID to 0x1AF4 for VirtIO
PCI_HDR_WRITE(dev->pci_dev.hdr, PCI_VENDOR_ID, VIRTIO_PCI_VENDOR_ID, 16);
// Set Capability list pointer to 0x40
PCI_HDR_WRITE(dev->pci_dev.hdr, PCI_CAPABILITY_LIST, cap_list, 8);
// Set Header type to type 0 (normal)
PCI_HDR_WRITE(dev->pci_dev.hdr, PCI_HEADER_TYPE, PCI_HEADER_TYPE_NORMAL, 8);
// Set the interrupt pin to INTA#
PCI_HDR_WRITE(dev->pci_dev.hdr, PCI_INTERRUPT_PIN, 1, 8);
// Enable capability list (bit 4) and status interrupt (bit 3)
pci_set_status(&dev->pci_dev, PCI_STATUS_CAP_LIST | PCI_STATUS_INTERRUPT);
pci_set_bar(&dev->pci_dev, 0, 0x100, PCI_BASE_ADDRESS_SPACE_MEMORY,
virtio_pci_space_io);
virtio_pci_set_cap(dev, cap_list);
dev->device_feature |=
(1ULL << VIRTIO_F_RING_PACKED) | (1ULL << VIRTIO_F_VERSION_1);
}
```
In `/usr/include/linux/pci_regs.h`:
```c
#define PCI_BASE_ADDRESS_0 0x10 /* 32 bits */
#define PCI_BASE_ADDRESS_SPACE 0x01 /* 0 = memory, 1 = I/O */
#define PCI_BASE_ADDRESS_SPACE_IO 0x01
#define PCI_BASE_ADDRESS_SPACE_MEMORY 0x00
```
In [`src/pci.h`](https://github.com/sysprog21/kvm-host/blob/master/src/pci.h):
```c
#define PCI_BAR_OFFSET(bar) (PCI_BASE_ADDRESS_0 + ((bar) << 2))
```
In [`src/pci.c`](https://github.com/sysprog21/kvm-host/blob/master/src/pci.c):
```c
void pci_set_bar(struct pci_dev *dev,
uint8_t bar,
uint32_t bar_size,
bool is_io_space,
dev_io_fn do_io)
{
/* TODO: mem type, prefetch */
/* FIXME: bar_size must be power of 2 */
PCI_HDR_WRITE(dev->hdr, PCI_BAR_OFFSET(bar), is_io_space, 32);
dev->bar_size[bar] = bar_size;
dev->bar_is_io_space[bar] = is_io_space;
dev_init(&dev->space_dev[bar], 0, bar_size, dev, do_io);
}
```
## Network Device
> Reference:
> - [Red Hat: Deep dive into Virtio-networking and vhost-net](https://www.redhat.com/en/blog/deep-dive-virtio-networking-and-vhost-net)
The initialization process has 2 main steps:
1. Initialize the TUN/TAP device, which is implemented in [`virtio_net_init()`](#Setup-TUNTAP-Device) in `src/virtio-net.c`.
2. Initialize the virtio network device for the guest's PCI, which is implemented in [`virtio_net_init_pci()`](#Setup-VirtIO-Network-PCI-Device) in `src/virtio-net.c`.
### Setup TUN/TAP Device
> Reference:
> - [Universal TUN/TAP device driver](https://docs.kernel.org/networking/tuntap.html)
1. Open the TUN/TAP device `/dev/net/tun`:
```c
virtio_net_dev->tapfd = open("/dev/net/tun", O_RDWR);
```
2. Set up the interface request structure to
a. TAP mode `IFF_TAP`
b. No packet information `IFF_NO_PI`
```c
#define TAP_INTERFACE "tap%d"
struct ifreq ifreq = {.ifr_flags = IFF_TAP | IFF_NO_PI};
strncpy(ifreq.ifr_name, TAP_INTERFACE, sizeof(ifreq.ifr_name));
```
3. Set up the TUN/TAP device. Upon completion, the interface name (for example, tap0) is assigned and can be verified by running `ip a`.
```c
ioctl(virtio_net_dev->tapfd, TUNSETIFF, &ifreq);
```
### Setup VirtIO Network PCI Device
1. Initialize the VirtIO-Net device.
```c
virtio_net_setup(virtio_net_dev);
```
It will
A. Setup the followings of the `virtio_net_dev`
- IRQ number
- Tx/Rx, IRQ event file descriptor
- Register the IRQ file descriptor and the IRQ number to KVM.
```c
struct kvm_irqfd irqfd = {
.fd = fd,
.gsi = gsi,
.flags = flags,
};
ioctl(v->vm_fd, KVM_IRQFD, &irqfd)
```
- Initialize and setup the virtqueue operations to the virtqueue.
```c
#define VIRTQ_RX 0
#define VIRTQ_TX 1
#define VIRTIO_NET_VIRTQ_NUM 2
static struct virtq_ops virtio_net_ops[VIRTIO_NET_VIRTQ_NUM] = {
[VIRTQ_RX] = {.enable_vq = virtio_net_enable_vq_rx,
.complete_request = virtio_net_complete_request_rx,
.notify_used = virtio_net_notify_used_rx},
[VIRTQ_TX] = {.enable_vq = virtio_net_enable_vq_tx,
.complete_request = virtio_net_complete_request_tx,
.notify_used = virtio_net_notify_used_tx},
};
for (int i = 0; i < VIRTIO_NET_VIRTQ_NUM; i++) {
struct virtq_ops *ops = &virtio_net_ops[i];
dev->vq[i].info.notify_off = i;
virtq_init(&dev->vq[i], dev, ops);
}
```
2. Initialize the VirtIO-PCI device and register the device to PCI bus, IO bus, and MMIO bus.
```c
struct virtio_pci_dev *dev = &virtio_net_dev->virtio_pci_dev;
virtio_pci_init(dev, pci, io_bus, mmio_bus);
```
3. Setup the VirtIO-PCI configuration space.
```c
virtio_pci_set_dev_cfg(dev, &virtio_net_dev->config,
sizeof(virtio_net_dev->config));
```
4. Setup the [VirtIO-PCI CFG header](#VirtIO-PCI-Header).
- According to [VirtIO 1.1](https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-1020002), the network card device ID is `0x1041`.
- The PCI class `0x020000` means the Ethernet controller.
- The class code `0x02` means the network controller.
- The subclass code is `0x00`
- The programming interface is `0x00`
- Reference: [PCI Code and ID Assignment Specification](https://pcisig.com/sites/default/files/files/PCI_Code-ID_r_1_11__v24_Jan_2019.pdf)
- Set the Interrupt Request Line
```c
#define VIRTIO_PCI_DEVICE_ID_NET 0x1041
#define VIRTIO_NET_PCI_CLASS 0x020000
virtio_pci_set_pci_hdr(dev,
VIRTIO_PCI_DEVICE_ID_NET,
VIRTIO_NET_PCI_CLASS,
virtio_net_dev->irq_num);
```
5. Setup the VirtIO-PCI notify capability.
```c
#define NOTIFY_OFFSET 2
dev->notify_cap->notify_off_multiplier = NOTIFY_OFFSET;
```
6. Setup the Virtqueue.
```c
virtio_pci_set_virtq(dev, virtio_net_dev->vq, VIRTIO_NET_VIRTQ_NUM);
```
7. Add feature bit.
```c
#define VIRTIO_NET_F_MQ 22 // Device supports Receive Flow Steering
virtio_pci_add_feature(dev, VIRTIO_NET_F_MQ);
```
8. Enable the VirtIO-PCI device.
```c
virtio_pci_enable(dev);
```