This is the note for the initialization and configuration process of the Linux KVM.
In src/main.c
:
vm_init(&vm)
: Initialize the virtual machine.vm_load_image(&vm, kernel_file)
: Load the kernel image file.vm_load_initrd(&vm, initrd_file)
: Load the initial RAM filesystem.vm_load_diskimg(&vm, diskimg_file)
:
diskimg_init(&v->diskimg, diskimg_file)
: Initialize the disk image.virtio_blk_init_pci(&v->virtio_blk_dev, &v->diskimg, &v->pci, &v->io_bus, &v->mmio_bus)
: Initialize the virtio block.vm_enable_net(&vm)
vm_late_init(&vm)
: Final initialization for specific architecture.vm_run(&vm)
: Start the virtual machine.vm_exit(&vm)
: Exit the virtual machine.vm_arch_init()
mmap()
in the previous step to the guest OS.
RAM_BASE
is 0 in x86 system, and 2GB in arm64 system.vm_arch_cpu_init()
bus_init()
vm_arch_init_platform_device(v)
Go Back to Overall Architecture
mmap()
to map each file into memoryvm_arch_load_image()
with the base memory address and size parameters.
Go Back to Overall Architecture
diskimg_init(&v->diskimg, diskimg_file);
virtio_blk_init_pci(&v->virtio_blk_dev, &v->diskimg, &v->pci, &v->io_bus, &v->mmio_bus);
Go Back to Overall Architecture
generate_fdt(v)
init_reg(v)
Go Back to Overall Architecture
When a virtual CPU is executing guest code, it runs in non-root mode, allowing most instructions to execute directly on the hardware.
This enables high performance with
minimal overhead. However, certain operations—such as accessing I/O ports, modifying control registers, or executing privileged system instructions—are not permitted in this mode.
When the CPU encounters such an instruction or a predefined condition, it performs a VM exit (VMEXIT), which transitions control from non-root to root mode, handing
execution back to the kernel.Page 70 in The Conceptual Guide to the Linux Kernel v1.1.2025 by Moon Hee Lee
Go Back to Overall Architecture
KVM_SET_TSS_ADDR
: Defines the physical address range spanning three pages to configure the Task State Segment.KVM_SET_IDENTITY_MAP_ADDR
: Defines the physical address range spanning one page to configure the identity map (page table).KVM_CREATE_IRQCHIP
: Creates a virtual Programmable Interrupt Controller (PIC).KVM_CREATE_PIT2
: Creates a virtual Programmable Interval Timer (PIT).References:
sysprog: Linux KVM
Booting AArch64 Linux
ARM GICv3 and GICv4 Software Overview Release B
Linux KVM API
ARM Virtual Generic Interrupt Controller v3 and later (VGICv3)
kvmtool
KVM provides a virtualized interrupt controller with support for both GICv2 and GICv3. Depending on the capabilities of the host hardware, if the host employs GICv3 hardware that does not offer GICv2 emulation, then only a virtual GICv3 interrupt controller can be instantiated. Conversely, if the host’s interrupt controller is GICv2, only a GICv2 virtual controller may be created.
By leveraging the virtualized interrupt controller supplied by KVM, there is no need to implement interrupt-controller emulation in user space; one can instead rely directly on the implementation provided within the Linux kernel.
The eMAG 8180 host only supports the creation of a virtual GICv3 interrupt controller; accordingly, the following discussion will focus on GICv3.
Below is the architecture of the GICv3 interrupt controller:
Reference: GICv3 and v4 Software Overview Chapter 3.5, Page 16
In GICv3, the CPU Interface is accessed through system registers, employing the MSR and MRS instructions for reading from and writing to those registers. Each CPU core has its own Redistributor, whereas a single Distributor serves the entire system. Both the Redistributor and the Distributor are accessed via MMIO.
Once a virtual machine has been created using the KVM_CREATE_VM
ioctl
, the interrupt controller can be instantiated by issuing the KVM_CREATE_DEVICE
ioctl
for that VM.
The KVM_CREATE_DEVICE
ioctl
must be provided with a struct kvm_create_device
, as shown below:
Upon creation, the fd member within the struct kvm_create_device
can be used to retrieve the file descriptor for this GICv3 interrupt controller, which is then stored in the arm64-specific private data structure vm_arch_priv_t
.
Next, the MMIO addresses for the Redistributor and Distributor must be configured. To do this, the KVM_SET_DEVICE_ATTR
ioctl
is used, with the file descriptor set to the GICv3 fd obtained earlier. The argument passed to this ioctl is a struct kvm_device_attr
. The procedure is as follows:
Since the .addr
field must hold a pointer to a uint64_t
, we first declare a local variable and then obtain its address using the &
operator.
After GICv3 has been created and all required vCPUs have been instantiated, additional initialization is necessary to allow the VM to function correctly. The GICv3 initialization takes place within the finalize_irqchip()
function in src/arch/arm64/vm.c
, which is invoked by vm_arch_init_platform_device().
On ARM64, the vCPU can be initialized by invoking the KVM_ARM_VCPU_INIT
ioctl on the vcpu_fd
, where a pointer to a struct kvm_vcpu_init
must be supplied as the argument. The struct kvm_vcpu_init
itself can be obtained directly by issuing the KVM_ARM_PREFERRED_TARGET
ioctl on the vm_fd
.
At this stage, we initialize the system bus, PCI bus, and serial device.
Finally, we must invoke the KVM_SET_DEVICE_ATTR
ioctl on the file descriptor of the GICv3 that was just created, supplying a struct kvm_device_attr as the argument, as follows:
Once GICv3 has been initialized, no further vCPUs may be created.
According to the information provided by the Linux kernel documentation, the decompressed kernel image contains a 64-byte header as follows:
First, we check if the magic number is legal.
The document states
Prior to v3.17, the endianness of text_offset was not specified. In these cases image_size is zero and text_offset is 0x80000 in the endianness of the kernel. Where image_size is non-zero image_size is little-endian and must be respected. Where image_size is zero, text_offset can be assumed to be 0x80000.
After loading the kernel, the address at which execution begins (i.e., the address of the first instruction) is the location where the image was loaded, namely the address of code0
; therefore, this address is recorded and stored in the entry
member variable of vm_arch_priv_t
.
The document also states
The Image must be placed text_offset bytes from a 2MB aligned base address anywhere in usable system RAM and called there.
Therefore, a 2 MB–aligned memory address must be selected, and the value of text_offset
from the header is then added; this resulting address designates where the kernel image is placed. The image_size
denotes, from the starting position of the kernel image placement, the amount of memory to be reserved as usable space for the kernel.
In addition to loading the kernel, the initramfs must also be loaded as the first filesystem during boot. The document states:
If an initrd/initramfs is passed to the kernel at boot, it must reside entirely within a 1 GB aligned physical memory window of up to 32 GB in size that fully covers the kernel Image as well.
This indicates that the initramfs and the Linux kernel must both reside within the same 32 GB window aligned on a 1 GB boundary; however, it does not specify whether the initramfs itself must be aligned.
Go Back to VM Load Image / Initial RAM File System
Prior to booting the kernel, the physical memory address at which the device tree resides must be passed to the kernel via the x0
register.
kvmtool generates the device tree using libfdt, and we can adopt the same approach. libfdt is a library included within the dtc package.
For implementation of the device tree, refer to kvmtool’s implementation. This is the DTB dump produced by kvmtool.
The Device Tree primarily defines the following:
Procedure for generating the device tree using libfdt:
fdt_create()
function to specify the buffer and its size for placing the device tree—this creates an empty device tree.fdt_begin_node()
to add a node. Since the root node must be added first, invoke fdt_begin_node(fdt, "")
.fdt_property()
, fdt_property_cell()
, fdt_property_u64()
, and so on. When using fdt_property()
, libfdt will copy the data verbatim; however, because values in the device tree must be represented in big-endian format, you must pair fdt_property()
with cpu_to_fdt32()
or cpu_to_fdt64()
to convert endianness.fdt_end_node()
to close that node.fdt_finish()
to complete the device tree. Once fdt_finish()
returns—provided that every fdt_begin_node()
call has a matching fdt_end_node()
—the contents of the buffer constitute a valid device tree.References:
Booting AArch64 Linux
kvmtool
Before jumping into the kernel, the following conditions must be met:
The __REG
macro can be used to generate the register ID within the struct kvm_one_reg
passed to the KVM_SET_ONE_REG
ioctl.
Used to handle the mapping between addresses and devices, employing a singly linked list to manage the devices.
The owner
parameter points to the structure that owns this device and is passed to the callback.
In src/bus.h
:
Initialize the singly linked list of the bus.
In src/bus.c
:
Initialize the device structure.
Insert the device into the linked list of the bus.
Remove the device from the linked list of the bus.
Use the following function to issue an I/O request to the bus. It locates the target device based on the device’s base
and len
, then invokes the do_io
callback.
In the implementation, there are io_bus
and mmio_bus
components to handle KVM_EXIT
events, as well as a pci_bus
to manage the PCI device’s configuration space.
Go back to VM Platform Devices Initialization in arm64
Reference: OS Dev: PCI
The PCI architecture is as follows: the Host Bridge is responsible for connecting the CPU and managing all PCI devices and buses. Devices are classified as either endpoint devices or bridges; a bridge serves to interconnect two separate buses.
A PCI logical device provides 256 bytes of Configuration Space used to perform the device’s configuration and initialization.
In /usr/include/linux/pci_regs.h
:
The CPU cannot directly access this space and must instead rely on a special mechanism provided by the PCI Host Bridge to facilitate access to the configuration registers.
Under Intel’s architecture, this mechanism employs two I/O ports: CF8 and CFC. The CPU first writes the target configuration register’s address to CF8; subsequently, reading from or writing to CFC completes the operation on that register.
In src/pci.h
:
The Bus Number, in conjunction with the Device Number, is used to identify a physical PCI device. Each device may offer multiple functions, and each function is treated as a separate logical device; the combination of
Bus Number : Device Number : Function Number
distinguishes each logical device.
The least significant byte selects the offset into the 256-byte configuration space available through this method. Since all reads and writes must be both 32-bits and aligned to work on all implementations, the two lowest bits of CONFIG_ADDRESS (0xCF8) must always be zero, with the remaining six bits allowing you to choose each of the 64 32-bit words.
In src/pci.h
:
In src/pci.c
:
The callback associated with pci_bus_dev
invokes the pci_bus
to locate the registered virtio-blk
device, thereby performing read and write operations within the PCI device’s configuration space.
In src/pci.h
:
In src/pci.c
:
Go back to VM Platform Devices Initialization in arm64
The first 64 bytes of the PCI Header Type 0 configuration space are common to all devices, whereas the remaining 192 bytes are defined individually by each device. The common portion is illustrated in the table below.
Go back to VM Platform Devices Initialization in arm64
Base Address Registers (or BARs) can be used to hold memory addresses used by the device, or offsets for port addresses. Typically, memory address BARs need to be located in physical ram while I/O space BARs can reside at any memory address (even beyond physical memory). To distinguish between them, you can check the value of the lowest bit. The following tables describe the two types of BARs:
Memory Space BAR Layout
Bits 31-4 | Bit 3 | Bits 2-1 | Bit 0 |
---|---|---|---|
16-Byte Aligned Base Address | Prefetchable | Type | Always 0 |
I/O Space BAR Layout
Bits 31-2 | Bit 1 | Bits 0 |
---|---|---|
4-Byte Aligned Base Address | Reserved | Always 1 |
Go back to VM Platform Devices Initialization in arm64
The device-specific region is composed of a Capability List. Bit 4 of the Status register indicates whether the device implements a Capability List, and the starting offset of the Capability List is fixed at 0x34.
Bit 15 | Bit 14 | Bit 13 | Bit 12 | Bit 11 | Bits 9-10 | Bit 8 | Bit 7 | Bit 6 | Bit 5 | Bit 4 | Bit 3 | Bits 0-2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Detected Parity Error | Signaled System Error | Received Master Abort | Received Target Abort | Signaled Target Abort | DEVSEL Timing | Master Data Parity Error | Fast Back-to-Back Capable | Reserved | 66 MHz Capable | Capabilities List | Interrupt Status | Reserved |
RW1C | RW1C | RW1C | RW1C | RW1C | RO | RW1C | RO | RO | RO | RO | RO |
In order to traverse the capabilities list. The low 8 bits of a capability register are the ID - 0x05 for MSI in standard PCI capability ID. The next 8 bits are the offset (in PCI Configuration Space) of the next capability.
The standard capability ID can be found in PCI Code ID, Page 22-23.
For VirtIO devices, the capability ID deviates from the standard specification; the definitions are provided below.
This reference is taken from Chapter 4.1.4, “Virtio Structure PCI Capabilities,” in the Virtio version 1.1 specification published by OASIS Open.
In src/virtio-pci.h
:
In src/virtio-pci.c
:
In /usr/include/linux/pci_regs.h
:
In src/pci.h
:
In src/pci.c
: