Try   HackMD

Linux KVM Initialization and Configuration Process

This is the note for the initialization and configuration process of the Linux KVM.

Overall Architecture

In src/main.c:

VM Overall Process

VM Initialization

  • Open the KVM device file.
    ​​​v->kvm_fd = open("/dev/kvm", O_RDWR);
    
  • Create a new virtual machine. In this step, KVM in Linux creates a virtual machine and returns the corresponding file descriptor.
    ​​​v->vm_fd = ioctl(v->kvm_fd, KVM_CREATE_VM, 0);
    
  • vm_arch_init()
  • Allocate the guest memory space.
    ​​​​v->mem = mmap(NULL, RAM_SIZE, PROT_READ | PROT_WRITE,
    ​​​​              MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    
  • Allocate the memory space created by the mmap() in the previous step to the guest OS.
    ​​​​struct kvm_userspace_memory_region region = {
    ​​​​    .slot = 0,
    ​​​​    .flags = 0,
    ​​​​    .guest_phys_addr = RAM_BASE,
    ​​​​    .memory_size = RAM_SIZE,
    ​​​​    .userspace_addr = (__u64) v->mem,
    ​​​​};
    ​​​​ioctl(v->vm_fd, KVM_SET_USER_MEMORY_REGION, &region);
    
    • The RAM_BASE is 0 in x86 system, and 2GB in arm64 system.
  • Create virtual CPU.
    ​​​v->vcpu_fd = ioctl(v->vm_fd, KVM_CREATE_VCPU, 0);
    
  • vm_arch_cpu_init()
  • Initialize the IO and MMIO bus. bus_init()
    ​​​bus_init(&v->io_bus);
    ​​​bus_init(&v->mmio_bus);
    
  • vm_arch_init_platform_device(v)

Go Back to Overall Architecture

VM Load Image / Initial RAM File System

  • Opens the image and initrd files via their file paths.
  • Retrieve the size of each file.
  • Employ mmap() to map each file into memory
  • Call vm_arch_load_image() with the base memory address and size parameters.

Go Back to Overall Architecture

VM Load Disk Image

  • diskimg_init(&v->diskimg, diskimg_file);
  • virtio_blk_init_pci(&v->virtio_blk_dev, &v->diskimg, &v->pci, &v->io_bus, &v->mmio_bus);

Go Back to Overall Architecture

VM Late Init

  • x86: Nothing to do.
  • arm64:
    • Create device tree: generate_fdt(v)
    • Initialize CPU registers: init_reg(v)

Go Back to Overall Architecture

VM Run

When a virtual CPU is executing guest code, it runs in non-root mode, allowing most instructions to execute directly on the hardware.
This enables high performance with
minimal overhead. However, certain operations—such as accessing I/O ports, modifying control registers, or executing privileged system instructions—are not permitted in this mode.
When the CPU encounters such an instruction or a predefined condition, it performs a VM exit (VMEXIT), which transitions control from non-root to root mode, handing
execution back to the kernel.

Page 70 in The Conceptual Guide to the Linux Kernel v1.1.2025 by Moon Hee Lee

Go Back to Overall Architecture

x86-Specific Process

VM Initialization in x86

The Linux/x86 Boot Protocol

  1. KVM_SET_TSS_ADDR: Defines the physical address range spanning three pages to configure the Task State Segment.
  2. KVM_SET_IDENTITY_MAP_ADDR: Defines the physical address range spanning one page to configure the identity map (page table).
  3. KVM_CREATE_IRQCHIP: Creates a virtual Programmable Interrupt Controller (PIC).
  4. KVM_CREATE_PIT2: Creates a virtual Programmable Interval Timer (PIT).

Go Back to VM Initialization

arm64-Specific Process

VM Initialization in arm64

References:
sysprog: Linux KVM
Booting AArch64 Linux
ARM GICv3 and GICv4 Software Overview Release B
Linux KVM API
ARM Virtual Generic Interrupt Controller v3 and later (VGICv3)
kvmtool

Create and initialize the interrupt controller

KVM provides a virtualized interrupt controller with support for both GICv2 and GICv3. Depending on the capabilities of the host hardware, if the host employs GICv3 hardware that does not offer GICv2 emulation, then only a virtual GICv3 interrupt controller can be instantiated. Conversely, if the host’s interrupt controller is GICv2, only a GICv2 virtual controller may be created.

By leveraging the virtualized interrupt controller supplied by KVM, there is no need to implement interrupt-controller emulation in user space; one can instead rely directly on the implementation provided within the Linux kernel.

The eMAG 8180 host only supports the creation of a virtual GICv3 interrupt controller; accordingly, the following discussion will focus on GICv3.

Below is the architecture of the GICv3 interrupt controller:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Reference: GICv3 and v4 Software Overview Chapter 3.5, Page 16

In GICv3, the CPU Interface is accessed through system registers, employing the MSR and MRS instructions for reading from and writing to those registers. Each CPU core has its own Redistributor, whereas a single Distributor serves the entire system. Both the Redistributor and the Distributor are accessed via MMIO.

Once a virtual machine has been created using the KVM_CREATE_VM ioctl, the interrupt controller can be instantiated by issuing the KVM_CREATE_DEVICE ioctl for that VM.

In src/arch/arm64/vm.c:

The KVM_CREATE_DEVICE ioctl must be provided with a struct kvm_create_device, as shown below:

struct kvm_create_device gic_device = {
    .type = KVM_DEV_TYPE_ARM_VGIC_V3,
};

ioctl(v->vm_fd, KVM_CREATE_DEVICE, &gic_device);

Upon creation, the fd member within the struct kvm_create_device can be used to retrieve the file descriptor for this GICv3 interrupt controller, which is then stored in the arm64-specific private data structure vm_arch_priv_t.

priv->gic_fd = device.fd;

Next, the MMIO addresses for the Redistributor and Distributor must be configured. To do this, the KVM_SET_DEVICE_ATTR ioctl is used, with the file descriptor set to the GICv3 fd obtained earlier. The argument passed to this ioctl is a struct kvm_device_attr. The procedure is as follows:

uint64_t dist_addr = ARM_GIC_DIST_BASE;
uint64_t redist_addr = ARM_GIC_REDIST_BASE;

struct kvm_device_attr dist_attr = {
    .group	= KVM_DEV_ARM_VGIC_GRP_ADDR,
    .attr	= KVM_VGIC_V3_ADDR_TYPE_DIST,
    .addr	= (uint64_t) &dist_addr,
};

struct kvm_device_attr redist_attr = {
    .group	= KVM_DEV_ARM_VGIC_GRP_ADDR,
    .attr	= KVM_VGIC_V3_ADDR_TYPE_REDIST,
    .addr	= (uint64_t) &redist_addr,
};

ioctl(gic_fd, KVM_SET_DEVICE_ATTR, &redist_attr);
ioctl(gic_fd, KVM_SET_DEVICE_ATTR, &dist_attr);

Since the .addr field must hold a pointer to a uint64_t, we first declare a local variable and then obtain its address using the & operator.

After GICv3 has been created and all required vCPUs have been instantiated, additional initialization is necessary to allow the VM to function correctly. The GICv3 initialization takes place within the finalize_irqchip() function in src/arch/arm64/vm.c, which is invoked by vm_arch_init_platform_device().

Go Back to VM Initialization

VM CPU Initialization in arm64

On ARM64, the vCPU can be initialized by invoking the KVM_ARM_VCPU_INIT ioctl on the vcpu_fd, where a pointer to a struct kvm_vcpu_init must be supplied as the argument. The struct kvm_vcpu_init itself can be obtained directly by issuing the KVM_ARM_PREFERRED_TARGET ioctl on the vm_fd.

In src/arch/arm64/vm.c:

struct kvm_vcpu_init vcpu_init;
ioctl(v->vm_fd, KVM_ARM_PREFERRED_TARGET, &vcpu_init);
ioctl(v->vcpu_fd, KVM_ARM_VCPU_INIT, &vcpu_init);

Go Back to VM Initialization

VM Platform Devices Initialization in arm64

At this stage, we initialize the system bus, PCI bus, and serial device.

In src/arch/arm64/vm.c:

static void pio_handler(void *owner,
                        void *data,
                        uint8_t is_write,
                        uint64_t offset,
                        uint8_t size)
{
    vm_t *v = (vm_t *) owner;
    bus_handle_io(&v->io_bus, data, is_write, offset, size);
}

/* Initial system bus */
dev_init(&priv->iodev, ARM_IOPORT_BASE, ARM_IOPORT_SIZE, v, pio_handler);
bus_register_dev(&v->mmio_bus, &priv->iodev);

/* Initialize PCI bus */
pci_init(&v->pci);
v->pci.pci_mmio_dev.base = ARM_PCI_CFG_BASE;
bus_register_dev(&v->mmio_bus, &v->pci.pci_mmio_dev);

/* Initialize serial device */
serial_init(&v->serial, &v->io_bus);

Finally, we must invoke the KVM_SET_DEVICE_ATTR ioctl on the file descriptor of the GICv3 that was just created, supplying a struct kvm_device_attr as the argument, as follows:

struct kvm_device_attr vgic_init_attr = {
    .group = KVM_DEV_ARM_VGIC_GRP_CTRL,
    .attr = KVM_DEV_ARM_VGIC_CTRL_INIT,
};

ioctl(gic_fd, KVM_SET_DEVICE_ATTR, &vgic_init_attr);

Once GICv3 has been initialized, no further vCPUs may be created.

Go Back to VM Initialization

VM Load Image / Initial RAM File System in arm64

According to the information provided by the Linux kernel documentation, the decompressed kernel image contains a 64-byte header as follows:

In src/arch/arm64/vm.c:

typedef struct {
    uint32_t code0;       /* Executable code */
    uint32_t code1;       /* Executable code */
    uint64_t text_offset; /* Image load offset, little endian */
    uint64_t image_size;  /* Effective Image size, little endian */
    uint64_t flags;       /* kernel flags, little endian */
    uint64_t res2;        /* reserved */
    uint64_t res3;        /* reserved */
    uint64_t res4;        /* reserved */
    uint32_t magic;       /* Magic number, little endian, "ARM\x64" */
    uint32_t res5;        /* reserved (used for PE COFF offset) */
} arm64_kernel_header_t;

First, we check if the magic number is legal.

if (header->magic != 0x644d5241U)
    return throw_err("Invalid kernel image\n");

The document states

Prior to v3.17, the endianness of text_offset was not specified. In these cases image_size is zero and text_offset is 0x80000 in the endianness of the kernel. Where image_size is non-zero image_size is little-endian and must be respected. Where image_size is zero, text_offset can be assumed to be 0x80000.

uint64_t offset;
if (header->image_size == 0)
    offset = 0x80000;
else
    offset = header->text_offset;

if (offset + datasz >= ARM_KERNEL_SIZE ||
    offset + header->image_size >= ARM_KERNEL_SIZE) {
    return throw_err("Image size too large\n");
}

void *dest = vm_guest_to_host(v, ARM_KERNEL_BASE + offset);
memmove(dest, data, datasz);

After loading the kernel, the address at which execution begins (i.e., the address of the first instruction) is the location where the image was loaded, namely the address of code0; therefore, this address is recorded and stored in the entry member variable of vm_arch_priv_t.

priv->entry = ARM_KERNEL_BASE + offset;

The document also states

The Image must be placed text_offset bytes from a 2MB aligned base address anywhere in usable system RAM and called there.

Therefore, a 2 MB–aligned memory address must be selected, and the value of text_offset from the header is then added; this resulting address designates where the kernel image is placed. The image_size denotes, from the starting position of the kernel image placement, the amount of memory to be reserved as usable space for the kernel.

In addition to loading the kernel, the initramfs must also be loaded as the first filesystem during boot. The document states:

If an initrd/initramfs is passed to the kernel at boot, it must reside entirely within a 1 GB aligned physical memory window of up to 32 GB in size that fully covers the kernel Image as well.

This indicates that the initramfs and the Linux kernel must both reside within the same 32 GB window aligned on a 1 GB boundary; however, it does not specify whether the initramfs itself must be aligned.

void *dest = vm_guest_to_host(v, ARM_INITRD_BASE);
memmove(dest, data, datasz);
priv->initrdsz = datasz;

Go Back to VM Load Image / Initial RAM File System

VM Late Init in arm64

Device Tree

Device Tree Specification
Implementation in kvmtool

Prior to booting the kernel, the physical memory address at which the device tree resides must be passed to the kernel via the x0 register.

kvmtool generates the device tree using libfdt, and we can adopt the same approach. libfdt is a library included within the dtc package.

For implementation of the device tree, refer to kvmtool’s implementation. This is the DTB dump produced by kvmtool.

The Device Tree primarily defines the following:

  • Machine Type
  • CPU
  • Memory
  • Initramfs Address
  • Boot Arguments
  • Interrupt Controller
  • 16550 UART Addresses
  • PCI Addresses

Procedure for generating the device tree using libfdt:

  1. Use the fdt_create() function to specify the buffer and its size for placing the device tree—this creates an empty device tree.
  2. Call fdt_begin_node() to add a node. Since the root node must be added first, invoke fdt_begin_node(fdt, "").
  3. Within each node, you can add properties using fdt_property(), fdt_property_cell(), fdt_property_u64(), and so on. When using fdt_property(), libfdt will copy the data verbatim; however, because values in the device tree must be represented in big-endian format, you must pair fdt_property() with cpu_to_fdt32() or cpu_to_fdt64() to convert endianness.
  4. After all properties for a node have been added, call fdt_end_node() to close that node.
  5. Finally, call fdt_finish() to complete the device tree. Once fdt_finish() returns—provided that every fdt_begin_node() call has a matching fdt_end_node()—the contents of the buffer constitute a valid device tree.

Initialize vCPU Register

References:
Booting AArch64 Linux
kvmtool

Before jumping into the kernel, the following conditions must be met:

  • Primary CPU general-purpose register settings:
    • x0 = physical address of device tree blob (dtb) in system RAM.
    • x1 = 0 (reserved for future use)
    • x2 = 0 (reserved for future use)
    • x3 = 0 (reserved for future use)

The __REG macro can be used to generate the register ID within the struct kvm_one_reg passed to the KVM_SET_ONE_REG ioctl.

/* Initialize the vCPU registers according to Linux arm64 boot protocol
 * Reference: https://www.kernel.org/doc/Documentation/arm64/booting.txt
 */
static int init_reg(vm_t *v)
{
    vm_arch_priv_t *priv = (vm_arch_priv_t *) v->priv;
    struct kvm_one_reg reg;
    uint64_t data;

    reg.addr = (uint64_t) &data;
#define __REG(r)                                                  \
    (KVM_REG_ARM_CORE_REG(r) | KVM_REG_ARM_CORE | KVM_REG_ARM64 | \
     KVM_REG_SIZE_U64)

    /* Clear x1 ~ x3 */
    for (int i = 0; i < 3; i++) {
        data = 0;
        reg.id = __REG(regs.regs[i]);
        if (ioctl(v->vcpu_fd, KVM_SET_ONE_REG, &reg) < 0)
            return throw_err("Failed to set x%d\n", i);
    }

    /* Set x0 to the address of the device tree */
    data = ARM_FDT_BASE;
    reg.id = __REG(regs.regs[0]);
    if (ioctl(v->vcpu_fd, KVM_SET_ONE_REG, &reg) < 0)
        return throw_err("Failed to set x0\n");

    /* Set program counter to the begining of kernel image */
    data = priv->entry;
    reg.id = __REG(regs.pc);
    if (ioctl(v->vcpu_fd, KVM_SET_ONE_REG, &reg) < 0)
        return throw_err("Failed to set program counter\n");

#undef _REG
    return 0;
}

Go Back to VM Late Init

Bus

Used to handle the mapping between addresses and devices, employing a singly linked list to manage the devices.

The owner parameter points to the structure that owns this device and is passed to the callback.

In src/bus.h:

typedef void (*dev_io_fn)(void *owner,
                          void *data,
                          uint8_t is_write,
                          uint64_t offset,
                          uint8_t size);

struct dev {
    uint64_t base;
    uint64_t len;
    void *owner;
    dev_io_fn do_io;
    struct dev *next;
};

struct bus {
    uint64_t dev_num;
    struct dev *head;
};

Bus Initialization

Initialize the singly linked list of the bus.

In src/bus.c:

void bus_init(struct bus *bus)
{
    bus->dev_num = 0;
    bus->head = NULL;
}

Device Initialization

Initialize the device structure.

void dev_init(struct dev *dev,
              uint64_t base,
              uint64_t len,
              void *owner,
              dev_io_fn do_io)
{
    dev->base = base;
    dev->len = len;
    dev->owner = owner;
    dev->do_io = do_io;
    dev->next = NULL;
}

Register the device to the bus

Insert the device into the linked list of the bus.

void bus_register_dev(struct bus *bus, struct dev *dev)
{
    dev->next = bus->head;
    bus->head = dev;
    bus->dev_num++;
}

Deregister the deivce from the bus

Remove the device from the linked list of the bus.

void bus_deregister_dev(struct bus *bus, struct dev *dev)
{
    struct dev **p = &bus->head;

    while (*p != dev && *p) {
        p = &(*p)->next;
    }

    if (*p)
        *p = (*p)->next;
}

Handle IO in the Bus

Use the following function to issue an I/O request to the bus. It locates the target device based on the device’s base and len, then invokes the do_io callback.

void bus_handle_io(struct bus *bus,
                   void *data,
                   uint8_t is_write,
                   uint64_t addr,
                   uint8_t size)
{
    struct dev *dev = bus_find_dev(bus, addr);  // Traverse the linked list

    if (dev && addr + size - 1 <= dev->base + dev->len - 1) {
        dev->do_io(dev->owner, data, is_write, addr - dev->base, size);
    }
}

In the implementation, there are io_bus and mmio_bus components to handle KVM_EXIT events, as well as a pci_bus to manage the PCI device’s configuration space.

Go back to VM Platform Devices Initialization in arm64

PCI

PCI Definition

Reference: OS Dev: PCI

The PCI architecture is as follows: the Host Bridge is responsible for connecting the CPU and managing all PCI devices and buses. Devices are classified as either endpoint devices or bridges; a bridge serves to interconnect two separate buses.

upload_d2f64ac11d5227ff304bd36ac76b46e4

A PCI logical device provides 256 bytes of Configuration Space used to perform the device’s configuration and initialization.

In /usr/include/linux/pci_regs.h:

/*
 * Conventional PCI and PCI-X Mode 1 devices have 256 bytes of
 * configuration space.  PCI-X Mode 2 and PCIe devices have 4096 bytes of
 * configuration space.
 */
#define PCI_CFG_SPACE_SIZE	256
#define PCI_CFG_SPACE_EXP_SIZE	4096

The CPU cannot directly access this space and must instead rely on a special mechanism provided by the PCI Host Bridge to facilitate access to the configuration registers.

Under Intel’s architecture, this mechanism employs two I/O ports: CF8 and CFC. The CPU first writes the target configuration register’s address to CF8; subsequently, reading from or writing to CFC completes the operation on that register.

image

In src/pci.h:

union pci_config_address {
    struct {
        unsigned reg_offset : 2;
        unsigned reg_num : 6;
        unsigned func_num : 3;
        unsigned dev_num : 5;
        unsigned bus_num : 8;
        unsigned reserved : 7;
        unsigned enable_bit : 1;
    };  // Little endian
    uint32_t value;
};

image

The Bus Number, in conjunction with the Device Number, is used to identify a physical PCI device. Each device may offer multiple functions, and each function is treated as a separate logical device; the combination of

Bus Number : Device Number : Function Number

distinguishes each logical device.

The least significant byte selects the offset into the 256-byte configuration space available through this method. Since all reads and writes must be both 32-bits and aligned to work on all implementations, the two lowest bits of CONFIG_ADDRESS (0xCF8) must always be zero, with the remaining six bits allowing you to choose each of the 64 32-bit words.

In src/pci.h:

struct pci {
    union pci_config_address pci_addr;
    struct bus pci_bus;
    struct dev pci_bus_dev;
    struct dev pci_addr_dev;
    struct dev pci_mmio_dev;
};

PCI Initialization

In src/pci.c:

#define PCI_CONFIG_ADDR 0xCF8
#define PCI_CONFIG_DATA 0xCFC
#define PCI_MMIO_SIZE (1UL << 16)

void pci_init(struct pci *pci)
{
    dev_init(&pci->pci_addr_dev, PCI_CONFIG_ADDR, sizeof(uint32_t), pci,
             pci_address_io);
    dev_init(&pci->pci_bus_dev, PCI_CONFIG_DATA, sizeof(uint32_t), pci,
             pci_data_io);
    dev_init(&pci->pci_mmio_dev, 0, PCI_MMIO_SIZE, pci, pci_mmio_io);
    bus_init(&pci->pci_bus);
}

The callback associated with pci_bus_dev invokes the pci_bus to locate the registered virtio-blk device, thereby performing read and write operations within the PCI device’s configuration space.

static void pci_address_io(void *owner,
                           void *data,
                           uint8_t is_write,
                           uint64_t offset,
                           uint8_t size)
{
    struct pci *pci = (struct pci *) owner;
    void *p = (void *) ((uintptr_t) &pci->pci_addr + offset);
    /* The data in port 0xCF8 is as an address when Guest Linux accesses the
     * configuration space.
     */
    if (is_write)
        memcpy(p, data, size);
    else
        memcpy(data, p, size);
    pci->pci_addr.reg_offset = 0;
}

#define PCI_ADDR_ENABLE_BIT (1UL << 31)

static void pci_data_io(void *owner,
                        void *data,
                        uint8_t is_write,
                        uint64_t offset,
                        uint8_t size)
{
    struct pci *pci = (struct pci *) owner;
    if (pci->pci_addr.enable_bit) {
        uint64_t addr = (pci->pci_addr.value | offset) & ~(PCI_ADDR_ENABLE_BIT);
        bus_handle_io(&pci->pci_bus, data, is_write, addr, size);
    }
}

static void pci_mmio_io(void *owner,
                        void *data,
                        uint8_t is_write,
                        uint64_t offset,
                        uint8_t size)
{
    struct pci *pci = (struct pci *) owner;
    bus_handle_io(&pci->pci_bus, data, is_write, offset, size);
}

In src/pci.h:

#define PCI_HDR_READ(hdr, offset, width) \
    (*((uint##width##_t *) ((uintptr_t) hdr + offset)))
#define PCI_HDR_WRITE(hdr, offset, value, width) \
    ((uint##width##_t *) ((uintptr_t) hdr + offset))[0] = value

struct pci_dev {
    uint8_t cfg_space[PCI_CFG_SPACE_SIZE];  // Configuration space
    void *hdr;  // Pointer to the cfg_space above, init in `pci_dev_init`
    uint32_t bar_size[6];
    bool bar_active[6];
    bool bar_is_io_space[6];
    struct dev space_dev[6];
    struct dev config_dev;
    struct bus *io_bus;
    struct bus *mmio_bus;
    struct bus *pci_bus;
};

In src/pci.c:

void pci_dev_init(struct pci_dev *dev,
                  struct pci *pci,
                  struct bus *io_bus,
                  struct bus *mmio_bus)
{
    memset(dev, 0x00, sizeof(struct pci_dev));
    dev->hdr = dev->cfg_space;
    dev->pci_bus = &pci->pci_bus;
    dev->io_bus = io_bus;
    dev->mmio_bus = mmio_bus;
}

Go back to VM Platform Devices Initialization in arm64

VirtIO PCI Header

The first 64 bytes of the PCI Header Type 0 configuration space are common to all devices, whereas the remaining 192 bytes are defined individually by each device. The common portion is illustrated in the table below.

  • Vendor ID: Identifies the manufacturer’s identifier; for Virtio, this value is 0x1AF4.
  • Device ID: The identifier assigned to the device; for Virtio-blk, this value is 0x1042.
  • Command: Used to configure the device’s operational settings; this register is writable.
  • Status: Indicates the current status of the device.
  • Class Code: Specifies the device’s category or class.
  • Base Address Register (BAR): Denotes the address to which the device’s internal memory space is mapped; this register is also writable.

截圖 2025-06-02 下午3.46.32

Go back to VM Platform Devices Initialization in arm64

Base Address Register (BAR)

Base Address Registers (or BARs) can be used to hold memory addresses used by the device, or offsets for port addresses. Typically, memory address BARs need to be located in physical ram while I/O space BARs can reside at any memory address (even beyond physical memory). To distinguish between them, you can check the value of the lowest bit. The following tables describe the two types of BARs:

  1. Memory Space BAR Layout

    Bits 31-4 Bit 3 Bits 2-1 Bit 0
    16-Byte Aligned Base Address Prefetchable Type Always 0
  2. I/O Space BAR Layout

    Bits 31-2 Bit 1 Bits 0
    4-Byte Aligned Base Address Reserved Always 1

Go back to VM Platform Devices Initialization in arm64

Capability List

The device-specific region is composed of a Capability List. Bit 4 of the Status register indicates whether the device implements a Capability List, and the starting offset of the Capability List is fixed at 0x34.

Bit 15 Bit 14 Bit 13 Bit 12 Bit 11 Bits 9-10 Bit 8 Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bits 0-2
Detected Parity Error Signaled System Error Received Master Abort Received Target Abort Signaled Target Abort DEVSEL Timing Master Data Parity Error Fast Back-to-Back Capable Reserved 66 MHz Capable Capabilities List Interrupt Status Reserved
RW1C RW1C RW1C RW1C RW1C RO RW1C RO RO RO RO RO

In order to traverse the capabilities list. The low 8 bits of a capability register are the ID - 0x05 for MSI in standard PCI capability ID. The next 8 bits are the offset (in PCI Configuration Space) of the next capability.

截圖 2025-06-04 下午2.02.59

The standard capability ID can be found in PCI Code ID, Page 22-23.

Capability List for VirtIO

For VirtIO devices, the capability ID deviates from the standard specification; the definitions are provided below.

/* Common configuration */
#define VIRTIO_PCI_CAP_COMMON_CFG        1
/* Notifications */
#define VIRTIO_PCI_CAP_NOTIFY_CFG        2
/* ISR Status */
#define VIRTIO_PCI_CAP_ISR_CFG           3
/* Device specific configuration */
#define VIRTIO_PCI_CAP_DEVICE_CFG        4
/* PCI configuration access */
#define VIRTIO_PCI_CAP_PCI_CFG           5

This reference is taken from Chapter 4.1.4, “Virtio Structure PCI Capabilities,” in the Virtio version 1.1 specification published by OASIS Open.

In src/virtio-pci.h:

struct virtio_pci_dev {
    struct pci_dev pci_dev;
    struct virtio_pci_config config;
    uint64_t device_feature;
    uint64_t guest_feature;
    struct virtio_pci_notify_cap *notify_cap;
    struct virtio_pci_cap *dev_cfg_cap;
    struct virtq *vq;
};

In src/virtio-pci.c:

void virtio_pci_init(struct virtio_pci_dev *dev,
                     struct pci *pci,
                     struct bus *io_bus,
                     struct bus *mmio_bus)
{
    /* The capability list begins at offset 0x40 of pci config space */
    uint8_t cap_list = 0x40;

    memset(dev, 0x00, sizeof(struct virtio_pci_dev));
    pci_dev_init(&dev->pci_dev, pci, io_bus, mmio_bus);
    // Set Vendor ID to 0x1AF4 for VirtIO
    PCI_HDR_WRITE(dev->pci_dev.hdr, PCI_VENDOR_ID, VIRTIO_PCI_VENDOR_ID, 16);
    // Set Capability list pointer to 0x40
    PCI_HDR_WRITE(dev->pci_dev.hdr, PCI_CAPABILITY_LIST, cap_list, 8);
    // Set Header type to type 0 (normal)
    PCI_HDR_WRITE(dev->pci_dev.hdr, PCI_HEADER_TYPE, PCI_HEADER_TYPE_NORMAL, 8);
    // Set the interrupt pin to INTA#
    PCI_HDR_WRITE(dev->pci_dev.hdr, PCI_INTERRUPT_PIN, 1, 8);
    // Enable capability list (bit 4) and status interrupt (bit 3)
    pci_set_status(&dev->pci_dev, PCI_STATUS_CAP_LIST | PCI_STATUS_INTERRUPT);
    pci_set_bar(&dev->pci_dev, 0, 0x100, PCI_BASE_ADDRESS_SPACE_MEMORY,
                virtio_pci_space_io);
    virtio_pci_set_cap(dev, cap_list);
    dev->device_feature |=
        (1ULL << VIRTIO_F_RING_PACKED) | (1ULL << VIRTIO_F_VERSION_1);
}

In /usr/include/linux/pci_regs.h:

#define PCI_BASE_ADDRESS_0            0x10	/* 32 bits */
#define PCI_BASE_ADDRESS_SPACE        0x01	/* 0 = memory, 1 = I/O */
#define PCI_BASE_ADDRESS_SPACE_IO     0x01
#define PCI_BASE_ADDRESS_SPACE_MEMORY 0x00

In src/pci.h:

#define PCI_BAR_OFFSET(bar) (PCI_BASE_ADDRESS_0 + ((bar) << 2))

In src/pci.c:

void pci_set_bar(struct pci_dev *dev,
                 uint8_t bar,
                 uint32_t bar_size,
                 bool is_io_space,
                 dev_io_fn do_io)
{
    /* TODO: mem type, prefetch */
    /* FIXME: bar_size must be power of 2 */
    PCI_HDR_WRITE(dev->hdr, PCI_BAR_OFFSET(bar), is_io_space, 32);
    dev->bar_size[bar] = bar_size;
    dev->bar_is_io_space[bar] = is_io_space;
    dev_init(&dev->space_dev[bar], 0, bar_size, dev, do_io);
}