參考資料:

kvm-host
kvm-host 的改進
Linux 核心專題: 系統虛擬機器開發和改進
linux-riscv-dev/exercises2/kvm/kvm.md
Virtio-networking series
KVM: Linux 虛擬化基礎建設
Virtio: An I/O virtualization framework for Linux
Introduction to VirtIO
virtio-v1.2-cs01.pdf
Universal TUN/TAP device driver
Linux 核心專題: RISC-V 系統模擬器
semu

TODO: 嘗試實作出 virtio-net

Virtio

Linux 支援多種不同的虛擬化系統,比如說

  • Xen
  • KVM
  • VMWare

而每個系統都會有其特有的驅動裝置,比如說 block, console, network 等等,而 virtio 作為一個虛擬化的 io,其功能在於作為前端與後端的通訊,而此處所提之前端即為虛擬化系統的驅動裝置,後端則為由虛擬化系統 KVM 所模擬的裝置。

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Linux 中的前端驅動裝置 Front-end drivers

在 Linux 中,virtio 的前端的驅動裝置 Front-end drivers 都分別被寫為核心模組 kernel modules,並且可以直接通過 virtio 來通訊。

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

比如說 virtio-net 就是通過將其寫為核心模組的形式來啟動。

...
static __init int virtio_net_driver_init(void)
{
	int ret;

	ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "virtio/net:online",
				      virtnet_cpu_online,
				      virtnet_cpu_down_prep);
	if (ret < 0)
		goto out;
	virtionet_online = ret;
	ret = cpuhp_setup_state_multi(CPUHP_VIRT_NET_DEAD, "virtio/net:dead",
				      NULL, virtnet_cpu_dead);
	if (ret)
		goto err_dead;

        ret = register_virtio_driver(&virtio_net_driver);
	if (ret)
		goto err_virtio;
	return 0;
err_virtio:
	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
err_dead:
	cpuhp_remove_multi_state(virtionet_online);
out:
	return ret;
}
module_init(virtio_net_driver_init);

static __exit void virtio_net_driver_exit(void)
{
	unregister_virtio_driver(&virtio_net_driver);
	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
	cpuhp_remove_multi_state(virtionet_online);
}
module_exit(virtio_net_driver_exit);

MODULE_DEVICE_TABLE(virtio, id_table);
MODULE_DESCRIPTION("Virtio network driver");
MODULE_LICENSE("GPL");

而這些前端的任務為

  • 接收來自使用者的請求 request
  • 將前端的請求傳輸至相應的後端
  • 從後端接收已完成請求的指令

從以上所述可以了解到,virtio-net 屬於前端的驅動裝置,並且透過 virtio 傳輸資料,但是在 kvm-host 中,驅動裝置都是建立在 virtio-pci 上(後面會介紹),並不是透過註冊核心模組的方式來啟動。

Virtio 傳輸資料

virtio 傳輸資料透過結構體 virtqueue,virtqueue 主要分為兩種形式,一種為 Split virtqueue,另一種為 Packed virtqueue,在 kvm-host 中實作的形式為 Packed virtqueue,所以這裡主要討論的是 Packed virtqueue。

Virtio 裝置

Virtio 裝置為提供一個虛擬界面來交換資訊,而一個 Virtio 虛擬界面應該包含以下幾個部份

  • Device status field
  • Feature bits
  • Notifications
  • 一個或多個的 virtqueues

Device status field

Device status field 為指示此裝置的一些裝態,比如說

  • ACKNOWLEDGE (0x1) 為此裝置已被系統承認
  • DRIVER (0x2) 為表示此裝置已經被初始化
  • DRIVER_OK (0x4),FEATURES_OK (0x8) 代表著此驅動或裝置已經可以開始進行通訊
  • DRIVER_NEEDS_RESET (0x40),FAILED (0x80) 代表此裝置遇到錯誤

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Packed virtqueue 跟 Split virtqueue 的主要區別在 Split virtqueue 將 descriptor ring,available ring 和 used ring 分開來實作,而 Packed virtqueue 則是結合在一起,合併的優點在

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

在 Packed virtqueue 的形式中,首先要理解的是 descriptor ring,available ring 和 used ring。

descriptor ring

descriptor ring 由四個變數組成,分別為 addrlenidflags,其主要目的是為了描述從 guest 傳遞的資料並且在完成 guest 的請求後由 used buffer 通知已完成。

  • addr 為資料在 guest 中的地址
  • len 為資料的大小
  • flags 為此資料的運作方式,選項有 write-only 或 read-only
struct vring_packed_desc {
	/* Buffer Address. */
	__le64 addr;
	/* Buffer Length. */
	__le32 len;
	/* Buffer ID. */
	__le16 id;
	/* The flags depending on descriptor type. */
	__le16 flags;
};

available ring & used ring

available ring 的功能是將 available descriptor 寫入 descriptor ring 當中並且完成 io 的請求,used ring 則是在完成請求後會向驅動發起 notification。

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

PCI bus 上的 Virtio

由於在 kvm-host 中的 driver 都是建立在 virtio-pci 上,所以有必要理解 virtio-pci 的實作和原理。

PCI

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

一個 PCI 的架構如上,在此專案中每個 PCI 裝置都被設定為需要通過 PCI host bridge 來存取資訊,繼而透過 PCI host bridge 來與 cpu 和記憶體溝通。

此專案中,每個 Virtio 裝置都實作為 PCI 裝置,啟動時也會以 PCI 裝置啟動,比如說 virtio-blk

...
struct virtio_blk_req {
    uint32_t type;
    uint32_t reserved;
    uint64_t sector;
    uint8_t *data;
    uint16_t data_size;
    uint8_t *status;
};

struct virtio_blk_dev {
    struct virtio_pci_dev virtio_pci_dev;
    struct virtio_blk_config config;
    struct virtq vq[VIRTIO_BLK_VIRTQ_NUM];
    int irqfd;
    int ioeventfd;
    int irq_num;
    pthread_t vq_avail_thread;
    pthread_t worker_thread;
    struct diskimg *diskimg;
    bool enable;
};

void virtio_blk_init(struct virtio_blk_dev *virtio_blk_dev);
void virtio_blk_exit(struct virtio_blk_dev *dev);
void virtio_blk_init_pci(struct virtio_blk_dev *dev,
                         struct diskimg *diskimg,
                         struct pci *pci,
                         struct bus *io_bus,
                         struct bus *mmio_bus);

PCI 裝置分配了 256 個 bytes 來進行設定,前 64 個 bytes 為共通設定如下圖所示,後 128 個 bytes 為自行設定,而前 64 個 bytes 則是需要從 virtio-v1.2-cs01.pdf 來得知其中訊息

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

  • 所有 PCI 裝置的 Vendor ID 皆為 0x1AF4
  • Device ID 的起始值為 0x1040,不同裝置有不同的 Device ID,比如說 Virtio-net 的 Device ID 即為 0x1041,Virtio-block 的 Device ID 即為 0x1042
Transitional PCI Device ID Virtio Device
0x1000 network card
0x1001 block device
0x1002 memory balloning
0x1003 console
0x1004 SCSI host
0x1005 entropy source
0x1009 9P transport

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
從上方 pci 裝置上可以看到有 class code 這個設定,那下面定義又是從何而來? 查閱 virtio-v1.2-cs01.pdf 卻沒看到相關設定。

#define VIRTIO_BLK_PCI_CLASS 0x018000

PCI 裝置初始化

首先先來觀察 kvm-host 是如何定義 PCI 裝置

struct pci_dev {
    uint8_t cfg_space[PCI_CFG_SPACE_SIZE];
    void *hdr;
    uint32_t bar_size[6];
    bool bar_active[6];
    bool bar_is_io_space[6];
    struct dev space_dev[6];
    struct dev config_dev;
    struct bus *io_bus;
    struct bus *mmio_bus;
    struct bus *pci_bus;
};

從結構體 pci_dev 可以看到此結構體提供了一塊區域設定 PCI 裝置,但是如何去設定此 PCI 裝置 ? 答案為結構體的第二個成員 hdr

#define PCI_HDR_READ(hdr, offset, width) (*((uint##width##_t *) (hdr + offset)))
#define PCI_HDR_WRITE(hdr, offset, value, width) \
    ((uint##width##_t *) (hdr + offset))[0] = value

從上方的定義可以看出兩個訊息,一是 PCI_HDR_READ 為讀取 hdr + offset 中的資料,一是 PCI_HDR_WRITE 為對 hdr + offset 中的資料進行寫入,在 PCI 的裝置中的成員 hdr 就是指向 PCI 裝置地址的起點,也就是 cfg_space 的起始位置。

void pci_dev_init(struct pci_dev *dev,
                  struct pci *pci,
                  struct bus *io_bus,
                  struct bus *mmio_bus)
{
    memset(dev, 0x00, sizeof(struct pci_dev));
    dev->hdr = dev->cfg_space;
    dev->pci_bus = &pci->pci_bus;
    dev->io_bus = io_bus;
    dev->mmio_bus = mmio_bus;
}

初始化完 PCI 裝置後,必須要針對 virtio-pci 介面進行初始化,首先第一行的 0x40 為 PCI 裝置設定區域的起始地址,接著將個別裝置的設定以 PCI 裝置中的 hdr 寫入 PCI 中。

void virtio_pci_init(struct virtio_pci_dev *dev,
                     struct pci *pci,
                     struct bus *io_bus,
                     struct bus *mmio_bus)
{
    /* The capability list begins at offset 0x40 of pci config space */
    uint8_t cap_list = 0x40;

    memset(dev, 0x00, sizeof(struct virtio_pci_dev));
    pci_dev_init(&dev->pci_dev, pci, io_bus, mmio_bus);
    PCI_HDR_WRITE(dev->pci_dev.hdr, PCI_VENDOR_ID, VIRTIO_PCI_VENDOR_ID, 16);
    PCI_HDR_WRITE(dev->pci_dev.hdr, PCI_CAPABILITY_LIST, cap_list, 8);
    PCI_HDR_WRITE(dev->pci_dev.hdr, PCI_HEADER_TYPE, PCI_HEADER_TYPE_NORMAL, 8);
    PCI_HDR_WRITE(dev->pci_dev.hdr, PCI_INTERRUPT_PIN, 1, 8);
    pci_set_status(&dev->pci_dev, PCI_STATUS_CAP_LIST | PCI_STATUS_INTERRUPT);
    pci_set_bar(&dev->pci_dev, 0, 0x100, PCI_BASE_ADDRESS_SPACE_MEMORY,
                virtio_pci_space_io);
    virtio_pci_set_cap(dev, cap_list);
    dev->device_feature |=
        (1ULL << VIRTIO_F_RING_PACKED) | (1ULL << VIRTIO_F_VERSION_1);
}

Virtio 裝置初始化

Virtio 裝置初始化需要透過 virtio-pci 將裝置建立到 pci 上,

TUN/TAP

TUN/TAP 為 Linux 核心模擬出來的虛擬網路裝置,並且提供可以使用 userspace 來接收跟傳輸數據包。 TUN/TAN 為完全由軟體支援的網路設備,差別在於 TUN 位於 Network layer 運行專門處理 IP 封包,而 TAP 在 Data Link layer 運行專門處理 Ethernet 封包。

TUN/TAP 是如何與外界溝通,可以從下面這張圖得知。

圖中 APP 為如 Firefox 等網路瀏覽器將網路封包由 Read 將封包資訊經過在 kernel 中模擬出來的 character device 送到 Process 中,而 Write 則是將封包傳輸出去。

開啟 TUN/TAP 作為 Virtio-net 傳輸資料裝置

Universal TUN/TAP device driver 可以知道如何開啟 TUN/TAP 作為虛擬網路裝置,以下簡單解釋各成員。

  • virtio_pci_dev 將 virtio-net 驅動裝置以 virtio-pci 的形式開啟。
  • 在虛擬機中,前端和後端的溝通是透過結構體 virtq 來溝通,而這裡考慮到一個封包會有接收端 RX 和傳輸端 TX,所以將結構體 virtq 的陣列長度乘上 2。
struct virtio_net_dev {
    struct virtio_pci_dev virtio_pci_dev;
    struct virtq vq[VIRTIO_NET_VIRTQ_NUM * 2];
    int tap_fd;
    int ioeventfd;
    bool enable;
    int irq_num;
};

bool virtio_net_init(struct virtio_net_dev *virtio_net_dev);
void virtio_net_exit(struct virtio_net_dev *dev);
void virtio_net_init_pci(struct virtio_net_dev *dev,
                         struct pci *pci,
                         struct bus *io_bus,
                         struct bus *mmio_bus);

下方程式碼為將網卡開啟的方法,並且在 vm_arch_init_platform_device 開啟此網卡,之後的封包資料會透過此檔案來讀取和寫入。

virtio-net.c
...
#define DEVICE_NAME "tap%d"

bool virtio_net_init(struct virtio_net_dev *virtio_net_dev)
{
    virtio_net_dev->tap_fd = open("/dev/net/tun", O_RDWR);
    if (virtio_net_dev->tap_fd < 0) {
        fprintf(stderr, "failed to open TAP device: %s\n", strerror(errno));
        return false;
    }

    struct ifreq ifr;
    memset(&ifr, 0, sizeof(ifr));
    ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
    strcpy(ifr.ifr_name, DEVICE_NAME);
    if (ioctl(virtio_net_dev->tap_fd, TUNSETIFF, &ifr) < 0) {
        fprintf(stderr, "failed to allocate TAP device: %s\n", strerror(errno));
        return false;
    }

    fprintf(stderr, "allocated TAP interface: %s\n", ifr.ifr_name);

}

x86/vm.c
...
int vm_arch_init_platform_device(vm_t *v)
{
    ...
    virtio_net_init(&v->virtio_net_dev);
    return 0;
}

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
在 main.c 和 src/vm.c 中可以看到 vm_load_diskimg 此一函式為將一映像檔來作為 pci 裝置上的 virtio-blk 開啟的地方,那如果是 virtio-net 呢,virtio-net 應該在哪裡開啟?記憶體上?

int vm_load_diskimg(vm_t *v, const char *diskimg_file)
{
    if (diskimg_init(&v->diskimg, diskimg_file) < 0)
        return -1;
    virtio_blk_init_pci(&v->virtio_blk_dev, &v->diskimg, &v->pci, &v->io_bus,
                        &v->mmio_bus);
    return 0;
}

透過 PCI 啟動 virtio-net

透過 PCI 啟動虛擬網卡需要設定一些有關 PCI 的參數,以下為我所設定的參數,首先為 virtqueue 的數量和 virtio-net 在 PCI 裝置上的 class code。

#define VIRTIO_NET_VIRTQ_NUM 2
#define VIRTIO_NET_PCI_CLASS 0x019000

接著定義 PCI 裝置上的 device ID,從 virtio-v1.2-cs01.pdf 中可得知為 0x1041。

#define VIRTIO_PCI_DEVICE_ID_NET 0x1041

接著將這些設定註冊至 PCI 中

void virtio_net_init_pci(struct virtio_net_dev *virtio_net_dev,
                         struct pci *pci,
                         struct bus *io_bus,
                         struct bus *mmio_bus)
{
    struct virtio_pci_dev *dev = &virtio_net_dev->virtio_pci_dev;
    virtio_net_setup(virtio_net_dev);
    virtio_pci_init(dev, pci, io_bus, mmio_bus);
    virtio_pci_set_pci_hdr(dev, VIRTIO_PCI_DEVICE_ID_NET, VIRTIO_NET_PCI_CLASS,
                           virtio_net_dev->irq_num);
    virtio_pci_set_virtq(dev, virtio_net_dev->vq, VIRTIO_NET_VIRTQ_NUM);
    virtio_pci_add_feature(dev, 0);
    virtio_pci_enable(dev);
    pthread_create(&virtio_net_dev->worker_thread, NULL,
                   (void *) virtio_net_thread, (void *) virtio_net_dev);
    fprintf(stderr, "Initialize net device through pci \n");
}

直接在 main.c 中啟動虛擬機並且將 virtio-net 註冊到 pci 上,應該可以看到以下畫面

Initialize net device through pci
...
PCI host bridge to bus 0000:00
pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
pci_bus 0000:00: root bus resource [mem 0x00000000-0x7fffffffff]
pci_bus 0000:00: No busn resource found for root bus, will use [bus 00-ff]
pci 0000:00:00.0: [1af4:1042] type 00 class 0x018000
pci 0000:00:00.0: reg 0x10: [mem 0x00000000-0x000000ff]
pci 0000:00:01.0: [1af4:1041] type 00 class 0x019000
pci 0000:00:01.0: reg 0x10: [mem 0x00000000-0x000000ff]
pci_bus 0000:00: busn_res: [bus 00-ff] end is updated to 00
clocksource: Switched to clocksource tsc-early
pci 0000:00:00.0: BAR 0: assigned [mem 0x40000000-0x400000ff]
pci 0000:00:01.0: BAR 0: assigned [mem 0x40000100-0x400001ff]
...
virtio-pci 0000:00:00.0: enabling device (0000 -> 0002)
virtio-pci 0000:00:01.0: enabling device (0000 -> 0002)
virtio-pci 0000:00:01.0: virtio_pci: bad capability len 0 (>0 expected)
...

上面的畫面可以看出 pci 開啟了一塊區域 0000:00:01.0 給 virtio-net,至於 0000:00:00.0 則是給 virtio-blk,而 1041 則是上面所定義的 device ID。

pci 0000:00:01.0: [1af4:1041] type 00 class 0x019000

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
下方顯示 bad capability 的原因為我沒有設定 virtio-net 的 capability,在結構體中 virtio_net_dev 有一個成員為 virtio_net_config ,但是在 linux/virtio_net.h 卻沒有可以填入 capability 的選項?

virtio-pci 0000:00:01.0: virtio_pci: bad capability len 0 (>0 expected)

虛擬網卡傳輸資料

要透過虛擬網卡傳輸資料首先需要在 Linux 或 Busybox 的設定檔中加入可傳送封包的指令如 ip 和 ping,目的在於讓虛擬機和外界中可以互傳網路封包