###### This note is refer to [NTHU' virtualization & VM class](http://lms.nthu.edu.tw/course/20998) and virtio' document. # IO Virtualization ###### tags: `virtualization` `VM` 在虛擬機當中,一般的IO操作會使用以下步驟 1. App會發送 Sysyem call到 Guest os 2. 當 Guest OS接手後會執行相對應的 IO Driver 3. IO Driver會經由IO Operation傳送給Virtual Machine Monitor(VMM) 4. VMM去操作實體的Hardware IO ``` +----------------+ | Application | +--+-------------+ | System Call +--v-------------+ | Guest OS | +----------------+ | IO Driver | +--+-------------+ | IO operation +--v-------------+ | VMM | +----------------+ | Phy HW | +----------------+ ``` ### System Call Level Virtualizaiton 1. VMM直接擷取system call 2. Simulate對應的IO 3. 直接回覆App ``` +----------------+ +--> Application | | +--+-------------+ | | System Call+--+ | +--v-------------+ | Return directly | | Guest OS | | to App | +----------------+ | Trap of system call | | IO Driver | | | +--+-------------+ | | | IO operation | | +--v-------------+ | +--+ Shadow SysCall <--+ +----------------+ | Phy HW | +----------------+ ``` ### Device Driver Level(para-virtualization) 1. Guest OS當中的Device driver需要被修改 2. IO的操作是使用Hyper-Call,由VMM IO提供 ``` +--------------------+ | Application | +--+-----------------+ | System Call +--v-----------------+ | Guest OS | +--------------------+ | IO Driver | +---^----------------+ | Hyper Call +---v----------------+ | VMM IO Component | +--------------------+ | Phy HW | +--------------------+ ``` ### IO Operation Level 主要是PCI做passthrough IO,有許多問題,這邊不提出怎們操作。 ``` +----------------------+ | Application | +--+-------------------+ | System Call +--v-------------------+ | Guest OS | +----------------------+ +---> IO Driver | | +---+------------------+ | | IO Operation -----+ | +---v------------------+ | +--- MMIO/IO instructions<-+ +----------------------+ | Phy HW | +----------------------+ ``` # VirtIO Virtio主要是剛剛提到的第二種,在device driver layer用software解決。 下圖可以看到para-virtualization的overview。可以看到Guest OS的driver已經被修改成跟hypervisor溝通。 ``` +----------------------------+ +--------------------------+ | | | | | Guest OS | | Guest OS +----------+ | | | | Para-driver +---------------+------------+ +--------------------------+ | | Traps | | | Interfaces | Hypervisor | | | Hypervisor +----------- | +------------+ | | Device | | | Device | | | Emulation| | | Emulation | | +----------+ | +------------+ | | | | | | +----------------------------+ +--------------------------+ | HW | | HW | +----------------------------+ +--------------------------+ Full Virtualization Para-virtualization ``` * Virtio是Linux的IO虛擬化標準,主要提供網路Disk的driver與hypervisor共同操作 * Host OS實作driver在QEMU當中(virtio),所以不需要dirver在host layer * 只有guest os的device driver知道身處在virtual environment(可參考下圖) ``` +------------------------+ | | | Guest | +------------------------+ | Front-end Drivers| +----------------^-------+ | VirtIO | +--------------+----v----------------+ | | Back-end Drivers | | +---------------------+ | Hypervisor | Device Emulation | | +---------------------+ | | +------------------------------------+ ``` ## Virtio又可細分為三部分 1. Front-end driver 2. Transport 3. Back-end driver ``` +--------------------------------------+ |Front-end Driver | | | +--------------------------------------+ | Virtio Driver | +----------+------------------------+--+ | | Virtio PCI Controller | | | +------------------------+ | +--------------------------------------+ +--------------------------------------+ | Virtqueue +------------+ | | Virtio-buffer | Vring | | | +------------+ | | | |Transport | +--------------------------------------+ +--------------------------------------+ | +------------------------+ | | | Virtio PCI Controller | | +----------+------------------------+--+ | Virtio Driver | +--------------------------------------+ | | |Back-end Driver | +--------------------------------------+ ``` ### Driver --- #### Front-end 取代guest os當中的driver,讓guest os可以透過transport layer溝通。這個driver有以下兩個功能: * 接收從user space來的I/O request * 傳遞I/O request給back-end driver #### Back-end Back-end driver主要是在QEMU上的driver,可跟模擬的device溝通。主要有以下功能: * 接收從front-end driver來的I/O request * 透過physical device執行I/O request ### Transport --- Transport有以下兩個virtal-device組成。首先virtqueue是guest跟host傳送data的channel。而virtio-buffer通知對方的功用,可參考以下說明: * Virtual queue(virtqueue) * 屬於guest memory的一部分 * 負責傳遞front跟back end的I/O(guest-to-hypervisor) * 實作上稱為vring,這是因為是由ring的方式來實作 * 在Linux中,virtqueue vdev包含vring這個跟host os的shared device,裡面包含`vring descriptor` * 而在Linux中的`virtqueue`則是用來在front-end或back-end記錄`vring`的操作 * Virtio-buffer * 放入send/recive requests到buffer去通知對方(driver) * 可看做scatter-gather list ( [with each entry in the list representing an address and a length](https://www.ibm.com/developerworks/library/l-virtio/)). * 用vring的descriptor來記錄 * 透過vring kick發送virtual interrupt通知另一邊的OS ``` Initialization for Virtqueue +--------------------------------------+ |Front+end Driver | | | +--------------------------------------+ | Virtio Driver | Find virtqueue +-----------------------------------+--+ +---------------| Virtio PCI Controller | | | | +---------------|--------+ | | +--------------------------| Vring Alloc | +--------------------------|-----------+ | | Virtqueue +-----v------+ | | | Virtio-buffer | Vring | | | | +------------+ | | | | | |Transport | | +--------------------------------------+ | |--------------------------------------+ | | +------------------------+ | Vring GPA +-------------->| Virtio PCI Controller | | (Guest Physical Address)+----------+------------------------+--+ | Virtio Driver | +--------------------------------------+ | | |Back+end Dri^er | +--------------------------------------+ ``` ## API Flow Usage API: * Front-end on Linux * virtqueue_add_buf() * 新增 buffer到vring中 * virtqueue_get_buf() * 從 virtqueue 中取得 vring的buffer * virtqueue_kick() * 發送virtual interrupt * 將 buffer 更新到 virtqueue 中 * 並通知 virtqueue 的另一邊有東西送入 * virtqueue_disable_cb() * 取消 callback,非同步使用 * 在最佳化時使用 * virtqueue_enable_cb() * 在 disable_cb 後使用可從新開啟 callback * Back-end on QEMU * virtqueue_pop() * 從 virtqueue 中取得 data * virtqueue_push() * 將 data 放入到 virtqueue 中 Flow (流程太無聊,因此直接引用上課ppt的圖片) 1. [virtqueue_add_buf](https://i.imgur.com/wgvOaLz.png) 2. [virtqueue_kick](https://i.imgur.com/HQ0qmEI.png) 3. [virtqueue_pop](https://i.imgur.com/MPr5ssO.png) 4. [virtqueue_push](https://i.imgur.com/PQMF0zr.png) 5. [virtqueue_get_buf](https://i.imgur.com/L3c4tED.png) In guset vm, the callback API * virtqueue_disable_cb(struct virtqueue *_vq) ``` /** * virtqueue_disable_cb - disable callbacks * @vq: the struct virtqueue we're talking about. * * Note that this is not necessarily synchronous, hence unreliable and only * useful as an optimization. * * Unlike other operations, this need not be serialized. */ ``` * virtqueue_enable_cb(struct virtqueue *_vq) ``` /** * virtqueue_enable_cb - restart callbacks after disable_cb. * @vq: the struct virtqueue we're talking about. * * This re-enables callbacks; it returns "false" if there are pending * buffers in the queue, to detect a possible race between the driver * checking for more work, and enabling callbacks. * * Caller must ensure we don't call this with other virtqueue * operations at the same time (except where noted). */ ``` # Device Model * 達成full virtualization * 由下圖可以看到,io driver被guest os呼叫時,io operation會被傳送到VMM。VMM會將io操作送給device model。 * Device model * 能夠emulate相應的io operation ``` +------------------+ | Guest vm | | | +------------------+ | IO driver | +-------^----------+ - ---- -- | --- --- -- | VMM +------v---------+ | Device model | +----------------+ ``` ### Implementation --- 1. Device model在[Type1](https://en.wikipedia.org/wiki/Hypervisor#Classification)是被實作在VMM當中 ``` +------------------+ | Guest vm | | | +------------------+ | IO driver | +-------^----------+ User mode - ---- -- | --- --- -- | VMM Kernel mode +------v---------+ | Device model | +------^---------+ | +------v---------+ | IO driver | +----------------+ ``` 2. Device model在[Type2](https://en.wikipedia.org/wiki/Hypervisor#Classification)是跑在user mode * 是一隻單獨的user process * 負責emulate device ``` +------------------+ | Guest vm | +----------------+ | | | Device model | +------------------+ +--^----------^--+ | IO driver | | | +---------^--------+ User mode +--+ -- | +--+ -- | --- +-- +--+ | +--+ +---+ +--+ | | | Kernel mode +------v--------+ +--------------+ Hoset OS | IO driver | +---------------+ ``` ## IO Virtualization Flow --- 1. Initializztion * Device discovery 2. Operation * Access interception ### Discovery --- #### Physical devices * Non-enumerable device * 擁有自己的hard-coded * vmm設定status information在virtual device port * Enumerable device * 擁有定義好的方法來發現device的存在 * VMM不只模擬devices,也會模擬bus的行為 * 像是match跟bus controller等等 #### Non-Exist devises * VMM會define跟emulate所有device的行為 * VMM自己會define這些devices為non-enumerable跟enumerable * Guest OS要自行為virtual devices載入driver ### Interception --- #### [Port mapped IO](https://en.wikipedia.org/wiki/Memory-mapped_I/O) * Direct assignment * 這類型的VMM必須要開啟physical的IO bitmap * 所有IO instructions將會被直接作用到hardware,VMM不會介入其中的行為 * Indirect assignment * VMM會關掉physical的IO bitmap * IO instructions會被VMM接收出裡後才會傳送給hardware #### [Memory Mapped IO](https://en.wikipedia.org/wiki/Memory-mapped_I/O) * Direct assignment * VMM 使用 shadow page table來mapping I/O device * _參考Memory virtualization_ * IO operation不會背干涉,只會透過shadow page table更改memory address * Indirect assignment * VMM 讓全部IO devices在shadow page table當中無效(invalid),因此當使用IO operation時,會產生page fault,可參考下圖 * 當page fault時,會產生trap到VMM中,此時就可以在VMM中emulate device ``` +---------------+ | | | IO driver | | | +-------+-------+ | +----------+ | Page fault +-------------------+ +----v------------+ | Shadow | +------------+ | Page table | +------->+ Device | | | | model | +-----------------+ +------------+ ``` ### DMA Mechanism Guest device driver並不知道Host的實際記憶體位址,當Io operation發生時,VMM remap DMA target * 在同步的DMA當中,guest OS會透過[EPT technique](https://en.wikipedia.org/wiki/Second_Level_Address_Translation#Extended_Page_Tables)來得到正確的host memory address * 同步的DMA是透過software來trigger * 在非同步的DMA當中,hardware必續透過host OS的介入來存取memory * 透過hardware device來trigger ## Hardware Solution --- * Intel VT-d * Support DMA remapping * SR-IOV * Virtualizable devic with PCIe interface ### IOMMU - Input/Output memory management unit - connect dma-capable IO bus to the main memory - map the device virtual address to the pa(physical address) - e.g. - System Memory Management Unit(SMMU) for arm - VT-d for intel - bypass vmm #### arm * GIC - Generic Interrupt Controller * the only interrupt controller * called interrupt distributor * set in booting time * virtual CPU interface * provide virtual interrupt, use the virtual CPU interface, let guest touch interrupt without hypervisor * choose interrupt to hypervisor’s vector table * choose interrupt to Guest OS’s vector table. * SMMU * two stage to translate memory * https://www.arm.com/files/pdf/System-MMU-Whitepaper-v8.0.pdf