# VirtIO and TC
###### tags: `CHTTL`
[20200731 進度報告](https://drive.google.com/file/d/1KLRdpGMQvKk-Z0f2hyyGZ_gacHHci7iM/view?usp=sharing)
---
[toc]
---
## QEMU 全虛擬化下的 device I/O 狀況
1. 被 KVM module 中的 I/O trap 捕捉到 & 處理
2. 將處理結果放到 I/O sharing page 中
3. 通知 QEMU process 來取得 I/O 資訊,並交由 QEMU I/O Emulation Code 來模擬 I/O request
4. 完成後將結果放回 I/O sharing page
5. 通知 KVM module 中的 I/O trap 將處理結果取回並回傳給 virtual machine
* 透過 QEMU 可以模擬出各式各樣的 I/O device,甚至很老舊的設備都沒有問題;但從上面複雜的步驟不難看出為何使用 QEMU 模擬 device I/O 會效率不彰,除了每次 I/O request 處理的流程繁複之外,過多的 VMEntry, VMExit, context switch,也都是拖垮 QEMU 效能的原因
* 有鑑於此,virtio 被提出來,作為運行在 Hypervisor 上的一組 API interface,讓 virtual machine 知道自己運行在虛擬環境中,並根據 virtio 標準與 hypervisor 互動,藉此達到更好的運作效能(I/O 效能提升最為明顯)
## virtio
主要參考資料:
[1. Virtio网络的演化之路](https://cloud.tencent.com/developer/article/1540284),
[2. 浅谈网络I/O全虚拟化、半虚拟化和I/O透传](https://ictyangye.github.io/virtualized-network-io/2019/03/31/virtualized-IO.html),
[3. Deep dive into Virtio-networking and vhost-net](https://www.redhat.com/en/blog/deep-dive-virtio-networking-and-vhost-net)
[4. 很詳細介紹的作者 Eugenio Pérez Martín @ Red Hat](https://access.redhat.com/solutions/3394851)
[5. A detailed view of the vhost user protocol and its implementation in OVS DPDK, qemu and virtio-net @ Red Hat](https://www.redhat.com/en/authors/eugenio-p%C3%A9rez-mart%C3%ADn)
[6. devconf-19-virtio-硬件加速](http://tech.mytrix.me/2019/05/devconf-19-virtio-%E7%A1%AC%E4%BB%B6%E5%8A%A0%E9%80%9F/)
* 虚拟化的实现主要由三项关键技术构成:CPU虚拟化、内存虚拟化和I/O虚拟化。其中CPU和内存虚拟化多由硬件支持实现,而I/O虚拟化则没有CPU和内存虚拟化那么统一,由于其大量基于软件实现,因此发展过程中,衍生出了好多种性能、灵活性各异的方案。
* 網路 IO 半虛擬化
* 在这种虚拟化中,客户机操作系统能够感知到自己是虚拟机,I/O的虚拟化由前端驱动和后端驱动共同模拟实现。在客户机中运行的驱动程序称之为前端,在宿主机上与前端通信的驱动程序称之为后端。前端发送客户机请求给后端,后端驱动处理完这些请求后再返回给前端。
前端可以通稱 VirtIO,而以下都是後端,最常使用的是 vhost
### virtio-net
額外補充: [Virtio 基本概念和设备操作](https://www.ibm.com/developerworks/cn/linux/1402_caobb_virtio/)
* 最原始的virtio网络
* I/O的虚拟化由前端驱动和后端驱动共同模拟实现。在客户机中运行的驱动程序称之为前端,在宿主机上与前端通信的驱动程序称之为后端。前端发送客户机请求给后端,后端驱动处理完这些请求后再返回给前端
* 后端的变化往往标志着virtio网络的演化
* virtio标准将其对于队列的抽象称为Virtqueue。Vring即是对Virtqueue的具体实现
* 一个Virtqueue主要由 descriptor table, available ring, used ring 組成
* VirtIO 通過**共享內存**實現的 Vring 來實現虛擬機與主機之間的封包傳遞(數據通道)
* 当virtio-net驱动发送网络数据包时,会将数据放置于Available Ring中之后,会触发一次通知(給 KVM)。这时QEMU会接管控制,将此网络包传递到TAP设备。接着QEMU将数据放于Used Ring中,并发出一次通知,这次通知会触发虚拟中断的注入。虚拟机收到这个中断后,就会到Used Ring中取得后端已经放置的数据
* 也就是說虛擬機把封包放到 available ring 中後,會通知 KVM,QEMU 就會來把封包傳給 TAP
* 两次报文拷贝导致性能瓶颈: copy from TAP TX to QEMU buffer, copy from QEMU buffer to vring rx [[參考來源](https://ictyangye.github.io/virtualized-network-io/2019/03/31/virtualized-IO.html)]
* 另外消息机制处理过程太长:报文到达Tap时内核通知QEMU,QEMU利用IOCTL向KVM请求中断,KVM发送中断到客户机
* 消息機制還是屬於 data plane
* 從 guest 發出的通知需要先進內核,再回到用戶,再進到內核,總共切換兩次
* block diagram
![](https://i.imgur.com/oHrliwx.png)
* flow diagram
![](https://i.imgur.com/uJFBcew.png)
### vhost-net
* 处于内核态的后端
* QEMU实现的virtio网络后端带来的网络性能并不如意,究其原因是因为频繁的上下文切换,低效的数据拷贝、线程间同步等。于是,内核实现了一个新的virtio网络后端驱动,名为vhost-net
* 针对virtio-net的优化是把QEMU从消息队列的处理中解放出来,直接在宿主机实现了一个vhost-net内核模块,专门做virtio的后端,以此减少上下文切换和数据包拷贝(重新設計架構,少了一些 vmexit, interrupt 等等)
* 從 guest 發出的通知直接進到內核,不需回到用戶,提升了網路性能
* virtio-net device 剩下 control 的功用
* block diagram
![](https://i.imgur.com/cVjdGm0.png)
* flow diagram
![](https://i.imgur.com/ZWhDvC9.png)
* with OVS
![](https://i.imgur.com/ayQB630.png)
### vhost-user
* 使用DPDK加速的后端
* 同時支持 Linux virtio-net 和 DPDK virtio PMD 驅動的前端
* vhost-user是采用DPDK用户态后端实现的高性能半虚拟化网络I/O。其实现机理与vhost-net类似,但是整个后端包括ovs(openvswitch) datapath全部置于用户空间,更好的利用DPDK加速
* 然而由于OVS进程是用户态进程,无权限访问客户机内存,因此需要使用共享内存技术,提前通过socket通信在客户机启动时,告知OVS自己的内存布局和virtio中虚拟队列信息等
* virtio-net 雖然也在用戶態,但是因為 guest 是包括在 QEMU 中的,所以沒有權限問題(我猜測啦)
* 这样OVS建立起对每个VM的共享内存,便可以在用户态实现上述vhost-net内核模块的功能
* vhost-net 不是用 socket 溝通
* vhost-user协议和vhost-net协议最大的区别其实就是通信信道的区别。Vhost协议通过对vhost-net字符设备进行ioctl实现,而vhost-user协议则通过unix socket进行实现
* ...The result is that the DPDK application can read and write packets directly to and from guest memory and use the irqfd and the ioeventfd mechanisms to notify with the guest directly.
* 雖然 vhost-user 也和 virtio-net 一樣內核切換兩次,但不同的是 vhost-user 支援 hugepage, zero copy(應該不是全部 zero copy), CPU pin, NUMA local 參考[[這](http://www.jeepxie.net/article/200363.html)]
* 宿主机是否开启OVS-DPDK以及虚拟机是否开启DPDK的组合中,性能存在如下组合关系
![](https://i.imgur.com/NOHGVYR.png)
* DPDK还有自己的virtio PMD作为高性能的前端(詳細說明及圖例請看[這](https://www.redhat.com/en/blog/journey-vhost-users-realm))
* block diagram(virtio-net driver as frontend)
![](https://i.imgur.com/MkHzMcr.png)
* through OVS+DPDK
![](https://i.imgur.com/krvJ3CK.png)
* OVS+DPDK (較複雜版本)
* 沒經過 KVM 做通知
![](https://i.imgur.com/S4eQ8zU.png)
* flow diagram
![](https://i.imgur.com/C4kQaQo.png)
### vDPA
* 在DPDK加速的vhost-user方案中,还有一次内存拷贝。半虚拟化中仅剩的性能瓶颈也就在这一次拷贝中
* intel推出了一款硬件解决方案,直接让网卡与客户机内的virtio虚拟队列交互,把数据包DMA到客户机buffer内,在支持了virtio标准的基础上实现了真正意义上的零拷贝
* 可以在圖中看到,封包會從 NIC 傳到 guest memory 管理的 physical memory(中間有一條虛線),如此一來就直接交給 guest 了,之間沒有 copy 的動作
![](https://i.imgur.com/RxSy9Xg.png)
* Virtual data path acceleration (vDPA) in essence is an approach to standardize the NIC SRIOV data plane using the virtio ring layout and placing a single standard virtio driver in the guest decoupled from any vendor implementation, while adding a generic control plane and SW infrastructure to support it.
* virtio的控制平面仍需要vDPA driver进行传递,也就是说QEMU,或者虚拟机仍然使用原先的控制平面协议作为接口,而这些控制信息被传递到硬件中,硬件会通过这些信息配置好数据平面
* 鉴于现在后端的数据处理其实完全在硬件中,原先的前后端通知方式也可以几乎完全规避主机的干预,以中断为例,原先中断必须由主机处理,主机通过软件交换机得知中断的目的地之后,将虚拟中断注入到虚拟机中,而在vDPA中,网卡可以直接将中断发送到虚拟机中
* 总体来看,vDPA的数据平面与SR-IOV设备直通的数据平面非常接近,并且在性能数据上也能达到后者的水准。更重要的是vDPA框架保有virtio这套标准的接口,使云服务提供商在不改变virtio接口的前提下,得到更高的性能
* block diagram
![](https://i.imgur.com/2pUPyel.png)
* block diagram
![](https://i.imgur.com/sHlHGKe.png)
* vDPA+OVS
![](https://i.imgur.com/w5JPC4Z.png)
* how it work in DPDK (PMD 好像只是備援的)
* 可以看到 vDPA 做了数据面,但是控制的部分其实是 virtio-net(vhost-user)实现的,这样在刚开始做 vDPA 来说很方便,因为减少了很多工作,但如果面向别的类型设备准备实现 virtio 硬件加速,比如存储啊、一些辅助加速设备啊什么的就会发现 vDPA 帮我们减少了数据面的代码量,但控制面还是需要很多工作。就像 vDPA 在 DPDK 里做 vdpa driver 这样解决 vhost 到 vDPA 这个过程
![](https://i.imgur.com/vXjEc9i.png)
* block diagram
* 可以跟 DPDK 比較多了 vDPA Framework & vDPA Driver
![](https://i.imgur.com/z2JTqHC.png)
### 總結
* 纵观virtio网络的发展,控制平面由最原始的virtio到vhost-net协议,再到vhost-user协议,逐步得到了完善与扩充。而数据平面上,从原先集成在QEMU中或内核模块的中,到集成了DPDK数据平面优化技术的vhost-user,最终到使用硬件加速数据平面。在保留virtio这种标准接口的前提下,达到了SR-IOV设备直通的网络性能
* 封包傳輸原理: [DPDK解析](https://www.jianshu.com/p/9b669f7c97ce)
### others
* vhost user has since 2 sides:
* Master - qemu
* Slave - Open vSwitch or any other software switch
* vhost user can run in 2 modes:
* vhostuser-client - qemu is the server, the software switch is the client
* vhostuser - the software switch is the server, qemu is the client
* vhost user is based on the vhost architecture and implements all features in user space.
* vhost user socket is for control
## Achieving network wirespeed in an open standard manner: introducing vDPA [[資料來源](https://www.redhat.com/en/blog/achieving-network-wirespeed-open-standard-manner-introducing-vdpa)]
### SR-IOV
* SR-IOV has two main functions
1. Using the guest kernel driver: In this approach we use the NIC (vendor specific) driver in the kernel of the guest, while directly mapping the IO memory, so that the HW device can directly access the memory on the guest kernel.
2. Using the DPDK pmd driver in the guest: In this approach we use the NIC (vendor specific) DPDK pmd driver in the guest userspace, while directly mapping the IO memory, so that the HW device can directly access the memory on the specific userspace process in the guest.
* SR-IOV is a specification by PCI-SIG that allows a single physical device to expose multiple virtual devices. Those virtual devices can be safely assigned to guest virtual machine giving them direct access to the hardware. Using hardware directly reduces the CPU load on the hypervisor and usually results in better performance and lower latency.
* The problem is that we have a single physical NIC on the server exposed through PCI thus the question is how can we create "virtual ports" on the physical NIC as well?
Single root I/O virtualization (SR-IOV) is a standard for a type of PCI device assignment that can share a single device to multiple virtual machines. In other words, it allows different VMs in a virtual environment to share a single NIC. This means we can have a single root function such as an Ethernet port appear as multiple separated physical devices which address our problem of creating "virtual ports" in the NIC.
* ... This implies for example that if the NIC firmware is upgraded, the guest application driver may need to be upgraded as well. If the NIC is replaced with a NIC from another vendor, the guest must use another PMD to drive the NIC. Moreover, migration of a VM can only be done to a host with the exact same configuration. This implies the same NIC with the same version, in the same physical place and some vendor specific tailored solution for migration.
So the question we want to address is how to provide the SRIOV wirespeed to the VM while using a standard interface and most importantly, using generic driver in the guest to decouple it from specific host configurations or NICs.
![](https://i.imgur.com/OEof9r8.png)
* Mellanox OVS+SR-IOV
![](https://i.imgur.com/EE4joXc.png)
### Virtio full HW offloading
* In this approach the guest can communicate directly with the NIC via PCI so there is no need for any additional drivers in the host kernel. The approach however requires the NIC vendor to implement the virtio spec fully inside its NIC (each vendor with its proprietary implementation) including the control plane implementation (which is usually done in SW on the host OS, but in this case needs to be implemented inside the NIC)
* 也就是說缺點就是 control plane 需要由 NIC 廠商去完成在 NIC 上,很麻煩 (所以才會有 DPDK-based vDPA 方案)
![](https://i.imgur.com/xIJEwan.png)
### vDPA - standard data plane
* using the virtio ring layout
* The vDPA is a much more flexible approach than the virtio full HW offloading enabling NIC vendors to support virtio ring layout with significant smaller effort and still achieving wire speed performance on the data plane.
![](https://i.imgur.com/xfOnOW9.png)
* vDPA has the potential of being a powerful solution for providing wirespeed Ethernet interfaces to VMs:
1. Open public specification—anyone can see, consume and be part of enhancing the specifications (the Virtio specification) without being locked to a specific vendor.
2. Wire speed performance—similar to SRIOV, no mediators or translator between.
3. Future proof for additional HW platform technologies—ready to also support Scalable IOV and similar platform level enhancements.
4. Single VM certification—Single standard generic guest driver independent of a specific NIC vendor means that you can now certify your VM consuming an acceleration interface only once regardless of the NICs and versions used (both for your Guest OS and or for your Container/userspace image).
5. Transparent protection—the guest uses a single interface which is protected on the host side by 2 interfaces (AKA backend-protection). If for example the vDPA NIC is disconnected then the host kernel is able to identify this quickly and automagically switch the guest to use another virtio interface such as a simple vhost-net backend.
6. Live migration—Providing live migration between different vendor NICs and versions given the ring layout is now standard in the guest.
7. Providing a standard accelerated interface for containers—will be discussed in future posts.
8. The bare-metal vision—a single generic NIC driver—Forward looking, the virtio-net driver can be enabled as a bare-metal driver, while using the vDPA SW infrastructure in the kernel to enable a single generic NIC driver to drive different HW NICs (similar, e.g. to the NVMe driver for storage devices).
* With vDPA, no vendor specific software is required to operate the device by the VM guest, however, some vendor specific software is used on the host side.
## TC
1. Another building block towards being able to actually offload flow rules to HW was the introduction of a framework for offloading TC classifier rules/actions to NIC HW drivers.
2. TC(Traffic Control)是Linux進行流量控制的工具,可以控制網絡接口發送數據的速率。每個網絡接口都有一個隊列,用於管理和調度待發的數據。TC的工作原理就是通過設置不同類型的網絡接口隊列,從而改變數據包發送的速率和優先級,達到流量控制的目的。
> 原文網址:https://kknews.cc/code/mkal23p.html
## Kernel-OVS, OVS-DPDK, OVS-TC
參考[網站](https://www.netronome.com/m/documents/WP_OVS-TC_.pdf),超讚!
The Open vSwitch kernel module (Kernel-OVS) is the most commonly used OVS datapath.
Kernel-OVS is implemented as a match/action forwarding engine based on flows that are
inserted, modified or removed by user space. In 2012 OVS was further enhanced with another user space datapath based on the data plane development kit (DPDK). The addition
of OVS-DPDK improved performance but created some challenges. OVS-DPDK bypasses
the Linux kernel networking stack, requires third party modules and defines its own security
model for user space access to networking hardware. DPDK applications are more difficult to
configure optimally and while OVS-DPDK management solutions do exist, debugging can become a challenge without access to the tools generally available for the Linux kernel networking stack. It has become clear that a better solution is needed.
OVS using traffic control (TC) is the newest kernel-based approach and improves upon
Kernel-OVS and OVS-DPDK by providing a standard upstream interface for hardware acceleration. This paper will discuss how an offloaded OVS-TC solution performs against software-based OVS-DPDK.
Service providers need a scalable vSwitch, and now there is an open source, upstreamed and
kernel-compliant solution with OVS-TC which maintains all the benefits of Kernel-OVS and
OVS-DPDK. In addition, hardware-accelerated OVS-TC provides better CPU efficiency, lower
complexity, enhanced scalability and increased network performance.
![](https://i.imgur.com/lSm2ukq.png)
![](https://i.imgur.com/zMn7lFg.png)
Netronome Agilio CX SmartNICs enable transparent offload of the TC datapath. While OVS
software still runs on the server, the OVS-TC datapath match/action modules are synchronized down to the Agilio SmartNIC via hooks provided in the Linux kernel.
OVS contains a user space-based agent and a kernel-based datapath. The user space agent
is responsible for switch configuration and flow table population.
![](https://i.imgur.com/6WJxqf1.png)
This translates directly to improved server efficiency and a dramatic reduction in TCO, as fewer servers and less data center infrastructure (such as switches, racks, and cabling) are needed to perform a given application workload.
* DPDK 缺點: 太耗 CPU(直接佔據著),而且缺少很多 linux kernel 套件來維運 (因為 kernel 被跳過)
* 圖片[來源](https://www.netronome.com/m/documents/WP_OVS-TC_.pdf)![](https://i.imgur.com/UObNEeJ.png)
* DPDK 是一種 software acceleration
### hardware offload 好處
* more VM instances, higher throughput, lower latency
## [OVS 硬體加速淺談](https://www.sdnlab.com/23003.html)
#### OpenVSwitch kernel datapath
* OpenVSwitch是一个实现了OpenFlow的虚拟交换机,它由多个模块组成。主要有位于用户空间的ovsdb-server和ovs-vswitchd进程,和位于内核空间的OVS datapath组成
* ![](https://i.imgur.com/wFJEu2K.png)
* 网络数据的转发,都是由位于内核空间的OVS datapath完成。对于一个网络数据流,第一个数据包到达OVS datapath,这个时候的datapath没有转发信息,并不知道怎么完成转发。接下来OVS datapath会查询位于用户空间的ovs-vswitchd进程
* 这样第一个数据包就完成了转发。与此同时,ovs-vswitchd会通过netlink向OVS datapath写入一条(也有可能是多条)转发规则,这里的规则不同于OpenFlow规则,可以通过ovs-dpctl dump-flows查看
* 通过ovs-vswitchd查找OpenFlow实现转发的路径称为slow-path,通过OVS datapath直接转发的路径称为fast-path
* OVS的设计思路就是通过slow-path和fast-path的配合使用,完成网络数据的高效转发。这种设计思想类似于传统的硬件网络设备
* OpenVSwitch kernel datapath,实际上是在通用的硬件&软件上实现了专用网络设备的功能。优点在最开始说过了,但是这种实现也有自己的缺点。
* 硬件交换机因为有专门的转发硬件,可以保证任意时间都有一定的资源用于转发
* 操作系统会像对待其他进程一样,将CPU的部分时间片,而不是整个CPU分配给虚拟交换机,内存也需要受操作系统的管理,这就存在资源抢占的可能
* 另一方面,因为操作系统本身的设计,需要经过硬中断,软中断,内核空间和用户空间的切换来完成网络数据的传输,通过内核进行转发使得网络数据在操作系统内的路径也很长
* OVS kernel datapath多用于一些对网络性能要求不高的场合
#### OpenVSwitch DPDK
* DPDK(Data Plane Development Kit)本身是个独立的技术,OpenVSwitch在2012年提供了对DPDK的支持
* DPDK的思路是,绕过操作系统内核,在用户空间,通过PMD(Poll Mode Driver)直接操作网卡的接收和发送队列。在网络数据的接收端,PMD会不断的轮询网卡,从网卡上收到数据包之后,会直接通过DMA(可以獨立地直接讀寫系統記憶體,而不需CPU介入處理)将其传输到预分配的内存中,同时更新接收队列的指针,这样应用程序很快就能感知收到数据包。DPDK绕过了大部分中断处理和操作系统的内核空间,大大缩短了网络数据在操作系统内的路径长度
* 应用程序指的應該是像是 OVS
* 另一方面,因为PMD采用了轮询的方式与网卡交互,DPDK要独占部分CPU和内存,这样在任意时间都有一定的资源用于网络转发,避免了因为操作系统调度带来的资源抢占的问题
* ![](https://i.imgur.com/TWVpbJw.png)
* 因为DPDK绕过了内核空间,OVS-DPDK的datapath也存在于用户空间,采用如下图所示。对于SDN controller来说,因为连接的是ovs-vswitchd和ovsdb-server,所以感觉不到差异
* ![](https://i.imgur.com/IQ9d3bd.png)
* 虽然DPDK在一定程度上解决了性能的问题,并且DPDK社区在不断的进行优化,但是DPDK也有其自身的问题。首先,DPDK没有集成在操作系统中,使用DPDK就需要额外的安装软件,这增加了维护成本。其次,绕过了Linux内核空间,也就是绕过了网络协议栈,这使得基于DPDK的应用更难调试和优化,因为大量的基于Linux kernel网络调试监控工具在DPDK下不可用了。第三个问题,DPDK独占了部分CPU和内存,这实际上分走了本该用来运行应用程序的资源,最直观的感受是,使用了DPDK之后,主机可以部署的虚机数变少了
#### OpenVSwitch Hardware offload
* Linux TC(Traffic Control)Flower
* 最初的Linux TC是为了实现QoS
* 它在netdev设备的入方向和出方向增加了挂载点,进而控制网络流量的速度,延时,优先级等
* 随后,TC增加了Classifier-Action子系统,可以根据网络数据包的报头识别报文,并执行相应的Action。与其他的Classifier-Action系统,例如OpenFlow,不一样的是,TC的CA子系统并不只是提供了一种Classifier(识别器),而是提供了一个插件系统,可以接入任意的Classifier,甚至可以是用户自己定义的Classifier
* 在2015年,TC的Classifier-Action子系统增加了对OpenFlow的支持,所有的OpenFlow规则都可以映射成TC规则。随后不久,OpenFlow Classifier又被改名为Flower Classifier。这就是TC Flower的来源
* Linux TC Flower hardware offload
* 在2011年,Linux内核增加了基于硬件QoS的支持。因为TC就是Linux内实现QoS的模块,也就是说Linux增加了TC的硬件卸载功能
* 在4.9~4.14内核,Linux增加了对TC Flower硬件卸载的支持。也就是说OpenFlow规则有可能通过TC Flower的硬件卸载能力,在硬件(主要是网卡)中完成转发
* TC Flower硬件卸载的工作原理比较简单。当一条TC Flower规则被添加时,Linux TC会检查这条规则的挂载网卡是否支持并打开了NETIF_F_HW_TC标志位,并且是否实现了ndo_setup_tc(TC硬件卸载的挂载点)。如果都满足的话,这条TC Flower规则会传给网卡的ndo_setup_tc函数,进而下载到网卡内部
#### OVS-TC
* 前面说过,TC Flower规则现在可以下发到网卡上,相应的网卡上也会有一个虚机交换机。Mellanox称这个虚拟交换机为eSwitch。OVS初始化的时候,会向eSwitch下发一条默认的规则,如果网络包匹配不了任何其他规则,则会被这条默认规则匹配。这条规则的action就是将网络数据包送到eSwitch的管理主机,也就是说送到了位于Linux kernel的datapath上
* 如果这个网络数据包是首包的话,那根据前面的描述,在kernel的OVS datapath会继续上送到位于用户空间的ovs-vswitchd。因为ovs-vswitchd中有OpenFlow规则,ovs-vswitchd还是可以完成转发。不一样的地方是,ovs-vswitchd会判断当前数据流对应的规则能否offload(卸载)到网卡。如果可以的话,ovs-vswitchd会调用通过TC接口将flow规则下发至硬件。这样,同一个数据流的后继报文,可以直接在网卡的eSwitch中完成转发,根本不需要走到主机操作系统来。Datapath规则的aging(老化)也是由ovs-vswitchd轮询,最终通过TC接口完成删除。Datapath的变化如下所示
* ![](https://i.imgur.com/Cv9T5ZA.png)
* 在OVS-TC中,严格来说,现在Datapath有三个,一个是之前的OVS kernel datapath,一个是位于Kernel的TC datapath,另一个是位于网卡的TC datapath。位于kernel的TC datapath一般情况下都是空的,它只是ovs-vswitchd下发硬件TC Flower规则的一个挂载点
* ![](https://i.imgur.com/bm7Swd4.png)
* 使用OVS-TC方案,可以提供比DPDK更高的网络性能。因为,首先网络转发的路径根本不用进操作系统,因此变的更短了。其次,网卡,作为专用网络设备,转发性能一般要强于基于通用硬件模拟的DPDK。另一方面,网卡的TC Flower offload功能,是随着网卡驱动支持的,在运维上成本远远小于DPDK
* 但是OVS-TC方案也有自己的问题。首先,它需要特定网卡支持,不难想象的是,支持这个功能的网卡会更贵,这会导致成本上升,但是考虑到不使用DPDK之后释放出来的CPU和内存资源,这方面的成本能稍微抵消。其次,OVS-TC功能还不完善,例如connection track功能还没有很好的支持。第三,这个问题类似于DPDK,因为不经过Linux kernel,相应的一些工具也不可用了,这使得监控难度更大
#### 最后
* 三种实现中,OVS kernel最稳定,功能最全,相应的配套工具也最多,缺点就是性能较差,在对网络性能没有特殊要求的场合,OVS kernel还是首选;OVS DPDK能在一定程度上提升网络性能,并且对硬件没有依赖,但是需要占用主机的资源,并且需要有一定的维护成本;OVS-TC具有最好的网络性能,但是目前不太成熟,并且依赖特定的网卡,不过个人认为这是一个比较好的发展方向
## others
### TC 小結
* HW offload in OVS
* Kernel offload using TC
* DPDK offload using rte_flow
* 不論 vDPA 和 SRIOV 是用 TC 或 rte_flow,差別主要只在規則下發方式,對於實作上應該沒差 [[參考](https://www.openvswitch.org/support/ovscon2019/day2/0951-hw_offload_ovs_con_19-Oz-Mellanox.pdf)]
### tc flower [參考](https://www.slideshare.net/Netronome/tc-flower-offload)
* match-action ![](https://i.imgur.com/9AS1rZH.png)
* linux command example ![](https://i.imgur.com/EcMxXY7.png)
### 各種資料
* Mellanox ConnectedX-5 uses mlx5_core as its PF and VF driver. This driver generally takes care of table creation on hardware, flow handling, switchdev config, block device creation etc.
* [參考資料](https://hackmd.io/@sysprog/linux-ebpf#eBPF-%E5%88%B0%E5%BA%95%E5%92%8C%E8%A7%80%E5%AF%9F%E4%BD%9C%E6%A5%AD%E7%B3%BB%E7%B5%B1%E5%85%A7%E9%83%A8%E6%9C%89%E4%BD%95%E9%97%9C%E8%81%AF%EF%BC%9F) Berkeley Packet Filter (BPF) 最初的動機的確是封包過濾機制,但擴充為 eBPF (Extended BPF) 後,就變成 Linux 核心內建的內部行為分析工具包含以下:
* 動態追蹤 (dynamic tracing);
* 靜態追蹤 (static tracing);
* profiling events
### 其他可參考資料
[Lets understand the openvswitch hardware offload!](https://hareshkhandelwal.blog/2020/03/11/lets-understand-the-openvswitch-hardware-offload/)