Assessing Container Network In

## Assessing Container Network Interface Plugins: Functionality, Performance, and Scalability ![](https://hackmd.io/_uploads/SJNoPkFTn.png) 本篇希望介紹到各個cni performance(latency,throughput, fine-grained CPU measurements of the various components of the networking stack) 1. pod數量增加的時候(比較多的時候) 2. pod startup latency #### 聲明: ==Overlay tunnel offload support in the network interface card plays a significant role in achieving the good performance of CNIs== 問題: how the various design considerations affect performance? 主要探討面向: 1. 分析不同cni plugins針對IPv6,encryption support處理的差別 2. 分析cni 在host network處理iptables chains, packet forwarding, overlay tunneling, extended Berkeley Packet Filter (eBPF)的過程 3. 探討CNI plugin包含network protocol stack情況下封包的傳輸速率為何(packet transimission) 4. 在real-world環境下不同情況應該要使用哪個CNI 5. 本篇不探討多個cni組件裝在cluster上面 ### I. introduction #### A. Container Network Interface CNM: 是只docker runtime支援的network model CNI: 是通用在所有的cni models上面 * CNI主要做的事情: > The CNI plugin is then responsible for IP Address Management (IPAM), to connect the Pod network namespace with the host network, provide IP address allocation to the container network interface * CNI: 是個執行檔 * CNI的組成: 1. cni daemon: > 負責host(不同節點)的network routing,使用network policy ,renew subnet leasing,boarder gateway updating, 定義自定義資源像是calico 的IPPool 2. cni binary files: > 負責分配ip address給pod,建立linux network device(ex: bridge) ![](https://hackmd.io/_uploads/Hkjc_7zC3.png) #### kubernetes network model 1. intra-pod or Container-to-Container communication within a Pod 2. inter- Pod or Pod-to-Pod communication 3. Service-to-Pod com- munication 4. External-to-Service communication 規定: 1. CLUSTER內部的pod彼此是可以互通的(不需借助NAT) 2. 同個host(node)上面的pod是要彼此可以互通 #### network policy: > 可以規定經過pod的network flow要accept還是deny infra-pod communication: > Layer 2: MAC learning(macVLAN,beidge),隨著host裡面的pod數量越來越多，MAClearning速度會越來越慢(在overlay network 裡面那一層) > layer3: BGP support cross autonomous system (AS) boundaries，BGP好處是可以有較好的彈性,eg:Layer-2 for intra-host and Layer-3 for inter-host >>layer3 面臨的資安問題: BGP hijack, route leaks (i) overlay network: When packets go through the overlay tunnel, they are encapsulated with an outer header based on the adopted overlay protocol,ex: Virtual Extensible LAN,Generic Routing Encapsulation (GRE), VXLan,GRE作法是會減少performance,而且隨著封包長度變長，checksum會需要再多做改變(希望能做資料還原) #### packet forwarding and routing 1. Layer3 + Overlay: * inter host: 透過overlay tunnel endpoint做encapsulation(像vxlan那樣) * intra host: 用host network layer3 stack做pod的傳輸(ip-ip) 2. Layer3 + underlay: >跟一般的實體網卡底下配一些虛擬網卡類似 3. hybrid + overlay: > 在每個pod外面配對應的虛擬網卡，然後用一個linux bridge去接，外面再用個overlay tunnel endpoint接到外部的eth0 4. hybrid + underlay: > 每個pod配虛擬網卡然後外面加上linux bridge接到eth0 不同cni plugin敘述： 1. Flannel: > vxlan backend的效能優於一般udp overlay mode,vxlan外面需要再多一層udp header, 在kernel space處理完封包的加工之後，丟回user space,再把他送到destination >> in total three context switches happen under the UDP mode resulting in more CPU overhead and poor performance * 缺點: Flannel 沒有實作network policies 2. Weave: > intra host commmunication: performed at layer2 > inter host commmunication: performed at layer3 * Packet Forwarding across Hosts > The VxLAN mode is running based on the kernel’s native Open vSwitch datapath module, while the UDP mode relies on the Weave CNI daemon to implement encapsulation >> weave context switch 3 time likewise to flannel * network policy support > weave是用iptables來實作network policy > Weave uses state extension in iptables, which is a subset of the connection tracking extension (conntrack) >>Weave uses state extension to speed up the processing of iptables #### state extension: ==For an established connection, only the first packet needs to be matched with the iptables, the remaining packets will be allowed to pass directly== #### ipset: speedup iptables processing 其他特色: weave有支援multicasting,可以應用在streaming video 上面 3. Calico: #### Layer of Operation: > 只可以用L3 intra/inter host communication #### Packet Forwarding across Hosts: overlay network: ip-ip,vxlan encapsulation calico發生context switch的時機: * IP-in-IP or VxLAN and underlay mode using BGP. #### Network Policy Support: calico 可以有效的和istio 做結合，可以設定iptables決定pod要drop/accpet packets, ==calico需要implement network policy based on iptables== > It inserts user- defined chains on top of the system default chain >> Once a connection has been established, the following packets will be allowed to pass directly instead of being examined by the iptables(like ==weave==) ##### 另外calico可以用ipv6，不過只限於使用在underlay network,overlay network 還是只能用在ipv4 4. Cilium: > layer 3 only #### Layer of operation >For intra-host communication, Cilium relies on eBPF programs attached at the veth-pairs to redirect packets to their destination. cilium的功能: > Cilium builds its datapath based on a set of eBPF hooks that run eBPF programs. The eBPF hooks used in Cilium include XDP (eXpress Data Path), Traffic Control ingress/egress (TC), Socket operations, and Socket send/recv. TCs #### packets forwarding across hosts > Like Calico, Cilium’s underlay mode uses native routing based on BGP > underlay network uses BGP,overlay network: "bpf_overlay"設定好後，就可以經過otep到eth0,==跟calico一樣== #### network policy support: > Cilium can use eBPF hooks (e.g., XDP, TC) to define packet filters. Since this filtering occurs earlier than the network protocol stack, it can achieve better performance than iptables. 5. kube-router: * layer of operation: > layer2,layer3 , Kube-router uses Linux bridge to forward packets intra-hosts (Layer-2). It uses Layer-3 operation to forward packets inter-hosts. >> overlay network: IP-IP ,underlay network: BGP >> kube-router一樣用==conntrack==做加速 >> ipset to get around the overhead of a large iptables * 特殊的功能: >Direct Server Return (DSR) eature to implement a high- efficient ingress for load balancing, which is a unique ![](https://hackmd.io/_uploads/rJ24ANGRn.png) B. Iptables Comparison 5個hook points,4 kinds of tables: 1. raw: > used to split the traffic without a need for the connection to be tracked. 2. mangle: > change the QoS settings of packets 3. nat: > network address translation and the filter table is used for packet filtering 4. filter * kubeproxy做的事情: 1. 為host建立defualt host network iptable rules 2. 在建立k8s worker nodes上面user-defined iptables chains > kubeproxy installs NAT rules to support ‘External-to-Service’ communication #### Iptables’ Routing Management: > prerouting-->forwarding-->postrouting 1. prerouting先檢驗raw, mangle, nat這三個tables有沒有都符合規則，只要有一項不符合，pakcet就會drop 2. 一旦符合了就可以往forwarding的鍊送，看要是做NAT或其他服務 3. forwarding,post routing這些都會在送到pod veth之前會完成 ==ebpf approach: 上面這些過程通通部會碰到，會直接送到 XDP or TC hooks 處理== * overlay network #### 從節點外部流入內部過程(ingress): 1. 通過eth0: prerouting --> input 2. 通過OTEP: PREROUTING → FORWARD → POSTROUTING > 傳到指定的pod #### 從節點內部流入外部過程(egress): 1. pod到OTEP: PREROUT ING → FORWARD → POSTROUTING 2. OTEP到eth0: OUTPUT → POSTOUTING eth0 出去到node外部 * underlay network: The configuration for the underlay solution has the PREROUTING → FORWARD → POSTOUTING #### 支援多個cni的cni: * Multus: > With Multus, multiple Single- interface CNIs can coexist on the same host, and each Single- interface CNI has its own subnet, which is separated from each other. We find that it is necessary for the individual container framework to leverage the NICs offload capabilities and match it with the appropriate CNI to leverage that hardware acceleration to maximize performance. * intra host performance: 1. overall performance: > Layer-3 routing based solutions (Calico- wp and Calico-np) perform worse than the Layer-2 based solutions >>we also observe that Cilium achieves the lowest latency and Calico-wp has the worst round-trip latency as shown 分析packet經過每個component所需要花費的時間: >Forwarding Information Base (FIB), eBPF, Netfilter, Veth, and IP forwarding. >>用linux perf算出每60-second的cpu cycles總數公式: CPP =Cycletotal/Npacket ×Cyclepercentage 比較各個component在不同cni下所花的時間： 1. Bridge 2. FIB & IP forwarding 3. eBPF 4. Veth 5. Netfilter--> cilium沒有這個bottleneck * inter host performance: 機器型號: Mellanox ConnectX-4 25 Gbps NIC 作者說他的實驗機器不支援IP-IP tunneling, 反而是他希望用另一個實驗falnnel-off mode,關閉VXLAN的OTEP 實驗結果: * Flannel-off and Calico in IP-in-IP overlays perform poorly, with much lower throughput than Cilium and Calico with native routing (xsub) option * tunnel offload(disble otep)對於效能的提升是有幫助的(tunnel -based solution: UDP, VxLAN, IP-in-IP) > 即未使用封包再加工的模式可以讓performance 提升，cpu cycles降低(每60secs)用前面提到的benchmark CPP計算 * underlay network performs better than overlay network cpu cycles: 1. overlay > underlay 2. tunneling > untunneling * layer2 beidging: flannel,kube-router這兩個cni plugins在forward packets from the veth to the overlay tunnel這段是使用host network stack，而weave這邊則是讓bridge用br_forward()的function call()執行，速度較慢 > Thus, the total bridge overhead of Flannel and Kube-router is half of that of Weave * IP forwarding >This in-kernel tunnel processing (e.g., ip_send_check() function call) expends more CPU cycles. Calico-*-ipip consumes ∼ 290 C P P in the IP protocol stack >雖然calico和flannel都有兩次packet transimission,但是因為calico後半段otep to eth0這段增加ip header這部分是在kernel space下完成，所以需要時間較長 * netfilter ==決定netfilter時間長短的依據是: Large and complex iptables chains incur higher processing overheads== > Cali-wp-ipip 最長，而 Kube-router and Cali-np-xsub have the lowest >> ebpf-based 是有最短的ip tables link--> Cilium has the least Netfilter overhead compared to the other overlay-based solutions, as it bypasses the PREROUT ING → FORW ARD → POST ROUT ING * overlay Flannel, Weave, and Cilium use VxLAN overlay and Calico-*-ipip use an IP-in-IP overlay > 效能來說ip-ip > vxlan * Tunnel Offload >Calico-wp-ipip and Calico-np- ipip only achieve 11.1Gbps and 12.5Gbps TCP throughput for inter-host pod communication respectively, which is much slower than the VxLAN overlay >> Calico-wp-ipip has a similar RTT latency compared to some of the VxLAN overlay CNIs >> Moreover, the ==Flannel and Flannel-off== have similar RTT latency, indicating that the effect of hardware tunnel offload support for small packets is not significant. #### Performance for larger-scale configurations we set up an increasing number of TCP connections between Pods as background traffic. inter-host: 10Mbps , intra-host: 50Mbps > We use iperf3 to generate the background TCP connection traffic >> we deploy 99 iperf3 Pods as servers on one host and deploy 99 iperf3 Pods as clients on another host **Calico-wp has the worst performance throughout, due to its large overhead from Netfilter rules. The TCP round-trip time is also better for Cilium** #### impact of CNI on typical HTTP workload 實驗方式: create two pods on two different hosts, one for http client,and the other is for http server > Calico-*-xsub (underlay mode) works the best, while the Calico-*-ipip (overlay mode) ![](https://hackmd.io/_uploads/H1pS-aM0h.png) IPtables: ![](https://hackmd.io/_uploads/S14FZaGR3.png) * pod creation time > 因為cri 需要在建立pod之後，讓cni接受後續pod network config事項，veth與host的串接..etc,需要有這個benchmark ![](https://hackmd.io/_uploads/Bk6cf6zRn.png) >Flannel and Kube-router have a smaller network startup latency (∼ 60ms) compared to the other alternatives. Weave consumes about 165ms in the Pod-host Link Up step, due to the work of appending multicast rule in iptables. Calico spends ∼ 80ms in the IP Allocation step, which is primarily due to the interaction with the etcd store. The time spent by Cilium in the Endpoint Creation step accounts for ∼ 90ms. During this step, Cilium generates the eBPF code and links it into the kernel, which contributes to this high latency. #### V. CHARACTERISTICS FOR AN IDEAL CNI * utilize the eBPF approach for intra-host communication. This is primarily because it generates the least amount of CPU overhead compared to the other solutions * we propose the ideal CNI use native (IP) routing (based on the results in IV- C). This helps avoid packet encapsulation/decapsulation > This can achieve the highest packet forwarding performance when crossing the host boundary * 作者希望使用者可以自己選擇要的cni類型，the ideal CNI should be able to offer sufficient overlay tunneling options to users (e.g., IP-in-IP, VXLAN, GRE, etc.) ##### Multicast support can be built by leveraging eBPF’s TC hooks, which can be a good match with the ideal CNI’s intra-/inter-host datapath #### VII. conclusion >While there is no single universally ‘best’ CNI plugin, there is a clear choice depending on the need for intra-host or inter-host Pod-to-Pod communication. For the intra-host case, Cilium appears best, with eBPF optimized for routing within a host. For the inter-host case, Kube-router and Calico are better due to the lighter-weight IP routing mode compared to their overlay counterparts #### Although Netfilter rules incur overhead, their rich, fine-grained network policy and customization can enhance cluster security >看是要performance還是要secuyirty(選擇要犧牲netfilter與否)