# 在 Kubeadm 3m1w (v1.34) 安裝 Cilium 與網路研究 ## 環境準備 * 已初始化好 k8s ``` $ kubectl get no -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME m1 NotReady control-plane 6m6s v1.34.1 10.10.7.32 <none> Ubuntu 24.04.2 LTS 6.8.0-86-generic cri-o://1.34.1 m2 NotReady control-plane 3m42s v1.34.1 10.10.7.33 <none> Ubuntu 24.04.2 LTS 6.8.0-86-generic cri-o://1.34.1 m3 NotReady control-plane 2m23s v1.34.1 10.10.7.34 <none> Ubuntu 24.04.2 LTS 6.8.0-86-generic cri-o://1.34.1 w1 NotReady <none> 2m17s v1.34.1 10.10.7.35 <none> Ubuntu 24.04.2 LTS 6.8.0-86-generic cri-o://1.34.1 ``` ``` $ kubectl get po -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-system coredns-66bc5c9577-n7mbs 0/1 Pending 0 5m59s kube-system coredns-66bc5c9577-qgkfq 0/1 Pending 0 5m58s kube-system etcd-m1 1/1 Running 0 6m9s kube-system etcd-m2 1/1 Running 0 3m37s kube-system etcd-m3 1/1 Running 0 2m29s kube-system kube-apiserver-m1 1/1 Running 0 6m9s kube-system kube-apiserver-m2 1/1 Running 0 3m37s kube-system kube-apiserver-m3 1/1 Running 0 2m29s kube-system kube-controller-manager-m1 1/1 Running 0 6m13s kube-system kube-controller-manager-m2 1/1 Running 0 3m37s kube-system kube-controller-manager-m3 1/1 Running 0 2m29s kube-system kube-haproxy-m1 1/1 Running 0 6m1s kube-system kube-haproxy-m2 1/1 Running 0 3m37s kube-system kube-haproxy-m3 1/1 Running 0 2m29s kube-system kube-keepalived-m1 1/1 Running 0 6m10s kube-system kube-keepalived-m2 1/1 Running 0 3m37s kube-system kube-keepalived-m3 1/1 Running 0 2m29s kube-system kube-proxy-62zcw 1/1 Running 0 3m50s kube-system kube-proxy-cwtqp 1/1 Running 0 2m25s kube-system kube-proxy-kdmn7 1/1 Running 0 2m31s kube-system kube-proxy-t86mj 1/1 Running 0 5m58s kube-system kube-scheduler-m1 1/1 Running 0 6m6s kube-system kube-scheduler-m2 1/1 Running 0 3m37s kube-system kube-scheduler-m3 1/1 Running 0 2m29s ``` * 使用 cilium CNI k8s 的所有節點的 kernel 都需要大於 5.10 版本 ``` $ uname -r 6.8.0-86-generic ``` ## 下載 Cilium CLI ``` CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt) CLI_ARCH=amd64 if [ "$(uname -m)" = "aarch64" ]; then CLI_ARCH=arm64; fi curl -L --fail --remote-name-all https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-${CLI_ARCH}.tar.gz{,.sha256sum} sha256sum --check cilium-linux-${CLI_ARCH}.tar.gz.sha256sum sudo tar xzvfC cilium-linux-${CLI_ARCH}.tar.gz /usr/local/bin rm cilium-linux-${CLI_ARCH}.tar.gz{,.sha256sum} ``` * 檢視 cilium CLI 版本為 `v1.18.2` ``` $ cilium version cilium-cli: v0.18.8 compiled with go1.25.3 on linux/amd64 cilium image (default): v1.18.2 cilium image (stable): v1.18.2 cilium image (running): unknown. Unable to obtain cilium version. Reason: release: not found ``` ## 安裝 Cilium * 安裝 cilium CNI `1.18.3` 版本 ``` $ cilium install --version 1.18.3 ``` * 以下安裝指令是在 container 當節點時安裝 cilium CNI ``` $ cilium install --version 1.18.3 --set cgroup.autoMount.enabled=false --set cgroup.hostRoot=/sys/fs/cgroup --set securityContext.privileged=true ``` > `--set cgroup.autoMount.enabled=false` : 關閉 Cilium 安裝包中「自動把 host 的 cgroup 檔案系統 mount 到 Cilium container 裡」的機制。 > `--set cgroup.hostRoot=/sys/fs/cgroup` : 指定 Cilium 要把 host 的 `/sys/fs/cgroup` 當成 hostRoot。 > `--set securityContext.privileged=true` : 把 Cilium Pod設為 privileged(等同 container 在 host 上擁有近乎完整的權限)。 ``` $ kubectl get po -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-system cilium-2bzbk 1/1 Running 0 2m13s kube-system cilium-c6bps 1/1 Running 0 2m14s kube-system cilium-envoy-72np6 1/1 Running 0 2m14s kube-system cilium-envoy-8tsxn 1/1 Running 0 2m13s kube-system cilium-envoy-fcw4c 1/1 Running 0 2m14s kube-system cilium-envoy-j89dx 1/1 Running 0 2m13s kube-system cilium-operator-68bd8cc456-bxmll 1/1 Running 0 2m13s kube-system cilium-prdt6 1/1 Running 0 2m13s kube-system cilium-wsfmf 1/1 Running 0 2m13s kube-system coredns-66bc5c9577-n7mbs 1/1 Running 0 12m kube-system coredns-66bc5c9577-qgkfq 1/1 Running 0 12m kube-system etcd-m1 1/1 Running 0 12m kube-system etcd-m2 1/1 Running 0 9m54s kube-system etcd-m3 1/1 Running 0 8m46s kube-system kube-apiserver-m1 1/1 Running 0 12m kube-system kube-apiserver-m2 1/1 Running 0 9m54s kube-system kube-apiserver-m3 1/1 Running 0 8m46s kube-system kube-controller-manager-m1 1/1 Running 0 12m kube-system kube-controller-manager-m2 1/1 Running 0 9m54s kube-system kube-controller-manager-m3 1/1 Running 0 8m46s kube-system kube-haproxy-m1 1/1 Running 0 12m kube-system kube-haproxy-m2 1/1 Running 0 9m54s kube-system kube-haproxy-m3 1/1 Running 0 8m46s kube-system kube-keepalived-m1 1/1 Running 0 12m kube-system kube-keepalived-m2 1/1 Running 0 9m54s kube-system kube-keepalived-m3 1/1 Running 0 8m46s kube-system kube-proxy-62zcw 1/1 Running 0 10m kube-system kube-proxy-cwtqp 1/1 Running 0 8m42s kube-system kube-proxy-kdmn7 1/1 Running 0 8m48s kube-system kube-proxy-t86mj 1/1 Running 0 12m kube-system kube-scheduler-m1 1/1 Running 0 12m kube-system kube-scheduler-m2 1/1 Running 0 9m54s kube-system kube-scheduler-m3 1/1 Running 0 8m46s ``` ``` $ kubectl get no NAME STATUS ROLES AGE VERSION m1 Ready control-plane 12m v1.34.1 m2 Ready control-plane 10m v1.34.1 m3 Ready control-plane 8m52s v1.34.1 w1 Ready <none> 8m46s v1.34.1 ``` * cilium agent 這支程式就是由 cilium 這個 daemonset 做出來的 ``` $ kubectl -n kube-system get po -l app.kubernetes.io/part-of=cilium NAME READY STATUS RESTARTS AGE cilium-2bzbk 1/1 Running 0 5d20h cilium-c6bps 1/1 Running 0 5d20h cilium-envoy-72np6 1/1 Running 0 5d20h cilium-envoy-8tsxn 1/1 Running 0 5d20h cilium-envoy-fcw4c 1/1 Running 0 5d20h cilium-envoy-j89dx 1/1 Running 0 5d20h cilium-operator-68bd8cc456-bxmll 1/1 Running 0 5d20h cilium-prdt6 1/1 Running 0 5d20h cilium-wsfmf 1/1 Running 0 5d20h ``` ## 檢視 cilium 的網路架構 * 預設安裝的 cilium 跨節點溝通是透過 vxlan ``` $ kubectl -n kube-system get cm cilium-config -oyaml | grep tunnel-protocol tunnel-protocol: vxlan ``` * Cilium 在基於 VXLAN 的 Overlay 網路情況下,預設使用 `UDP/8472` 進行溝通,在 host 主機新增了四個虛擬網路介面,都是由 cilium agent 做出來的: - `cilium_vxlan`:對封包進行 VXLAN 封裝和解封裝操作。 - `cilium_net` 和 `cilium_host`:是一對 veth-pair(就是一對相互連通的虛擬乙太網路介面),他們都是在主機(host)network namespace,負責把 Cilium 管理的 IP spac (Pod IPs/Cluster IPs) 與 node 的 host networking 串起來,並提供給 eBPF 去掛程式處理封包。 - `cilium_net` : 這端是 Cilium datapath (eBPF 程式、conntrack、負載平衡邏輯)實際把封包做轉送、rewrite、或封裝時的介面。當 Pod 的封包被 kernel 路由到 Cilium 要處理時,封包會進到這一端,eBPF 能在這端的 hook(ingress/egress)做 policy 判定與處理。 - `cilium_host` : 主機側的接點,當要從節點直接去訪問 pod、透過 service 訪問到 pod、透過 gateway api 訪問到 pod 等都會使用到這張網卡。 - 只有當 pod 跟 pod 跨節點溝通實在會用到 `cilium_vxlan`、`cilium_net`、`cilium_host` 這幾張網卡。 - `lxc_health`:用於節點之間的健康檢測。 - `lxcxxxxxxx`:這個網卡跟 pod 內的 eth0 網卡是一對 veth-pair,他負責把 Pod netns 與 host namespace 連接。veth pair 的作用就是把不同的 network namespace 連起來(一端在 datapath/netns,一端在 host namespace) ``` $ ip link show | grep -E "cilium|lxc_health" 3: cilium_net@cilium_host: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 4: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 5: cilium_vxlan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default 7: lxc_health@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 ``` * 檢視 cilium 網路狀態 ``` $ kubectl -n kube-system exec -it ds/cilium -c cilium-agent -- cilium status --verbose ``` * 創建測試用 deployment,pod 跨不同節點 ``` $ kubectl create deploy nginx --image=nginx --replicas=2 # pod 還要啟用 privileged=true $ echo 'apiVersion: apps/v1 kind: Deployment metadata: labels: app: client name: client spec: replicas: 2 selector: matchLabels: app: client template: metadata: labels: app: client spec: containers: - image: quay.io/cooloo9871/debug.alp name: debug-alp securityContext: privileged: true' | kubectl apply -f - $ kubectl get po -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES client-7b76fc6dc9-lc9qr 1/1 Running 0 7m1s 10.0.3.81 w1 <none> <none> client-7b76fc6dc9-v94js 1/1 Running 0 112s 10.0.0.93 m1 <none> <none> nginx-66686b6766-hlrkx 1/1 Running 0 2d17h 10.0.0.105 m1 <none> <none> nginx-66686b6766-k5xft 1/1 Running 0 2d17h 10.0.3.187 w1 <none> <none> ``` ### 網路架構圖如下 ![image](https://hackmd.io/_uploads/HkxV6B4x-g.png) ``` $ kubectl exec client-7b76fc6dc9-v94js -- ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 30: eth0@if31: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP link/ether a2:fa:c8:99:7a:d2 brd ff:ff:ff:ff:ff:ff inet 10.0.0.93/32 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::a0fa:c8ff:fe99:7ad2/64 scope link valid_lft forever preferred_lft forever $ kubectl exec -it nginx-66686b6766-hlrkx -- ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host proto kernel_lo valid_lft forever preferred_lft forever 16: eth0@if17: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether e6:a5:89:71:b7:dd brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.0.0.105/32 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::e4a5:89ff:fe71:b7dd/64 scope link proto kernel_ll valid_lft forever preferred_lft forever ``` * 找出 client pod 在主機側的網卡,在 pod 所在的節點執行以下命令 ``` $ pod_name="client-7b76fc6dc9-v94js" $ ip a s | grep -A 3 ^$(sudo ip netns exec $(sudo ip netns identify $(sudo crictl inspect $(sudo crictl ps -a | grep ${pod_name} | cut -d " " -f 1) | jq -r '.info.pid // .status.pid')) ip a s eth0 | head -n 1 | cut -d ":" -f 2 | tail -c 3) ``` 執行結果: ``` 31: lxc1fa781ad487a@if30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 32:3e:b3:20:00:e1 brd ff:ff:ff:ff:ff:ff link-netns f0c571eb-8a3e-42b4-88ed-036fc25c7ed7 inet6 fe80::303e:b3ff:fe20:e1/64 scope link valid_lft forever preferred_lft forever ``` * 也就是說看到 Pod 內的網卡 `eth0@if31`,那麼在 host 主機尋找序號為 31 的網卡即可,即對應的網卡為 `lxc1fa781ad487a` ``` $ ip link show | grep 31 31: lxc1fa781ad487a@if30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 $ ip link show | grep 17 17: lxc584de0207ba0@if16: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 ``` ### 同節點 Pod 溝通 * 當 Pod 啟動的時候,Cilium 會為其指派位址並設定 default gateway,預設 gateway 會指向 host 主機的 `cilium_host` 網卡,而預設情況下 ARP 表是空的。 ``` $ kubectl exec client-7b76fc6dc9-v94js -- route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.0.0.249 0.0.0.0 UG 0 0 0 eth0 10.0.0.249 0.0.0.0 255.255.255.255 UH 0 0 0 eth0 $ ip a s cilium_host 4: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether ba:3f:a5:5c:50:c9 brd ff:ff:ff:ff:ff:ff inet 10.0.0.249/32 scope global cilium_host valid_lft forever preferred_lft forever inet6 fe80::b83f:a5ff:fe5c:50c9/64 scope link valid_lft forever preferred_lft forever $ kubectl exec client-7b76fc6dc9-v94js -- arp -n ``` 當同節點 Pod 之間進行溝通的時候,網路架構如下 ![image](https://hackmd.io/_uploads/SJ4r0rVgZg.png) ``` $ kubectl exec -it client-7b76fc6dc9-v94js -- ping -c 3 10.0.0.105 PING 10.0.0.105 (10.0.0.105): 56 data bytes 64 bytes from 10.0.0.105: seq=0 ttl=63 time=1.645 ms 64 bytes from 10.0.0.105: seq=1 ttl=63 time=0.178 ms 64 bytes from 10.0.0.105: seq=2 ttl=63 time=0.262 ms --- 10.0.0.105 ping statistics --- 3 packets transmitted, 3 packets received, 0% packet loss round-trip min/avg/max = 0.178/0.695/1.645 ms ``` > 在執行 ping 時也可以收集觀察 arp 和 icmp 封包實際流量 ``` $ sudo tcpdump -n -i lxc1fa781ad487a arp or icmp -vvv -w client-pod-host.pcap $ sudo tcpdump -n -i lxc584de0207ba0 arp or icmp -vvv -w nginx-pod-host.pcap ``` `client-7b76fc6dc9-v94js` 的 IP 為 `10.0.0.93`,`nginx-66686b6766-hlrkx` 的 IP 為 `10.0.0.105`,當 `client-7b76fc6dc9-v94js` 發送 ICMP 請求到 `nginx-66686b6766-hlrkx` 的時候,由於 networkID 是 `10.0.0.93/32` 等於是封包除了自己都會往 default gateway 送,查詢到 default gateway 是 `10.0.0.249`,也就是 host 主機的 `cilium_host` 網卡,此時 Pod 會傳送 ARP 廣播,取得該 IP 的 MAC 位址,但 `cilium_host` 網路卡是 NOARP 狀態的,也就是不會回應 ARP 封包。 Cilium 會根據 Pod 對應的 veth-pair 設備,將其 MAC 位址傳回給 Pod,也就是 `client-7b76fc6dc9-v94js` 最終會取得到 `lxc1fa781ad487a` 的 MAC 位址,並進行緩存,然後再傳送 ICMP 封包: 並且此時看到的 arp 表看到的 ip 是 `10.0.0.249` 也就是同節點上的 `cilium_host` 網卡 ``` $ kubectl exec client-7b76fc6dc9-v94js -- arp -n ? (10.0.0.249) at 32:3e:b3:20:00:e1 [ether] on eth0 $ ip a s lxc1fa781ad487a 31: lxc1fa781ad487a@if30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 32:3e:b3:20:00:e1 brd ff:ff:ff:ff:ff:ff link-netns f0c571eb-8a3e-42b4-88ed-036fc25c7ed7 inet6 fe80::303e:b3ff:fe20:e1/64 scope link valid_lft forever preferred_lft forever $ ip a s cilium_host 4: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether ba:3f:a5:5c:50:c9 brd ff:ff:ff:ff:ff:ff inet 10.0.0.249/32 scope global cilium_host valid_lft forever preferred_lft forever inet6 fe80::b83f:a5ff:fe5c:50c9/64 scope link valid_lft forever preferred_lft forever ``` 可以看到要到目的地 `10.0.0.105` 的路由指向 `cilium_host` 網卡,但當 `lxc1fa781ad487a` 收到 Pod 寄來的 ICMP 訊息後,會發現 MAC 位址是自己的,這是因為封包到主機側時會使用 eBPF 實作封包轉送,也就是 `cil_from_container` 這支程式,他會偽裝 MAC address。 ``` # 查看 m1 主機的路由表 $ route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.10.7.254 0.0.0.0 UG 0 0 0 ens18 10.0.0.0 10.0.0.249 255.255.255.0 UG 0 0 0 cilium_host 10.0.0.249 0.0.0.0 255.255.255.255 UH 0 0 0 cilium_host 10.0.1.0 10.0.0.249 255.255.255.0 UG 0 0 0 cilium_host 10.0.2.0 10.0.0.249 255.255.255.0 UG 0 0 0 cilium_host 10.0.3.0 10.0.0.249 255.255.255.0 UG 0 0 0 cilium_host 10.10.7.0 0.0.0.0 255.255.255.0 U 0 0 0 ens18 ``` 此時 cilium bpf `cil_from_container` 程式在接收到封包後,經過一連串判斷,會用自己的 MAC address`(lxc1fa781ad487a)` 假裝自己是 gateway 來回復這個 arp request。 查看同節點的 pod 串接 `cil_from_container` 程式。 ``` $ sudo bpftool net xdp: tc: ens18(2) tcx/ingress cil_from_netdev prog_id 877 link_id 8 cilium_net(3) tcx/ingress cil_to_host prog_id 864 link_id 7 cilium_host(4) tcx/ingress cil_to_host prog_id 856 link_id 5 cilium_host(4) tcx/egress cil_from_host prog_id 855 link_id 6 cilium_vxlan(5) tcx/ingress cil_from_overlay prog_id 844 link_id 3 cilium_vxlan(5) tcx/egress cil_to_overlay prog_id 845 link_id 4 lxc_health(7) tcx/ingress cil_from_container prog_id 847 link_id 11 lxcb37c3e803a9a(9) tcx/ingress cil_from_container prog_id 926 link_id 10 lxcc0147322a505(11) tcx/ingress cil_from_container prog_id 916 link_id 9 lxc584de0207ba0(17) tcx/ingress cil_from_container prog_id 937 link_id 14 # 這裡 lxc0d92690e376d(29) tcx/ingress cil_from_container prog_id 946 link_id 20 lxc1fa781ad487a(31) tcx/ingress cil_from_container prog_id 943 link_id 21 # 這裡 lxc1e397b5ebe86(37) tcx/ingress cil_from_container prog_id 886 link_id 24 flow_dissector: netfilter: ``` 而 `cil_from_container` 程式會根據以下規則的 Node MAC `02:29:D4:4F:B3:9B` 找到在主機側的 `lxc584de0207ba0` 網卡,也就是 `nginx-66686b6766-hlrkx` Pod 的 veth-pair 設備 ``` $ kubectl -n kube-system exec -it cilium-prdt6 -c cilium-agent -- cilium bpf endpoint list IP ADDRESS LOCAL ENDPOINT INFO ...... # 以下為 nginx-66686b6766-hlrkx 對應的 rule 10.0.0.105:0 id=404 sec_id=10980 flags=0x0000 ifindex=17 mac=E6:A5:89:71:B7:DD nodemac=02:29:D4:4F:B3:9B parent_ifindex=0 ``` * 所以 client pod 收到被偽造 Mac address 的 arp reply 後,會再發送 icmp echo request 封包,而這時封包又送到 `lxc1fa781ad487a`,又在被 `cil_from_container` 程式接收到,並作出以下判斷: - 如果目標 pod 是同一個 Node → 找到目標 pod 在主機側的網卡為 `lxc584de0207ba0` 也就是 `nginx-66686b6766-hlrkx` 在目的地 host 主機上對應的 veth-pair 設備,最後轉發到 Pod 內的 eth0 網路卡中,完成網路溝通。 - 如果目標 pod 是跨 Node → 才會轉交 `cilium_vxlan` / `ens18` 網卡。 ``` # 在 m1 節點上執行 bigred@m1:~$ ip link show | grep -i "02:29:D4:4F:B3:9B" -B 1 17: lxc584de0207ba0@if16: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 02:29:d4:4f:b3:9b brd ff:ff:ff:ff:ff:ff link-netns 06414ab3-8571-47e7-bfa7-75f256749aa1 ``` #### 小結論 所以 pod 同節點溝通,雖然 pod 內的 default gateway 是設到 `cilium_host` 網卡上,但是實際封包在傳送的過程中會被 `cil_from_container` 程式給偽裝 Mac address 來做路由規則,封包根本不會經過 `cilium_host` 網卡上。 ### 跨節點 Pod 溝通 ``` $ kubectl exec client-7b76fc6dc9-v94js -- ping -c 2 10.0.3.187 PING 10.0.3.187 (10.0.3.187): 56 data bytes 64 bytes from 10.0.3.187: seq=0 ttl=63 time=1.081 ms 64 bytes from 10.0.3.187: seq=1 ttl=63 time=0.629 ms --- 10.0.3.187 ping statistics --- 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max = 0.629/0.855/1.081 ms ``` 當 `client-7b76fc6dc9-v94js` 訪問 `nginx-66686b6766-k5xft` 流程圖如下: ![image](https://hackmd.io/_uploads/HyiOevVxWg.png) `client-7b76fc6dc9-v94js` 的 IP 為 `10.0.0.93`,`nginx-66686b6766-k5xft` 的 IP 為 `10.0.3.187`,同理,Pod 的流量會先到達 `veth-pair` 設備,然後會被 `cil_from_container` 程式接收到,進行 MAC address 偽裝。 所以此時看到的 arp 表看到的 ip 是 `10.0.0.249` 也就是同節點上的 `cilium_host` 網卡但是 MAC address 是自己對應的 `lxc1fa781ad487a` 設備。 ``` $ kubectl exec client-7b76fc6dc9-v94js -- arp -n ? (10.0.0.249) at 32:3e:b3:20:00:e1 [ether] on eth0 ``` 由於是跨節點溝通,`cil_from_container` 程式會轉交 `cilium_vxlan` / `ens18` 網卡,並且會查詢 vxlan 隧道轉送規則,然後進行封包轉送,再由 `cilium_vxlan` 將封包送從本機的 `ens18` 到目的主機的 `ens18` 網卡。 查詢隧道轉發規則,可以透過下面的命令進行查詢: ``` $ kubectl get cn NAME CILIUMINTERNALIP INTERNALIP AGE m1 10.0.0.249 10.10.7.32 11d m2 10.0.1.46 10.10.7.33 11d m3 10.0.2.249 10.10.7.34 11d w1 10.0.3.197 10.10.7.35 11d ``` 當目的地主機收到 VXLAN 封包後,辨識到是 VXLAN 封包,會將其轉送給 `cilium_vxlan` 進行解包,再依照實際的目的位址(也就是 `10.0.3.187`)進行路由: ``` # 在 w1 節點查詢本機的路由表 bigred@w1:~$ route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.10.7.254 0.0.0.0 UG 0 0 0 ens18 10.0.0.0 10.0.3.197 255.255.255.0 UG 0 0 0 cilium_host 10.0.1.0 10.0.3.197 255.255.255.0 UG 0 0 0 cilium_host 10.0.2.0 10.0.3.197 255.255.255.0 UG 0 0 0 cilium_host 10.0.3.0 10.0.3.197 255.255.255.0 UG 0 0 0 cilium_host 10.0.3.197 0.0.0.0 255.255.255.255 UH 0 0 0 cilium_host 10.10.7.0 0.0.0.0 255.255.255.0 U 0 0 0 ens18 ``` VXLAN 封包通過 `cilium_vxlan` 介面進入後會轉交由 `cil_from_overlay` BPF 程式,在進行路由決策,將封包交付到目標 Pod ``` # 檢查 w1 節點的 cilium-2bzbk pod $ kubectl -n kube-system exec -it cilium-2bzbk -c cilium-agent -- cilium bpf endpoint list IP ADDRESS LOCAL ENDPOINT INFO ...... # 查看到以下此條規則 10.0.3.187:0 id=394 sec_id=10980 flags=0x0000 ifindex=15 mac=CE:AE:F2:48:65:2A nodemac=2E:85:D4:AF:AF:92 parent_ifindex=0 ``` 根據 Node MAC 查詢到對應的網卡為 `lxc644f612d5318`,也就是 `nginx-66686b6766-k5xft` 在目的地 host 主機上對應的 veth-pair 設備,封包會被轉發到該設備,最後轉發到 Pod 內的 eth0 網路卡中,完成網路溝通。 ``` # 在 w1 節點上執行 bigred@w1:~$ ip link show | grep -i "2E:85:D4:AF:AF:92" -B 1 15: lxc644f612d5318@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 2e:85:d4:af:af:92 brd ff:ff:ff:ff:ff:ff link-netns 950eaa81-0918-4b70-a5af-ea4b26178bb2 ``` #### 小結論 所以 pod 跨節點溝通事實上也是不會經過 `cilium_host` 網卡的。 ### 那什麼時候用到 `cilium_host` 和 `cilium_net`? `cilium_host` 主要處理: 1. Host-to-Pod 通信:從 Node 本身發起的流量 2. L7 Proxy 相關流量:需要經過 Envoy proxy 的流量 3. `NodePort/LoadBalancer` 服務流量:Kubernetes 服務相關流量 4. Cilium Health Check 流量:Node 間健康檢查 5. 特定服務流量:需要經過 host network stack 的流量 #### 小結論 * 所以會使用到 `cilium_host` 和 `cilium_net` 網卡的情境下就是, - 從節點直接去訪問 pod。 - 透過 service 訪問到 pod。 - 透過 gateway api 訪問到 pod 等。 ## 測試 cilium 網路效能 * 透過 cilium 指令檢測 K8s 網路頻寬,指定 `node-selector-client` `node-selector-server` 讓測試的 pod 長在這兩個節點上。 ``` $ cilium connectivity perf \ --crr \ --host-to-pod \ --pod-to-host \ --performance-image quay.io/cilium/network-perf:1751527436-c2462ae \ --node-selector-client kubernetes.io/hostname=m1 \ --node-selector-server kubernetes.io/hostname=w1 ``` 螢幕輸出: ``` ................................ 🔥 Network Performance Test Summary [cilium-test-1]: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 📋 Scenario | Node | Test | Duration | Min | Mean | Max | P50 | P90 | P99 | Transaction rate OP/s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 📋 pod-to-pod | same-node | TCP_CRR | 10s | 131µs | 331.01µs | 3.485ms | 314µs | 460µs | 691µs | 3012.06 📋 pod-to-pod | same-node | TCP_RR | 10s | 37µs | 140.75µs | 3.871ms | 124µs | 218µs | 380µs | 7061.51 📋 pod-to-host | same-node | TCP_CRR | 10s | 162µs | 410.55µs | 23.076ms | 384µs | 576µs | 860µs | 2429.36 📋 pod-to-host | same-node | TCP_RR | 10s | 39µs | 140.87µs | 7.25ms | 126µs | 211µs | 356µs | 7057.62 📋 host-to-pod | same-node | TCP_CRR | 10s | 162µs | 449.48µs | 7.217ms | 427µs | 627µs | 914µs | 2218.58 📋 host-to-pod | same-node | TCP_RR | 10s | 40µs | 153.75µs | 5.969ms | 138µs | 236µs | 395µs | 6461.87 📋 host-to-host | same-node | TCP_CRR | 10s | 106µs | 331.95µs | 4.146ms | 309µs | 485µs | 713µs | 3002.43 📋 host-to-host | same-node | TCP_RR | 10s | 34µs | 153.27µs | 15.963ms | 136µs | 234µs | 393µs | 6481.81 📋 pod-to-pod | other-node | TCP_CRR | 10s | 469µs | 1.0597ms | 18.036ms | 989µs | 1.429ms | 2.323ms | 941.87 📋 pod-to-pod | other-node | TCP_RR | 10s | 174µs | 509.19µs | 10.345ms | 478µs | 705µs | 1.167ms | 1957.50 📋 pod-to-host | other-node | TCP_CRR | 10s | 413µs | 1.01767ms | 17.581ms | 930µs | 1.365ms | 2.581ms | 980.78 📋 pod-to-host | other-node | TCP_RR | 10s | 140µs | 456.32µs | 5.459ms | 431µs | 641µs | 1.066ms | 2183.68 📋 host-to-pod | other-node | TCP_CRR | 10s | 410µs | 1.03293ms | 7.961ms | 962µs | 1.422ms | 2.341ms | 966.58 📋 host-to-pod | other-node | TCP_RR | 10s | 206µs | 575.88µs | 12.171ms | 537µs | 767µs | 1.321ms | 1730.92 📋 host-to-host | other-node | TCP_CRR | 10s | 293µs | 787.14µs | 6.087ms | 731µs | 1.089ms | 1.788ms | 1267.36 📋 host-to-host | other-node | TCP_RR | 10s | 116µs | 425.38µs | 5.442ms | 403µs | 587µs | 982µs | 2342.92 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------- 📋 Scenario | Node | Test | Duration | Throughput Mb/s ---------------------------------------------------------------------------------------- 📋 pod-to-pod | same-node | TCP_STREAM | 10s | 7708.49 📋 pod-to-pod | same-node | TCP_STREAM_MULTI | 10s | 21133.93 📋 pod-to-host | same-node | TCP_STREAM | 10s | 7027.26 📋 pod-to-host | same-node | TCP_STREAM_MULTI | 10s | 18826.72 📋 host-to-pod | same-node | TCP_STREAM | 10s | 5461.00 📋 host-to-pod | same-node | TCP_STREAM_MULTI | 10s | 16860.81 📋 host-to-host | same-node | TCP_STREAM | 10s | 12831.99 📋 host-to-host | same-node | TCP_STREAM_MULTI | 10s | 32901.72 📋 pod-to-pod | other-node | TCP_STREAM | 10s | 2283.82 📋 pod-to-pod | other-node | TCP_STREAM_MULTI | 10s | 2450.87 📋 pod-to-host | other-node | TCP_STREAM | 10s | 3471.98 📋 pod-to-host | other-node | TCP_STREAM_MULTI | 10s | 13457.33 📋 host-to-pod | other-node | TCP_STREAM | 10s | 2437.80 📋 host-to-pod | other-node | TCP_STREAM_MULTI | 10s | 2437.48 📋 host-to-host | other-node | TCP_STREAM | 10s | 9197.10 📋 host-to-host | other-node | TCP_STREAM_MULTI | 10s | 14677.08 ---------------------------------------------------------------------------------------- ✅ [cilium-test-1] All 1 tests (32 actions) successful, 0 tests skipped, 0 scenarios skipped. ``` > 可以看到我的環境 pod-to-pod same-node 網速是 7.7G > pod-to-host same-node 網速是 7.0G > host-to-pod same-node 網速是 5.4G > pod-to-pod other-node 網速是 2.2G > host-to-host other-node 網速是 9.1G * 環境清理 ``` $ kubectl -n cilium-test-1 delete deploy --all ``` ## 參考 https://ithelp.ithome.com.tw/articles/10387997 https://ithelp.ithome.com.tw/articles/10390038