# Kata container deploy record ## Issue Log ### Add Relation question try to execute ```bash juju add-relation kata kubernetes-master ``` An error will occur, and the display on `juju status` is that an error occurred during the charm download. Go to `/var/log/juju/unit_kata.log` to view the system log to see what went wrong: ```bash DEBUG install Traceback (most recent call last): DEBUG install File "/var/lib/juju/agents/unit-kata-0/charm/hooks/install", line 8, in <module> DEBUG install basic.bootstrap_charm_deps() DEBUG install File "lib/charms/layer/basic.py", line 130, in bootstrap_charm_deps DEBUG install install_or_update_charm_env() DEBUG install File "lib/charms/layer/basic.py", line 171, in install_or_update_charm_env DEBUG install '--version']).decode('utf8')) DEBUG install File "/usr/lib/python3.6/subprocess.py", line 356, in check_output DEBUG install **kwargs).stdout DEBUG install File "/usr/lib/python3.6/subprocess.py", line 423, in run DEBUG install with Popen(*popenargs, **kwargs) as process: DEBUG install File "/usr/lib/python3.6/subprocess.py", line 729, in __init__ DEBUG install restore_signals, start_new_session) DEBUG install File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child DEBUG install raise child_exception_type(errno_num, err_msg, err_filename) DEBUG install PermissionError: [Errno 13] Permission denied: 'bin/charm-env' ``` Finally, it was found that the permission denied problem occurred in the `lib/charms/layer/basic.py` file. The problematic code is as follows: ```python3 try: bundled_version = parse_version( check_output(['bin/charm-env', '--version']).decode('utf8')) ``` It should be an error caused by insufficient permissions when `check_output` executes the command, because I don't know how to give root permissions when adding-relation, so I decided to modify the code directly: ```python try: bundled_version = parse_version( check_output(['sudo', 'bin/charm-env', '--version']).decode('utf8')) ``` After the modification is completed, the subordinate kata charm can be installed normally. Also executing ```bash juju add-relation kata kubernetes-worker ``` Sometimes the same error occurs, and the problem can be solved by using the same method. This is the screen after the deployment is complete: ```bash $ watch juju status 13 0/lxd/17 0/lxd/18 0/lxd/19 0/lxd/20 Every 2.0s: juju status 13 0/lxd/17 0/lxd/18 0/lxd/19 0/lxd/20 maas-vm: Thu Sep 9 07:15:16 2021 Model Controller Cloud/Region Version SLA Timestamp default maas-cloud-controller maas-cloud/default 2.7.6.1 unsupported 07:15:17Z App Version Status Scale Charm Store Rev OS Notes containerd active 0 containerd local 0 ubuntu easyrsa 3.0.1 active 1 easyrsa local 1 ubuntu etcd 3.2.10 active 1 etcd local 2 ubuntu flannel 0.11.0 active 0 flannel local 0 ubuntu kata active 0 kata local 1 ubuntu kubeapi-load-balancer 1.14.0 active 1 kubeapi-load-balancer local 1 ubuntu kubernetes-master 1.16.15 active 1 kubernetes-master local 3 ubuntu kubernetes-worker 1.16.15 active 1 kubernetes-worker local 2 ubuntu Unit Workload Agent Machine Public address Ports Message easyrsa/1* active idle 0/lxd/17 10.207.212.119 Certificate Authority connected. etcd/4* active idle 0/lxd/18 10.207.212.123 2379/tcp Healthy with 1 known peer kubeapi-load-balancer/4* active idle 0/lxd/19 10.207.212.122 443/tcp Loadbalancer ready. kubernetes-master/4* active idle 0/lxd/20 10.207.212.121 6443/tcp Kubernetes master running. containerd/0 active idle 10.207.212.121 Container runtime available flannel/0 active idle 10.207.212.121 Flannel subnet 10.1.45.1/24 kata/1 active idle 10.207.212.121 Kata runtime available kubernetes-worker/1* active idle 13 10.207.190.2 Kubernetes worker running. containerd/1* active idle 10.207.190.2 Container runtime available flannel/1* active idle 10.207.190.2 Flannel subnet 10.1.39.1/24 kata/2* active idle 10.207.190.2 Kata runtime available Machine State DNS Inst id Series AZ Message 0 started 10.207.190.16 wisr7-1605 bionic default Deployed 0/lxd/17 started 10.207.212.119 juju-9efc09-0-lxd-17 bionic default Container started 0/lxd/18 started 10.207.212.123 juju-9efc09-0-lxd-18 bionic default Container started 0/lxd/19 started 10.207.212.122 juju-9efc09-0-lxd-19 bionic default Container started 0/lxd/20 started 10.207.212.121 juju-9efc09-0-lxd-20 bionic default Container started 13 started 10.207.190.2 wisr7-0205 bionic default Deployed ``` ### Create pod problem In the teaching of [Ensuring security and isolation in Charmed Kubernetes with Kata Containers](https://juju.is/tutorials/charmed-kubernetes-kata-containers), the pod is created with the kata container in the way of runtime class, and the actual reference An error will occur after the operation, the error message is as follows: ```bash Failed create pod sandbox: rpc error: code = NotFound desc = failed to create containerd task: failed to create shim: not found ``` However, this article [How to use Kata Containers and Containerd](https://github.com/kata-containers/documentation/blob/master/how-to/containerd-kata.md) on the official github of kata container states that Pods can be created using annotations untrusted workload. ```yaml annotations: io.kubernetes.cri.untrusted-workload: "true" ``` The pod can be successfully created using the following yaml file. ```yaml apiVersion: v1 kind: Pod metadata: creationTimestamp: null labels: run: kata-container name: kata-container annotations: io.kubernetes.cri.untrusted-workload: "true" spec: #runtimeClassName: kata containers: - image: nginx name: kata-container resources: {} dnsPolicy: ClusterFirst restartPolicy: Never status: {} ``` After running, you can use the containerd command to confirm whether it is running in kata-container. ```bash $ sudo ctr -n=k8s.io c ls CONTAINER IMAGE RUNTIME 013f94e641c37fa74aed36d7f3725f97428a9482e9f405ea5ce22f1d7d879197 k8s.gcr.io/pause-amd64:3.1 io.containerd.runtime.v1.linux 342615365a6bc40e0b43293bbc634d6f35d0c27536d02e06739ec5c9e795612c rocks.canonical.com:443/cdk/coredns/coredns-amd64:1.6.2 io.containerd.runtime.v1.linux 7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963 k8s.gcr.io/pause-amd64:3.1 io.containerd.kata.v2 a0a47479b5a12a47b5e0d6c6e5726703e39140a8e91aa554e76cad6a67b9d80a docker.io/library/nginx:latest io.containerd.kata.v2 d4093b9f70a25a33d3be74d21722b245ece543dabb0edcee85caa96f0da3f4eb k8s.gcr.io/pause-amd64:3.1 io.containerd.runtime.v1.linux f4687e9f23541a54d8b6044ac902518a1b7f67d85fe5e124eb1217790f1d078d rocks.canonical.com:443/cdk/coredns/coredns-amd64:1.6.2 io.containerd.runtime.v1.linux ``` You can see that RUNTIME displays io.containerd.kata.v2 as the pod we created with kata container. The next question is why the runtimeclass cannot be started normally. First go to the containerd configuration file `/etc/containerd/config.toml` to have a look. ```toml [plugins.cri.containerd] no_pivot = false [plugins.cri.containerd.default_runtime] runtime_type = "io.containerd.runtime.v1.linux" [plugins.cri.containerd.untrusted_workload_runtime] runtime_type = "io.containerd.kata.v2" [plugins.cri.containerd.runtimes] [plugins.cri.containerd.runtimes.runc] runtime_type = "io.containerd.runc.v1" [plugins.cri.containerd.runtimes.kata] runtime_type= "io.containerd.kata.v2" [plugins.cri.containerd.runtimes.kata.options] Runtime = "kata-runtime" RuntimeRoot = "/usr/bin/kata-runtime" ``` Here you can see containerd’s setting of ruintimeclass. There are two runtimeclasses `runc` and `kata` in total. After looking at it, I think there is no special problem, and because if you use the following ctr command directly, there will be no problem. So temporarily rule out the problem of kata config. ```bash sudo ctr run --runtime io.containerd.run.kata.v2 -t --rm docker.io/library/busybox:latest hello sh ``` Finally, I decided to comment out the runtimeclass option of kata to see if the error was caused by the option. ```toml [plugins.cri.containerd] no_pivot = false [plugins.cri.containerd.default_runtime] runtime_type = "io.containerd.runtime.v1.linux" [plugins.cri.containerd.untrusted_workload_runtime] runtime_type = "io.containerd.kata.v2" [plugins.cri.containerd.runtimes] [plugins.cri.containerd.runtimes.runc] runtime_type = "io.containerd.runc.v1" [plugins.cri.containerd.runtimes.kata] runtime_type= "io.containerd.kata.v2" # [plugins.cri.containerd.runtimes.kata.options] # Runtime = "kata-runtime" # RuntimeRoot = "/usr/bin/kata-runtime" # ConfigPath = "/etc/kata-containers/configuration.toml" ``` toml file will end up like this. Then restart containerd: ```bash sudo systemctl restart containerd ``` Then try to build the pod again: ```bash $ kubectl create -f kata.yaml runtimeclass.node.k8s.io/kata created $ kubectl create -f nginx-kata.yaml pod/nginx-kata created $ kubectl get pod NAME READY STATUS RESTARTS AGE kata-container 1/1 Running 0 24h nginx-kata 1/1 Running 0 12s ``` In the end it worked. Verify with ctr command: ```bash $ sudo ctr -n=k8s.io c ls CONTAINER IMAGE RUNTIME 013f94e641c37fa74aed36d7f3725f97428a9482e9f405ea5ce22f1d7d879197 k8s.gcr.io/pause-amd64:3.1 io.containerd.runtime.v1.linux 06498b4762fc5f04bed9f30d4708c053615a9d2a719250539986f59078ef6c03 k8s.gcr.io/pause-amd64:3.1 io.containerd.kata.v2 085981127cbff581c0727f610da34ec92fd97f9eee298e3025f6fcac302a3b86 docker.io/library/nginx:latest io.containerd.kata.v2 342615365a6bc40e0b43293bbc634d6f35d0c27536d02e06739ec5c9e795612c rocks.canonical.com:443/cdk/coredns/coredns-amd64:1.6.2 io.containerd.runtime.v1.linux 7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963 k8s.gcr.io/pause-amd64:3.1 io.containerd.kata.v2 a0a47479b5a12a47b5e0d6c6e5726703e39140a8e91aa554e76cad6a67b9d80a docker.io/library/nginx:latest io.containerd.kata.v2 d4093b9f70a25a33d3be74d21722b245ece543dabb0edcee85caa96f0da3f4eb k8s.gcr.io/pause-amd64:3.1 io.containerd.runtime.v1.linux f4687e9f23541a54d8b6044ac902518a1b7f67d85fe5e124eb1217790f1d078d rocks.canonical.com:443/cdk/coredns/coredns-amd64:1.6.2 io.containerd.runtime.v1.linux ``` Finally, it was confirmed that the kata container was successfully deployed by using runtimeclass. It is also possible to execute: ```bash ps -ef | grep qemu ``` To see if there is a qemu vm being executed, so as to verify that the kata container will indeed wrap a layer of qemu vm outside. ```bash $ ps -ef | grep qemu ubuntu 8361 7082 0 10:31 pts/0 00:00:00 grep --color=auto qemu root 20247 1 0 Sep01 ? 00:17:07 /usr/bin/qemu-vanilla-system-x86_64 -name sandbox-7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963 -uuid 224a187a-92e6-4b35-bfb8-f641806312d1 -machine pc,accel=kvm,kernel_irqchip,nvdimm -cpu host,pmu=off -qmp unix:/run/vc/vm/7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963/qmp.sock,server,nowait -m 2048M,slots=10,maxmem=49308M -device pci-bridge,bus=pci.0,id=pci-bridge-0,chassis_nr=1,shpc=on,addr=2,romfile= -device virtio-serial-pci,disable-modern=false,id=serial0,romfile=,max_ports=2 -device virtconsole,chardev=charconsole0,id=console0 -chardev socket,id=charconsole0,path=/run/vc/vm/7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963/console.sock,server,nowait -device nvdimm,id=nv0,memdev=mem0 -object memory-backend-file,id=mem0,mem-path=/usr/share/kata-containers/kata-containers-image_clearlinux_1.13.0-alpha0_agent_27b90c2690.img,size=268435456 -device virtio-scsi-pci,id=scsi0,disable-modern=false,romfile= -object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng-pci,rng=rng0,romfile= -device vhost-vsock-pci,disable-modern=false,vhostfd=3,id=vsock-89367521,guest-cid=89367521,romfile= -device virtio-9p-pci,disable-modern=false,fsdev=extra-9p-kataShared,mount_tag=kataShared,romfile= -fsdev local,id=extra-9p-kataShared,path=/run/kata-containers/shared/sandboxes/7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963/shared,security_model=none,multidevs=remap -netdev tap,id=network-0,vhost=on,vhostfds=4,fds=5 -device driver=virtio-net-pci,netdev=network-0,mac=86:1b:71:42:09:96,disable-modern=false,mq=on,vectors=4,romfile= -rtc base=utc,driftfix=slew,clock=host -global kvm-pit.lost_tick_policy=discard -vga none -no-user-config -nodefaults -nographic --no-reboot -daemonize -object memory-backend-ram,id=dimm1,size=2048M -numa node,memdev=dimm1 -kernel /usr/share/kata-containers/vmlinuz-5.4.60.91-52.container -append tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 cryptomgr.notests net.ifnames=0 pci=lastbus=0 iommu=off root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro ro rootfstype=ext4 quiet systemd.show_status=false panic=1 nr_cpus=16 agent.use_vsock=true systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket scsi_mod.scan=none -pidfile /run/vc/vm/7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963/pid -smp 1,cores=1,threads=1,sockets=16,maxcpus=16 ``` But according to github: > From Containerd v1.2.4 and Kata v1.6.0, there is a new runtime option supported, which allows you to specify a specific Kata configuration file as follows: > ```toml > [plugins.cri.containerd.runtimes.kata] > runtime_type = "io.containerd.kata.v2" > [plugins.cri.containerd.runtimes.kata.options] > ConfigPath = "/etc/kata-containers/config.toml" >``` It should also support the runtime option. I thought the config path was wrong, but it can still start normally after using kata-runtime-config to bring in the address and restart. I don’t know what caused the problem yet. Finally, I tried again and only marked runtime options but did not make any settings. ```toml [plugins.cri.containerd] no_pivot = false [plugins.cri.containerd.default_runtime] runtime_type = "io.containerd.runtime.v1.linux" [plugins.cri.containerd.untrusted_workload_runtime] runtime_type = "io.containerd.kata.v2" [plugins.cri.containerd.runtimes] [plugins.cri.containerd.runtimes.runc] runtime_type = "io.containerd.runc.v1" [plugins.cri.containerd.runtimes.kata] runtime_type = "io.containerd.kata.v2" [plugins.cri.containerd.runtimes.kata.options] # Runtime = "kata-runtime" # RuntimeRoot = "/usr/bin/kata-runtime" # ConfigPath = "/etc/kata-containers/configuration.toml" ``` According to the description in github, if the config path is not set, the system will run the default path by itself, but in the end, the pod cannot be created normally. It should be that the runtime options cannot be set here, but the specific reason has not been found yet. ## How to build a pod with kata contaienr Currently, there are two ways to create a pod using kata container: 1. Annotation 2. Runtime class ### Annotation If you use the Annotation method to create a pod, you need to add the `untrusted_workload_runtime` mark under `/etc/containerd/config.toml` `[plugins.cri.containerd]` to let containerd know what to do when encountering `untrusted_workload` annotations Which runtime to use. ```toml [plugins.cri.containerd] no_pivot = false [plugins.cri.containerd.default_runtime] runtime_type = "io.containerd.runtime.v1.linux" [plugins.cri.containerd.untrusted_workload_runtime] runtime_type = "io.containerd.kata.v2" ``` In the yaml file of kubernetes, just add annotation under the metadata: ```yaml annotations: io.kubernetes.cri.untrusted-workload: "true" ``` As shown below: ```yaml apiVersion: v1 kind: Pod metadata: creationTimestamp: null labels: run: kata-container name: kata-container annotations: ### io.kubernetes.cri.untrusted-workload: "true" ### spec: containers: - image: nginx name: kata-container resources: {} dnsPolicy: ClusterFirst restartPolicy: Never status: {} ``` ### Runtime class If you want to use runtimeclass to create a pod, you must first set it in `/etc/containerd/config.toml`, and you must set it under `[plugins.cri.containerd.runtimes]` to let containerd know what runtimeclass and Which runtime should be used. ```toml [plugins.cri.containerd.runtimes] [plugins.cri.containerd.runtimes.runc] runtime_type = "io.containerd.runc.v1" [plugins.cri.containerd.runtimes.kata] runtime_type = "io.containerd.kata.v2" ``` In addition, on the kubernetes side, you must first create a runtimeclass and then substitute it into the yaml file of the pod. The yaml file of the runtimeclass is as follows: ```yaml apiVersion: node.k8s.io/v1beta1 kind: RuntimeClass metadata: name: kata handler: kata ``` The more important thing here is the handler field. Fill in the runtimeclass set in containerd, for example `plugins.cri.containerd.runtimes.runc` is `runc`, `plugins.cri.containerd.runtimes.kata` is `kata`. After that, just add the runtimeClassName field to the spec. ```yaml apiVersion: v1 kind: Pod metadata: creationTimestamp: null labels: run: nginx-kata name: nginx-kata spec: runtimeClassName: kata ### containers: - image: nginx name: nginx-kata resources: {} dnsPolicy: ClusterFirst restartPolicy: Never status: {} ``` ## Confirm network architecture First of all, we can check the running and setting of vsock implement ```bash $ ps aux | grep vsock root 12796 0.3 0.4 2862424 223724 ? Sl 16:23 0:10 /usr/bin/qemu-vanilla-system-x86_64 -name sandbox-604b217ec06a75823965475124491685b2a5a721ef6da13904188d2204b607bf -uuid 23908cd2-5310-4c3f-aaf3-0f8ba449a0dc -machine pc,accel=kvm,kernel_irqchip,nvdimm -cpu host,pmu=off -qmp unix:/run/vc/vm/604b217ec06a75823965475124491685b2a5a721ef6da13904188d2204b607bf/qmp.sock,server,nowait -m 2048M,slots=10,maxmem=49308M -device pci-bridge,bus=pci.0,id=pci-bridge-0,chassis_nr=1,shpc=on,addr=2,romfile= -device virtio-serial-pci,disable-modern=false,id=serial0,romfile=,max_ports=2 -device virtconsole,chardev=charconsole0,id=console0 -chardev socket,id=charconsole0,path=/run/vc/vm/604b217ec06a75823965475124491685b2a5a721ef6da13904188d2204b607bf/console.sock,server,nowait -device nvdimm,id=nv0,memdev=mem0 -object memory-backend-file,id=mem0,mem-path=/usr/share/kata-containers/kata-containers-image_clearlinux_1.13.0-alpha0_agent_27b90c2690.img,size=268435456 -device virtio-scsi-pci,id=scsi0,disable-modern=false,romfile= -object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng-pci,rng=rng0,romfile= -device vhost-vsock-pci,disable-modern=false,vhostfd=3,id=vsock-751925852,guest-cid=751925852,romfile= -device virtio-9p-pci,disable-modern=false,fsdev=extra-9p-kataShared,mount_tag=kataShared,romfile= -fsdev local,id=extra-9p-kataShared,path=/run/kata-containers/shared/sandboxes/604b217ec06a75823965475124491685b2a5a721ef6da13904188d2204b607bf/shared,security_model=none,multidevs=remap -netdev tap,id=network-0,vhost=on,vhostfds=4,fds=5 -device driver=virtio-net-pci,netdev=network-0,mac=f6:45:6e:18:f8:36,disable-modern=false,mq=on,vectors=4,romfile= -rtc base=utc,driftfix=slew,clock=host -global kvm-pit.lost_tick_policy=discard -vga none -no-user-config -nodefaults -nographic --no-reboot -daemonize -object memory-backend-ram,id=dimm1,size=2048M -numa node,memdev=dimm1 -kernel /usr/share/kata-containers/vmlinuz-5.4.60.91-52.container -append tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 cryptomgr.notests net.ifnames=0 pci=lastbus=0 iommu=off root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro ro rootfstype=ext4 quiet systemd.show_status=false panic=1 nr_cpus=16 agent.use_vsock=true systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket scsi_mod.scan=none -pidfile /run/vc/vm/604b217ec06a75823965475124491685b2a5a721ef6da13904188d2204b607bf/pid -smp 1,cores=1,threads=1,sockets=16,maxcpus=16 ``` You can observe the flag of `use_vsock` and what the guest id is. Then you can judge from the btrctl command whether the bridge on the host side is connected to the veth device. ```bash $ brctl show bridge name bridge id STP enabled interfaces cni0 8000.cad82eecfdf8 no veth3aa3ebe4 veth546c1a80 vethfaabff91 vethfcbd921f ``` After comparison before and after, when we build a pod with kata container, the bridge of `cni0` will be connected to the veth device `vethfaabff91`, which conforms to the description on github. Next, we can look for information about the network namespace, and then view the network architecture and tc rule. ```bash $ sudo ip netns list cni-78af01c6-ba8e-9795-c89c-7f23439ba7dd (id: 0) cni-20199cbf-eb8f-063f-686b-6f0d8717754a (id: 1) cni-eba349cb-932e-804b-81c3-cdf3ee91dbb0 (id: 4) cni-1114f2bb-8969-6fcc-58a7-d2666bcc904a (id: 3) ``` This command will display the current network stack, so when we open more new pods, the higher ones represent the neamespace of newer pods. Next, you can use `ip addr` to observe the network card settings and addresses. ```bash $ sudo ip netns exec cni-78af01c6-ba8e-9795-c89c-7f23439ba7dd ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 3: eth0@if1668: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000 link/ether f6:45:6e:18:f8:36 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.1.39.154/24 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::f445:6eff:fe18:f836/64 scope link valid_lft forever preferred_lft forever 4: tap0_kata: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc mq state UNKNOWN group default qlen 1000 link/ether ae:af:03:cd:2f:35 brd ff:ff:ff:ff:ff:ff inet6 fe80::acaf:3ff:fecd:2f35/64 scope link valid_lft forever preferred_lft forever ``` You can see the tap device of `tap0_kata` and the veth device of eth0. Finally, we can use `tc filter show dev [name] [ingress, egress]` to view the tc rule inside. ```bash $ sudo ip netns exec cni-78af01c6-ba8e-9795-c89c-7f23439ba7dd tc filter show dev tap0_kata ingress filter protocol all pref 49152 u32 chain 0 filter protocol all pref 49152 u32 chain 0 fh 800: ht divisor 1 filter protocol all pref 49152 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 terminal flowid ??? not_in_hw match 00000000/00000000 at 0 action order 1: mirred (Egress Redirect to device eth0) stolen index 2 ref 1 bind 1 ``` It can be seen that when the match tc rule can be mirrored to the eth0 end, and vice versa. ```bash $ sudo ip netns exec cni-78af01c6-ba8e-9795-c89c-7f23439ba7dd tc filter show dev eth0 ingress filter protocol all pref 49152 u32 chain 0 filter protocol all pref 49152 u32 chain 0 fh 800: ht divisor 1 filter protocol all pref 49152 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 terminal flowid ??? not_in_hw match 00000000/00000000 at 0 action order 1: mirred (Egress Redirect to device tap0_kata) stolen index 1 ref 1 bind 1 ``` If you want to go in the opposite direction, you can also filter and transmit data according to the same rule. ### matches namespace, veth and container If you want to precisely match the corresponding container namespace and veth name, you can use the ctr command to find out the corresponding setting value. First describe the pod to find out the corresponding containerd id: ```bash $ kubectl describe pod nginx-kata Name: nginx-kata Namespace: default Priority: 0 Node: wisr7-0205/10.207.190.2 Start Time: Fri, 10 Sep 2021 15:11:36 +0800 Labels: run=nginx-kata Annotations: <none> Status: Running IP: 10.1.39.40 IPs: IP: 10.1.39.40 Containers: nginx-kata: Container ID: containerd://fd1cf62c3f55d0d3d004c8c44e7916b92b6afe3f84e06ff37efb9ea7638e5d05 Image: nginx Image ID: docker.io/library/nginx@sha256:853b221d3341add7aaadf5f81dd088ea943ab9c918766e295321294b035f3f3e Port: <none> Host Port: <none> State: Running Started: Fri, 10 Sep 2021 15:13:26 +0800 Ready: True Restart Count: 0 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-l7png (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: default-token-l7png: Type: Secret (a volume populated by a Secret) SecretName: default-token-l7png Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: <none> ``` At this time, we determine that the id of containerd is `fd1cf62c3f55d0d3d004c8c44e7916b92b6afe3f84e06ff37efb9ea7638e5d05`, and then we can use this id to find the corresponding vm id. ```bash $ sudo ctr -n=k8s.io c info fd1cf62c3f55d0d3d004c8c44e7916b92b6afe3f84e06ff37efb9ea7638e5d05 | grep sandbox "source": "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/27fc6c87feadff5e0a11b618f5d6a67032568c543fea9c70705028faea275e01/hostname", "source": "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/27fc6c87feadff5e0a11b618f5d6a67032568c543fea9c70705028faea275e01/resolv.conf", "source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/27fc6c87feadff5e0a11b618f5d6a67032568c543fea9c70705028faea275e01/shm", "io.kubernetes.cri.sandbox-id": "27fc6c87feadff5e0a11b618f5d6a67032568c543fea9c70705028faea275e01", "io.kubernetes.cri.sandbox-name": "nginx-kata", "io.kubernetes.cri.sandbox-namespace": "default" ``` After checking the container info, we determined that the vm id is `27fc6c87feadff5e0a11b618f5d6a67032568c543fea9c70705028faea275e01`. Then we can use ctr command to search namespace data: ```bash $ sudo ctr -n=k8s.io c info 27fc6c87feadff5e0a11b618f5d6a67032568c543fea9c70705028faea275e01 | grep cni "path": "/var/run/netns/cni-42d0400b-c2c9-f365-0497-3f170429a55f" ``` From this we know that the id of this container namespace is `cni-42d0400b-c2c9-f365-0497-3f170429a55f`, we can enter this namespace to see the devices inside: ```bash $ sudo ip netns exec cni-42d0400b-c2c9-f365-0497-3f170429a55f ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 3: eth0@if1806: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000 link/ether c2:0e:0b:8d:ff:36 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.1.39.40/24 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::c00e:bff:fe8d:ff36/64 scope link valid_lft forever preferred_lft forever 4: tap0_kata: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc mq state UNKNOWN group default qlen 1000 link/ether b6:a5:16:3c:0d:bf brd ff:ff:ff:ff:ff:ff inet6 fe80::b4a5:16ff:fe3c:dbf/64 scope link valid_lft forever preferred_lft forever ``` You can see that the series of numbers after `eth0` is the index `1806` of this device. We can get the matching veth device by searching in the default namespace. ```bash $ ip addr | grep 1806 1806: vethf95422f9@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default ``` From this, we know that vethf95422f9@if3 is the corresponding veth device, and we can also check its existence from brctl show: ```bash $ brctl show bridge name bridge id STP enabled interfaces cni0 8000.cad82eecfdf8 no veth3aa3ebe4 veth546c1a80 vethf95422f9 vethfcbd921f ``` In this way, we have successfully matched the corresponding container id and its network settings. ### Confirm virtual machine status We can use the debug mode to connect to the vm to check the status, and the operation can be found here[enable debug console](https://github.com/kata-containers/documentation/blob/master/Developer-Guide. md#enabling-debug-console-for-qemu). After we have finished setting, we can use the following steps to connect to the vm. ```bash $ console="/var/run/vc/vm/${id}/console.sock" $ sudo socat "stdin,raw,echo=0,escape=0x11" "unix-connect:${console}" ``` After entering, there is no way to manipulate some basic commands such as `ls` and `cat`, so you can only use `tab` to find the relevant process status in `/proc`. First of all, you can make sure that `kata-agent` will indeed run in the vm, and you can find the process of `kata-agent` in `/proc/66/comm`. In addition, since `cat` cannot be used, it can only be printed line by line by using the `read` command of the shell with a loop. ```shell # cd /proc # cd 66 # while read line; do echo $line; done <comm kata-agent ``` If you want to confirm whether there is `nginx`, you can also search under `/proc`. ```shell # cd 122 # while read line; do echo $line; done <comm nginx ``` You can also see some settings when vm is loaded in /proc/cmdline. ```shell # while read line; do echo $line; done <cmdline tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 cryptomgr.notests net.ifnames=0 pci =lastbus=0 iommu=off root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro ro rootfstype=ext4 debug systemd.show_status=true systemd.log_level=debug panic=1 nr_cpus=16 agent.use_vsock =true systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket scsi_mod.scan=none agent.debug_console ``` After comparing the settings displayed by `ps aux | grep qemu` on the host side, it can be found that many settings will also be brought into the vm, such as the `use_vsock` flag and some agent settings. Finally, if you want to check the network settings, each network card will be listed in `/proc/net/dev`, we can use this to verify whether `eth0` exists in the vm. ```shell # cd /proc/net # while read line; do echo $line; done <dev Inter-| Receive | Transmit face|bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed lo: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 dummy0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 eth0: 6708 92 0 0 0 0 0 0 1180 16 0 0 0 0 0 0 ``` It can be seen that there is indeed an `eth0` network card, which also conforms to the official description on github. If you want to know the ip information, you can get the information from `/proc/net/fib_trie`. ```shell # while read line; do echo $line; done <fib_trie Main: +-- 0.0.0.0/1 2 0 2 +-- 0.0.0.0/4 2 0 2 |-- 0.0.0.0 /0 universe UNICAST +-- 10.1.0.0/18 2 0 2 |-- 10.1.0.0 /16 universe UNICAST +-- 10.1.39.0/24 2 0 1 |-- 10.1.39.0 /32 link BROADCAST /24 link UNICAST |-- 10.1.39.160 /32 host LOCAL |-- 10.1.39.255 /32 link BROADCAST +-- 127.0.0.0/8 2 0 2 +-- 127.0.0.0/31 1 0 0 |-- 127.0.0.0 /32 link BROADCAST /8 host LOCAL |-- 127.0.0.1 /32 host LOCAL |-- 127.255.255.255 /32 link BROADCAST Local: +-- 0.0.0.0/1 2 0 2 +-- 0.0.0.0/4 2 0 2 |-- 0.0.0.0 /0 universe UNICAST +-- 10.1.0.0/18 2 0 2 |-- 10.1.0.0 /16 universe UNICAST +-- 10.1.39.0/24 2 0 1 |-- 10.1.39.0 /32 link BROADCAST /24 link UNICAST |-- 10.1.39.160 /32 host LOCAL |-- 10.1.39.255 /32 link BROADCAST +-- 127.0.0.0/8 2 0 2 +-- 127.0.0.0/31 1 0 0 |-- 127.0.0.0 /32 link BROADCAST /8 host LOCAL |-- 127.0.0.1 /32 host LOCAL |-- 127.255.255.255 /32 link BROADCAST ``` Here we use `kubectl` to view the ip of the pod, and then look for the corresponding address above. ```bash $ kubectl get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-kata 1/1 Running 0 5h26m 10.1.39.160 wisr7-0205 <none> <none> ``` As mentioned above, there is indeed an ip address of 10.1.39.160, and when we look at `/proc/net/dev`, only the network card `eth0` is processing the transmission of packets, so it can be confirmed that it is the same as described on github The data will also be transmitted to the vm via `eth0`. In addition, route information can be found in `/proc/net/route`. ```bash # while read line; do echo $line; done <route Iface Destination Gateway Flags RefCnt Use Metric Mask MTU Window IRTT eth0 00000000 0127010A 0003 0 0 0 00000000 0 0 0 eth0 0000010A 0127010A 0003 0 0 0 0000FFFF 0 0 0 eth0 0027010A 00000000 0001 0 0 0 00FFFFFF 0 0 0 ``` Finally, if you want to know the MAC address, you can check it in `/sys/class/net/eth0/address`. ```shell # while read line; do echo $line; done <address 8e:9d:f5:8c:8d:cf ``` ## Use Nvidia GPU in kata container According to [Using Nvidia GPU device with Kata Containers](https://github.com/kata-containers/kata-containers/blob/main/docs/use-cases/Nvidia-GPU-passthrough-and-Kata.md) According to the description, we can use Nvidia GPU in kata container using two modes: * Nvidia GPU pass-through mode * Nvidia vGPU mode The comparison between the two is as follows: | Technology | Description | Behavior | Detail | | -------- | -------- | -------- | -------- | |Nvidia GPU pass-through mode|GPU passthrough|Physical GPU assigned to a single VM|Direct GPU assignment to VM without limitation| |Nvidia vGPU mode|GPU sharing|Physical GPU shared by multiple VMs|Mediated passthrough| ### Hardware requirements Nvidia GPUs Recommended for Virtualization: * Nvidia Tesla (T4, M10, P6, V100 or newer) * Nvidia Quadro RTX 6000/8000 ### Host BIOS Requirements Some hardware needs larger PCI BARs window like Nvidia Tesla P100, K40m ```bash $ lspci -s 04:00.0 -vv | grep Region Region 0: Memory at c6000000 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at 383800000000 (64-bit, prefetchable) [size=16G] #above 4G Region 3: Memory at 383c00000000 (64-bit, prefetchable) [size=32M] ``` If a larger BARS MMIO mapping is required, the section above 4G needs to be enabled in the PCI configuration of the BIOS. Different brands may show different results: * Above 4G Decoding * Memory Hole for PCI MMIO * Memory Mapped I/O above 4GB ### Host Kernel Requirements The following flags must be set on the host kernel: * `CONFIG_VFIO` * `CONFIG_VFIO_IOMMU_TYPE1` * `CONFIG_VFIO_MDEV` * `CONFIG_VFIO_MDEV_DEVICE` * `CONFIG_VFIO_PCI` Also set `intel_iommu=on` in the kernel cmdline at boot time. ### Configure kata container We also need to set related items in `/etc/kata-containers/configuration.toml`. In addition, if non-large BARs devices are used, the recommended version is Kata version 1.3.0 or above, and if large BARs devices are used, the recommended version is It is required to be on Kata version 1.11.0 or above. Hotplug for PCI devices by shpchp (Linux's SHPC PCI Hotplug driver): ```bash machine_type = "q35" hotplug_vfio_on_root_bus = false ``` Hotplug for PCIe devices by pciehp (Linux's PCIe Hotplug driver): ```bash machine_type = "q35" hotplug_vfio_on_root_bus = true pcie_root_port = 1 ``` ### Build Kata Containers kernel with GPU support Next, we need to create a kernel that supports GPU, and change the default kernel to what we created. How to create a guest kernel is in [Build Kata Containers Kernel](https://github.com/kata-containers/kata-containers /tree/main/tools/packaging/kernel) is explained in detail. In addition, to build a kernel, you need to have the corresponding environment and software. You can also find it in [Build Kata Containers Kernel](https://github.com/kata-containers/kata-containers/tree/main/tools/packaging/kernel) View in the article. Here, first use go to grab the script that builds the kernel: ```bash $ go get -d -u github.com/kata-containers/kata-containers $ cd $GOPATH/src/github.com/kata-containers/kata-containers/tools/packaging/kernel $ ./build-kernel.sh setup ``` Then set the following kernel config option: ```bash # Support PCI/PCIe device hotplug (Required for large BARs device) CONFIG_HOTPLUG_PCI_PCIE=y CONFIG_HOTPLUG_PCI_SHPC=y # Support for loading modules (Required for load Nvidia drivers) CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # Enable the MMIO access method for PCIe devices (Required for large BARs device) CONFIG_PCI_MMCONFIG=y ``` You also need to disable `CONFIG_DRM_NOUVEAU`: ```bash # Disable Open Source Nvidia driver nouveau # It conflicts with Nvidia official driver CONFIG_DRM_NOUVEAU=n ``` Then you can build the kernel: ```bash ## Build guest kernel with ../../tools/packaging/kernel # Prepare (download guest kernel source, generate .config) $ ./build-kernel.sh -v 4.19.86 -g nvidia -f setup # Build guest kernel $ ./build-kernel.sh -v 4.19.86 -g nvidia build # Install guest kernel $ sudo -E ./build-kernel.sh -v 4.19.86 -g nvidia install /usr/share/kata-containers/vmlinux-nvidia-gpu.container -> vmlinux-4.19.86-70-nvidia-gpu /usr/share/kata-containers/vmlinuz-nvidia-gpu.container -> vmlinuz-4.19.86-70-nvidia-gpu ``` Then generate the rpm package of `kernel-devel`: ```bash $ cd kata-linux-4.19.86-68 $ make rpm-pkg Output RPMs: ~/rpmbuild/RPMS/x86_64/kernel-devel-4.19.86_nvidia_gpu-1.x86_64.rpm ``` Finally, set the kernel path in `/etc/kata-containers/configuration.toml` and you are done. ```bash kernel = "/usr/share/kata-containers/vmlinuz-nvidia-gpu.container" ``` ### Nvidia GPU pass-through mode with Kata Containers First find the Bus-Device-Function (BDF) of the GPU on the host: ```bash $ sudo lspci -nn -D | grep -i nvidia 0000:04:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:15f8] (rev a1) 0000:84:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:15f8] (rev a1) ``` Then find the IOMMU group of the GPU: ```bash $BDF="0000:04:00.0" $ readlink -e /sys/bus/pci/devices/$BDF/iommu_group /sys/kernel/iommu_groups/45 ``` Check the IOMMU number under `/dev/vfio`: ```bash $ ls -l /dev/vfio total 0 crw------- 1 root root 248, 0 Feb 28 09:57 45 crw------- 1 root root 248, 1 Feb 28 09:57 54 crw-rw-rw-1 root root 10, 196 Feb 28 09:57 vfio ``` Then you can create a container to use the GPU: ```bash $ sudo docker run -it --runtime=kata-runtime --cap-add=ALL --device /dev/vfio/45 centos /bin/bash ``` The example here is to use docker to directly start the container. If you want to use kubernetes or containerd, it should be similar. We can enter the container to confirm whether the GPU device is in the PCI devices list: ```bash $ lspci -nn -D | grep '10de:15f8' 0000:01:01.0 3D controller [0302]: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] [10de:15f8] (rev a1) ``` You can also confirm the size of the PCI BARS in the container: ```bash $ lspci -s 01:01.0 -vv | grep Region Region 0: Memory at c0000000 (32-bit, non-prefetchable) [disabled] [size=16M] Region 1: Memory at 4400000000 (64-bit, prefetchable) [disabled] [size=16G] Region 3: Memory at 4800000000 (64-bit, prefetchable) [disabled] [size=32M] ``` ### Nvidia vGPU mode with Kata Containers > Nvidia vGPU is a licensed product on all supported GPU boards. A software license is required to enable all vGPU features within the guest VM. The purpose of vGPU mode is that different VMs can use the GPU device on the host at the same time, so the installation is performed on the host, and a license is required to use these functions on the guest VM. First download the official Nvidia driver from [Nvidia](https://www.nvidia.com/Download/index.aspx), such as `NVIDIA-Linux-x86_64-418.87.01.run`. Then download `kernel-devel`, the rpm file that has been created before. ```bash $ sudo rpm -ivh kernel-devel-4.19.86_gpu-1.x86_64.rpm ``` Then you can follow the official steps to decompress, compile and download the Nvidia driver: ```bash ## Extract $ sh ./NVIDIA-Linux-x86_64-418.87.01.run -x ## Compile and install (It will take some time) $ cd NVIDIA-Linux-x86_64-418.87.01 $ sudo ./nvidia-installer -a -q --ui=none \ --no-cc-version-check \ --no-opengl-files --no-install-libglvnd \ --kernel-source-path=/usr/src/kernels/`uname -r` ``` or ```bash $ sudo sh ./NVIDIA-Linux-x86_64-418.87.01.run -a -q --ui=none \ --no-cc-version-check \ --no-opengl-files --no-install-libglvnd \ --kernel-source-path=/usr/src/kernels/`uname -r` ``` View installer logs: ```bash $ tail -f /var/log/nvidia-installer.log ``` Then load the Nvidia driver module: ```bash # Optional (generate modules.dep and map files for Nvidia driver) $ sudo depmod # Load module $ sudo modprobe nvidia-drm # Check module $ lsmod | grep nvidia nvidia_drm 45056 0 nvidia_modeset 1093632 1 nvidia_drm nvidia 18202624 1 nvidia_modeset drm_kms_helper 159744 1 nvidia_drm drm 364544 3 nvidia_drm,drm_kms_helper i2c_core 65536 3 nvidia,drm_kms_helper,drm ipmi_msghandler 49152 1 nvidia ``` Finally, you can view the status: ```bash $ nvidia-smi Tue Mar 3 00:03:49 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... Off | 00000000:01:01.0 Off | 0 | | N/A 27C P0 25W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ``` Once installed, we can use this GPU with a different VM.