# Kata container deploy record
## Issue Log
### Add Relation question
try to execute
```bash
juju add-relation kata kubernetes-master
```
An error will occur, and the display on `juju status` is that an error occurred during the charm download.
Go to `/var/log/juju/unit_kata.log` to view the system log to see what went wrong:
```bash
DEBUG install Traceback (most recent call last):
DEBUG install File "/var/lib/juju/agents/unit-kata-0/charm/hooks/install", line 8, in <module>
DEBUG install basic.bootstrap_charm_deps()
DEBUG install File "lib/charms/layer/basic.py", line 130, in bootstrap_charm_deps
DEBUG install install_or_update_charm_env()
DEBUG install File "lib/charms/layer/basic.py", line 171, in install_or_update_charm_env
DEBUG install '--version']).decode('utf8'))
DEBUG install File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
DEBUG install **kwargs).stdout
DEBUG install File "/usr/lib/python3.6/subprocess.py", line 423, in run
DEBUG install with Popen(*popenargs, **kwargs) as process:
DEBUG install File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
DEBUG install restore_signals, start_new_session)
DEBUG install File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
DEBUG install raise child_exception_type(errno_num, err_msg, err_filename)
DEBUG install PermissionError: [Errno 13] Permission denied: 'bin/charm-env'
```
Finally, it was found that the permission denied problem occurred in the `lib/charms/layer/basic.py` file.
The problematic code is as follows:
```python3
try:
bundled_version = parse_version(
check_output(['bin/charm-env',
'--version']).decode('utf8'))
```
It should be an error caused by insufficient permissions when `check_output` executes the command, because I don't know how to give root permissions when adding-relation, so I decided to modify the code directly:
```python
try:
bundled_version = parse_version(
check_output(['sudo', 'bin/charm-env',
'--version']).decode('utf8'))
```
After the modification is completed, the subordinate kata charm can be installed normally.
Also executing
```bash
juju add-relation kata kubernetes-worker
```
Sometimes the same error occurs, and the problem can be solved by using the same method.
This is the screen after the deployment is complete:
```bash
$ watch juju status 13 0/lxd/17 0/lxd/18 0/lxd/19 0/lxd/20
Every 2.0s: juju status 13 0/lxd/17 0/lxd/18 0/lxd/19 0/lxd/20 maas-vm: Thu Sep 9 07:15:16 2021
Model Controller Cloud/Region Version SLA Timestamp
default maas-cloud-controller maas-cloud/default 2.7.6.1 unsupported 07:15:17Z
App Version Status Scale Charm Store Rev OS Notes
containerd active 0 containerd local 0 ubuntu
easyrsa 3.0.1 active 1 easyrsa local 1 ubuntu
etcd 3.2.10 active 1 etcd local 2 ubuntu
flannel 0.11.0 active 0 flannel local 0 ubuntu
kata active 0 kata local 1 ubuntu
kubeapi-load-balancer 1.14.0 active 1 kubeapi-load-balancer local 1 ubuntu
kubernetes-master 1.16.15 active 1 kubernetes-master local 3 ubuntu
kubernetes-worker 1.16.15 active 1 kubernetes-worker local 2 ubuntu
Unit Workload Agent Machine Public address Ports Message
easyrsa/1* active idle 0/lxd/17 10.207.212.119 Certificate Authority connected.
etcd/4* active idle 0/lxd/18 10.207.212.123 2379/tcp Healthy with 1 known peer
kubeapi-load-balancer/4* active idle 0/lxd/19 10.207.212.122 443/tcp Loadbalancer ready.
kubernetes-master/4* active idle 0/lxd/20 10.207.212.121 6443/tcp Kubernetes master running.
containerd/0 active idle 10.207.212.121 Container runtime available
flannel/0 active idle 10.207.212.121 Flannel subnet 10.1.45.1/24
kata/1 active idle 10.207.212.121 Kata runtime available
kubernetes-worker/1* active idle 13 10.207.190.2 Kubernetes worker running.
containerd/1* active idle 10.207.190.2 Container runtime available
flannel/1* active idle 10.207.190.2 Flannel subnet 10.1.39.1/24
kata/2* active idle 10.207.190.2 Kata runtime available
Machine State DNS Inst id Series AZ Message
0 started 10.207.190.16 wisr7-1605 bionic default Deployed
0/lxd/17 started 10.207.212.119 juju-9efc09-0-lxd-17 bionic default Container started
0/lxd/18 started 10.207.212.123 juju-9efc09-0-lxd-18 bionic default Container started
0/lxd/19 started 10.207.212.122 juju-9efc09-0-lxd-19 bionic default Container started
0/lxd/20 started 10.207.212.121 juju-9efc09-0-lxd-20 bionic default Container started
13 started 10.207.190.2 wisr7-0205 bionic default Deployed
```
### Create pod problem
In the teaching of [Ensuring security and isolation in Charmed Kubernetes with Kata Containers](https://juju.is/tutorials/charmed-kubernetes-kata-containers), the pod is created with the kata container in the way of runtime class, and the actual reference An error will occur after the operation, the error message is as follows:
```bash
Failed create pod sandbox: rpc error: code = NotFound desc = failed to create containerd task: failed to create shim: not found
```
However, this article [How to use Kata Containers and Containerd](https://github.com/kata-containers/documentation/blob/master/how-to/containerd-kata.md) on the official github of kata container states that Pods can be created using annotations untrusted workload.
```yaml
annotations:
io.kubernetes.cri.untrusted-workload: "true"
```
The pod can be successfully created using the following yaml file.
```yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: kata-container
name: kata-container
annotations:
io.kubernetes.cri.untrusted-workload: "true"
spec:
#runtimeClassName: kata
containers:
- image: nginx
name: kata-container
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Never
status: {}
```
After running, you can use the containerd command to confirm whether it is running in kata-container.
```bash
$ sudo ctr -n=k8s.io c ls
CONTAINER IMAGE RUNTIME
013f94e641c37fa74aed36d7f3725f97428a9482e9f405ea5ce22f1d7d879197 k8s.gcr.io/pause-amd64:3.1 io.containerd.runtime.v1.linux
342615365a6bc40e0b43293bbc634d6f35d0c27536d02e06739ec5c9e795612c rocks.canonical.com:443/cdk/coredns/coredns-amd64:1.6.2 io.containerd.runtime.v1.linux
7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963 k8s.gcr.io/pause-amd64:3.1 io.containerd.kata.v2
a0a47479b5a12a47b5e0d6c6e5726703e39140a8e91aa554e76cad6a67b9d80a docker.io/library/nginx:latest io.containerd.kata.v2
d4093b9f70a25a33d3be74d21722b245ece543dabb0edcee85caa96f0da3f4eb k8s.gcr.io/pause-amd64:3.1 io.containerd.runtime.v1.linux
f4687e9f23541a54d8b6044ac902518a1b7f67d85fe5e124eb1217790f1d078d rocks.canonical.com:443/cdk/coredns/coredns-amd64:1.6.2 io.containerd.runtime.v1.linux
```
You can see that RUNTIME displays io.containerd.kata.v2 as the pod we created with kata container.
The next question is why the runtimeclass cannot be started normally.
First go to the containerd configuration file `/etc/containerd/config.toml` to have a look.
```toml
[plugins.cri.containerd]
no_pivot = false
[plugins.cri.containerd.default_runtime]
runtime_type = "io.containerd.runtime.v1.linux"
[plugins.cri.containerd.untrusted_workload_runtime]
runtime_type = "io.containerd.kata.v2"
[plugins.cri.containerd.runtimes]
[plugins.cri.containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v1"
[plugins.cri.containerd.runtimes.kata]
runtime_type= "io.containerd.kata.v2"
[plugins.cri.containerd.runtimes.kata.options]
Runtime = "kata-runtime"
RuntimeRoot = "/usr/bin/kata-runtime"
```
Here you can see containerd’s setting of ruintimeclass. There are two runtimeclasses `runc` and `kata` in total. After looking at it, I think there is no special problem, and because if you use the following ctr command directly, there will be no problem. So temporarily rule out the problem of kata config.
```bash
sudo ctr run --runtime io.containerd.run.kata.v2 -t --rm docker.io/library/busybox:latest hello sh
```
Finally, I decided to comment out the runtimeclass option of kata to see if the error was caused by the option.
```toml
[plugins.cri.containerd]
no_pivot = false
[plugins.cri.containerd.default_runtime]
runtime_type = "io.containerd.runtime.v1.linux"
[plugins.cri.containerd.untrusted_workload_runtime]
runtime_type = "io.containerd.kata.v2"
[plugins.cri.containerd.runtimes]
[plugins.cri.containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v1"
[plugins.cri.containerd.runtimes.kata]
runtime_type= "io.containerd.kata.v2"
# [plugins.cri.containerd.runtimes.kata.options]
# Runtime = "kata-runtime"
# RuntimeRoot = "/usr/bin/kata-runtime"
# ConfigPath = "/etc/kata-containers/configuration.toml"
```
toml file will end up like this.
Then restart containerd:
```bash
sudo systemctl restart containerd
```
Then try to build the pod again:
```bash
$ kubectl create -f kata.yaml
runtimeclass.node.k8s.io/kata created
$ kubectl create -f nginx-kata.yaml
pod/nginx-kata created
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
kata-container 1/1 Running 0 24h
nginx-kata 1/1 Running 0 12s
```
In the end it worked.
Verify with ctr command:
```bash
$ sudo ctr -n=k8s.io c ls
CONTAINER IMAGE RUNTIME
013f94e641c37fa74aed36d7f3725f97428a9482e9f405ea5ce22f1d7d879197 k8s.gcr.io/pause-amd64:3.1 io.containerd.runtime.v1.linux
06498b4762fc5f04bed9f30d4708c053615a9d2a719250539986f59078ef6c03 k8s.gcr.io/pause-amd64:3.1 io.containerd.kata.v2
085981127cbff581c0727f610da34ec92fd97f9eee298e3025f6fcac302a3b86 docker.io/library/nginx:latest io.containerd.kata.v2
342615365a6bc40e0b43293bbc634d6f35d0c27536d02e06739ec5c9e795612c rocks.canonical.com:443/cdk/coredns/coredns-amd64:1.6.2 io.containerd.runtime.v1.linux
7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963 k8s.gcr.io/pause-amd64:3.1 io.containerd.kata.v2
a0a47479b5a12a47b5e0d6c6e5726703e39140a8e91aa554e76cad6a67b9d80a docker.io/library/nginx:latest io.containerd.kata.v2
d4093b9f70a25a33d3be74d21722b245ece543dabb0edcee85caa96f0da3f4eb k8s.gcr.io/pause-amd64:3.1 io.containerd.runtime.v1.linux
f4687e9f23541a54d8b6044ac902518a1b7f67d85fe5e124eb1217790f1d078d rocks.canonical.com:443/cdk/coredns/coredns-amd64:1.6.2 io.containerd.runtime.v1.linux
```
Finally, it was confirmed that the kata container was successfully deployed by using runtimeclass.
It is also possible to execute:
```bash
ps -ef | grep qemu
```
To see if there is a qemu vm being executed, so as to verify that the kata container will indeed wrap a layer of qemu vm outside.
```bash
$ ps -ef | grep qemu
ubuntu 8361 7082 0 10:31 pts/0 00:00:00 grep --color=auto qemu
root 20247 1 0 Sep01 ? 00:17:07 /usr/bin/qemu-vanilla-system-x86_64 -name sandbox-7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963 -uuid 224a187a-92e6-4b35-bfb8-f641806312d1 -machine pc,accel=kvm,kernel_irqchip,nvdimm -cpu host,pmu=off -qmp unix:/run/vc/vm/7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963/qmp.sock,server,nowait -m 2048M,slots=10,maxmem=49308M -device pci-bridge,bus=pci.0,id=pci-bridge-0,chassis_nr=1,shpc=on,addr=2,romfile= -device virtio-serial-pci,disable-modern=false,id=serial0,romfile=,max_ports=2 -device virtconsole,chardev=charconsole0,id=console0 -chardev socket,id=charconsole0,path=/run/vc/vm/7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963/console.sock,server,nowait -device nvdimm,id=nv0,memdev=mem0 -object memory-backend-file,id=mem0,mem-path=/usr/share/kata-containers/kata-containers-image_clearlinux_1.13.0-alpha0_agent_27b90c2690.img,size=268435456 -device virtio-scsi-pci,id=scsi0,disable-modern=false,romfile= -object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng-pci,rng=rng0,romfile= -device vhost-vsock-pci,disable-modern=false,vhostfd=3,id=vsock-89367521,guest-cid=89367521,romfile= -device virtio-9p-pci,disable-modern=false,fsdev=extra-9p-kataShared,mount_tag=kataShared,romfile= -fsdev local,id=extra-9p-kataShared,path=/run/kata-containers/shared/sandboxes/7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963/shared,security_model=none,multidevs=remap -netdev tap,id=network-0,vhost=on,vhostfds=4,fds=5 -device driver=virtio-net-pci,netdev=network-0,mac=86:1b:71:42:09:96,disable-modern=false,mq=on,vectors=4,romfile= -rtc base=utc,driftfix=slew,clock=host -global kvm-pit.lost_tick_policy=discard -vga none -no-user-config -nodefaults -nographic --no-reboot -daemonize -object memory-backend-ram,id=dimm1,size=2048M -numa node,memdev=dimm1 -kernel /usr/share/kata-containers/vmlinuz-5.4.60.91-52.container -append tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 cryptomgr.notests net.ifnames=0 pci=lastbus=0 iommu=off root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro ro rootfstype=ext4 quiet systemd.show_status=false panic=1 nr_cpus=16 agent.use_vsock=true systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket scsi_mod.scan=none -pidfile /run/vc/vm/7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963/pid -smp 1,cores=1,threads=1,sockets=16,maxcpus=16
```
But according to github:
> From Containerd v1.2.4 and Kata v1.6.0, there is a new runtime option supported, which allows you to specify a specific Kata configuration file as follows:
> ```toml
> [plugins.cri.containerd.runtimes.kata]
> runtime_type = "io.containerd.kata.v2"
> [plugins.cri.containerd.runtimes.kata.options]
> ConfigPath = "/etc/kata-containers/config.toml"
>```
It should also support the runtime option. I thought the config path was wrong, but it can still start normally after using kata-runtime-config to bring in the address and restart. I don’t know what caused the problem yet.
Finally, I tried again and only marked runtime options but did not make any settings.
```toml
[plugins.cri.containerd]
no_pivot = false
[plugins.cri.containerd.default_runtime]
runtime_type = "io.containerd.runtime.v1.linux"
[plugins.cri.containerd.untrusted_workload_runtime]
runtime_type = "io.containerd.kata.v2"
[plugins.cri.containerd.runtimes]
[plugins.cri.containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v1"
[plugins.cri.containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"
[plugins.cri.containerd.runtimes.kata.options]
# Runtime = "kata-runtime"
# RuntimeRoot = "/usr/bin/kata-runtime"
# ConfigPath = "/etc/kata-containers/configuration.toml"
```
According to the description in github, if the config path is not set, the system will run the default path by itself, but in the end, the pod cannot be created normally. It should be that the runtime options cannot be set here, but the specific reason has not been found yet.
## How to build a pod with kata contaienr
Currently, there are two ways to create a pod using kata container:
1. Annotation
2. Runtime class
### Annotation
If you use the Annotation method to create a pod, you need to add the `untrusted_workload_runtime` mark under `/etc/containerd/config.toml` `[plugins.cri.containerd]` to let containerd know what to do when encountering `untrusted_workload` annotations Which runtime to use.
```toml
[plugins.cri.containerd]
no_pivot = false
[plugins.cri.containerd.default_runtime]
runtime_type = "io.containerd.runtime.v1.linux"
[plugins.cri.containerd.untrusted_workload_runtime]
runtime_type = "io.containerd.kata.v2"
```
In the yaml file of kubernetes, just add annotation under the metadata:
```yaml
annotations:
io.kubernetes.cri.untrusted-workload: "true"
```
As shown below:
```yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: kata-container
name: kata-container
annotations: ###
io.kubernetes.cri.untrusted-workload: "true" ###
spec:
containers:
- image: nginx
name: kata-container
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Never
status: {}
```
### Runtime class
If you want to use runtimeclass to create a pod, you must first set it in `/etc/containerd/config.toml`, and you must set it under `[plugins.cri.containerd.runtimes]` to let containerd know what runtimeclass and Which runtime should be used.
```toml
[plugins.cri.containerd.runtimes]
[plugins.cri.containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v1"
[plugins.cri.containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"
```
In addition, on the kubernetes side, you must first create a runtimeclass and then substitute it into the yaml file of the pod. The yaml file of the runtimeclass is as follows:
```yaml
apiVersion: node.k8s.io/v1beta1
kind: RuntimeClass
metadata:
name: kata
handler: kata
```
The more important thing here is the handler field. Fill in the runtimeclass set in containerd, for example `plugins.cri.containerd.runtimes.runc` is `runc`, `plugins.cri.containerd.runtimes.kata` is `kata`.
After that, just add the runtimeClassName field to the spec.
```yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: nginx-kata
name: nginx-kata
spec:
runtimeClassName: kata ###
containers:
- image: nginx
name: nginx-kata
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Never
status: {}
```
## Confirm network architecture
First of all, we can check the running and setting of vsock
implement
```bash
$ ps aux | grep vsock
root 12796 0.3 0.4 2862424 223724 ? Sl 16:23 0:10 /usr/bin/qemu-vanilla-system-x86_64 -name sandbox-604b217ec06a75823965475124491685b2a5a721ef6da13904188d2204b607bf -uuid 23908cd2-5310-4c3f-aaf3-0f8ba449a0dc -machine pc,accel=kvm,kernel_irqchip,nvdimm -cpu host,pmu=off -qmp unix:/run/vc/vm/604b217ec06a75823965475124491685b2a5a721ef6da13904188d2204b607bf/qmp.sock,server,nowait -m 2048M,slots=10,maxmem=49308M -device pci-bridge,bus=pci.0,id=pci-bridge-0,chassis_nr=1,shpc=on,addr=2,romfile= -device virtio-serial-pci,disable-modern=false,id=serial0,romfile=,max_ports=2 -device virtconsole,chardev=charconsole0,id=console0 -chardev socket,id=charconsole0,path=/run/vc/vm/604b217ec06a75823965475124491685b2a5a721ef6da13904188d2204b607bf/console.sock,server,nowait -device nvdimm,id=nv0,memdev=mem0 -object memory-backend-file,id=mem0,mem-path=/usr/share/kata-containers/kata-containers-image_clearlinux_1.13.0-alpha0_agent_27b90c2690.img,size=268435456 -device virtio-scsi-pci,id=scsi0,disable-modern=false,romfile= -object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng-pci,rng=rng0,romfile= -device vhost-vsock-pci,disable-modern=false,vhostfd=3,id=vsock-751925852,guest-cid=751925852,romfile= -device virtio-9p-pci,disable-modern=false,fsdev=extra-9p-kataShared,mount_tag=kataShared,romfile= -fsdev local,id=extra-9p-kataShared,path=/run/kata-containers/shared/sandboxes/604b217ec06a75823965475124491685b2a5a721ef6da13904188d2204b607bf/shared,security_model=none,multidevs=remap -netdev tap,id=network-0,vhost=on,vhostfds=4,fds=5 -device driver=virtio-net-pci,netdev=network-0,mac=f6:45:6e:18:f8:36,disable-modern=false,mq=on,vectors=4,romfile= -rtc base=utc,driftfix=slew,clock=host -global kvm-pit.lost_tick_policy=discard -vga none -no-user-config -nodefaults -nographic --no-reboot -daemonize -object memory-backend-ram,id=dimm1,size=2048M -numa node,memdev=dimm1 -kernel /usr/share/kata-containers/vmlinuz-5.4.60.91-52.container -append tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 cryptomgr.notests net.ifnames=0 pci=lastbus=0 iommu=off root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro ro rootfstype=ext4 quiet systemd.show_status=false panic=1 nr_cpus=16 agent.use_vsock=true systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket scsi_mod.scan=none -pidfile /run/vc/vm/604b217ec06a75823965475124491685b2a5a721ef6da13904188d2204b607bf/pid -smp 1,cores=1,threads=1,sockets=16,maxcpus=16
```
You can observe the flag of `use_vsock` and what the guest id is.
Then you can judge from the btrctl command whether the bridge on the host side is connected to the veth device.
```bash
$ brctl show
bridge name bridge id STP enabled interfaces
cni0 8000.cad82eecfdf8 no veth3aa3ebe4
veth546c1a80
vethfaabff91
vethfcbd921f
```
After comparison before and after, when we build a pod with kata container, the bridge of `cni0` will be connected to the veth device `vethfaabff91`, which conforms to the description on github.
Next, we can look for information about the network namespace, and then view the network architecture and tc rule.
```bash
$ sudo ip netns list
cni-78af01c6-ba8e-9795-c89c-7f23439ba7dd (id: 0)
cni-20199cbf-eb8f-063f-686b-6f0d8717754a (id: 1)
cni-eba349cb-932e-804b-81c3-cdf3ee91dbb0 (id: 4)
cni-1114f2bb-8969-6fcc-58a7-d2666bcc904a (id: 3)
```
This command will display the current network stack, so when we open more new pods, the higher ones represent the neamespace of newer pods.
Next, you can use `ip addr` to observe the network card settings and addresses.
```bash
$ sudo ip netns exec cni-78af01c6-ba8e-9795-c89c-7f23439ba7dd ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if1668: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
link/ether f6:45:6e:18:f8:36 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.1.39.154/24 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::f445:6eff:fe18:f836/64 scope link
valid_lft forever preferred_lft forever
4: tap0_kata: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc mq state UNKNOWN group default qlen 1000
link/ether ae:af:03:cd:2f:35 brd ff:ff:ff:ff:ff:ff
inet6 fe80::acaf:3ff:fecd:2f35/64 scope link
valid_lft forever preferred_lft forever
```
You can see the tap device of `tap0_kata` and the veth device of eth0.
Finally, we can use `tc filter show dev [name] [ingress, egress]` to view the tc rule inside.
```bash
$ sudo ip netns exec cni-78af01c6-ba8e-9795-c89c-7f23439ba7dd tc filter show dev tap0_kata ingress
filter protocol all pref 49152 u32 chain 0
filter protocol all pref 49152 u32 chain 0 fh 800: ht divisor 1
filter protocol all pref 49152 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 terminal flowid ??? not_in_hw
match 00000000/00000000 at 0
action order 1: mirred (Egress Redirect to device eth0) stolen
index 2 ref 1 bind 1
```
It can be seen that when the match tc rule can be mirrored to the eth0 end, and vice versa.
```bash
$ sudo ip netns exec cni-78af01c6-ba8e-9795-c89c-7f23439ba7dd tc filter show dev eth0 ingress
filter protocol all pref 49152 u32 chain 0
filter protocol all pref 49152 u32 chain 0 fh 800: ht divisor 1
filter protocol all pref 49152 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 terminal flowid ??? not_in_hw
match 00000000/00000000 at 0
action order 1: mirred (Egress Redirect to device tap0_kata) stolen
index 1 ref 1 bind 1
```
If you want to go in the opposite direction, you can also filter and transmit data according to the same rule.
### matches namespace, veth and container
If you want to precisely match the corresponding container namespace and veth name, you can use the ctr command to find out the corresponding setting value.
First describe the pod to find out the corresponding containerd id:
```bash
$ kubectl describe pod nginx-kata
Name: nginx-kata
Namespace: default
Priority: 0
Node: wisr7-0205/10.207.190.2
Start Time: Fri, 10 Sep 2021 15:11:36 +0800
Labels: run=nginx-kata
Annotations: <none>
Status: Running
IP: 10.1.39.40
IPs:
IP: 10.1.39.40
Containers:
nginx-kata:
Container ID: containerd://fd1cf62c3f55d0d3d004c8c44e7916b92b6afe3f84e06ff37efb9ea7638e5d05
Image: nginx
Image ID: docker.io/library/nginx@sha256:853b221d3341add7aaadf5f81dd088ea943ab9c918766e295321294b035f3f3e
Port: <none>
Host Port: <none>
State: Running
Started: Fri, 10 Sep 2021 15:13:26 +0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-l7png (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-l7png:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-l7png
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
```
At this time, we determine that the id of containerd is `fd1cf62c3f55d0d3d004c8c44e7916b92b6afe3f84e06ff37efb9ea7638e5d05`, and then we can use this id to find the corresponding vm id.
```bash
$ sudo ctr -n=k8s.io c info fd1cf62c3f55d0d3d004c8c44e7916b92b6afe3f84e06ff37efb9ea7638e5d05 | grep sandbox
"source": "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/27fc6c87feadff5e0a11b618f5d6a67032568c543fea9c70705028faea275e01/hostname",
"source": "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/27fc6c87feadff5e0a11b618f5d6a67032568c543fea9c70705028faea275e01/resolv.conf",
"source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/27fc6c87feadff5e0a11b618f5d6a67032568c543fea9c70705028faea275e01/shm",
"io.kubernetes.cri.sandbox-id": "27fc6c87feadff5e0a11b618f5d6a67032568c543fea9c70705028faea275e01",
"io.kubernetes.cri.sandbox-name": "nginx-kata",
"io.kubernetes.cri.sandbox-namespace": "default"
```
After checking the container info, we determined that the vm id is `27fc6c87feadff5e0a11b618f5d6a67032568c543fea9c70705028faea275e01`.
Then we can use ctr command to search namespace data:
```bash
$ sudo ctr -n=k8s.io c info 27fc6c87feadff5e0a11b618f5d6a67032568c543fea9c70705028faea275e01 | grep cni
"path": "/var/run/netns/cni-42d0400b-c2c9-f365-0497-3f170429a55f"
```
From this we know that the id of this container namespace is `cni-42d0400b-c2c9-f365-0497-3f170429a55f`, we can enter this namespace to see the devices inside:
```bash
$ sudo ip netns exec cni-42d0400b-c2c9-f365-0497-3f170429a55f ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if1806: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
link/ether c2:0e:0b:8d:ff:36 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.1.39.40/24 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::c00e:bff:fe8d:ff36/64 scope link
valid_lft forever preferred_lft forever
4: tap0_kata: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc mq state UNKNOWN group default qlen 1000
link/ether b6:a5:16:3c:0d:bf brd ff:ff:ff:ff:ff:ff
inet6 fe80::b4a5:16ff:fe3c:dbf/64 scope link
valid_lft forever preferred_lft forever
```
You can see that the series of numbers after `eth0` is the index `1806` of this device. We can get the matching veth device by searching in the default namespace.
```bash
$ ip addr | grep 1806
1806: vethf95422f9@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
```
From this, we know that vethf95422f9@if3 is the corresponding veth device, and we can also check its existence from brctl show:
```bash
$ brctl show
bridge name bridge id STP enabled interfaces
cni0 8000.cad82eecfdf8 no veth3aa3ebe4
veth546c1a80
vethf95422f9
vethfcbd921f
```
In this way, we have successfully matched the corresponding container id and its network settings.
### Confirm virtual machine status
We can use the debug mode to connect to the vm to check the status, and the operation can be found here[enable debug console](https://github.com/kata-containers/documentation/blob/master/Developer-Guide. md#enabling-debug-console-for-qemu).
After we have finished setting, we can use the following steps to connect to the vm.
```bash
$ console="/var/run/vc/vm/${id}/console.sock"
$ sudo socat "stdin,raw,echo=0,escape=0x11" "unix-connect:${console}"
```
After entering, there is no way to manipulate some basic commands such as `ls` and `cat`, so you can only use `tab` to find the relevant process status in `/proc`.
First of all, you can make sure that `kata-agent` will indeed run in the vm, and you can find the process of `kata-agent` in `/proc/66/comm`.
In addition, since `cat` cannot be used, it can only be printed line by line by using the `read` command of the shell with a loop.
```shell
# cd /proc
# cd 66
# while read line; do echo $line; done <comm
kata-agent
```
If you want to confirm whether there is `nginx`, you can also search under `/proc`.
```shell
# cd 122
# while read line; do echo $line; done <comm
nginx
```
You can also see some settings when vm is loaded in /proc/cmdline.
```shell
# while read line; do echo $line; done <cmdline
tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 cryptomgr.notests net.ifnames=0 pci =lastbus=0 iommu=off root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro ro rootfstype=ext4 debug systemd.show_status=true systemd.log_level=debug panic=1 nr_cpus=16 agent.use_vsock =true systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket scsi_mod.scan=none agent.debug_console
```
After comparing the settings displayed by `ps aux | grep qemu` on the host side, it can be found that many settings will also be brought into the vm, such as the `use_vsock` flag and some agent settings.
Finally, if you want to check the network settings, each network card will be listed in `/proc/net/dev`, we can use this to verify whether `eth0` exists in the vm.
```shell
# cd /proc/net
# while read line; do echo $line; done <dev
Inter-| Receive | Transmit
face|bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
lo: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
dummy0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
eth0: 6708 92 0 0 0 0 0 0 1180 16 0 0 0 0 0 0
```
It can be seen that there is indeed an `eth0` network card, which also conforms to the official description on github.
If you want to know the ip information, you can get the information from `/proc/net/fib_trie`.
```shell
# while read line; do echo $line; done <fib_trie
Main:
+-- 0.0.0.0/1 2 0 2
+-- 0.0.0.0/4 2 0 2
|-- 0.0.0.0
/0 universe UNICAST
+-- 10.1.0.0/18 2 0 2
|-- 10.1.0.0
/16 universe UNICAST
+-- 10.1.39.0/24 2 0 1
|-- 10.1.39.0
/32 link BROADCAST
/24 link UNICAST
|-- 10.1.39.160
/32 host LOCAL
|-- 10.1.39.255
/32 link BROADCAST
+-- 127.0.0.0/8 2 0 2
+-- 127.0.0.0/31 1 0 0
|-- 127.0.0.0
/32 link BROADCAST
/8 host LOCAL
|-- 127.0.0.1
/32 host LOCAL
|-- 127.255.255.255
/32 link BROADCAST
Local:
+-- 0.0.0.0/1 2 0 2
+-- 0.0.0.0/4 2 0 2
|-- 0.0.0.0
/0 universe UNICAST
+-- 10.1.0.0/18 2 0 2
|-- 10.1.0.0
/16 universe UNICAST
+-- 10.1.39.0/24 2 0 1
|-- 10.1.39.0
/32 link BROADCAST
/24 link UNICAST
|-- 10.1.39.160
/32 host LOCAL
|-- 10.1.39.255
/32 link BROADCAST
+-- 127.0.0.0/8 2 0 2
+-- 127.0.0.0/31 1 0 0
|-- 127.0.0.0
/32 link BROADCAST
/8 host LOCAL
|-- 127.0.0.1
/32 host LOCAL
|-- 127.255.255.255
/32 link BROADCAST
```
Here we use `kubectl` to view the ip of the pod, and then look for the corresponding address above.
```bash
$ kubectl get pod -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-kata 1/1 Running 0 5h26m 10.1.39.160 wisr7-0205 <none> <none>
```
As mentioned above, there is indeed an ip address of 10.1.39.160, and when we look at `/proc/net/dev`, only the network card `eth0` is processing the transmission of packets, so it can be confirmed that it is the same as described on github The data will also be transmitted to the vm via `eth0`.
In addition, route information can be found in `/proc/net/route`.
```bash
# while read line; do echo $line; done <route
Iface Destination Gateway Flags RefCnt Use Metric Mask MTU Window IRTT
eth0 00000000 0127010A 0003 0 0 0 00000000 0 0 0
eth0 0000010A 0127010A 0003 0 0 0 0000FFFF 0 0 0
eth0 0027010A 00000000 0001 0 0 0 00FFFFFF 0 0 0
```
Finally, if you want to know the MAC address, you can check it in `/sys/class/net/eth0/address`.
```shell
# while read line; do echo $line; done <address
8e:9d:f5:8c:8d:cf
```
## Use Nvidia GPU in kata container
According to [Using Nvidia GPU device with Kata Containers](https://github.com/kata-containers/kata-containers/blob/main/docs/use-cases/Nvidia-GPU-passthrough-and-Kata.md) According to the description, we can use Nvidia GPU in kata container using two modes:
* Nvidia GPU pass-through mode
* Nvidia vGPU mode
The comparison between the two is as follows:
| Technology | Description | Behavior | Detail |
| -------- | -------- | -------- | -------- |
|Nvidia GPU pass-through mode|GPU passthrough|Physical GPU assigned to a single VM|Direct GPU assignment to VM without limitation|
|Nvidia vGPU mode|GPU sharing|Physical GPU shared by multiple VMs|Mediated passthrough|
### Hardware requirements
Nvidia GPUs Recommended for Virtualization:
* Nvidia Tesla (T4, M10, P6, V100 or newer)
* Nvidia Quadro RTX 6000/8000
### Host BIOS Requirements
Some hardware needs larger PCI BARs window like Nvidia Tesla P100, K40m
```bash
$ lspci -s 04:00.0 -vv | grep Region
Region 0: Memory at c6000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at 383800000000 (64-bit, prefetchable) [size=16G] #above 4G
Region 3: Memory at 383c00000000 (64-bit, prefetchable) [size=32M]
```
If a larger BARS MMIO mapping is required, the section above 4G needs to be enabled in the PCI configuration of the BIOS.
Different brands may show different results:
* Above 4G Decoding
* Memory Hole for PCI MMIO
* Memory Mapped I/O above 4GB
### Host Kernel Requirements
The following flags must be set on the host kernel:
* `CONFIG_VFIO`
* `CONFIG_VFIO_IOMMU_TYPE1`
* `CONFIG_VFIO_MDEV`
* `CONFIG_VFIO_MDEV_DEVICE`
* `CONFIG_VFIO_PCI`
Also set `intel_iommu=on` in the kernel cmdline at boot time.
### Configure kata container
We also need to set related items in `/etc/kata-containers/configuration.toml`. In addition, if non-large BARs devices are used, the recommended version is Kata version 1.3.0 or above, and if large BARs devices are used, the recommended version is It is required to be on Kata version 1.11.0 or above.
Hotplug for PCI devices by shpchp (Linux's SHPC PCI Hotplug driver):
```bash
machine_type = "q35"
hotplug_vfio_on_root_bus = false
```
Hotplug for PCIe devices by pciehp (Linux's PCIe Hotplug driver):
```bash
machine_type = "q35"
hotplug_vfio_on_root_bus = true
pcie_root_port = 1
```
### Build Kata Containers kernel with GPU support
Next, we need to create a kernel that supports GPU, and change the default kernel to what we created. How to create a guest kernel is in [Build Kata Containers Kernel](https://github.com/kata-containers/kata-containers /tree/main/tools/packaging/kernel) is explained in detail.
In addition, to build a kernel, you need to have the corresponding environment and software. You can also find it in [Build Kata Containers Kernel](https://github.com/kata-containers/kata-containers/tree/main/tools/packaging/kernel) View in the article.
Here, first use go to grab the script that builds the kernel:
```bash
$ go get -d -u github.com/kata-containers/kata-containers
$ cd $GOPATH/src/github.com/kata-containers/kata-containers/tools/packaging/kernel
$ ./build-kernel.sh setup
```
Then set the following kernel config option:
```bash
# Support PCI/PCIe device hotplug (Required for large BARs device)
CONFIG_HOTPLUG_PCI_PCIE=y
CONFIG_HOTPLUG_PCI_SHPC=y
# Support for loading modules (Required for load Nvidia drivers)
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# Enable the MMIO access method for PCIe devices (Required for large BARs device)
CONFIG_PCI_MMCONFIG=y
```
You also need to disable `CONFIG_DRM_NOUVEAU`:
```bash
# Disable Open Source Nvidia driver nouveau
# It conflicts with Nvidia official driver
CONFIG_DRM_NOUVEAU=n
```
Then you can build the kernel:
```bash
## Build guest kernel with ../../tools/packaging/kernel
# Prepare (download guest kernel source, generate .config)
$ ./build-kernel.sh -v 4.19.86 -g nvidia -f setup
# Build guest kernel
$ ./build-kernel.sh -v 4.19.86 -g nvidia build
# Install guest kernel
$ sudo -E ./build-kernel.sh -v 4.19.86 -g nvidia install
/usr/share/kata-containers/vmlinux-nvidia-gpu.container -> vmlinux-4.19.86-70-nvidia-gpu
/usr/share/kata-containers/vmlinuz-nvidia-gpu.container -> vmlinuz-4.19.86-70-nvidia-gpu
```
Then generate the rpm package of `kernel-devel`:
```bash
$ cd kata-linux-4.19.86-68
$ make rpm-pkg
Output RPMs:
~/rpmbuild/RPMS/x86_64/kernel-devel-4.19.86_nvidia_gpu-1.x86_64.rpm
```
Finally, set the kernel path in `/etc/kata-containers/configuration.toml` and you are done.
```bash
kernel = "/usr/share/kata-containers/vmlinuz-nvidia-gpu.container"
```
### Nvidia GPU pass-through mode with Kata Containers
First find the Bus-Device-Function (BDF) of the GPU on the host:
```bash
$ sudo lspci -nn -D | grep -i nvidia
0000:04:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:15f8] (rev a1)
0000:84:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:15f8] (rev a1)
```
Then find the IOMMU group of the GPU:
```bash
$BDF="0000:04:00.0"
$ readlink -e /sys/bus/pci/devices/$BDF/iommu_group
/sys/kernel/iommu_groups/45
```
Check the IOMMU number under `/dev/vfio`:
```bash
$ ls -l /dev/vfio
total 0
crw------- 1 root root 248, 0 Feb 28 09:57 45
crw------- 1 root root 248, 1 Feb 28 09:57 54
crw-rw-rw-1 root root 10, 196 Feb 28 09:57 vfio
```
Then you can create a container to use the GPU:
```bash
$ sudo docker run -it --runtime=kata-runtime --cap-add=ALL --device /dev/vfio/45 centos /bin/bash
```
The example here is to use docker to directly start the container. If you want to use kubernetes or containerd, it should be similar.
We can enter the container to confirm whether the GPU device is in the PCI devices list:
```bash
$ lspci -nn -D | grep '10de:15f8'
0000:01:01.0 3D controller [0302]: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] [10de:15f8] (rev a1)
```
You can also confirm the size of the PCI BARS in the container:
```bash
$ lspci -s 01:01.0 -vv | grep Region
Region 0: Memory at c0000000 (32-bit, non-prefetchable) [disabled] [size=16M]
Region 1: Memory at 4400000000 (64-bit, prefetchable) [disabled] [size=16G]
Region 3: Memory at 4800000000 (64-bit, prefetchable) [disabled] [size=32M]
```
### Nvidia vGPU mode with Kata Containers
> Nvidia vGPU is a licensed product on all supported GPU boards. A software license is required to enable all vGPU features within the guest VM.
The purpose of vGPU mode is that different VMs can use the GPU device on the host at the same time, so the installation is performed on the host, and a license is required to use these functions on the guest VM.
First download the official Nvidia driver from [Nvidia](https://www.nvidia.com/Download/index.aspx), such as `NVIDIA-Linux-x86_64-418.87.01.run`.
Then download `kernel-devel`, the rpm file that has been created before.
```bash
$ sudo rpm -ivh kernel-devel-4.19.86_gpu-1.x86_64.rpm
```
Then you can follow the official steps to decompress, compile and download the Nvidia driver:
```bash
## Extract
$ sh ./NVIDIA-Linux-x86_64-418.87.01.run -x
## Compile and install (It will take some time)
$ cd NVIDIA-Linux-x86_64-418.87.01
$ sudo ./nvidia-installer -a -q --ui=none \
--no-cc-version-check \
--no-opengl-files --no-install-libglvnd \
--kernel-source-path=/usr/src/kernels/`uname -r`
```
or
```bash
$ sudo sh ./NVIDIA-Linux-x86_64-418.87.01.run -a -q --ui=none \
--no-cc-version-check \
--no-opengl-files --no-install-libglvnd \
--kernel-source-path=/usr/src/kernels/`uname -r`
```
View installer logs:
```bash
$ tail -f /var/log/nvidia-installer.log
```
Then load the Nvidia driver module:
```bash
# Optional (generate modules.dep and map files for Nvidia driver)
$ sudo depmod
# Load module
$ sudo modprobe nvidia-drm
# Check module
$ lsmod | grep nvidia
nvidia_drm 45056 0
nvidia_modeset 1093632 1 nvidia_drm
nvidia 18202624 1 nvidia_modeset
drm_kms_helper 159744 1 nvidia_drm
drm 364544 3 nvidia_drm,drm_kms_helper
i2c_core 65536 3 nvidia,drm_kms_helper,drm
ipmi_msghandler 49152 1 nvidia
```
Finally, you can view the status:
```bash
$ nvidia-smi
Tue Mar 3 00:03:49 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:01:01.0 Off | 0 |
| N/A 27C P0 25W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
```
Once installed, we can use this GPU with a different VM.