Try   HackMD

Kata container deploy record

Issue Log

Add Relation question

try to execute

juju add-relation kata kubernetes-master

An error will occur, and the display on juju status is that an error occurred during the charm download.

Go to /var/log/juju/unit_kata.log to view the system log to see what went wrong:

DEBUG install Traceback (most recent call last):
DEBUG install File "/var/lib/juju/agents/unit-kata-0/charm/hooks/install", line 8, in <module>
DEBUG install basic.bootstrap_charm_deps()
DEBUG install File "lib/charms/layer/basic.py", line 130, in bootstrap_charm_deps
DEBUG install install_or_update_charm_env()
DEBUG install File "lib/charms/layer/basic.py", line 171, in install_or_update_charm_env
DEBUG install '--version']).decode('utf8'))
DEBUG install File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
DEBUG install **kwargs).stdout
DEBUG install File "/usr/lib/python3.6/subprocess.py", line 423, in run
DEBUG install with Popen(*popenargs, **kwargs) as process:
DEBUG install File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
DEBUG install restore_signals, start_new_session)
DEBUG install File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
DEBUG install raise child_exception_type(errno_num, err_msg, err_filename)
DEBUG install PermissionError: [Errno 13] Permission denied: 'bin/charm-env'

Finally, it was found that the permission denied problem occurred in the lib/charms/layer/basic.py file.

The problematic code is as follows:

try:
    bundled_version = parse_version(
    check_output(['bin/charm-env',
                  '--version']).decode('utf8'))

It should be an error caused by insufficient permissions when check_output executes the command, because I don't know how to give root permissions when adding-relation, so I decided to modify the code directly:

try:
    bundled_version = parse_version(
    check_output(['sudo', 'bin/charm-env',
                  '--version']).decode('utf8'))

After the modification is completed, the subordinate kata charm can be installed normally.

Also executing

juju add-relation kata kubernetes-worker

Sometimes the same error occurs, and the problem can be solved by using the same method.

This is the screen after the deployment is complete:

$ watch juju status 13 0/lxd/17 0/lxd/18 0/lxd/19 0/lxd/20
Every 2.0s: juju status 13 0/lxd/17 0/lxd/18 0/lxd/19 0/lxd/20                                                           maas-vm: Thu Sep  9 07:15:16 2021

Model    Controller             Cloud/Region        Version  SLA          Timestamp
default  maas-cloud-controller  maas-cloud/default  2.7.6.1  unsupported  07:15:17Z

App                    Version  Status  Scale  Charm                  Store  Rev  OS      Notes
containerd                      active      0  containerd             local    0  ubuntu
easyrsa                3.0.1    active      1  easyrsa                local    1  ubuntu
etcd                   3.2.10   active      1  etcd                   local    2  ubuntu
flannel                0.11.0   active      0  flannel                local    0  ubuntu
kata                            active      0  kata                   local    1  ubuntu
kubeapi-load-balancer  1.14.0   active      1  kubeapi-load-balancer  local    1  ubuntu
kubernetes-master      1.16.15  active      1  kubernetes-master      local    3  ubuntu
kubernetes-worker      1.16.15  active      1  kubernetes-worker      local    2  ubuntu

Unit                      Workload  Agent  Machine   Public address  Ports     Message
easyrsa/1*                active    idle   0/lxd/17  10.207.212.119            Certificate Authority connected.
etcd/4*                   active    idle   0/lxd/18  10.207.212.123  2379/tcp  Healthy with 1 known peer
kubeapi-load-balancer/4*  active    idle   0/lxd/19  10.207.212.122  443/tcp   Loadbalancer ready.
kubernetes-master/4*      active    idle   0/lxd/20  10.207.212.121  6443/tcp  Kubernetes master running.
  containerd/0            active    idle             10.207.212.121            Container runtime available
  flannel/0               active    idle             10.207.212.121            Flannel subnet 10.1.45.1/24
  kata/1                  active    idle             10.207.212.121            Kata runtime available
kubernetes-worker/1*      active    idle   13        10.207.190.2              Kubernetes worker running.
  containerd/1*           active    idle             10.207.190.2              Container runtime available
  flannel/1*              active    idle             10.207.190.2              Flannel subnet 10.1.39.1/24
  kata/2*                 active    idle             10.207.190.2              Kata runtime available

Machine   State    DNS             Inst id               Series  AZ       Message
0         started  10.207.190.16   wisr7-1605            bionic  default  Deployed
0/lxd/17  started  10.207.212.119  juju-9efc09-0-lxd-17  bionic  default  Container started
0/lxd/18  started  10.207.212.123  juju-9efc09-0-lxd-18  bionic  default  Container started
0/lxd/19  started  10.207.212.122  juju-9efc09-0-lxd-19  bionic  default  Container started
0/lxd/20  started  10.207.212.121  juju-9efc09-0-lxd-20  bionic  default  Container started
13        started  10.207.190.2    wisr7-0205            bionic  default  Deployed

Create pod problem

In the teaching of Ensuring security and isolation in Charmed Kubernetes with Kata Containers, the pod is created with the kata container in the way of runtime class, and the actual reference An error will occur after the operation, the error message is as follows:

Failed create pod sandbox: rpc error: code = NotFound desc = failed to create containerd task: failed to create shim: not found

However, this article How to use Kata Containers and Containerd on the official github of kata container states that Pods can be created using annotations untrusted workload.

annotations:
    io.kubernetes.cri.untrusted-workload: "true"

The pod can be successfully created using the following yaml file.

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: kata-container
  name: kata-container
  annotations:
    io.kubernetes.cri.untrusted-workload: "true"
spec:
 #runtimeClassName: kata
  containers:
  - image: nginx
    name: kata-container
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Never
status: {}

After running, you can use the containerd command to confirm whether it is running in kata-container.

$ sudo ctr -n=k8s.io c ls
CONTAINER                                                           IMAGE                                                      RUNTIME
013f94e641c37fa74aed36d7f3725f97428a9482e9f405ea5ce22f1d7d879197    k8s.gcr.io/pause-amd64:3.1                                 io.containerd.runtime.v1.linux
342615365a6bc40e0b43293bbc634d6f35d0c27536d02e06739ec5c9e795612c    rocks.canonical.com:443/cdk/coredns/coredns-amd64:1.6.2    io.containerd.runtime.v1.linux
7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963    k8s.gcr.io/pause-amd64:3.1                                 io.containerd.kata.v2
a0a47479b5a12a47b5e0d6c6e5726703e39140a8e91aa554e76cad6a67b9d80a    docker.io/library/nginx:latest                             io.containerd.kata.v2
d4093b9f70a25a33d3be74d21722b245ece543dabb0edcee85caa96f0da3f4eb    k8s.gcr.io/pause-amd64:3.1                                 io.containerd.runtime.v1.linux
f4687e9f23541a54d8b6044ac902518a1b7f67d85fe5e124eb1217790f1d078d    rocks.canonical.com:443/cdk/coredns/coredns-amd64:1.6.2    io.containerd.runtime.v1.linux

You can see that RUNTIME displays io.containerd.kata.v2 as the pod we created with kata container.

The next question is why the runtimeclass cannot be started normally.
First go to the containerd configuration file /etc/containerd/config.toml to have a look.

    [plugins.cri.containerd]
      no_pivot = false
      [plugins.cri.containerd.default_runtime]
        runtime_type = "io.containerd.runtime.v1.linux"

      [plugins.cri.containerd.untrusted_workload_runtime]
        runtime_type = "io.containerd.kata.v2"

      [plugins.cri.containerd.runtimes]
        [plugins.cri.containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v1"

        [plugins.cri.containerd.runtimes.kata]
          runtime_type= "io.containerd.kata.v2"
          [plugins.cri.containerd.runtimes.kata.options]
            Runtime = "kata-runtime"
            RuntimeRoot = "/usr/bin/kata-runtime"

Here you can see containerd’s setting of ruintimeclass. There are two runtimeclasses runc and kata in total. After looking at it, I think there is no special problem, and because if you use the following ctr command directly, there will be no problem. So temporarily rule out the problem of kata config.

sudo ctr run --runtime io.containerd.run.kata.v2 -t --rm docker.io/library/busybox:latest hello sh

Finally, I decided to comment out the runtimeclass option of kata to see if the error was caused by the option.

    [plugins.cri.containerd]
      no_pivot = false
      [plugins.cri.containerd.default_runtime]
        runtime_type = "io.containerd.runtime.v1.linux"

      [plugins.cri.containerd.untrusted_workload_runtime]
        runtime_type = "io.containerd.kata.v2"

      [plugins.cri.containerd.runtimes]
        [plugins.cri.containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v1"

        [plugins.cri.containerd.runtimes.kata]
          runtime_type= "io.containerd.kata.v2"
#          [plugins.cri.containerd.runtimes.kata.options]
#            Runtime = "kata-runtime"
#            RuntimeRoot = "/usr/bin/kata-runtime"
#            ConfigPath = "/etc/kata-containers/configuration.toml"

toml file will end up like this.

Then restart containerd:

sudo systemctl restart containerd

Then try to build the pod again:

$ kubectl create -f kata.yaml
runtimeclass.node.k8s.io/kata created
$ kubectl create -f nginx-kata.yaml
pod/nginx-kata created
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
kata-container 1/1 Running 0 24h
nginx-kata 1/1 Running 0 12s

In the end it worked.

Verify with ctr command:

$ sudo ctr -n=k8s.io c ls
CONTAINER                                                           IMAGE                                                      RUNTIME
013f94e641c37fa74aed36d7f3725f97428a9482e9f405ea5ce22f1d7d879197    k8s.gcr.io/pause-amd64:3.1                                 io.containerd.runtime.v1.linux
06498b4762fc5f04bed9f30d4708c053615a9d2a719250539986f59078ef6c03    k8s.gcr.io/pause-amd64:3.1                                 io.containerd.kata.v2
085981127cbff581c0727f610da34ec92fd97f9eee298e3025f6fcac302a3b86    docker.io/library/nginx:latest                             io.containerd.kata.v2
342615365a6bc40e0b43293bbc634d6f35d0c27536d02e06739ec5c9e795612c    rocks.canonical.com:443/cdk/coredns/coredns-amd64:1.6.2    io.containerd.runtime.v1.linux
7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963    k8s.gcr.io/pause-amd64:3.1                                 io.containerd.kata.v2
a0a47479b5a12a47b5e0d6c6e5726703e39140a8e91aa554e76cad6a67b9d80a    docker.io/library/nginx:latest                             io.containerd.kata.v2
d4093b9f70a25a33d3be74d21722b245ece543dabb0edcee85caa96f0da3f4eb    k8s.gcr.io/pause-amd64:3.1                                 io.containerd.runtime.v1.linux
f4687e9f23541a54d8b6044ac902518a1b7f67d85fe5e124eb1217790f1d078d    rocks.canonical.com:443/cdk/coredns/coredns-amd64:1.6.2    io.containerd.runtime.v1.linux

Finally, it was confirmed that the kata container was successfully deployed by using runtimeclass.

It is also possible to execute:

ps -ef | grep qemu

To see if there is a qemu vm being executed, so as to verify that the kata container will indeed wrap a layer of qemu vm outside.

$ ps -ef | grep qemu
ubuntu    8361  7082  0 10:31 pts/0    00:00:00 grep --color=auto qemu
root     20247     1  0 Sep01 ?        00:17:07 /usr/bin/qemu-vanilla-system-x86_64 -name sandbox-7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963 -uuid 224a187a-92e6-4b35-bfb8-f641806312d1 -machine pc,accel=kvm,kernel_irqchip,nvdimm -cpu host,pmu=off -qmp unix:/run/vc/vm/7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963/qmp.sock,server,nowait -m 2048M,slots=10,maxmem=49308M -device pci-bridge,bus=pci.0,id=pci-bridge-0,chassis_nr=1,shpc=on,addr=2,romfile= -device virtio-serial-pci,disable-modern=false,id=serial0,romfile=,max_ports=2 -device virtconsole,chardev=charconsole0,id=console0 -chardev socket,id=charconsole0,path=/run/vc/vm/7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963/console.sock,server,nowait -device nvdimm,id=nv0,memdev=mem0 -object memory-backend-file,id=mem0,mem-path=/usr/share/kata-containers/kata-containers-image_clearlinux_1.13.0-alpha0_agent_27b90c2690.img,size=268435456 -device virtio-scsi-pci,id=scsi0,disable-modern=false,romfile= -object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng-pci,rng=rng0,romfile= -device vhost-vsock-pci,disable-modern=false,vhostfd=3,id=vsock-89367521,guest-cid=89367521,romfile= -device virtio-9p-pci,disable-modern=false,fsdev=extra-9p-kataShared,mount_tag=kataShared,romfile= -fsdev local,id=extra-9p-kataShared,path=/run/kata-containers/shared/sandboxes/7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963/shared,security_model=none,multidevs=remap -netdev tap,id=network-0,vhost=on,vhostfds=4,fds=5 -device driver=virtio-net-pci,netdev=network-0,mac=86:1b:71:42:09:96,disable-modern=false,mq=on,vectors=4,romfile= -rtc base=utc,driftfix=slew,clock=host -global kvm-pit.lost_tick_policy=discard -vga none -no-user-config -nodefaults -nographic --no-reboot -daemonize -object memory-backend-ram,id=dimm1,size=2048M -numa node,memdev=dimm1 -kernel /usr/share/kata-containers/vmlinuz-5.4.60.91-52.container -append tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 cryptomgr.notests net.ifnames=0 pci=lastbus=0 iommu=off root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro ro rootfstype=ext4 quiet systemd.show_status=false panic=1 nr_cpus=16 agent.use_vsock=true systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket scsi_mod.scan=none -pidfile /run/vc/vm/7ad160ed6b2eb991ddde4d220fda4335e24a3ff6103a7fab6a67bb08a91d5963/pid -smp 1,cores=1,threads=1,sockets=16,maxcpus=16

But according to github:

From Containerd v1.2.4 and Kata v1.6.0, there is a new runtime option supported, which allows you to specify a specific Kata configuration file as follows:

[plugins.cri.containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"
[plugins.cri.containerd.runtimes.kata.options]
ConfigPath = "/etc/kata-containers/config.toml"

It should also support the runtime option. I thought the config path was wrong, but it can still start normally after using kata-runtime-config to bring in the address and restart. I don’t know what caused the problem yet.

Finally, I tried again and only marked runtime options but did not make any settings.

    [plugins.cri.containerd]
      no_pivot = false
      [plugins.cri.containerd.default_runtime]
        runtime_type = "io.containerd.runtime.v1.linux"

      [plugins.cri.containerd.untrusted_workload_runtime]
        runtime_type = "io.containerd.kata.v2"

      [plugins.cri.containerd.runtimes]
        [plugins.cri.containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v1"

        [plugins.cri.containerd.runtimes.kata]
          runtime_type = "io.containerd.kata.v2"
          [plugins.cri.containerd.runtimes.kata.options]
# Runtime = "kata-runtime"
# RuntimeRoot = "/usr/bin/kata-runtime"
# ConfigPath = "/etc/kata-containers/configuration.toml"

According to the description in github, if the config path is not set, the system will run the default path by itself, but in the end, the pod cannot be created normally. It should be that the runtime options cannot be set here, but the specific reason has not been found yet.

How to build a pod with kata contaienr

Currently, there are two ways to create a pod using kata container:

  1. Annotation
  2. Runtime class

Annotation

If you use the Annotation method to create a pod, you need to add the untrusted_workload_runtime mark under /etc/containerd/config.toml [plugins.cri.containerd] to let containerd know what to do when encountering untrusted_workload annotations Which runtime to use.

    [plugins.cri.containerd]
      no_pivot = false
      [plugins.cri.containerd.default_runtime]
        runtime_type = "io.containerd.runtime.v1.linux"

      [plugins.cri.containerd.untrusted_workload_runtime]
        runtime_type = "io.containerd.kata.v2"

In the yaml file of kubernetes, just add annotation under the metadata:

annotations:
    io.kubernetes.cri.untrusted-workload: "true"

As shown below:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: kata-container
  name: kata-container
  annotations: ###
    io.kubernetes.cri.untrusted-workload: "true" ###
spec:
  containers:
  - image: nginx
    name: kata-container
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Never
status: {}

Runtime class

If you want to use runtimeclass to create a pod, you must first set it in /etc/containerd/config.toml, and you must set it under [plugins.cri.containerd.runtimes] to let containerd know what runtimeclass and Which runtime should be used.

      [plugins.cri.containerd.runtimes]
        [plugins.cri.containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v1"

        [plugins.cri.containerd.runtimes.kata]
          runtime_type = "io.containerd.kata.v2"

In addition, on the kubernetes side, you must first create a runtimeclass and then substitute it into the yaml file of the pod. The yaml file of the runtimeclass is as follows:

apiVersion: node.k8s.io/v1beta1
kind: RuntimeClass
metadata:
  name: kata
handler: kata

The more important thing here is the handler field. Fill in the runtimeclass set in containerd, for example plugins.cri.containerd.runtimes.runc is runc, plugins.cri.containerd.runtimes.kata is kata.

After that, just add the runtimeClassName field to the spec.

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: nginx-kata
  name: nginx-kata
spec:
  runtimeClassName: kata ###
  containers:
  - image: nginx
    name: nginx-kata
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Never
status: {}

Confirm network architecture

First of all, we can check the running and setting of vsock

implement

$ ps aux | grep vsock
root     12796  0.3  0.4 2862424 223724 ?      Sl   16:23   0:10 /usr/bin/qemu-vanilla-system-x86_64 -name sandbox-604b217ec06a75823965475124491685b2a5a721ef6da13904188d2204b607bf -uuid 23908cd2-5310-4c3f-aaf3-0f8ba449a0dc -machine pc,accel=kvm,kernel_irqchip,nvdimm -cpu host,pmu=off -qmp unix:/run/vc/vm/604b217ec06a75823965475124491685b2a5a721ef6da13904188d2204b607bf/qmp.sock,server,nowait -m 2048M,slots=10,maxmem=49308M -device pci-bridge,bus=pci.0,id=pci-bridge-0,chassis_nr=1,shpc=on,addr=2,romfile= -device virtio-serial-pci,disable-modern=false,id=serial0,romfile=,max_ports=2 -device virtconsole,chardev=charconsole0,id=console0 -chardev socket,id=charconsole0,path=/run/vc/vm/604b217ec06a75823965475124491685b2a5a721ef6da13904188d2204b607bf/console.sock,server,nowait -device nvdimm,id=nv0,memdev=mem0 -object memory-backend-file,id=mem0,mem-path=/usr/share/kata-containers/kata-containers-image_clearlinux_1.13.0-alpha0_agent_27b90c2690.img,size=268435456 -device virtio-scsi-pci,id=scsi0,disable-modern=false,romfile= -object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng-pci,rng=rng0,romfile= -device vhost-vsock-pci,disable-modern=false,vhostfd=3,id=vsock-751925852,guest-cid=751925852,romfile= -device virtio-9p-pci,disable-modern=false,fsdev=extra-9p-kataShared,mount_tag=kataShared,romfile= -fsdev local,id=extra-9p-kataShared,path=/run/kata-containers/shared/sandboxes/604b217ec06a75823965475124491685b2a5a721ef6da13904188d2204b607bf/shared,security_model=none,multidevs=remap -netdev tap,id=network-0,vhost=on,vhostfds=4,fds=5 -device driver=virtio-net-pci,netdev=network-0,mac=f6:45:6e:18:f8:36,disable-modern=false,mq=on,vectors=4,romfile= -rtc base=utc,driftfix=slew,clock=host -global kvm-pit.lost_tick_policy=discard -vga none -no-user-config -nodefaults -nographic --no-reboot -daemonize -object memory-backend-ram,id=dimm1,size=2048M -numa node,memdev=dimm1 -kernel /usr/share/kata-containers/vmlinuz-5.4.60.91-52.container -append tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 cryptomgr.notests net.ifnames=0 pci=lastbus=0 iommu=off root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro ro rootfstype=ext4 quiet systemd.show_status=false panic=1 nr_cpus=16 agent.use_vsock=true systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket scsi_mod.scan=none -pidfile /run/vc/vm/604b217ec06a75823965475124491685b2a5a721ef6da13904188d2204b607bf/pid -smp 1,cores=1,threads=1,sockets=16,maxcpus=16

You can observe the flag of use_vsock and what the guest id is.

Then you can judge from the btrctl command whether the bridge on the host side is connected to the veth device.

$ brctl show
bridge name bridge id STP enabled interfaces
cni0 8000.cad82eecfdf8 no veth3aa3ebe4
                                                         veth546c1a80
                                                         vethfaabff91
                                                         vethfcbd921f

After comparison before and after, when we build a pod with kata container, the bridge of cni0 will be connected to the veth device vethfaabff91, which conforms to the description on github.

Next, we can look for information about the network namespace, and then view the network architecture and tc rule.

$ sudo ip netns list
cni-78af01c6-ba8e-9795-c89c-7f23439ba7dd (id: 0)
cni-20199cbf-eb8f-063f-686b-6f0d8717754a (id: 1)
cni-eba349cb-932e-804b-81c3-cdf3ee91dbb0 (id: 4)
cni-1114f2bb-8969-6fcc-58a7-d2666bcc904a (id: 3)

This command will display the current network stack, so when we open more new pods, the higher ones represent the neamespace of newer pods.

Next, you can use ip addr to observe the network card settings and addresses.

$ sudo ip netns exec cni-78af01c6-ba8e-9795-c89c-7f23439ba7dd ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
3: eth0@if1668: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether f6:45:6e:18:f8:36 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.1.39.154/24 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::f445:6eff:fe18:f836/64 scope link
       valid_lft forever preferred_lft forever
4: tap0_kata: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc mq state UNKNOWN group default qlen 1000
    link/ether ae:af:03:cd:2f:35 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::acaf:3ff:fecd:2f35/64 scope link
       valid_lft forever preferred_lft forever

You can see the tap device of tap0_kata and the veth device of eth0.

Finally, we can use tc filter show dev [name] [ingress, egress] to view the tc rule inside.

$ sudo ip netns exec cni-78af01c6-ba8e-9795-c89c-7f23439ba7dd tc filter show dev tap0_kata ingress
filter protocol all pref 49152 u32 chain 0
filter protocol all pref 49152 u32 chain 0 fh 800: ht divisor 1
filter protocol all pref 49152 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 terminal flowid ??? not_in_hw
  match 00000000/00000000 at 0
        action order 1: mirred (Egress Redirect to device eth0) stolen
        index 2 ref 1 bind 1

It can be seen that when the match tc rule can be mirrored to the eth0 end, and vice versa.

$ sudo ip netns exec cni-78af01c6-ba8e-9795-c89c-7f23439ba7dd tc filter show dev eth0 ingress
filter protocol all pref 49152 u32 chain 0
filter protocol all pref 49152 u32 chain 0 fh 800: ht divisor 1
filter protocol all pref 49152 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 terminal flowid ??? not_in_hw
  match 00000000/00000000 at 0
        action order 1: mirred (Egress Redirect to device tap0_kata) stolen
        index 1 ref 1 bind 1

If you want to go in the opposite direction, you can also filter and transmit data according to the same rule.

matches namespace, veth and container

If you want to precisely match the corresponding container namespace and veth name, you can use the ctr command to find out the corresponding setting value.

First describe the pod to find out the corresponding containerd id:

$ kubectl describe pod nginx-kata
Name:         nginx-kata
Namespace:    default
Priority:     0
Node:         wisr7-0205/10.207.190.2
Start Time:   Fri, 10 Sep 2021 15:11:36 +0800
Labels:       run=nginx-kata
Annotations:  <none>
Status:       Running
IP:           10.1.39.40
IPs:
  IP:  10.1.39.40
Containers:
  nginx-kata:
    Container ID:   containerd://fd1cf62c3f55d0d3d004c8c44e7916b92b6afe3f84e06ff37efb9ea7638e5d05
    Image:          nginx
    Image ID:       docker.io/library/nginx@sha256:853b221d3341add7aaadf5f81dd088ea943ab9c918766e295321294b035f3f3e
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Fri, 10 Sep 2021 15:13:26 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-l7png (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  default-token-l7png:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-l7png
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

At this time, we determine that the id of containerd is fd1cf62c3f55d0d3d004c8c44e7916b92b6afe3f84e06ff37efb9ea7638e5d05, and then we can use this id to find the corresponding vm id.

$ sudo ctr -n=k8s.io c info fd1cf62c3f55d0d3d004c8c44e7916b92b6afe3f84e06ff37efb9ea7638e5d05 | grep sandbox
                "source": "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/27fc6c87feadff5e0a11b618f5d6a67032568c543fea9c70705028faea275e01/hostname",
                "source": "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/27fc6c87feadff5e0a11b618f5d6a67032568c543fea9c70705028faea275e01/resolv.conf",
                "source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/27fc6c87feadff5e0a11b618f5d6a67032568c543fea9c70705028faea275e01/shm",
            "io.kubernetes.cri.sandbox-id": "27fc6c87feadff5e0a11b618f5d6a67032568c543fea9c70705028faea275e01",
            "io.kubernetes.cri.sandbox-name": "nginx-kata",
            "io.kubernetes.cri.sandbox-namespace": "default"

After checking the container info, we determined that the vm id is 27fc6c87feadff5e0a11b618f5d6a67032568c543fea9c70705028faea275e01.

Then we can use ctr command to search namespace data:

$ sudo ctr -n=k8s.io c info 27fc6c87feadff5e0a11b618f5d6a67032568c543fea9c70705028faea275e01 | grep cni
                     "path": "/var/run/netns/cni-42d0400b-c2c9-f365-0497-3f170429a55f"

From this we know that the id of this container namespace is cni-42d0400b-c2c9-f365-0497-3f170429a55f, we can enter this namespace to see the devices inside:

$ sudo ip netns exec cni-42d0400b-c2c9-f365-0497-3f170429a55f ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
3: eth0@if1806: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether c2:0e:0b:8d:ff:36 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.1.39.40/24 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::c00e:bff:fe8d:ff36/64 scope link
       valid_lft forever preferred_lft forever
4: tap0_kata: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc mq state UNKNOWN group default qlen 1000
    link/ether b6:a5:16:3c:0d:bf brd ff:ff:ff:ff:ff:ff
    inet6 fe80::b4a5:16ff:fe3c:dbf/64 scope link
       valid_lft forever preferred_lft forever

You can see that the series of numbers after eth0 is the index 1806 of this device. We can get the matching veth device by searching in the default namespace.

$ ip addr | grep 1806
1806: vethf95422f9@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default

From this, we know that vethf95422f9@if3 is the corresponding veth device, and we can also check its existence from brctl show:

$ brctl show
bridge name bridge id STP enabled interfaces
cni0 8000.cad82eecfdf8 no veth3aa3ebe4
                                                        veth546c1a80
                                                        vethf95422f9
                                                        vethfcbd921f

In this way, we have successfully matched the corresponding container id and its network settings.

Confirm virtual machine status

We can use the debug mode to connect to the vm to check the status, and the operation can be found here[enable debug console](https://github.com/kata-containers/documentation/blob/master/Developer-Guide. md#enabling-debug-console-for-qemu).

After we have finished setting, we can use the following steps to connect to the vm.

$ console="/var/run/vc/vm/${id}/console.sock"
$ sudo socat "stdin,raw,echo=0,escape=0x11" "unix-connect:${console}"

After entering, there is no way to manipulate some basic commands such as ls and cat, so you can only use tab to find the relevant process status in /proc.

First of all, you can make sure that kata-agent will indeed run in the vm, and you can find the process of kata-agent in /proc/66/comm.

In addition, since cat cannot be used, it can only be printed line by line by using the read command of the shell with a loop.

# cd /proc
# cd 66
# while read line; do echo $line; done <comm
kata-agent

If you want to confirm whether there is nginx, you can also search under /proc.

# cd 122
# while read line; do echo $line; done <comm
nginx

You can also see some settings when vm is loaded in /proc/cmdline.

# while read line; do echo $line; done <cmdline
tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 cryptomgr.notests net.ifnames=0 pci =lastbus=0 iommu=off root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro ro rootfstype=ext4 debug systemd.show_status=true systemd.log_level=debug panic=1 nr_cpus=16 agent.use_vsock =true systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket scsi_mod.scan=none agent.debug_console

After comparing the settings displayed by ps aux | grep qemu on the host side, it can be found that many settings will also be brought into the vm, such as the use_vsock flag and some agent settings.

Finally, if you want to check the network settings, each network card will be listed in /proc/net/dev, we can use this to verify whether eth0 exists in the vm.

# cd /proc/net
# while read line; do echo $line; done <dev
Inter-| Receive | Transmit
face|bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
lo: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
dummy0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
eth0: 6708 92 0 0 0 0 0 0 1180 16 0 0 0 0 0 0

It can be seen that there is indeed an eth0 network card, which also conforms to the official description on github.

If you want to know the ip information, you can get the information from /proc/net/fib_trie.

# while read line; do echo $line; done <fib_trie
Main:
+-- 0.0.0.0/1 2 0 2
+-- 0.0.0.0/4 2 0 2
|-- 0.0.0.0
/0 universe UNICAST
+-- 10.1.0.0/18 2 0 2
|-- 10.1.0.0
/16 universe UNICAST
+-- 10.1.39.0/24 2 0 1
|-- 10.1.39.0
/32 link BROADCAST
/24 link UNICAST
|-- 10.1.39.160
/32 host LOCAL
|-- 10.1.39.255
/32 link BROADCAST
+-- 127.0.0.0/8 2 0 2
+-- 127.0.0.0/31 1 0 0
|-- 127.0.0.0
/32 link BROADCAST
/8 host LOCAL
|-- 127.0.0.1
/32 host LOCAL
|-- 127.255.255.255
/32 link BROADCAST
Local:
+-- 0.0.0.0/1 2 0 2
+-- 0.0.0.0/4 2 0 2
|-- 0.0.0.0
/0 universe UNICAST
+-- 10.1.0.0/18 2 0 2
|-- 10.1.0.0
/16 universe UNICAST
+-- 10.1.39.0/24 2 0 1
|-- 10.1.39.0
/32 link BROADCAST
/24 link UNICAST
|-- 10.1.39.160
/32 host LOCAL
|-- 10.1.39.255
/32 link BROADCAST
+-- 127.0.0.0/8 2 0 2
+-- 127.0.0.0/31 1 0 0
|-- 127.0.0.0
/32 link BROADCAST
/8 host LOCAL
|-- 127.0.0.1
/32 host LOCAL
|-- 127.255.255.255
/32 link BROADCAST

Here we use kubectl to view the ip of the pod, and then look for the corresponding address above.

$ kubectl get pod -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-kata 1/1 Running 0 5h26m 10.1.39.160 wisr7-0205 <none> <none>

As mentioned above, there is indeed an ip address of 10.1.39.160, and when we look at /proc/net/dev, only the network card eth0 is processing the transmission of packets, so it can be confirmed that it is the same as described on github The data will also be transmitted to the vm via eth0.

In addition, route information can be found in /proc/net/route.

# while read line; do echo $line; done <route
Iface Destination Gateway Flags RefCnt Use Metric Mask MTU Window IRTT
eth0 00000000 0127010A 0003 0 0 0 00000000 0 0 0
eth0 0000010A 0127010A 0003 0 0 0 0000FFFF 0 0 0
eth0 0027010A 00000000 0001 0 0 0 00FFFFFF 0 0 0

Finally, if you want to know the MAC address, you can check it in /sys/class/net/eth0/address.

# while read line; do echo $line; done <address
8e:9d:f5:8c:8d:cf

Use Nvidia GPU in kata container

According to Using Nvidia GPU device with Kata Containers According to the description, we can use Nvidia GPU in kata container using two modes:

  • Nvidia GPU pass-through mode
  • Nvidia vGPU mode

The comparison between the two is as follows:

Technology Description Behavior Detail
Nvidia GPU pass-through mode GPU passthrough Physical GPU assigned to a single VM Direct GPU assignment to VM without limitation
Nvidia vGPU mode GPU sharing Physical GPU shared by multiple VMs Mediated passthrough

Hardware requirements

Nvidia GPUs Recommended for Virtualization:

  • Nvidia Tesla (T4, M10, P6, V100 or newer)
  • Nvidia Quadro RTX 6000/8000

Host BIOS Requirements

Some hardware needs larger PCI BARs window like Nvidia Tesla P100, K40m

$ lspci -s 04:00.0 -vv | grep Region
      Region 0: Memory at c6000000 (32-bit, non-prefetchable) [size=16M]
      Region 1: Memory at 383800000000 (64-bit, prefetchable) [size=16G] #above 4G
      Region 3: Memory at 383c00000000 (64-bit, prefetchable) [size=32M]

If a larger BARS MMIO mapping is required, the section above 4G needs to be enabled in the PCI configuration of the BIOS.

Different brands may show different results:

  • Above 4G Decoding
  • Memory Hole for PCI MMIO
  • Memory Mapped I/O above 4GB

Host Kernel Requirements

The following flags must be set on the host kernel:

  • CONFIG_VFIO
  • CONFIG_VFIO_IOMMU_TYPE1
  • CONFIG_VFIO_MDEV
  • CONFIG_VFIO_MDEV_DEVICE
  • CONFIG_VFIO_PCI

Also set intel_iommu=on in the kernel cmdline at boot time.

Configure kata container

We also need to set related items in /etc/kata-containers/configuration.toml. In addition, if non-large BARs devices are used, the recommended version is Kata version 1.3.0 or above, and if large BARs devices are used, the recommended version is It is required to be on Kata version 1.11.0 or above.

Hotplug for PCI devices by shpchp (Linux's SHPC PCI Hotplug driver):

machine_type = "q35"

hotplug_vfio_on_root_bus = false

Hotplug for PCIe devices by pciehp (Linux's PCIe Hotplug driver):

machine_type = "q35"

hotplug_vfio_on_root_bus = true
pcie_root_port = 1

Build Kata Containers kernel with GPU support

Next, we need to create a kernel that supports GPU, and change the default kernel to what we created. How to create a guest kernel is in [Build Kata Containers Kernel](https://github.com/kata-containers/kata-containers /tree/main/tools/packaging/kernel) is explained in detail.

In addition, to build a kernel, you need to have the corresponding environment and software. You can also find it in Build Kata Containers Kernel View in the article.

Here, first use go to grab the script that builds the kernel:

$ go get -d -u github.com/kata-containers/kata-containers
$ cd $GOPATH/src/github.com/kata-containers/kata-containers/tools/packaging/kernel
$ ./build-kernel.sh setup

Then set the following kernel config option:

# Support PCI/PCIe device hotplug (Required for large BARs device)
CONFIG_HOTPLUG_PCI_PCIE=y
CONFIG_HOTPLUG_PCI_SHPC=y

# Support for loading modules (Required for load Nvidia drivers)
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y

# Enable the MMIO access method for PCIe devices (Required for large BARs device)
CONFIG_PCI_MMCONFIG=y

You also need to disable CONFIG_DRM_NOUVEAU:

# Disable Open Source Nvidia driver nouveau
# It conflicts with Nvidia official driver
CONFIG_DRM_NOUVEAU=n

Then you can build the kernel:

## Build guest kernel with ../../tools/packaging/kernel

# Prepare (download guest kernel source, generate .config)
$ ./build-kernel.sh -v 4.19.86 -g nvidia -f setup

# Build guest kernel
$ ./build-kernel.sh -v 4.19.86 -g nvidia build

# Install guest kernel
$ sudo -E ./build-kernel.sh -v 4.19.86 -g nvidia install
/usr/share/kata-containers/vmlinux-nvidia-gpu.container -> vmlinux-4.19.86-70-nvidia-gpu
/usr/share/kata-containers/vmlinuz-nvidia-gpu.container -> vmlinuz-4.19.86-70-nvidia-gpu

Then generate the rpm package of kernel-devel:

$ cd kata-linux-4.19.86-68
$ make rpm-pkg
Output RPMs:
~/rpmbuild/RPMS/x86_64/kernel-devel-4.19.86_nvidia_gpu-1.x86_64.rpm

Finally, set the kernel path in /etc/kata-containers/configuration.toml and you are done.

kernel = "/usr/share/kata-containers/vmlinuz-nvidia-gpu.container"

Nvidia GPU pass-through mode with Kata Containers

First find the Bus-Device-Function (BDF) of the GPU on the host:

$ sudo lspci -nn -D | grep -i nvidia
0000:04:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:15f8] (rev a1)
0000:84:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:15f8] (rev a1)

Then find the IOMMU group of the GPU:

$BDF="0000:04:00.0"
$ readlink -e /sys/bus/pci/devices/$BDF/iommu_group
/sys/kernel/iommu_groups/45

Check the IOMMU number under /dev/vfio:

$ ls -l /dev/vfio
total 0
crw------- 1 root root 248, 0 Feb 28 09:57 45
crw------- 1 root root 248, 1 Feb 28 09:57 54
crw-rw-rw-1 root root 10, 196 Feb 28 09:57 vfio

Then you can create a container to use the GPU:

$ sudo docker run -it --runtime=kata-runtime --cap-add=ALL --device /dev/vfio/45 centos /bin/bash

The example here is to use docker to directly start the container. If you want to use kubernetes or containerd, it should be similar.

We can enter the container to confirm whether the GPU device is in the PCI devices list:

$ lspci -nn -D | grep '10de:15f8'
0000:01:01.0 3D controller [0302]: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] [10de:15f8] (rev a1)

You can also confirm the size of the PCI BARS in the container:

$ lspci -s 01:01.0 -vv | grep Region
 Region 0: Memory at c0000000 (32-bit, non-prefetchable) [disabled] [size=16M]
 Region 1: Memory at 4400000000 (64-bit, prefetchable) [disabled] [size=16G]
 Region 3: Memory at 4800000000 (64-bit, prefetchable) [disabled] [size=32M]

Nvidia vGPU mode with Kata Containers

Nvidia vGPU is a licensed product on all supported GPU boards. A software license is required to enable all vGPU features within the guest VM.

The purpose of vGPU mode is that different VMs can use the GPU device on the host at the same time, so the installation is performed on the host, and a license is required to use these functions on the guest VM.

First download the official Nvidia driver from Nvidia, such as NVIDIA-Linux-x86_64-418.87.01.run.

Then download kernel-devel, the rpm file that has been created before.

$ sudo rpm -ivh kernel-devel-4.19.86_gpu-1.x86_64.rpm

Then you can follow the official steps to decompress, compile and download the Nvidia driver:

## Extract
$ sh ./NVIDIA-Linux-x86_64-418.87.01.run -x

## Compile and install (It will take some time)
$ cd NVIDIA-Linux-x86_64-418.87.01
$ sudo ./nvidia-installer -a -q --ui=none \
 --no-cc-version-check \
 --no-opengl-files --no-install-libglvnd \
 --kernel-source-path=/usr/src/kernels/`uname -r`

or

$ sudo sh ./NVIDIA-Linux-x86_64-418.87.01.run -a -q --ui=none \
 --no-cc-version-check \
 --no-opengl-files --no-install-libglvnd \
 --kernel-source-path=/usr/src/kernels/`uname -r`

View installer logs:

$ tail -f /var/log/nvidia-installer.log

Then load the Nvidia driver module:

# Optional (generate modules.dep and map files for Nvidia driver)
$ sudo depmod

# Load module
$ sudo modprobe nvidia-drm

# Check module
$ lsmod | grep nvidia
nvidia_drm             45056  0
nvidia_modeset       1093632  1 nvidia_drm
nvidia              18202624  1 nvidia_modeset
drm_kms_helper        159744  1 nvidia_drm
drm                   364544  3 nvidia_drm,drm_kms_helper
i2c_core               65536  3 nvidia,drm_kms_helper,drm
ipmi_msghandler        49152  1 nvidia

Finally, you can view the status:

$ nvidia-smi
Tue Mar  3 00:03:49 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:01:01.0 Off |                    0 |
| N/A   27C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Once installed, we can use this GPU with a different VM.