conmon 3.0 proposal: conmon-rs

# conmon 3.0 proposal: conmon-rs ## Problem It turns out the golang os.exec implementation uses a decent amount of memory, especially for single node openshift. Right now, CRI-O is responsible for a lot of exec'ing--it execs a process for essentially every runtime operation. This is charged against CRI-O in memory usage, and is the second leading cause in CRI-O RSS being high (the first being image pulling). ## Possible solutions - move to containerd shimv2 - still uses golang - update conmon to be able to spawn exec processes - writing such an application in C could get clunky, as conmon will become a server, essentially. - The conmon code is pretty fragile and changing it for this could disturb podman's needs - write a whole new conmon - :) ### Introducing podmon Note: name a work in progress--it would be a nightmare next to the comparatively more popular podman. podmon--pod monitor, would be a single process to watch over the behavior of a whole pod. It would be created as part of a RunPodSandbox request, and would listen on a unix socket for protobuf requests. The first requests it would listen for are described below: #### CreateContainer This would fulfill the previous behavior of a conmon call (without `--exec` or `--restore` arguments). It would receive a request like so: ```proto message CreateContainerRequest { string id = 1; string name = 2; string bundle_path = 3; string pid_file = 4; bool terminal = 5; bool stdin = 6; bool close_stdin = 7; repeated string log_paths = 8; repeated string exit_paths = 9; // where to write the exit files (persist-dir) repeated string runtime_args = 10: // this is where one would pass --systemd-cgroup } ``` and similarly configure the output buffers/console socket/logging drivers/OOM watchers. Instead of communicating through the sync pipe as conmon did, podmon would return the following structure: ```proto message CreateContainerResponse { uint32 container_pid = 1; string output = 2; string error = 3; } ``` Similarly, it wouldn't need the sync pipe either, as podmon would be moved to the pod's cgroup (or a dedicated slice) on pod creation. #### ExecSyncRequest ExecSync requests are for replacing the *legacy* behavior of exec--`conmon -e`. The motivation here is that CRI-O makes the most exec sync requests (to satisfy the comparatively rapid exec probes). A normal Exec (one where one can open a terminal/pass stdin in) is out of scope for the MVP. The request will look as follows: ```proto message ExecContainerRequest { string id = 1; string bundle_path = 2; string pid_file = 3; bool terminal = 4; repeated string runtime_args = 5: } ``` And the the stdout/stderr will be sent back directly, rather than through the sync pipe/through a log file: ```proto message ExecContainerResponse { string stdout = 1; string stderr = 2; string error = 3; // is there an error proto type? // yes, we can use the usual grpc return errors } ``` ### Advantages of this approach The most obvious advantage is we now only have a single exec process executed, and the vast majority of the work it does will be after it double forks: similar to how conmon worked before. However, this approach doesn't suffer from the way conmon execs did before: there's only one process for systemd to watch per pod, as opposed to an unbounded number of exec probe conmons to cleanup. Another advantage is a clean restart allows us to rethink what this runtime shim process is. conmon has existed for a long time, but has been notoriously difficult to maintain/extend. It has served well, but perhaps it is time for something new. ## Podmon holding the network namespace Today, the kubelet invokes the http probes and they aren't scoped to the pod's network. One proposal to solve this is to have the container runtime invoke the probes inside the network namespace of the pod. This can be simplified if podmon already joins the network namespace. ## Podmon as pause container for the pod level pid namespace case, we could use the new conmon as the holder of the pid namespace ## podmon as pinns? If we're already having podmon hold the net and pid namespaces, then we could also have podmon do the namespace unsharing for the pod, removing another exec. ## libraries grpc: https://github.com/hyperium/tonic We should consider communication protocols which are smaller and faster than grpc, like cap'n proto https://github.com/containers/conmon/commit/184682ba89759a91f0bc90c0ed1fdc279a6a0afb https://docs.rs/tokio/1.4.0/tokio/process/struct.Command.html#method.pre_exec # client work: - move conmon-server to correct cgroup # implement container creation - container logging - tty - no tty - runtime args processing - runtime command spawn - catch container exit - catch container oom - container timeout