Building Talos Linux using Controller Resource Pattern

# Building Talos Linux using Controller Resource Pattern - [Introduction](#introduction) - [What are controllers?](#what-are-controllers) - [COSI Deep Dive](#cosi-deep-dive) - [COSI in Talos Linux](#cosi-in-talos-linux) - [What's next?](#whats-next) --- ## Introduction The traditional operating system design is based on the concept of sequential phases: mount a filesystem, start a service, wait for the service to be ready, and move on to the next task. This is how Talos worked as well before the controller-resource pattern was introduced two years ago. Talos Linux machine state is fully defined by the machine configuration applied to the node. If the change to the machine configuration requires a reboot, it makes any change extremely painful. Even though it is not supposed that the machine configuration is being updated frequently, there are cases when immediate feedback on changes is important: - debugging a problem - rolling out an update (upgrades, applying cluster-wide configuration changes) ### Reactive Operating System The operating system has to be [reactive](https://en.wikipedia.org/wiki/Reactive_programming), as it needs to act on the changes in the environment both outside of the machine and inside the machine. The operating system environment can be described as a stream of events. Some events need a simple action, e.g. if the power button is pressed, an ACPI shutdown event is sent, and the operating system needs to react by shutting down the machine. Other events might trigger a more complex cascading sequence of actions, for example, we can imagine a sequence of events triggered by a hostname change: - DHCP server assigns a new hostname to the node - The operating system records the new hostname set by the DHCP server - The operating system merges the new hostname with the existing configuration - The operating system applies the change down to the Linux kernel - The operating system re-issues the certificates for the services running on the node with the new SANs - The operating system restarts the services which do not handle certificate renewal - The operating system updates Kubernetes nodename based on the new hostname - If the nodename changes, the operating system forces `kubelet` to re-bootstrap itself with the new nodename ```mermaid graph TD A[DHCP server] -->|new hostname|B[OS] B --> C[Merge hostname configuration] C --> CC{Is there a change?} CC -->|Yes|D[OS applies the change down to the Linux kernel] CC -->|No|DD[Stop] D --> DDD[Rebuild a list of cert SANs] DDD --> DDDD{Is there a change?} DDDD -->|Yes|E[OS re-issues the certificates] DDDD -->|No|EE[Stop] E --> F[Restart the services] DDD --> G[Re-calculate Kubernetes nodename] G --> GG{Is there a change?} GG -->|Yes|H[OS updates Kubernetes nodename] GG -->|No|HH[Stop] H --> I[Re-bootstrap the kubelet] ``` The sequence might abort at any step if there is no actual change done, e.g. if the hostname is overridden in machine configuration, it takes over the DHCP assigned one, and change propagation is stopped. Another example of a change propagation is that there might be many events that trigger the same action, for example, if a new IP address is assigned to the node, the operating system needs to re-issue the certificates for the services the same way as in the previous example. Same way if the certificates are rotated based on the expiration date, the operating system needs to re-issue the certificates for the services. ### Implementing Change Propagation It is natural to use some form of events to communicate changes from one component to another. As there are multiple components in the operating system, the design might include an event bus to communicate the events between them. The event should carry the updated state so that downstream components can pick up the change. This idea of the event bus sounds to be very similar to the controller-resource pattern, so why not use it instead? --- ## What are controllers? - [Kubernetes Controllers](https://kubernetes.io/docs/concepts/architecture/controller/) A controller is a control loop that watches some input resources, performs some actions, and updates the output resources. ```mermaid graph TD I1[Input1] --> C[/Controller/] I2[Input2] --> C C --> O1[Output1] C --> O2[Output2] ``` A resource has metadata and a spec. The metadata is fully structured, while a spec is a free-form blob of data that supports equality checks. A resource change triggers the controller loop to run, while a no-op change to a resource stops the change propagation. The meaningful work of the controller is usually a side-effect of the control loop, e.g.: - creating files on the disk - performing Linux syscalls setting up the network interfaces - interacting with other components via APIs, e.g. CRI - others... The controller might not have any resource as input, but, instead, it watches the state of other components (e.g. Linux kernel), and reflects that state in the output resources. ### Why COSI? - [COSI](https://github.com/cosi-project/runtime) Talos Linux is a minimal operating system, and it needs to run controllers with minimum resource consumption right from the moment of the initial boot, so embedding the Kubernetes API server was not an option. There was a KCP project which was supposed to offer a minimal Kubernetes API server, but it was not ready at the time of the development and seems to have changed its direction since that time. - [KCP project](https://github.com/kcp-dev/kcp) As we are not using Kubernetes API directly, we could also try to fix issues in the Kubernetes model, as we build our implementation: - each resource has a _single owner_ (controller), so the flow of the changes is clear - controllers define their _inputs_ and _outputs_ (a controller contract), every controller access to the resources is verified against that list - lightweight resource implementation (_in-memory_ efficient storage) If anyone tried to debug [Cluster API](https://cluster-api.sigs.k8s.io/) controllers in the past, they know how hard it is to understand the flow of the changes, as multiple controllers modify the same resource, or wait for some specific change in the resource, which makes debugging extremely hard. This problem is completely avoided in Talos, as controller contracts make the resource ownership strictly defined. ### Resources as Introspection Method As the operating system becomes more complex, it is important to have a way to introspect the state of the system. One can observe the final output of the flow, but if there is a problem in the middle of the flow, it is hard to understand what exactly went wrong. As resources are the glue point between the controllers in the flow, it is possible to understand the problem by inspecting the resource state. ```mermaid graph LR R1 --> C1[/Controller1/] C1 --> R2 R2 --> C2[/Controller2/] C2 --> R3 ``` Resource API provides a unified interface to query the state of the system, to wait for the resource to reach the desired state, etc. ```text $ taloctl get members VERSION HOSTNAME MACHINE TYPE OS ADDRESSES 2 talos-default-controlplane-1 controlplane Talos (v1.4.0) ["172.20.0.2"] 1 talos-default-worker-1 worker Talos (v1.4.0) ["172.20.0.3"] ``` And the same with full details: ```yaml # talosctl -n 172.20.0.2 get members talos-default-worker-1 -o yaml node: 172.20.0.2 metadata: namespace: cluster type: Members.cluster.talos.dev id: talos-default-worker-1 version: 1 owner: cluster.MemberController phase: running created: 2023-03-16T13:50:08Z updated: 2023-03-16T13:50:08Z spec: nodeId: o5lPFDGG7eluDADvH0u0BPfR6MvHIBzuBqlU22gvLXp addresses: - 172.20.0.3 hostname: talos-default-worker-1 machineType: worker operatingSystem: Talos (v1.4.0) ``` ### Controller Loops Each controller loop runs in its own thread of execution (goroutine, as Talos is implemented in Go), so it runs independently of other controllers. This allows controllers to run in parallel if there are no dependencies between them, which in turn allows faster boots and faster reaction to changes. At the same time as controller loops are only triggered on changes, they are not polling the state of the system, and controller threads are blocked most of the time, so the resource consumption is minimal. Each controller loop is independent, so in case of a crash or fatal error, only a single controller will be affected, and the controller will be automatically restarted by the runtime with exponential backoff. The automatic restart on failures is a powerful mechanism to implement retries on transient errors. ### Testability As the controller loops are independent, it is possible to test them in isolation, without the need to run the whole system. Controllers have well-defined inputs and outputs, so by creating a set of inputs, we can build assertions on the set of outputs. There isn't anything better than integration tests, but having a way to test each controller isolation provides a way to test edge cases or inputs which are hard to reproduce in integration tests. --- ## COSI Deep Dive GitHub links: - [COSI Runtime](https://github.com/cosi-project/runtime) - [Protobuf API specs](https://github.com/cosi-project/specification) COSI (Common Operating System Interface) is a separate project which can be used outside of Talos Linux. For example, Sidero Labs Omni is completely based on COSI. COSI consists of two layers: - Resource storage and API - Controller runtime ### Resource Storage and API Resource API is very similar to Kubernetes API, and it provides equivalent basic resource operations: - getting, listing and watching resources - creating, updating and destroying resources There are two implementations of resource storage: - in-memory storage, with optional persistence to disk via BoltDB - `etcd`-backed storage Talos Linux is using at the moment only in-memory storage without any persistence, which means all resources are re-created on every boot. Each resource namespace can be backed by its own resource storage, so there could be a mix of ephemeral in-memory resources and cluster-wide persistent `etcd`-backed resources. Resource API is available both in-process providing the best performance and over the network via gRPC. The gRPC API is used by Talos Linux to expose the resources to the user, and even internally to provide resource APIs to the daemons running in the system. ### Controller Runtime Controller runtime is built on top of the resource API to run controllers. Controller runtime takes care of running controller loops, validating access to the resources, handling restarts on failures, etc. Controller runtime is also either available in-process or over gRPC APIs, so it can be used to run controllers in different processes, or even on multiple machines. ### Controller Examples - [COSI Mock Controller](https://github.com/cosi-project/runtime/blob/2731ce3fd28dce4e45a83d5fab41e42946b44127/pkg/controller/conformance/controllers.go#L230-L305) - [Talos Network Status Controller](https://github.com/siderolabs/talos/blob/cf101e56fbf18bb401bebb95e9fe005f65765d3d/internal/app/machined/pkg/controllers/network/status.go#L67-L143) --- ## COSI in Talos Linux The very first set of COSI controllers was introduced in [Talos Linux v0.9.0](https://github.com/siderolabs/talos/releases/tag/v0.9.0) two years ago. Talos v0.9.0 migrated away from self-hosted Kubernetes (`bootkube`-based) to Talos-managed control plane as static pods. Since that time, more and more components were reimplemented on top of COSI, and now Talos features **88** different resource types: ```text $ talosctl -n 172.20.0.2 get resourcedefinition | wc -l 89 ``` > One line in the output is the table header. ### Controller Dependencies For Talos v0.9.0, the tree of dependencies between controllers and resources looked like that: ![image controller dependencies](https://www.talos.dev/images/controller-dependencies-v2.png "title") > As COSI requires controller inputs and outputs to be explicitly defined, it's very easy to render this dependency tree. At the moment of the writing, the dependency diagram for the development version of Talos looks much more intertwined: ![controller dependencies main](https://i.imgur.com/OqNGeEu.jpg) - [SVG version](https://gist.githubusercontent.com/smira/ff7b193ae44a72bd4241868f5832704e/raw/c995459fca102f04072a7894e027b54346296465/deps.svg) ### Dependency Tree #### Machine Configuration Dissection Talos Linux is fully defined by [the machine configuration](https://www.talos.dev/v1.3/reference/configuration/), so it's not a big surprise that the machine configuration resource appears at the top of the dependency tree. Many smaller controllers translate parts of the machine configuration into small resources which contain only the information relevant to that subsystem. Examples: - [cluster discovery configuration](https://github.com/siderolabs/talos/blob/64e3d24c6bfe60b5556c41822c8e81f63d0a06d2/internal/app/machined/pkg/controllers/cluster/config.go#L57-L151) - [`seccomp` profiles](https://github.com/siderolabs/talos/blob/64e3d24c6bfe60b5556c41822c8e81f63d0a06d2/internal/app/machined/pkg/controllers/cri/seccomp_profile.go#L55-L105) There are two ideas behind these types of controllers: - decouple the subsystem from the Talos Linux machine configuration, so that they can be extracted and used as independent modules - stop the change propagation so that the change in the machine configuration which doesn't affect the subsystem doesn't trigger the updates in other controllers #### Watching External Sources Several controllers don't have resource inputs, but they rather watch the state of Linux kernel or interact with Kubernetes APIs, etc: - [`AddressStatus` controller](https://github.com/siderolabs/talos/blob/cf101e56fbf18bb401bebb95e9fe005f65765d3d/internal/app/machined/pkg/controllers/network/address_status.go#L49-L149) - [`StaticPodStatus` controller](https://github.com/siderolabs/talos/blob/cf101e56fbf18bb401bebb95e9fe005f65765d3d/internal/app/machined/pkg/controllers/k8s/kubelet_static_pod.go#L73-L166) #### Resource Transformation Some controllers in the middle of the dependency chain do transformation of input resources into output resources without any side effects: - [Converting `Affiliates` to `Members`](https://github.com/siderolabs/talos/blob/cf101e56fbf18bb401bebb95e9fe005f65765d3d/internal/app/machined/pkg/controllers/cluster/member.go#L48-L111) - [Generating final `kubelet` configuration](https://github.com/siderolabs/talos/blob/cf101e56fbf18bb401bebb95e9fe005f65765d3d/internal/app/machined/pkg/controllers/k8s/kubelet_spec.go#L76-L206) #### Controllers with Side Effects Controllers closer to the leaves of the dependency tree interact with other components to do the "real" work: - [Managing `kubelet` service lifecycle](https://github.com/siderolabs/talos/blob/cf101e56fbf18bb401bebb95e9fe005f65765d3d/internal/app/machined/pkg/controllers/k8s/kubelet_service.go#L70-L212) - [Assigning `AddressSpecs` to the network links](https://github.com/siderolabs/talos/blob/cf101e56fbf18bb401bebb95e9fe005f65765d3d/internal/app/machined/pkg/controllers/network/address_spec.go#L52-L119) ## Building Solutions on top of COSI COSI provides API to interact with resources, add new resources, build extension controllers, etc. At the moment most of the extension points are not available in Talos Linux yet, but it's possible today to get the state of any resource. ### Interacting with Talos Resource API Talos exposes COSI API directly protected by common authentication and authorization layer, so having `talosconfig` one could query any resource: - [Talos API Resource Example](https://gist.github.com/smira/eac1b6d2e45ac4c49504ce06f88cb873) ### Talos Cloud Controller Manager Talos CCM builds on the top two Talos features: - Talos Linux provides cloud metadata via [the `PlatformMetadata` resource](https://gist.github.com/smira/9684ffef0967c17742772421317d348b) - [Talos API access from Kubernetes](https://www.talos.dev/v1.3/advanced/talos-api-access-from-k8s/) Talos CCM combines these two features together to provide a Kubernetes deployment which provides CCM features in hybrid clusters - [Talos CCM source code](https://github.com/siderolabs/talos-cloud-controller-manager/blob/86818165f5ad6eb26d1abda22a914672e526e4bf/pkg/talos/client.go#L23-L78) --- ## What's Next? ### COSI Controller Libary Many COSI controllers and resources are not specific to Talos Linux, so they could be extracted as libraries to be used and developed independently of the project: - network subsystem - time synchronization (SNTP) - running Kubernetes - ... The resources build an API layer for the controller interaction, so other controllers can hook up to the resources created or managed by the library, providing further integration. E.g. BGP controller could connect to the network and Kubernetes subsystems to provide BGP-based control plane Virtual IPs. ### Distributed Resources At the moment all resources in Talos Linux are local to the node and ephemeral. But some resources need to be synchronized at least the cluster control plane - such resources could be stored in `etcd`. These cluster-wide resources will still be available using the same API, but they will be always in sync across cluster nodes. ### Talos COSI Extensions What if the COSI controller can run in `Kubernetes` in-cluster? What if the controller can run outside of the cluster? In the shared management cluster? As COSI is API-based, there's endless possibility for inspecting each machine state, injecting additional resources, and affecting the machine operations. One could think of a security controller running outside of the cluster, or a `Kubernetes` subsystem in Talos replaced with `Nomad` provisioner.