Kubeadm operator

--- title: Kubeadm operator authors: - "@fabriziopandini" owning-sig: sig-cluster-lifecycle participating-sigs: - sig-cluster-lifecycle reviewers: - "@neolit123" - "@rosti" - "@ereslibre" - "@detiber" - "@vincepri" - "@yastij" - "@chuckha" approvers: - "@timothysc" - "@luxas" editor: "@fabriziopandini" creation-date: 2019-09-16 last-updated: 2019-09-16 status: implementable --- # Kubeadm operator # This document was moved to https://github.com/kubernetes/enhancements/pull/1239 ## Table of Contents  - [Release Signoff Checklist](#release-signoff-checklist) - [Summary](#summary) - [Motivation](#motivation) - [Goals](#goals) - [Non-Goals](#non-goals) - [Proposal](#proposal) - [User Stories](#user-stories) - [Story 1](#story-1) - [Story 2](#story-2) - [Story 3](#story-3) - [Story 4](#story-4) - [Story 5](#story-5) - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) - [Action](#action) - [Action Controller and TaskGroup](#action-controller-and-taskgroup) - [TaskGroupController and Task](#taskgroupcontroller-and-task) - [TaskController](#taskcontroller) - [Observability and Operability](#observability-and-operability) - [Execution order](#execution-order) - [Execution modes](#execution-modes) - [Log management](#log-management) - [Error management](#error-management) - [Extensibility](#extensibility) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) - [Test Plan](#test-plan) - [Graduation Criteria](#graduation-criteria) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) - [Infrastructure Needed](#infrastructure-needed)  ## Release Signoff Checklist - [x] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR) - [x] KEP approvers have set the KEP status to `implementable` - [x] Design details are appropriately documented - [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input - [ ] Graduation criteria is in place - [ ] "Implementation History" section is up-to-date for milestone - [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] - [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes ## Summary Kubeadm operator would like to enable declarative control of kubeadm workflows, automating the execution and the orchestration of such tasks across existing nodes in a cluster. ## Motivation Kubeadm binary can execute actions only on the machine where it is running e.g. it is not possible to execute actions on other nodes, to copy files across nodes, etc. As a consequence, most of the kubeadm workflows, like [kubeadm upgrade](https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-upgrade/), consists of a complex sequence of tasks that should be manually executed and orchestrated across all the existing nodes in the cluster. Such a user experience is not ideal due to the error-prone nature of humans running commands. The manual approach can be considered a blocker for implementing more complex workflows such as rotating certificate authorities, modifying the settings of an existing cluster or any task that requires coordination of more than one Kubernetes node. This KEP aims to address such problems by applying the operator pattern to kubeadm workflows. ### Goals - To allow declarative control of kubeadm actions that lead to "in place" mutations[1] of kubeadm generated artifacts. More specifically kubeadm artifacts are static pod manifests, certificates, kubeconfig file, bootstrap token, kubeadm generated configmap and secrets, while this proposal initially includes the following kubeadm workflows: - kubeadm upgrade - certificate renewal - certificate authority rotation (NEW) - change configuration in an existing cluster (NEW) > [1] Please note that we are referring to "in place" mutations of kubeadm generated artifacts in order to highlight the difference between the kubeadm operator and other SCL projects like [Cluster API](https://cluster-api.sigs.k8s.io/), which instead assume nodes and underlying machines are immutable. Considering the complexity of this topic, this document is expected to be subject to some iterations. The goal of the current iteration is to: - Get initial approval on Summary and Motivation paragraphs - To identify a semantic for defining “Actions” to be performed by the kubeadm-operator. - To define how the kubeadm-operator should manage kubeadm workflows. - To define how the users should interact with the kubeadm-operator, including also observability and error handling. - To define how the kubeadm-operator should be deployed in a Kubernetes cluster. ### Non-Goals - To provide or manage any infrastructure elements such as underlying machines, load balancers, storage, etc. - To manage and automate the kubeadm init and join workflows. - , configuration file, or artifact *not* generated by kubeadm. The only exception are the kubelet, kubeadm, and kubectl binaries which are considered in-scope as required by the upgrade workflow. - To replace kubeadm "raw" workflows. The user will always be able to run kubeadm workflows in isolation in a manual fashion. ## Proposal ### User Stories #### Story 1 As a Kubernetes operator, I would like to be able to declaratively control upgrades in a systematic fashion. #### Story 2 As a Kubernetes operator, I would like to be able to declaratively control certificate renewal in a systematic fashion. #### Story 3 As a Kubernetes operator, I would like to be able to declaratively control changes of the current cluster's settings in a systematic fashion. #### Story 4 As a Kubernetes operator, I would like to be able to declaratively rotate my certificate authorities. #### Story 5 As a Kubernetes operator, I would like to control whether nodes are cordoned and drained for tasks or if they are performed without disruption to the workloads on the node, referred to as a "hot" update. ### Implementation Details/Notes/Constraints The initial proposal for implementing the kubeadm operator can be summarized by the following sequence diagram; the goal of the current iteration is to validate/improve the proposed approach. ![](20190916-kubeadm-operator.png) #### Action The first step of the sequence above is the user asking the kubeadm operator to perform an `Action`; `Action` is a new CRD that defines actions performed by the kubeadm operator in a high-level, end-user oriented format e.g. ```yaml apiVersion: operator.kubeadm.x-k8s.io/v1alpha1 kind: Action metadata: name: action-sample spec: upgrade: targetVersion: v1.16.0 ``` or ```yaml apiVersion: operator.kubeadm.x-k8s.io/v1alpha1 kind: Action metadata: name: action-sample spec: renewCertificates: {} ``` #### Action Controller and TaskGroup The `Action controller` is a component of the kubeadm operator that manages the `Action` CRD and implements all the domain-specific knowledge about the kubeadm workflows e.g. it knows that the certificate-renewal workflow requires to run `kubeadm alpha certs renew` on all the control plane nodes in the cluster. As a consequence, the `Action controller` generates one or more `TaskGroup` objects, each describing a `TaskTemplate` object - e.g. run `kubeadm alpha certs renew` - targeting a set of nodes matching a given `NodeSelector` - e.g. all nodes with `node-role.kubernetes.io/master` label -. ```yaml apiVersion: operator.kubeadm.x-k8s.io/v1alpha1 kind: TaskGroup metadata: name: taskgroup-sample2 ownerReference: ... spec: nodeSelector: matchLabels: node-role.kubernetes.io/master: "" selector: matchLabels: app: a template: metadata: labels: app: a alphaCertsRenew: {} ``` It is important to notice that the `Action controller` is going to always generate `TaskGroup` in sequential order - only after the current `TaskGroup` is completed successfully a new one will be generated. > during implementation of the first proof of concept we might consider to extend `nodeSelector` semantic in order to express criteria like e.g. the first control plane node > during implementation of the first proof of concept we might consider to extend `Task` template semantic in order to allow to execute more than one command. > during implementation of the first proof of concept we might consider to implement a lock mechanisms for prevventing concurrent execution of more than one `Action` at any time. > Alternative name for the `TaskGroup` might be `TaskBatch` (batch = a quantity or consignment of goods produced at one time). > Instead usage of `TaskSet` or `TaskDeployment` was discarded because of its generally imply something that creates Pods, which it is not the case. #### TaskGroupController and Task The `TaskGroup controller` will be responsible for managing the `TaskGroup` CRD and implements all the logic for generating `Task` from a `TaskGroup` according to the current topology of the Kubernetes cluster. Each generated `Task` is expected to target a specific `Node` in the cluster. ```yaml apiVersion: operator.kubeadm.x-k8s.io/v1alpha1 kind: Task metadata: name: task-sample-control-plane-1 labels: app: a spec: nodeName: control-plane-1 alphaCertsRenew: {} ``` or ```yaml apiVersion: operator.kubeadm.x-k8s.io/v1alpha1 kind: Task metadata: name: task-sample-control-plane-2 labels: app: a spec: nodeName: control-plane-2 alphaCertsRenew: {} ``` The `TaskGroup controller` is going to generate `Task` according to a policy that by default is sequential order - only after the current `Task` is completed successful a new one will be generated. In future releases of the kubeadm operator also additional policy could be added for supporting e.g. parallel execution of `Task` or disruption budget (max N `Task` in executed at the same time). In any case, the `TaskGroup controller` is going to ensure a predictable execution order based on the alphabetical order of `Node`s in the scope of a `TaskGroup`. #### TaskController The `Task controller` is the last component of the kubeadm operator. The main characteristic of this controller is the fact that this controller is deployed on all the nodes in the cluster using a `DaemonSet`, and that each instance is responsible for managing `Task` targeting a specific node. `Task` reconciliation triggers the execution of kubeadm commands on such node; in order to do so, `Task controller` require to be executed in privileged mode. #### Observability and Operability The kubeadm operator is designed to take responsibility for actions that as of today are in charge of users, and we want to ensure that the users can feel comfortable in handling over such responsibility. This goal influences the kubeadm operator design as described in the following paragraphs ##### Execution order The kubeadm operator is going to execute `Action`'s `Task`s in a predictable and consistent order: - `TaskGroup` will be generated in the order encoded in each action and always executed in sequential order. - If a `TaskGroup` uses `sequential` policy, `Task` will be executed according to the alphabetical order of Nodes. - Additional `TaskGroup` policies introduced in future, are expected to introduce similar guarantees. Additionally, it will be possible to execute actions in `dry-run` mode, thus allowing users to have a preview of the resulting task order. ##### Execution modes The kubeadm operator is going to execute`Task`s acconding to the order described above, and by default a new `Task` will be automatically started as soon as the previous task completes. At any time during the execution, the user will be allowed to pause the sequence of tasks using the new `kubeadm operator pause` command. Similarly, it will be possible to restart the execution with `kubeadm operator restart`. It will be also possible to execute action in `Controlled` mode; with this option, the kubeadm operator to automatically pause before actually executing each task; this will allow the user to check the task details and decide whether to proceed or not. > NB. Once a `Task` is handled over to kubeadm, it is not possible to pause its execution; that means that `kubeadm operator pause` does not affect running `Task`, but it prevents new new `Task` to be started. ##### Log management The kubeadm operator will never delete `Action`, `TaskGroup` and `Task` objects. Such objects will be preserved, and the user could use those elements to observe what is happening during `Action` execution, or to inspect what happened after an `Action` completes with success or fails. The `Task` object in particular will host the output of each command, providing the same level of detail today accessible to the users when executing manually the kubeadm commands. > During the proof of concept we will consider also if to store the output of each command directly on the `Task` or in a separated object. ##### Error management In case of errors, the kubeadm operator will immediately pause `Action` execution, so the user can access available logs to get insight on the problem that originated the error. Additionally, during the proof of concept we will explore options for allowing retry of failed tasks and/or restart and skipping failed task. #### Extensibility The kubeadm operator provides high level `Action` encoding the knowledge of how kubeadm workflows should be performed in the form of `TaskGroup` and `Task`. However, such out of the box experience might not be fit for advanced use cases, custom clusters etc. The kubeadm operator will provide a toolbox for allowing users to address such use cases as well: - It will be introduced a `CustomAction` type, allowing the user to define custom lists of `TaskGroup` and `Task`. - It will be possible to create manually "detached" `TaskGroup` and/or `Task` (without a parent `Action`), thus allowing to the user further extension points. > NB. During implementation we will consider if to include extensibility in the first POC or in following release of the kubeadm operator. ### Risks and Mitigations [1] A hierarchy of CRD objects - `Action`, `TaskGroup` , `Task` - provides a clean semantics and helps in having a good separation of concerns between controllers, but as back side, it introduces new problems like orphan management. Those conditions can arise only if the users accidentally or intentionally creates or deletes `TaskGroup`/`Task` while executing one `Action`. Such behaviour should be avoided or executed only by very advanced users, but there is no way to ensure this can't happen. During the proof of concept we will explore options for detecting such conditions and immediately pause execution in case of inconsistencies. Additionally, we are going to ensure that completed `Action` will not reconcile anymore in order to prevent accidental execution of tasks. [2] The `Task controller` require to be executed in privileged mode, and this might raise concern in terms of security. However, considering that the kubeadm operator does not require more privileges than the admin executing the same command manually, we consider that we are not introducing new security risks with this proposal. On the other side, given that the kubeadm operator inheriths Kubernetes Authentication & Authorization, Kubernetes Auditing, Pod security policies etc. the proposed solution is going to enable new options for ensuring security and traceability around kubeadm Actions. Nevertheless, during implementation of the first proof of concept or in follow up release of this KEP we might consider to make the operator to automatically deploy The `Task controller` when an action is started and then to delete it immediately after the action completes. This will further reduce the surface for this risk. ## Design Details ### Test Plan TBD ### Graduation Criteria TBD ### Upgrade / Downgrade Strategy TBD ### Version Skew Strategy TBD ## Implementation History - the `Summary` and `Motivation` sections being merged signaling SIG acceptance - the `Proposal` section being merged signaling agreement on a proposed design ## Drawbacks TBD ## Alternatives [1] To NOT implement the kubeadm operator, and let the user automate workflows/orchestration of kubeadm actions across nodes with other tools. [2] To implement the kubeadm operator using `Jobs` instead of the `Task` abstraction. This was considered not practical, because it will force the operator to have a complex interaction with the `Jobs` abstraction. e.g. in order to use `Jobs`, it will be necessary to implement `Kubeadm operator Pod`, but interacting with this pod using plain `command` and `args` pod options can be difficult. Instead, having a custom `Task` abstraction and a `TaskController` running of nodes allows full flexibility and control of the action executed by the kubeadm operator. [3] To implement the kubeadm operator without using the `TaskGroup` abstraction. This was considered not practical because it will collaps management of domain specific logic about kubeadm workflows and management of cluster topology/task generation policies into a unique controller. A better separation of the above concern, achieved by implementing the `TaskGroup` abstraction and the `TaskGroup controller`, can simplify the implementation and maintenance of the kubeadm-operator. ## Infrastructure Needed TBD