owned this note
owned this note
Published
Linked with GitHub
# Introducing: Control plane machine management and 1-click scaling
Since the inception of OpenShift 4, installer-provisioned-infrastructure clusters have utilised the Machine API to manage worker and infrastructure machines via `MachineSet` resources.
However, while OpenShift 4 clusters have had control-plane machines, these have been standalone and unmanaged.
Users have not had the ability to scale or modify their control-plane machines without manual work and checks to ensure the health of the control-plane.
As adoption of OpenShift grows within an organisation, and the organisation's clusters have grown in size,
the additional pressure put on the control-plane by additional worker machines and the workloads running on those machines,
causes users to look to vertically scale their control-plane instances to cope with the additional load.
Since scaling the control-plane requires management of the OpenShift cluster's etcd cluster, the process of scaling a control-plane manually is very involved and requires specific knowledge and careful execution of steps in the correct order.
As managed services have grown, this process has not evolved and has become a significant bottleneck for the SREs managing these services.
An automated solution is required.
## Introducing the ControlPlaneMachineSet
The `ControlPlaneMachineSet` is a new resource within the OpenShift Machine API ecosystem, introduced in 4.12.
It manages the cluster's control-plane machines and adds new automation on top of the existing Machine API concepts.
The OpenShift team often refer to `Machine` and `MachineSet` resources as being analogous to `Pod` and `ReplicaSet` resources.
The `MachineSet` is to a `Machine` as the `ReplicaSet` is to a `Pod`.
If we extend this analogy, a `ControlPlaneMachineSet` is similar to a `StatefulSet`.
Rather than managing an arbitrary number of identical `Machine` resources, like a `MachineSet` would, the `ControlPlaneMachineSet`
manages a small number of identical `Machines` and adds special logic on top of the `Machines` to provide functionality such as
rolling-update replacement of the `Machines` as well as spreading the `Machines` across multiple failure domains.
With a `ControlPlaneMachineSet` installed and active within a cluster, user's can now modify parameters of their control-plane specification and observe as the `ControlPlaneMachineSet` automatically, and safely, replaces the control-plane machines with new machines with the updated spec.
## What can I use a ControlPlaneMachineSet for?
The `ControlPlaneMachineSet` can be used to perform rolling update replacements of control-plane `Machines` within OpenShift.
For example, if you need to increase the underlying instance type of the control-plane `Machines`, by editing the provider specification on the
`ControlPlaneMachineSet` spec, you can trigger a complete rolling replacement of the control-plane `Machines` within the cluster, allowing you
to make automated changes to the infrastructure within the control-plane in a safe and controlled manner.
## How does the ControlPlaneMachineSet work?
The `ControlPlaneMachineSet` constantly monitors the control-plane `Machines` within the cluster.
It compares the desired specification (from within the resource spec) to the existing configuration of the control-plane `Machines`.
When it detects that there is a difference, it will iterate through the control-plane `Machines` and, 1 by 1, replace those with an up-to-date `Machine`, this is an example of the immutable infrastructure concept.
This means that, it creates a new `Machine`, waits for that `Machine` to join the cluster, and then marks the old `Machine` for deletion.
Once the old `Machine` is removed (ie there should be no more than one additional `Machine` in the cluster), it will move onto the next control-plane `Machine`
and repeat the process until all of the `Machines` have been updated.
If at any point, a control-plane `Machine` is manually marked for deletion, the `ControlPlaneMachineSet` will attempt to maintain the cluster by creating
a replacement for that `Machine`.
## What happens to etcd when scaling my control plane?
Starting in OpenShift 4.11, the etcd operator leverages machine lifecycle hooks to implement a quorum protection mechanism when the Machine API is configured within the cluster.
The lifecycle hooks allow the etcd operator to control when the Machine API drains and removes pods on a control-plane machine.
Using this hook, the etcd operator prevents removal of an etcd member until it has had an opportunity to migrate that member
onto a new node within the cluster.
While performing a rolling update, the cluster will, for a short period, have 4 control-plane machines.
When the 4th control-plane node joins the cluster, the etcd operator starts a new etcd member on the new node.
Once it observes that the old control-plane machine has been marked for deletion, it stops the etcd member on the old node and promotes
the new etcd member to join the quorum of the cluster.
This mechanism allows the etcd operator precise control over the members within the quorum and allows the Machine API to safely create and
remove control-plane machines without specific operational knowledge of the etcd cluster.
## When is the ControlPlaneMachineSet available?
Now!
The `ControlPlaneMachineSet` will be configured and active on all freshly installed OpenShift 4.12 (and onwards) clusters, for the AWS platform.
Support for Azure and GCP is being targetted for a future release.
For clusters upgrading to 4.12 on AWS and Azure, an inactive `ControlPlaneMachineSet` will be created and maintained on the cluster by the
operator. The inactive `ControlPlaneMachineSet` can then be activated by the user should they wish to enable the functionality of the `ControlPlaneMachineSet` on their cluster.
This functionality will be available for GCP in a future release.
## Where can I learn more about the ControlPlaneMachineSet?
The `ControlPlaneMachineSet` operator project contains [some documentation](https://github.com/openshift/cluster-control-plane-machine-set-operator/tree/main/docs/user) aimed at users of the project.
If you are interested in the background of the design of the new project, the original design proposal, on which this project was implemented, is available to read on [GitHub](https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/control-plane-machine-set.md).
This design proposal includes detailed descriptions of the various features of the `ControlPlaneMachineSet` as well as the motvation
for decisions taken during the design process.