Refer to the Cluster API Book Glossary.
In-place Update: any change to a Machine spec, including the Kubernetes Version, that is performed without deleting the machines and creating a new one.
External Update Lifecycle Hook: CAPI Lifecycle Runtime Hook to invoke external update extensions.
External Update Extension: Runtime Extension (Implementation) is a component responsible to perform in place updates when the External Update Lifecycle Hook
is invoked.
The proposal introduces update extensions allowing users to execute custom strategies when performing Cluster API rollouts.
An External Update Extension implementing custom update strategies will report the subset of changes they know how to perform. Cluster API will orchestrate the different extensions, polling the update progress from them.
If the totality of the required changes cannot be covered by the defined extensions, Cluster API will fall back to the current behavior (rolling update).
Cluster API by default performs rollouts by creating a new machine and deleting the old one.
This approach, inspired by the principle of immutable infrastructure (the very same used by Kubernetes to manage Pods), has a set of considerable advantages:
Over time several improvement were made to Cluster API immutable rollouts:
Even if the project continues to improve immutable rollouts, most probably there are and there will always be some remaining use cases where it is complex for users to perform immutable rollouts, or where users perceive immutable rollouts to be too disruptive to how they are used to manage machines in their organization:
TODO: looking for more real life usecases here
With this proposal, Cluster API provides a new extensibility point for users willing to implement their own specific solution for these problems, allowing them to implement a custom rollout strategy to be triggered via a new external update extension point implemented using the existing runtime extension framework.
With the implementation of custom rollout strategy, users can take ownership of the rollout process and embrace in-place rollout strategies, intentionally trading off some of the benefits that you get from immutable infrastructure.
As this proposal is an output of the In-place updates Feature Group, ensuring that the external update extension allows the implementation of in-place rollout strategies is considered a non-negotiable goal of this effort.
Please note that the practical consequence of focusing on in-place rollout strategies, is that the possibility to implement different types of custom rollout strategies, even if technically possible, won’t be validated in this first iteration (future goal).
Another important point to surface, before digging into implementation details of the proposal, is the fact that this proposal is not tackling the problem of improving CAPI to embrace all the possibilities that external update extensions are introducing. E.g. If an external update extension introduces support for in-place updates, using “BootstrapConfig” (emphasis on bootstrap) as the place where most of the machine configurations are defined seems not ideal.
However, at the same time we would like to make it possible for Cluster API users to start exploring this field, gain experience, and report back so we can have concrete use cases and real-world feedback to evolve our API.
Cluster API user experience MUST be the same when using default, immutable updates or when using external update extensions: e.g. in order to trigger a MachineDeployment rollout, you have to rotate a template, etc.
At the current alpha stage of the in-place updates feature, the default rolling update is always enabled and cannot be turned off. If external update extensions can not cover the totality of the desired changes, CAPI WILL defer to Cluster API’s default, immutable rollouts. This is important for a couple of reasons:
The external update extension will be responsible to perform the updates on a single machine.
The responsibility to determine which machine should be rolled out as well as the responsibility to handle rollout options like MaxSurge/MaxUnavailable will remain on the controllers owning the machine (e.g. KCP, MD controller).
We propose a pluggable update strategy architecture that allows External Update Extension to handle the update process.
Initially, this feature will be implemented without modifying any of the current APIs of Cluster API. It will follow Kubernetes' feature gate mechanism and be contained within the experimental package. This means that any changes in behavior are controlled by the feature gate InPlaceUpdates
, which must be enabled by users for the new in-place updates workflow to be available. It is disabled unless explicitly configured.
This proposal introduces a Lifecycle Hook named ExternalUpdate
for communication between CAPI and external update implementers. Multiple external updaters can be registered, each of them only covering a subset of machine changes. The CAPI controllers will ask the external updaters what kind of changes they can handle and, based on the reponse, compose and orchestrate them to achieve the desired state.
With the introduction of this experimental feature, users may want to apply the in-place updates workflow to a subset of CAPI clusters only. By leveraging CAPI's RuntimeExtension
, we can provide a namespace selector via ExtensionConfig
. This allows us to support cluster selection at the namespace level (only clusters/machines namespaces that match the selector) without applying API changes.
As an cluster operator, I want to perform in-place updates on my Kubernetes clusters without replacing the underlying machines. I expect the update process to be flexible, allowing me to customize the strategy based on my specific requirements, such as air-gapped environments or special node configurations.
As a cluster operator, I want to seamlessly transition between rolling and in-place updates while maintaining a consistent user interface. I appreciate the option to choose or implement my own update strategy, ensuring that the update process aligns with my organization's unique needs.
As an cluster operator for resource constrained environments, I want to utilize CAPI pluggable external update mechanism to implement in-place updates without requiring additional compute capacity in a single node cluster.
As an cluster operator for highly specialized/customized environments, I want to utilize CAPI pluggable external update mechanism to implement in-place updates without losing the existing VM/OS customizations.
As a cluster operator, I want to update machine attributes supported by my infrastructure provider without the need to recreate the machine.
As a cluster service provider, I want guidance/documentation on how to write external update extension for own my use case.
As a bootstrap/controlplane provider developer, I want guidance/documentation on how to reuse some parts of this pluggable external update mechanism.
sequenceDiagram
participant Operator
box Management Cluster
participant apiserver as kube-api server
participant capi as CP/MD controller
participant mach as Machine Controller
participant hook as External updater
end
Operator->>+apiserver: Make changes to KCP
apiserver->>+capi: Notify changes
apiserver->>-Operator: OK
loop For all External Updaters
capi->>+hook: Can update?
hook->>capi: Supported changes
end
capi->>capi: Decide Update Strategy
loop For all machine
capi->>apiserver: Mark Machine as pending and set UpToDate condition
apiserver->>mach: Notify changes
loop For all External Updaters
mach->>hook: Run updater
end
mach->>apiserver: Mark Hooks in Machine as Done
capi->>apiserver: Set UpToDate condition
end
When configured, external updates will, roughly, follow these steps:
sigs.k8s.io/cluster-api/internal/hooks.MarkAsPending()
function to track that updaters should be called.UpToDate
condition on machines to False
.sigs.k8s.io/cluster-api/internal/hooks.MarkAsDone()
to mark machine as done updating.UpToDate
condition on machines to True
.The following sections dive deep into these steps, zooming in into the different component interactions and defining how the main error cases are handled.
sequenceDiagram
participant Operator
box Management Cluster
participant apiserver as kube-api server
participant capi as CP/MD Controller
participant hook as External updater 1
participant hook2 as External updater 2
end
Operator->>+apiserver: make changes to CP/MD
apiserver->>+capi: Notify changes
apiserver->>-Operator: OK
capi->>+hook: Can update?
hook->>capi: Set of changes
capi->>+hook2: Can update?
hook2->>capi: Set of changes
alt all changes covered?
capi->>apiserver: Decide Update Strategy
else
alt fallback strategy?
capi->>apiserver: Re-create machines
else
capi->>apiserver: Marked update as failed
end
end
Both KCP
and MachineDeployment
controllers follow a similar pattern around updates, they first detect if an update is required and then based on the configured strategy follow the appropiate update logic (note that today there is only one valid strategy, RollingUpdate
).
With InPlaceUpdates
feature gate enabled, CAPI controllers will compute the set of desired changes and iterate over the registered external updaters, requesting through the Runtime Hook the set of changes each updater can handle. The changes supported by an updater can be the complete set of desired changes, a subset of them or an empty set, signaling it cannot handle any of the desired changes.
If any updater can handle the desired then CAPI will determine that the update can be performed using the external strategy.
If any of the desired changes cannot be covered by the updaters capabilities, CAPI will determine the desired state cannot be reached through external updaters. In this case, it will fallback to the rolling update strategy, replacing machines as needed.
sequenceDiagram
participant Operator
box Management Cluster
participant apiserver as kube-api server
participant capi as MD controller
participant msc as MachineSet Controller
participant mach as Machine Controller
participant hook as External updater
end
Operator->>apiserver: make changes to MD
apiserver->>capi: Notify changes
apiserver->>Operator: OK
capi->>capi: Decide Update Strategy
capi->>apiserver: Create new MachineSet
loop For all machines
capi->>apiserver: Mark as pending and move to new Machine Set
apiserver->>msc: Notify changes
msc->>apiserver: Update Machine Spec and set UpToDate condition
apiserver->>mach: Notify changes
loop For all updaters in plan
mach->>hook: Run updater
end
mach->>apiserver: Mark Machine as done
msc->>apiserver: Set UpToDate condition
end
The MachineDeployment controller updates machines in place in a very similar way to rolling updates: by creating a new MachineSet and moving the machines from the old MS to the new one. We want to stress that the Machine objects won't be deleted and recreated like in the current rolling strategy. The MachineDeployment will just update the OwnerRefs and labels, effectively moving the existing Machine object from one MS to another. The number of machines moved at once might be made configurable on the MachineDeployment in the same way maxSurge
and maxUnavailable
control this for rolling updates.
When the new MachineSet controller sees a new Machine with an outdated spec, it updates the spec to match the one in the MS. This update together with marking machine as pending and setting a condition is what triggers the Machine controller to start executing
the external updaters.
sequenceDiagram
participant Operator
box Management Cluster
participant apiserver as kube-api server
participant capi as KCP controller
participant mach as Machine Controller
participant hook as External updater
end
Operator->>apiserver: make changes to KCP
apiserver->>capi: Notify changes
apiserver->>Operator: OK
capi->>capi: Decide Update Strategy
loop For all machines
capi->>apiserver: Mark Machine as pending and set UpToDate condition
apiserver->>mach: Notify changes
loop For all External Updaters
mach->>hook: Run updater
end
mach->>apiserver: Mark Machine as Done
capi->>capi: Set UpToDate condition
end
The KCP external updates will work in a very similar way to MachineDeployments but removing the MachineSet level of indirection. In this case, it's the KCP controller the one in charge of arking machine as pending, setting a condition, and also updating the Machine spec. This follows this same pattern as for rolling updates, where the KCP controller directly creates and deletes Machines. Machines will be updated one by one, sequentially.
sequenceDiagram
box Management Cluster
participant apiserver as kube-api server
participant capi as CAPI
participant mach as Machine Controller
participant hook as External updater 1
participant hook2 as External updater 2
end
box Workload Cluster
participant infra as Infrastructure
end
capi->>apiserver: Decide Update Strategy
capi->>apiserver: Mark Machine as pending and set UpToDate condition
apiserver->>mach: Notify changes
mach->>hook: Start update
hook->>infra: Update components
loop For all External Updaters
mach->>hook: finished?
hook2->>mach: try in X secs
end
mach->>apiserver: Mark Machine as Done
capi->>apiserver: Set UpToDate condition
Once a Machine is marked as pending and UpToDate
condition is set and the Machine's spec has been updated with the desired changes, the Machine controller takes over. This controller is responsible for calling the updaters and tracking the progress of those updaters and exposing this progress in the Machine conditions.
The Machine controller will not follow any order when calling the updaters. This might change in future iterations.
The controller will trigger updaters by hitting another RuntimeHook endpoint (eg. /UpdateMachine
). The updater could respond saying "update completed", "update failed" or "update in progress" with an optional "retry after X seconds". The CAPI controller will continuously poll the status of the update by hitting the same endpoint until it reaches a terminal state.
CAPI expects the /UpdateMachine
endpoint of an updater to be idempotent: for the same Machine with the same spec, the endpoint can be called any number of times (before and after it completes), and the end result should be the same. CAPI guarantees that once an /UpdateMachine
endpoint has been called once, it won't change the Machine spec until the update reaches a terminal state.
Once the update completes, the Machine controller will mark machine as done. If the update fails, this will be reflected in the Machine status.
From this point on, the KCP
or MachineDeployment
controller will take over and set the UpToDate
condition to True
.
As mentioned before, the user experience to update in-place should be the exact same one as for rolling updates. This includes the need to rotate the Infra machine template. For providers that bundle the kubernetes components in some kind of image, this means that when upgrading kubernetes versions, a new image will be required.
This might seem counter-intuitive, given the update will be made in-place, so there is no need for a new image. However, not only this ensures the experience is the same as in rolling updates, but it also allows new Machines to be created for that MachineDeployment/CP in case a scale up is required or fallback to rolling update.
We leave up to the external updater implementers to decide how to deal with these changes. Some infra providers might have the ability to swap the image in an existing running machine, in which case they can offer a true in-place update for this field. For the ones that can't do this but want to allow changes that require a new image (like kubernetes updates), they should "ignore" the image field when processing the update, leaving the machine in a dirty state.
We might explore the ability to represent this "dirty" state at the API level. We leave this for a future iteration of this feature.
Remediation can be used as the solution to recover machine when in-place update fails on a machine. The remediation process stays the same as today: the MachineHealthCheck controller monitors machine health status and marks it to be remediated based on pre-configured rules, then ControlPlane/MachineDeployment replaces the machine or call external remediation.
However, in-place updates might cause Nodes to become unhealthy while the update is in progress. In addition, an in-place update might take more (or less) time than a fresh machine creation. Hence, in order to successfully use MHC to remediate in-place updated Machines, we require:
All functionality related to In-Place Updates will be available only if the InPlaceUpdates
feature flag is set to true.
This section aims to showcase our vision for the In-Places Updates end state. It shows a high level picture of a few common usecases, specially around how the different components interact through the API.
Note that these examples don't show all the low level details. Some of those details might not yet be defined in this doc and will be added later, the examples here are just to help communicate the vision.
Let's imagine a vSphere cluster with a KCP control plane that has two fictional In-Place update extensions already deployed and registered through their respective ExtensionConfig
.
vsphere-vm-memory-update
: The extension uses vSphere APIs to hot-add memory to VMs if "Memory Hot Add" is enabled or through a power cycle.kcp-version-upgrade
: Updates the kubernetes version of KCP machines by using an agent that first updates the kubernetes related packages (kubeadm
, kubectl
, etc.) and then runs the kubeadm upgrade
command. The In-place Update extension communicates with this agent, sending instructions with the kubernetes version a machine needs to be updated to.sequenceDiagram
participant Operator
box Management Cluster
participant apiserver as kube-api server
participant capi as KCP Controller
participant mach as Machine Controller
participant hook as KCP version <br>update extension
participant hook2 as vSphere memory <br>update extension
end
box Workload Cluster
participant machines as Agent in machine
end
Operator->>+apiserver: Update version field in KCP
apiserver->>+capi: Notify changes
apiserver->>-Operator: OK
loop For 3 KCP Machines
capi->>+hook2: Can update [spec.version,<br>clusterConfiguration.kubernetesVersion]?
hook2->>capi: I can update []
capi->>+hook: Can update [spec.version,<br>clusterConfiguration.kubernetesVersion]?
hook->>capi: I can update [spec.version,<br>clusterConfiguration.kubernetesVersion]
capi->>capi: Decide Update Strategy
capi->>apiserver: Mark Machine as pending and set UpToDate condition
apiserver->>mach: Notify changes
mach->>hook: Run update in<br> in Machine
hook->>mach: In progress
hook->>machines: Update packages and<br>run kubeadm upgrade 1.31
machines->>hook: Done
mach->>hook: Run update in<br> in Machine
hook->>mach: Done
mach->>apiserver: Mark Machine as done
capi->>apiserver: Set UpToDate condition
end
The user starts the process by updating the version field in the KCP object:
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
name: kcp-1
spec:
replicas: 3
rolloutStrategy:
type: InPlace
- version: v1.30.0
+ version: v1.31.0
The KCP computes the difference between the current CP machines (plus bootstrap config and infra machine) and their desired state and detects a difference for the machine spec.version
and for the KubeadmConfig spec.clusterConfiguration.kubernetesVersion
. It then starts calling the external update extensions to see if they can handle these changes.
First, it makes a request to the vsphere-vm-memory-update/CanUpdateMachine
endpoint of the first update extension registered, the vsphere-vm-memory-update
extension:
{
"desiredMachine": {...},
"desiredBootstrapConfig": {...},
"desiredInfraMachine": {...},
"changes": ["machine.spec.version", "bootstrap.spec.clusterConfiguration.kubernetesVersion"],
}
The vsphere-vm-memory-update
extension does not support any or the required changes, so it responds with the following message declaring that if does not accept any of the requrested changes:
{
"error": null,
"acceptedChanges": [],
}
Given that there are still changes not covered, KCP continue with the next update extension, making the same request to the kcp-version-upgrade/CanUpdateMachine
endpoint of the kcp-version-upgrade
extension:
{
"desiredMachine": {...},
"desiredBootstrapConfig": {...},
"desiredInfraMachine": {...},
"changes": ["machine.spec.version", "bootstrap.spec.clusterConfiguration.kubernetesVersion"],
}
The kcp-version-upgrade
extension detects that this is a KCP machine, verifies that the changes only require a kubernetes version upgrade, and responds:
{
"error": null,
"acceptedChanges": ["machine.spec.version", "bootstrap.spec.clusterConfiguration.kubernetesVersion"],
}
Now that the KCP knows how to cover all desired changes, it proceeds to mark the first selected KCP machine for update. It sets the pending hook annotation and condition on the machine, which is the signal for the Machine controller to treat this changes differently (as an external update).
apiVersion: cluster.x-k8s.io/v1beta1
kind: Machine
metadata:
+ annotations:
+ runtime.cluster.x-k8s.io/pending-hooks: ExternalUpdate
name: kcp-1-hfg374h
spec:
- version: v1.30.0
+ version: v1.31.0
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfig
- name: kcp-1-hfg374h-9wc29
- uid: fc69d363-272a-4b91-aa35-72ccdaa7a427
+ name: kcp-1-hfg374h-flkf3
+ uid: ddab8525-bb36-4a86-81e9-ef3eeeb33e18
status:
conditions:
+ - lastTransitionTime: "2024-12-31T23:50:00Z"
+ status: "False"
+ type: UpToDate
These changes are observed by the Machine controller. Then it call all updaters. To trigger the updater, it calls the kcp-version-upgrade/UpdateMachine
endpoint:
{
"desiredMachine": {...},
"desiredBootstrapConfig": {...},
"desiredInfraMachine": {...},
}
When the kcp-version-upgrade
extension receives the request, it verifies it can read the Machine object, verifies it's a CP machine and triggers the upgrade process by sending the order to the agent. It then responds to the Machine controller:
{
"error": null,
"status": "InProgress",
"tryAgain": "5m0s"
}
The Machine controller then sends a simillar request to vsphere-vm-memory-update/UpdateMachine
endpoint, since this extension has not been able to cover any of the changes, it responds with the Done
(machine controller doesn't need to know if
the update was accepted or rejected):
{
"error": null,
"status": "Done"
}
The Machine controller then requeues the reconcile request for this Machine for 5 minutes later. On the next reconciliation it repeats the request to the kcp-version-upgrade/UpdateMachine
endpoint:
{
"desiredMachine": {...},
"desiredBootstrapConfig": {...},
"desiredInfraMachine": {...},
}
The kcp-version-upgrade
which has tracked the upgrade process reported by the agent, responds:
{
"error": null,
"status": "Done"
}
The Machine controller then removes the annotation:
apiVersion: cluster.x-k8s.io/v1beta1
kind: Machine
metadata:
- annotations:
- runtime.cluster.x-k8s.io/pending-hooks: ExternalUpdate
name: kcp-1-hfg374h
spec:
version: v1.31.0
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfig
name: kcp-1-hfg374h-flkf3
uid: ddab8525-bb36-4a86-81e9-ef3eeeb33e18
status:
conditions:
- lastTransitionTime: "2024-12-31T23:50:00Z"
status: "False"
type: UpToDate
On the next KCP reconciliation, it detects that this machine doesn't have any pending hooks and sets the UpToDate
condition to True
.
apiVersion: cluster.x-k8s.io/v1beta1
kind: Machine
metadata:
name: kcp-1-hfg374h
spec:
version: v1.31.0
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfig
name: kcp-1-hfg374h-flkf3
uid: ddab8525-bb36-4a86-81e9-ef3eeeb33e18
status:
conditions:
- - lastTransitionTime: "2024-12-31T23:50:00Z"
- status: "False"
+ - lastTransitionTime: "2024-12-31T23:59:59Z"
+ status: "True"
type: UpToDate
This process is repeated a third time with the last KCP machine, finally marking the KCP object as up to date.
sequenceDiagram
participant Operator
box Management Cluster
participant apiserver as kube-api server
participant capi as MD controller
participant msc as MachineSet Controller
participant mach as Machine Controller
participant hook as vSphere memory <br>update extension
end
participant api as vSphere API
Operator->>+apiserver: Update memory field in<br> VsphereMachineTemplate
apiserver->>+capi: Notify changes
apiserver->>-Operator: OK
capi->>+hook: Can update [spec.memoryMiB]?
hook->>capi: I can update [spec.memoryMiB]
capi->>capi: Decide Update Strategy
capi->>apiserver: Create new MachineSet
loop For all Machines
capi->>apiserver: Mark as pending and move to new Machine Set
apiserver->>msc: Notify changes
msc->>apiserver: Update Machine's<br> spec.memoryMiB
apiserver->>mach: Notify changes
mach->>hook: Run update in<br> in Machine
hook->>mach: In progress
hook->>api: Update VM's memory
api->>hook: Done
mach->>hook: Run update in<br> in Machine
hook->>mach: Done
mach->>apiserver: Mark Machine as updated
mach->>apiserver: Mark Machine as done
msc->>apiserver: Set UpToDate condition
end
The user starts the process by creating a new VSphereMachineTemplate with the updated memoryMiB
value and updating the infrastructure template ref in the MachineDeployment:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachineTemplate
metadata:
- name: md-1-1
+ name: md-1-2
spec:
template:
spec:
- memoryMiB: 4096
+ memoryMiB: 8192
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
name: m-cluster-vsphere-gaslor-md-0
spec:
strategy:
type: InPlace
template:
spec:
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachineTemplate
- name: md-1-1
+ name: md-1-2
The vsphere-vm-memory-update
extension informs that can cover required requested changes:
{
"desiredMachine": {...},
"desiredBootstrapConfig": {...},
"desiredInfraMachine": {...},
"changes": ["infraMachine.spec.memoryMiB"],
}
{
"error": null,
"acceptedChanges": ["infraMachine.spec.memoryMiB"],
}
The request is also made to kcp-version-upgrade
but it responds with an empty array, indicating it cannot handle any of the changes:
{
"error": null,
"acceptedChanges": [],
}
The Machine controller then creates a new MachineSet with the new spec and moves the first Machine to it by updating its OwnerRefs
, it also marks the Machine as pending:
apiVersion: cluster.x-k8s.io/v1beta1
kind: Machine
metadata:
+ labels:
+ cluster.x-k8s.io/cluster-name: cluster1
+ cluster.x-k8s.io/deployment-name: md-1
+ annotations:
+ runtime.cluster.x-k8s.io/pending-hooks: ExternalUpdate
name: md-1-6bp6g
ownerReferences:
- apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineSet
- name: md-1-gfsnp
+ name: md-1-hndio
The MachineSet controller detects that this machine is out of date and is pending external update execution and proceeds to update the Machine's spec and sets the UpToDate
condition to False
:
apiVersion: cluster.x-k8s.io/v1beta1
kind: Machine
metadata:
name: md-1-6bp6g
annotations:
runtime.cluster.x-k8s.io/pending-hooks: ExternalUpdate
spec:
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachine
- name: md-1-1-whtwq
+ name: md-1-2-nfdol
status:
+ conditions:
+ - lastTransitionTime: "2024-12-31T23:50:00Z"
+ status: "False"
+ type: UpToDate
From that point, the Machine controller follows the same process as in the first example.
The process is repeated for all replicas in the MachineDeployment.
flowchart TD
Update[MachineDeployment<br> spec change] --> UpdatePlane{Can all changes be<br> covered with the registered<br> update extensions?}
UpdatePlane -->|No| Roll[Fallback to machine<br> replacement rollout]
UpdatePlane -->|Yes| InPlace[Run extensions to<br> update in place]
sequenceDiagram
participant Operator
box Management Cluster
participant apiserver as kube-api server
participant capi as MD controller
participant msc as MachineSet Controller
participant mach as Machine Controller
participant hook as vSphere memory <br>update extension
participant hook2 as KCP version <br>update extension
end
Operator->>+apiserver: Update template field in<br> VsphereMachineTemplate
apiserver->>+capi: Notify changes
apiserver->>-Operator: OK
capi->>+hook: Can update [spec.template]?
hook->>capi: I can update []
capi->>+hook2: Can update [spec.template]?
hook2->>capi: I can update []
capi->>apiserver: Create new MachineSet
loop For all Machines
capi->>apiserver: Replace machines
end
The user starts the process by creating a new VSphereMachineTemplate with the updated template
value and updating the infrastructure template ref in the MachineDeployment:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachineTemplate
metadata:
- name: md-1-2
+ name: md-1-3
spec:
template:
spec:
- template: /Datacenter/vm/Templates/kubernetes-1-32-ubuntu
+ template: /Datacenter/vm/Templates/kubernetes-1-32-windows
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
name: m-cluster-vsphere-gaslor-md-0
spec:
strategy:
type: InPlace
fallbackRollingUpdate:
maxUnavailable: 1
template:
spec:
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachineTemplate
- name: md-1-2
+ name: md-1-3
Both the kcp-version-upgrade
and the vsphere-vm-memory-update
extensions inform that they cannot handle any of the changes:
{
"desiredMachine": {...},
"desiredBootstrapConfig": {...},
"desiredInfraMachine": {...},
"changes": ["infraMachine.spec.template"],
}
{
"error": null,
"acceptedChanges": [],
}
Since the fallback to machine replacement is a default strategy and always enabled, the MachineDeployment controller proceeds with the rollout process as it does today, replacing the old machines with new ones.
All functionality related to In-Place Updates will be available only if the InPlaceUpdates
feature flag is set to true.
TODO: we will add this later, after we get feedback from the first daft
CanUpdateMachine
endpointRequirements:
- Current Machine Spec
- Desired Machine Spec, including bootstrap and and infra machine templates
Requirements:
- Set of supported changes, probably and array of strings (the path in the object)
- Error
UpdateMachine
endpointRequirements:
- Desired Machine Spec, including bootstrap and and infra machine templates
Requirements:
- Result: [Success/Error/InProgress]
- Retry in X seconds
On the core CAPI side, the security model for this feature is very straightforward: CAPI controllers only require to read/create/update CAPI resources and those controllers are the only ones that need to modify the CAPI resources. Moreover, the controllers that need to perform these actions already have the necessary permissions over the resources they need to modify.
However, each external updater should define their own security model. Depending on the mechanism used to update machines in-place, different privileges might be needed, from scheduling privileged pods to SSH access to the hosts. Moreover, external updaters might need RBAC to read CAPI resources.
In the future, depending on the scenario, users might: ammend the desired state to something that the registered updaters can cover or register additional updaters capable of handling the desired changes. However, at this stage of the proposal, these options are not supported, and the rolling update strategy remains the only default fallback.
To test the external update strategy, we will implement a "CAPD Kubeadm Updater". This will serve as a reference implementation and will be integrated into CAPI CI. In-place updates will be performed by executing a set of commands in the container, similar to how it is currently implemented for cloud config when machine is bootstrapped.
The initial plan is to provide support for external update strategy in the KCP and MD controllers under a feature flag (which would be unset by default) and to have the webhook API in an alpha
stage ) which will allow us to iterate faster).
The main criteria for graduating this feature will be community adoption and API stability. Once the feature is stable, we have fixes to any known bugs and the webhook API has remained stable without backward incompatible changes for some time, we will propose to the community moving the API out of the alpha
stage. It will then be promoted out of experimental and the feature flag for enabling/disabling the functionality will be deprecated.