# A general design of In-place upgrade extension
This doc proposed a design of in-place upgrade extension. It mainly focusd on the interaction between CAPI and extension, and upgrade operation orchestration. To make the extension be general purpose, it doesn't have dependency to infrastructure, but provides contract to delegate the node level in-place upgrade to external entity. Through this doc, we shall have:
- Verify the CAPI in-place upgrade contract. prove that the contact design is comprehensive to implement in-place upgrade.
- Provide a design of in-place upgrade extension as a show case to community, guide developer to implement their own extension correctly.
- Provide a general implementation of in-place upgrade extension. If meet his need, developer can leverage it and reduce his work by just implementing infrastructure related in-place upgrade logic.
## Proposal
### Components
```plantuml
@startuml
node "Host(s)" {
[Agent process]
}
cloud "management cluster" {
[CAPI] --> WebHook
[Cluster In-place upgrader] - WebHook
[Cluster In-place upgrader] --> NodeUpgradeCR
[Node upgrader] - NodeUpgradeCR
[Node upgrader] --> [Agent process]
}
@enduml
```
### Sequence diagram
1. A new CRD UpgradeTask is introduced to track in-place upgrade spec and status.
2. Implment ExternalUpgradeRequest webhook. When webhook get called, extension abort old upgrade on same CP/MachineDeployment, and then determine if extension is able to handle upgrade request, and create UpgradeTask CR.
3. In UpgradeTask reconcile loop
-- If UpgradeTask in abort stage, abort any ongoing upgrade, remove ownership.
-- If UpgradeTask in ongoing stage, set machine ownership, select a node based on preflight policy and upgrade.
4. Provide contract NodeUpdateTemplate to delegate node upgrade operation to external entity.
```plantuml
@startuml
actor User as user
participant "CAPI" as capi
participant "**cluster In-place upgrader**" as cup
participant "node In-place upgrader" as nup
participant host
user -> capi: upgrade cluster os/k8s version
capi -> cup: call Lifecycle Runtime Hook \n**ExternalUpgradeRequest**
cup --> capi: accept
cup -> cup: create UpgradeTask CR
activate cup
loop while nodesToUpdate > 0
cup -> capi: preflight check \nand select 1 node to update
cup -> nup: create node update CR
activate nup
nup --> cup
nup -> host: in-place upgrade node
host --> nup
nup --> cup: completed
deactivate nup
end
cup -> capi: upgrade completed
deactivate cup
@enduml
```
### UpgradeTask CR
```golang
type UpgradeTaskPhase string
const (
OngoingPhase UpgradeTaskPhase = "Ongoing"
AbortPhase UpgradeTaskPhase = "Abort"
CompletePhase UpgradeTaskPhase = "Complete"
)
// UpgradeTaskSpec defines the desired state of UpgradeTask
type UpgradeTaskSpec struct {
Cluster *clusterv1.Cluster `json:"cluster,omitempty"`
ControlPlaneRef *corev1.ObjectReference `json:"controlPlaneRef,omitempty"`
MachineDeploymentRef *corev1.ObjectReference `json:"machineDeploymentRef,omitempty"`
MachinesRequireUpgrade []*clusterv1.Machine `json:"machinesRequireUpgrade,omitempty"`
NewMachineSpec *MachineSpec `json:"newMachineSpec,omitempty"`
// Template defines the template of Machine In-place upgrader
Template *MachineUpgraderTemplate `json:"template,omitempty"`
Phase UpgradeTaskPhase `json:"phase,omitempty"`
}
type MachineUpgraderTemplate struct {
UpgraderRef corev1.ObjectReference `json:"upgraderRef"`
}
type MachineSpec struct {
BootstrapConfig *unstructured.Unstructured `json:"bootstrapConfig"`
InfraMachine *unstructured.Unstructured `json:"infraMachine"`
}
// UpgradeTaskStatus defines the observed state of UpgradeTask
type UpgradeTaskStatus struct {
}
//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
// UpgradeTask is the Schema for the upgradetasks API
type UpgradeTask struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec UpgradeTaskSpec `json:"spec,omitempty"`
Status UpgradeTaskStatus `json:"status,omitempty"`
}
```
### ExternalUpgradeRequest Webhook
```plantuml
@startuml
start
:Inplace upgrade extension webhook;
:ExternalUpgradeRequest get called;
if (Is there ongoing UpgradeTask for
same MachineDeployment/CP exists) then (yes)
:Abort ongoing UpgradeTask;
else (no)
endif
if (Verify request and determine if
extension able to handle changes?) then (yes)
:Create UpgradeTask CR;
:Respond **Accept**;
else (no)
:Respond **Decline**;
endif
stop
@enduml
```
### Reconcile loop
```plantuml
@startuml reconcile
start
:UpgradeTask controller;
:UpgradeTask controller enquenes a Reconcile call;
:Get Machines for given MachineDeployment/CP;
if (UpgradeTask stage ongoing) then (yes)
repeat :iterate Machines
if (Is Machine owned by any UpgradeTask?) then (no)
:Add MachineUpToDate condition with reason 'waiting';
:Add UpgradeTask owner;
else (yes)
endif
repeat while (more Machines?)
if (Query owned Machines, find machine which its \nMachineUpToDate condition reason is \n'ExternalUpgradeInProgress') then (found)
#lightGreen:**Doing** inplace upgrade;
if (Upgrade succeed?) then (yes)
:Update Machine Spec;
:Mark MachineUpToDate condition 'Ready';
else if (Upgrade failed or timeout?) then (yes)
:Update MachineUpToDate condition \nreason to 'Failed';
else (no)
endif
else (no)
if (Select 1 owned Machine to be upgraded \nbased on preflight check) then (selected)
:Update MachineUpToDate condition \nreason to 'ExternalUpgradeInProgress';
else (no)
if (All Machine upgraded?) then (yes)
:Mark UpgradeTask status to ready;
stop
else (no)
endif
endif
endif
:ReEnqueue;
else if (UpgradeTask stage aborted) then (yes)
repeat :iterate Machines
if (Check if machine is owned by current UpgradeTask?) then (yes)
if (Check if machine is doing inplace upgrade?) then (yes)
#lightGreen:**Abort** inplace upgrade;
else (no)
endif
:Remove MachineUpToDate condition;
:Remove UpgradeTask owner;
else (no)
endif
repeat while (more machines?)
else (no)
endif
stop
@enduml
```
Preflight checks:
- In-place upgrade shall not start if node condition in not ready state.
- In-place upgrade shall not start if node in deleting state.
- In-place upgrade shall not start if totalAvailableNodeNum > nodeNum - maxUnavailable
# NodeUpgarder
```plantuml
@startuml
start
if (NodeUpdate stage?) then (ongoing)
if (Has UpdatePlan spec?) then (yes)
else (no)
:Retrieve machine and infraMachine CR;
:Set UpdatePlan;
endif
if (Has onging trident operation) then (yes)
:Retrieve operation result and update status;
stop
else (no)
endif
if (Need update?) then (yes)
:Prepare Trident HostConfiguration to update OS Image;
:Prepare Kubeadm update script for node \nand embed to Trident HostConfiguration;
:Submit HostConfiguration;
else (no)
:Set status as Ready;
endif
else (abort)
if (Has onging trident operation) then (yes)
:Retrieve operation result and update status;
stop
else (no)
:Set status as Ready;
endif
endif
stop
@enduml
```
```plantuml
@startuml
start
:Upgrade kubeadm;
if (ControlPlane?) then (yes)
:Get ClusterConfiguration;
if (K8s version == Target version?) then (yes)
:kubeadm upgrade node;
else (no)
:kubeadm upgrade apply $target_version;
endif
else (no)
:kubeadm upgrade node;
endif
:Drain node;
:Upgrade kubectl;
:Upgrade & Restart kubelet;
:Uncordon node;
stop
@enduml
```
Error handling
```plantuml
@startuml
actor User as user
participant "CAPI" as capi
participant "**cluster In-place upgrader**" as cup
participant "node In-place upgrader" as nup
participant host
user -> capi: upgrade cluster os/k8s version
capi -> cup: call Lifecycle Runtime Hook \n**ExternalUpgradeRequest**
cup --> capi: accept
cup -> cup: create UpgradeTask CR
activate cup
loop while nodesToUpdate > 0
cup -> capi: preflight check \nand select 1 node to update
cup -> nup: create NodeUpdate CR
activate nup
nup --> cup
nup -> host: in-place upgrade node
host --> nup
alt #Pink failed & retryable?
nup -> host: retry in-place upgrade node
host --> nup
else failed & not retryable?
nup -> nup: set NodeUpdate \nfailureReason &failureMessage
nup --> cup: watch NodeUpdate changes
cup -> cup: aggregate conditions
cup -> capi: set Machine MachineUpToDate \ncondition with failure info
capi -> capi: MachineHealthCheck detect \nunhealthy state and mark \nmachine for remediation
capi -> capi: Remediate machine
end
deactivate nup
end
cup -> capi: upgrade completed
deactivate cup
@enduml
```