# A general design of In-place upgrade extension This doc proposed a design of in-place upgrade extension. It mainly focusd on the interaction between CAPI and extension, and upgrade operation orchestration. To make the extension be general purpose, it doesn't have dependency to infrastructure, but provides contract to delegate the node level in-place upgrade to external entity. Through this doc, we shall have: - Verify the CAPI in-place upgrade contract. prove that the contact design is comprehensive to implement in-place upgrade. - Provide a design of in-place upgrade extension as a show case to community, guide developer to implement their own extension correctly. - Provide a general implementation of in-place upgrade extension. If meet his need, developer can leverage it and reduce his work by just implementing infrastructure related in-place upgrade logic. ## Proposal ### Components ```plantuml @startuml node "Host(s)" { [Agent process] } cloud "management cluster" { [CAPI] --> WebHook [Cluster In-place upgrader] - WebHook [Cluster In-place upgrader] --> NodeUpgradeCR [Node upgrader] - NodeUpgradeCR [Node upgrader] --> [Agent process] } @enduml ``` ### Sequence diagram 1. A new CRD UpgradeTask is introduced to track in-place upgrade spec and status. 2. Implment ExternalUpgradeRequest webhook. When webhook get called, extension abort old upgrade on same CP/MachineDeployment, and then determine if extension is able to handle upgrade request, and create UpgradeTask CR. 3. In UpgradeTask reconcile loop -- If UpgradeTask in abort stage, abort any ongoing upgrade, remove ownership. -- If UpgradeTask in ongoing stage, set machine ownership, select a node based on preflight policy and upgrade. 4. Provide contract NodeUpdateTemplate to delegate node upgrade operation to external entity. ```plantuml @startuml actor User as user participant "CAPI" as capi participant "**cluster In-place upgrader**" as cup participant "node In-place upgrader" as nup participant host user -> capi: upgrade cluster os/k8s version capi -> cup: call Lifecycle Runtime Hook \n**ExternalUpgradeRequest** cup --> capi: accept cup -> cup: create UpgradeTask CR activate cup loop while nodesToUpdate > 0 cup -> capi: preflight check \nand select 1 node to update cup -> nup: create node update CR activate nup nup --> cup nup -> host: in-place upgrade node host --> nup nup --> cup: completed deactivate nup end cup -> capi: upgrade completed deactivate cup @enduml ``` ### UpgradeTask CR ```golang type UpgradeTaskPhase string const ( OngoingPhase UpgradeTaskPhase = "Ongoing" AbortPhase UpgradeTaskPhase = "Abort" CompletePhase UpgradeTaskPhase = "Complete" ) // UpgradeTaskSpec defines the desired state of UpgradeTask type UpgradeTaskSpec struct { Cluster *clusterv1.Cluster `json:"cluster,omitempty"` ControlPlaneRef *corev1.ObjectReference `json:"controlPlaneRef,omitempty"` MachineDeploymentRef *corev1.ObjectReference `json:"machineDeploymentRef,omitempty"` MachinesRequireUpgrade []*clusterv1.Machine `json:"machinesRequireUpgrade,omitempty"` NewMachineSpec *MachineSpec `json:"newMachineSpec,omitempty"` // Template defines the template of Machine In-place upgrader Template *MachineUpgraderTemplate `json:"template,omitempty"` Phase UpgradeTaskPhase `json:"phase,omitempty"` } type MachineUpgraderTemplate struct { UpgraderRef corev1.ObjectReference `json:"upgraderRef"` } type MachineSpec struct { BootstrapConfig *unstructured.Unstructured `json:"bootstrapConfig"` InfraMachine *unstructured.Unstructured `json:"infraMachine"` } // UpgradeTaskStatus defines the observed state of UpgradeTask type UpgradeTaskStatus struct { } //+kubebuilder:object:root=true //+kubebuilder:subresource:status // UpgradeTask is the Schema for the upgradetasks API type UpgradeTask struct { metav1.TypeMeta `json:",inline"` metav1.ObjectMeta `json:"metadata,omitempty"` Spec UpgradeTaskSpec `json:"spec,omitempty"` Status UpgradeTaskStatus `json:"status,omitempty"` } ``` ### ExternalUpgradeRequest Webhook ```plantuml @startuml start :Inplace upgrade extension webhook; :ExternalUpgradeRequest get called; if (Is there ongoing UpgradeTask for same MachineDeployment/CP exists) then (yes) :Abort ongoing UpgradeTask; else (no) endif if (Verify request and determine if extension able to handle changes?) then (yes) :Create UpgradeTask CR; :Respond **Accept**; else (no) :Respond **Decline**; endif stop @enduml ``` ### Reconcile loop ```plantuml @startuml reconcile start :UpgradeTask controller; :UpgradeTask controller enquenes a Reconcile call; :Get Machines for given MachineDeployment/CP; if (UpgradeTask stage ongoing) then (yes) repeat :iterate Machines if (Is Machine owned by any UpgradeTask?) then (no) :Add MachineUpToDate condition with reason 'waiting'; :Add UpgradeTask owner; else (yes) endif repeat while (more Machines?) if (Query owned Machines, find machine which its \nMachineUpToDate condition reason is \n'ExternalUpgradeInProgress') then (found) #lightGreen:**Doing** inplace upgrade; if (Upgrade succeed?) then (yes) :Update Machine Spec; :Mark MachineUpToDate condition 'Ready'; else if (Upgrade failed or timeout?) then (yes) :Update MachineUpToDate condition \nreason to 'Failed'; else (no) endif else (no) if (Select 1 owned Machine to be upgraded \nbased on preflight check) then (selected) :Update MachineUpToDate condition \nreason to 'ExternalUpgradeInProgress'; else (no) if (All Machine upgraded?) then (yes) :Mark UpgradeTask status to ready; stop else (no) endif endif endif :ReEnqueue; else if (UpgradeTask stage aborted) then (yes) repeat :iterate Machines if (Check if machine is owned by current UpgradeTask?) then (yes) if (Check if machine is doing inplace upgrade?) then (yes) #lightGreen:**Abort** inplace upgrade; else (no) endif :Remove MachineUpToDate condition; :Remove UpgradeTask owner; else (no) endif repeat while (more machines?) else (no) endif stop @enduml ``` Preflight checks: - In-place upgrade shall not start if node condition in not ready state. - In-place upgrade shall not start if node in deleting state. - In-place upgrade shall not start if totalAvailableNodeNum > nodeNum - maxUnavailable # NodeUpgarder ```plantuml @startuml start if (NodeUpdate stage?) then (ongoing) if (Has UpdatePlan spec?) then (yes) else (no) :Retrieve machine and infraMachine CR; :Set UpdatePlan; endif if (Has onging trident operation) then (yes) :Retrieve operation result and update status; stop else (no) endif if (Need update?) then (yes) :Prepare Trident HostConfiguration to update OS Image; :Prepare Kubeadm update script for node \nand embed to Trident HostConfiguration; :Submit HostConfiguration; else (no) :Set status as Ready; endif else (abort) if (Has onging trident operation) then (yes) :Retrieve operation result and update status; stop else (no) :Set status as Ready; endif endif stop @enduml ``` ```plantuml @startuml start :Upgrade kubeadm; if (ControlPlane?) then (yes) :Get ClusterConfiguration; if (K8s version == Target version?) then (yes) :kubeadm upgrade node; else (no) :kubeadm upgrade apply $target_version; endif else (no) :kubeadm upgrade node; endif :Drain node; :Upgrade kubectl; :Upgrade & Restart kubelet; :Uncordon node; stop @enduml ``` Error handling ```plantuml @startuml actor User as user participant "CAPI" as capi participant "**cluster In-place upgrader**" as cup participant "node In-place upgrader" as nup participant host user -> capi: upgrade cluster os/k8s version capi -> cup: call Lifecycle Runtime Hook \n**ExternalUpgradeRequest** cup --> capi: accept cup -> cup: create UpgradeTask CR activate cup loop while nodesToUpdate > 0 cup -> capi: preflight check \nand select 1 node to update cup -> nup: create NodeUpdate CR activate nup nup --> cup nup -> host: in-place upgrade node host --> nup alt #Pink failed & retryable? nup -> host: retry in-place upgrade node host --> nup else failed & not retryable? nup -> nup: set NodeUpdate \nfailureReason &failureMessage nup --> cup: watch NodeUpdate changes cup -> cup: aggregate conditions cup -> capi: set Machine MachineUpToDate \ncondition with failure info capi -> capi: MachineHealthCheck detect \nunhealthy state and mark \nmachine for remediation capi -> capi: Remediate machine end deactivate nup end cup -> capi: upgrade completed deactivate cup @enduml ```