How to replace all control plane nodes.

Introduction:
If for some reason you need to completely replace all of your control plane nodes in a HA setup using kubeadm there is a specific order of operations to ensure the health of your production cluster. The goal is to make this changes with zero downtime. The order of operations is important due to how etcd works. Read this link to understand why we are doing this in this specific order.

Infastructure Assumptions:
Host OS: Ubuntu 20.04
Kubernetes Version: 1.26.0
Internal Etcd Cluster
3 control plane nodes (old)
3 control plane nodes (new)
3 worker nodes

Dev, control plane, and worker hosts are on the same network and can communicate with each other.

A single host were you can issue kubectl commands against the cluster in question. We will call this the "dev" host.

Setup:
We need to backup the etcd data store. If we crash the cluster with this proceedure then we have a way to recover. Please review the tutorial here on how to backup your etcd cluster.

Ideally you would create 3 brand new nodes configured and ready to accept the kubeadm join command and have them standing by.

{{< note >}}
If you use dhcp to assign ips make sure they have a static mapping. The ip of any host in the cluster can never change after bootstrapping.
{{< /note >}}

Next we need to install the etcdctl tool on the dev host. This tool allows you to interact directly with the etcd servers on the contol plane nodes. Normally we would put this tool on one of the control plane nodes however we will removing them all so the tool needs to be on a host that will persist during the removal process. There is three things we need to do.

On the dev host, sudo apt install etcd-client -y
Copy from any exisiting control plane node these three files and place them in the home directory of the user which is logged into the dev host. You will need these files to authenticate with the etcd servers.
- /etc/kubernetes/pki/etcd/ca.crt
- /etc/kubernetes/pki/etcd/peer.crt
- /etc/kubernetes/pki/etcd/peer.key
To make sure we have the right endpoints we are going to use env values to build the endpoint string. Here are the intial env values we need to set.

export ETCDCLI_API=3
export HOST_1=<ip of old cp1 node>
export HOST_2=<ip of old cp2 node>
export HOST_3=<ip of old cp3 node>
export ENDPOINTS=${HOST_1}:2379,${HOST_2}:2379,${HOST_3}:2379

This is needed becuase we will be swaping in and out hosts and the endpoints need to change accordingly.

Gather Facts:

Lets query etcd and gather our first required peice of information, who the leader is.
etcdctl endpoint status --cacert=ca.crt --cert=peer.crt --key=peer.key --endpoints=${ENDPOINTS} --write-out=table

Output:

+----------------------+------------------+---------+---------+-----------+-----------+------------+
|    ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+----------------------+------------------+---------+---------+-----------+-----------+------------+
|  <IP of HOST_1>:2379 | cf6b4f78d74de8cf |   3.5.6 |  144 MB |     false |        17 |   38133364 |
|  <IP of HOST_2>:2379 | dfe0ecd90aef9c99 |   3.5.4 |  149 MB |     false |        17 |   38133364 |
|  <IP of HOST_3>:2379 | aad7afaa2e5cc464 |   3.5.4 |  151 MB |      true |        17 |   38133364 |
+----------------------+------------------+---------+---------+-----------+-----------+------------+

We see that host_3 is the current leader. This is important becuase we will replace this host last. Not critical but I think important to insuring the health of the cluster. So the order of operation is this.

Remove old control plane node 1
Add new control plane node 1
Remove old control plane node 2
Add new control plane node 2
Elect new etcd leader to new control plane node 1
Remove old control plane node 3
Add new control plane node 3

Replacement Process:

These steps will be the same for each old node we are replacing. The only exception will be the last node, the leader, were we will force a leader election to a new node before we remove the remaining old node.

Before we get started we need the kubeadm join string. On one of the control plane nodes use this command:
sudo kubeadm token create --print-join-command --certificate-key $(kubeadm certs certificate-key)
Keep this string available as you will need to use it on each new node. You should see something similar:
kubeadm join <IP:PORT of Load Balanced api enpoint> --token <redact> --discovery-token-ca-cert-hash sha256:<redact> --control-plane --certificate-key <redact>

{{< note >}}
Depending on the way your cluster was initally setup your values should be different but each element in the join command has to be present.
{{< /note >}}

First we need to remove the old node from the etcd cluster. Get the ID of the old node we are removing from the information we gathered previously. From the dev host use the command:
etcdctl member remove <ID of etcd member being removed> --cacert=ca.crt --cert=peer.crt --key=peer.key --endpoints=<IP of remaining node>:2379,<IP of remaining node>:2379
{{< note >}}
Do not include the node ip:port that is being removed in the endpoint with this command. We can't use the ENDPOINTS env variable here.
{{< /note >}}
Now cordon, drain and remove the node from the cluster. To do this we use the following commands on the dev host:
kubectl cordon <name of node being removed>
kubectl drain <name of node being removed> --ignore-daemonsets (wait till all dained)
kubectl delete <name of node being removed>
Shut down the host. On the host that is being removed shutdown -h now This will ensure this somehow doesn't creep back into the node pool.
Login to the new host, and paste the kubeadm join command and wait for it to complete. To confirm that the new node is apart the cluster and healthy, on the dev host, issue kubectl get nodes. You should see the list of nodes with the ready state. The names will be unique to your cluster. If your new node is not ready wait till it is before proceeding.

NAME  STATUS ROLES           AGE     VERSION
cp1   Ready  control-plane   91d     v1.26.0
cp2   Ready  control-plane   91d     v1.26.0
cp3   Ready  control-plane   91d     v1.26.0
wk1   Ready  worker          2d23h   v1.26.0
wk2   Ready  worker          2d3h    v1.26.0
wk3   Ready  worker          2d2h    v1.26.0

We also need to confirm that the etcd cluster is healthy. First we need to re-export our new node ip address. Each time you cycle throught this process you will need to re-export accordingly. So back on the dev host issue one of these exports and then the command:

export HOST_1=<ip of first new cp node>
or
export HOST_2=<ip of second new cp node>
or
export HOST_3=<ip of third new cp node>

etcdctl member list --cacert=ca.crt --cert=peer.crt --key=peer.key --endpoints=${ENDPOINTS} --write-out=table

Output:
+------------------+---------+---------------+---------------------------+---------------------------+
|        ID        | STATUS  |          NAME |       PEER ADDRS          |        CLIENT ADDRS       |
+------------------+---------+---------------+---------------------------+---------------------------+
| aad7afaa2e5cc464 | started | <name of node>| https://<IP of node>:2380 | https://<IP of node>:2379 |
| cf6b4f78d74de8cf | started | <name of node>| https://<IP of node>:2380 | https://<IP of node>:2379 |
| dfe0ecd90aef9c99 | started | <name of node>| https://<IP of node>:2380 | https://<IP of node>:2379 |
+------------------+---------+---------------+---------------------------+---------------------------+

{{< note >}}
Look carefully at the ip address in the output. These will change as you replace nodes.
{{< /note >}}

{{< caution >}}
If your etcd cluster is not healty or one of the nodes is not started/ready you have to stop. If you remove an additional node, you will crash your kubernetes cluster hard. The recovery is a hard process with no guarantee of success. You have to take steps to remedy the new node before you proceed. You MUST have 3 healthy etcd nodes.
{{< /caution >}}

Once healthy we can proceed to the next node. For each node to be replaced you would follow the above proceedure. Pay particular attention the ips and hostnames so they match the nodes you are working on. When you get to the final node, (the original leader) You will need to force a leader election.

Before step 1, from the dev host issue this command:

etcdctl elect one <name of any new cp node> --cacert=ca.crt --cert=peer.crt --key=peer.key --endpoints=${ENDPOINTS}

Lets check to make sure the election happened and was successful.

etcdctl endpoint status --cacert=ca.crt --cert=peer.crt --key=peer.key --endpoints=${ENDPOINTS}

Output
+---------------------+------------------+---------+---------+-----------+-----------+------------+
|    ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+---------------------+------------------+---------+---------+-----------+-----------+------------+
| <IP of HOST_1>:2379 | cf6b4f78d74de8cf |   3.5.6 |  144 MB |     true  |        17 |   38133364 |
| <IP of HOST_2>:2379 | dfe0ecd90aef9c99 |   3.5.4 |  149 MB |     false |        17 |   38133364 |
| <IP of HOST_3>:2379 | aad7afaa2e5cc464 |   3.5.4 |  151 MB |     false |        17 |   38133364 |
+---------------------+------------------+---------+---------+-----------+-----------+------------+

We are now safe to replace the final control plane node. Proceed to step 1.

If you are careful and methodical in this process you should have a fully functional kubernetes cluster with completely new control plane nodes with zero down time.

How to replace all control plane nodes.

Read more

How to add dynamic servers for host based routing