# Runbook for Rotating the CA of the cluster `jetstack-build-infra-workers-trusted`
This runbook was originally created by Becky Pauley. These steps are based on [Google's Rotate your cluster credentials](https://cloud.google.com/kubernetes-engine/docs/how-to/credential-rotation).
**Why?**
On 28 August 2023, Google sent a notification to owners of the project "jetstack-build-infra-internal":
> Hello GKE Customer,
> We’re writing to remind you that the Kubernetes Certificate Authority (CA) will expire soon for some Google Kubernetes Engine (GKE) clusters in your project(s) listed below. You must perform a credential rotation before your CA expires, or else workloads and cluster operations will be interrupted.
>
> **What do you need to know?**
> GKE uses a CA to establish a root of trust between cluster components and for securing user access to the cluster. The current CA for a cluster, generated as part of cluster creation, is valid for five years, but you can manually rotate it at any time by performing a credential rotation.
> We have changed the length of CA from five to thirty years so a credential rotation will issue a new CA for the cluster that will be valid for thirty years. Credential rotation will also:
> - Change the IP address of the Kubernetes API server.
> - Invalidate all Kubernetes service account tokens.
>
> If a cluster’s CA expires, components within the cluster will no longer be able to communicate, which will prevent all access to the Kubernetes API Server from users and workloads running on nodes.
>
> Please note: As of **9/01/2023, within 30 days of your CA expiring, Google will automatically initiate a credential rotation to ensure that your cluster does not suffer a complete outage**. This automatic rotation may cause API clients outside the cluster that rely on the cluster’s credentials, such as kubectl, to stop working after the rotation is completed until you are able to update the API clients to the new credentials.
>
> **What do you need to do?**
> Our records show that the CA for the following in your project(s) will expire soon:
> **`jetstack-build-infra-internal`**
> To prevent normal cluster operations from being interrupted, please take these steps before the CA’s expiration:
> Perform a credential rotation on the above clusters.
> Update all API clients outside of the clusters (such as kubectl on developer machines) to use the new credentials once credential rotation has been initiated.
**Impact:** The [Prow UI](https://prow.build-infra.jetstack.net) won't be affected. The Prow jobs triggered by PRs won't be affected either. There will be a period of 30 minutes to 2 h during which the "trusted" jobs will stay as "pending" in the Prow UI. These jobs are things like the rotten PRs, the branch automatic protection and such.
The reason this minor inconvenience will happen is because the cluster `jetstack-build-infra-workers-trusted` is being accessed from the cluster `github-build-infra` via two Secrets containing each a kubeconfig file:
```text
Project
"jetstack-build-infra" Project
+----------------------------+ "jetstack-build-infra-internal"
| | +--------------------------------------+
|Cluster | | |
|"github-build-infra" | | |
|+--------------------------+| | |
|| || |Cluster |
|| || |"jetstack-build-infra-workers-trusted"|
||+------------------+ || |+-----------------------------------+ |
|||Secret | access| | |
|||"crier-kubeconfig"|-------------> | |
||+------------------+ || || | |
|| || || | |
||+-------------+ access| | |
|||Secret | -----------------> | |
|||"kubeconfig" | || |+-----------------------------------+ |
||+-------------+ || +--------------------------------------+
|| ||
|+--------------------------+|
+----------------------------+
```
**Steps:**
- [x] **Start the rotation**: the control plane starts serving on a new IP address in addition to the original IP address. New credentials are issued to workloads and the control plane.
⚠️ Testing indicates **approx 5 mins downtime for cluster API server** after running this command.
To start the rotation:
```bash
gcloud container clusters update jetstack-build-infra-workers-trusted \
--project jetstack-build-infra-internal \
--zone europe-west1-b --start-credential-rotation
```
After this step, you can check that two root CA certificates are now
added to your kubeconfig if you re-authenticate:
```bash
gcloud container clusters get-credentials jetstack-build-infra-workers-trusted \
--zone europe-west1-b \
--project jetstack-build-infra-internal
kubectl config view --minify --flatten -ojson | jq '.clusters[0].cluster."certificate-authority-data"' -r | base64 -d | certigo dump
```
shows something like:
```text
** CERTIFICATE 1 **
Input Format: PEM
Valid: 2018-11-20 15:05 UTC to 2023-11-19 16:05 UTC
Subject:
CN=27cbe66a-8b03-4942-85cb-0e2522a345ca
Issuer:
CN=27cbe66a-8b03-4942-85cb-0e2522a345ca
** CERTIFICATE 2 **
Input Format: PEM
Valid: 2023-08-29 12:07 UTC to 2053-08-21 13:07 UTC
Subject:
CN=3afb4c59-7557-437d-aae1-86bde4ec58fe
Issuer:
CN=3afb4c59-7557-437d-aae1-86bde4ec58fe
```
- [x] **Update the `crierclient` and `client` certificate present on the `github-build-infra` cluster**:
First, prepare the `create_user.sh` script**:
```bash
curl -O https://raw.githubusercontent.com/jetstack/testing/22aa73cf/prow/_hack/create_user.sh
sed -i.bak 's|^USERNAME=.*||' create_user.sh
chmod +x create_user.sh
./create_user.sh
```
Then, create the two client certificates:
```bash
gcloud container clusters get-credentials jetstack-build-infra-workers-trusted \
--project jetstack-build-infra-internal \
--zone europe-west1-b
1 cert-manager governance $? ☸ jetstack-build-infra-workers-trusted 1.21 G mael.valais@jetstack.io
kubectl delete certificatesigningrequests crierclient 2>/dev/null || true
kubectl delete certificatesigningrequests client 2>/dev/null || true
env USERNAME=crierclient ./create_user.sh
env USERNAME=client ./create_user.sh
```
Then, get the existing `crier-kubeconfig` and `kubeconfig` secrets:
```bash
gcloud container clusters get-credentials github-build-infra \
--project jetstack-build-infra \
--zone europe-west1-b
kubectl get secret -n default crier-kubeconfig -ojson | jq '.data."config"' -r | base64 -d >./crier-kubeconfig.yaml
kubectl get secret -n default kubeconfig -ojson | jq '.data."config"' -r | base64 -d >./kubeconfig.yaml
```
Get the new IP as well as the new root certificates:
```bash
gcloud container clusters get-credentials jetstack-build-infra-workers-trusted \
--project jetstack-build-infra-internal \
--zone europe-west1-b
kubectl config view --minify --flatten -ojson | jq -r '.clusters[].cluster."certificate-authority-data"' | base64 -d >certificate-authority.pem
IP=$(kubectl config view --minify --flatten -ojson | jq -r '.clusters[].cluster.server')
```
Update the two kubeconfigs with the new IP and new root certificates.
```bash
KUBECONFIG=./crier-kubeconfig.yaml kubectl config set-credentials trusted \
--client-key crierclient.key --client-certificate crierclient.crt --embed-certs
KUBECONFIG=./crier-kubeconfig.yaml kubectl config set-cluster trusted \
--server $IP \
--certificate-authority=certificate-authority.pem --embed-certs
KUBECONFIG=./kubeconfig.yaml kubectl config set-credentials trusted \
--client-key client.key --client-certificate client.crt --embed-certs
KUBECONFIG=./kubeconfig.yaml kubectl config set-cluster trusted \
--server $IP \
--certificate-authority=certificate-authority.pem --embed-certs
```
Finally, update the kubeconfigs in the `github-build-infra` cluster:
```bash
gcloud container clusters get-credentials github-build-infra \
--zone europe-west1-b \
--project jetstack-build-infra
kubectl create secret generic crier-kubeconfig --from-file=config=crier-kubeconfig.yaml --dry-run=client -oyaml | kubectl apply -f -
kubectl create secret generic kubeconfig --from-file=config=kubeconfig.yaml --dry-run=client -oyaml | kubectl apply -f -
```
Check that it works and that no errors are shown in logs:
```bash
kubectl rollout restart deployment crier
kubectl logs -l app=crier --follow
```
- [x] **Node pool re-creation.** After the control plane has been reconfigured,
GKE automatically updates your cluster's nodes to use the new IP and
credentials. Each node pool is marked for recreation. GKE does not finish the
credential rotation until the automatic recreation is complete. We will force
the recreation because the maintenance window is scheduled later, and we would
rather be there when that happens.
- [x] **approx 25-30 mins for green-containerd pool**
Two options for node pool recreation:
1. The node pool is re-created during a maintenance window:
- If the credential rotation starts during a maintenance window, it is
likely that GCP will trigger these node pool changes without the need
for intervention.
2. **If the node pool re-creation does not start automatically during the maintenance window, force the node pool re-creation (if there are multiple nodes) by running a `gcloud container clusters upgrade` command that specifies cluster version as the current version of the node pool.**
- [x] containerd-green-pool pool:
```bash
gcloud container clusters upgrade jetstack-build-infra-workers-trusted \
--project jetstack-build-infra-internal \
--zone europe-west1-b \
--node-pool=containerd-green-pool --cluster-version=1.27.3-gke.100
```
- [x] **Check that the new `kubeconfig` and `crier-kubeconfig` work**:
Using the following command, check that the deployments are "Ready".
The command finds all the deployments that are using the `kubeconfig`
secret:
```bash
gcloud container clusters get-credentials github-build-infra \
--project jetstack-build-infra \
--zone europe-west1-b
kubectl get deploy -n default -ojson \
| jq -c '.items[]' \
| grep 'secretName\":\"kubeconfig' \
| jq -r '.metadata.name' \
| xargs -L1 kubectl get deploy
```
it should show:
```
NAME READY UP-TO-DATE AVAILABLE AGE
deck 3/3 3 3 5y311d
NAME READY UP-TO-DATE AVAILABLE AGE
hook 4/4 4 4 5y311d
NAME READY UP-TO-DATE AVAILABLE AGE
prow-controller-manager 1/1 1 1 722d
NAME READY UP-TO-DATE AVAILABLE AGE
sinker 1/1 1 1 5y311d
```
Now, check the `crier-kubeconfig` secret:
```bash
kubectl get deploy -n default -ojson \
| jq -c '.items[]' \
| grep 'secretName\":\"crier-kubeconfig' \
| jq -r '.metadata.name' \
| xargs -L1 kubectl get deploy
```
it should show:
```
NAME READY UP-TO-DATE AVAILABLE AGE
crier 1/1 1 1 4y145d
```
- [x] **Complete the rotation**: the control plane stops serving traffic over the original IP address. Old credentials are revoked.
⚠️ Testing indicates **approx 5 mins downtime for cluster API server** after running this command.
```bash
gcloud container clusters update jetstack-build-infra-workers-trusted \
--zone europe-west1-b \
--project jetstack-build-infra-internal \
--complete-credential-rotation
```
- [ ] Remind people to update their local kubeconfig by re-running:
```bash
gcloud container clusters get-credentials jetstack-build-infra-workers-trusted \
--zone europe-west1-b \
--project jetstack-build-infra-internal
```