# Approval Deployment 2020-04-14
## 0. Goal
- CIDR Change
- `10.x` -> `100.x`
- EC2 Instance Type Change
- `r5d.2xlarge` -> `m5d.4xlarge`
- Flux Workaround
- Autoscaler Update to handle mutliple AZ properly
- Metrics-Server to bring kubectl top po back
- AMI Image Update
## 0.1 Document Current State
Note current state of Pods in `default` and `ingress` NS, so we know what has been running and what not :)
```bash
kubectl -n ingress get po > get_po_ingress
kubectl -n default get po > get_po_default
```
## 1. Stop cluster-autoscaler (scale to 0)
**Note:** Lets scale down both Cluster-Autoscalers and fix them after Node/ASG-Operations
```bash
kubectl -n kube-system delete deploy cluster-autoscaler
kubectl -n kube-system scale deploy/cluster-autoscaler-aws-cluster-autoscaler --replicas=0
kubectl -n kube-system get deploy | grep autoscaler
```
Stop alert-manager to stop sending alerts for the time of deployment
```bash
kubectl -n monitoring scale <tbd>
```
## 2. Rollout Weavenet Patch
Rollout patch and wait until it is distributed accross all nodes
```bash
# Patch weave-net DaemonSet
cat <<EOF | kubectl patch ds weave-net -n kube-system --patch "$(cat -)"
spec:
template:
spec:
containers:
- name: weave
image: weaveworks/weave-kube:2.6.0
env:
- name: IPALLOC_RANGE
value: 100.96.0.0/11
- name: WEAVE_MTU
value: "8912"
- name: weave-npc
image: weaveworks/weave-npc:2.6.0
EOF
```
Check if DS is distributed in whole Cluster
```bash
kubectl -n kube-system get ds weave-net
```
## 3. Cordon all nodes to distinguish old and new ones
This should mark all nodes (except masters) as "SchedulingDisabled".
For easier later reference, store the "old_nodes" :)
```bash
kubectl get nodes --no-headers -owide > old_nodes
```
Cordon all nodes except "master":
```bash
kubectl get nodes --no-headers -owide | grep -v "master" | awk '{print$1}' | while read name ;
do
echo "Cordoning $name";
kubectl cordon "$name"
done 2>/dev/null
```
## 4. `Kops update` to update Instance Groups and Launch-Configurations
- this can be done by triggering "[Deploy-K8S](https://ci-plt.pre.eu.dp.vwg-connect.com/teams/dp-corebe-approval/pipelines/dp-corebe-kubernetes-approval/jobs/deploy-k8s/builds/16)" Job
- **IMPORTANT** the Job should be interrupted after all three Master Nodes are exchanged
- For this purpose a tag [`1.5.27-20200408-double-asgs`](https://devstack.vwgroup.com/bitbucket/projects/MLSS/repos/w-025-execution-environment/commits?until=refs%2Ftags%2F1.5.27-20200408-double-asgs) has been created that can/should be used!
- This tag will double ASG Capacities
* This will Update the Kops Cluster Configuration (and therefore the Launch Configs of the ASGs) AND update our Master Nodes as desired.
* All amount of Nodes in remaining ASGs should be doubled (new Nodes will start to boot)
### The result of this Step should be:
- All Launch Configs are Up-To-Date
- Master Nodes with new Launch Configs
- All ASGs have the double amount of Nodes
- this means in our ASGs we have now "old" and "new" nodes
## 5. Drain all cordoned nodes
```bash
kubectl get nodes --no-headers -owide | grep -v "master" | grep "SchedulingDisabled" | awk '{print$1}' | while read name ;
do
echo "Draining $name";
kubectl drain --ignore-daemonsets --grace-period=300 --force $name
done 2>/dev/null
```
To be considered to make sure all workloads are actually rescheduled:
```bash
kubectl drain --ignore-daemonsets --grace-period=300 --delete-local-data --force $name
```
> This filter checks if `emptyDir` exists for a pod or not. If the pod uses `emptyDir` to store local data, it may not be safe to delete because if a pod is removed from a node the data in the `emptyDir` is deleted with it. Just like with the unreplicated filter, it is up for the implementation to decide what to do with these pods. `drain` provides a switch for this as well; if `--delete-local-data` is set, `drain` will proceed even if there are pods using the `emptyDir` and will delete the pods and therefore delete the local data as well.
>
> *Source: https://banzaicloud.com/blog/drain/*
> Interesting Issue about it: https://github.com/kubernetes/kubernetes/issues/80228
## 6. Shutdown all drained and cordoned nodes
After shutting them down, they will reboot with the new Launch Configuration
**Note:** as we still have our `systemd-resolved` issue, it is wise to check the current node count and compare it afterwards!
```bash
kubectl get nodes --no-headers | wc -l
```
Shutdown all Nodes
```bash
kubectl get nodes --no-headers -owide | grep -v "master" | grep "SchedulingDisabled" | awk '{print$6}' | while read ip ;
do
echo "Shutting down $ip";
# Cpmmented out for accidentally Copy/Pasting
# sudo -i ssh -o StrictHostKeyChecking=no -n ubuntu@$ip sudo -i shutdown now
done 2>/dev/null
```
## 7. Let the whole "Deploy-K8S" Job run
- **configure the actual tag ` 1.5.27-20200408`!**
- should be way faster, as it doesn't mess with ASGs anymore
- this will set the min/max values back; NOT the desired once!
- necessary to deploy Flux
- necessary to deploy new Autoscaler Version
## 8. Set cluster-autoscaler to 1
The Cluster-Autoscaler will probably be set to 1 by the step 7.
Ensure, that the cluster-autoscaler is actually taking down remaining nodes! This will take ~10min until the first instances will be removed
– reduce capacity of all ASGs back to normal
– let cluster-autoscaler do the work!
```bash
kubectl -n kube-system scale deploy/cluster-autoscaler-aws-cluster-autoscaler --replicas=1
```