# Skunkworks 2020
## Chaos Engineering in Kubernetes
Let's bring some to the k8s environment. The goal is to study and extend the platform provided by [ChaosMesh](https://chaos-mesh.org/) to develop some more chaos and break the k8s environment in which our operator(s) run.
## Update
I realized I never linked the [repo](https://github.com/bznein/chaos-mesh)!
## Day 1
Most of the morning was spent trying to figure out why my changes were not applied. Finally gave up on finding a solution myself and opened an issue [here](https://github.com/chaos-mesh/chaos-mesh/issues/1096). Turns out the `ImagePullPolicy: Always` was preventing me from re-pulling the image with the new changes.
A more productive afternoon saw me actually writing some real code. I managed to create a chaos (that doesn't really have any _chaos_ in it) that lists all the persistent volumes that matches a selector (for now working only on labels). This of course also required to write some additional `ClusterRole` as Persistent Volumes are cluster-wide resources.
### Lessons learned
* Don't rely too much on documentation with young projects
* Start simple, and add complexity only later
### Result example
This is the example definition of my `helloworld` chaos (which will have to change name)
```yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: HelloWorldChaos
metadata:
name: hello-world
namespace: chaos-testing
spec:
scheduler:
cron: "@every 5s"
selector:
labelSelectors:
"app": "test"
```
Which will, every 5 seconds, select all the persistent volumes which have been labeled with `app=test`, and print them to stdout.
## Day 2
These were the goals for day 2:
- [x] Make the chaos kill persistent volumes
- [x] Modify finalizers/grace period to actually kill it right away
- [x] Make it configurable in the yaml definition of chaos
- [ ] Improve the selector on the chaos
- [x] Apply randomness (kill 50% of PV, for example)
- [x] Rename the chaos to a name that actually means something
- [ ] Study the chaos that already exists and come up with something more to add (There are other chaos engineering tools I can take inspiration from)
- [x] Improve launch script (move to a real programming language?)
As you can see, a pretty productive day with a lot of improvements:
#### Make the chaos kill persistent volumes
##### Modify finalizers/grace period to actually kill it right away
This has been "fun". Took a lot of time to understand how to write `jsonPatch` types to modify the PV, and especially the importance of order of operations!
In fact, it turns out that removing the finalizers works only if the persistent volumes has already been put into`Terminating` state, otherwise the finalizer `kubernetes.io/pv-protection` will be re-added every time.
The final solution is actually pretty easy:
```go
payload := []patch{{
Op: "remove",
Path: "/metadata/finalizers",
}}
payloadBytes, err := json.Marshal(payload)
...
err := e.Delete(ctx, pv, &client.DeleteOptions{})
...
err := e.Client.Patch(context.TODO(), pv, client.ConstantPatch(types.JSONPatchType, payloadBytes))
```
#### Make it configurable in the yaml definition of chaos
This was as easy as adding a simple field to my chaos `spec` and then have an `if` branch on that one
#### Apply randomness (kill 50% of PV, for example)
Obtaining this required a lot of duplicated code from pre-existing pod logic (if only we could use templates/generic...), but it turned out be pretty easy. Now we can kill one, all, or a percent of the matching PVs
#### Rename the chaos to a name that actually means something
It is finally called `PersistentVolumeChaos` :)
#### Improve launch script (move to a real programming language?)
And yes, I did move to `Python` in the end. Fairly simple program with some basic parallelism and optional arguments!
### Not Done
#### Improve the selector on the chaos
I still need to support selection on annotations, namespaces, and I'm not sure if the current implementation works on multiple labels selection (but it should!).
I also want the ability to target specific PVs
#### Study the chaos that already exists and come up with something more to add
This will probably be tomorrow morning's work!
## Day 3
Goals for today:
- [x] Improve chaos selector (see day 2)
- [ ] ~~Namespace selector~~
- [x] Direct selector by name
- [ ] ~~Field selector~~
- [ ] ~~Node selector~~
- [x] Annotation selector
- [ ] Investigate other chaos solution and eventually develop new ones
### Results:
At 9:26am I already have an improved selector, which is nice :smile:
Of course, while writing it, I realized that namespace selector doesn't mean anything, as Persistent Volumes are not namespaced!
(side note: I copied the list from the Pod Selector which is already part of Chaos Mesh, that's why there will be many not making sense)
I will also not be implementing field selectors, as the only field selectors valid for persistent volumes are `metadata.name` and `metadata.namespace` (??)
Node selector, as well, doesn't make sense for PVs
Annotation Selector turned out to be super easy (9:40am)!
The afternoon was mostly spent on developing a new similar chaos. This time, instead of running on `PersistentVolume`, I developed one that acts on `PersistentVolumeClaim`
## Day 4
With two more days remaining, day 4 will be pivotal in deciding what the end goal for the week is. I could either go for more experiments (disk fill, CPU burn, and such) or for some combination of the existing ones with our operator running.
I will likely go with the latter though, as other chaos might be really hard to develop and would leave me without a real end result by Friday!
### Actual day 4:
Not very productive day due to a few health issues: however, I managed to run chaos in AWS deployment,
with the enterprise operator! Maybe I can have some evg test up and running by the end of Friday?
## Day 5
Mysterious things happen when playing with a lot of things:
* Why can I still write after deleting the PV?
* Why does YCSB fail when killing a pod? is this a bug? WHERE?
### Long term goals:
- [ ] Add tests!
- [ ] Combine multiple chaos in experimenting with the operator