## Algorithm properties
### Actions
The KafkaRoller only every does one of three kinds of "active" operation:
* Restart a broker by deleting its pod.
* Reconfiguring a broker using the Admin client.
* Wait for a broker to become healthy.
All other interactions should be side-effect-free.
### Invariants
1. We only try to process each broker at most one in each reconciliation.
2. We never process a broker that's `HEALTHY`, or `NEEDS_RESTART` or `NEEDS_RECOFIG` while there are brokers in `UNHEALTHY`
3. We never process a broker that's `HEALTHY`, while there are brokers in `NEEDS_RESTART` or `NEEDS_RECONFIG`
4. We never restart a broker if it would impact `acks=all` clients (i.e. with `NOT_ENOUGH_BROKERS_IN_ISR` error code)
Notes:
* Invariant 1 implies the algorithm should terminate in finite time.
* Invariant 2 and 3 together imply `HEALTHY` brokers are processed last.
* Invariants 2, 3 and 4 are an expression of "prioritise existing stability over convergence to desired state"
### Post conditions
If the Roller algorithm terminates normally:
1. Every existing pod has either been restarted, reconfigured, or required neither operation.
2. The number of non-STABLE pods has not increased.
If the Roller algorithm terminates abnormally:
1. The Reconciliation is considered failed and the CR status reflects this.
## Algorithm outline
1. For each existing broker pod
1. Categorize it: `Restarting` < `UNHEALTHY` < `NEEDS_RESTART` < `NEEDS_RECONFIG` < `STABLE`
2. Sort the pods by their category (and by broker id as tie breaker)
3. Take the first pod from the list:
1. Switch:
* `Restarting`: continue
* `UNHEALTHY`, `NEEDS_RESTART`: restart it
* `NEEDS_RECOFIG`: reconfigure it.
2. Wait for it to become `STABLE`
3. If it does not become `STABLE` within `t` ms then abort the reconciliation.
4. Otherwise it becomes `STABLE` within `t` ms. Then recategorize all the pods.
If any need to transition from Healthy then abort the reconciliation.
Ignore any transitions from `Health` to `Needs_*`.
**When it is OK to abort/fail reconciliation: Scale down reconciliation (Jakub)**
**How dies KRaft change this? Specifically, do we need to be able to roll the controllers separately from non-controller brokers? Should we seek to roll passive controllers first or last?**
## State machine
**TODO insert image**
## State descriptions
### `UNKNOWN`
Every pod starts in this state, which represents the CO having no deeper knowledge of the broker's true state.
Transitions from the state based on observations of/interactions with Pod and/or broker.
### `HEALTHY`
A broker is in this state when all of the following are true:
* Its pod condition is `Ready`
* Its broker state `Running`
* It is network reachable on internal listeners for KRPC
* Its pod is up-to-date wrt the CR `spec`
* It is a member of the cluster (it's present and unfenced in `Metadata`)
If the broker is the leader of all the partitions for which is it the preferred leader it is `STABLE`.
Otherwise it is `SYNCING`
### `NEEDS_RESTART`
As `STABLE` but not up-to-date wrt CR `spec` and those changes can't done via dynamic reconfig
From here the broker can only transition to `RESTARTED` by the deletion of the broker's Pod.
### `RESTARTING`
Pod condition < `Ready` or Broker state < `Running`
* `RESTARTED`: Transitions to this start when the pod gets deleted
* `RECOVERY`: Transtions to this state when broker state metrics < `Running`
* `SYNCING`: When the broker is not not in the cluster or not the preferred leader for 1 or more partitions
### `NEEDS_RECONFIG`
As `STABLE` but not up-to-date wrt CR `spec` and those changes can done via dynamic reconfig.
From here the broker can transition to
### `UNHEALTHY`
A broker that is none of `HEALTHY`, `RESTARTING`, `NEEDS_*` or `UNKNOWN`
# Progress
## 2022/09/02
* Is the problem with the `never_touch_a_broker_more_than_once` property due to the random transitions not reflecting how the system would really behave, in does the state machine need to take into account that we've already reconfiged and follow a different path the second time around?
* Decision: Let's implement another property based test to see if that sheds light on which would be the right course of action.
## 2022-09-08
* Distinction between the per-broker state machine (and its termination states) and the ways in which a rolling operation can terminate
https://jamboard.google.com/d/1lr-v9F9-4gk_G8-ASS84HPLi8doqpwc2crhGiZvB4LI/viewer?f=0
* Split UNHEALTHY into two states, pre- and post-.
* POST_UNHEALTH and STABLE are final states in the state machine
* Exceptions arise in the rolling code:
If a pod is in the POST_UNHEATHLY state
If a pod doesn't get to STABLE within a timeout
If we make an unexpected transition