## Algorithm properties ### Actions The KafkaRoller only every does one of three kinds of "active" operation: * Restart a broker by deleting its pod. * Reconfiguring a broker using the Admin client. * Wait for a broker to become healthy. All other interactions should be side-effect-free. ### Invariants 1. We only try to process each broker at most one in each reconciliation. 2. We never process a broker that's `HEALTHY`, or `NEEDS_RESTART` or `NEEDS_RECOFIG` while there are brokers in `UNHEALTHY` 3. We never process a broker that's `HEALTHY`, while there are brokers in `NEEDS_RESTART` or `NEEDS_RECONFIG` 4. We never restart a broker if it would impact `acks=all` clients (i.e. with `NOT_ENOUGH_BROKERS_IN_ISR` error code) Notes: * Invariant 1 implies the algorithm should terminate in finite time. * Invariant 2 and 3 together imply `HEALTHY` brokers are processed last. * Invariants 2, 3 and 4 are an expression of "prioritise existing stability over convergence to desired state" ### Post conditions If the Roller algorithm terminates normally: 1. Every existing pod has either been restarted, reconfigured, or required neither operation. 2. The number of non-STABLE pods has not increased. If the Roller algorithm terminates abnormally: 1. The Reconciliation is considered failed and the CR status reflects this. ## Algorithm outline 1. For each existing broker pod 1. Categorize it: `Restarting` < `UNHEALTHY` < `NEEDS_RESTART` < `NEEDS_RECONFIG` < `STABLE` 2. Sort the pods by their category (and by broker id as tie breaker) 3. Take the first pod from the list: 1. Switch: * `Restarting`: continue * `UNHEALTHY`, `NEEDS_RESTART`: restart it * `NEEDS_RECOFIG`: reconfigure it. 2. Wait for it to become `STABLE` 3. If it does not become `STABLE` within `t` ms then abort the reconciliation. 4. Otherwise it becomes `STABLE` within `t` ms. Then recategorize all the pods. If any need to transition from Healthy then abort the reconciliation. Ignore any transitions from `Health` to `Needs_*`. **When it is OK to abort/fail reconciliation: Scale down reconciliation (Jakub)** **How dies KRaft change this? Specifically, do we need to be able to roll the controllers separately from non-controller brokers? Should we seek to roll passive controllers first or last?** ## State machine **TODO insert image** ## State descriptions ### `UNKNOWN` Every pod starts in this state, which represents the CO having no deeper knowledge of the broker's true state. Transitions from the state based on observations of/interactions with Pod and/or broker. ### `HEALTHY` A broker is in this state when all of the following are true: * Its pod condition is `Ready` * Its broker state `Running` * It is network reachable on internal listeners for KRPC * Its pod is up-to-date wrt the CR `spec` * It is a member of the cluster (it's present and unfenced in `Metadata`) If the broker is the leader of all the partitions for which is it the preferred leader it is `STABLE`. Otherwise it is `SYNCING` ### `NEEDS_RESTART` As `STABLE` but not up-to-date wrt CR `spec` and those changes can't done via dynamic reconfig From here the broker can only transition to `RESTARTED` by the deletion of the broker's Pod. ### `RESTARTING` Pod condition < `Ready` or Broker state < `Running` * `RESTARTED`: Transitions to this start when the pod gets deleted * `RECOVERY`: Transtions to this state when broker state metrics < `Running` * `SYNCING`: When the broker is not not in the cluster or not the preferred leader for 1 or more partitions ### `NEEDS_RECONFIG` As `STABLE` but not up-to-date wrt CR `spec` and those changes can done via dynamic reconfig. From here the broker can transition to ### `UNHEALTHY` A broker that is none of `HEALTHY`, `RESTARTING`, `NEEDS_*` or `UNKNOWN` # Progress ## 2022/09/02 * Is the problem with the `never_touch_a_broker_more_than_once` property due to the random transitions not reflecting how the system would really behave, in does the state machine need to take into account that we've already reconfiged and follow a different path the second time around? * Decision: Let's implement another property based test to see if that sheds light on which would be the right course of action. ## 2022-09-08 * Distinction between the per-broker state machine (and its termination states) and the ways in which a rolling operation can terminate https://jamboard.google.com/d/1lr-v9F9-4gk_G8-ASS84HPLi8doqpwc2crhGiZvB4LI/viewer?f=0 * Split UNHEALTHY into two states, pre- and post-. * POST_UNHEALTH and STABLE are final states in the state machine * Exceptions arise in the rolling code: If a pod is in the POST_UNHEATHLY state If a pod doesn't get to STABLE within a timeout If we make an unexpected transition