**Background**
Currently FLT handles production deployments by uploading them to runners that are part of a single Kube deployment. This means that from the clusters point of view all the runners are identical, when in practice they can all be hosting different deployments. This has a number of problematic implications:
* When upgrading the FTL images Kube will create new runner deployments, then start killing old ones. As Kube considers all runner deployments the same there is a chance it can kill all of a single type of deployment, resulting in downtime. Note that this is not the only reason that Kube might kill pods, so this situation is not just limited to upgrades.
* When Kube creates new pods the controller has no way of knowing the old pods are going to be killed, so will not move deployments onto the new pods until the old ones are gone. For example if we have a deployment with a single runner, when Kube upgrades FTL it will create a new `ReplicaSet` and start a new runner with the upgraded runner image. The controller however will be happy with its existing single runner deployment, and won't do anything with the new runner until the old one goes away.
* There is no easy way to apply different Istio rules to different FTL deployments, as they are all part of the same Kube deployment. The only option currently would be to create multiple Kube deployments and apply labels to the deployments.
**Proposed Solution**
FTL will move to a deployment approach where individual runners are created and assigned to deployments on demand. This will be done using a 'pull' based approach, where either a local or remote consumer of the gRPC schema change scheme will create runners for deployments as needed.
FTL will implement a generic streaming gRPC based runner management layer, that will allow FTL to stream to remote endpoints information about it's runner requirements. This will allow various different runner management solutions to integrate via this API. Intially we will support a Kube based scaler and the existing local scaling approach.
It is expected this design will evolve further as FTL builds out its provisioning infrasturcture, with compute provisioning to end up being rolled into the general provisioning mechanism.
**Design Considerations**
Alternative approaches were considered, such as having FTL manage the deployments directly. This approach has a number of advantages to direct management:
* This model is more secure, as it means that controller does not need permission to modify Kube objects. The permissions will be limited to the agent that is soley responsible for scaling Deployments.
* This model is plugable, and aligns better with our long term architectural goals. Even though this will require an extra Kubernetes deployment it will not complicate local development, and means that Kube management code is not present in the main FTL binary.
**Protocol Level Changes**
The existing runner handshake protocol will be modified to include a deployment name in the handshake. When a runner connects to the controller the controller will assign that runner the deployment it has requested, no matter how many replicas the controller thinks is the ideal number.
This changes the scaling approach so that instead of the controller directly assigning N deployment replicas to runners, the controller advertises to a remote agent that it wants N replicas of a particular deployment, and the agent decides how it wants to assign the runners. When doing a Kube rolling deployment it is common to have more replicas than the requested number, as there is overlap between the old and new pods, so this approach will allow rolling upgrades to work correctly.
It is an open question if we should make this deployment field optional, or if it should be required. If we make it optional then we can continue to support the existing deployment model where we create a big block of unmanaged compute. The down side of continuing to support it is that it is extra code to maintain that will likely not be used in the future. I think it is likely that we will want some overlap between the approaches to allow for a staged rollout, and then the old approach may be phased out.
**Wire Protocol**
The wire protocol that will be used is the existing `ControllerService.PullSchema` streaming gRPC endpoint. Both local and remote clients can connect to this endpoint to get all deployments and the desired number of replicas, as well as any updates when this changes.
**Local Development**
For FTL dev a new management goroutine will be started that connects to the runner management gRPC endpoint. This will then manage runners in a similar maner to how local scaling is managed at the moment.
**Kube Deployment**
FTL will implement a new Kubernetes operator that is a seperate binary / container image.
This operator will run under a ServiceAccount that has permission to create, modify and delete Deployment objects, and will create them on demand as required by the controller.
On startup this operator will connect to the controller, and will recieve the desired deployment state. It will then use the Kube API to list all FTL deployments it is managing (identified by a label), and compare the current state with the desired state.
It will then use the Kube API to adjust the state of the cluster to the state requested by the controller, and then continue to monitor both the cluster and the streaming desired state from the controller, and update the cluster state as nessesary.
In general an FTL deployment with a desired number of replicas will be directly mapped to a Kube Deployment with the same number of replicas. Note that that each version of a FTL deployment will get its own Kube deployment, so multiple versions of the same can co-exist when FTL is doing an upgrade.
The runners will have their readiness check configured to only be 'ready' once the deployment is uploaded and started, rather than when they are ready to recieve a deployment.
Open question: Istio and the runner start delay
The RBAC issue with Istio was solved by adding a delay, but it is unkown if that delay was just a general 'Istio being slow' delay, or because Istio needed the pod to be ready before it would allow the connection. If it is the later we will probably have an issue with proper readiness checks, as the pod won't be ready until the controller has uploaded the deployment, and the controller can't do that until Istio allows it to connect.
Open question: Configuration
How will the operator be configured, in terms of all the different parmeters on the deployments it will create? We can probably keep this simple to start with (e.g. just mount a ConfigMap with some basic params), and enhance it later.
Open Question: Services and Network
How do we handle Kube Services? We don't really use the headless service directly AFAIK, it is just used for Istio configuration. If we make the Kube deployment labels customisable then the cluster admisitrator could create services and policies based on these labels. This might be an ok short term solution but would move things like egress control outside of the realm of FTL. We may want the ability to have FTL directly create Istio egress policies so we can manage all of this ourself.