owned this note
owned this note
Published
Linked with GitHub
# Observable and Configurable Resolution Pipelines
## Introduction
An important lesson learned from the legacy OLM implementation is that having a resolver that is deeply integrated into the operator creates several challenges for users and support, to name a few:
- Modifying resolver behaviour, e.g. for new types of constraints, is an adventure that takes us deep into the OLM operator code, which increases both the complexity and risk
- Resolver behavior is opaque and hard for users and support to debug (at present it is hard to determine the input that lead to a particular resolution outcome)
- It is impossible to query the resolver, or even tune it, to a user's or cluster's particularities
This document describes a possible way to structure the resolver in a way that is configurable, extensible, and auditable and observable. By thinking about the input into the solver as a pipeline through which variables get produced, transformed and given to the solver, we can surface the structure pipeline to the user, thereby making the process of translating bundles to variables/constraints transparent and configurable. Each node in the pipeline has a single specific responsibility. Therefore, this architecture makes it easy to modify resolver behavior in a way that is additive and independent from the other nodes/components. Furthermore, by leveraging message passing between these nodes/components, the entire pipeline can be observable (and replayable) by collecting these events.
## Definitions
* **Variable**: A variable is the unit of input into the solver. It has a unique ID and a collection of constraints related to other variables. For instance, a variable representing a bundle might have an ID (e.g. the bundle's package and version), and dependency constraints against other bundle variables.
* **VariableProducer**: A pipeline component that generates variables
* **VariableProcessor**: A pipeline component that takes variables as input and outputs variables
* **VariableConsumer**: A pipeline component that only consumes variables and produces nothing
* **Pipeline**: A directed acyclic graph (DAG) of pipeline components whose roots are producers and leaves are consumers
## Resolver as a Pipeline
The current OLM resolver works roughly in the following way:
1. Collects `Subscriptions` and translates those into `required package` variables that is mandatory and has a dependency on the bundles that can fulfill that required package (e.g. fit the given channel and version range).
2. Collects `ClusterServiceVersions` and translates those into `installed package` variables that are mandatory and depend on the variable representing the specific bundle that is installed
3. Collects `bundle` variables that represent each of the bundles in the repository. These bundles can have dependencies on other bundle variables (e.g. to fulfill their package or gvk dependencies)
4. Adds `global uniqueness` constraints to make sure that at most one bundle per package is selected and at most one bundle per gvk is selected
These steps are articulated in code and hidden behind the resolver interface and modifying them is not a trivial exercise.
The pipeline representation of this process can be visualized as follows:

1. The `Installed Packages` and `Required Packages` producers create and send their variables to the `Solver` and to the `Bundles and Dependencies` processor
2. The `Bundles and Dependencies` processor examines the incoming variables, deduces the bundles attached to those variables, and produces bundle variable the bundle variables for their dependencies. It sends these variables to the `Global Constraints` and `Solver` processors
3. The `Global Constraints` processor keeps track of the bundles for a particular package and gvk by examining the bundle variables being given to it. Once it has examined all bundle variables, it produces the global constraint variables and sends them to the `Solver`
4. The `Solver` processor keeps track of all the variables it is given. Once it has them all, it gives them to the solver for resolution and outputs the selection to the `Output Collector` consumer
5. The `Output Collector` keeps track of all variables given to it. Once it is finished collecting all variables, the pipeline is complete and the output can be examined
### Pipeline Nodes
As previously mentioned, the pipeline is a DAG rooted on `Producers`, with `Consumers` at the leaves. There are three types of nodes:
- **Producers**: Producers generate data events for either `Consumers` or `Processors`. Once it finished producing its last item, it concludes
- **Processors**: Consumer data events to producer different data events. Once all sources of data have concluded, the processor concludes
- **Consumers**: Consumers consume data events but don't create anything new. They have no output edges. It concludes once all of its sources of input have concluded
Nodes can be in one of (at least) the following states:
- **Inactive**: it is ready to start its process but it hasn't started yet
- **Active**: its process is on-going
- **Successful**: it has completed its process without errors
- **Failed**: it has completed its process with an error
- **Aborted**: it has aborted its process due to either up- or downstream errors, or due to context expiry
### Events
The nodes in the pipeline communicate through events. Each event contains a header that includes information that would allow us to reconstruct the execution of the pipeline, such as:
- The time the event was created
- Who created the event
- Who was the sender and who was the receiver
- Custom metadata (string->string map): could include things like a pipelineID and executionID
- An unique event ID
Events can be of different types, which describe what type of data the event carries. We'd need at least two types of events: data and error: data events carry the variables, while error events surface processing errors.
### Error Handling
Any of the steps can fail at any time during processing. A pipeline can have different postures towards failures. For a start, it would be simple and sufficient to just abort execution in case of an error. If a node encounters an error, it can broadcast an error event to all pipeline nodes, which in turn can abort their execution. If a pipeline fails, the state of each node can be examined to find out the culprit(s) and their reasons.
### Debuging
A debug channel can be given to a pipeline such that every event generated by the pipeline is also sent down the debug channel. This should give a complete overview of the pipeline execution. It could even be possible to replay source events (in the order they were created) through the pipeline to reproduce executions.
### Pipeline Modelling and Extensibility
With a library of node types (e.g. `required-package-producer`, `global-variable-processor`, etc.) it is easy to imagine a declarative pipeline configuration that can be defined in yaml, reconcilled to ensure its meets certain conditions (e.g. it's a DAG) and to fire off executions against that model. Pipeline configuration could be immutable, in the sense that any change made to a pipeline results in a new pipeline ID that can be included in the header of the events as part of metadata. Furthermore, every execution of the same pipeline can have its own unique ID, also included in the event header metadata. This could further help facilitate support and debugging efforts.
By adding different types of nodes, the pipeline can be further configured and extended. For instance:
- **static-variable-producer**: produces static variables (that can come from declarative sources). This becomes an easy way to gently nudge the resolver in one direction or another.
- **online-variable-processor/producer/consumer**: a named on-cluster process that providers a standard api for event handling
- **declarative-variable-processors**: a declarativelly configurable processor that can provide a certain range of possibilities, e.g. filtering, mutating specific variables, etc.
## Downsides
This approach completely exposes the way the solver gets its variables. Therefore, there's a significant blast radius here, which would need to be contained. A small mistake could lead to big effects on the cluster. Because this approach exposes how resolution works, it also demands more of the user's understanding of the sytem (at least in cases where they might want to change it).