owned this note
owned this note
Published
Linked with GitHub
# Unavailable global catalog source prevents resolution from succeeding
## Summary
Background:
https://bugzilla.redhat.com/show_bug.cgi?id=2076323
Currently, the OLM resolver is configured to perform resolution on a particular namespace based on the set of subscriptions in that namespace. To build a cache of potential solutions to resolution, the resolver queries for any catalogsource present in the namespace and in the global catalog namespace, a special namespace that is accessible for resolution across any namespace in the cluster.
Resolution should be deterministic, in the sense that a particular set of inputs should always lead to the same content being applied to the cluster. In the case of potential ambiguity, such as when a particular catalog is in an error state and not responding, the resolver currently throws an error and does not provide a solution. This is logical, but causes all resolution to fail across the cluster whenever there is a problem with any catalog in the global catalog namespace.
There needs to be way to provide determinstic resolution while still allowing for transient errors from catalogs to occur. There is essentially a tradeoff between correctness and UX, where being the most deterministic and correct from the point of view of the resolver leads to cases where resolution is not working across the entire cluster and so the UX is poor.
## Potential solutions
* Only consider catalogs that are explicitly defined in the subscription spec for resolution when such a catalog is provided. This is explored in https://github.com/operator-framework/operator-lifecycle-manager/pull/2749 but is not a viable solution - dependencies of the originating subscription can come from the global catalog. Changing the behavior of resolution this way would break existing users who expect their dependency resolution to work as before.
* Add an opt-in setting on the operator group in a particular namespace that enables transient catalog errors to be ignored during resolution. In that case, with the setting enabled, if a catalog is unreachable it's ignored for the purposes of resolution.
* Lean on the SAT-solver for resolution determinism; e.g. operators that require deterministic resolution have constraints on unique label properties only available in the "correct catalog".
* Perform a preemptive analysis -- when resolution is performed, if all the provided subscriptions in the namespace do not have any external dependencies, and all reference a catalog in the given namespace, do not reach out to the global catalog namespace when performing resolution
### Option 1 - namespace-scoped resolution toggle
Add a toggle on an OLM resource present in the namespace where resolution occurs that indicates that dependencies should be resolved only within that namespace. Since an OperatorGroup must already be present in the namespace to successfully resolve, this API would be a natural place to add the toggle. This approach is similar to the [Fail Forward feature](https://github.com/operator-framework/enhancements/pull/110) which added a toggle on the OperatorGroup to change certain OLM behavior.
This approach has some downsides, namely that this toggle would need to be enabled in every namespace in which resolution is occuring beforehand to ensure the new resolution behavior. It also further dilutes the OperatorGroup API, which was not intended to influence resolution decisions at all. It may be viable as a short-term solution if there are serious service degradation issues that cannot be addressed by the existing workaround (turning off default catalogs from the global catalog namespace and removing them).
### Option 2 - add first-class catalog dependency constraints
Solving this issue in a more robust way would involve enabling support for constraints on a particular catalog that would be considered by the resolver during resolution. The resolver would only consider a particular catalog, or set of catalogs, if this constraint were provided. This doesn't exist in the OLM resolver today, but would be a feature candidate for the new resolver work as part of deppy.
This would be a longer term solution, assuming the workaround(turning off default catalogs from the global catalog namespace and removing them) is sufficient for the time being.
See https://issues.redhat.com/browse/RFE-2801 for an RFE that outlines this potential solution.
### Option 3 - treat existing behavior as a bug
Based on meetings with PM and the SD team, it was determined that the existing behavior (where an unresponsive catalog prevents resolution globally) was problematic enough to be considered a bug, even though it was the product of a previous bugfix. The solution in this option is to simply treat the behavior as a bug, and simply allow resolution in the face of catalog connection problems. This is a deviation from existing behavior, since after the bugfix resolution may go through (with an incomplete picture of the catalogs on cluster). The bugfix would basically be an undo of https://github.com/operator-framework/operator-lifecycle-manager/pull/2290, the fix for the original issue.
Additionally we could also, rather than just do "best effort", respect the subscription spec and fail resolution if any of the referenced
catalog sources in the resolver context's subscriptions are unavailable. This would:
- Keep with the intent of the user to source a package from a specific catalog source
- Keep the mental model of resolving the whole namespace in unison
- Solve the issue of blocking resolution if an unferenced global catalog source is down
- Continue with a best effort basis to resolve dependencies
## Notes
* Putting the knob on the operator-group may overload the operator group API
* Putting the knob on the operator-group API would require it being enabled for every namespace in the cluster for resolution to be successful across the cluster
* Release valve: turn off the global operators catalogs if those catalogs are failing
* In the upstream world, there should not be any dependency on Red Hat registries
* Resolution with incomplete information is potentially acceptable (should only cause rare problems, most operators do not use dependencies)
* Using constraints would enable users to specify dependencies on specific catalogs
* Must disable global catalog sources before removing them
* Ensure that the particular catalog that isn't responding is surfaced via error/event