# Operator.X
* **Status**: Final
* **JIRA**: [KEYCLOAK-????](https://issues.jboss.org/browse/KEYCLOAK-????)
Needs investigation:
* [Descoped](https://docs.google.com/presentation/d/1j1J575SxS8LtL_YvKqrexUhso7j4SgrLfyNrDUroJcc/edit?usp=drive_web)[Jon]
* Supported K8s/OCP versions. Sync with PMs.[Jon]
* K8s dist? Make sure we support only OCP.
* Really no native image? Sync with PMs/Stian.[Jon]
* Marketing?
* IBM Z/P. [VM]
* No multiarch support.
* Admin credentials. [VM]
* Bootstrapping initial admin user.
* Metrics [Jon]
* Gather requirements
* Allign with Dist.X metrics (operand metrics)
* Operator metrics. Operator SDK helps with that?
* Prometheus
* Sync with RHMI and MAS-SSO teams
* Upgrades [Jon]
* What happens with custom images ?
* What's the flow for the Operator when a new version appears and the user is using custom images ?
* Configuring KC from K8s[Andrea]
* Alternative to git static store
* CRDs? Generated from Java representations
* Config map?
* Secret values (creds etc.)?
* Quarkus version in SDK
* Should be latest and greatest
* Integration with a DB operator[Jon]
Investigated:
* How Ingresses/Routes will work?[Andrea]
* Custom config?
* Should even the operator create Ingress?
* Best practices
* Canary releases.[Andrea]
* specific version of operator+keycloak
* operator supporting latest + previous version
* several CRD version in same cluster?
* best practices?
* StatefulSet for KC deployment? Sync with Store.X team and Infinispan team.[Jon]
* Temporary solution before we have Store.X (static storage)[Jon]
* Basically only import from K8s objects.
## How Operators align with Red Hat strategy ?
* Why Red Hat is relying on Operators ?
* What are Operators bringing for Red Hat strategy ?
* Which is the benefit of doing it in Quarkus + JOSDK ?
* Which are the problems or challenges this Operator will solve ?
[Presentation](https://docs.google.com/presentation/d/1NdEy2QImLmWVZQMjph9AmnYq1KBLVInWMTsytngxDmo)
### Phase 1 : Basic Install - 2022Q1
Based on the CR, Operator will deploy Keycloak and its operands(, with a default Realm). We need to specify how to agree on the approach for the admin credentials.
In terms of the properties open to be modified in order to scale the replicas, resources, etc.....
### Phase 2 : Seamless Upgrades - 2022Q1
Keycloak.X Operator Versioning will be fully aligned to the Keycloak.X Version, so for every new version of Keycloak.X a new Operator needs to be released although no code change is involved.
The upgrade of Keycloak.X will be triggered by the Operator upgrade.
Backup ? It will no longer support DB Backup as it won't manage the DB installation, but with the Store.X inclusion probably before upgrading a Store.X Backup process should be triggered.
### Phase 3 : Full LifeCycle
...storage= backup, failure recovery?
...app=lifecycle ?
### Phase 4 : Deep Insights
Regarding metrics,tracing and logging definetely we should try to embrace [OpenTelemetry](https://opentelemetry.io/). The hard part is to define exactly what do we want to have metris,tracing and loggin.
Operator should provide a way to export both Keycloak and Operator telemetry.
At the moment OpenTelemetry for Quarkus only covers Tracing. To cover Metrics it's needed to use Microprofile or Micrometer.
`(?)` Should the Operator depend on metrics backends or simply allow the user to configure the URL for those services ?
### Phase 5 : Auto Pilot
To achive the Horizontal Auto Scale feature we can use the Kubernetes [Horizontal Auto Scale feature](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/). The Keycloak main CRD would need to have the properties needed in order to allow the user to specify the metrics formula and the limits.
Regarding Vertical Auto Scale, apparently it is needed to install a resource (e.g. this one from [kubernetes repo](https://medium.com/infrastructure-adventures/vertical-pod-autoscaler-deep-dive-limitations-and-real-world-examples-9195f8422724)) , **BUT** we [can not have both HPA nd VPA](https://betterprogramming.pub/understanding-vertical-pod-autoscaling-in-kubernetes-6d53e6d96ef3) when both are based on CPU/Memory resources.
## Motivation for Operator.X
The Keycloak Operator.X comes to solve the situation where the current Operator made in Go Lang is :
* hard to maintain
* in a language with less expertise inside the team
* needs to be upgraded as it is using an old Go lang version and old Operator-SDK version, in order to use latest features (webhooks) and latest fixes and patches
* in order to be upgraded it's needed a huge effort creating a new project and moving and refactoring some parts
* the business case related to CRDs design no longer fits and needs a redesign
## Use-cases
Deployment of Keycloak.X in Kubernetes and Openshift. Other clusters are not considered in the first MVP.
**TBD** minimum version of Openshift, therefore the Kubernetes API.
At the moment [OCP 4.7](https://docs.openshift.com/container-platform/4.7/release_notes/ocp-4-7-release-notes.html) is still supported on [maintenance](https://access.redhat.com/support/policy/updates/openshift) and runs on K8s API 1.20.
## How the feature will be used (configuration, APIs, UIs, etc.)
Operator deployment following OLM both using the OperatorHub embedded marketplace in Openshift, and also the Operatorhub.io command line approach.
## Technical definition
This operator is a Keycloak module and will follow the BOM of it in order to retrieve libraries versions and JDK restrictions. This affects mainly to the Quarkus version.
It will use the Java Operator SDK and its Quarkus extension. This implies the usage of Fabric8 K8s client.
## Implementation details
This operator will use **ONLY** 1 CRD representing the Keycloak installation, and Realms, Roles, Clients will be no longer managed by the operator and will need to be maintained using the Keycloak dashboard.
Initially there should be a way to provide static configuration files to create Realms,Roles,Clients. One approach would be to use current CRDs for those resources, and add an `importer` that would read them and translate the content into REST requests to Keycloak.
### Kubernetes particularities
**Statefulset vs Deployment** : the old operator was using a Statefulset because it needed basically stateful isolated storage to have the Wildfly application deployed and also because the distributed cache used (Infinispan) does not tolerate a fast Pod termination because the master/backup pod could be terminated if Infinispan does not have time to re-balance. ... with the Keycloak.X approach it will have a self contained server, so we could try to ping Infinispan and check if we can move to a stateless approach (Deployment will create a Replicaset).
**Ingress** : the old operator was expecting to find nginx-controller as the Ingress class ( on vanilla Kubernetes) and so it was [specifying](https://github.com/keycloak/keycloak-operator/pull/381/files#diff-bc65b12c98f9ad70343fc4c796178c77d5797f83ce9acee1c7f655141f98295dR26) the ingress-class annotation. We could not relay on this and add to the documentation that the cluster needs to have a [default ingress class](https://kubernetes.io/docs/concepts/services-networking/ingress/#default-ingress-class)
**CRD Fields** We could start with current fields and reevaluate them. TBD.
**Kubernetes vs Openshift objects** : The first approach is to **NOT** have a double deployment K8s/OCP. So the deployment will be unique as a standar vanilla Kubernetes, but in order to get more powerful features from Openshift , some annotations can be introduced if OCP cluster is detected.
### Dependencies
The Keycloak Operator should not be responsible of deploying and managing its dependencies (e.g. Prometheus, Grafana, PostgreSQL), so it should connect to external services providing those features.
But, the Operator can also make UX better for those users without those services already deployed using [OLM dependencies](https://docs.openshift.com/container-platform/4.6/operators/understanding/olm/olm-understanding-dependency-resolution.html) on external Operators, that eventually can install and manage those services.
## Productization
* It depends on the productization of Quarkus JOSDK
* It needs to be able to be built using OpenJ9 in order to go to IBM Z/P machines
* Lately IBM is moving to [Semeru](https://developer.ibm.com/languages/java/semeru-runtimes/downloads) Runtimes which is basically OpenJDK libraries with openj9 JVM.
* Productization Gates document [here](https://docs.engineering.redhat.com/x/hkEZAw)
* Instructions on how to run the Gates pipeline can be found [here](https://docs.engineering.redhat.com/x/6kUqAw)
## Migration from the old operator
Just document it.
## Ingresses
There is no clear guidance or de-facto standard around operator's managed Ingresses for Kuberenetes.
There are operators specifically intended to deploy and manage Ingress resources (e.g. the Kong operator).
A pattern used by some operators(such as Grafana and ArgoCD), is to provide an optional, simple and opinionated Ingress resource for a seamless getting started experience, a viable alternative is to leave the Ingress creation only mentioned in the docs.
The important aspect we should take into consideration is the associated Kubernetes "Service" that should be fully customizable to fit the user's need.
To help the initial not-advanced users, the Operator **can create Ingresses**, and in that case there's a decision to make when in Openshift. For every Ingress, a Route will be created by Openshift, so initially no manual creation of Routes should be involved.
An import aspect is the IngressClass (same way as the StorageClass for storage). In latest versions of the Nginx Ingress Controller, the IngressClass needs to be defaulted by the user in the cluster or specified in the Ingress resource. This applies to Kubernetes as with Openshift this is more automated.
~~Another important point is **TLS**. The Operator needs to provide a way to the user in order to provide the certs used when connecting to the Ingress.~~
> [name=Andrea Peruffo] the example ingress shouldn't handle production scenarios, so that ppl will not use it for production. In this case the operator should not deal with TLS and ingress customizations.
> [name=Jon] at the moment current operator is creating that ingress, for production, and providing a mechanism to set the TLS. If we wan't to change this approach is fine, but we need to justify it, specially in terms of UX.
## Ugrades
Approach to follow (version blocking upgrade):
* manual upgrade approval
* operator will check the version of keycloak on the 1st running pod and if it is not the expected :
* Operator :
* mark the CR as "Incompatible"
* add log & prometheus
* exit
* Customer :
* update the CR setting the new image to use
* if the keycloak version is the expected :
* Operator :
* execute the reconciliation process
## Reaugmentation process in K8s
We are exploring the idea of leveraging K8s volumes to act as "caches" for the "augmented"/configured version of Keycloak.
An initial POC to show the concept has been drafted here:
https://github.com/andreaTP/poc-mutable-jar
We can use a Kubernetes "emptyDir" or a "PVC" to store the augmented version of the binaries.
The k8s volume will be filled by an init-container and the operation should result in a noop in case the volume has already been populated by a compatible augmentation.
## Canary releases
Canary rollout releases of the operator is going to be covered by upcoming OLM improvements (ref: https://issues.redhat.com/browse/OLM-2311 ).
To be able to benefit from those there are 3 items to be kept in mind:
- Operators are installed Cluster Wide (i.e. Descoped)
- The operator select operands based with additional selectors(such as labels)
- The upgrade process of the operator should always be retrocompatible
- The upgrade of the operands is completely independent
Upgrades of the operand needs to be discussed and designed from scratch.
We decide to post-pone the design and implementation to a future point in time when more detail are settled (such as Store.X etc.)
## Resources