owned this note
owned this note
Published
Linked with GitHub
---
title: enabling-cluster-api-based-installations-via-openshift-install
authors:
- "@patrickdillon"
- "@vincepri"
- "@JoelSpeed"
reviewers:
- "@sdodson"
- "@zaneb"
approvers:
- "@sdodson"
- "@zaneb"
api-approvers:
- "None"
creation-date: 2023-10-16
last-updated: 2023-10-16
tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement
- TBD
see-also:
- "/enhancements/this-other-neat-thing.md"
replaces:
- "/enhancements/that-less-than-great-idea.md"
superseded-by:
- "/enhancements/our-past-effort.md"
---
# Enabling Cluster-API-based Installations via openshift-install
## Summary
This enhancement discusses how `openshift-install` can use
cluster-api (CAPI) infrastructure providers to provision infrastructure for clusters,
without requiring access to an external management cluster or a local container runtime.
By running a Kubernetes control plane and CAPI-provider controllers as
subprocesses on the installer host, `openshift-install` can use CAPI and its
providers in a similar manner to how Terraform and its providers are currently
being used.
## Motivation
There are two primary motivations:
1. OpenShift Alignment with CAPI: CAPI offers numerous potential benefits;
such as: day-2 infrastructure management, an API for users to edit cluster
infrastructure, and upstream collaboration. Installer support for CAPI would
be foudational for adopting these benefits.
2. Terraform BSL License Change: due to the restrictive license change of
Terraform, `openshift-install` needs a framework to replace the primary
tool it used to provision cluster infrastructure. In addition to the benefits
listed above, CAPI provides solutions for the biggest gaps left by Terraform:
a common API for specifying infrastructure across cloud providers and robust
infrastructure error handling.
### User Stories
- As an existing user/client of the installer, I want backwards compatibility so that I can continue to use the installer (e.g. `create cluster`) in the same manner and with existing automation.
- As a security analyst, I want the installer image to be free of Terraform and related dependencies to decrease surface area for vulnerabilities.
- As an advanced user or cluster administrator, I want to be able to edit the CAPI infrastructure manifests so that I can customize control-plane infrastructure.
### Goals
- To provide a common user and developer experience when installing and developing across cloud platforms
- To be backwards compatible and fully satisfy the requirements of install-config type APIs.
- To keep the user experience for day-zero operations unchanged or improved.
- To not require any new runtime dependencies.
- To provide an extensible framework to plug-in new infrastructure cloud providers.
### Non-Goals / Future work
- To retain full and strict backward compatibility with the infrastructure previously created with Terraform
- To optimize build processes or binary size
- To use an existing management cluster to install OpenShift
- To pivot the CAPI manifests to the newly-installed cluster to enable day-2 infrastructure management within the cluster.
## Proposal
The Installer will create CAPI infrastructure manifests based on user
input from the install config; then, in order to provision cluster infrastructure,
apply the manifests to CAPI controllers running on a local Kubernetes control-plane
setup by [envtest](https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/envtest).
### Workflow Description
**cluster creator** is a human user responsible for deploying a
cluster. Note that the workflow does not change for this user.
**openshift-install** is the Installer binary.
1. The cluster creator provides an install-config and credentials
2. (optional) The cluster creator runs `openshift-install create manifests`
3. (optional) The cluster creator edits the newly created CAPI manifests.
4. The cluster creator runs `openshift-install create cluster`
5. `openshift-install` extracts the kube-api server, etcd, CAPI infrastructure provider & the cloud CAPI provider to the install dir
6. `openshift-install` using `envtest` initializes a control plane locally on the Installer host
7. `openshift-install` execs the CAPI infrastructure and cloud provider as subprocesses, pointing them to the local control plane
8. `openshift-install` applies the CAPI manifests to the control plane
9. The CAPI controllers provision cluster infrastructure based on the manifests
10. `openshift-install` monitors the status of the local manifests as they are applied
11. If the statuses are as expected, infrastrucutre has been provisioned and installation continues with the normal flow.
In the case of an error in the final step, the Installer will bubble up resources with non-expected statuses.
#### Variation and form factor considerations [optional]
How does this proposal intersect with Standalone OCP, Microshift and Hypershift?
If the cluster creator uses a standing desk, in step 1 above they can
stand instead of sitting down.
See
https://github.com/openshift/enhancements/blob/master/enhancements/workload-partitioning/management-workload-partitioning.md#high-level-end-to-end-workflow
and https://github.com/openshift/enhancements/blob/master/enhancements/agent-installer/automated-workflow-for-agent-based-installer.md for more detailed examples.
### API Extensions
API Extensions are CRDs, admission and conversion webhooks, aggregated API servers,
and finalizers, i.e. those mechanisms that change the OCP API surface and behaviour.
- Name the API extensions this enhancement adds or modifies.
- Does this enhancement modify the behaviour of existing resources, especially those owned
by other parties than the authoring team (including upstream resources), and, if yes, how?
Please add those other parties as reviewers to the enhancement.
Examples:
- Adds a finalizer to namespaces. Namespace cannot be deleted without our controller running.
- Restricts the label format for objects to X.
- Defaults field Y on object kind Z.
Fill in the operational impact of these API Extensions in the "Operational Aspects
of API Extensions" section.
### Implementation Details/Notes/Constraints [optional]
#### Overview
In a typical CAPI installation, manifests indicating the desired cluster configuration are applied to a
management cluster. In order to keep `openshift-install` free of any new external runtime dependencies,
the dependencies will be [embedded][embed] into the `openshift-install` binary,
extracted at runtime, and cleaned up afterward. This approach is similar to what we have been using
for Terraform.
With Terraform, the Installer has been embedding the Terraform and cloud-specific provider binaries
within the Installer binary and extracting them at runtime. The Installer produces the Terraform
configuration files and invokes Terraform using the `tf-exec` library.
![terraform diagram(2)](https://hackmd.io/_uploads/HkFrqGCS6.jpg)
We can follow a similar pattern to run CAPI controllers locally on the Installer host. In addition
to the CAPI controller binaries, `kube-apiserver` and `etcd` are embedded in order to run a local
control plane, orchestrated with `envtest`.
![capi diagram(3)](https://hackmd.io/_uploads/r1YU9zRSa.jpg)
#### Local control plane
The local control plane is setup using the previously available work done in Controller Runtime through [envtest][envtest].
Envtest was born due to a necessity to run integration tests for controllers against a real API server, register webhooks
(conversion, admission, validation), and managing the lifecycle of Custom Resource Definitions.
Over time, `envtest` matured in a way that now can be used to run controllers in a local environment,
reducing or eliminating the need for a full Kubernetes cluster to run controllers.
At a high level, the local control plane is responsible for:
- Setting up certificates for the apiserver and etcd.
- Running (and cleaning up, on shutdown) the local control plane components.
- Installing any required component, like Custom Resource Definitions (CRDs)
- For Cluster API core the CRDs are stored in `data/data/cluster-api/core-components.yaml`.
- Infrastructure providers are expected to store their components in `data/data/cluster-api/<name>-infrastructure-components.yaml`
- Upon install, the local control plane takes care of modifying any webhook (conversion, admission, validation) to point to the `host:post` combination assigned.
- Each controller manager will have its own `host:port` combination assigned.
- Certificates are generated and injected in the server, and the client certs in the api-server webhook configuration.
- For each process that the local control plane manages, a health check (ping to `/healthz`) is required to pass similarly how, when running in a Deployment, a health probe is configured.
#### Manifests
The Installer will produce the CAPI manifests as part of the `manifests` target, writing them to a new
`cluster-api` directory alongside the existing `manifests` and `openshift` directories:
```shell=
$ ./openshift-install create manifests --dir install-dir
INFO Credentials loaded from the "default" profile in file "~/.aws/credentials"
INFO Consuming Install Config from target directory
INFO Manifests created in: install-dir/cluster-api, install-dir/manifests and install-dir/openshift
$ tree install-dir/cluster-api/
install-dir/cluster-api/
├── 00_capi-namespace.yaml
├── 01_aws-cluster-controller-identity-default.yaml
├── 01_capi-cluster.yaml
├── 02_infra-cluster.yaml
├── 10_inframachine_mycluster-6lxqp-master-0.yaml
├── 10_inframachine_mycluster-6lxqp-master-1.yaml
├── 10_inframachine_mycluster-6lxqp-master-2.yaml
├── 10_inframachine_mycluster-6lxqp-master-bootstrap.yaml
├── 10_machine_mycluster-6lxqp-master-0.yaml
├── 10_machine_mycluster-6lxqp-master-1.yaml
├── 10_machine_mycluster-6lxqp-master-2.yaml
└── 10_machine_mycluster-6lxqp-master-bootstrap.yaml
1 directory, 12 files
```
The manifests within this `cluster-api` directory will not be written to the cluster or included in bootstrap ignition.
In future work, we expect these manifests to be pivoted to the cluster to enable the target cluster to take over managing
its own infrastructure.
### Risks and Mitigations
While we do not expect these changes to introduce a significant security risk, we are working with product security teams
to ensure they are aware of the changes and are able to review.
### Drawbacks
By depending on CAPI providers whose codebases live in a repository external to the Installer,
the process for developing features and delivering fixes is complex. While we had the
same situation for Terraform, the CAPI providers will be more actively developed than their
Terraform counterparts. Furthermore, it will be necessary to ensure that the CAPI providers
used by the Installer match the version of those in the payload.
While this external dependency is a significant drawback, it is not unique to this design
and is common throughout OpenShift (e.g. any time the API or library-go must be updated
before being vendored into a component). To minimize the devex friction, we will focus
on documenting a workflow for developing providers while working with the Installer. If
the problem becomes significant, we could consider automation to bump Installer providers
when merges happen upstream or in our forks.
## Design Details
### Open Questions [optional]
1. UX design during install process as well as during failure (log collection). The Installer will dump
(potentially prettified) controller logs. Once we reach a certain level of stability it may be worthwhile
to implement a UI.
### Test Plan
As this is replacing existing functionality in the Installer, we can rely on existing
testing infrastructure.
### Graduation Criteria
#### Dev Preview -> Tech Preview
- Ability to utilize the enhancement end to end
- End user documentation, relative API stability
#### Tech Preview -> GA
- More testing (upgrade, downgrade, scale)
- Sufficient time for feedback
- Available by default
- User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/)
#### Removing a deprecated feature
- Announce deprecation and support policy of the existing feature
- Deprecate the feature
### Upgrade / Downgrade Strategy
As this enhancement only concerns the Installation process and affects only the underlying cluster
infrastructure, this change should not affect existing cluster upgrades.
### Version Skew Strategy
N/A
### Operational Aspects of API Extensions
N/A
#### Failure Modes
During a failed install, the controller logs (displayed in stdout and collect in .openshift_install.log)
will contain useful information. The status of the CAPI manifests may also contain useful information,
in which case it would be important to display that to users and collect for bugs and support cases. There
is an open question about the best way to handle this UX, and we expect the answer to become more clear during
development.
As the infrastructure will be reconciled by a controller, it will be possible to resolve issues during an ongoing
installation, although this would not necessarily be a feature we would call attention to for documented use cases.
Finally, the Installer will need to be able to identify when infrastructure provisioning has failed during an installation.
Initially this will be achieved through a timeout.
#### Support Procedures
Describe how to
- detect the failure modes in a support situation, describe possible symptoms (events, metrics,
alerts, which log output in which component)
Examples:
- If the webhook is not running, kube-apiserver logs will show errors like "failed to call admission webhook xyz".
- Operator X will degrade with message "Failed to launch webhook server" and reason "WehhookServerFailed".
- The metric `webhook_admission_duration_seconds("openpolicyagent-admission", "mutating", "put", "false")`
will show >1s latency and alert `WebhookAdmissionLatencyHigh` will fire.
- disable the API extension (e.g. remove MutatingWebhookConfiguration `xyz`, remove APIService `foo`)
- What consequences does it have on the cluster health?
Examples:
- Garbage collection in kube-controller-manager will stop working.
- Quota will be wrongly computed.
- Disabling/removing the CRD is not possible without removing the CR instances. Customer will lose data.
Disabling the conversion webhook will break garbage collection.
- What consequences does it have on existing, running workloads?
Examples:
- New namespaces won't get the finalizer "xyz" and hence might leak resource X
when deleted.
- SDN pod-to-pod routing will stop updating, potentially breaking pod-to-pod
communication after some minutes.
- What consequences does it have for newly created workloads?
Examples:
- New pods in namespace with Istio support will not get sidecars injected, breaking
their networking.
- Does functionality fail gracefully and will work resume when re-enabled without risking
consistency?
Examples:
- The mutating admission webhook "xyz" has FailPolicy=Ignore and hence
will not block the creation or updates on objects when it fails. When the
webhook comes back online, there is a controller reconciling all objects, applying
labels that were not applied during admission webhook downtime.
- Namespaces deletion will not delete all objects in etcd, leading to zombie
objects when another namespace with the same name is created.
## Implementation History
Major milestones in the life cycle of a proposal should be tracked in `Implementation
History`.
## Alternatives
Using other infrastructure-as-code alternatives such as Pulumi, Ansible, or OpenTofu
all have their own individual drawbacks. We prefer the CAPI solution over
these alternatives because it:
* streamlines Installer development (we do not need to re-implement features for the control plane)
* lays the foundation for OpenShift to implement future CAPI features
* requires less development effort, as CAPI providers are already setup to provision infrastructure for a cluster
It would also be possible to implement the installation using direct SDK calls for the cloud provider. In addition
to the reasons stated above, using individual SDK implementations would not present a common framework across various
cloud platforms.
## Infrastructure Needed [optional]
Use this section if you need things from the project. Examples include a new
subproject, repos requested, github details, and/or testing infrastructure.
Listing these here allows the community to get the process for these resources
started right away.
[embed]: https://pkg.go.dev/embed
[envtest]: https://github.com/kubernetes-sigs/controller-runtime/tree/main/tools/setup-envtest