This enhancement discusses how openshift-install
can use
cluster-api (CAPI) infrastructure providers to provision infrastructure for clusters,
without requiring access to an external management cluster or a local container runtime.
By running a Kubernetes control plane and CAPI-provider controllers as
subprocesses on the installer host, openshift-install
can use CAPI and its
providers in a similar manner to how Terraform and its providers are currently
being used.
There are two primary motivations:
OpenShift Alignment with CAPI: CAPI offers numerous potential benefits;
such as: day-2 infrastructure management, an API for users to edit cluster
infrastructure, and upstream collaboration. Installer support for CAPI would
be foudational for adopting these benefits.
Terraform BSL License Change: due to the restrictive license change of
Terraform, openshift-install
needs a framework to replace the primary
tool it used to provision cluster infrastructure. In addition to the benefits
listed above, CAPI provides solutions for the biggest gaps left by Terraform:
a common API for specifying infrastructure across cloud providers and robust
infrastructure error handling.
create cluster
) in the same manner and with existing automation.The Installer will create CAPI infrastructure manifests based on user
input from the install config; then, in order to provision cluster infrastructure,
apply the manifests to CAPI controllers running on a local Kubernetes control-plane
setup by envtest.
cluster creator is a human user responsible for deploying a
cluster. Note that the workflow does not change for this user.
openshift-install is the Installer binary.
openshift-install create manifests
openshift-install create cluster
openshift-install
extracts the kube-api server, etcd, CAPI infrastructure provider & the cloud CAPI provider to the install diropenshift-install
using envtest
initializes a control plane locally on the Installer hostopenshift-install
execs the CAPI infrastructure and cloud provider as subprocesses, pointing them to the local control planeopenshift-install
applies the CAPI manifests to the control planeopenshift-install
monitors the status of the local manifests as they are appliedIn the case of an error in the final step, the Installer will bubble up resources with non-expected statuses.
How does this proposal intersect with Standalone OCP, Microshift and Hypershift?
If the cluster creator uses a standing desk, in step 1 above they can
stand instead of sitting down.
See
https://github.com/openshift/enhancements/blob/master/enhancements/workload-partitioning/management-workload-partitioning.md#high-level-end-to-end-workflow
and https://github.com/openshift/enhancements/blob/master/enhancements/agent-installer/automated-workflow-for-agent-based-installer.md for more detailed examples.
API Extensions are CRDs, admission and conversion webhooks, aggregated API servers,
and finalizers, i.e. those mechanisms that change the OCP API surface and behaviour.
Name the API extensions this enhancement adds or modifies.
Does this enhancement modify the behaviour of existing resources, especially those owned
by other parties than the authoring team (including upstream resources), and, if yes, how?
Please add those other parties as reviewers to the enhancement.
Examples:
Fill in the operational impact of these API Extensions in the "Operational Aspects
of API Extensions" section.
In a typical CAPI installation, manifests indicating the desired cluster configuration are applied to a
management cluster. In order to keep openshift-install
free of any new external runtime dependencies,
the dependencies will be embedded into the openshift-install
binary,
extracted at runtime, and cleaned up afterward. This approach is similar to what we have been using
for Terraform.
With Terraform, the Installer has been embedding the Terraform and cloud-specific provider binaries
within the Installer binary and extracting them at runtime. The Installer produces the Terraform
configuration files and invokes Terraform using the tf-exec
library.
We can follow a similar pattern to run CAPI controllers locally on the Installer host. In addition
to the CAPI controller binaries, kube-apiserver
and etcd
are embedded in order to run a local
control plane, orchestrated with envtest
.
The local control plane is setup using the previously available work done in Controller Runtime through envtest.
Envtest was born due to a necessity to run integration tests for controllers against a real API server, register webhooks
(conversion, admission, validation), and managing the lifecycle of Custom Resource Definitions.
Over time, envtest
matured in a way that now can be used to run controllers in a local environment,
reducing or eliminating the need for a full Kubernetes cluster to run controllers.
At a high level, the local control plane is responsible for:
data/data/cluster-api/core-components.yaml
.data/data/cluster-api/<name>-infrastructure-components.yaml
host:post
combination assigned.
host:port
combination assigned./healthz
) is required to pass similarly how, when running in a Deployment, a health probe is configured.The Installer will produce the CAPI manifests as part of the manifests
target, writing them to a new
cluster-api
directory alongside the existing manifests
and openshift
directories:
$ ./openshift-install create manifests --dir install-dir
INFO Credentials loaded from the "default" profile in file "~/.aws/credentials"
INFO Consuming Install Config from target directory
INFO Manifests created in: install-dir/cluster-api, install-dir/manifests and install-dir/openshift
$ tree install-dir/cluster-api/
install-dir/cluster-api/
├── 00_capi-namespace.yaml
├── 01_aws-cluster-controller-identity-default.yaml
├── 01_capi-cluster.yaml
├── 02_infra-cluster.yaml
├── 10_inframachine_mycluster-6lxqp-master-0.yaml
├── 10_inframachine_mycluster-6lxqp-master-1.yaml
├── 10_inframachine_mycluster-6lxqp-master-2.yaml
├── 10_inframachine_mycluster-6lxqp-master-bootstrap.yaml
├── 10_machine_mycluster-6lxqp-master-0.yaml
├── 10_machine_mycluster-6lxqp-master-1.yaml
├── 10_machine_mycluster-6lxqp-master-2.yaml
└── 10_machine_mycluster-6lxqp-master-bootstrap.yaml
1 directory, 12 files
The manifests within this cluster-api
directory will not be written to the cluster or included in bootstrap ignition.
In future work, we expect these manifests to be pivoted to the cluster to enable the target cluster to take over managing
its own infrastructure.
While we do not expect these changes to introduce a significant security risk, we are working with product security teams
to ensure they are aware of the changes and are able to review.
By depending on CAPI providers whose codebases live in a repository external to the Installer,
the process for developing features and delivering fixes is complex. While we had the
same situation for Terraform, the CAPI providers will be more actively developed than their
Terraform counterparts. Furthermore, it will be necessary to ensure that the CAPI providers
used by the Installer match the version of those in the payload.
While this external dependency is a significant drawback, it is not unique to this design
and is common throughout OpenShift (e.g. any time the API or library-go must be updated
before being vendored into a component). To minimize the devex friction, we will focus
on documenting a workflow for developing providers while working with the Installer. If
the problem becomes significant, we could consider automation to bump Installer providers
when merges happen upstream or in our forks.
As this is replacing existing functionality in the Installer, we can rely on existing
testing infrastructure.
As this enhancement only concerns the Installation process and affects only the underlying cluster
infrastructure, this change should not affect existing cluster upgrades.
N/A
N/A
During a failed install, the controller logs (displayed in stdout and collect in .openshift_install.log)
will contain useful information. The status of the CAPI manifests may also contain useful information,
in which case it would be important to display that to users and collect for bugs and support cases. There
is an open question about the best way to handle this UX, and we expect the answer to become more clear during
development.
As the infrastructure will be reconciled by a controller, it will be possible to resolve issues during an ongoing
installation, although this would not necessarily be a feature we would call attention to for documented use cases.
Finally, the Installer will need to be able to identify when infrastructure provisioning has failed during an installation.
Initially this will be achieved through a timeout.
Describe how to
detect the failure modes in a support situation, describe possible symptoms (events, metrics,
alerts, which log output in which component)
Examples:
webhook_admission_duration_seconds("openpolicyagent-admission", "mutating", "put", "false")
WebhookAdmissionLatencyHigh
will fire.disable the API extension (e.g. remove MutatingWebhookConfiguration xyz
, remove APIService foo
)
What consequences does it have on the cluster health?
Examples:
What consequences does it have on existing, running workloads?
Examples:
What consequences does it have for newly created workloads?
Examples:
Does functionality fail gracefully and will work resume when re-enabled without risking
consistency?
Examples:
Major milestones in the life cycle of a proposal should be tracked in Implementation History
.
Using other infrastructure-as-code alternatives such as Pulumi, Ansible, or OpenTofu
all have their own individual drawbacks. We prefer the CAPI solution over
these alternatives because it:
It would also be possible to implement the installation using direct SDK calls for the cloud provider. In addition
to the reasons stated above, using individual SDK implementations would not present a common framework across various
cloud platforms.
Use this section if you need things from the project. Examples include a new
subproject, repos requested, github details, and/or testing infrastructure.
Listing these here allows the community to get the process for these resources
started right away.