Cluster API Provider AWS v1alpha4 / v0.7

--- title: Cluster API Provider AWS v1alpha4 / v0.7 authors: - "@joelspeed" - "@randomvariable" - "@sedefsavas" reviewers: - "@janedoe" creation-date: yyyy-mm-dd last-updated: yyyy-mm-dd status: provisional|experimental|implementable|implemented|deferred|rejected|withdrawn|replaced see-also: - "/docs/proposals/20190101-we-heard-you-like-proposals.md" - "/docs/proposals/20190102-everyone-gets-a-proposal.md" replaces: - "/docs/proposals/20181231-replaced-proposal.md" superseded-by: - "/docs/proposals/20190104-superceding-proposal.md" --- # Cluster API Provider AWS v1alpha4 / v0.7 - Keep it simple and descriptive. - A good title can help communicate what the proposal is and should be considered as part of any review.  To get started with this template: 1. **Make a copy of this template.** Copy this template into `docs/enhacements` and name it `YYYYMMDD-my-title.md`, where `YYYYMMDD` is the date the proposal was first drafted. 1. **Fill out the required sections.** 1. **Create a PR.** Aim for single topic PRs to keep discussions focused. If you disagree with what is already in a document, open a new PR with suggested changes. The canonical place for the latest set of instructions (and the likely source of this file) is [here](/docs/proposals/YYYYMMDD-template.md). The `Metadata` section above is intended to support the creation of tooling around the proposal process. This will be a YAML section that is fenced as a code block. See the proposal process for details on each of these items.  ## Table of Contents A table of contents is helpful for quickly jumping to sections of a proposal and for highlighting any additional information provided beyond the standard proposal template. [Tools for generating](https://github.com/ekalinin/github-markdown-toc) a table of contents from markdown are available. - [Title](#title) - [Table of Contents](#table-of-contents) - [Glossary](#glossary) - [Summary](#summary) - [Motivation](#motivation) - [Goals](#goals) - [Non-Goals/Future Work](#non-goalsfuture-work) - [Proposal](#proposal) - [User Stories](#user-stories) - [Story 1](#story-1) - [Story 2](#story-2) - [Requirements (Optional)](#requirements-optional) - [Functional Requirements](#functional-requirements) - [FR1](#fr1) - [FR2](#fr2) - [Non-Functional Requirements](#non-functional-requirements) - [NFR1](#nfr1) - [NFR2](#nfr2) - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) - [Security Model](#security-model) - [Risks and Mitigations](#risks-and-mitigations) - [Alternatives](#alternatives) - [Upgrade Strategy](#upgrade-strategy) - [Additional Details](#additional-details) - [Test Plan [optional]](#test-plan-optional) - [Graduation Criteria [optional]](#graduation-criteria-optional) - [Version Skew Strategy [optional]](#version-skew-strategy-optional) - [Implementation History](#implementation-history) ## Glossary Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html). If this proposal adds new terms, or defines some, make the changes to the book's glossary when in PR stage. ## Summary The `Summary` section is incredibly important for producing high quality user-focused documentation such as release notes or a development roadmap. It should be possible to collect this information before implementation begins in order to avoid requiring implementors to split their attention between writing release notes and implementing the feature itself. A good summary is probably at least a paragraph in length. ## Motivation This section is for explicitly listing the motivation, goals and non-goals of this proposal. - Describe why the change is important and the benefits to users. - The motivation section can optionally provide links to [experience reports](https://github.com/golang/go/wiki/ExperienceReports) to demonstrate the interest in a proposal within the wider Kubernetes community. ### Goals - List the specific high-level goals of the proposal. - How will we know that this has succeeded? ### Non-Goals/Future Work - What high-levels are out of scope for this proposal? - Listing non-goals helps to focus discussion and make progress. ## Proposal This is where we get down to the nitty gritty of what the proposal actually is. - What is the plan for implementing this feature? - What data model changes, additions, or removals are required? - Provide a scenario, or example. - Use diagrams to communicate concepts, flows of execution, and states. [PlantUML](http://plantuml.com) is the preferred tool to generate diagrams, place your `.plantuml` files under `images/` and run `make diagrams` from the docs folder. ### User Stories User stories that are only related to the experimental EKS support (i.e. provisioning/managing an Amazon EKS Cluster with CAPA) start with `E`, other stories start with `U`. <table> <thead> <tr> <th>ID</th><th>Title</th><th>Description</th></tr> </thead> <tbody> <tr> <td>U1</td><td>Control planes with external NLB</td> <td> As an AWS administrator, I want Cluster API Provider AWS to support externally managed infrastructure so that I can provision my cluster infrastructure using an external tool such as Terraform. </td> </tr> <tr> <td>U2</td><td>VMC on AWS Support</td> <td> As an VMware Cloud on AWS administrator, I want Cluster API Provider AWS to attach VMC vSphere control plane machines to my provisioned Network Load Balancer so that I have a HA load balancing solution for my control plane. </td> </tr> <tr> <td>U3</td><td>API Rate Optimisations</td> <td> As an AWS administrator, I would like Cluster API Provider AWS to reduce the number of API calls it makes so that I do not hit service rate limits in AWS. </td> </tr> <tr> <td>U4</td><td>Amazon EC2 Fleet</td> <td> As an AWS administrator, I would like Cluster API Provider AWS to provision Amazon EC2 Fleet so I can optimise spend. </td> </tr> <tr> <td>U5</td><td>Spot Instance Graceful Termination</td> <td> As a cluster operator, I would like Cluster API Provider AWS to gracefully drain the worker nodes that are on Amazon EC2 spot instances when instances are scheduled for termination to avoid any disruption in the workloads. [issue 2023] </td> </tr> <tr> <td>U6</td><td>External Autoscaler for ASG</td> <td> As a cluster operator, I would like Cluster API Provider AWS to support using an external auto-scaler (e.g., cluster-autoscaler) to set desired capacity in ASGs so that a Kubernetes-aware scaler can handle automatic scaling. [issue 2022] </td> </tr> <tr> <td>U7</td><td>Multi-AZ Machines</td> <td> As a cluster operator, I would like Cluster API Provider AWS to spread worker instances across different availability zones and subnets. [issue 1829] </td> </tr> <tr> <td>E1</td><td>Amazon EC2 Spot Instances</td> <td> As a cluster operator, I would like Cluster API Provider AWS to support enabling EC2 Spot Instances in managed node groups (AWSManagedMachinePool) to optimise cost. </td> </tr> </tbody> </table> </tbody> </table> ### Requirements <table> <thead> <tr><th>ID</th><th>Requirement</th><th>Related Stories</th> </thead> <tbody> <tr> <td>R1</td> <td> Cluster API Provider AWS must support externally provisioned Network Load Balancers </td> <td>U1</td> </tr> <tr> <td>R2</td> <td> Cluster API Provider AWS must support the provisioning of Network Load Balancers </td> <td>U2</td> </tr> <tr> <td>R3</td> <td> Cluster API Provider AWS must support attachment of control plane instances to Network Load Balancers. </td> <td>U1</td> </tr> <tr> <td>R4</td> <td> Cluster API Provider AWS MUST assume that the load balancer attachment for control planes is optional (and carried out by a separate controller/compoennt). </td> <td>U1</td> </tr> <tr> <td>R5</td> <td> Cluster API Provider AWS SHOULD be able to attach multiple load balancers to a single machine. </td> <td>U1</td> </tr> <tr> <td>R6</td> <td> Cluster API Provider AWS SHOULD assume that multiple load balancers which have a single control plane instance attached to them are of different types (Network Load Balancer vs. Elastic Load Balancer). </td> <td>U1</td> </tr> <tr> <td>R7</td> <td> Cluster API Provider AWS SHOULD coalesce describe operations instead of performing separate queries for every resource [issue 1764]. </td> <td>U4</td> </tr> <tr> <td>R8</td> <td> Cluster API Provider AWS SHOULD detect spot instances that are marked for termination by polling the termination end point and mark the node for termination.[issue 1764]. </td> <td>U5</td> </tr> <tr> <td>R9</td> <td> Cluster API Provider AWS SHOULD drain nodes that are on spot instances when spot instance termination signal is received. </td> <td>U5</td> </tr> <tr> <td>R10</td> <td> Cluster API Provider AWS SHOULD not modify an ASG's desired capacity based on MachinePool's replica number if an external autoscaler is in use. </td> <td>U6</td> </tr> <tr> <td>R11</td> <td> Cluster API Provider AWS SHOULD should provide availability zones and subnets to not modify an ASG's desired capacity based on MachinePool's replica number if an external autoscaler is in use. </td> <td>U6</td> </tr> <tr> <td>R12</td> <td> Cluster API Provider AWS SHOULD support launching EC2 Spot Instances in a managed node group. </td> <td>E1</td> </tr> </tbody> </table> ### Implementation Details/Notes/Constraints #### API Changes ##### AWSMangedCluster Removal The AWSManagedCluster type will be removed as it's a "passthrough" type to satisfy the InfraCluster contract. However, AWSManagedControlPlane already has the required Spec/Status fields to satisfy this contract, so can be used for both `infrastructureRef` and `controlPlaneRef`, e.g.: ``` yaml apiVersion: cluster.x-k8s.io/v1alpha4 kind: Cluster metadata: name: "capi-managed-test" spec: infrastructureRef: kind: AWSManagedControlPlane apiVersion: controlplane.cluster.x-k8s.io/v1alpha4 name: capi-managed-test-control-plane controlPlaneRef: kind: AWSManagedControlPlane apiVersion: controlplane.cluster.x-k8s.io/v1alpha4 name: capi-managed-test-control-plane ``` ###### Conversion of Cluster resources *Option 1* Cluster API Provider AWS will install a mutating webhook for Cluster resources to rewrite references to `AWSManagedCluster` to `AWSManagedControlPlane`. However, this may not fix controller reconciliation. *Option 2* Cluster API Provider AWS will add a migration controller for `Cluster` resources to rewrite `infrastructureRef` fields. ##### AWSManagedControlPlane cleanup The current organic additions to AWSManagedControlPlane will be cleaned up in v1alpha4: ``` yaml kind: AWSManagedControlPlane apiVersion: controlplane.cluster.x-k8s.io/v1alpha3 metadata: name: "capi-managed-test-control-plane" spec: region: "eu-west-2" version: "v1.18.0" additionalTags: key1: value1 network: spec: # infrav1.NetworkSpec vpc: subnets: cni: secondaryCidrBlock: "10.0.0.0/24" bastion: enabled: true controllerPermissions: roleName: "myrole" roleAdditionalPolices: - "rolearn1" - "rolearn2" associateOIDCProvider: true eksCluster: # or this could be called eksControlPlane name: "default_mycluster" # used to be eksClusterName endpointAccess: logging: apiServer: false audit: true encryptionConfig: provider: "abcdf" addons: - name: "vpc-cni" version: "v1.6.3-eksbuild.1" conflictResolution: "overwrite" machines: sshKeyName: "capi-management-2" image: lookupFormat: lookupOrg: lookupBaseOS: userAuthentication: # this could potentially sit under the `eksCluster` iam: mapRoles: - username: "kubernetes-admin" rolearn: "arn:aws:iam::123456789:role/AdministratorAccess" groups: - "system:masters" mapUsers: oidc: provider: "" userprefix: "" controlPlaneEndpoint: #this has to stay at this level to satisfy contract with Cluster ``` ### Security Model No RBAC implications are envisioned for v1alpha4. ### Risks and Mitigations - What are the risks of this proposal and how do we mitigate? Think broadly. - How will UX be reviewed and by whom? - How will security be reviewed and by whom? - Consider including folks that also work outside the SIG or subproject. ## Alternatives An older proposal to break the AWSCluster object into separate components is deferred pending possible UX improvements in Cluster API. We believe that support for externally provisioned ## Upgrade Strategy Cluster API Provider AWS will follow standard version skew practice - webhooks will perform conversions from v1alpha3 to v1alpha4. v1alpha2 to v1alpha4 conversion will not be supported. ## Implementation History - [ ] MM/DD/YYYY: Proposed idea in an issue or [community meeting] - [ ] MM/DD/YYYY: Compile a Google Doc following the CAEP template (link here) - [ ] MM/DD/YYYY: First round of feedback from community - [ ] MM/DD/YYYY: Present proposal at a [community meeting] - [ ] MM/DD/YYYY: Open proposal PR  [community meeting]: https://docs.google.com/document/d/1Ys-DOR5UsgbMEeciuG0HOgDQc8kZsaWIWJeKJ1-UfbY [issue 1764]: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/1764