Try   HackMD

Copy - Add a pluggable etcdadm-based provider for creating and managing external etcd clusters

Table of Contents

Glossary

Refer to the Cluster API Book Glossary.

If this proposal adds new terms, or defines some, make the changes to the book's glossary when in PR stage.

Summary

This is a proposal for adding support for provisioning external etcd clusters in Cluster API. CAPI's KubeadmControlPlane supports using an external etcd cluster. However, it currently does not support provisioning and managing this external etcd cluster.
As per this KEP, there are ongoing
efforts to rebase etcd-manager (used in kOps) on to etcdadm, so as to make etcdadm a consistent etcd solution to be used across all projects, including cluster-api.
This can be achieved for CAPI by adding a new pluggable provider that has two main components:

  • A bootstrap provider, that uses etcdadm to convert a machine into an etcd member.
  • A new controller that manages this etcd cluster's lifecycle

Motivation

  • Motivation behind having cluster-api provision and manage the external etcd cluster

    • Cluster API supports the use of an external etcd cluster, by allowing users to provide their external etcd cluster's endpoints.
      So it would be good to add support for provisioning the etcd cluster too.

    • External etcd topology decouples the control plane and etcd member. So if a controlplane-only node fails, or if there's a memory leak in a component like kube-apiserver, it won't directly impact an etcd member.

    • Etcd is resource intensive, so it is safer to have dedicated nodes for etcd, it could use more disk space, higher bandwidth.
      Having a separate etcd cluster for these reasons could ensure a more resilient HA setup.

  • Motivation behind using etcdadm as a pluggable etcd provider

    • Leveraging existing projects such as etcdadm is one way of bringing up an etcd cluster. This KEP and etcdadm roadmap doc indicate that this is one of the goals for etcdadm, to be integrated into cluster-api

    • Once the rebase of etcd-manager on etcdadm is completed, etcdadm can also provide cluster administration features that etcd-manager does, such as backups and restores.

    • Etcd providers can be made pluggable and we can start with adding one that uses etcdadm.

Goals

  • Introduce pluggable etcd providers in Cluster API, starting with an etcdadm based provider. This etcdadm provider will create and manage an etcd cluster.
  • User should be able to create a Kubernetes cluster that uses external etcd topology in a single step.
  • User should only have to create the CAPI Cluster resources, along with the required etcd provider specific resouces. That should trigger creation of an etcd cluster, followed by creation of the target workload cluster that uses this etcd cluster.
  • There will be a 1:1 mapping between the external etcd cluster and the workload cluster.
  • Define a contract for pluggable etcd providers and define steps that controlplane providers should take to use a managed external etcd cluster.
  • Support following etcd cluster management actions: scale up and scale down, etcd member replacement, etcd version upgrades and rollbacks.
  • The etcd providers can utilize the existing Machine objects to represent etcd members.
  • Etcd cluster members will undergo upgrades using the rollingUpdate strategy.

Non-Goals

  • The first iteration will use IP addresses/hostnames as etcd cluster endpoints. It can not configure static endpoint(s) till the Load Balancer provider is available.

Future work

There are some downsides to using non-static IP addresses/hostnames as etcd cluster endpoints, such as the controlplane having to undergo an upgrade when the etcd cluster is reconfigured. We have discussed using a controller that can configure DNS record sets to use as static endpoints for the etcd members. This will be achieved through the Load Balancer provider as discussed in a CAPI office hours meeting.

Proposal

User Stories

  • As an end user, I want to be able to create a Kubernetes cluster that uses external etcd topology using CAPI with a single step.
  • On creating the required CRs in CAPI along with the new etcd provider specific CRs, CAPI should provision an etcd cluster, followed by a workload cluster that uses this external etcd cluster.
  • As an end user, I should be able to use the etcd provider CRs to specify/modify etcd cluster size and etcd version. CAPI controllers should modify and manage the etcd clusters accordingly.

Implementation Details/Notes/Constraints

These are some of the key differences between etcd cluster provisioning flow when using etcdadm CLI vs the new etcdadm based provider:

  • Etcdadm cluster provisioning using CLI commands works as follows:

    • Run etcdadm init on any one of the nodes to create a single node etcd cluster. This generates the CA cert & key on that node, along with the server, peer and client certs for that node.
    • The init command also gives as output etcdadm join command with the client URL of the first node.
    • To add a new member to the cluster, copy the CA cert key pair from the first node to the right location (etc/etcd/pki) on the new node. Then run the etcdadm join <client URL> command.
  • This flow can't be used within cluster-api as is, since it requires copying over certs from the first etcd node to others. Instead it will follow the design that Kubeadm uses by generating certs outside the etcd cluster.
    The etcd controller running on the management cluster will generate the certs, and save them as a Secret, which then each member can lookup and populate in the write_files section of its cloud-init script

  • The first node outputs the etcdadm join <client URL> command, but cluster-api can't directly use this output. The only way to get this command from the output would be to save the cloud-init output and parse it. Instead the etcd controller can form this command once the first etcd node is provisioned by using that node's address with port 2379.

Data Model Changes

The following type/CRDs will be added for the etcdadm based provider

type EtcdCluster struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`

    Spec   EtcdClusterSpec   `json:"spec,omitempty"`
    Status EtcdClusterStatus `json:"status,omitempty"`
}

type EtcdClusterSpec struct {
    Replicas *int32 `json:"replicas,omitempty"`

    // InfrastructureTemplate is a required reference to a custom resource offered by an infrastructure provider.
    InfrastructureTemplate corev1.ObjectReference `json:"infrastructureTemplate"`

    EtcdadmConfigSpec EtcdadmConfigSpec `json:"etcdadmConfigSpec"`
    
}

type EtcdClusterStatus struct {
    // Total number of non-terminated machines targeted by this etcd cluster
    // (their labels match the selector).
    // +optional
    ReadyReplicas int32 `json:"replicas,omitempty"`

    // +optional
    InitMachineAddress string `json:"initMachineAddress"`

    // +optional
    Initialized bool `json:"initialized"`

    // +optional
    Ready bool `json:"ready"`

    // +optional
    Endpoint string `json:"endpoint"`

    // +optional
    Selector string `json:"selector,omitempty"`
}


type EtcdadmConfig struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`
    
    Spec   EtcdadmConfigSpec   `json:"spec,omitempty"`
    Status EtcdadmConfigStatus `json:"status,omitempty"`
}

type EtcdadmConfigSpec struct {
	// Users specifies extra users to add
	// +optional
	Users []capbk.User `json:"users,omitempty"`
	// PreEtcdadmCommands specifies extra commands to run before kubeadm runs
	// +optional
	PreEtcdadmCommands []string `json:"preEtcdadmCommands,omitempty"`

	// PostEtcdadmCommands specifies extra commands to run after kubeadm runs
	// +optional
	PostEtcdadmCommands []string `json:"postEtcdadmCommands,omitempty"`
    
    // +optional
	EtcdadmArgs map[string]interface{} `json:"etcdadmArgs,omitempty"`
}


type EtcdadmConfigStatus struct {
    Conditions clusterv1.Conditions `json:"conditions,omitempty"`
    
    DataSecretName *string `json:"dataSecretName,omitempty"`

    Ready bool `json:"ready,omitempty"`
}

Implementation details

The etcd provider will have two main components:

Etcdadm bootstrap provider

  • The Etcdadm bootstrap provider will convert a Machine to an etcd member by generating cloud-init scripts that will run the required etcdadm commands. It will do so through the EtcdadmConfig resource.
  • Etcdadm bootstrap provider controller will follow the same flow as that of the kubeadm bootstrap provider
    • This controller will also use the InitLock way of determining which member runs the init command.
    • Once init lock is acquired, the controller will lookup existing CA cert-key pair, or generate a CA cert key and save them in a Secret in the management cluster. For subsequent etcd members, the controller will lookup the CA cert key pair.
    • The controller will generate
      data for each etcd member, save it as a Secret and save this Secret's name on the EtcdadmConfig.Status.DataSecretName field. This cloud-init data will contain:
      • The CA certs to be writtern at etc/etcd/pki
      • Userdata provided through the EtcdadmConfig spec
      • Etcdadm commands
  • The infrastructure provider controller will use this Secret the same way it would a Kubeadm bootstrap Secret, to execute the cloud-init script.

Etcd cluster controller

  • Cluster API has a controlplane controller for Kubeadm. A similar etcd controller needs to be added for etcd clusters. Users will create object of type EtcdCluster, that contains replicas and etcdadmConfigSpec. This controller will create EtcdadmConfig CRs to match the number of replicas.
  • This controller is responsible for provisioning the etcd cluster, and signaling the controlplane cluster once etcd is ready.
  • The end to end flow will be as follows and requires these changes in CAPI:
    • Cluster.Spec will have a new field ManagedExternalEtcdRef of type ObjectReference (same as ControlPlaneRef). For using CAPI managed etcd cluster user will set this field in Cluster manifest, for instance:
    ​​​​managedExternalEtcdRef:
    ​​​​    kind: EtcdCluster
    ​​​​    apiVersion: etcdcluster.cluster.x-k8s.io/v1alpha4
    ​​​​    name: "etcd-cluster"
    ​​​​    namespace: "default"
    
    • Cluster controller's reconcileControlPlane will check if managedExternalEtcdRef is set on cluster spec and if External Etcd endpoints are not populated, it will pause KubeadmControlPlane provisioning by setting clusterv1.Paused annotation (cluster.x-k8s.io/paused) on it. This will allow etcdcluster provisioning to occur first. (In April 28th CAPI meeting, Jason/Fabrizio suggested moving this external etcd ref field to KCP instead of Cluster)
    • Etcd cluster controller generates CA cert-key pair to be used by all machines that will run the etcdadm commands. This external etcd CA is saved in a Secret with name cluster.Name-managed-etcd. We can use the existing util/secret pkg in CAPI for this and add a new Purpose for managed etcd.
    • For a CAPI cluster using external etcd, the apiserver-etcd-client certs must exist in a Secret with the name cluster.Name-apiserver-etcd-client. So the etcdadm controller will also generate client certs and create a Secret with the expected name, again using the existing CAPI util/secret pkg.
    • Then the controller will first initialize etcd cluster by creating one EtcdadmConfig and a corresponding Machine resource. Following that it will add as many members to this cluster as required to match EtcdCluster.Spec.Replicas field
    • Etcd cluster controller at the end of every reconciliation loop check if desired number of members are created by
      • Getting Machines owned by EtcdCluster
      • Running healthcheck for each Machine with the etcd /health endpoint at https://client-url:2379/health
    • Once all members are ready, EtcdCluster will set the endpoints on EtcdCluster.Status.Endpoints field, and set EtcdCluster.Status.Ready to true
    • The CAPI cluster controller will have a new reconcileEtcdCluster function, that will check if EtcdCluster is ready, and when it is ready it will update the KubeadmControlPlane's external etcd cluster endpoints and unpause it by deleting the paused annotation. This will then start the provisoning of the controlplane cluster using this etcd cluster.

Static etcd endpoints

  • VM replacement will require etcd cluster reconfiguration. Using static endpoints for the external etcd cluster can avoid this. There are two ways of configuring static endpoints, using a load balancer or configuring DNS records.
  • Load balancer can add latency because of the hops associated with routing requests, whereas DNS records will directly resolve to the etcd members.
  • The DNS records can be configured such that each etcd member gets a separate sub-domain, under the domain associated with the etcd cluster. An example of this when using ec2 would be:
    • User creates a hosted zone called external.etcd.cluster and gives that as input to EtcdCluster.Spec.HostedZoneName.
    • The EtcdCluster controller creates a separate A record name within that hosted zone for each EtcdadmConfig it creates. The DNS controller will create the route53 A record once the corresponding Machine's IP address is set
  • A suggestion from a CAPI meeting is to add a DNS configuration implementation in the load balancer proposal
  • In today's CAPI meeting (April 28th) we decided to implement phase 1 with IP addresses in the absence of the load balancer provider. But we need to discuss what happens to the certs & kube-apiserver of a cluster that's upgraded to use static endpoint instead of IP addresses once load balancer provider is available

Changes in machine controller

  • Currently each machine needs to have a providerID set for it to be picked up by the infrastructure provider.
  • Machine controller needs to undergo changes for an etcd member since it won't have providerID set.
    • This was brought up in the initial meeting. CAPD calls kubectl patch to add the provider ID on the node. And only then it sets the provider ID value on the DockerMachine resource. But the problem is that in case of an etcd only cluster, the machine is not a Kubernetes node. Cluster provisioning does not progress until this provider ID is set on the Machine. I couldn't provision an external etcd cluster using the POC etcdadm bootstrap provider for CAPA+CAPD without making these changes in the machine controller. With these changes/workaround, CAPD will check if the bootstrap provider is etcdadm, and skip the kubectl patch process and set the provider ID on DockerMachine directly.
    • CAPA unlike CAPD seems to create the provider ID using the ec2 instance metadata and sets that directly on the AWSMachine resource. I may be wrong but I didn't see it patching the k8s node. Even if it does, when I tried provisioning an external etcd cluster using CAPI+CAPA, I didn't have to make any changes in CAPA to get the provider ID set on the AWSMachine and the Machine resources. And the etcd cluster got provisioned using the POC etcdadm bootstrap provider.

Questions/concerns about implementation

  • [Resolved] TL;DR: How to get IP address of etcd init node(first member on which etcdadm init is run), how to make the rest of the etcd members wait for the first node's IP address to be set, and how to propagate this address to the remaining nodes

    • I have been working on a POC bootstrap provider, it can provision multi-node etcd cluster using etcdadm (3 node tested using Docker provider) https://github.com/mrajashree/etcdadm-bootstrap-provider. This is just to understand better what problems we might face while using etcdadm. One was figuring out how to generate the etcdadm join command.
    • The Kubeadm bootstrap provider uses the endpoint of the load balancer created for the api server, as the endpoint in init/join commands. But seems like this is not possible with etcdadm. etcdadm join needs the first machines's client URL, which means the first machine needs to be provisioned, and should get its address fields populated by the infra provider, before the etcadm join commands can be run on any node. As a temporary solution, I modified the machine controller to create a secret for the cluster containing the IP address of the first node once it's provisioned. And the remaining nodes wait on this secret, so the etcdadmconfig controller keeps returning nil/error. Is this the right approach?
  • Should the machines making the external etcd be registered as a Kubernetes node or not?

    • So far the assumption is that the external etcd cluster will only run etcd and no other kubernetes components, but are there use cases that will need these machines to also register as a kubernetes node?
    • To clarify, if we have use cases where we need certain helper processes to run on these etcd nodes, we can do it with kubelet + static pod manifests.
  • TL;DR: Changes made to CAPD to set Node Provider ID, are they acceptable and are those changes not needed in other infra providers:

    • From what I’ve seen, docker infra provider runs kubectl command to patch the kubernetes node to set provider id. And then also sets the same on the DockerMachine resource. For the POC changes, I added a check to see if the bootstrap provider is etcdadm, and in that case CAPD will skip patching the k8s node (since there isn’t one), and directly set the provider ID on the DockerMachine. That works to the point where it provisions an etcd cluster. Tested it by running etcdctl commands. And for cloud providers like aws, the provider ID is obtained from the instance metadata and set directly on the InfraMachine. So in case of CAPA I didn’t have to do any changes in the CAPA machine controller for the node provider ID. I was using an ubuntu ami for the instances that doesn’t install kubectl/kubelet or anything related to k8s on the ec2 instances and still etcd cluster provisioning worked. So does this mean we need a solution only for CAPD and the rest of the providers should work as they are for setting the node provider id? and is the solution for CAPD acceptable. (Here’s the solution for CAPD)
  • [Resolved] TL;DR: VM replacement will require etcd cluster reconfiguration OR can we use fixed internal IP addresses in all infrastructure providers?

    • The CAPI Load Balancer provider will be developed with an implementation of configuring DNS records in the infra providers. This can be used to create the static endpoints for the etcd cluster.
    • The Load Balancer provider is not developed yet, so phase I of external etcd provisioning will use IP addresses as etcd endpoint. We need to figure out upgrades in case the etcd endpoints change, like in the case when going from using IP addresses to static endpoint once load balancer provider is developed.
  • TL;DR: Providing etcd endpoints to the controlplane

    • The KubeadmControlPlane's ClusterConfiguration accepts either Local or External options for Etcd. We will need a new option to indicate that the external etcd cluster should be provisioned by CAPI first, but then can we update the ClusterConfiguration spec with the etcd cluster's endpoints and then let kubeadm controller pick it up?
  • Infrastructure provider specific changes:

    • Security Groups/Ports: For CAPA, I had to make a change to add etcd specific security group to add rules for ports 2379-2380. Will other infra providers need these changes to?
    • Fixed IPs: Infrastructure provider specific concepts to keep etcd nodes IP address constant.
  • Images: Will CAPI make new image for external etcd nodes, or embed etcdadm release in current image and use it for all nodes

Design questions

  • External etcd endpoints field is currently immutable. To use this field for the managed external etcd cluster it should be mutable.

  • Controlplane providers should be responsible for picking up etcd endpoints

  • Should etcd controller be pluggable like infra provider, load balancer provider

  • Contract between external etcd provider and controlplane provider. if etcd provider is pluggable contract between that and CAPI

    • This might affect BYO-external etcd clusters
    • But we do need to allow this field to be changed, in case we want to update external etcd endpoints (IP addresses); or switch from IP address to static endpoints once the Load Balancer provider is available.
  • Updating external etcd endpoints (fine for now)

    • An update to KubeadmConfigSpec.ClusterConfiguration.Etcd.Endpoints will create new KubeadmConfig and Machine resources. These KubeadmConfig nodes will have run kubeadm commands with the right updated endpoints.
    • The controlplane cluster requires two certs to use the external etcd cluster:
      • Etcd CA cert
      • Etcd client cert-key pair
    • Both of these exist as Secrets in the management cluster. The CA cert won't change in case of an upgrade. Any changes required to client cert will be done by the etcdadm controller
  • API Changes

    • New field to indicate its a managed external etcd cluster
      • If it's on Cluster spec instead of KubeadmControlPlane spec, it's not specific to Kubeadm
  • Goals/non-goals

    • using with clusterctl
  • contract

Goals

  • 1:1
  • pluggable

Non-goal (examples)

  • not starting wiht clusterctl

Security Model

  • Does this proposal implement security controls or require the need to do so?

    • Kubernetes RBAC

      • Etcd CA cert-key stored in secrets on the bootstrap cluster, need roles to grant the bootstrap provider and capi controllers access to them same as the current roles for CA cert-key pair Secrets
    • Security Groups

      • Etcd nodes need to have ports 2379 and 2380 open for incoming client and peer requests
  • Is any sensitive data being stored in a secret, and only exists for as long as necessary?

    • The CA cert key pair will be stored in a Secret

Risks and Mitigations

  • What are the risks of this proposal and how do we mitigate? Think broadly.
  • How will UX be reviewed and by whom?
  • How will security be reviewed and by whom?
  • Consider including folks that also work outside the SIG or subproject.

Alternatives

The Alternatives section is used to highlight and record other possible approaches to delivering the value proposed by a proposal.

Upgrade Strategy

If applicable, how will the component be upgraded? Make sure this is in the test plan.

Consider the following in developing an upgrade strategy for this enhancement:

  • What changes (in invocations, configurations, API use, etc.) is an existing cluster required to make on upgrade in order to keep previous behavior?
  • What changes (in invocations, configurations, API use, etc.) is an existing cluster required to make on upgrade in order to make use of the enhancement?

Additional Details

Test Plan [optional]

Note: Section not required until targeted at a release.

Consider the following in developing a test plan for this enhancement:

  • Will there be e2e and integration tests, in addition to unit tests?
  • How will it be tested in isolation vs with other components?

No need to outline all of the test cases, just the general strategy.
Anything that would count as tricky in the implementation and anything particularly challenging to test should be called out.

All code is expected to have adequate tests (eventually with coverage expectations).
Please adhere to the Kubernetes testing guidelines when drafting this test plan.

Graduation Criteria [optional]

Note: Section not required until targeted at a release.

Define graduation milestones.

These may be defined in terms of API maturity, or as something else. Initial proposal should keep
this high-level with a focus on what signals will be looked at to determine graduation.

Consider the following in developing the graduation criteria for this enhancement:

Clearly define what graduation means by either linking to the API doc definition,
or by redefining what graduation means.

In general, we try to use the same stages (alpha, beta, GA), regardless how the functionality is accessed.

Version Skew Strategy [optional]

If applicable, how will the component handle version skew with other components? What are the guarantees? Make sure
this is in the test plan.

Consider the following in developing a version skew strategy for this enhancement:

  • Does this enhancement involve coordinating behavior in the control plane and in the kubelet? How does an n-2 kubelet without this feature available behave when this feature is used?
  • Will any other components on the node change? For example, changes to CSI, CRI or CNI may require updating that component before the kubelet.

Implementation History

  • MM/DD/YYYY: Proposed idea in an issue or community meeting
  • MM/DD/YYYY: Compile a Google Doc following the CAEP template (link here)
  • MM/DD/YYYY: First round of feedback from community
  • MM/DD/YYYY: Present proposal at a community meeting
  • MM/DD/YYYY: Open proposal PR