Karpenter Provider Cluster API Open Questions

After discussion during the 1 May 2024 feature group meeting, we are reevaluating what types of custom resources the capi provider will utilize. Originally, we had the notion of talking directly to the InfrastructureMachineTemplates and creating orphan Machines to manage resources. But, this approach loses all the contextual information that a user might encode in their MachineDeployments, and this is degredation of the core capi experience. To quote from the Cluster API manifesto on the topic of simplicity:

Kubernetes Cluster lifecycle management is a complex problem space, especially if you consider doing this across so many different types of infrastructures.

Hiding this complexity behind a simple declarative API is “why” the Cluster API project ultimately exists.

The project is strongly committed to continue its quest in defining a set of common API primitives working consistently across all infrastructures (one API to rule them all).

Working towards graduating our API to v1 will be the next step in this journey.

While doing so, the project should be inspired by Tim Hockin’s talk, and continue to move forward without increasing operational and conceptual complexity for Cluster API’s users.

Taking this as inspiration, this document is recording some of the areas where cluster api functionality could be improved to help with the integration to karpenter.

Questions about MachineDeployments

ProviderID for New Machines

When requesting capacity, karpenter will create a NodeClaim object. It is the provider's responsibility to update the status of the NodeClaim with information about the instance's progress and metadata.

When increasing the replica count of a MachineDeployment, what is the best method for finding the Machine resource that was created in response to the scaling change?

We will need to be able to accurately receive a NodeClaim and then update it with the information about the Machine that was created.

It is possible that we could use a caching mechanism inside the karpenter provider to help solve this issue, but there might be a more idiomatic cluster-api methodology.

Karpenter wants to be able to address individual nodes when making scaling requests, this is in contrast to the cluster autoscaler which wants to scale node groups. It is important to keep the NodeClaims updated so that the karpenter core can make informed decisions about which nodes to keep and which to remove.

The interactions described in this section will necessitate some mediator to translate between a MachineDeployment and the Machines it owns. Karpenter core will want to deal with provider IDs and we will need to translate those into the corresponding MachineDeployments to understand which groups need scaling.

From asking on slack, it appears that idiomatic way to do this would be to list Machines for a MachineDeployment and check the creation timestamps.

update: i've opened this issue in the karpenter repository to discuss asynchronous provider ID propogation.

Deleting a Specific Machine

This appears to be a non-issue as we can utilize the same cluster.x-k8s.io/delete-machine annotation as the CAS to signal which machine should be removed. we also need to update the replicas when we apply the annotation.

When karpenter chooses to delete a Node it will pass the associated NodeClaim to identify what should be removed.

When using MachineDeployments, will we need to remove the Machine that karpenter is requesting and also update the MachineDeployment replicas to ensure a proper Node removal?

Our cluster autoscaler implementation currently does this math when removing nodes, the core autoscaler will request some nodes to be removed (by referencing the Node objects). But, in this case, the cluster autoscaler identifies the node group (MachineDeployment) in question before requesting the reduction.

For karpenter, we might need to get the Machine object from the provider ID given in the NodeClaim, then get the MachineDeployment from the labels on the Machine, and then reduce the replica count on the MachineDeployment while deleting the Machine.

Resource Capacity Information

Karpenter needs to know the shape of Nodes that are available so that it can make
decisions about what to request.

In the context of Nodes, "shape" means the resource capacity of the Node, e.g. cpu cores, memory bytes, gpu devices, etc.

The Opt-in Autoscaling from Zero CAEP defines a mechanism for adding annotations to a MachineDeployment
to describe it's shape. For example:

kind: <MachineSet or MachineDeployment>
metadata:
  annotations:
      capacity.cluster-autoscaler.kubernetes.io/gpu-count: "1"
      capacity.cluster-autoscaler.kubernetes.io/gpu-type: "nvidia.com/gpu"
      capacity.cluster-autoscaler.kubernetes.io/memory: "500mb"
      capacity.cluster-autoscaler.kubernetes.io/cpu: "1"
      capacity.cluster-autoscaler.kubernetes.io/ephemeral-disk: "100Gi"

These annotations can be used, but a more concrete API would be preferable. As defined in the CAEP, the InfrastructureMachineTemplate can, at the provider's option, carry information in its status field about the Machine's capacity. For example:

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha4
kind: DockerMachineTemplate
metadata:
  name: workload-md-0
  namespace: default
spec:
  template:
    spec: {}
status:
  capacity:
    memory: 500mb
    cpu: "1"
    nvidia.com/gpu: "1"

Having this information on the status field of a MachineDeployment might also be worthwhile since it would reduce the number of API calls necessary to determine the instance shape.

Another idea would be to propagate this information to Machine records by putting the annotations into MachineDeployment.spec.template.metadata.annotations. This would ensure that each Machine had the capacity information as well, making it much easier to consume in Karpenter.

General Questions

Node Zone Labels

Karpenter does some matching of labels when attempting to find, or create, nodes that can satisfy a pod. The zone label (topology.kubernetes.io/zone) is especially important for workloads that need geographical awareness or affinity. As each Cluster API provider will have a different methodology for implementing zone data internally, how will we know this before a node is created?

For zone, and other node labels, we can use the scale from zero annotations initially for the PoC.

    capacity.cluster-autoscaler.kubernetes.io/labels: "key1=value1,key2=value2"

Instance Type Naming

The karpenter internal APIs are configured in a way that assumes an instance type has a name associated with it. When we translate this into the Cluster API parlance it becomes more complex when viewing instances through the lens of MachineDeployments.

Offerings

Cluster API does not currently have a concept of price offerings associated with specific machine types. One of karpenter's features is its ability to include price options when planning and requesting infrastructure resources. Price-informed provisioning is one of karpenter's most popular features.

Adding an interface for offerings data will require input and design from the Cluster API community to ensure that it is widely accepted and implemented.

In addition to karpenter, the cluster autoscaler also has price based expansion features available for use. If Cluster API had a way to expose this information it would be useful to multiple applications.

How to accomodate special negotiated pricing and offerings?

Taints

Karpenter wants to know what taints will be on a node that is made by any given NodeClaim, more precisely it uses the taints as constraints when creating a NodeClaim. There exists a mechanism for utilizing the scale from zero annotations to indicate what taints will be on nodes made from a MachineDeployment, this is probably the methodology we will need to use to start.

How do we know what taints will be on a node that is created from a given MachineDeployment?

As a starting point, we can use the scale from zero annotations for the PoC.

    capacity.cluster-autoscaler.kubernetes.io/taints: "key1=value1:NoSchedule,key2=value2:NoExecute"

Asynchronous Action

Some of the karpenter behaviors appear to expect synchronous behavior, eg Create wants us to return status about creation and provider ID. Testing is under way to learn more about the limits of the asynchonous behavior, with out reach to the karpenter community for guidance.

There is not much we could change on the cluster api side to address this, we will need to learn about karpenter's requirements and work with that community to address our needs.

Instance Type Labels

When informing karpenter about instance types through the GetInstanceTypes interface method, karpenter expects that the name of the instance type will correspond to the node.kubernetes.io/instance-type label for the nodes. The instance type label is usually applied by a cloud controller manager.

It is not clear how deeply karpenter depends on this information, but we might need a way to understand what the instance type will be when browsing the scalable resources that wrap the instances.

Notes

we might want to include other scalable types later, eg MachinePool
where does the NodePool originate?
- can we make a default for users?
upgrades, how will we handle drift
- which is in control, karpenter or capi
providerIDs, how to know when a new one comes in
- possible to have a 1:1 MD:M relationship
  - [scott] was dealing with a capi cluster with 1200 nodes, if it had been 1:1 with MDs i think i would have had serious problems operating kubernetes. would be careful about doing this as opposed to enumerating the instances available in another manner.
  - autogenerating MDs would be part of this process
thoughts from alberto
- the value we are adding here: node autoprovisioning
  - we currently have aws and azure providers, if we look at aws provider it takes input about what infrastructure you like and the provider goes and chooses the best instances for you and deploys it. currently, the aws provider uses the fleet interface to do the creation. fleet interface has a lot of magic built into it. think the discussion we need to be having is about how do we include interfaces like fleet so that we can match the native provider interfaces. my vision would be for the CAPA provider to be calling fleet on the backend.
  - [joel] +1 on CAPA fleet stuff
  - [scott] fleet api is amazing, do we start to get too close to provider-specific behavior when we push into these types of apis. maybe we want to have a generic crd/api in the capi world that would help to coordinate these types of interactions with the lower level provider apis.
    - [alberto] fleet would be an implementation detail of CAPA, imo. we would expose options through the existing apis in capi. if we look at the karpenter-aws api today, they provide many options that we don't in capi and would need to include these somehow.
    - [joel] did some experimentation with fleet and it seems like a superset of RunInstance interface. think this is a tangential topic, but a good general improvement to capi.
- agree about benefits of having a basic PoC to start investigating behavior and patterns. plus we would be able to have an organized way to proceed on development.
  - think before we ship anything for karpenter that we have the discussions about fleet-like things
  - also, having something concrete gives us data to compare between karpenter versions
- was thinking about the general problem of providers today, possible we could have a mechanism in the future where the user could specify the controllers they want to run with a specific implementation of capi.
rationale for scalable resources
- capi users have certain expectations
- if we put karpenter on top of those expectations, it will be easier for them
- if we make a completely different pattern for karpenter, we will make it more complex
2 categories of questions
- technical stuff, "where does X come from?"
- api design, this is more complex, previously our needs for autoscaler with respect to scale-from-zero were constrained. if we pursue this deeper in the PoC we will need to examine what is needed for the API instead of using annotations.
[fabrizio] i would like to revive the api review group for capi. we have several open PRs/issues that could benefit from a strong api review.

Alternate style implementation with provider details

Chatting with Alberto about alternate styles of implementation, he shared with me a design for wrapping the native provider (eg aws, azure) with a capi provider. This style could use the deep provider integrations for inventory while also creating capi objects to observe the cluster topology.

https://github.com/enxebre/karpenter-provider-cluster-api/tree/machinepool-dev

This would require us to have provider-specific implementations inside the capi provider, and would also limit us to providers who have created code for karpenter (either in their own provider or the capi provider).

A topic for discussion around this style of implementation is the creation of capi resources. In this implementation, as instances are created, scalable resources (eg machinepool) are created to own the machines as they join. A couple questions I'd like to get into:

are users ok with the karpenter provider creating scalable resources?
do we need some way for users to supply configuration for those resources?
would we want to have reusable scalable resources so that we aren't creating a new resource for each instance?

Instance type offerings

If there were a way to designate MachineDeployments. or other scalable instance types, as having the same type of instances as another, e.g. through the use of the */instance-type label, then we might be able to group them as offerings. By their nature, offerings allow for multiple zones, and reservation styles (on-demand, spot, etc.), to be represented as part of a single instance type.

When listing MachineDeployments to inform about instance types, scalable resources that refer to the same instance type with only the only difference in resources being the zone of deployment then those resources should be listed as a single instance type with multiple offerings.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.