Cluster API Karpenter Feature Group Notes

# Cluster API Karpenter Feature Group Notes [meeting zoom](https://zoom.us/j/861487554?pwd=dTVGVVFCblFJc0VBbkFqQlU0dHpiUT09) (passcode: 77777) Meetings are scheduled for 19:00 UTC, immediately following the Cluster API office hours. Please add agenda items for future meetings here: * <add name and agenda topic above here ex. _[name] topic_> ## 2024-05-15 @ 19:00 (UTC) [**recording**](https://www.youtube.com/watch?v=Ol4MwzibYOw) ### Attendees * elmiko - Red Hat * Jeremy - Adobe ### Agenda * [elmiko] prep for deep dive on 16 May * [jeremy] adding more abstractions on top of karpenter and cluster-api could lead to worse performance. don't want the layer cake to get so thick that it adds to time it takes to make nodes. * in general, performance of scaling speed is important to optimization. having nodes quickly reducing the pressure on overprovisioning. ## 2024-05-01 @ 19:00 (UTC) [**recording**](https://youtu.be/R57c-0hXn_Q) ### Attendees * elmiko - Red Hat * Fabrizio - vMWare * Tony G, Jeremy L - Adobe ### Agenda * [elmiko] using scalable resource types to back NodeClaims * giving some background * [fabrizio] want to avoid us going down the path of re-engineering the scalable resources to account for the collections of machines that are created * [fabrizio] what if we could invert the layers a little and have a MachineDeployment as the owner, perhaps with some options to differentiate the machines * [fabrizio] happy to help with some pairing * AI: elmiko to gather notes about technical difficulties with scalable resources, will schedule a deep dive meeting with Fabrizio, and the community, to investigate further ## 2024-04-17 @ 19:00 (UTC) [**recording**](https://www.youtube.com/watch?v=mzxFTZh_YFM) ### Attendees * elmiko - Red Hat * Mike T, Jeremy L - Adobe * Pau - Giant Swarm ### Agenda * [elmiko] making my repo public * it's not ready for use, but i would feel better to have it open * [mike t] +1 if it's at the point and advertises its status * [elmiko] review some architecture desicions for the PoC * joined cluster to make things simple * labeling machines for ownership * [mike t] +1 short term, longer term want to be able to delete specific machine object, how will this work in karpenter? * this is a good question and we might use owernship with the NodeClaim in some way * [jeremy] from my experience working with karpenter, it seems like there are too many options for delete style operations. eg delete the node or delete the nodeclaim. * specifying infra templates by name in ClusterAPINodeClass * [mike t] i'm a fan of label selectors * ## 2024-04-03 @ 19:00 (UTC) [**recording**](https://www.youtube.com/watch?v=6jNw9txrAsQ) ### Attendees * elmiko, Marco - Red Hat ### Agenda * [elmiko] - first technical challenge, working through multiplexing client issues. * [scott] - have we looked at how karpenter will look at the NodeClaims and NodePools, can it be configured to run multiple copies in a specific namespace, how will we handle multiple clusters in the same management cluster? * yes, somewhat, we will have to instruct users that running karpenter is best suited for individual namespaces * AI: elmiko to look into namespacing karpenter and can it differentiate NodePools, NodeClaims. * [scott] how will we indicate which infrastructure templates are ok for karpenter to use? * elmiko: initially just going to use annotations, but we'll need to talk with wider community * scott: maybe list on Cluster object to indicate which can be used? ## 2024-03-20 @ 19:00 (UTC) ### Cancelled due to KubeCon next scheduled meeting 2024-04-17 ## 2024-03-06 @ 19:00 (UTC) [**recording**](https://www.youtube.com/watch?v=aLJeoO2oyEE) ### Attendees * Tony Gosselin, Mike Tougeron, Adobe * elmiko - Red Hat * Scott Rosenberg * Jack Francis - Microsoft ### Agenda * [elmiko] where should the karpenter resources live? * as i'm working through the deployment options it gives me pause for thought about how we will consume the karpenter and cluster api resources. would love to talk through the various patterns. * [jack] do the karpenter objects have a lifecycle beyond a specific reconciliation, does NoidClaim have owner-like properties? * think so * [jack] in CAS, we namespace the app in the management cluster, but it only has application no specific operands. many to many relation between karpenters and workload clusters. * [scott] have tested large number of clusters with capi and cas, seemed to work well, needed a little more resource for the cas instances * [jack] if karp is the same, would want to have namespace separation for karpenters, with similar separation for the crds. having these in the management cluster makes sense in this scenario. * seems like we are talking about having a similar pattern for karpenter, recommend to run in management cluster in same namespace as capi artifacts. * [scott] open question here, how will multiple karpenters in the same namespace handle living next to each other, is this even possible? we use a similar pattern in CAS now. * [jack] for MVP we can focus on karpenter in management cluster, and then later we can address some of the wider questions, or even adjust our assumptions. * [mike] my team has hesitancy to run everything through a management cluster, not for anything speific beyond "all the eggs in one basket", i do think running in management cluster is better approach. * [scott] running karpenter in the workload starts to get into that self-managing pattern that we've seen be problematic in the past. having CAS for management cluster and karpenter for workloads is another pattern that would be beneficial. * [jack] we need to make sure that we build it in such a way that it is idempotent and can restart easily * [elmiko] (describing architecture and progress) * [scott] infra templates in capi are much more specific than those that karpenter normally deals with * [elmiko] would like to see this get better over time, maybe we need more from capi * [jack] do you think we'll need capi changes for the poc? * don't think so, we should be able to use the existing mechanisms for scale from zero with the kubemark provider for the poc ## 2024-02-21 @ 19:00 (UTC) [**recording**](https://www.youtube.com/watch?v=VC7A0681Jzw) ### Attendees * elmiko - Red Hat * Jack Francis - Microsoft ### Agenda * [elmiko] repo progress * hope to have repo setup in next couple weeks * [jack] are we talking about a `kubebuilder init` type repo? * [elmiko] not quite sure yet, want it to be easy to hack on and build * [elmiko] orphan machine progress * making decent progress, doesn't seem blocked by controllers * [jack] might want to stop using "orphan" term as it's not quite accurate * "karpenter owned" * "singleton machine" * "ownerless machine" * [jack] concept of "externally" owned machine * elmiko ++ * [elmiko] price information / dynamic inventory / inventory discovery * [jack] this will be cloud provider specific, not sure if aws/azure have enough to use as a common implementation or api. potential for other providers to emulate behavior if we can agree on a contract (aws/azure already have this written). * what about some sort of discovery information about instances on the provider (eg pricing) ## 2024-02-07 @ 19:00 (UTC) [**recording**](https://www.youtube.com/watch?v=3Uer9FKEbRQ) ### Attendees * elmiko - Red Hat * Cameron McAvoy - Indeed * Jeremy Lieb - Adobe ### Agenda * [elmiko] orphan machine discussion * concerned about owner reference after reading the docs * https://cluster-api.sigs.k8s.io/developer/architecture/controllers/machine * still working on some kubemark experiments with orphans * [elmiko] not sure if owner reference is causing me issues * [cameron] looked into something similar with machinepools, this might be in the pr conversation or enhancement * look at machinepool machines, provider can make machines that might not have owner * [elmiko] repo progress * talked with sig autoscaling, cluster api community, and karpenter working group, no objections to creating repo * would like to get a bootstrap in place before requesting * [cameron] would be nice if we could consume the early versions to be able try things out ## 2024-01-24 @ 19:00 (UTC) [**recording**](https://www.youtube.com/watch?v=NKxzMK8wfho) ### Attendees * elmiko - Red Hat * jeremy - Adobe * jonathan - AWS ### Agenda * [jeremy] looking at how to run karpenter on cluster * karpenter doc recomends not running in the same cluster * thinking about some sort of dual-mode management where CAS could manage some nodes * don't like have a bifurcated approach * like being able to specify instance types for use and lose some of that when using ASGs * set up a cascade style topology with karpenter nodes managing other karpenter nodes * karp admin node pool, small maybe 2 nodes, managing scaling for the rest of the cluster (in a separate node pool) * a separate node pool, also running karpenter, manages the admin node pool * feel this is better than managing the admin pool with CAS * same cluster, multiple node pools, multiple karpenters monitoring * is this like a ring or more like primary/secondary? * ring * node group a - running karp a - managing node pool b * node group b - running karp b - managing node pool a * scaling to zero is an issue because it would starve one of the karpenters * question that predicated this investigation was about how to size the nodes for running karpenter, thinking about in relation to ASGs and CAS, weren't quite sure what we wanted, karpenter seemed better aligned * want to have admin node group have the ability to vertically scale when karpenter needs more resources ## 2024-01-10 @ 19:00 (UTC) [**recording**](https://www.youtube.com/watch?v=KQrW-wp3WWw) ### Attendees * elmiko, Marco Braga - Red Hat * jackfrancis - Microsoft * Cameron McAvoy - Indeed * Jeremy Lieb - Adobe ### Agenda * [elmiko] should we increase the frequency of this meeting? * +1 from jack * elmiko, i'm neutral, kinda prefer if we change when needed * Cameron, maybe once we have a repo and some code to hack on * let's revisit in a month * [elmiko] orphan machine pattern and infrastructure templates * [elmiko] i am investigating this direction, hoping to build traction here * [jack] no objection on this direction, would love to see a prototype of what the integration would look like. what does it look like to create the machine with infra template and metadata. active capi cluster running a provider, skip the owner object and just create machines on their own. would like to confirm this. * [elmiko] i am investigating this with kubemark provider, can create a demo for next time * [cameron] no objection, would like to better understand the primary purpose for the orphan machines. * [elmiko] orphan machines are like a read-only way for users to understand what is happening * [jack] on deletion, why wouldn't we just delete the machine and allow capa (for example) to do the cleanup? * we will rely on the machine being deleted by provider id from the core interface that karpenter exposes for node claims * [elmiko] do we have enough consensus to start think about creating a code repo? * should we make a `karpenter-provider-cluster-api` in kube-sigs org? * alternative `cluster-api-provider-karpenter` * we want to target a generic cluster-api implementation * if any contracts are needed from providers (eg infra machine capacity), we will put that material into the CAEP and socialize through the community. if/when needed, providers will be responsible for operating with karpenter. * [jack] let's bring this up at monday's sig autoscaling meeting ## 2023-12-13 @ 19:00 (UTC) [**recording**](https://www.youtube.com/watch?v=KQrW-wp3WWw) ### Attendees * elmiko - Red Hat * Marco Braga - Red Hat * jonathan-innis - AWS * njtran - AWS * Mike Tougeron, Nathan Romriell, Jonathan Raymond, Andrew Rafferty - Adobe ### Agenda * [elmiko] MachineDeployments, what do folks think about the idea of different modes for them? * [jack] might expect a `mode` property at the top level, different modes could then signal heter or homo geneous. is there anything about the machine template that says it has to be homogenous, or represent a single type? * would we need a new infrastructure type? * karpenter machine template? * would this make duplicated fields or objects? * could something create these automatically? * how to reconcile this with a single karpenter provider for capi? * [jack] what is the "lift" for creating a pan-infra provider that plugs into karpenter in the same manner as other providers? * do we even need the MachineDeployment? * generic capi provider will be different than the specific providers * [jack] not convinced that there is an architectural advantage either way * is there some way to bring the concepts of NodeClaim and Machine together? * karpenter replaces capi controller manager altogether, assuming we don't need a MacheDeployment * [mike] from my perspective, interacting with the Machine is a large part of the value of this integration * maybe we should meet weekly in the new year? ## 2023-11-29 @ 19:00 (UTC) [**recording**](https://www.youtube.com/watch?v=sfaEwOWtpbU) ### Attendees * elmiko - Red Hat * Mike Tougeron, Jeremy Lieb - Adobe * Cameron McAvoy - Indeed ### Agenda * [elmiko] should we cancel the meeting on 27 December? * we will cancel * [elmiko] NodeClass, still learning about this but wondering if it would be appropriate to wrap an InfrastructureMachine in this type of object? * cloud provider gives some nuts and bolts about launching * NodeClaim is the "i want a machine" * [elmiko] does NodeClass get applied to k8s? * core karp generates NodeClaims based its matching * NodeClass contains provider specific details * karpenter wants to own the provisioning lifecycle * maybe possible to just have karp created Machine objects on its own, without a parent object * would the community be ok with orphan machines? * [jonathan] where do we want to integrate? * karp doing provisioning or capi doing the provisioning? * is it inefficient to have capi doing the provisioning? * are we creating extra CRs by having both together? * [jack] what about scenarios where karpenter is running on a previously created CAPI cluster? * [cameron] this can be done today with eks out of the box, additional work required for kubeadm clusters * what is our goal? * we want to integrate with karpeneter in a manner that allows cluster api users to continue using cluster api interfaces and CRs. basically so that cluster api users can leverage their experience and also gain access to karpenter's benefits. (if possible) ## 2023-11-15 @ 19:00 (UTC) [**recording**](https://www.youtube.com/watch?v=LsgGrrfMxY0) ### Attendees * elmiko - Red Hat * Cameron McAvoy - Indeed * Mikhail Fedosin - New Relic * Chris Negus - AWS ### Agenda * [elmiko] is it feasible to build a pure CAPI provider for karpenter? * what issues are we bound to face? * [ellis] what about making NodeClaims something that CAPI could understand, perhaps a different way to look at the layer cake approach (eg capi on the bottom). * [pau] sharing some experiences from giant swarm ## 2023-09-14 @ 9:00 (GMT-4) [**recording**](https://www.youtube.com/watch?v=t1Uo18v8g48 ) ### Attendees * elmiko - Red Hat * Mike Tougeron, dan mcweeney - Adobe * Lorenzo Soligo, Andreas Sommer, Pau Rosello - Giant Swarm * Jack Francis - Microsoft ### Agenda * intro, what are we doing here? * what are people looking to get from cluster api and karpenter integration? * is there a cost savings that can come from using karpenter? * would be interesting to see some A/B testing results here with cluster autoscaler and karpenter * seems there is enough evidence to warrant pushing forward with doing _something_ with karpenter and capi * Scenario 1: replace cluster-autoscaler w/ karpenter on my existing capi cluster * easier to manage multiple instance types * some folks are experimenting with karpenter and capa * how can provisioners be created from the launch templates? * karpenter folks might be deprecating this * want to have parity between instances in cloud and Machine CRs * might be some scenarios where users don't care about seeing the CAPI resources * this is non-ideal, but workable in some cases * "bring you own nodes" approach, how do nodes join the cluster? * spot instances are more usable in karpenter than ASG, EKS-specific * how does karpenter handle node drift? e.g. when updating AMI on a running cluster * a capi mgmt cluster central approach would be nice here * would the community want a cluster api provider for karpenter? * how would this be done? perhaps we would need some sort of provider-provider for karpenter, e.g. capi api informs about which provider is deployed is with karpenter. * karpenter running on workload, is able to use a cloud-specific provider on that cluster, and then makes Machine objects as it creates instances * this approach might require more involvement from cloud-specific contributors to ensure that problems which arise on that cloud could have the best attention. * capi machinepool may be another approach to fit into the karpenter logic * machinepools are different between providers, this may require a lot of work on the capi side * karpenter hybrid-cloud topology might be difficult to make work, there are many other questions beyond provisioning here. * would the community want to use karpenter with capa in some sort of managed mode? * this might look a little like the karpenter in workload option from above * do we have enough interest for a feature group and followup meetings? * pau +1 to continuing this effort * dan +1 * jack +1 * [jack] I would like to see us come up with a CAEP that describes how a karpenter provider would look, and then choose one cloud (probably AWS) to implement first of all in this “general karpenter provider” implementation (assuming that that’s what the CAEP describes) * [dan] let's sync with the karpenter folks as well * elmiko +1 * jack on the wg with sig autoscaling for karpenter inclusion, happy to share knowledge * tl;dr karpenter in the process of donating itself to cncf, many things to work through * https://github.com/kubernetes/org/issues/4258 * https://docs.google.com/document/d/1_KCCr5CzxmurFX_6TGLav6iMwCxPOj0P/edit * [karpenter working group calendar item](https://calendar.google.com/calendar/u/0?cid=N3FmZGVvZjVoZWJkZjZpMnJrMmplZzVqYmtAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ)

Read more

Karpenter Provider Cluster API Open Questions

Generic CCM Testing in Kubernetes

OKD Community Development Introduction to Mastodon

cluster-api-actuator-pkg testing introspection