owned this note
owned this note
Published
Linked with GitHub
# Hack the Garden 2025-11 — Topics 📜
See also:
https://github.com/gardener-community/hackathon
## Participants (27) 🤓⌨️
* Stefan M.
* Gerrit S.
* Ebubekir A.
* Tim E.
* Maximilian G.
* Aniruddha B.
* Lukas H.
* Marcel B.
* Viet D. M.
* Matthias H.
* Benedikt H.
* Tobias G.
* Marc V.
* Niklas K.
* Johannes S.
* Tobias S.
* Axel S.
* Shafeeque E S
* Sonu K. S.
* Konstantinos A.
* Rickards J.
* Oliver G.
* Vedran L.
* Luca B.
* Daniel G. N.
* Andreas F.
* Rafael F.
## Proposals 💡
## Core ⚙️
### Allow Shoot Migration on metal-stack.io
**Authors:** Stefan Majer / Gerrit Schwerthelm
Currently metal-stack.io clusters need manual intervention after the shoot was migrated to another seed. The root cause is that the firewall, which is part of the infrastructure, does not get the new api server connection established. Fix could be to register a firewall in a similar way as the node-agent with a bootstrap token.
### Use Autonomous Shoot Cluster for Single-Node End-to-End Tests
**Author:** Johannes Scheerer
It should be possible by now to create autonomous shoot clusters using `gardenadm`. To eat our own dog food we could run our end-to-end tests in a single-node autonomous shoot cluster. This could increase confidence in autonomous shoot clusters and put them to a real test.
### `gardenlet` Meltdown Protection for `ManagedSeed`s
**Author:** Sonu Kumar Singh
Currently, when a `gardenlet` is updated in the seed (not managed seed) cluster, all `ManagedSeed` cluster `gardenlet`s are updated simultaneously. If there is an issue with the `gardenlet` version, there is no rollback mechanism, all seeds hosted by that soil/seed cluster enter a bad state, which in turn causes all shoots hosted by these `ManagedSeed`s to fail.
**Proposed Solution:**
* Stop the rollout of new `gardenlet` version in all the `ManagedSeed`s at once.
* Utilize `ManagedSeedSet` to achieve this.
* Add a way to adopt existing seeds(may be).
* Ensure rolling updates on new seeds only start once the current seed is confirmed healthy.
* Extend the mechanism to also manage extension versions (in addition to the `gardenlet` version) to prevent breaking updates caused by broken extension versions.
**Comment from @rfranzke:** We could also think of introducing `seedmanagement.gardener.cloud/v1alpha1.Gardenlet` resources for `ManagedSeed`s in order to harmonize the update behaviour (of unmanaged and managed seeds).
### Gardener API Types as Standalone Go Module [gardener/gardener#2871](https://github.com/gardener/gardener/issues/2871)
**Author:** Tim Ebert
We could introduce a dedicated Go module for `pkg/apis` in gardener/gardener.
The dependencies of this package should be very limited, e.g., only `k8s.io/{api,apimachinery,utils}`, etc. (should be enforced using a `.import-restrictions` file).
The API module should be released together with the main module (using the proper Go submodule tag, see [gardener/cc-utils#1382](https://github.com/gardener/cc-utils/pull/1382)).
We can use the Go Workspaces feature in gardener/gardener for developing both the main and the API mdoule together conveniently.
For API-only consumers of the gardener/gardener repository, this ensures a minimal set of dependencies when importing the API types.
### Gardener Scale-Out Tests
**Author:** Tim Ebert
We don't have a good estimate of how many seeds and shoots a Gardener environment can support. We are not aware of any scalability limitations that we might face in the future. Also, there is no way to prevent regressions in Gardener's scalability.
We could implement "hollow" gardenlets similar to kubemark’s hollow nodes and run many of them to generate load on the Garden cluster. This could be a good basis for running automatic performance/scalability tests.
### Skip Validation of Resource References during `--dry-run=server`
**Author:** Marcel Boehm
As already discussed in the issues [#12582](https://github.com/gardener/gardener/issues/12582#issuecomment-3311966606) and [#12950](https://github.com/gardener/gardener/issues/12950), the strict validation of referenced resources makes it impossible for tools like `flux` to create a Shoot and e.g. a referenced Audit ConfigMap, which is not yet present, at the same time, because flux always performs a server-side dry run first. I would propose to consistently make these validations optional on dry-runs and only emit a warnings if the resources do not exist.
---
## Core – Backup & Restore 🛟
### Allow Relocating Backup Buckets for Shoot Clusters
**Authors:** Stefan Majer / Gerrit Schwerthelm
When you are required to move a backups into another project at a cloud provider (like GCP), it is currently required to create new seed clusters and migrate all shoots to the new seed clusters. As the functionality for copying backup buckets during the shoot migration is already in place we were wondering if it would be possible to just alter the backup bucket as well in order to relocate its place.
### Add a force-restore operation annotation for Shoots
**Author:** Matthias Hoffmann
https://github.com/gardener/gardener/issues/12952
This would facilitate recovery from a disaster using the available backups.
---
## Extensions 🧩
### Generic Extension for User Workloads
**Author:** Sonu Kumar Singh
Currently, if a user has, for example, 100 clusters in a project and wants to deploy some basic workloads in some of them, they must manually target and deploy resources in each individual cluster.
**Proposed Solution:**
* Allow namespaced controllerdeployments, and it include a field that holds user-provided charts.
* Introduce a generic extension running in the seed cluster. This extension would check if a namespace controllerdeployment is used by one of the shoots in that project. If so, it would create a ManagedResource of class shoot in the shoot’s namespace in the seed.
* The GRM in the shoot would then ensure that the user-specific charts are deployed into the target shoots.
### Support updating underlying infrastructure resources during in-place node updates in MCM ([gardener/mcm#1023](https://github.com/gardener/machine-controller-manager/issues/1023))
**Author** Andreas Fritzler
Support for updating the underlying infrastructure resources (e.g., OS image) during in-place node updates in Gardener Machine Controller Manager (MCM). This includes extending the MCM provider driver interface with an UpdateMachine method, enabling providers like `ironcore-metal` to handle infrastructure-level updates without full node recreation.
### Augment extension library with support for controllers watching resources in Shoots
**Author:** Rafael Franzke
If we established a mechansim that allows dynamically watching resources in `Shoot`s, we could move some things in-tree and reduce maintenance effort (e.g., `aws-custom-route-controller` could be moved into `gardener-extension-provider-aws`).
### Rework extension `ControlPlane` controller
**Author:** Rafael Franzke
We could rework the `ControlPlane` extensions controller and move generic things (like the CSI deployments) into `gardener/gardener` such that this is not duplicated in all provider extensions.
Furthermore, we could move to `ManagedResource`s and get rid of the Helm charts.
### GEP-28: Restore broken self-hosted cluster
**Author:** Rafael Franzke
We could evaluate what is takes to restore a broken self-hosted (f.k.a. "autonomous") shoot cluster.
### Evaluate Talos as node operating system
**Author:** Johannes Scheerer
Talos has a different approach to what a kubernetes node operating system provides. It might be interesting to evaluate if we could support talos as operating system in Gardener Shoot clusters.
---
## CI/CD 🏗️
### Add SBOMs to all created artefacts
**Author:** Stefan Majer
In order to be able to get a complete view of possible CVEs, artefacts can contain the list of dependencies in the sbom/SPDX format.
We at metal-stack.io already added that to all our artifacts and can be used as a blueprint.
### Persist Logs of e2e Tests
**Author:** Tim Ebert
Gardener e2e tests export the logs of running pods/machines before exiting – both on success and failure – so that they can be viewed/downloaded in the artifacts browser (gcsweb).
However, logs of terminated pods/machines will not be exported if they are not running at the end of the test execution. I.e., we don't collect logs of pods/machines of successful test cases, because the shoots will be deleted as part of the test execution.
Debugging e2e test failures based on this information is very tedious. The ability to search e2e test logs or to compare logs of successful/tailed tests would improve this experience.
For this, we could add a logging stack to the prow cluster (similar to the performance prometheus) where e2e test logs are stored and which can be queried in the cluster's Plutono instance.
### Go Build Cache in Prow
**Author:** Tim Ebert
Our Prow jobs spend a significant part of the execution time on building Go binaries/tests/tools. We could significantly reduce build times by keeping/reusing the Go build cache and thereby get faster CI feedback on PRs.
[GOCACHEPROG](https://pkg.go.dev/cmd/go/internal/cacheprog) allows storing the build cache externally, e.g., in S3 using https://github.com/tailscale/go-cache-plugin.
We should take care of preventing cache poisoning, though.
### [GEP-28] Expose API server of Automonous Shoots ([gardener/gardener#2906](https://github.com/gardener/gardener/issues/2906))
**Author:** Tim Ebert
The API server of an autonomous shoot cluster with managed infrastructure (medium-touch scenario) needs to be exposed for external access. Ideas for this include creating a LoadBalancer for the `default/kubernetes` service so that we can reuse the cloud-controller-manager for this. As soon as cloud-controller-manager publishes the LoadBalancer's IP, the `DNSRecord` can be updated to point to the LoadBalancer instead of the machine's internal IP.
---
## Registry Cache 🪞
### Harmonize Registry Mirror Extension in gardener-extension-registry-cache with harbor registry cache
**Author:** Benedikt Haug
Currently the mirroring function doesn't allow credentials to be used when configuring mirrors. The idea is to add such a functionality to enforce an internal harbor to be used as cache. Corresponding Issue: https://github.com/gardener/gardener-extension-registry-cache/issues/462
Additional features that would be relevant:
- Add support for the server field and the override_path option, and allow URL paths to be part of the serverand host fields of the mirrorConfig to a.) support non-conformant registries (like the widely used harbor registry) and b.) be able to control if a fallback to upstream is allowed
- Extend the gardener operatingsystemconfig to support custom headers in the containerd RegistryHost config
Work already started here: https://github.com/networkhell/gardener-extension-registry-cache/tree/feature_mirror_server_and_options
### Allow configuring registry-mirror for Helm OCI charts pulled by gardenlet
**Author:** Marcel Boehm
When the `gardenlet` pulls a Helm Chart from an OCI Repository, there is no option to configure any mirrors, like we can do for containerd running on all nodes. We would like to extend the `GardenletConfig` with options similar to the `RegistryConfig` options in the OSC.
### Gardener Node Agent should be pullable from a registry mirror
**Author:** Lukas Hoehl
Currently it is not possible to pull the gardener node agent through a registry mirror configured by the OSC, since the gardener node agent is responsible for configuring containerd with the mirrors.
We work around this by adding the registry mirror as systemd files into the userdata via webhook. I would however like to have this inside the registry-cache extension itself.
---
## Networking 🔌
### Implement Firewall Distance and HA for metal-stack.io
**Authors:** Stefan Majer / Gerrit Schwerthelm
In order to get high available firewalls in metal-stack.io, we would like to add dead detection of a firewall and the ability to deploy two or more firewalls in front of a cluster and use path prolongation to allow traffic to flow through in a HA manner.
### Evaluation of NFT mode of `kube-proxy` (Fast Track 🏎️)
**Author:** Johannes Scheerer
The new NFT mode finally seems to be the real successor of the `iptables` mode. Therefore, it may make sense to evaluate it and the interaction with related components, e.g. CNIs.
### Add Support for Calico Whisker
**Author:** Johannes Scheerer
Calico added Whisker in the 3.30 open source release. It has some capabilities similar to what you can do with Hubble in a Cilium cluster, i.e. you can monitor/trace ongoing traffic in the cluster. Currently, Whisker is only directly supported in a setup managed by `tigera-operator`. However, it is quite possible to run it in a Gardener-managed cluster. As Whisker seems to require mTLS, Calico needs to be deployed slightly differently from how the Calico extension currently manages it, though.
### Pod Overlay to Native Routing without Downtime
**Author:** Johannes Scheerer
Currently, Gardener supports both pod overlay networking and native routing. It is possible to switch between both modes via `.spec.networking.providerConfig.overlay.enabled`. However, the current implementation incurs a networking downtime while the cluster is reconfigured, i.e. while the daemon set is rolled out. Some productive clusters cannot tolerate such a downtime. Therefore, it would be helpful if the switch could be implemented in a seamless manner, i.e. old nodes use whatever they used before and new nodes use the new mode, but both groups can also communicate with each other.
### (D)DOS protection for kube-apiservers
**Author** Oliver Götz
Support counter (D)DOS measurements like rate limiting for kube-apiserver endpoints of Garden and Shoots.
### Cluster Mesh for cilium extension
**Author:** Lukas Hoehl
Allow connecting shoots to other kubernetes clusters running cilium via cluster mesh.
https://docs.cilium.io/en/stable/network/clustermesh/clustermesh/
---
## Networking – Istio ⛵️
### Reduce number of Istio Ingress Gateways
**Author:** Johannes Scheerer
In a standard multi-zonal seed cluster, there is one multi-zonal istio ingress gateway and one per availability zone. The multi-zonal istio ingress gateway could be replaced by usage of all single-zone istio ingress gateways. This could lead to higher resource usage, reduced costs and a less complicated setup.
### Always use the same istio-gateway for shoot kube-apiserver endpoint and observability components ([gardener/gardener#11860](https://github.com/gardener/gardener/issues/11860))
**Author** Oliver Götz
There are kube-apiserver endpoints (internal/external/wildcard) and observability endpoints for each shoot. Depending on the shoot and seed configuration there might be different istio-gateways used.
If exposure classes are used, this could lead to a situation where the endpoints are exposed to different networks.
If a zonal shoots is scheduled on a regional seed it might be "only" cost for cross-zonal traffic.
### Replace Istio native resources with Gateway API
**Author:** Lukas Hoehl
Istio has support for Gateway API since some time ago: https://istio.io/latest/docs/tasks/traffic-management/ingress/gateway-api/
We should evaluate how good the actual implementation of it is so we could replace some of the native Istio resources in g/g like Gateway, VirtualServices and DestinationRules.
While we cannot replace all things (probably most EnvoyFilters are not replaceable) we should try to adopt Gateway API to drive it's maturity.
---
## Networking – IPv6 🧬
### IPv6 or Dual-Stack Support for another Infrastructure
**Author:** Johannes Scheerer
Gardener currently supports IPv6/Dual-Stack on AWS and GCP. During the second to last hackathon a proof-of-concept for IronCore was created. Other infrastructures, e.g. metal-stack, OpenStack or Azure, also support IPv6 and could be enabled for Dual-Stack.
### Dual-Stack Seed API
**Author:** Johannes Scheerer
As one of the last steps missing for full Gardener Dual-Stack support, the `Seed` API needs to be extended.
---
## LLMs 🤖
### LLM-based Agents
**Author:** Vedran Lerenc
We use LLMs since 2.5y for various simple tasks (coding, operations, and other side tasks) and would like to discuss whether you do too and if so, collaborate and possibly build new agents together. We have some small "platform" that sits on top of LiteLLM that sits on top of models deployed on Azure OpenAI, AWS Bedrock, and GCP Vertex AI, so we can prototype ideas immediately and we would be happy to do so together.
**Proposed Approach:**
* Discuss your LLM applications (only coding or more?)
* Discuss pain points were LLMs could help
* Discuss areas of interest where LLMs would improve Gardener (e.g. Dashboard, `gardenctl`, operations, etc.)
* Prototype together
### One commit message
**Author:** Niklas Klocke
having consitent and insightful commit messages is a major benefit.
I would propose creating a small AI based tool to generated consitent commit messages, after piloting it in one or two projects we could role it out to the whole gardener project, and finally speak with one voice.
### Bring the Gardener Answering Machine to the Gardener Documentation
**Author:** Niklas Klocke
Let's introduce the Answering Machine as a self-service offering for our users directly within the documentation.
In addition, we could explore ways to trace the sources used by the Answering Machine to answer specific questions.
This would help us:
- Identify gaps in our documentation, and
- Potentially automate the creation of pull requests to address those gaps.
---
## Observability 🔭
### Resolve the Istio Metrics Leak
**Author:** Johannes Scheerer
Currently, istio metrics are disabled because metrics for no longer existing `kube-apiserver` instances are served until istio finally restarts. This leads to a huge increase in metrics size, which can lead to congestion, cost explosion and metrics retention reduction. We should figure out how to report only the relevant istio metrics.
### Enrich Shoot Logs with Istio Access Logs
**Author:** Johannes Scheerer
Istio ingress gateway is configured to log accesses. In conjunction with L7 load balancing this becomes very useful as it shows all requests passing through istio. However, the logs are currently only accessible to seed operators. It would be nice if the access logs could be moved to the corresponding shoot log. This would also help in cases where access control is restricted, e.g. with the ACL extension.
The topic also applies to other component in the seed, but the istio access logs could be taken as a first step.
---
## Enablement 📖
### Declarative GitHub Membership Administration ([gardener/org#2](https://github.com/gardener/org/issues/2))
**Author:** Tim Ebert
From [gardener/org#2](https://github.com/gardener/org/issues/2):
Adding individuals to different GitHub teams should be done automatically, based on a declarative approach.
The implementation can follow the approach used by Kubernetes, which utilizes the https://github.com/kubernetes/org repository along with Prow/[Peribolos](https://docs.prow.k8s.io/docs/components/cli-tools/peribolos/) (see https://github.com/gardener/documentation/pull/715#discussion_r2321015493).
### Ease Shoot API Server Connectivity from external clients
**Author:** Tobias Gabriel
A lot of external clients connect to the Shoot API Server, from local CLIs to automation like ArgoCD.
In the most basic setup a service account is created and shared to the external party. With OIDC and the Gardener Discovery Service a lot of improvements are already possible and. This is however not always so easy to setup and use and this is something I want to tackle.
Figuring out what is possible, properly documenting it and identifying what can be implemented (and implementing it).
E.g. some of questions I want to investigate, document and maybe even improve are:
- publicly trusted certificates for shoot API server endpoints (is the seed bound certificate reliable usable? What are the caveats there)
- End to end integration of GitOps controllers running outside of the cluster (auth to shoot and CA management)
- Maybe finally open source the Gardener specific "setup kubecontext with ID token and download CA certicicate"
### The Illustrated Children’s Guide to Gardener
**Author:** Niklas Klocke
Gardener is deeply rooted in the kubernetes ecosystem and tries to follow the proofen path whereever it is reasonable.
But one part was always ignored! Adressing the most pressing question parents working on the gardener are facing from their toddler at home.
"What actually is gardener?".
**The Illustrated Children’s Guide to Kubernetes** Answers this question for Kubernetes. We should do the same for gardener.
Btw: I also think that this would make for amazing merch at the next kubecon ;)
https://www.cncf.io/phippy/the-childrens-illustrated-guide-to-kubernetes/