owned this note
owned this note
Published
Linked with GitHub
# Hack the Garden 2025-11 — Topics 📜
See also:
https://github.com/gardener-community/hackathon
## Participants (27) 🤓⌨️
* Stefan M.
* Gerrit S.
* Ebubekir A.
* Tim E.
* Maximilian G.
* Aniruddha B.
* Lukas H.
* Marcel B.
* Viet D. M.
* Matthias H.
* Benedikt H.
* Tobias G.
* Marc V.
* Niklas K.
* Johannes S.
* Tobias S.
* Axel S.
* Shafeeque E S
* Sonu K. S.
* Konstantinos A.
* Rickards J.
* Oliver G.
* Vedran L.
* Luca B.
* Daniel G. N.
* Andreas F.
* Rafael F.
*Currently unassigned:* Gerrit S., Rafael F.
## Topics 🎬
* [**Evaluate Talos as node operating system**](#Evaluate-Talos-as-node-operating-system) (12 votes) :construction:
* **Konstantinos A.**, **Oliver G.**
* [**Gardener API Types as Standalone Go Module**](#Gardener-API-Types-as-Standalone-Go-Module-gardenergardener2871) (11 votes) ✅
* **Marcel B.**, *Luca B.*
* [**Use Self-Hosted Shoot Cluster for Single-Node End-to-End Tests**](#Use-Self-Hosted-Shoot-Cluster-for-Single-Node-End-to-End-Tests) (9 votes)
* *Tim E.*, *Marc V.*, *Ebubekir A.*, **rfranzke**
* [**Gardener Scale-Out Tests**](#Gardener-Scale-Out-Tests) (7 votes) 💥:construction:💥
* *Viet D. M.*, *Sonu K. S.*, **Tobias S.**
* [**Add a force-restore operation annotation for Shoots**](#Add-a-force-restore-operation-annotation-for-Shoots) (7 votes) :construction:
* *Maximilian G.*, **Matthias H.**
* [**Enrich Shoot Logs with Istio Access Logs**](#Enrich-Shoot-Logs-with-Istio-Access-Logs) (7 votes) ✅
* **Axel S.**, *Benedikt H.*, *Stefan M.*
* [**Allow Relocating Backup Buckets for Shoot Clusters**](#Allow-Relocating-Backup-Buckets-for-Shoot-Clusters) (7 votes) ✅
* **Gerrit S.**, *Rafael F.*
* [**Go Build Cache in Prow**](#Go-Build-Cache-in-Prow) (6 votes) ✅
* *Shafeeque E S*, **Tobias G.**
* [**Pod Overlay to Native Routing without Downtime**](#Pod-Overlay-to-Native-Routing-without-Downtime) (6 votes) ✅
* *Lukas H.*, *Johannes S.*
* [**Support updating underlying infrastructure resources during in-place node updates in MCM**](#Support-updating-underlying-infrastructure-resources-during-in-place-node-updates-in-MCM-gardenermcm1023) (5 votes) :construction:
* **Andreas F.**, *Aniruddha B.*, *Daniel G. N.*
* [**Bring the Gardener Answering Machine to the Gardener Documentation**](#Bring-the-Gardener-Answering-Machine-to-the-Gardener-Documentation) (4 votes) :construction:
* *Vedran L.*, *Niklas K.*
* [**Gardener Node Agent should be pullable from a registry mirror**](#Gardener-Node-Agent-should-be-pullable-from-a-registry-mirror) (6 votes) ✅
* *Maximilian G.*, *Matthias H.*, **Lukas H.**
* [**Add Support for Calico Whisker**](#Add-Support-for-Calico-Whisker) (5 votes) ✅
* **Lukas H.**, *Johannes S.*
* [**MCM sets `ToBeDeletedByClusterAutoscaler` Taint to respect terminating nods in load balancing**](#MCM-sets-ToBeDeletedByClusterAutoscaler-Taint-to-respect-terminating-nods-in-load-balancing) ✅
* *Maximilian G.*, **Konstantinos A.**
* [**Replace Ingress NGINX controller with Gateway API**](#Replace-Ingress-NGINX-controller-with-Gateway-API) (5 votes) :construction:
* *Lukas H.*, *Johannes S.*, **Gerrit S.**, *Ebubekir A.*, *Benedikt H.*, *Stefan M.*
* [**Use go tools in g/g & remove vgopath**](#Use-go-tools-in-gg-amp-remove-vgopath) ✅
* *Marcel B*, **Luca B.**
* [**`gardenlet` Meltdown Protection for `ManagedSeed`s**](#gardenlet-Meltdown-Protection-for-ManagedSeeds) (2 votes)
* Ebubekir A.,Marc V.(?), Aniruddha B.
* [**Evaluation of NFT mode of `kube-proxy`**](#Evaluation-of-NFT-mode-of-kube-proxy) (1 vote)
* **Axel S.**, Benedikt H, Stefan M.
* [**[GEP-28] Expose API server of Self-Hosted Shoots**](#GEP-28-Expose-API-server-of-Self-Hosted-Shoots-gardenergardener2906) (9 votes) 🚧
* Rafael F., Tim E.
* gardenadm/flow: handle SIGINFO (^T) and print current task ✅
* Marcel B.
* https://github.com/gardener/gardener/pull/13565
* [provider-local: build a real "LoadBalancer" controller](#provider-local-build-a-real-LoadBalancer-controller) 🚧
* **Rafael F.**, Tim E., Stefan M.
## Pick Up Next 🪝
* [**Rework extension `ControlPlane` controller**](#Rework-extension-ControlPlane-controller) (6 votes)
* [**Persist Logs of e2e Tests**](#Persist-Logs-of-e2e-Tests) (5 votes)
* [**GEP-28: Restore broken self-hosted cluster**](#GEP-28-Restore-broken-self-hosted-cluster) (5 votes)
* [**IPv6 or Dual-Stack Support for another Infrastructure**](#IPv6-or-Dual-Stack-Support-for-another-Infrastructure) (5 votes)
* [**Reduce number of Istio Ingress Gateways**](#Reduce-number-of-Istio-Ingress-Gateways) (5 votes)
* [**Ease Shoot API Server Connectivity from external clients**](#Ease-Shoot-API-Server-Connectivity-from-external-clients) (5 votes)
* [**Add SBOMs to all created artefacts**](#Add-SBOMs-to-all-created-artefacts) (5 votes)
* [**Cluster Mesh for cilium extension**](#Cluster-Mesh-for-cilium-extension) (4 votes)
* [**Allow configuring registry-mirror for Helm OCI charts pulled by gardenlet**](#Allow-configuring-registry-mirror-for-Helm-OCI-charts-pulled-by-gardenlet) (4 votes)
* [**Dual-Stack Seed API**](#Dual-Stack-Seed-API) (4 votes)
* [**Resolve the Istio Metrics Leak**](#Resolve-the-Istio-Metrics-Leak) (4 votes)
* [**Harmonize Registry Mirror Extension in gardener-extension-registry-cache with harbor registry cache**](#Harmonize-Registry-Mirror-Extension-in-gardener-extension-registry-cache-with-harbor-registry-cache) (3 votes)
* [**Always use the same istio-gateway for shoot kube-apiserver endpoint and observability components**](#Always-use-the-same-istio-gateway-for-shoot-kube-apiserver-endpoint-and-observability-components-gardenergardener11860) (2 votes)
* [**Implement Firewall Distance and HA for metal-stack.io**](#Implement-Firewall-Distance-and-HA-for-metal-stackio) (2 votes)
* [**(D)DOS protection for kube-apiservers**](#DDOS-protection-for-kube-apiservers) (2 votes)
* [**LLM-based Agents**](#LLM-based-Agents) (2 votes)
* [**Generic Extension for User Workloads**](#Generic-Extension-for-User-Workloads) (1 vote)
* [**Allow Shoot Migration on metal-stack.io**](#Allow-Shoot-Migration-on-metal-stackio) (0 votes)
* [**One commit message**](#One-commit-message) (0 votes)
## Fast- & Side-Track 🏎️
* [**The Illustrated Children’s Guide to Gardener**](#The-Illustrated-Children’s-Guide-to-Gardener) (10 votes)
* [**Declarative GitHub Membership Administration**](#Declarative-GitHub-Membership-Administration-gardenerorg2) (4 votes)
* [**Skip Validation of Resource References during `--dry-run=server`**](#Skip-Validation-of-Resource-References-during---dry-runserver) (3 votes)
* tobschli: fix annotation key ordering in static pod translator
* [**Use go tools in g/g & remove vgopath**](#Use-go-tools-in-gg-amp-remove-vgopath)
* Add TTL config to local registry and caches
## Proposals 💡
## Core ⚙️
### Allow Shoot Migration on metal-stack.io
**Authors:** Stefan Majer / Gerrit Schwerthelm
Currently metal-stack.io clusters need manual intervention after the shoot was migrated to another seed. The root cause is that the firewall, which is part of the infrastructure, does not get the new api server connection established. Fix could be to register a firewall in a similar way as the node-agent with a bootstrap token.
### Use Self-Hosted Shoot Cluster for Single-Node End-to-End Tests
**Author:** Johannes Scheerer
It should be possible by now to create self-hosted shoot clusters using `gardenadm`. To eat our own dog food we could run our end-to-end tests in a single-node self-hosted shoot cluster. This could increase confidence in self-hosted shoot clusters and put them to a real test.
Tracks:
- [x] run `
ninit` in docker container (manual/script): https://github.com/timebertt/gardener/tree/gind
- [x] make druid-managed etcd optional: continue with bootstrap etcd if no backup is configured in `Shoot` (https://github.com/gardener/gardener/pull/13542)
- [x] NetworkPolicies for coredns
- [x] ensure multiple controllers don't conflict (resource-manager, etcd-druid, vpa, etc.)
- [x] keep resource-manager and extensions in host network
- [x] fix registry hostnames
- [x] always run registries as container via docker compose and expose on dedicated hostnames per registry: https://github.com/gardener/gardener/pull/13551
- [ ] Extension management (don't deploy extensions twice of self-hosted shoot is a seed)
- `gardener-controller-manager` checks if seed is a self-hosted shoot
- When the seed gardenlet registers the `Seed` object, it can label it with `seed.gardener.cloud/self-hosted-shoot-cluster=true` (it must detect if it runs in a self-hosted shoot by checking if `kube-system` is labeled with `gardener.cloud/role=shoot`).
- GAPI must deny removing this label
- If self-hosted shoot is NO seed, GCM just creates `ControllerInstallation`s with `.spec.shootRef`
- Otherwise, GCM still creates `ControllerInstallation`s with `.spec.shootRef` (we need to make sure that an extension still exists in the self-hosted shoot even if it is no longer a Seed at some point in time)
- Also, if the shoot later uses an extension which is already required by the seed, we want the shoot gardenlet to manage it.
- We have to make sure that the seed gardenlet fails in case it is deployed in a self-hosted shoot which is not yet connect (otherwise, we might confuse GCM according to above logic)
- [x] Cleanup `provider-local` `Service` controller in traditional `kind`-based setup: https://github.com/gardener/gardener/pull/13549
- [x] fix DNS records in kind coredns `Corefile` (see kind-up.sh) to use hard-coded IPs
- Future work / lookout
- [ ] adapt e2e tests to create self-hosted cluster and run e2e tests in it
- [ ] implement gind (**G**ardener/gardenadm in Docker) – basically kind but use gardenadm instead kubeadm
Considerations:
- We might need to run multiple `gardener-resource-manager`s, `etcd-druid`s, etc. in a self-hosted shoot which is the garden runtime or a seed cluster at the same time.
- When introducing new features in `gardener-resource-manager`, `gardener-operator` needs to update `gardener-resource-manager` in `kube-system` first, before using it during the `Garden` reconciliation.
- Hence, `gardener-operator` could update the `gardener-resource-manager` in `kube-system` (take over management).
- While new features in controllers are always introduced in a backward-compatible manner (e.g., new fields in CRDs are optional), it's not so obvious to follow the same rules for webhook handlers. I.e., updating the central/shared `gardener-resource-manager` in `kube-system` might interfere with shoot system components, even though the shoot is not yet reconciled with the new gardenlet version.
- The safest option is to split responsibilities and deploy multiple `gardener-resource-manager`s, `etcd-druid`s
- Approaches for splitting responsibilities of components:
- While webhook configurations can simply exclude the `kube-system` namespace using a selector, such a mechanism would need to be implemented in **all** controllers in `gardener-resource-manager`, `etcd-druid` and even in third-party components like `vpa-*`, `{prometheus,fluent}-operator`, etc.
Pull requests:
- [ ] [Introduce `--use-bootstrap-etcd` flag for `gardenadm init` (gardener/gardener#13542)](https://github.com/gardener/gardener/pull/13542)
- [ ] always run registries as container via docker compose and expose on dedicated hostnames per registry: https://github.com/gardener/gardener/pull/13551
- [ ] Cleanup `provider-local` `Service` controller in traditional `kind`-based setup: https://github.com/gardener/gardener/pull/13549
Next steps:
- [ ] rfranzke: PR for moving gardener-resource-manager and etcd-druid of self-hosted shoot to garden namespace
### provider-local: build a real "LoadBalancer" controller
WIP branch: https://github.com/timebertt/cloud-provider-kind/tree/allocate-ips
Demo steps:
```bash
cd cloud-provider-kind
kind create cluster
for i in $(seq 64 67) ; do ip a add 172.18.255.$i dev lo0; done
docker build -t cloud-provider-kind .
docker run --rm --network kind -v /var/run/docker.sock:/var/run/docker.sock cloud-provider-kind --enable-lb-port-mapping --load-balancer-ip-range 172.18.255.64/26
# in a new terminal
watch docker ps
# in a new terminal
watch kubectl get deploy,svc
# in a new terminal
k create deployment --image nginx nginx
k expose deployment nginx --type LoadBalancer --port 80 --name nginx
k expose deployment nginx --type LoadBalancer --port 80 --name nginx2
# ...
k get svc
curl 172.18.255.64
curl 172.18.255.65
# ...
```
- [ ] correctly handle IPv6
- [ ] implement LoadBalancer controller for shoots (to make Managed Seed tests succeed)
- [ ] deploy in local setup
- [ ] drop provider-local service controller
### `gardenlet` Meltdown Protection for `ManagedSeed`s
**Author:** Sonu Kumar Singh
Currently, when a `gardenlet` is updated in the seed (not managed seed) cluster, all `ManagedSeed` cluster `gardenlet`s are updated simultaneously. If there is an issue with the `gardenlet` version, there is no rollback mechanism, all seeds hosted by that soil/seed cluster enter a bad state, which in turn causes all shoots hosted by these `ManagedSeed`s to fail.
**Proposed Solution:**
* Stop the rollout of new `gardenlet` version in all the `ManagedSeed`s at once.
* Utilize `ManagedSeedSet` to achieve this.
* Add a way to adopt existing seeds(may be).
* Ensure rolling updates on new seeds only start once the current seed is confirmed healthy.
* Extend the mechanism to also manage extension versions (in addition to the `gardenlet` version) to prevent breaking updates caused by broken extension versions.
**Comment from @rfranzke:** We could also think of introducing `seedmanagement.gardener.cloud/v1alpha1.Gardenlet` resources for `ManagedSeed`s in order to harmonize the update behaviour (of unmanaged and managed seeds).
Tracks:
* [ ] Update the managedseedset to support updates
### Gardener API Types as Standalone Go Module [gardener/gardener#2871](https://github.com/gardener/gardener/issues/2871)
**Author:** Tim Ebert
We could introduce a dedicated Go module for `pkg/apis` in gardener/gardener.
The dependencies of this package should be very limited, e.g., only `k8s.io/{api,apimachinery,utils}`, etc. (should be enforced using a `.import-restrictions` file).
The API module should be released together with the main module (using the proper Go submodule tag, see [gardener/cc-utils#1382](https://github.com/gardener/cc-utils/pull/1382)).
We can use the Go Workspaces feature in gardener/gardener for developing both the main and the API mdoule together conveniently.
For API-only consumers of the gardener/gardener repository, this ensures a minimal set of dependencies when importing the API types.
* [x] Opened PR: https://github.com/gardener/gardener/pull/13536
### Use go tools in g/g & remove vgopath
- [x] Opened PR: https://github.com/gardener/gardener/pull/13545
- [x] Opened PR: https://github.com/gardener/gardener/pull/13556
### Gardener Scale-Out Tests
**Author:** Tim Ebert
We don't have a good estimate of how many seeds and shoots a Gardener environment can support. We are not aware of any scalability limitations that we might face in the future. Also, there is no way to prevent regressions in Gardener's scalability.
We could implement "hollow" gardenlets similar to kubemark’s hollow nodes and run many of them to generate load on the Garden cluster. This could be a good basis for running automatic performance/scalability tests.
- Our Progress is documented in our [hackmd](https://hackmd.io/bZPazwFfRn-QVryA49dMWw) :100:
- [Hollow-gardenlet WIP implementation](https://github.com/acumino/gardener/tree/hollow-gardenlet)
- We transferred the concept of hollow nodes to seeds, meaning that we have a gardenlet that registers itself with the Garden and reports itself as ready
- Shoots that scheduled to these Shoots will also be reported healthy, but no real control planes will be spawned
- To simulate a more realistic scenario, we analyzed how many qps the gardenlets do against the apiserver, and registered a runnable in the hollow gardenlet that simulates these requests, based on the number of Shoots on the seeds
- In the local operator setup (on an M4 Mac, 48Gig), we were able to schedule ~200 hollow gardenlets in the runtime cluster before we ran out of resources
- We figured out that scheduling too many Shoots and Seeds in a small amount of time leads to problems, leading to lease requests timing out
- settings of the Garden were just taken as-is, so tuning for this scenario would probably possible
- A longer running test (aiming for 50 Shoots per minute and 1 Seed every 3 minutes) was done in order to circumvent the problem
- Outcome: 800 Seeds and 21600 Shoots over 10 hours, around 4 am we ran into meltdown
- We observed unequal request balancing on the kube-apiservers (no L7 loadbalancing was active)
*Next steps*:
- Try to run the tests in a more controlled environment
- Realistically, we will not continue on this after the hackathon, but this can be used as a starting point for the next round ;)

### Skip Validation of Resource References during `--dry-run=server`
**Author:** Marcel Boehm
As already discussed in the issues [#12582](https://github.com/gardener/gardener/issues/12582#issuecomment-3311966606) and [#12950](https://github.com/gardener/gardener/issues/12950), the strict validation of referenced resources makes it impossible for tools like `flux` to create a Shoot and e.g. a referenced Audit ConfigMap, which is not yet present, at the same time, because flux always performs a server-side dry run first. I would propose to consistently make these validations optional on dry-runs and only emit a warnings if the resources do not exist.
---
## Core – Backup & Restore 🛟
### Allow Relocating Backup Buckets for Shoot Clusters
**Authors:** Stefan Majer / Gerrit Schwerthelm
When you are required to move a backups into another project at a cloud provider (like GCP), it is currently required to create new seed clusters and migrate all shoots to the new seed clusters. As the functionality for copying backup buckets during the shoot migration is already in place we were wondering if it would be possible to just alter the backup bucket as well in order to relocate its place.
**Flow:**
- For `Garden`s/virtual cluster:
- ✅ Make `spec.virtualCluster.etcd.main.backup` mutable (protect via confirmation annotation (subresource is not possible due to CRD limitations)).
- ✅ In `Garden` reconciliation
- a new `extensions.gardener.cloud/v1alpha1.BackupBucket` will be deployed.
- ETCD will be redeployed (pointing to the new `BackupBucket`).
- ✅ A full snapshot will automatically be taken by ETCD (it detects the bucket is empty).
- ✅ Once `Garden` was reconciled, they can delete the old `BackupBucket` resource.
- For `Seed`s/`Shoot`s:
- ✅ Add field in `SeedSpec` that references a `core.gardener.cloud/v1beta1.BackupBucket` resource that should be used for backups (today, the `Seed`'s `.metadata.uid` is used implicitly).
- ✅ Protect changes to this new field via a confirmation annotation.
- Human operator changes `.spec.backup.bucketName` in `Seed`.
- ✅ ~~We should always prohibit that multiple `Seed`s point to the same `BackupBucket`.~~ *we cannot check this*
- ✅ ~~Add `.status.bucketName` field in `core.gardener.cloud/v1beta1.BackupEntry` resource.~~
- Now in the `Shoot` reconciliations, `gardenlet`
- ✅ redeploys the `BackupEntry` for the `Shoot` (changing `.spec.bucketName` to the new `BackupBucket` — at this point, `.status.bucketName` still has the old one)
- ✅ redeploys ETCD (pointing to the new `BackupBucket`)
- ✅ A full snapshot will automatically be taken by ETCD (it detects the bucket is empty).
- ~~sets `.status.bucketName=.spec.bucketName` at the end of this (after ETCD deployed and snapshot taken).~~
- ✅ The human operator can now check all `BackupEntry` resources for the `Shoot`s on this `Seed` whether `.spec.bucketName` points to the new `BackupBucket`.
- ✅ Afterward, manually delete the old `BackupBucket` from the garden cluster.
### Add a force-restore operation annotation for Shoots
**Author:** Matthias Hoffmann
https://github.com/gardener/gardener/issues/12952
This would facilitate recovery from a disaster using the available backups.
---
## Extensions 🧩
### Generic Extension for User Workloads
**Author:** Sonu Kumar Singh
Currently, if a user has, for example, 100 clusters in a project and wants to deploy some basic workloads in some of them, they must manually target and deploy resources in each individual cluster.
**Proposed Solution:**
* Allow namespaced controllerdeployments, and it include a field that holds user-provided charts.
* Introduce a generic extension running in the seed cluster. This extension would check if a namespace controllerdeployment is used by one of the shoots in that project. If so, it would create a ManagedResource of class shoot in the shoot’s namespace in the seed.
* The GRM in the shoot would then ensure that the user-specific charts are deployed into the target shoots.
### Support updating underlying infrastructure resources during in-place node updates in MCM ([gardener/mcm#1023](https://github.com/gardener/machine-controller-manager/issues/1023))
**Author** Andreas Fritzler, Daniel Gonzalez
Support for updating the underlying infrastructure resources (e.g., OS image) during in-place node updates in Gardener Machine Controller Manager (MCM). This includes extending the MCM provider driver interface with an UpdateMachine method, enabling providers like `ironcore-metal` to handle infrastructure-level updates without full node recreation.
***Idea 💡***
Extend the MCM interface with a new method `UpdateMachine` which is called during an in-place update:
```go
// Driver is the common interface for creation/deletion of the VMs over different cloud-providers.
type Driver interface {
...
// UpdateMachine call is responsible for VM update on the provider
UpdateMachine(context.Context, *UpdateMachineRequest) (*UpdateMachineResponse, error)
...
}
```
The core idea is to memory boot bare metal machines via network and update the image by rebooting the server. This results in a clean state of the OS in contrast to in-place update on a disk installed system.
***Tasks and Challanges 💥***
- [x] MCM update machine - https://github.com/gardener/machine-controller-manager/compare/master...afritzler:machine-controller-manager:enh/machine-update
- [x] Testing on Provider local - https://github.com/gardener/gardener/compare/master...aniruddha2000:gardener:ani/machine-update
- [x] Release forked container images to test the scenario in a real world bare metal environment.
- [x] Turns out updating `MachineImage` from the `MachineClass` during an in-place update is not that trivial as in the `UpdateMachine` call we only get the old machine image. "Hacky" fix has been build to get to the new image: https://github.com/afritzler/machine-controller-manager/commit/7005bd20bea234895bdf9a9fc98e5dc43ceca1e9
- [x] GardenLinux OS extension is using a `gardenlinux-update` command which is invoked by the `gardener-node-agent`. This is only possible via an on-disk installed system using a UKI GardenLinux image. We mitigated that problem by providing a fix to a forked GardenLinux OS extension: https://github.com/gardener/gardener-extension-os-gardenlinux/commit/6903635885b2a58d4c876612cd91be134f2aa307
***Hitting the wall 🤯***
Rebooting a server with a new OS image gave us a host with no prior `gardener-node-agent` and `kubelet` configuration as this information has been lost due to the nature of the in-memory booted machine.
***Open points***
- [ ] Bind mount `/var/lib/gardener-node-agent` and `/var/lib/kubelet` to preserve the previous state of the `Node`. However this approach leaves us with a machine without the necessary systemd units for `kubelet` service etc.
***Conclusion***
Revisit the idea of using in-place update for memory booted servers and rather re-evaluate the rolling update approach. The hacky changes to the MCM are not worth contributing upstream as they might break the core in-place update contract.
### Rework extension `ControlPlane` controller
**Author:** Rafael Franzke
We could rework the `ControlPlane` extensions controller and move generic things (like the CSI deployments) into `gardener/gardener` such that this is not duplicated in all provider extensions.
Furthermore, we could move to `ManagedResource`s and get rid of the Helm charts.
### GEP-28: Restore broken self-hosted cluster
**Author:** Rafael Franzke
We could evaluate what is takes to restore a broken self-hosted (f.k.a. "self-hosted") shoot cluster.
### Evaluate Talos as node operating system
**Author:** Johannes Scheerer
Talos has a different approach to what a kubernetes node operating system provides. It might be interesting to evaluate if we could support talos as operating system in Gardener Shoot clusters.
Follow-up: Will not be followed up to production
#### Overview
This track explored the technical feasibility of integrating **Talos OS** as a worker node operating system within the Gardener. Talos is a modern, minimal, and immutable OS designed specifically for Kubernetes, forgoing traditional components like SSH and SystemD in favor of an API-driven model.
Value Proposition:
- **Security:** Immutable filesystem, minimal package set, and zero traditional access surfaces (no SSH/Shell).
- **Management:** Fully declarative, API-based configuration via GRPC.
- **Architecture:** A lightweight solution dedicated solely to running Kubernetes.
Implementation Environment:
- **Scope:** The PoC focused exclusively on the worker node layer.
- **Setup:**
- Provider-Local (`make operator-up`)
- Talos nodes running as Pods (similar to the `local` provider machines).
#### 3. Implementation Status
The following milestones were achieved to validate the integration:
✅ Phase 1: Machine Bootstrapping
- [x] **Config Generation:** Automated generation of initial Talos configuration.
- [x] **Pod Creation:** Successfully deployed Talos machine as a Pod.
- [x] **Initial Push:** Pushed configuration to the node while in "insecure" mode.
- On boot, the Talos machine is running in insecure mode. Clients can push configuration with `talosctl` but after that the API locks down.
✅ Phase 2: Cluster Joining
- [x] **Kubelet Containerization:** Identified compatible Kubelet images (Hyperkube was incompatible).
- [x] **Trust Establishment:** Configured Kubelet with the Cluster CA.
- [x] **Token Exchange:** Successfully created and utilised bootstrap tokens.
✅ Phase 3: CNI & VPN
- [x] **CNI Management:** Disabled Talos CNI management to allow Gardener-managed CNI injection.
- [x] **Connectivity:** Enabled Kubelet Server to support VPN/Tunnel requirements.
- [x] **DNS/SNI Fix:** Disabled `KubePrism` to resolve SNI breakage; enabled communication with Kube-API via proper DNS names.
- [x] **Typha Workaround:** Mitigated `FileOrCreate` volume bug (issue #12283) to ensure correct file mounting.
✅ Phase 4: Talos Control Plane Trust
- [x] **Trust Daemon:** Deployed `trustd` within the Control Plane.
- [x] **Istio Routing:** Configured Istio to allow Talos nodes to reach `trustd` securely.
#### 4. Architectural Findings & Constraints
**⚠️ The `trustd` Challenge**
The `trustd` component (normally hosted on Talos Control Plane nodes) presented significant networking hurdles when moved to the Gardener Control Plane:
1. **Port Constraints:** It must be reachable on the **same address** as the Kubernetes API Service at port `50001`.
2. **SNI Conflicts:** The GRPC client in `trustd` performs DNS name resolution and tries to connect to the IP in order to perform client-side a sort of DNS load balancing. This caused SNI resolution failures when routing through Istio.
- _Resolution:_ Custom Istio configuration was required to bypass standard SNI checks for this service.
**⚠️ Operational Access (`apid`)**
Since Gardener Control Planes are decoupled from the Data Plane, the Talos API Daemon (`apid`) is not naturally exposed.
- Operators need `talosctl` access for debugging.
- `apid` must be exposed publicly (NodePort/LoadBalancer) on the DataPlane side, or accessed via a jump-host.
**⚠️ Boot Security**
Talos nodes boot in **insecure mode** and remain so until the first configuration is pushed.
- **Risk:** On platforms with slow user-data initialization or delayed config injection, there is a theoretical window of vulnerability.
#### 5. Future Work
**OSConfig & Node Agent Evolution**
- **Refactor OSConfig:** The current architecture relies heavily on SystemD. It must be abstracted to support declarative API OSs.
- **New Node Agent:** The Gardener Node Agent (GNA) is incompatible with Talos. A new agent leveraging Talos's declarative API is required.
**Extension Compatibility**
- **Mutations:** `Gardenlet` and extensions that inject SystemD units via OSC mutations must be rewritten to support containerized sidecars or DaemonSets.
**Service Hardening**
- **Integrated `trustd`:** A fully managed, integrated `trustd` deployment strategy within the Control Plane is needed.
- **Access Strategy:** Define a standard pattern for exposing `apid` securely to operators.
#### Conclusion
**Status:** **FEASIBLE** Talos OS can successfully function as a Gardener worker node. However, adopting it requires a paradigm shift away from SystemD-based management.
### MCM sets `ToBeDeletedByClusterAutoscaler` Taint to respect terminating nods in load balancing
**Author:** Maximilian Geberl
If a machine is deleted the it is possible that the load balancer do not know yet that the node is not available anymore. This results eventually in unsuccessful new connections as the node where the load balancer is forwarding the traffic is not existing anymore.
There are 2 ways a load balancer gets informed: 1. Via a reconcile of the load balancer in the `cloud-provider` package or the `kube-proxy` health check. Unfortunately the "drain" or conditions set by the MCM dose not trigger a reconcile or the health check to fail.
The `ToBeDeletedByClusterAutoscaler` Taint is used in both [`cloud-provider`](https://github.com/kubernetes/cloud-provider/blob/080e91c4b910bf92dfd43ca79eb74f5d39dcba75/controllers/service/controller.go#L1027) and [`kube-proxy`](https://github.com/kubernetes/kubernetes/blob/5bcb7599736327cd8c6d23e398002354a6e40f68/pkg/proxy/healthcheck/proxy_health.go#L187), therefore we want to add this Taint to improve the load balancing during machine terminations.
https://github.com/gardener/machine-controller-manager/pull/1054
---
## CI/CD 🏗️
### Add SBOMs to all created artefacts
**Author:** Stefan Majer
In order to be able to get a complete view of possible CVEs, artefacts can contain the list of dependencies in the sbom/SPDX format.
We at metal-stack.io already added that to all our artifacts and can be used as a blueprint.
### Persist Logs of e2e Tests
**Author:** Tim Ebert
Gardener e2e tests export the logs of running pods/machines before exiting – both on success and failure – so that they can be viewed/downloaded in the artifacts browser (gcsweb).
However, logs of terminated pods/machines will not be exported if they are not running at the end of the test execution. I.e., we don't collect logs of pods/machines of successful test cases, because the shoots will be deleted as part of the test execution.
Debugging e2e test failures based on this information is very tedious. The ability to search e2e test logs or to compare logs of successful/tailed tests would improve this experience.
For this, we could add a logging stack to the prow cluster (similar to the performance prometheus) where e2e test logs are stored and which can be queried in the cluster's Plutono instance.
### Go Build Cache in Prow
**Author:** Tim Ebert
Our Prow jobs spend a significant part of the execution time on building Go binaries/tests/tools. We could significantly reduce build times by keeping/reusing the Go build cache and thereby get faster CI feedback on PRs.
[GOCACHEPROG](https://pkg.go.dev/cmd/go/internal/cacheprog) allows storing the build cache externally, e.g., in S3 using https://github.com/tailscale/go-cache-plugin.
We should take care of preventing cache poisoning, though.
- Other implementations: https://github.com/saracen/gobuildcache (supports GCS/azure/s3)
- Istio does local caching using hostDirectory for gocache & gomod: https://github.com/istio/test-infra/blob/master/prow/config/jobs/istio.io-1.28.yaml#L33-L42
- Kubermatic uses a custom upload script to upload the cache: https://github.com/kubermatic/machine-controller/blob/b362a3a0fa305092e0142f638aa3c817c1c31c75/hack/ci/upload-gocache.sh
- Baseline for unit tests on our test cluster: 30min
- We tested a readWriteMany Device (NFS share/GCP Filestore) with great speed improvements --> 3min
- Filestore requires to copy the files around for pullRequests from the readonly device (or do magic overlay networks)
- NFS share requires 2.5TB+ for ~900$/month
- We used gobuildcache with GCS as well to increase permformance
- Interregion traffic is free
- readonly is possible easily
- only storage needs to be paid (2TB = 20$)
- Workload identity federation can be used for access
- https://github.com/gardener/ci-infra/pull/4903
- https://github.com/gardener/ci-infra/pull/4926
- Results: https://excalidraw.com/#json=SxaESsbBjROuL9xzFUzZb,FgVgBG6tigZ8TPseBY4p1w
### [GEP-28] Expose API server of Self-Hosted Shoots ([gardener/gardener#2906](https://github.com/gardener/gardener/issues/2906))
**Author:** Tim Ebert
The API server of an self-hosted shoot cluster with managed infrastructure (medium-touch scenario) needs to be exposed for external access. Ideas for this include creating a LoadBalancer for the `default/kubernetes` service so that we can reuse the cloud-controller-manager for this. As soon as cloud-controller-manager publishes the LoadBalancer's IP, the `DNSRecord` can be updated to point to the LoadBalancer instead of the machine's internal IP.
**Approach:**
- We introduce a new resource in the `extensions.gardener.cloud/v1alpha1` API called `SelfHostedShootExposure` (we could call it `ControlPlaneExposure`, but since it'll only be relevant for self-hosted shoots, this might be too general and confusing).
- In the `Shoot` API, the control plane worker pool will be configured like this (this can only be specified for the "managed infrastructure" (@timebertt's) case):
```yaml
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
provider:
type: local
workers:
- name: control-plane
controlPlane:
exposure: # either `extension` or `dns` or nothing
extension:
type: local # defaults to `.spec.provider.type`, but could also be different
# providerConfig: ...
dns: {}
```
- If `.spec.provider.workers[].controlPlane.exposure.extension` is set, `gardenadm`/`gardenlet` will create a `SelfHostedShootExposure` object in `kube-system` like this:
```yaml
apiVersion: extensions.gardener.cloud/v1alpha1
kind: SelfHostedShootExposure
metadata:
name: <shoot-name>
namespace: kube-system
spec:
type: # string
providerConfig: # *runtime.RawExtension
endpoints:
- nodeName: gardener-local-control-plane
addresses:
- address: 172.18.0.2
type: InternalIP
- address: gardener-local-control-plane
type: Hostname
port: 443
status:
ingress: # []corev1.LoadBalancerIngress
- ip: 1.2.3.4
- hostname: external.load-balancer.example.com
```
As usual, it will wait for the object to be reconciled successfully and update the (already existing) `extensions.gardener.cloud/v1alpha1.DNSRecord`'s `.spec.values[]` with the preferred addresses out of the reported `.status.ingress[]` (TODO: explain what is "preferred" [x-ref](https://github.com/gardener/gardener/blob/7c0127f653d4f63417513bcbaa2f88f1713b8ef6/pkg/gardenadm/botanist/machines.go#L39-L63))
- Extension controllers implementing the new `SelfHostedShootExposure` API are expected to:
- Reconcile the resources for exposing the self-hosted shoot control plane when `gardenadm`/`gardenlet` adds the `gardener.cloud/operation=reconcile` annotation.
- Delete the resources for exposing the self-hosted shoot control plane when `gardenlet` deletes the object.
- A new controller will be added to the extensions library for handling reconciliation and deletion of the `SelfHostedShootExposure`. The actuator interface will look like this:
```go
// Actuator is the minimal interface implemented by `SelfHostedShootExposure` extensions.
type Actuator interface {
// Reconcile creates/reconciles all resources for the exposure of the self-hosted shoot control plane.
Reconcile(context.Context, *extensionsv1alpha1.SelfHostedShootExposure, *extensionscontroller.Cluster) ([]corev1.LoadBalancerIngress, error)
// Delete removes all resources that were created for the exposure of the self-hosted shoot control plane.
Delete(context.Context, *extensionsv1alpha1.SelfHostedShootExposure, *extensionscontroller.Cluster) error
}
```
- `gardenlet` runs a new controller that watches the `Node`s of the worker control plane.
- If `.spec.provider.workers[].controlPlane.exposure.extension` is set, it updates the `.spec.endpoints[]` in the `SelfHostedShootExposure` resource with the all `.status.addresses[]` of the control plane nodes.
- If `.spec.provider.workers[].controlPlane.exposure.dns` is set, `gardenlet` updates the existing `extensions.gardener.cloud/v1alpha1.DNSRecord`'s `.spec.values[]` with the preferred address of the control plane nodes. In this case, no additional resources for exposing the control plane are created, i.e., no `SelfHostedShootExposure` object is created and no corresponding extension controller needs to be registered.
- TODO: Examples
- cloud provider with LoadBalancer
- kube-vip
- provider-local: service in kind cluster
- DNS only -> cache issues accepted
- Future optimization: Introduce a new field in `ControllerRegistration` API that allows extensions implementing the `SelfHostedShootExposure` kind to specify whether they need continuously updated `.spec.endpoints` (some implementations like `kube-vip` (see above) might not need it).
**Next steps:** timebertt: Write a GEP and defend it in front of the TSC.
---
## Registry Cache 🪞
### Harmonize Registry Mirror Extension in gardener-extension-registry-cache with harbor registry cache
**Author:** Benedikt Haug
Currently the mirroring function doesn't allow credentials to be used when configuring mirrors. The idea is to add such a functionality to enforce an internal harbor to be used as cache. Corresponding Issue: https://github.com/gardener/gardener-extension-registry-cache/issues/462
Additional features that would be relevant:
- Add support for the server field and the override_path option, and allow URL paths to be part of the serverand host fields of the mirrorConfig to a.) support non-conformant registries (like the widely used harbor registry) and b.) be able to control if a fallback to upstream is allowed
- Extend the gardener operatingsystemconfig to support custom headers in the containerd RegistryHost config
Work already started here: https://github.com/networkhell/gardener-extension-registry-cache/tree/feature_mirror_server_and_options
### Allow configuring registry-mirror for Helm OCI charts pulled by gardenlet
**Author:** Marcel Boehm
When the `gardenlet` pulls a Helm Chart from an OCI Repository, there is no option to configure any mirrors, like we can do for containerd running on all nodes. We would like to extend the `GardenletConfig` with options similar to the `RegistryConfig` options in the OSC.
### Gardener Node Agent should be pullable from a registry mirror
**Author:** Lukas Hoehl
Currently it is not possible to pull the gardener node agent through a registry mirror configured by the OSC, since the gardener node agent is responsible for configuring containerd with the mirrors.
We work around this by adding the registry mirror as systemd files into the userdata via webhook. I would however like to have this inside the registry-cache extension itself.
PR: https://github.com/gardener/gardener-extension-registry-cache/pull/495
---
## Networking 🔌
### Implement Firewall Distance and HA for metal-stack.io
**Authors:** Stefan Majer / Gerrit Schwerthelm
In order to get high available firewalls in metal-stack.io, we would like to add dead detection of a firewall and the ability to deploy two or more firewalls in front of a cluster and use path prolongation to allow traffic to flow through in a HA manner.
### Evaluation of NFT mode of `kube-proxy`
**Author:** Johannes Scheerer
Currently kube-proxy supports `ipvs` and `iptables` as proxy-mode. Since kubernetes 1.31 `nftables` is considered stable according to this blog post: [NFTables mode for kube-proxy](https://kubernetes.io/blog/2025/02/28/nftables-kube-proxy/).
We added with this PR this support:
- https://github.com/gardener/gardener/pull/13558
### Add Support for Calico Whisker
**Author:** Johannes Scheerer
Calico added Whisker in the 3.30 open source release. It has some capabilities similar to what you can do with Hubble in a Cilium cluster, i.e. you can monitor/trace ongoing traffic in the cluster. Currently, Whisker is only directly supported in a setup managed by `tigera-operator`. However, it is quite possible to run it in a Gardener-managed cluster. As Whisker seems to require mTLS, Calico needs to be deployed slightly differently from how the Calico extension currently manages it, though.
**Track**:
- Calico Whisker requires additional components to be deployed in the cluster:
1. Calico Whisker
2. Calico Goldmane
- These components are not included in the helm charts currently being used for the networking extension. Both are deployed/managed with the `tigera-operator`.
- We could take the kubernetes resources from a cluster managed with `tigera-operator` or we could reuse the code from the `tigera-operator`.
- A working prototype using the second approach, i.e. reusing the `tigera-operator` code, was implemented in https://github.com/ScheererJ/gardener-extension-networking-calico/tree/calico/whisker
- `tigera-operator` has an api module, but it is unfortunately not properly tagged leading to necessity to the automatically added version with a different one.
- It required to change the communication between `calico-node` and `calico-typha` to be mTLS. We used the `security-manager` from Gardener to manage the necessary certificates.
- Image handling was only partially checked. The approach allows both use of container image tags and digests. We used the default, but overriding should be possible.
- Network policies were necessary to allow additional communication paths, e.g. extension to shoot api servers, `calico-node` to `goldmane`.
- Productization of the change is left to be done after the hackathon.
- Adding support for calico api server might be easier after this as we could use a similar approach, i.e. reuse code from `tigera-operator` to manage it.
### Pod Overlay to Native Routing without Downtime
**Author:** Johannes Scheerer
Currently, Gardener supports both pod overlay networking and native routing. It is possible to switch between both modes via `.spec.networking.providerConfig.overlay.enabled`. However, the current implementation incurs a networking downtime while the cluster is reconfigured, i.e. while the daemon set is rolled out. Some productive clusters cannot tolerate such a downtime. Therefore, it would be helpful if the switch could be implemented in a seamless manner, i.e. old nodes use whatever they used before and new nodes use the new mode, but both groups can also communicate with each other.
**Track**:
- Tested the flow with calico and cilium in both directions in a small 2-node cluster, i.e. from pod overlay network to native routing and back on AWS with a long running connection (netcat) and continuous traffic. It looks like the connection always survives with calico. With cilium, the switch from pod overlay to native routing worked, but the reverse always resulted in a reset.
- The topic will be addressed by Sebastian Stauch during ordinary operations after the hackathon as a follow up to https://github.com/gardener/gardener/pull/13332.
### (D)DOS protection for kube-apiservers
**Author** Oliver Götz
Support counter (D)DOS measurements like rate limiting for kube-apiserver endpoints of Garden and Shoots.
### Cluster Mesh for cilium extension
**Author:** Lukas Hoehl
Allow connecting shoots to other kubernetes clusters running cilium via cluster mesh.
https://docs.cilium.io/en/stable/network/clustermesh/clustermesh/
---
## Networking – Istio ⛵️
### Reduce number of Istio Ingress Gateways
**Author:** Johannes Scheerer
In a standard multi-zonal seed cluster, there is one multi-zonal istio ingress gateway and one per availability zone. The multi-zonal istio ingress gateway could be replaced by usage of all single-zone istio ingress gateways. This could lead to higher resource usage, reduced costs and a less complicated setup.
### Always use the same istio-gateway for shoot kube-apiserver endpoint and observability components ([gardener/gardener#11860](https://github.com/gardener/gardener/issues/11860))
**Author** Oliver Götz
There are kube-apiserver endpoints (internal/external/wildcard) and observability endpoints for each shoot. Depending on the shoot and seed configuration there might be different istio-gateways used.
If exposure classes are used, this could lead to a situation where the endpoints are exposed to different networks.
If a zonal shoots is scheduled on a regional seed it might be "only" cost for cross-zonal traffic.
### Replace Ingress NGINX controller with Gateway API
**Author:** Lukas Hoehl
Istio has support for Gateway API since some time ago: https://istio.io/latest/docs/tasks/traffic-management/ingress/gateway-api/
We should evaluate how good the actual implementation of it is so we could replace some of the native Istio resources in g/g like Gateway, VirtualServices and DestinationRules.
While we cannot replace all things (probably most EnvoyFilters are not replaceable) we should try to adopt Gateway API to drive it's maturity.
**Track**:
- Due to the recent [deprecation announcement concerning ingress nginx](https://kubernetes.io/blog/2025/11/11/ingress-nginx-retirement/) we looked into how we could replace ingress nginx in the garden runtime/seed clusters with Gateway API implemented by istio.
- Using Gateway API directly with the istio ingress gateway as it is deployed by Gardener does not work out-of-the-box. The default tutorial works in a `kind` cluster, though.
- it does not work, because in gardener we set the mesh config in the istio configmap in istio-system namespace with `defaultServiceExportTo: '~'`, `defaultVirtualServiceExportTo: '~'` and `defaultDestinationRulExportTo: '~'`. This denies to export service by default. Since the istio translates the HTTPRoute into a VirtualService it needs to allow exporting to the istio ingress gateway. We need to implicitly allow the VirtualService created internally by istio using `defaultVirtualServiceExportTo: '.'`
- A branch showcasing Gateway API for Plutono is here: https://github.com/metal-stack/gardener/tree/gateway-api
- Basic Authentication was possible but required an envoy filter and an external authorization server. The envoy filter can possibly be replaced as soon as the experimental support for the `HTTPExternalAuthFilter` graduates
- A special label was introduced for http routes that can be used to reference basic auth secrets and mount them into the ext-authz server within the same namespace
- As an alternative, it could be possible to use Istio-native external authorization: https://istio.io/latest/docs/tasks/security/authorization/authz-custom/
- Direct response is currently not possible with Gateway API (https://github.com/kubernetes-sigs/gateway-api/issues/2826), so we used a redirect to "/" in order to prevent access to admin routes
- It was necessary to extend the gardener-resource-manager network policy controller to create network policies based on gateway http routes (similar to ingress resources)
- Due to limitations of istio the `Gateway` resource needs to be created in the istio-ingress namespace and not directly in the namespace where it belongs
**Open Topics**:
- [x] Basic Authentication
- Istio does not have support yet for experimental GatewayAPI feature [HTTPExternalAuthFilter](https://gateway-api.sigs.k8s.io/reference/spec/#httpexternalauthfilter)
- Is a default [envoy ext_authz server](https://github.com/gardener-attic/ext-authz-server), that could also be configured via EnvoyFilter
- Could also be possible to use an [Basic Auth Envoyfilter](https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/http/basic_auth/v3/basic_auth.proto#envoy-v3-api-msg-extensions-filters-http-basic-auth-v3-basicauth)
- e.g. mounting a secret that contains the basic auth information to reload by envoy
- [x] serve GatewayAPI resources as well as Istio resources under the same port
- [x] translate redirect snippets (direct-return) of nginx annotations to GatewayAPI resources
- [x] rest of nginx annotations to Gateway API
- [x] Translate 1 ingress resource to Gateway API
- [ ] external auth server traffic encryption
- [ ] Dashboard service
- [ ] Prometheus services (alertmanager, ...)
- [ ] Add to operator reconciliation flow
- [x] Add to shoot reconciliation flow
- [x] Envoy filter snippet is not suitable now for being deployed to shoot namespaces
- [ ] Replace DestinationRules with GatewayAPI resources
```yaml
kind: XBackendTrafficPolicy
apiVersion: gateway.networking.x-k8s.io/v1alpha1
metadata:
labels:
app: ext-authz-server
name: ext-authz-server
namespace: garden
spec:
targetRefs:
- kind: Service
group: ""
name: ext-authz-server
---
kind: XBackendTrafficPolicy
apiVersion: gateway.networking.x-k8s.io/v1alpha1
metadata:
labels:
app: plutono
name: plutono
namespace: garden
spec:
targetRefs:
- kind: Service
group: ""
name: plutono
```
---
## Networking – IPv6 🧬
### IPv6 or Dual-Stack Support for another Infrastructure
**Author:** Johannes Scheerer
Gardener currently supports IPv6/Dual-Stack on AWS and GCP. During the second to last hackathon a proof-of-concept for IronCore was created. Other infrastructures, e.g. metal-stack, OpenStack or Azure, also support IPv6 and could be enabled for Dual-Stack.
### Dual-Stack Seed API
**Author:** Johannes Scheerer
As one of the last steps missing for full Gardener Dual-Stack support, the `Seed` API needs to be extended.
---
## LLMs 🤖
### LLM-based Agents
**Author:** Vedran Lerenc
We use LLMs since 2.5y for various simple tasks (coding, operations, and other side tasks) and would like to discuss whether you do too and if so, collaborate and possibly build new agents together. We have some small "platform" that sits on top of LiteLLM that sits on top of models deployed on Azure OpenAI, AWS Bedrock, and GCP Vertex AI, so we can prototype ideas immediately and we would be happy to do so together.
**Proposed Approach:**
* Discuss your LLM applications (only coding or more?)
* Discuss pain points were LLMs could help
* Discuss areas of interest where LLMs would improve Gardener (e.g. Dashboard, `gardenctl`, operations, etc.)
* Prototype together
### One commit message
**Author:** Niklas Klocke
having consitent and insightful commit messages is a major benefit.
I would propose creating a small AI based tool to generated consitent commit messages, after piloting it in one or two projects we could role it out to the whole gardener project, and finally speak with one voice.
### Bring the Gardener Answering Machine to the Gardener Documentation
**Author:** Niklas Klocke
Let's introduce the Answering Machine as a self-service offering for our users directly within the documentation.
In addition, we could explore ways to trace the sources used by the Answering Machine to answer specific questions.
This would help us:
- Identify gaps in our documentation, and
- Potentially automate the creation of pull requests to address those gaps.
---
## Observability 🔭
### Resolve the Istio Metrics Leak
**Author:** Johannes Scheerer
Currently, istio metrics are disabled because metrics for no longer existing `kube-apiserver` instances are served until istio finally restarts. This leads to a huge increase in metrics size, which can lead to congestion, cost explosion and metrics retention reduction. We should figure out how to report only the relevant istio metrics.
### Enrich Shoot Logs with Istio Access Logs
**Author:** Johannes Scheerer
Istio ingress gateway is configured to log accesses. In conjunction with L7 load balancing this becomes very useful as it shows all requests passing through istio. However, the logs are currently only accessible to seed operators. It would be nice if the access logs could be moved to the corresponding shoot log. This would also help in cases where access control is restricted, e.g. with the ACL extension.
The topic also applies to other component in the seed, but the istio access logs could be taken as a first step.
* First attempt PR: https://github.com/gardener/logging/pull/398, was closed shortly after raising.
* Real PR: https://github.com/gardener/gardener/pull/13548
---
## Enablement 📖
### Declarative GitHub Membership Administration ([gardener/org#2](https://github.com/gardener/org/issues/2))
**Author:** Tim Ebert
From [gardener/org#2](https://github.com/gardener/org/issues/2):
Adding individuals to different GitHub teams should be done automatically, based on a declarative approach.
The implementation can follow the approach used by Kubernetes, which utilizes the https://github.com/kubernetes/org repository along with
/[Peribolos](https://docs.prow.k8s.io/docs/components/cli-tools/peribolos/) (see https://github.com/gardener/documentation/pull/715#discussion_r2321015493).
### Ease Shoot API Server Connectivity from external clients
**Author:** Tobias Gabriel
A lot of external clients connect to the Shoot API Server, from local CLIs to automation like ArgoCD.
In the most basic setup a service account is created and shared to the external party. With OIDC and the Gardener Discovery Service a lot of improvements are already possible and. This is however not always so easy to setup and use and this is something I want to tackle.
Figuring out what is possible, properly documenting it and identifying what can be implemented (and implementing it).
E.g. some of questions I want to investigate, document and maybe even improve are:
- publicly trusted certificates for shoot API server endpoints (is the seed bound certificate reliable usable? What are the caveats there)
- End to end integration of GitOps controllers running outside of the cluster (auth to shoot and CA management)
- Maybe finally open source the Gardener specific "setup kubecontext with ID token and download CA certicicate"
### The Illustrated Children’s Guide to Gardener
**Author:** Niklas Klocke
Gardener is deeply rooted in the kubernetes ecosystem and tries to follow the proofen path whereever it is reasonable.
But one part was always ignored! Adressing the most pressing question parents working on the gardener are facing from their toddler at home.
"What actually is gardener?".
**The Illustrated Children’s Guide to Kubernetes** Answers this question for Kubernetes. We should do the same for gardener.
Btw: I also think that this would make for amazing merch at the next kubecon ;)
https://www.cncf.io/phippy/the-childrens-illustrated-guide-to-kubernetes/