From Chaotic Lists to Standardized Inventories: The Magical Evolution of Kubernetes Multi-Cluster Management

# From Chaotic Lists to Standardized Inventories: The Magical Evolution of Kubernetes Multi-Cluster Management In the world of cloud-native computing, managing a single Kubernetes cluster is a well-understood task. But as organizations scale, they don't just scale workloads; they scale clusters. Soon, they find themselves grappling with a dozen, or even hundreds, of them. This sprawl introduces a formidable challenge: how do you manage a fleet of clusters as seamlessly as Kubernetes manages pods? This is the quest for "Kubernetes for Kubernetes". In a detailed presentation at KubeCon Japan 2025, Kaslin Fields, a Developer Relations Engineering Manager at Google Cloud, CNCF Ambassador, and Kubernetes SIG Contributor Experience Co-Chair, demystified this complex domain. This article synthesizes the insights from her talk, "Multi Cluster Magics with Argo CD and Cluster Inventory," including deep dives from the Q\&A session, to provide a comprehensive guide to the past, present, and future of multi-cluster management. ### The Primordial Challenge: Clusters as Isolated Islands The foundational problem of multi-cluster management is that Kubernetes clusters are inherently unaware of the outside world. A cluster has no native concept of its own name or ID, let alone the identity of other clusters. This fundamental isolation makes centralized management difficult. The community's first attempt to address multi-tenancy and organization was namespaces. Namespaces are a powerful feature for partitioning a single cluster's resources, much like virtual machines partition physical hardware. They allow you to logically organize Kubernetes objects like deployments and services and are considered a best practice. However, namespaces fall short of solving the core reasons users adopt multiple clusters in the first place. These reasons include: * Geographic Distribution: Reducing latency or addressing data gravity by placing clusters in different regions. * Hard Isolation Boundaries: Using the cluster itself as the security boundary for workloads. This can also be driven by organizational structure, cost allocation, or differing budget constraints. Namespaces provide logical, not physical or truly separate, security domains. Thus, the need to manage actual, distinct clusters remained. ### The Age of Lists: A Patchwork of Chaos To solve the challenge of deploying an application across multiple, independent clusters, a seemingly straightforward solution emerged: create a list of the target clusters. A continuous delivery tool like Argo CD could then iterate through this list and trigger a deployment on each one. Argo CD, in its multi-cluster mode, implements this by storing cluster connection details in Kubernetes Secrets. This approach is logical; you need a list, and you need it to be secure. This introduced new layers of complexity. 1. **Namespace Sameness:** To maintain organizational consistency, it's best practice to deploy a workload into the same namespace on every target cluster. This ensures you know not just which cluster to deploy to, but where within that cluster the application lives. 2. **A List of Lists:** What happens when you have multiple multi-cluster applications? A second application might need to deploy to a different subset of clusters. Now, you have multiple, application-specific cluster lists. To manage these, you need another tool—a manager for your lists of clusters. Argo CD effectively becomes a cluster manager, consuming various lists, and the complexity snowballs. This problem wasn't unique to Argo CD. The entire ecosystem was reinventing the same solution, leading to massive fragmentation. An army of tools emerged, each with its own way of managing cluster lists: * **Dedicated Cluster Managers:** KubeFleet, Open Cluster Management (OCM), Karmada, Clusternet, Kubestellar, and KubeAdmiral. * **Tools with Multi-Cluster Modes:** Argo CD, Kueue, KubeVela, and Istio. * **Cloud Provider Solutions:** GKE Fleet and Azure Kubernetes Fleet Manager. Every manager kept reinventing the list. If you wanted to use more than one of these tools, you'd be stuck writing glue code to reconcile their disparate list formats. This situation perfectly mirrors the classic XKCD comic on competing standards: trying to create a universal solution often results in just one more competing standard. ### The Standardization Spell: SIG-Multicluster's Grand Design To break this cycle, a standardized approach was needed from a reputable source: the Kubernetes project itself. The "magicians" creating this standard are the contributors to the Special Interest Group (SIG) Multicluster. They have developed two key API standards to form a unified solution. #### 1. The Multi-Cluster Services (MCS) API Introduced in 2020 via KEP-1645, the MCS API's goal is to make accessing a service across multiple clusters as easy as accessing it within a single cluster. It defines three primary objects: * **ServiceExport:** Makes a service in one cluster available to others. * **ServiceImport:** Creates a local proxy in a cluster to represent an exported service from another cluster. * **ClusterSet:** A concept for grouping clusters that can share services. While powerful for service discovery, MCS alone doesn't solve the core problem. Clusters still don't know about each other, making the ClusterSet concept difficult to manage robustly. #### 2. The ClusterProfiles API This is the second, more recent piece of the puzzle, proposed in 2023 via KEP-4322. The ClusterProfiles API offers a standardized way to describe a Kubernetes cluster in a multi-cluster context. It defines a namespaced Custom Resource Definition (CRD) called **ClusterProfile**. This Kubernetes object acts as a cluster's identity card, containing details like its properties and status. ```yaml apiVersion: multicluster.x-k8s.io/v1alpha1 kind: ClusterProfile metadata: name: some-cluster-name namespace: fleet-system labels: x-k8s.io/cluster-manager: some-cluster-manager spec: displayName: ... clusterManager: name: some-cluster-manager status: version: kubernetes: 1.31.0 properties: - name: clusterset.k8s.io value: prod-fleet - name: location value: us-west3 # You can define unique characteristics, e.g., if the cluster has GPUs - name: has-gpu value: "true" ``` Combined, these two APIs create a powerful, standardized framework. You use ClusterProfile objects to give each cluster a standard identity and then group them into ClusterSets. A collection of these is called a **Cluster Inventory**. This inventory is the standardized, unified list of clusters that the ecosystem so desperately needed. ### Making Magic Real: The Hub Cluster and Vendor Implementations With standards defined, the next step is implementation. Google's open-source **Multi-Cluster Orchestrator (MCO)**, released in April 2025, serves as a prime example of a vendor implementing these SIG-Multicluster APIs. MCO introduces the **Hub Cluster** concept—a central Kubernetes cluster that acts as the administrative heart of the multi-cluster environment. This cluster is where the multi-cluster controllers run and where the ClusterProfile CRDs live. This creates a producer-consumer architecture: * **Producers:** Cluster providers like GKE and AKS can automatically create and update ClusterProfile objects within the Hub Cluster, representing the clusters they manage. * **Consumers:** Tools like Argo CD, MCO itself, and the upcoming MultiKueue project can watch these ClusterProfile objects on the Hub Cluster and act upon them. The Hub Cluster becomes the single source of truth for the state of the entire fleet. By building on this open standard, vendors can add their own smarts. For example, Google's MCO implementation can perform intelligent bin-packing based on user policies or identify available capacity for scheduling, functionality specific to its underlying infrastructure. The complete architecture looks like this: An internal client makes a request, which hits a **Multi-Cluster Gateway (MCG)**. On the Fleet Management Control Plane (the Hub Cluster), the MCO Operator sees a **Multi-Cluster Workload** object. It uses the Cluster Inventory to decide which clusters have the capacity and required attributes, schedules the workload accordingly (often via a consumer like Argo CD), and the MCG routes traffic to the correct clusters. This is "Kubernetes for Clusters" in action. ### Insights from the Field: A Deeper Q\&A Analysis The Q\&A session following the talk revealed critical nuances about this approach. * **On Cluster Autoscaling:** The ClusterProfiles API and MCO are focused on scheduling workloads onto existing clusters. They do not handle the creation or destruction of clusters themselves. Autoscaling the number of clusters in your fleet still requires external, infrastructure-aware tooling. * **On Vendor-Specificity:** The decision to have vendor-specific implementations is deliberate. Managing networking, storage, and cluster lifecycle is deeply tied to the underlying environment (GCP, Azure, etc.). SIG-Multicluster's strategy is to provide a standard API so that users have a unified way to interact with these different vendor implementations, rather than creating a monolithic tool that can't possibly account for every environment. The challenge now is ensuring vendors implement the standards consistently. * **On Real-World Use Cases:** Beyond geographic distribution, concrete use cases include: * Hardware Emulation: A testing team using a full node to represent a piece of hardware. These clusters grow large quickly, necessitating separate clusters for different teams. * Security and Cost Boundaries: Organizations using separate clusters per group to enforce strict security boundaries or manage budgets. * Centralized App Deployment: A central DevOps team deploying a common application (e.g., a monitoring agent or security tool) to every team's cluster across the organization. * **On Day 2 Operations (Upgrades):** The dream of performing seamless rolling upgrades across a fleet of clusters, similar to how Kubernetes handles pod upgrades, is not yet a reality. These APIs are not yet mature enough for a smooth, zero-touch rollout; many manual considerations, like draining clusters, are still required. * **On the Data Plane:** The standards focus primarily on the control plane. The data plane—handling things like cross-cluster networking, DNS, and global load balancing—is where vendor implementations shine. For instance, Google's MCO leverages the Multi-Cluster Gateway for this purpose, a concept specific to Google Cloud's networking capabilities. ### Conclusion: The Magic is Real, and It's Standardized The journey from isolated clusters to a managed fleet has been a chaotic one, marked by a proliferation of disparate, list-based solutions. The work of SIG-Multicluster is finally casting a standardization spell on this chaos. By combining the **Multi-Cluster Services API** and the new **ClusterProfiles API**, the community has created a robust foundation for a Cluster Inventory. This allows tools and platforms to interact with a fleet of heterogeneous clusters in a unified way, while still enabling vendors to provide powerful, infrastructure-aware implementations. The Hub Cluster concept brings this all together, creating a central point of truth and administration. The magic of multi-cluster management is no longer a dark art. It's becoming an open, standardized, and extensible engineering discipline. **Get Involved:** * Learn more at SIG-Multicluster: sig-multicluster.sigs.k8s.io * Join the conversation on Slack: Find #sig-multicluster on the Kubernetes Slack and #argo-cd on the CNCF Slack. * Explore the implementation: Check out the Multi-Cluster Orchestrator project on GitHub.