`ComputeDomain` State Synchronization with SSA and Sharding

# `ComputeDomain` State Synchronization with SSA and Sharding ## Overview `ComputeDomains` are the Kubernetes abstraction NVIDIA built to enable secure GPU-to-GPU memory sharing across nodes for multi-node NVLink systems like GB200 NVL72 and beyond. For background, see the [NVIDIA blog post on enabling Multi-Node NVLink on Kubernetes](https://developer.nvidia.com/blog/enabling-multi-node-nvlink-on-kubernetes-for-gb200-and-beyond/). Multi-node NVLink requires each node to run a so-called IMEX daemon that brokers GPU-to-GPU memory operations across node boundaries. `ComputeDomains` automate the deployment and lifecycle of these daemons on a per-workload basis. When a `ComputeDomain` forms around a multi-node workload, the IMEX daemons must first discover each other and synchronize state before they can broker cross-node communication. Only once this process is complete can the actual multi-node workload begin. At scale, this time-to-convergence becomes a bottleneck for job startup. Reducing it requires rethinking how daemons share and synchronize state. The key insight that enables optimization is that IMEX daemons within the same NVLink partition (also called a *clique*) only need to coordinate with each other, even though the full job spans multiple partitions. This natural boundary makes sharding by clique an effective strategy. This document describes two improvements that reduce convergence time: 1. **Server-Side Apply (SSA)**: Eliminates conflict retries by moving from read-modify-write to SSA semantics, where each daemon owns its own field 2. **Sharded Cliques**: Introduces `ComputeDomainClique` CRD objects that partition state by clique, so daemons only watch and write to their own partition's object Together, these changes dramatically improve time-to-convergence, particularly at scale. Performance results comparing the original design, SSA only, and SSA plus sharding are included at the end of this document. ## The Problem: `ComputeDomain` Convergence Time `ComputeDomains` dynamically create, manage, and tear down IMEX domains as multi-node workloads are scheduled to nodes and run to completion. When a distributed GPU workload starts, each node runs an IMEX daemon that must register itself and discover its peers before it can broker GPU-to-GPU memory operations across node boundaries. The time from workload submission to all nodes being ready—the *convergence time*—directly impacts job startup latency. To coordinate, all IMEX daemons must share state about which nodes have joined and their readiness status. Traditionally, this shared state was stored in a single object: the `Status` field of the `ComputeDomain` CRD. Without SSA and sharding, convergence times degrade at scale due to: - Conflict errors from concurrent read-modify-write updates - Retry storms as daemons competed to update the same object - Watch amplification as every daemon received updates for every node change In practice, small deployments (tens of nodes) converged within seconds, but large deployments (100+ nodes) could take minutes to reach a ready state. ## Design Evolution ### Phase 1: The Original Design and Its Limitations In v25.8.1, the `ComputeDomainManager` in each IMEX daemon would: 1. Watch the `ComputeDomain` object 2. Read the current status, find or create its node entry, and write back 3. Retry on conflict errors (HTTP 409) 4. Receive updates for all node changes via informer callbacks ``` ┌──────────────┐ GET/UPDATE ┌───────────────────────┐ │ IMEX Daemon A│ ◄────────────► │ │ ├──────────────┤ (conflicts!) │ ComputeDomain CRD │ │ IMEX Daemon B│ ◄────────────► │ status.nodes[] │ ├──────────────┤ │ (all node info) │ │ IMEX Daemon C│ ◄────────────► │ │ └──────────────┘ └───────────────────────┘ ``` **Problems observed:** - **Conflict retries**: Concurrent writes triggered HTTP 409 errors, requiring exponential backoff retries - **Retry amplification**: Under load, conflict rates increased, compounding delays as more daemons entered backoff simultaneously - **Redundant status updates**: Multiple components updated global CD status, creating additional conflicts ### Phase 2: Server-Side Apply (SSA) The first improvement migrated to Server-Side Apply for updating the nodes list. SSA tracks field ownership at the field manager level, allowing each daemon to own exactly one list entry without conflicts. **Key changes:** - **Use SSA for conflict-free nodes list updates** (`581087a28`): Each daemon patches with a unique field manager (`cd-writer-<nodeName>`) - **Fix DNS index conflict resolution** (`3ec1915c0`): Proper gap-filling in index assignment, validated by stress testing - **Move global CD status logic to ComputeDomainManager** (`8df37eb7a`): Consolidated status updates to eliminate redundant writes - **Remove redundant status updates from DaemonSetManager** (`682344c1b`): Single source of truth for status ```go func (m *ComputeDomainStatusManager) patchCD(ctx context.Context, patch []byte) (*nvapi.ComputeDomain, error) { return m.config.clientsets.Nvidia.ResourceV1beta1().ComputeDomains(...).Patch( ctx, m.config.computeDomainName, types.ApplyPatchType, patch, metav1.PatchOptions{ FieldManager: fmt.Sprintf("cd-writer-%s", m.config.nodeName), }, "status", ) } ``` ``` ┌──────────────┐ SSA Patch ┌───────────────────────┐ │ IMEX Daemon A│ ─────────────► │ │ ├──────────────┤ (per-node FM) │ ComputeDomain CRD │ │ IMEX Daemon B│ ─────────────► │ status.nodes[] │ ├──────────────┤ │ (all node info) │ │ IMEX Daemon C│ ─────────────► │ │ └──────────────┘ └───────────────────────┘ ▲ │ └────────── Watch (all nodes) ─────┘ ``` **Benefits:** - No more conflict retries or retry storms - Each daemon's update is atomic and independent - Simpler code without complex retry logic - DNS index assignment works correctly under load **Remaining limitations:** - All daemons still write to the same object (API server contention) - All daemons watch the same object (informer traffic for every node change) - Single object size grows with node count ### Phase 3: Sharding with `ComputeDomainClique` Objects The second improvement introduced `ComputeDomainClique` CRD objects to shard daemon state by NVLink clique. Each daemon only writes to and watches the clique object for its own partition. **Key changes:** - **New CRD for `ComputeDomainClique`** (`91e3f323f`): Stores daemon info per-clique, named `<computeDomainUID>.<cliqueID>` - **Feature gate for sharding** (`c685cab39`): Alpha feature gate `ComputeDomainCliques` controls which path is used - **`ComputeDomainCliqueManager`** (`6daaa1c36`): New manager implementation that writes to `ComputeDomainClique` objects instead of `ComputeDomain` status - **Controller cleanup logic** (`f6b4ef4cb`): Handles crash recovery and stale entry removal when daemon pods terminate unexpectedly - **Clique status in `ComputeDomain`** (`419eb7e69`): Controller maintains summary clique info in `ComputeDomain.Status.Cliques` ```yaml apiVersion: resource.nvidia.com/v1beta1 kind: ComputeDomainClique metadata: name: <computeDomainUID>.<cliqueID> namespace: <driver-namespace> labels: resource.nvidia.com/computeDomain: <computeDomainUID> resource.nvidia.com/computeDomain.cliqueID: <cliqueID> daemons: - nodeName: node-a ipAddress: 10.0.0.1 cliqueID: clique-1 index: 0 status: Ready ``` ``` ┌──────────────┐ SSA Patch ┌─────────────────────────┐ │ IMEX Daemon A│ ─────────────────► │ ComputeDomainClique │ ├──────────────┤ │ <cdUID>.clique-1 │ │ IMEX Daemon B│ ─────────────────► │ daemons[] │ └──────────────┘ └─────────────────────────┘ ▲ │ └─────────── Watch (clique-1 only) ────┘ ┌──────────────┐ SSA Patch ┌─────────────────────────┐ │ IMEX Daemon C│ ─────────────────► │ ComputeDomainClique │ ├──────────────┤ │ <cdUID>.clique-2 │ │ IMEX Daemon D│ ─────────────────► │ daemons[] │ └──────────────┘ └─────────────────────────┘ ▲ │ └─────────── Watch (clique-2 only) ────┘ ┌───────────────────────┐ CD Controller ────────► │ ComputeDomain CRD │ (watches cliques) │ status.cliques[] │ │ (summary only) │ └───────────────────────┘ ``` **Benefits:** - Daemons in different cliques never contend on the same object - Each daemon only receives watch events for its own clique - Object sizes remain bounded by clique size, not total node count - Natural partitioning aligns with NVLink partition topology ## Implementation Details ### DaemonInfoManager Interface A common interface abstracts both implementations: ```go type DaemonInfoManager interface { Start(ctx context.Context) error Stop() error GetDaemonInfoUpdateChan() chan []*nvapi.ComputeDomainDaemonInfo } ``` The controller selects based on the feature gate: ```go var daemonInfoManager DaemonInfoManager if featuregates.Enabled(featuregates.ComputeDomainCliques) { daemonInfoManager = NewComputeDomainCliqueManager(mc) } else { daemonInfoManager = NewComputeDomainStatusManager(mc) } ``` ### Cleanup Semantics **Graceful shutdown**: Both managers remove their entry via an SSA patch with an empty list. SSA automatically removes entries owned by that field manager. **Crash recovery**: The controller's `ComputeDomainCliqueManager` runs periodic cleanup, comparing `ComputeDomainClique` daemon entries against running pods and removing stale entries for pods that no longer exist. **Owner references**: Each daemon pod adds itself as an owner reference to its `ComputeDomainClique`, enabling Kubernetes garbage collection when all daemons in a clique are removed. ### Feature Gate ```go ComputeDomainCliques featuregate.Feature = "ComputeDomainCliques" // Default: false, Alpha, Version: 25.12+ ``` ## API Changes ### New CRD: `ComputeDomainClique` ```go type ComputeDomainClique struct { metav1.TypeMeta `json:",inline"` metav1.ObjectMeta `json:"metadata,omitempty"` Daemons []*ComputeDomainDaemonInfo `json:"daemons,omitempty"` } type ComputeDomainDaemonInfo struct { NodeName string `json:"nodeName"` IPAddress string `json:"ipAddress"` CliqueID string `json:"cliqueID"` Index int `json:"index"` Status string `json:"status,omitempty"` } ``` ### `ComputeDomain` Status Extension ```go type ComputeDomainStatus struct { Status string `json:"status"` Nodes []*ComputeDomainNode `json:"nodes,omitempty"` // existing Cliques []*ComputeDomainCliqueInfo `json:"cliques,omitempty"` // new } type ComputeDomainCliqueInfo struct { ID string `json:"id"` Status string `json:"status,omitempty"` } ``` ## Migration Path 1. **v25.8.1 → main**: The SSA migration is transparent; no user action required 2. **main → sharded**: Enable the `ComputeDomainCliques=true` feature gate Both paths can be switched without data migration. `ComputeDomainClique` objects are ephemeral and rebuilt automatically on daemon startup. ## Performance Results The following plots show `ComputeDomain` convergence time across three implementations, measured on [TODO: describe test environment, node count]. ### v25.8.1: Original Design (Read-Modify-Write)  ![Convergence Time - v25.8.1](./plots/convergence-v25.8.1.png) In the original design, IMEX daemons used standard Kubernetes read-modify-write semantics to update `ComputeDomain.Status.Nodes`. Under concurrent updates, daemons frequently encountered conflict errors and had to retry, leading to unpredictable convergence times that grew worse with node count. ### With SSA: Conflict-Free Updates  ![Convergence Time - With SSA](./plots/convergence-ssa.png) After migrating to Server-Side Apply, each daemon owns its own field in the nodes list. Concurrent updates no longer conflict, eliminating retry storms. Convergence time becomes more predictable, though all daemons still write to and watch a single `ComputeDomain` object. ### With SSA + Sharding: Partitioned by Clique  ![Convergence Time - SSA + Sharding](./plots/convergence-sharded.png) With sharded `ComputeDomainClique` objects, daemons in different cliques write to and watch separate objects. This reduces both API server write contention and informer traffic, further improving convergence time—especially for `ComputeDomains` that span multiple NVLink partitions. ## References ### External Documentation - [Enabling Multi-Node NVLink on Kubernetes for NVIDIA GB200 NVL72 and Beyond](https://developer.nvidia.com/blog/enabling-multi-node-nvlink-on-kubernetes-for-gb200-and-beyond/) - [NVIDIA DRA Driver for GPUs - ComputeDomain Documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/dra-cds.html) - [NVIDIA IMEX Guide](https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html) ### Source Code - Feature gate: `pkg/featuregates/featuregates.go` - CDClique CRD: `api/nvidia.com/resource/v1beta1/computedomainclique.go` - Daemon StatusManager: `cmd/compute-domain-daemon/cdstatus.go` - Daemon CliqueManager: `cmd/compute-domain-daemon/cdclique.go` - Controller CliqueManager: `cmd/compute-domain-controller/cdclique.go` - DaemonInfoManager interface: `cmd/compute-domain-daemon/controller.go`