# `ComputeDomain` State Synchronization with SSA and Sharding
## Overview
`ComputeDomains` are the Kubernetes abstraction NVIDIA built to enable secure
GPU-to-GPU memory sharing across nodes for multi-node NVLink systems like GB200
NVL72 and beyond. For background, see the [NVIDIA blog post on enabling
Multi-Node NVLink on Kubernetes](https://developer.nvidia.com/blog/enabling-multi-node-nvlink-on-kubernetes-for-gb200-and-beyond/).
Multi-node NVLink requires each node to run a so-called IMEX daemon
that brokers GPU-to-GPU memory operations across node boundaries.
`ComputeDomains` automate the deployment and lifecycle of these daemons on a
per-workload basis. When a `ComputeDomain` forms around a multi-node
workload, the IMEX daemons must first discover each other and synchronize state
before they can broker cross-node communication. Only once this process is
complete can the actual multi-node workload begin.
At scale, this time-to-convergence becomes a bottleneck for job startup.
Reducing it requires rethinking how daemons share and synchronize state. The key
insight that enables optimization is that IMEX daemons within the same NVLink
partition (also called a *clique*) only need to coordinate with each other, even
though the full job spans multiple partitions. This natural boundary makes
sharding by clique an effective strategy.
This document describes two improvements that reduce convergence time:
1. **Server-Side Apply (SSA)**: Eliminates conflict retries by moving from
read-modify-write to SSA semantics, where each daemon owns its own field
2. **Sharded Cliques**: Introduces `ComputeDomainClique` CRD objects that
partition state by clique, so daemons only watch and write to their own
partition's object
Together, these changes dramatically improve time-to-convergence, particularly
at scale. Performance results comparing the original design, SSA only, and SSA
plus sharding are included at the end of this document.
## The Problem: `ComputeDomain` Convergence Time
`ComputeDomains` dynamically create, manage, and tear down IMEX domains as
multi-node workloads are scheduled to nodes and run to completion. When a
distributed GPU workload starts, each node runs an IMEX daemon that must
register itself and discover its peers before it can broker
GPU-to-GPU memory operations across node boundaries. The time from workload
submission to all nodes being ready—the *convergence time*—directly impacts
job startup latency.
To coordinate, all IMEX daemons must share state about which nodes have joined
and their readiness status. Traditionally, this shared state was stored in a
single object: the `Status` field of the `ComputeDomain` CRD. Without SSA and
sharding, convergence times degrade at scale due to:
- Conflict errors from concurrent read-modify-write updates
- Retry storms as daemons competed to update the same object
- Watch amplification as every daemon received updates for every node change
In practice, small deployments (tens of nodes) converged within seconds, but
large deployments (100+ nodes) could take minutes to reach a ready state.
## Design Evolution
### Phase 1: The Original Design and Its Limitations
In v25.8.1, the `ComputeDomainManager` in each IMEX daemon would:
1. Watch the `ComputeDomain` object
2. Read the current status, find or create its node entry, and write back
3. Retry on conflict errors (HTTP 409)
4. Receive updates for all node changes via informer callbacks
```
┌──────────────┐ GET/UPDATE ┌───────────────────────┐
│ IMEX Daemon A│ ◄────────────► │ │
├──────────────┤ (conflicts!) │ ComputeDomain CRD │
│ IMEX Daemon B│ ◄────────────► │ status.nodes[] │
├──────────────┤ │ (all node info) │
│ IMEX Daemon C│ ◄────────────► │ │
└──────────────┘ └───────────────────────┘
```
**Problems observed:**
- **Conflict retries**: Concurrent writes triggered HTTP 409 errors, requiring
exponential backoff retries
- **Retry amplification**: Under load, conflict rates increased, compounding
delays as more daemons entered backoff simultaneously
- **Redundant status updates**: Multiple components updated global CD status,
creating additional conflicts
### Phase 2: Server-Side Apply (SSA)
The first improvement migrated to Server-Side Apply for updating the nodes
list. SSA tracks field ownership at the field manager level, allowing each
daemon to own exactly one list entry without conflicts.
**Key changes:**
- **Use SSA for conflict-free nodes list updates** (`581087a28`): Each daemon
patches with a unique field manager (`cd-writer-<nodeName>`)
- **Fix DNS index conflict resolution** (`3ec1915c0`): Proper gap-filling in
index assignment, validated by stress testing
- **Move global CD status logic to ComputeDomainManager** (`8df37eb7a`):
Consolidated status updates to eliminate redundant writes
- **Remove redundant status updates from DaemonSetManager** (`682344c1b`):
Single source of truth for status
```go
func (m *ComputeDomainStatusManager) patchCD(ctx context.Context, patch []byte) (*nvapi.ComputeDomain, error) {
return m.config.clientsets.Nvidia.ResourceV1beta1().ComputeDomains(...).Patch(
ctx,
m.config.computeDomainName,
types.ApplyPatchType,
patch,
metav1.PatchOptions{
FieldManager: fmt.Sprintf("cd-writer-%s", m.config.nodeName),
},
"status",
)
}
```
```
┌──────────────┐ SSA Patch ┌───────────────────────┐
│ IMEX Daemon A│ ─────────────► │ │
├──────────────┤ (per-node FM) │ ComputeDomain CRD │
│ IMEX Daemon B│ ─────────────► │ status.nodes[] │
├──────────────┤ │ (all node info) │
│ IMEX Daemon C│ ─────────────► │ │
└──────────────┘ └───────────────────────┘
▲ │
└────────── Watch (all nodes) ─────┘
```
**Benefits:**
- No more conflict retries or retry storms
- Each daemon's update is atomic and independent
- Simpler code without complex retry logic
- DNS index assignment works correctly under load
**Remaining limitations:**
- All daemons still write to the same object (API server contention)
- All daemons watch the same object (informer traffic for every node change)
- Single object size grows with node count
### Phase 3: Sharding with `ComputeDomainClique` Objects
The second improvement introduced `ComputeDomainClique` CRD objects to shard
daemon state by NVLink clique. Each daemon only writes to and watches the
clique object for its own partition.
**Key changes:**
- **New CRD for `ComputeDomainClique`** (`91e3f323f`): Stores daemon info
per-clique, named `<computeDomainUID>.<cliqueID>`
- **Feature gate for sharding** (`c685cab39`): Alpha feature gate
`ComputeDomainCliques` controls which path is used
- **`ComputeDomainCliqueManager`** (`6daaa1c36`): New manager implementation
that writes to `ComputeDomainClique` objects instead of `ComputeDomain` status
- **Controller cleanup logic** (`f6b4ef4cb`): Handles crash recovery and stale
entry removal when daemon pods terminate unexpectedly
- **Clique status in `ComputeDomain`** (`419eb7e69`): Controller maintains
summary clique info in `ComputeDomain.Status.Cliques`
```yaml
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomainClique
metadata:
name: <computeDomainUID>.<cliqueID>
namespace: <driver-namespace>
labels:
resource.nvidia.com/computeDomain: <computeDomainUID>
resource.nvidia.com/computeDomain.cliqueID: <cliqueID>
daemons:
- nodeName: node-a
ipAddress: 10.0.0.1
cliqueID: clique-1
index: 0
status: Ready
```
```
┌──────────────┐ SSA Patch ┌─────────────────────────┐
│ IMEX Daemon A│ ─────────────────► │ ComputeDomainClique │
├──────────────┤ │ <cdUID>.clique-1 │
│ IMEX Daemon B│ ─────────────────► │ daemons[] │
└──────────────┘ └─────────────────────────┘
▲ │
└─────────── Watch (clique-1 only) ────┘
┌──────────────┐ SSA Patch ┌─────────────────────────┐
│ IMEX Daemon C│ ─────────────────► │ ComputeDomainClique │
├──────────────┤ │ <cdUID>.clique-2 │
│ IMEX Daemon D│ ─────────────────► │ daemons[] │
└──────────────┘ └─────────────────────────┘
▲ │
└─────────── Watch (clique-2 only) ────┘
┌───────────────────────┐
CD Controller ────────► │ ComputeDomain CRD │
(watches cliques) │ status.cliques[] │
│ (summary only) │
└───────────────────────┘
```
**Benefits:**
- Daemons in different cliques never contend on the same object
- Each daemon only receives watch events for its own clique
- Object sizes remain bounded by clique size, not total node count
- Natural partitioning aligns with NVLink partition topology
## Implementation Details
### DaemonInfoManager Interface
A common interface abstracts both implementations:
```go
type DaemonInfoManager interface {
Start(ctx context.Context) error
Stop() error
GetDaemonInfoUpdateChan() chan []*nvapi.ComputeDomainDaemonInfo
}
```
The controller selects based on the feature gate:
```go
var daemonInfoManager DaemonInfoManager
if featuregates.Enabled(featuregates.ComputeDomainCliques) {
daemonInfoManager = NewComputeDomainCliqueManager(mc)
} else {
daemonInfoManager = NewComputeDomainStatusManager(mc)
}
```
### Cleanup Semantics
**Graceful shutdown**: Both managers remove their entry via an SSA patch with an
empty list. SSA automatically removes entries owned by that field manager.
**Crash recovery**: The controller's `ComputeDomainCliqueManager` runs periodic
cleanup, comparing `ComputeDomainClique` daemon entries against running pods
and removing stale entries for pods that no longer exist.
**Owner references**: Each daemon pod adds itself as an owner reference to its
`ComputeDomainClique`, enabling Kubernetes garbage collection when all daemons
in a clique are removed.
### Feature Gate
```go
ComputeDomainCliques featuregate.Feature = "ComputeDomainCliques"
// Default: false, Alpha, Version: 25.12+
```
## API Changes
### New CRD: `ComputeDomainClique`
```go
type ComputeDomainClique struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Daemons []*ComputeDomainDaemonInfo `json:"daemons,omitempty"`
}
type ComputeDomainDaemonInfo struct {
NodeName string `json:"nodeName"`
IPAddress string `json:"ipAddress"`
CliqueID string `json:"cliqueID"`
Index int `json:"index"`
Status string `json:"status,omitempty"`
}
```
### `ComputeDomain` Status Extension
```go
type ComputeDomainStatus struct {
Status string `json:"status"`
Nodes []*ComputeDomainNode `json:"nodes,omitempty"` // existing
Cliques []*ComputeDomainCliqueInfo `json:"cliques,omitempty"` // new
}
type ComputeDomainCliqueInfo struct {
ID string `json:"id"`
Status string `json:"status,omitempty"`
}
```
## Migration Path
1. **v25.8.1 → main**: The SSA migration is transparent; no user action required
2. **main → sharded**: Enable the `ComputeDomainCliques=true` feature gate
Both paths can be switched without data migration. `ComputeDomainClique` objects
are ephemeral and rebuilt automatically on daemon startup.
## Performance Results
The following plots show `ComputeDomain` convergence time across three
implementations, measured on [TODO: describe test environment, node count].
### v25.8.1: Original Design (Read-Modify-Write)
<!-- TODO: Insert convergence time plot for v25.8.1 (commit e6361d5) -->

In the original design, IMEX daemons used standard Kubernetes read-modify-write
semantics to update `ComputeDomain.Status.Nodes`. Under concurrent updates,
daemons frequently encountered conflict errors and had to retry, leading to
unpredictable convergence times that grew worse with node count.
### With SSA: Conflict-Free Updates
<!-- TODO: Insert convergence time plot for main (with SSA) -->

After migrating to Server-Side Apply, each daemon owns its own field in the
nodes list. Concurrent updates no longer conflict, eliminating retry storms.
Convergence time becomes more predictable, though all daemons still write to
and watch a single `ComputeDomain` object.
### With SSA + Sharding: Partitioned by Clique
<!-- TODO: Insert convergence time plot for current branch (SSA + sharding) -->

With sharded `ComputeDomainClique` objects, daemons in different cliques write
to and watch separate objects. This reduces both API server write contention
and informer traffic, further improving convergence time—especially for
`ComputeDomains` that span multiple NVLink partitions.
## References
### External Documentation
- [Enabling Multi-Node NVLink on Kubernetes for NVIDIA GB200 NVL72 and Beyond](https://developer.nvidia.com/blog/enabling-multi-node-nvlink-on-kubernetes-for-gb200-and-beyond/)
- [NVIDIA DRA Driver for GPUs - ComputeDomain Documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/dra-cds.html)
- [NVIDIA IMEX Guide](https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html)
### Source Code
- Feature gate: `pkg/featuregates/featuregates.go`
- CDClique CRD: `api/nvidia.com/resource/v1beta1/computedomainclique.go`
- Daemon StatusManager: `cmd/compute-domain-daemon/cdstatus.go`
- Daemon CliqueManager: `cmd/compute-domain-daemon/cdclique.go`
- Controller CliqueManager: `cmd/compute-domain-controller/cdclique.go`
- DaemonInfoManager interface: `cmd/compute-domain-daemon/controller.go`