Best Practices for Combining Kubernetes with Custom GPU Controllers

<h1>Best Practices for Combining Kubernetes with Custom GPU Controllers</h1> ![best-practices-kubernetes-custom-gpu-controllers](https://hackmd.io/_uploads/rypCJ_RzZg.jpg) <p>Running GPUs on Kubernetes is easy until you need more than what standard device plugins and schedulers provide. As soon as you introduce custom constraints like multi-tenant isolation, topology-aware placement, MIG slicing rules, latency targets, custom accounting, or per-job GPU warmup, you will likely reach for a custom GPU controller. Done well, this can turn Kubernetes into a reliable GPU platform. Done poorly, it becomes a fragile mix of schedulers, webhooks, and node scripts.</p> <h2>1) Start with a clear ownership model</h2> <p>Before writing code, decide which component owns which responsibility.</p> <ul> <li> <p>Device plugin (node level): advertises GPU resources and integrates with the kubelet.</p> </li> <li> <p>Custom controller (cluster level): enforces policies, manages CRDs, orchestrates lifecycle, and reconciles desired state.</p> </li> <li> <p>Scheduler extensions (optional): influences placement decisions using node and pod metadata plus GPU constraints.</p> </li> </ul> <p>Best practice: avoid putting scheduling logic into a controller unless it is truly necessary. Encode constraints using node labels and taints, pod affinity rules, and scheduler plugins when placement must be dynamic and topology-aware.</p> <h2>2) Prefer declarative CRDs and reconciliation over imperative scripts</h2> <p>Custom GPU controllers should act like <a href="https://acecloud.ai/blog/kubernetes-architecture-and-core-components/#:~:text=Kubernetes%27%20architecture%20is%20built%20around,components%20and%20Worker%20Node%20components.">Kubernetes-native components</a>.</p> <ul> <li> <p>Define a CRD such as <code>GpuSlice</code>, <code>GpuProfile</code>, or <code>GpuAllocationPolicy</code>.</p> </li> <li> <p>Maintain status fields like <code>Allocated</code>, <code>Available</code>, <code>Health</code>, and <code>LastScrubbed</code>.</p> </li> <li> <p>Reconcile from current state to desired state continuously.</p> </li> </ul> <p>Avoid controllers that run a one-time script on nodes and assume it stays correct. GPU environments drift due to driver updates, reboots, MIG changes, and library shifts. Reconciliation is your safety net.</p> <h2>3) Treat node work as node work: use DaemonSets</h2> <p>Anything that touches the node should generally live in a DaemonSet, not inside the controller pod. This includes driver validation, NVML checks, MIG partitioning, device resets, and hardware health probing.</p> <p>A good pattern:</p> <ul> <li> <p>Controller manages policy and desired layout.</p> </li> <li> <p>DaemonSet agent performs node-local actions and reports results through CRD status, node annotations, or metrics.</p> </li> <li> <p>Agents are idempotent and safe to retry.</p> </li> </ul> <p>This separation improves security, reduces blast radius, and simplifies debugging.</p> <h2>4) Make scheduling predictable using extended resources plus constraints</h2> <p>Use extended resources such as <code>nvidia.com/gpu</code> or your own resource name as the baseline. Then layer constraints:</p> <ul> <li> <p>Topology constraints: label nodes with PCIe, NVLink, or NUMA groupings when relevant.</p> </li> <li> <p>Isolation constraints: taints and tolerations for dedicated GPU pools.</p> </li> <li> <p>Multi-tenant constraints: namespace quotas and ResourceQuotas.</p> </li> </ul> <p>If using MIG, define an explicit mapping between profiles like <code>1g.10gb</code> and <code>2g.20gb</code> and the resources you advertise. Do not rely on best effort matching.</p> <h2>5) Use admission control to prevent invalid GPU requests</h2> <p>Many operational problems come from pods requesting GPUs incorrectly. Add a ValidatingAdmissionWebhook or policy tooling to enforce:</p> <ul> <li> <p>Only approved GPU resource names</p> </li> <li> <p>Allowed GPU counts and MIG profiles</p> </li> <li> <p>Required tolerations or node selectors for GPU pools</p> </li> <li> <p>Required runtime class or security context settings</p> </li> </ul> <p>A little validation up front prevents hours of chasing confusing scheduling failures.</p> <h2>6) Design for failures: GPUs disappear and nodes lie</h2> <p><a href="https://acecloud.ai/blog/gpu-vs-cpu-for-image-processing/#:~:text=While%20CPUs%20are%20more%20versatile,graphics%2Dintensive%20and%20parallel%20processing%20tasks.">GPUs are not like CPUs</a>. Plan for transient NVML failures, ECC and Xid errors, device resets, and nodes restarting mid-job.</p> <p>Best practices:</p> <ul> <li> <p>Health checks should use multiple signals such as NVML, device files, and workload probes.</p> </li> <li> <p>Quarantine unhealthy GPUs using labels or taints like <code>gpu.health=degraded</code>.</p> </li> <li> <p>Provide a remediation path: drain GPU workloads, reset the device, revalidate, then re-enable.</p> </li> </ul> <h2>7) Keep controller logic minimal and push policy into config</h2> <p>Hardcoding policy like fair-share rules, tenant limits, and placement heuristics makes upgrades risky. Prefer:</p> <ul> <li> <p>ConfigMaps for tunables</p> </li> <li> <p>CRDs for per-team and per-namespace policies</p> </li> <li> <p>Feature flags for experimental behavior</p> </li> </ul> <p>Controllers should be stable plumbing, not a policy playground.</p> <h2>8) Observability is mandatory</h2> <p>You need to answer: why did this pod not get a GPU, and who owns GPU 2 on node X.</p> <p>Best practices:</p> <ul> <li> <p>Emit Kubernetes Events on allocation and rejection.</p> </li> <li> <p>Export Prometheus metrics for GPU health, allocations, MIG layout, reconciliation loops, and webhook rejects.</p> </li> <li> <p>Correlate pod to allocation to node to device ID.</p> </li> <li> <p>Store allocation decisions in CRD status for auditability.</p> </li> </ul> <p>If your system cannot explain itself, it will not be trusted.</p> <h2>9) Security and least privilege</h2> <p>GPU node operations often require elevated privileges, but controllers usually do not.</p> <p>Best practices:</p> <ul> <li> <p>Run the controller with minimal RBAC and no host access.</p> </li> <li> <p>Make the node agent the only privileged component.</p> </li> <li> <p>Use Pod Security Admission or equivalent guardrails.</p> </li> <li> <p>Avoid broad hostPath mounts and scope to what is required, such as <code>/dev/nvidia*</code> and relevant sockets.</p> </li> </ul> <h2>10) Test upgrades, mixed clusters, and rollback</h2> <p>GPU stacks change frequently, so make upgrades boring:</p> <ul> <li> <p>Test mixed node pools with different GPU SKUs and driver versions.</p> </li> <li> <p>Validate behavior during rolling upgrades of nodes and DaemonSets.</p> </li> <li> <p>Keep CRDs backward compatible or provide migrations.</p> </li> <li> <p>Ensure rollback works and does not strand allocations.</p> </li> </ul> <h2>A solid reference architecture</h2> <p>A proven layout looks like this:</p> <ul> <li> <p>Device plugin (DaemonSet): advertises GPU or MIG resources to the kubelet.</p> </li> <li> <p>GPU node agent (DaemonSet): health, configuration enforcement, and node reporting.</p> </li> <li> <p>Custom GPU controller (Deployment): CRDs, policy, reconciliation, events, and metrics.</p> </li> <li> <p>Admission webhook (Deployment): validation and mutation for GPU pods, optional but valuable.</p> </li> <li> <p>Scheduler plugin (optional): topology-aware placement when needed.</p> </li> </ul> <h2>Closing thought</h2> <p>The best custom GPU controllers feel like native Kubernetes components: declarative, reconciled, observable, and safe to retry. Build around CRDs, a clean node-agent separation, strong validation, and failure-first design, and you get a GPU platform teams can rely on without constant firefighting.</p>