<h1>Best Practices for Combining Kubernetes with Custom GPU Controllers</h1>

<p>Running GPUs on Kubernetes is easy until you need more than what standard device plugins and schedulers provide. As soon as you introduce custom constraints like multi-tenant isolation, topology-aware placement, MIG slicing rules, latency targets, custom accounting, or per-job GPU warmup, you will likely reach for a custom GPU controller. Done well, this can turn Kubernetes into a reliable GPU platform. Done poorly, it becomes a fragile mix of schedulers, webhooks, and node scripts.</p>
<h2>1) Start with a clear ownership model</h2>
<p>Before writing code, decide which component owns which responsibility.</p>
<ul>
<li>
<p>Device plugin (node level): advertises GPU resources and integrates with the kubelet.</p>
</li>
<li>
<p>Custom controller (cluster level): enforces policies, manages CRDs, orchestrates lifecycle, and reconciles desired state.</p>
</li>
<li>
<p>Scheduler extensions (optional): influences placement decisions using node and pod metadata plus GPU constraints.</p>
</li>
</ul>
<p>Best practice: avoid putting scheduling logic into a controller unless it is truly necessary. Encode constraints using node labels and taints, pod affinity rules, and scheduler plugins when placement must be dynamic and topology-aware.</p>
<h2>2) Prefer declarative CRDs and reconciliation over imperative scripts</h2>
<p>Custom GPU controllers should act like <a href="https://acecloud.ai/blog/kubernetes-architecture-and-core-components/#:~:text=Kubernetes%27%20architecture%20is%20built%20around,components%20and%20Worker%20Node%20components.">Kubernetes-native components</a>.</p>
<ul>
<li>
<p>Define a CRD such as <code>GpuSlice</code>, <code>GpuProfile</code>, or <code>GpuAllocationPolicy</code>.</p>
</li>
<li>
<p>Maintain status fields like <code>Allocated</code>, <code>Available</code>, <code>Health</code>, and <code>LastScrubbed</code>.</p>
</li>
<li>
<p>Reconcile from current state to desired state continuously.</p>
</li>
</ul>
<p>Avoid controllers that run a one-time script on nodes and assume it stays correct. GPU environments drift due to driver updates, reboots, MIG changes, and library shifts. Reconciliation is your safety net.</p>
<h2>3) Treat node work as node work: use DaemonSets</h2>
<p>Anything that touches the node should generally live in a DaemonSet, not inside the controller pod. This includes driver validation, NVML checks, MIG partitioning, device resets, and hardware health probing.</p>
<p>A good pattern:</p>
<ul>
<li>
<p>Controller manages policy and desired layout.</p>
</li>
<li>
<p>DaemonSet agent performs node-local actions and reports results through CRD status, node annotations, or metrics.</p>
</li>
<li>
<p>Agents are idempotent and safe to retry.</p>
</li>
</ul>
<p>This separation improves security, reduces blast radius, and simplifies debugging.</p>
<h2>4) Make scheduling predictable using extended resources plus constraints</h2>
<p>Use extended resources such as <code>nvidia.com/gpu</code> or your own resource name as the baseline. Then layer constraints:</p>
<ul>
<li>
<p>Topology constraints: label nodes with PCIe, NVLink, or NUMA groupings when relevant.</p>
</li>
<li>
<p>Isolation constraints: taints and tolerations for dedicated GPU pools.</p>
</li>
<li>
<p>Multi-tenant constraints: namespace quotas and ResourceQuotas.</p>
</li>
</ul>
<p>If using MIG, define an explicit mapping between profiles like <code>1g.10gb</code> and <code>2g.20gb</code> and the resources you advertise. Do not rely on best effort matching.</p>
<h2>5) Use admission control to prevent invalid GPU requests</h2>
<p>Many operational problems come from pods requesting GPUs incorrectly. Add a ValidatingAdmissionWebhook or policy tooling to enforce:</p>
<ul>
<li>
<p>Only approved GPU resource names</p>
</li>
<li>
<p>Allowed GPU counts and MIG profiles</p>
</li>
<li>
<p>Required tolerations or node selectors for GPU pools</p>
</li>
<li>
<p>Required runtime class or security context settings</p>
</li>
</ul>
<p>A little validation up front prevents hours of chasing confusing scheduling failures.</p>
<h2>6) Design for failures: GPUs disappear and nodes lie</h2>
<p><a href="https://acecloud.ai/blog/gpu-vs-cpu-for-image-processing/#:~:text=While%20CPUs%20are%20more%20versatile,graphics%2Dintensive%20and%20parallel%20processing%20tasks.">GPUs are not like CPUs</a>. Plan for transient NVML failures, ECC and Xid errors, device resets, and nodes restarting mid-job.</p>
<p>Best practices:</p>
<ul>
<li>
<p>Health checks should use multiple signals such as NVML, device files, and workload probes.</p>
</li>
<li>
<p>Quarantine unhealthy GPUs using labels or taints like <code>gpu.health=degraded</code>.</p>
</li>
<li>
<p>Provide a remediation path: drain GPU workloads, reset the device, revalidate, then re-enable.</p>
</li>
</ul>
<h2>7) Keep controller logic minimal and push policy into config</h2>
<p>Hardcoding policy like fair-share rules, tenant limits, and placement heuristics makes upgrades risky. Prefer:</p>
<ul>
<li>
<p>ConfigMaps for tunables</p>
</li>
<li>
<p>CRDs for per-team and per-namespace policies</p>
</li>
<li>
<p>Feature flags for experimental behavior</p>
</li>
</ul>
<p>Controllers should be stable plumbing, not a policy playground.</p>
<h2>8) Observability is mandatory</h2>
<p>You need to answer: why did this pod not get a GPU, and who owns GPU 2 on node X.</p>
<p>Best practices:</p>
<ul>
<li>
<p>Emit Kubernetes Events on allocation and rejection.</p>
</li>
<li>
<p>Export Prometheus metrics for GPU health, allocations, MIG layout, reconciliation loops, and webhook rejects.</p>
</li>
<li>
<p>Correlate pod to allocation to node to device ID.</p>
</li>
<li>
<p>Store allocation decisions in CRD status for auditability.</p>
</li>
</ul>
<p>If your system cannot explain itself, it will not be trusted.</p>
<h2>9) Security and least privilege</h2>
<p>GPU node operations often require elevated privileges, but controllers usually do not.</p>
<p>Best practices:</p>
<ul>
<li>
<p>Run the controller with minimal RBAC and no host access.</p>
</li>
<li>
<p>Make the node agent the only privileged component.</p>
</li>
<li>
<p>Use Pod Security Admission or equivalent guardrails.</p>
</li>
<li>
<p>Avoid broad hostPath mounts and scope to what is required, such as <code>/dev/nvidia*</code> and relevant sockets.</p>
</li>
</ul>
<h2>10) Test upgrades, mixed clusters, and rollback</h2>
<p>GPU stacks change frequently, so make upgrades boring:</p>
<ul>
<li>
<p>Test mixed node pools with different GPU SKUs and driver versions.</p>
</li>
<li>
<p>Validate behavior during rolling upgrades of nodes and DaemonSets.</p>
</li>
<li>
<p>Keep CRDs backward compatible or provide migrations.</p>
</li>
<li>
<p>Ensure rollback works and does not strand allocations.</p>
</li>
</ul>
<h2>A solid reference architecture</h2>
<p>A proven layout looks like this:</p>
<ul>
<li>
<p>Device plugin (DaemonSet): advertises GPU or MIG resources to the kubelet.</p>
</li>
<li>
<p>GPU node agent (DaemonSet): health, configuration enforcement, and node reporting.</p>
</li>
<li>
<p>Custom GPU controller (Deployment): CRDs, policy, reconciliation, events, and metrics.</p>
</li>
<li>
<p>Admission webhook (Deployment): validation and mutation for GPU pods, optional but valuable.</p>
</li>
<li>
<p>Scheduler plugin (optional): topology-aware placement when needed.</p>
</li>
</ul>
<h2>Closing thought</h2>
<p>The best custom GPU controllers feel like native Kubernetes components: declarative, reconciled, observable, and safe to retry. Build around CRDs, a clean node-agent separation, strong validation, and failure-first design, and you get a GPU platform teams can rely on without constant firefighting.</p>