---
author: "Tom Avital"
date-created: 20250907
tags:
- MIG
- H100
- OCP-AI
- configuration
description: |
"""
MIG slicing guide for an H100 GPU node within OpenShift-AI using nvidia-gpu-operator
"""
---
# 🚀 MIG Setup Guide (H100 + OpenShift + NVIDIA GPU Operator)
## 1. Decide MIG Strategy (`ClusterPolicy`)
The MIG strategy is set in the **`ClusterPolicy`** CRD (`gpu-cluster-policy`), which is managed by the NVIDIA GPU Operator. It controls **how GPUs are exposed to Kubernetes**.
```yaml
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
mig:
strategy: mixed # options: single, mixed, none
```
### Options
* **`single`**
* GPU presented as a whole device (`nvidia.com/gpu`).
* MIG disabled.
* Best for: full-GPU training/fine-tuning of large models.
* **`mixed`**
* MIG mode enabled; GPUs can be split into slices.
* MIG Manager watches node labels (`nvidia.com/mig.config`) and applies layouts.
* Kubernetes advertises MIG resource types (`nvidia.com/mig-1g.10gb`, etc.).
* Best for: inference clusters, multi-tenant, running several smaller jobs concurrently.
* **`none`**
* Operator does not manage MIG at all.
* GPU resources not exposed as MIG slices.
* Best for: rare cases where you manage MIG manually with `nvidia-smi`.
💡 **Rule of thumb**:
* Big training → `single`
* Multi-tenant inference → `mixed`
* DIY/manual → `none`
## 2. Discover Available MIG Configs
The GPU Operator includes a ConfigMap `default-mig-parted-config` in the `nvidia-gpu-operator` namespace. It defines valid MIG layouts.
```bash
oc -n nvidia-gpu-operator get cm default-mig-parted-config -o yaml | less
```
Check keys under `data:` for valid values of `nvidia.com/mig.config`.
Examples:
* `all-1g.10gb` (H100 80 GB → 7 × 10 GB slices)
* `all-1g.24gb` (H100 94 GB → 4 × 24 GB slices)
* `all-2g.20gb` (80 GB → 4 × 20 GB slices)
* `all-3g.40gb` (80 GB → 2 × 40 GB slices)
* `all-balanced` (a mix of 1g/2g/3g slices)
## 3. Apply a MIG Layout via Node Label
Label your GPU node with one of the valid configs. MIG Manager will reconfigure the GPU accordingly:
```bash
oc label node <GPU_NODE> nvidia.com/mig.config=all-2g.20gb --overwrite
```
Operator workflow:
1. MIG Manager detects the label.
2. It may drain workloads.
3. GPU is carved into slices.
4. Node resource list updates (e.g., `nvidia.com/mig-2g.20gb: 4`).
## 4. Monitor Progress
Watch operator pods as MIG Manager applies the layout:
```bash
oc -n nvidia-gpu-operator get pods | egrep "mig|driver|plugin"
oc -n nvidia-gpu-operator logs -l app=nvidia-mig-manager -f
```
Check node resources:
```bash
oc describe node <GPU_NODE> | egrep "nvidia.com/mig-|nvidia.com/mig.config"
```
When complete, allocatable MIG resources appear on the node.
---
## 5. Verify with `crictl` + `nvidia-smi`
Use `oc debug` to get into the node and exec into the driver container:
```bash
oc debug node/<GPU_NODE>
chroot /host
# Find the driver container
crictl ps | grep nvidia
# Exec into it
crictl exec -it <container-id> nvidia-smi -L
# List GPU instances and compute instances
crictl exec -it <container-id> nvidia-smi mig -lgi
crictl exec -it <container-id> nvidia-smi mig -lci
```
This shows which MIG slices exist and their UUIDs.
## 6. Re-labeling an Already Labeled Node
If the node already has a MIG config (e.g., `all-disabled`):
```bash
# Safest approach: drain first
oc adm cordon <GPU_NODE>
oc adm drain <GPU_NODE> --ignore-daemonsets --delete-emptydir-data
# Change layout
oc label node <GPU_NODE> nvidia.com/mig.config=all-3g.40gb --overwrite
# Bring it back
oc adm uncordon <GPU_NODE>
```
Alternative: remove then re-add the label
```bash
oc label node <GPU_NODE> nvidia.com/mig.config-
oc label node <GPU_NODE> nvidia.com/mig.config=all-3g.40gb
```
## 7. Deployment Redirection (Resources)
Use of specific GPU slice is done via deployment CR resources limits and requests
```yaml=
spec:
containers:
- resources:
limits:
cpu: '6'
memory: 18Gi
nvidia.com/mig-1g.24gb: '1'
requests:
cpu: '6'
memory: 18Gi
nvidia.com/mig-1g.24gb: '1'
```
Selection of `nvidia.com/mig-1g.24gb` sets the deployment to run specifically on such slice across GPU nodes.
# ✅ Summary
* **MIG strategy** (`ClusterPolicy`): `single` (whole GPU), `mixed` (MIG slices), `none` (disabled).
* **Layouts**: defined in `default-mig-parted-config`.
* **Node labels**: `nvidia.com/mig.config=<layout>` tell MIG Manager how to carve GPUs.
* **Progress**: watch Operator pods, check node allocatable resources.
* **Verification**: `nvidia-smi` via driver container (`crictl exec`).
* **Relabeling**: use `--overwrite` or remove+add; drain workloads first for safety.
* **Deployment setup:** Setting deployment to use MIG slice.