Install NVIDIA GPU Operator on SL Micro

# Install NVIDIA GPU Operator on SL Micro <style> .indent-title-1{ margin-left: 1em; } .indent-title-2{ margin-left: 2em; } .indent-title-3{ margin-left: 3em; } </style> ## 1. NVIDIA GPU Operator 簡介 <div class="indent-title-1"> 本文將解釋 NVIDIA GPU Operator，概述其所管理的 NVIDIA GPU 元件，並總結使用 NVIDIA GPU Operator 的好處。 </div> ### 1.1. NVIDIA GPU Operator 是什麼？ <div class="indent-title-1"> NVIDIA GPU Operator 是 Kubernetes operator，可簡化 Kubernetes 集群中 NVIDIA GPU 資源的管理與部署。它能自動化設定與監控 NVIDIA GPU 驅動程式，以及相關元件，例如 CUDA、container runtimes，以及其他 GPU 相關軟體。 </div> ### 1.2. NVIDIA GPU Operator 如何運作？ <div class="indent-title-1"> NVIDIA GPU Operator 遵循以下工作流程： 1. **Operator deployment** : NVIDIA Operator 以 Helm chart 或使用 Kubernetes manifests 的方式部署。 2. **Node labeling & GPU discovery** : 安裝之後，operator 會部署 GPU Feature Discovery (GFD) daemon，它會掃描每個節點上的硬體，以尋找 NVIDIA GPU。它會用 GPU 特有的資訊幫節點貼標籤，讓 Kubernetes 更容易根據可用的硬體排程 GPU workloads。 3. **NVIDIA Container Toolkit configuration** : GPU operator 安裝並設定 NVIDIA Container Toolkit，讓 GPU-accelerated containers 能在 Kubernetes 中執行。 4. **CUDA runtime and libraries** : operator 確保 CUDA 工具套件已正確安裝，讓需要 CUDA 的應用程式更容易無縫運作，無須手動介入。 5. **Validation and health monitoring** : 設定環境後，operator 會持續監控 GPU 資源的健康狀況。它還會公開健康度量指標，供管理員檢視並用於決策。 6. **Scheduling GPU workloads** : 一旦環境設定完成，您就可以安裝需要 GPU acceleration 的 workloads。 Kubernetes 會使用節點標籤和可用的 GPU 資源，將這些工作自動排程到啟用 GPU 的節點上。 </div> ### 1.3. 使用 NVIDIA GPU Operator 的優點 <div class="indent-title-1"> 使用 NVIDIA GPU Operator 有以下主要優點： * **Automated setup** : 不需要手動設定。 * **Cluster-wide management** : 適用於整個 Kubernetes 叢集，可隨著節點的增加或移除而擴充。 * **Simplified updates** : 自動更新 GPU 相關元件。 * **Optimized GPU usage** : 確保 GPU 資源的有效分配與使用。 </div> --- ## 2. 開始實作 ### 2.1. 實作環境 - GPU: NVIDIA GeForce RTX 3070 - OS: SUSE Linux Micro 6.0 - K8s: RKE2 v1.31.3+rke2r1 ### 2.2. 手動安裝 Nvidia GPU Driver 因 Nvidia GPU Operator 目前還不支援自動化安裝 Nvidia GPU Driver 在 SL Micro 6.0 的作業系統上，所以要手動安裝。 1. 啟動 transactional-update shell <div class="indent-title-2"> ``` sudo transactional-update shell ``` </div> 2. 註冊至 SCC <div class="indent-title-2"> ``` SUSEConnect -r yourcode ``` </div> 3. 啟用 sl-micro-extras 套件庫 <div class="indent-title-2"> ``` SUSEConnect -p SL-Micro-Extras/6.0/x86_64 ``` </div> 4. 安裝必要套件，包含 gcc (C compiler)、make (utility used to automate compilation process)、 development packages 和 kernel headers (kernel source files necessary to build kernel modules) <div class="indent-title-2"> ``` zypper -n install gcc \ binutils \ make \ kernel-default-6.4.0-25.1 \ kernel-macros-6.4.0-25.1 \ kernel-devel-6.4.0-25.1 \ kernel-default-devel-6.4.0-25.1 ``` </div> 5. 到 [Nvidia 官網](https://www.nvidia.com/zh-tw/drivers/) 搜尋 GPU 有哪些版本的 Driver 可以安裝 <div class="indent-title-2"> 以下以 RTX 3070 為例 * 產品列別 : GeForce * 產品系列 : GeForce RTX 30 Series * 產品 : GeForce RTX 3070 * 選取作業系統 : Linux 64-bit * 語言 : Chinese (Traditional) 最後點下方綠色"**尋找**"按鈕 ![image](https://hackmd.io/_uploads/SyqmIf3L1e.png =70%x) </div> 6. 在要安裝的版本點選 "**檢視**" <div class="indent-title-2"> ![image](https://hackmd.io/_uploads/r1NydG38yg.png) </div> 7. 在綠色"**下載**"按鈕上方，按滑鼠右鍵點選 "**複製連結網址**" <div class="indent-title-2"> ![image](https://hackmd.io/_uploads/SJA8df3Uye.png) </div> 8. 回到終端機，下載安裝 Driver 的程式 <div class="indent-title-2"> ``` cd /root; curl -LO https://tw.download.nvidia.com/XFree86/Linux-x86_64/570.133.07/NVIDIA-Linux-x86_64-570.133.07.run ``` > 注意，此時還仍在 transactional-update shell 中。 </div> 9. 開始安裝 Driver <div class="indent-title-2"> ``` sh /root/NVIDIA-Linux-x86_64-570.133.07.run --silent ``` > `--silent`，會關閉互動式使用者介面，並使用預設答案回答所有通常會提出的問題。除了錯誤訊息外，不會列印任何輸出。 </div> 10. 退出 transactional-update shell <div class="indent-title-2"> ``` exit ``` </div> 11. 重新開機 <div class="indent-title-2"> ``` sudo reboot ``` </div> 12. SSH 登入主機 13. 測試 Driver 是否有安裝成功 <div class="indent-title-2"> ``` nvidia-smi ``` 執行結果 : ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3070 Off | 00000000:00:10.0 Off | N/A | | 0% 47C P0 48W / 270W | 0MiB / 8192MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ ``` </div> ### 2.3. 安裝 NVIDIA GPU Operator 1. 安裝 Helm 命令 <div class="indent-title-2"> ``` curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash ``` </div> 2. 添加 NVIDIA Helm Chart Repository <div class="indent-title-2"> ``` helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && \ helm repo update ``` </div> 3. 安裝 NVIDIA GPU Operator <div class="indent-title-2"> ``` helm install gpu-operator -n gpu-operator --create-namespace \ nvidia/gpu-operator $HELM_OPTIONS \ --set toolkit.env[0].name=CONTAINERD_CONFIG \ --set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \ --set toolkit.env[1].name=CONTAINERD_SOCKET \ --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \ --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \ --set toolkit.env[2].value=nvidia \ --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \ --set-string toolkit.env[3].value=true \ --set driver.enabled=false ``` </div> 4. 檢查 NVIDIA GPU Operator 是否正常運作 <div class="indent-title-2"> ``` kubectl -n gpu-operator get pods ``` 執行結果 : ``` NAME READY STATUS RESTARTS AGE gpu-feature-discovery-4ml7q 1/1 Running 0 61m gpu-operator-55566cdcc9-vwcvg 1/1 Running 0 62m gpu-operator-node-feature-discovery-gc-7f546fd4bc-prjjk 1/1 Running 0 62m gpu-operator-node-feature-discovery-master-8448c8896c-pwwbf 1/1 Running 0 62m gpu-operator-node-feature-discovery-worker-wbq94 1/1 Running 0 62m nvidia-container-toolkit-daemonset-4qmzq 1/1 Running 0 61m nvidia-cuda-validator-9bg2x 0/1 Completed 0 60m nvidia-dcgm-exporter-jbbj6 1/1 Running 0 61m nvidia-device-plugin-daemonset-zsz2d 1/1 Running 0 61m nvidia-operator-validator-4jnm8 1/1 Running 0 61m ``` </div> 5. 測試 gpu pod sample 1 <div class="indent-title-2"> ``` cat << EOF | kubectl create -f - apiVersion: v1 kind: Pod metadata: name: cuda-vectoradd spec: restartPolicy: OnFailure runtimeClassName: nvidia containers: - name: cuda-vectoradd image: "nvidia/samples:vectoradd-cuda11.2.1" resources: limits: nvidia.com/gpu: 1 EOF ``` 檢查 Pod 狀態 ``` kubectl get pods ``` 執行結果 : ``` NAME READY STATUS RESTARTS AGE cuda-vectoradd 0/1 Completed 0 59m ``` 查看測試成功 log 資訊 ``` kubectl logs cuda-vectoradd ``` 執行結果 : ``` [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` </div> 6. 測試 gpu pod sample 2 <div class="indent-title-2"> ``` cat << EOF | kubectl create -f - apiVersion: v1 kind: Pod metadata: name: nbody-gpu-benchmark namespace: default spec: restartPolicy: OnFailure runtimeClassName: nvidia containers: - name: cuda-container image: nvcr.io/nvidia/k8s/cuda-sample:nbody args: ["nbody", "-gpu", "-benchmark"] resources: limits: nvidia.com/gpu: 1 env: - name: NVIDIA_VISIBLE_DEVICES value: all - name: NVIDIA_DRIVER_CAPABILITIES value: all EOF ``` 檢查 Pod 狀態 ``` kubectl get pods ``` 執行結果 : ``` NAME READY STATUS RESTARTS AGE cuda-vectoradd 0/1 Completed 0 66m nbody-gpu-benchmark 0/1 Completed 0 65m ``` 查看測試成功 log 資訊 ``` kubectl logs nbody-gpu-benchmark ``` 執行結果 : ``` Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance. -fullscreen (run n-body simulation in fullscreen mode) -fp64 (use double precision floating point values for simulation) -hostmem (stores simulation data in host memory) -benchmark (run benchmark to measure performance) -numbodies=<N> (number of bodies (>= 1) to run in simulation) -device=<d> (where d=0,1,2.... for the CUDA device to use) -numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation) -compare (compares simulation results running once on the default GPU and once on the CPU) -cpu (run n-body simulation on the CPU) -tipsy=<file.bin> (load a tipsy model file for simulation) NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. > Windowed mode > Simulation data stored in video memory > Single precision floating point simulation > 1 Devices used for simulation GPU Device 0: "Ampere" with compute capability 8.6 > Compute 8.6 CUDA device: [NVIDIA GeForce RTX 3070] 47104 bodies, total time for 10 iterations: 37.437 ms = 592.679 billion interactions per second = 11853.577 single-precision GFLOP/s at 20 flops per interaction ``` </div> </div> </div> </div> </div> </div> ## 參考文件 - [Installing the NVIDIA GPU Operator - SUSE Docs](https://documentation.suse.com/suse-ai/1.0/html/NVIDIA-Operator-installation/index.html)