Configs - HackMD

Configs === ###### tags: `SlinkyProject` ###### tags: `Kubernetes`, `k8s`, `app`, `slurm`, `SlinkyProject`, `gres.conf` [TOC] ## Configs ### 指令 ``` scontrol reconfigure ``` --- ## slurm.conf - [slurm / doc](https://slurm.schedmd.com/slurm.conf.html) - NODESET CONFIGURATION - NodeSet - Unique name for a set of nodes. Must not overlap with any NodeName definitions. - i.e. NodeSet=xxx 和 NodeName=xxx 會有衝突，不能有相同命名 ### Slinky (slurm-operator 0.2.x) - [Deploying SLURM with Slinky: Bridging HPC and Kubernetes for Container Workloads](https://www.nicktailor.com/?p=2024#:~:text=MaxTime%3DINFINITE%20State%3DUP) ```yaml= controller: config: slurm.conf: | # Basic cluster configuration ClusterName=slinky-cluster ControlMachine=slurm-controller-0 # Enable container support ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity PluginDir=/usr/lib64/slurm # Authentication AuthType=auth/munge # Node configuration NodeName=slurm-compute-debug-[0-9] CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=2 State=UNKNOWN PartitionName=debug Nodes=slurm-compute-debug-[0-9] Default=YES MaxTime=INFINITE State=UP # Accounting AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=slurm-accounting-0 ``` ### Slinky (slurm-operator 0.3.x) - [slurm-operator/helm/slurm/templates/controller/slurm-configmap.yaml](https://github.com/SlinkyProject/slurm-operator/blob/main/helm/slurm/templates/controller/slurm-configmap.yaml#L18-L121) ```yaml= slurm: # # -- (map[string]string | map[string][]string) # Extra slurm configuration lines to append to `slurm.conf`, represetned as a string or a map. # WARNING: Values can override existing ones. # Ref: https://slurm.schedmd.com/slurm.conf.html extraSlurmConf: # MinJobAge: 2 # MaxNodeCount: 1024 # error #ContainerPlugin: container/pyxis ### LOGGING ### SlurmctldDebug: debug2 SlurmSchedLogLevel: 1 SlurmdDebug: debug2 # DebugFlags: [] ### PLUGINS & PARAMETERS ### # SchedulerParameters: # - defer_batch ``` --- ## gres.conf > GRES（Generic Resource）通用資源 - [slurm / doc](https://slurm.schedmd.com/gres.html) - [twcc / doc](https://man.twcc.ai/@twccdocs/guide-twnia2-gpu-allocation-zh) - [設定範例](https://gist.github.com/alcides/615d4b99bb641ae4b77feea5d43ef131) ``` NodeName=localhost Name=gpu type=a30 File=/dev/nvidia[0-1] NodeName=localhost Name=mps count=200 File=/dev/nvidia[0-1] ``` ### 設定流程 ```bash= $ ssh root@127.0.0.1 -p 32222 # 查看當前節點 $ scontrol show nodes | grep NodeName NodeName=esc8000b-0 Arch=x86_64 CoresPerSocket=20 NodeName=esc8000a-0 Arch=x86_64 CoresPerSocket=32 NodeName=esc8000a-1 Arch=x86_64 CoresPerSocket=32 $ exit # 添加 compute pod 的專屬硬體資訊 $ kubectl -n slurm edit cm/slurm-config gres.conf: | AutoDetect=nvidia NodeName=e900-0 Name=gpu Type=rtx3090 NodeName=esc8000a-[0-15] Name=gpu Type=a40 NodeName=esc8000b-[0-15] Name=gpu Type=a40 # 導入 Feature 設定, Gres 參數 $ kubectl -n slurm edit nss/slurm-compute-esc8000a spec: containers: - args: ... - --conf - '''Features=esc8000a,a40 Gres=gpu:a40:1 RealMemory=245760''' # 將 compute pod 關閉 $ kubectl -n slurm scale nss/slurm-compute-esc8000a --replicas=0 $ kubectl -n slurm scale statefulset.apps/slurm-controller --replicas=0 $ kubectl -n slurm scale statefulset.apps/slurm-controller --replicas=1 $ kubectl -n slurm scale nss/slurm-compute-esc8000a --replicas=2 $ srun -p tn --gres=gpu:1 -N2 --constraint=a40 ls -ls ``` ### Troubleshooting - ### ``gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `nvidia_geforce_gtx_1080_ti`. Setting system GRES type to NULL`` ``` $ kubectl -n slurm logs pod/slurm-worker-gpu1080-0 | grep -i 1080 + exec slurmd --systemd -Z --conf-server slurm-controller-0.slurm-controller.slurm:6817 --conf 'Features=gpu1080,1080ti Gres=gpu:2 CoreSpecCount=36 MemSpecLimit=154725' [2025-09-24T07:13:32] debug2: gpu/nvidia: _set_name_and_file: Name: nvidia_geforce_gtx_1080_ti [2025-09-24T07:13:32] gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `nvidia_geforce_gtx_1080_ti`. Setting system GRES type to NULL ``` - ### gpt-5 (2025/09/24) (待驗證) 意思是：NVML 正確偵測到型號 **nvidia\_geforce\_gtx\_1080\_ti**，但 **你的設定裡沒有任何一條 GRES type 字串能「作為子字串」匹配到它**（例如你可能用了 `1080ti`，少了底線 `_`），所以被設成 **NULL**，最後在控制端只看到 `gpu:2` 沒有型號。 --- ### 你現在就能做的 2 條路 #### 路線 A：純 NVML 自動偵測（最省事）把 **gres.conf** 對該節點改成純自動： ```ini # /etc/slurm/gres.conf (在該工作節點內) NodeName=gpu1080-0 Name=gpu AutoDetect=nvml # 不要再寫 Type= 或 File=/dev/nvidiaX ``` > 另外，你的 container 啟動參數目前帶了： > `--conf '... Gres=gpu:2 ...'` > 這會讓節點的 GRES 只登記數量而「沒有型號」。**若可調整，最好移除這段**（或改成帶型號，見路線 B）。重啟 slurmd 後到 controller/login 端查： ```bash sinfo -N -o "%N %G" scontrol show node -o gpu1080-0 | grep -o 'Gres=[^ ]*' ``` 預期會看到 `gpu:nvidia_geforce_gtx_1080_ti:2`（或至少帶出型號）。 --- #### 路線 B：保留 override，但把型號字串改「能匹配」如果你要保留 `--conf 'Gres=...'` 或在 **gres.conf** 想明確指定 Type，**Type 必須是 NVML 名稱的子字串**（大小寫不敏感，但底線要對）： **選一種寫法即可**： * 在 **gres.conf** 指定可匹配的 Type（推薦）： ```ini NodeName=gpu1080-0 Name=gpu Type=gtx_1080_ti AutoDetect=nvml # 'gtx_1080_ti' 是 'nvidia_geforce_gtx_1080_ti' 的子字串 → 能匹配 ``` * 或者把 container 的啟動參數改成帶型號： ```bash --conf "Gres=gpu:gtx_1080_ti:2" # 也可用更長：Gres=gpu:nvidia_geforce_gtx_1080_ti:2 ``` > 千萬別用 `1080ti`（少了底線），這個不是 `nvidia_geforce_gtx_1080_ti` 的子字串，會再度變 NULL。改完重啟 slurmd 後，到 controller/login 端查： ```bash sinfo -N -o "%N %G" # 節點 + 型號 + 數量 scontrol show node -o gpu1080-0 | grep Gres= ``` --- ### 在 pod 裡「看得到」偵測到的型號嗎？可以，像你剛才做的一樣，直接看容器 stdout 日誌就行（不用停 slurmd）： ```bash kubectl -n slurm logs pod/slurm-worker-gpu1080-0 -c slurmd --since=24h \ | grep -iE '_set_name_and_file|normalize_sys_gres_types|adding device|nvml|type=' ``` 當日誌等級夠高（`SlurmdDebug=debug2`）時，會顯示 `Name: nvidia_geforce_gtx_1080_ti`、以及是否成功對到 config 的 Type。 --- ### 小重點彙整 * NVML 產生的型號字串是**小寫+底線**（例：`nvidia_geforce_gtx_1080_ti`）。 * **Type 的匹配規則是「子字串」**，所以請用 `gtx_1080_ti`、`geforce_gtx_1080_ti` 或完整字串，**不要寫 `1080ti`**。 * 兩種設定擇一： * `gres.conf` 純 `AutoDetect=nvml`（最單純）； * 或在 `gres.conf`/`--conf` 明確指定可匹配的 `Type=`/`Gres=`。 * 驗證用：`sinfo -N -o "%N %G"` 與 `scontrol show node -o <node> | grep Gres=`。這樣調整後，Type 就會正確出現在控制端，不再是 NULL。 --- ## nss/slurm-compute-e900 ### 變更設定 ``` $ kubectl -n slurm edit nss/slurm-compute-e900 spec: automountServiceAccountToken: false containers: - args: ... - --conf - '''Features=rtx3090 Gres=gpu:rtx3090:1 GPUs=8 RealMemory=245760''' $ kubectl -n slurm logs pod/slurm-compute-e900-0| grep -i error error: NodeNames=e900-0 CoreSpecCount=30 is invalid, reset to 1 error: Node configuration differs from hardware: CPUs=8:72(hw) Boards=1:1(hw) SocketsPerBoard=8:2(hw) CoresPerSocket=1:18(hw) ThreadsPerCore=1:2(hw) ``` ### 觀察結果 ``` # scontrol show node e900-0 NodeName=e900-0 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUEfctv=7 CPUTot=8 CPULoad=8.45 AvailableFeatures=rtx3090 ActiveFeatures=rtx3090 Gres=gpu:rtx3090:1 NodeAddr=10.244.4.88 NodeHostName=e900-0 Version=25.05.0 OS=Linux 5.15.0-83-lowlatency #92-Ubuntu SMP PREEMPT Fri Aug 18 13:09:41 UTC 2023 RealMemory=245760 AllocMem=0 FreeMem=12198 Sockets=8 Boards=1 CoreSpecCount=1 CPUSpecList=7 State=IDLE+DYNAMIC_NORM ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2025-07-08T05:42:23 SlurmdStartTime=2025-07-22T04:08:58 LastBusyTime=2025-07-22T04:08:58 ResumeAfterTime=None CfgTRES=cpu=7,mem=240G,billing=7,gres/gpu=1 AllocTRES= CurrentWatts=0 AveWatts=0 Comment={"namespace":"slurm","podName":"slurm-compute-e900-0"} root@slurm-login-6fb5d948d6-gzdvf:~# srun -p debug -c 8 -n 1 hostname srun: error: Unable to allocate resources: Requested node configuration is not available root@slurm-login-6fb5d948d6-gzdvf:~# srun -p debug -c 7 -n 1 hostname e900-0 ``` --- ## 實驗 ### 實驗1：能避免 NodeName 衝突？如果完全移除 nss/compute-xxx 中的 `args --conf 'Features=e900,rtx3090 Gres=gpu:rtx3090:1 RealMemory=245760'`　參數改由在 /etc/slurm.conf 定義 [node-config](https://slurm.schedmd.com/slurm.conf.html#SECTION_NODE-CONFIGURATION)，是不是能避免 NodeName 衝突？ ``` $ kubectl -n slurm logs pod/slurm-compute-e900-0 error: xsystemd_change_mainpid: connect() failed for /dev/null: Connection refused root@slurm-login-6fb5d948d6-mt4ss:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST tn* up infinite 1 idle esc8000b-0 debug up infinite 0 n/a root@slurm-login-6fb5d948d6-mt4ss:~# scontrol show node e900-0 NodeName=e900-0 Arch=x86_64 CoresPerSocket=18 CPUAlloc=0 CPUEfctv=12 CPUTot=72 CPULoad=7.27 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) ``` ### 實驗2：`gres.conf` 中的 device 資訊，能自行產生？ - ### Case1 ``` AutoDetect=nvidia # AutoDetect=nvml # NodeName=e900-0 Name=gpu Type=rtx3090 File=/dev/nvidia0 ``` ``` # sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST tn* up infinite 1 idle esc8000b-0 debug up infinite 1 inval e900-0 # scontrol show node e900-0 NodeName=e900-0 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUEfctv=8 CPUTot=8 CPULoad=20.50 AvailableFeatures=e900,rtx3090 ActiveFeatures=e900,rtx3090 Gres=gpu:rtx3090:1 ... Reason=gres/gpu count reported lower than configured (0 < 1) ``` - ### Case2 ``` # AutoDetect=nvidia AutoDetect=nvml # NodeName=e900-0 Name=gpu Type=rtx3090 File=/dev/nvidia0 ``` ``` # scontrol show node e900-0 Node e900-0 not found # sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug up infinite 0 n/a ``` - ### Case3 ``` # AutoDetect=nvidia # AutoDetect=nvml NodeName=e900-0 Name=gpu Type=rtx3090 File=/dev/nvidia0 ```