Configs
===
###### tags: `SlinkyProject`
###### tags: `Kubernetes`, `k8s`, `app`, `slurm`, `SlinkyProject`, `gres.conf`
<br>
[TOC]
<br>
## Configs
### 指令
```
scontrol reconfigure
```
<br>
---
## slurm.conf
- [slurm / doc](https://slurm.schedmd.com/slurm.conf.html)
- NODESET CONFIGURATION
- NodeSet
- Unique name for a set of nodes. Must not overlap with any NodeName definitions.
- i.e. NodeSet=xxx 和 NodeName=xxx 會有衝突,不能有相同命名
### Slinky (slurm-operator 0.2.x)
- [Deploying SLURM with Slinky: Bridging HPC and Kubernetes for Container Workloads](https://www.nicktailor.com/?p=2024#:~:text=MaxTime%3DINFINITE%20State%3DUP)
```yaml=
controller:
config:
slurm.conf: |
# Basic cluster configuration
ClusterName=slinky-cluster
ControlMachine=slurm-controller-0
# Enable container support
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
PluginDir=/usr/lib64/slurm
# Authentication
AuthType=auth/munge
# Node configuration
NodeName=slurm-compute-debug-[0-9] CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=slurm-compute-debug-[0-9] Default=YES MaxTime=INFINITE State=UP
# Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurm-accounting-0
```
### Slinky (slurm-operator 0.3.x)
- [slurm-operator/helm/slurm/templates/controller/slurm-configmap.yaml](https://github.com/SlinkyProject/slurm-operator/blob/main/helm/slurm/templates/controller/slurm-configmap.yaml#L18-L121)
```yaml=
slurm:
#
# -- (map[string]string | map[string][]string)
# Extra slurm configuration lines to append to `slurm.conf`, represetned as a string or a map.
# WARNING: Values can override existing ones.
# Ref: https://slurm.schedmd.com/slurm.conf.html
extraSlurmConf:
# MinJobAge: 2
# MaxNodeCount: 1024
# error #ContainerPlugin: container/pyxis
### LOGGING ###
SlurmctldDebug: debug2
SlurmSchedLogLevel: 1
SlurmdDebug: debug2
# DebugFlags: []
### PLUGINS & PARAMETERS ###
# SchedulerParameters:
# - defer_batch
```
<br>
---
## gres.conf
> GRES(Generic Resource)通用資源
- [slurm / doc](https://slurm.schedmd.com/gres.html)
- [twcc / doc](https://man.twcc.ai/@twccdocs/guide-twnia2-gpu-allocation-zh)
- [設定範例](https://gist.github.com/alcides/615d4b99bb641ae4b77feea5d43ef131)
```
NodeName=localhost Name=gpu type=a30 File=/dev/nvidia[0-1]
NodeName=localhost Name=mps count=200 File=/dev/nvidia[0-1]
```
### 設定流程
```bash=
$ ssh root@127.0.0.1 -p 32222
# 查看當前節點
$ scontrol show nodes | grep NodeName
NodeName=esc8000b-0 Arch=x86_64 CoresPerSocket=20
NodeName=esc8000a-0 Arch=x86_64 CoresPerSocket=32
NodeName=esc8000a-1 Arch=x86_64 CoresPerSocket=32
$ exit
# 添加 compute pod 的專屬硬體資訊
$ kubectl -n slurm edit cm/slurm-config
gres.conf: |
AutoDetect=nvidia
NodeName=e900-0 Name=gpu Type=rtx3090
NodeName=esc8000a-[0-15] Name=gpu Type=a40
NodeName=esc8000b-[0-15] Name=gpu Type=a40
# 導入 Feature 設定, Gres 參數
$ kubectl -n slurm edit nss/slurm-compute-esc8000a
spec:
containers:
- args:
...
- --conf
- '''Features=esc8000a,a40 Gres=gpu:a40:1 RealMemory=245760'''
# 將 compute pod 關閉
$ kubectl -n slurm scale nss/slurm-compute-esc8000a --replicas=0
$ kubectl -n slurm scale statefulset.apps/slurm-controller --replicas=0
$ kubectl -n slurm scale statefulset.apps/slurm-controller --replicas=1
$ kubectl -n slurm scale nss/slurm-compute-esc8000a --replicas=2
$ srun -p tn --gres=gpu:1 -N2 --constraint=a40 ls -ls
```
<br>
### Troubleshooting
- ### ``gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `nvidia_geforce_gtx_1080_ti`. Setting system GRES type to NULL``
```
$ kubectl -n slurm logs pod/slurm-worker-gpu1080-0 | grep -i 1080
+ exec slurmd --systemd -Z --conf-server slurm-controller-0.slurm-controller.slurm:6817 --conf 'Features=gpu1080,1080ti Gres=gpu:2 CoreSpecCount=36 MemSpecLimit=154725'
[2025-09-24T07:13:32] debug2: gpu/nvidia: _set_name_and_file: Name: nvidia_geforce_gtx_1080_ti
[2025-09-24T07:13:32] gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `nvidia_geforce_gtx_1080_ti`. Setting system GRES type to NULL
```
- ### gpt-5 (2025/09/24) (待驗證)
意思是:NVML 正確偵測到型號 **nvidia\_geforce\_gtx\_1080\_ti**,但 **你的設定裡沒有任何一條 GRES type 字串能「作為子字串」匹配到它**(例如你可能用了 `1080ti`,少了底線 `_`),所以被設成 **NULL**,最後在控制端只看到 `gpu:2` 沒有型號。
---
### 你現在就能做的 2 條路
#### 路線 A:純 NVML 自動偵測(最省事)
把 **gres.conf** 對該節點改成純自動:
```ini
# /etc/slurm/gres.conf (在該工作節點內)
NodeName=gpu1080-0 Name=gpu AutoDetect=nvml
# 不要再寫 Type= 或 File=/dev/nvidiaX
```
> 另外,你的 container 啟動參數目前帶了:
> `--conf '... Gres=gpu:2 ...'`
> 這會讓節點的 GRES 只登記數量而「沒有型號」。**若可調整,最好移除這段**(或改成帶型號,見路線 B)。
重啟 slurmd 後到 controller/login 端查:
```bash
sinfo -N -o "%N %G"
scontrol show node -o gpu1080-0 | grep -o 'Gres=[^ ]*'
```
預期會看到 `gpu:nvidia_geforce_gtx_1080_ti:2`(或至少帶出型號)。
---
#### 路線 B:保留 override,但把型號字串改「能匹配」
如果你要保留 `--conf 'Gres=...'` 或在 **gres.conf** 想明確指定 Type,**Type 必須是 NVML 名稱的子字串**(大小寫不敏感,但底線要對):
**選一種寫法即可**:
* 在 **gres.conf** 指定可匹配的 Type(推薦):
```ini
NodeName=gpu1080-0 Name=gpu Type=gtx_1080_ti AutoDetect=nvml
# 'gtx_1080_ti' 是 'nvidia_geforce_gtx_1080_ti' 的子字串 → 能匹配
```
* 或者把 container 的啟動參數改成帶型號:
```bash
--conf "Gres=gpu:gtx_1080_ti:2"
# 也可用更長:Gres=gpu:nvidia_geforce_gtx_1080_ti:2
```
> 千萬別用 `1080ti`(少了底線),這個不是 `nvidia_geforce_gtx_1080_ti` 的子字串,會再度變 NULL。
改完重啟 slurmd 後,到 controller/login 端查:
```bash
sinfo -N -o "%N %G" # 節點 + 型號 + 數量
scontrol show node -o gpu1080-0 | grep Gres=
```
---
### 在 pod 裡「看得到」偵測到的型號嗎?
可以,像你剛才做的一樣,直接看容器 stdout 日誌就行(不用停 slurmd):
```bash
kubectl -n slurm logs pod/slurm-worker-gpu1080-0 -c slurmd --since=24h \
| grep -iE '_set_name_and_file|normalize_sys_gres_types|adding device|nvml|type='
```
當日誌等級夠高(`SlurmdDebug=debug2`)時,會顯示
`Name: nvidia_geforce_gtx_1080_ti`、以及是否成功對到 config 的 Type。
---
### 小重點彙整
* NVML 產生的型號字串是**小寫+底線**(例:`nvidia_geforce_gtx_1080_ti`)。
* **Type 的匹配規則是「子字串」**,所以請用 `gtx_1080_ti`、`geforce_gtx_1080_ti` 或完整字串,**不要寫 `1080ti`**。
* 兩種設定擇一:
* `gres.conf` 純 `AutoDetect=nvml`(最單純);
* 或在 `gres.conf`/`--conf` 明確指定可匹配的 `Type=`/`Gres=`。
* 驗證用:`sinfo -N -o "%N %G"` 與 `scontrol show node -o <node> | grep Gres=`。
這樣調整後,Type 就會正確出現在控制端,不再是 NULL。
<br>
---
## nss/slurm-compute-e900
### 變更設定
```
$ kubectl -n slurm edit nss/slurm-compute-e900
spec:
automountServiceAccountToken: false
containers:
- args:
...
- --conf
- '''Features=rtx3090 Gres=gpu:rtx3090:1 GPUs=8 RealMemory=245760'''
$ kubectl -n slurm logs pod/slurm-compute-e900-0| grep -i error
error: NodeNames=e900-0 CoreSpecCount=30 is invalid, reset to 1
error: Node configuration differs from hardware: CPUs=8:72(hw) Boards=1:1(hw) SocketsPerBoard=8:2(hw) CoresPerSocket=1:18(hw) ThreadsPerCore=1:2(hw)
```
### 觀察結果
```
# scontrol show node e900-0
NodeName=e900-0 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUEfctv=7 CPUTot=8 CPULoad=8.45
AvailableFeatures=rtx3090
ActiveFeatures=rtx3090
Gres=gpu:rtx3090:1
NodeAddr=10.244.4.88 NodeHostName=e900-0 Version=25.05.0
OS=Linux 5.15.0-83-lowlatency #92-Ubuntu SMP PREEMPT Fri Aug 18 13:09:41 UTC 2023
RealMemory=245760 AllocMem=0 FreeMem=12198 Sockets=8 Boards=1
CoreSpecCount=1 CPUSpecList=7
State=IDLE+DYNAMIC_NORM ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=debug
BootTime=2025-07-08T05:42:23 SlurmdStartTime=2025-07-22T04:08:58
LastBusyTime=2025-07-22T04:08:58 ResumeAfterTime=None
CfgTRES=cpu=7,mem=240G,billing=7,gres/gpu=1
AllocTRES=
CurrentWatts=0 AveWatts=0
Comment={"namespace":"slurm","podName":"slurm-compute-e900-0"}
root@slurm-login-6fb5d948d6-gzdvf:~# srun -p debug -c 8 -n 1 hostname
srun: error: Unable to allocate resources: Requested node configuration is not available
root@slurm-login-6fb5d948d6-gzdvf:~# srun -p debug -c 7 -n 1 hostname
e900-0
```
<br>
---
<br>
## 實驗
### 實驗1:能避免 NodeName 衝突?
如果完全移除 nss/compute-xxx 中的 `args --conf 'Features=e900,rtx3090 Gres=gpu:rtx3090:1 RealMemory=245760'` 參數
改由在 /etc/slurm.conf 定義 [node-config](https://slurm.schedmd.com/slurm.conf.html#SECTION_NODE-CONFIGURATION),是不是能避免 NodeName 衝突?
```
$ kubectl -n slurm logs pod/slurm-compute-e900-0
error: xsystemd_change_mainpid: connect() failed for /dev/null: Connection refused
root@slurm-login-6fb5d948d6-mt4ss:~# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
tn* up infinite 1 idle esc8000b-0
debug up infinite 0 n/a
root@slurm-login-6fb5d948d6-mt4ss:~# scontrol show node e900-0
NodeName=e900-0 Arch=x86_64 CoresPerSocket=18
CPUAlloc=0 CPUEfctv=12 CPUTot=72 CPULoad=7.27
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
```
<br>
### 實驗2:`gres.conf` 中的 device 資訊,能自行產生?
- ### Case1
```
AutoDetect=nvidia
# AutoDetect=nvml
# NodeName=e900-0 Name=gpu Type=rtx3090 File=/dev/nvidia0
```
```
# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
tn* up infinite 1 idle esc8000b-0
debug up infinite 1 inval e900-0
# scontrol show node e900-0
NodeName=e900-0 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUEfctv=8 CPUTot=8 CPULoad=20.50
AvailableFeatures=e900,rtx3090
ActiveFeatures=e900,rtx3090
Gres=gpu:rtx3090:1
...
Reason=gres/gpu count reported lower than configured (0 < 1)
```
- ### Case2
```
# AutoDetect=nvidia
AutoDetect=nvml
# NodeName=e900-0 Name=gpu Type=rtx3090 File=/dev/nvidia0
```
```
# scontrol show node e900-0
Node e900-0 not found
# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug up infinite 0 n/a
```
- ### Case3
```
# AutoDetect=nvidia
# AutoDetect=nvml
NodeName=e900-0 Name=gpu Type=rtx3090 File=/dev/nvidia0
```