安裝流程
===
###### tags: `SlinkyProject`
###### tags: `Kubernetes`, `k8s`, `app`, `slurm`, `SlinkyProject`
<br>
[TOC]
<br>
## 安裝方式:
### 可 work 版本
```
$ helm list -A
NAME NAMESPACE REVISION STATUS CHART APP VERSION
cert-manager cert-manager 1 deployed cert-manager-v1.17.1 v1.17.1
monitor monitor 1 deployed kube-prometheus-stack-70.7.0 v0.81.0
slinky slinky 1 deployed slurm-operator-0.3.0 25.05
slurm slurm 6 failed slurm-0.3.0 25.05
```
### 流程導覽
- [Slinky - QuickStart Guide](https://github.com/SlinkyProject/slurm-operator/blob/release-0.3/docs/quickstart.md)
- ### STEP 1: 安裝相依套件
- ### STEP 2: 安裝 Slurm Operator
- 只要安裝一次
- 除非 slurm-operator 掛掉,沒有任何輸出 log 才需要重裝

- ### STEP 3: 部署 Slurm Cluster
- 需修改 `storageClass`,以符合當前 K8s 環境的基礎設施
- 需修改 `rootSshAuthorizedKeys`,讓當前主節點(control plane)可登入
- 可以支援動態 apply (i.e. 不用 uninstall , 再 install)
- ### 安裝/部署前,如何查看所產生的樣板?
> `helm install` 換成 `helm template`
```
helm template slurm oci://ghcr.io/slinkyproject/charts/slurm \
--values=values-slurm.yaml --version=0.3.0 --namespace=slurm --create-namespace
```
- ### 如何更新 Slurm Cluster 的設定?
> i.e. 如何動態 apply Slurm Cluster?
`helm install` 換成 `helm upgrade -i`
- 每次套用,相當於升級 (REVISION: 1 -> 2 -> ...)
- `-i`:如果一開始還沒安裝 chart,就執行 `helm install`
<br>
### 步驟一:安裝相依套件
```
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace --set crds.enabled=true
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace prometheus --create-namespace --set installCRDs=true
```
- ### [troubleshooting] Prometheus error:
```
$ helm install prometheus prometheus-community/kube-prometheus-stack --namespace prometheus --create-namespace --set installCRDs=true
Error: INSTALLATION FAILED: 1 error occurred:
* Prometheus.monitoring.coreos.com "prometheus-kube-prometheus-prometheus" is invalid: spec.maximumStartupDurationSeconds: Invalid value: 0: spec.maximumStartupDurationSeconds in body should be greater than or equal to 60
---
$ helm uninstall prometheus -n prometheus
release "prometheus" uninstalled
---
$ helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace prometheus --create-namespace \
--set installCRDs=true \
--set prometheus.prometheusSpec.maximumStartupDurationSeconds=300
```
<br>
### 步驟二:Slurm Operator 安裝方式
> https://github.com/SlinkyProject/slurm-operator/blob/release-0.3/docs/quickstart.md#slurm-operator
```
curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.3.0/helm/slurm-operator/values.yaml \
-o values-operator.yaml
helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
--values=values-operator.yaml --version=0.3.0 --namespace=slinky --create-namespace
```
- **source**
- https://github.com/SlinkyProject/slurm-operator
- **image**
- url
https://github.com/SlinkyProject/slurm-operator/pkgs/container/slurm-operator
- digest
`sha256:5afbee94555f31275823240e5e4ac3b57527339255322e83c7cf0bc52209dfd9`
- **values.yaml**
- source
https://github.com/SlinkyProject/slurm-operator/blob/v0.3.0/helm/slurm-operator/values.yaml
<br>
### 步驟三:Slurm Cluster 部署方式
> https://github.com/SlinkyProject/slurm-operator/blob/release-0.3/docs/quickstart.md#slurm-cluster
```
curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.3.0/helm/slurm/values.yaml \
-o values-slurm.yaml
helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
--values=values-slurm.yaml --version=0.3.0 --namespace=slurm --create-namespace
```
- **source**
- https://github.com/SlinkyProject/slurm-operator
- **images**
- url
https://github.com/orgs/SlinkyProject/packages?repo_name=containers
- **values.yaml**
- source
https://github.com/SlinkyProject/slurm-operator/blob/v0.3.0/helm/slurm/values.yaml
<br>
## 進行驗證與排查:
### 1. 在 login 節點上檢查 Slurm 狀態
```bash
# 列出所有分區與節點的狀態
sinfo
# 顯示各節點詳細資訊,確認 slurmd 是否已向 controller 註冊
scontrol show nodes
# Ping 各節點,確認 Controller ↔ Compute 間的連線
scontrol ping node=所有節點名稱
```
透過這些命令,你可以看到「idle」、「alloc」、「down」等節點狀態,如果所有節點都是 `down` 或 `drain`,表示 compute 節點還沒正常加入。
---
### 2. 檢查 compute-node Pod 的容器狀態與日誌
從剛才的 `kubectl get all -n slurm` 可見 `slurm-compute-debug-0` 只有 1/2 容器就緒,這表示裡面的 `slurmd` 可能啟動失敗。請在本地機器執行:
```bash
# 取得 Pod 詳細描述,找出 readinessProbe / startupProbe 相關錯誤
kubectl describe pod slurm-compute-debug-0 -n slurm
# 檢視 slurmd 容器日誌(假設容器名稱為 slurmd)
kubectl logs slurm-compute-debug-0 -c slurmd -n slurm
# 如有其他 sidecar/helper 容器,也一併查看它的日誌
kubectl logs slurm-compute-debug-0 -c helper -n slurm
```
從日誌中判斷是設定檔錯誤、連線不到 controller,還是權限問題,然後對應修正 ConfigMap、Secret 或重新部署 compute Pod。
---
### 3. 修復問題並提交測試任務
1. **修復 compute 節點**
根據上一步日誌排查結果,修正 slurmd 設定(如 controller 位址、憑證、資料目錄等),然後重啟或重建 `slurm-compute-debug-0` Pod,直到 `READY 2/2`。
2. **再次確認節點就緒**
```bash
sinfo
scontrol show nodes
```
3. **提交簡單測試任務**
* **即時互動**(不產生檔案):
```bash
srun -N1 -n1 --pty bash
# 進入 container 後執行
hostname
exit
```
* **批量腳本**:
```bash
echo -e "#!/bin/bash\nhostname" > test.sh
sbatch test.sh
# 等待幾秒後
squeue
sacct -j <JOBID>
```
如果測試任務能夠成功進入 `RUNNING`,並在執行後回傳正確的主機名稱,就代表整個 Slurm on K8s 集群已經運作正常。
<br>
## Slurm Cluster 安裝過程
### helm install
```
$ helm install tj-slurm2 oci://ghcr.io/slinkyproject/charts/slurm \
--values=values-slurm.yaml --version=0.3.0 --namespace=tj-slurm2 --create-namespace
Pulled: ghcr.io/slinkyproject/charts/slurm:0.3.0
Digest: sha256:737436f22147658cb9f91e75b3d604c4e74094268b8dec77ba9cbeb0e8017060
NAME: tj-slurm2
LAST DEPLOYED: Wed Jun 18 07:35:18 2025
NAMESPACE: tj-slurm2
STATUS: deployed
REVISION: 1
NOTES:
********************************************************************************
SSSSSSS
SSSSSSSSS
SSSSSSSSS
SSSSSSSSS
SSSS SSSSSSS SSSS
SSSSSS SSSSSS
SSSSSS SSSSSSS SSSSSS
SSSS SSSSSSSSS SSSS
SSS SSSSSSSSS SSS
SSSSS SSSS SSSSSSSSS SSSS SSSSS
SSS SSSSSS SSSSSSSSS SSSSSS SSS
SSSSSS SSSSSSS SSSSSS
SSS SSSSSS SSSSSS SSS
SSSSS SSSS SSSSSSS SSSS SSSSS
S SSS SSSSSSSSS SSS S
SSS SSSS SSSSSSSSS SSSS SSS
S SSS SSSSSS SSSSSSSSS SSSSSS SSS S
SSSSS SSSSSS SSSSSSSSS SSSSSS SSSSS
S SSSSS SSSS SSSSSSS SSSS SSSSS S
S SSS SSS SSS SSS S
S S S S
SSS
SSS
SSS
SSS
SSSSSSSSSSSS SSS SSSS SSSS SSSSSSSSS SSSSSSSSSSSSSSSSSSSS
SSSSSSSSSSSSS SSS SSSS SSSS SSSSSSSSSS SSSSSSSSSSSSSSSSSSSSSS
SSSS SSS SSSS SSSS SSSS SSSS SSSS SSSS
SSSS SSS SSSS SSSS SSSS SSSS SSSS SSSS
SSSSSSSSSSSS SSS SSSS SSSS SSSS SSSS SSSS SSSS
SSSSSSSSSSSS SSS SSSS SSSS SSSS SSSS SSSS SSSS
SSSS SSS SSSS SSSS SSSS SSSS SSSS SSSS
SSSS SSS SSSS SSSS SSSS SSSS SSSS SSSS
SSSSSSSSSSSSS SSS SSSSSSSSSSSSSSS SSSS SSSS SSSS SSSS
SSSSSSSSSSSS SSS SSSSSSSSSSSSS SSSS SSSS SSSS SSSS
********************************************************************************
CHART NAME: slurm
CHART VERSION: 0.3.0
APP VERSION: 25.05
slurm has been installed. Check its status by running:
$ kubectl --namespace=tj-slurm2 get pods -l app.kubernetes.io/instance=tj-slurm2 --watch
ssh via the Slurm login service:
$ SLURM_LOGIN_IP="$(kubectl get services -n tj-slurm2 -l app.kubernetes.io/instance=tj-slurm2,app.kubernetes.io/name=login -o jsonpath="{.items[0].status.loadBalancer.ingress[0].ip}")"
$ ssh -p 2222 $SLURM_LOGIN_IP
curl the Slurm restapi service:
$ SLURM_RESTAPI_IP="$(kubectl get services -n tj-slurm2 -l app.kubernetes.io/instance=tj-slurm2,app.kubernetes.io/name=slurmrestd -o jsonpath="{.items[0].spec.clusterIP})"
$ curl -H "X-SLURM-USER-TOKEN: auth/none" http://${SLURM_RESTAPI_IP}:6820/openapi/v3
```
Learn more about slurmrestd:
- Overview: https://slurm.schedmd.com/rest.html
- Quickstart: https://slurm.schedmd.com/rest_quickstart.html
- Documentation: https://slurm.schedmd.com/rest_api.html
Learn more about Slurm:
- Overview: https://slurm.schedmd.com/overview.html
- Quickstart: https://slurm.schedmd.com/quickstart.html
- Documentation: https://slurm.schedmd.com/documentation.html
- Support: https://www.schedmd.com/slurm-support/our-services/
- File Tickets: https://support.schedmd.com/
Learn more about Slinky:
- Overview: https://www.schedmd.com/slinky/why-slinky/
- Documentation: https://slinky.schedmd.com/docs/
<br>
### 進入 slurm cluster
- ### `kubectl get all -n <your_namespace>`

```
# 在主節點 (control plane)
ssh root@10.103.224.117 -p 2222
# or
ssh root@127.0.0.1 -p 31921
```
- ### 官方提供的連線 script ,無法取得 `SLURM_LOGIN_IP`
```
SLURM_LOGIN_IP="$(kubectl get services -n tj-slurm2 -l app.kubernetes.io/instance=tj-slurm2,app.kubernetes.io/name=login -o jsonpath="{.items[0].status.loadBalancer.ingress[0].ip}")"
echo $SLURM_LOGIN_IP
```
<br>
### 測試 slurm 的常用指令集
```bash
# 列出所有分區與節點的狀態
sinfo
# 顯示各節點詳細資訊,確認 slurmd 是否已向 controller 註冊
scontrol show nodes
# 顯示所有分區的詳細資訊,例如名稱、狀態及相關參數
scontrol show partitions
# 在指定節點上開啟互動 shell,方便檢查節點上的環境與檔案
# $ srun --nodelist=<node_name> --pty bash
# $ nvidia-smi
srun --nodelist=5glab-slurm-0 --pty bash
# 在指定節點上執行 nvidia-smi,檢查 GPU 使用與驅動狀態
# $ srun --nodelist=<node_name> --pty nvidia-smi
srun --nodelist=5glab-slurm-0 --pty nvidia-smi
# 列出目前執行佇列中,所有待執行與佇列中的工作
squeue
# 查詢已完成與進行中作業的會計資訊,包含使用資源與執行時間
sacct
```
<br>
### 測試 slurm 的進階指令集
```
# 驗證 sbatch 腳本語法(不真正提交)
sbatch --test-only job_script.sh
# 驗證 srun 排程(不執行工作,只測試排程)
srun --test-only --pty hostname
# 檢查 slurmctld 控制器與 slurmd 執行節點間的連線
scontrol ping
# 顯示指定作業的詳細參數與狀態
scontrol show job <jobid>
# 查看正在執行作業的即時資源使用情況
# $ sstat --helpformat # 可用欄位
# AllocTRES AveCPU AveCPUFreq AveDiskRead
# AveDiskWrite AvePages AveRSS AveVMSize
# ConsumedEnergy ConsumedEnergyRaw JobID MaxDiskRead
# MaxDiskReadNode MaxDiskReadTask MaxDiskWrite MaxDiskWriteNode
# MaxDiskWriteTask MaxPages MaxPagesNode MaxPagesTask
# MaxRSS MaxRSSNode MaxRSSTask MaxVMSize
# MaxVMSizeNode MaxVMSizeTask MinCPU MinCPUNode
# MinCPUTask Nodelist NTasks Pids
# ReqCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov
# TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask
# TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot
# TRESUsageOutAve TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask
# TRESUsageOutMin TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot
#
sstat -j <jobid> --format=JobID,JobName,MaxRSS,MaxVMSize,Elapsed # AI 有錯誤
sacct -j 12 --format=JobID,JobName,MaxRSS,MaxVMSize,Elapsed
# 查詢節點上 GPU 資源分配與可用情況
sinfo --format="%N %G"
# 列出節點不可用的原因(例如 drain、down 等)
sinfo -R
# 叢集診斷指令,匯出多項系統統計與健康狀態
sdiag
# 顯示目前所有作業的優先權排序
sprio
# 顯示叢集中的 QoS 設定
sacctmgr show qos
# 檢視 Slurm 的完整設定內容
scontrol show config
```
<br>
## Troubleshooting
### status (正常情況)
```
$ kubectl --namespace=tj-slurm4 get pods -l app.kubernetes.io/instance=tj-slurm4 --watch
NAME READY STATUS RESTARTS AGE
tj-slurm4-accounting-0 0/1 Running 0 29s
tj-slurm4-compute-5glab-slurm-0 1/2 Running 0 28s
tj-slurm4-controller-0 0/3 Pending 0 29s
tj-slurm4-login-7d74dccc58-677q5 0/1 Running 0 29s
tj-slurm4-mariadb-0 0/1 Pending 0 29s
tj-slurm4-restapi-9c445cffb-lhtmt 1/1 Running 0 29s
tj-slurm4-token-create-tjqpt 2/2 Running 0 29s
tj-slurm4-controller-0 0/3 Pending 0 29s
tj-slurm4-controller-0 0/3 Init:0/2 0 29s
tj-slurm4-controller-0 0/3 Init:1/2 0 34s
tj-slurm4-mariadb-0 0/1 Pending 0 35s
tj-slurm4-mariadb-0 0/1 Init:0/1 0 35s
tj-slurm4-controller-0 1/3 PodInitializing 0 36s
tj-slurm4-controller-0 2/3 Running 0 40s
tj-slurm4-mariadb-0 0/1 Init:0/1 0 40s
tj-slurm4-controller-0 3/3 Running 0 40s
tj-slurm4-mariadb-0 0/1 PodInitializing 0 41s
tj-slurm4-mariadb-0 0/1 Running 0 43s
tj-slurm4-mariadb-0 1/1 Running 0 85s
tj-slurm4-accounting-0 1/1 Running 0 111s
tj-slurm4-login-7d74dccc58-677q5 1/1 Running 0 2m41s
tj-slurm4-token-create-tjqpt 1/2 Completed 0 2m44s
tj-slurm4-token-create-tjqpt 0/2 Completed 0 2m47s
tj-slurm4-token-create-tjqpt 0/2 Completed 0 2m48s
tj-slurm4-token-create-tjqpt 0/2 Completed 0 2m50s
tj-slurm4-token-create-tjqpt 0/2 Completed 0 2m50s
tj-slurm4-token-create-tjqpt 0/2 Completed 0 2m50s
tj-slurm4-compute-5glab-slurm-0 2/2 Running 0 2m51s
```
- `<namespace>-token-create-xxxxx`: 共 6 個
- 執行時間約需 2m45s,一般應會在 3m 內完成
- 完成後 job 會消失
```
$ kubectl get all -n <namespace> -o wide
NAME STATUS COMPLETIONS DURATION AGE CONTAINERS IMAGES SELECTOR
job.batch/<namespace>-token-create Running 0/1 2m41s 2m41s token-slurm ghcr.io/slinkyproject/sackd:25.05-ubuntu24.04 batch.kubernetes.io/controller-uid=16ec747b-1c5f-4209-8c19-b760a27e9859
```
- 最後會有 compute-node log
<br>
```
$ kubectl get secret -n tj-slurm4
NAME TYPE DATA AGE
sh.helm.release.v1.tj-slurm4.v1 helm.sh/release.v1 1 24m
tj-slurm4-auth-key Opaque 1 24m
tj-slurm4-jwt-key Opaque 1 24m
tj-slurm4-login-config Opaque 1 24m
tj-slurm4-login-ssh-host-keys Opaque 3 24m
tj-slurm4-mariadb Opaque 2 24m
tj-slurm4-token-slurm Opaque 1 21m
---
# 共 6 個 secret
```
<br>
```
NAME READY STATUS RESTARTS AGE
pod/tj-slurm4-accounting-0 1/1 Running 0 3m30s
pod/tj-slurm4-compute-5glab-slurm-1-0 2/2 Running 0 3m29s
pod/tj-slurm4-compute-5glab-slurm-2-0 2/2 Running 0 3m29s
pod/tj-slurm4-controller-0 3/3 Running 0 3m30s
pod/tj-slurm4-login-b976b96d6-g299l 1/1 Running 0 3m30s
pod/tj-slurm4-mariadb-0 1/1 Running 0 3m30s
pod/tj-slurm4-restapi-9c445cffb-q8rxn 1/1 Running 0 3m30s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/tj-slurm4-accounting ClusterIP None <none> 6819/TCP 3m31s
service/tj-slurm4-compute ClusterIP None <none> 6818/TCP 3m31s
service/tj-slurm4-controller ClusterIP None <none> 6817/TCP 3m31s
service/tj-slurm4-login LoadBalancer 10.100.175.119 <pending> 2222:31915/TCP 3m31s
service/tj-slurm4-mariadb ClusterIP 10.97.42.109 <none> 3306/TCP 3m31s
service/tj-slurm4-mariadb-headless ClusterIP None <none> 3306/TCP 3m31s
service/tj-slurm4-restapi ClusterIP 10.103.245.35 <none> 6820/TCP 3m31s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/tj-slurm4-login 1/1 1 1 3m30s
deployment.apps/tj-slurm4-restapi 1/1 1 1 3m30s
NAME DESIRED CURRENT READY AGE
replicaset.apps/tj-slurm4-login-b976b96d6 1 1 1 3m30s
replicaset.apps/tj-slurm4-restapi-9c445cffb 1 1 1 3m30s
NAME READY AGE
statefulset.apps/tj-slurm4-accounting 1/1 3m30s
statefulset.apps/tj-slurm4-controller 1/1 3m30s
statefulset.apps/tj-slurm4-mariadb 1/1 3m30s
```
<br>
```
$ sinfo # (不正常情況,供對照)
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
5glab-slurm up infinite 0 n/a
all* up infinite 0 n/a
$ sinfo # (正常情況)
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
5glab-slurm up infinite 1 idle 5glab-slurm-0
all* up infinite 1 idle 5glab-slurm-0
---
$ srun hostname # (不正常情況,供對照)
srun: error: Unable to allocate resources: Requested node configuration is not available
$ srun hostname # (正常情況)
5glab-slurm-0
---
$ sbatch --wrap="sleep 60" # (不正常情況,供對照)
sbatch: error: Batch job submission failed: Requested node configuration is not available
$ sbatch --wrap="sleep 60" # (正常情況)
Submitted batch job 2
---
$ squeue # (不正常情況,供對照)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
$ squeue # (正常情況)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 all wrap root R 0:31 1 5glab-slurm-0
---
$ sacct # (不正常情況,供對照)
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1 hostname all root 0 FAILED 1:0
2 wrap all root 0 FAILED 1:0
$ sacct # (正常情況)
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1 hostname all root 2 COMPLETED 0:0
1.extern extern root 2 COMPLETED 0:0
1.0 hostname root 2 COMPLETED 0:0
2 wrap all root 2 COMPLETED 0:0
2.batch batch root 2 COMPLETED 0:0
2.extern extern root 2 COMPLETED 0:0
---
$ scontrol show nodes # (不正常情況,供對照)
No nodes in the system
$ scontrol show nodes # (正常情況)
NodeName=5glab-slurm-0 Arch=x86_64 CoresPerSocket=18
CPUAlloc=0 CPUEfctv=72 CPUTot=72 CPULoad=2.95
AvailableFeatures=5glab-slurm
ActiveFeatures=5glab-slurm
Gres=(null)
NodeAddr=10.244.4.182 NodeHostName=5glab-slurm-0 Version=25.05.0
OS=Linux 5.15.0-83-lowlatency #92-Ubuntu SMP PREEMPT Fri Aug 18 13:09:41 UTC 2023
RealMemory=370177 AllocMem=0 FreeMem=2531 Sockets=2 Boards=1
State=IDLE+DYNAMIC_NORM ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=5glab-slurm,all
BootTime=2025-06-15T22:04:56 SlurmdStartTime=2025-06-19T02:11:59
LastBusyTime=2025-06-19T02:40:54 ResumeAfterTime=None
CfgTRES=cpu=72,mem=370177M,billing=72
AllocTRES=
CurrentWatts=0 AveWatts=0
Comment={"namespace":"tj-slurm4","podName":"tj-slurm4-compute-5glab-slurm-0"}
---
srun --nodelist=5glab-slurm-0 --pty bash
# 等同:srun --partition=5glab-slurm --pty bash
srun --nodelist=5glab-slurm-0 --pty nvidia-smi
# 等同:srun --partition=5glab-slurm --pty nvidia-smi
```
<br>
---
<br>
### status (不正常情況)
```
$ kubectl get all -n tj-slurm4
NAME READY STATUS RESTARTS AGE
pod/tj-slurm4-accounting-0 0/1 Running 0 2m29s
pod/tj-slurm4-compute-5glab-slurm-0 1/2 Running 0 2m28s
pod/tj-slurm4-controller-0 2/3 Running 1 (112s ago) 2m29s
pod/tj-slurm4-login-7d74dccc58-9c7t5 0/1 Running 0 2m29s
pod/tj-slurm4-mariadb-0 1/1 Running 0 2m29s
pod/tj-slurm4-restapi-9c445cffb-p5sd9 1/1 Running 0 2m29s
pod/tj-slurm4-token-create-gtq7b 2/2 Running 0 2m29s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/tj-slurm4-accounting ClusterIP None <none> 6819/TCP 2m30s
service/tj-slurm4-compute ClusterIP None <none> 6818/TCP 2m30s
service/tj-slurm4-controller ClusterIP None <none> 6817/TCP 2m30s
service/tj-slurm4-login LoadBalancer 10.108.72.55 <pending> 2222:30451/TCP 2m30s
service/tj-slurm4-mariadb ClusterIP 10.108.104.185 <none> 3306/TCP 2m30s
service/tj-slurm4-mariadb-headless ClusterIP None <none> 3306/TCP 2m30s
service/tj-slurm4-restapi ClusterIP 10.98.224.69 <none> 6820/TCP 2m30s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/tj-slurm4-login 0/1 1 0 2m29s
deployment.apps/tj-slurm4-restapi 1/1 1 1 2m29s
NAME DESIRED CURRENT READY AGE
replicaset.apps/tj-slurm4-login-7d74dccc58 1 1 0 2m29s
replicaset.apps/tj-slurm4-restapi-9c445cffb 1 1 1 2m29s
NAME READY AGE
statefulset.apps/tj-slurm4-accounting 0/1 2m29s
statefulset.apps/tj-slurm4-controller 0/1 2m29s
statefulset.apps/tj-slurm4-mariadb 1/1 2m29s
NAME STATUS COMPLETIONS DURATION AGE
job.batch/tj-slurm4-token-create Running 0/1 2m29s 2m29s <---
```
- ### 查看 `job.batch/tj-slurm4-token-create`
```
$ kubectl logs -f job.batch/tj-slurm4-token-create -n tj-slurm4
Error from server (NotFound): secrets "tj-slurm4-token-slurm" not found
scontrol: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
scontrol: error: fetch_config: DNS SRV lookup failed
scontrol: error: _establish_config_source: failed to fetch config
scontrol: fatal: Could not establish a configuration source
...
```
- 正常情況
跑完 3 分鐘後,會自動消失
- ### 查看 pod/tj-slurm4-compute-5glab-slurm-0
```
$ kubectl logs pod/tj-slurm4-compute-5glab-slurm-0 -n tj-slurm4
2025-06-19 07:50:35,455 INFO Set uid to user 0 succeeded
2025-06-19 07:50:35,456 INFO supervisord started with pid 1
2025-06-19 07:50:36,458 INFO spawned: 'processes' with pid 9
2025-06-19 07:50:36,460 INFO spawned: 'slurmd' with pid 10
+ exec slurmd --systemd -Z --conf-server tj-slurm4-controller:6817 --conf Features=5glab-slurm
[2025-06-19T07:50:36.662] error: _xgetaddrinfo: getaddrinfo(tj-slurm4-controller:(null)) failed: Name or service not known
[2025-06-19T07:50:36.668] error: _xgetaddrinfo: getaddrinfo(tj-slurm4-controller:6817) failed: Name or service not known
[2025-06-19T07:50:36.668] error: slurm_set_addr: Unable to resolve "tj-slurm4-controller"
2025-06-19 07:50:37,669 INFO success: processes entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2025-06-19 07:50:37,670 INFO success: slurmd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
[2025-06-19T07:50:46.669] error: _fetch_child: failed to fetch remote configs: Unable to contact slurm controller (connect failure)
[2025-06-19T07:50:46.670] error: _establish_configuration: failed to load configs. Retrying in 10 seconds.
```
- 正常情況
```
$ k logs pod/tj-slurm5-controller-0 -n tj-slurm5
2025-06-19 08:08:43,157 INFO supervisord started with pid 1
2025-06-19 08:08:44,159 INFO spawned: 'processes' with pid 8
2025-06-19 08:08:44,160 INFO spawned: 'slurmctld' with pid 9
+ exec slurmctld --systemd
[2025-06-19T08:08:44.166] error: s_p_parse_file: unable to read "/etc/slurm/slurm.conf": Permission denied
[2025-06-19T08:08:44.166] error: ClusterName needs to be specified
[2025-06-19T08:08:44.166] fatal: Unable to process configuration file
2025-06-19 08:08:44,166 WARN exited: slurmctld (exit status 1; not expected)
2025-06-19 08:08:45,167 INFO success: processes entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2025-06-19 08:08:45,169 INFO spawned: 'slurmctld' with pid 10
+ exec slurmctld --systemd
[2025-06-19T08:08:45.182] warning: _prepare_run_dir: /run/slurmctld exists but is owned by 0, not SlurmUser
[2025-06-19T08:08:45.182] error: Configured MailProg is invalid
[2025-06-19T08:08:45.186] slurmctld version 25.05.0 started on cluster slurm(0)
[2025-06-19T08:08:45.187] error: mpi/pmix_v5: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2025-06-19T08:08:45.187] error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed
[2025-06-19T08:08:45.187] error: MPI: Cannot create context for mpi/pmix
[2025-06-19T08:08:45.187] error: mpi/pmix_v5: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2025-06-19T08:08:45.187] error: Couldn't load specified plugin name for mpi/pmix_v5: Plugin init() callback failed
[2025-06-19T08:08:45.187] error: MPI: Cannot create context for mpi/pmix_v5
[2025-06-19T08:08:45.187] error: xsystemd_change_mainpid: connect() failed for /dev/null: Connection refused
[2025-06-19T08:08:45.195] error: _xgetaddrinfo: getaddrinfo(tj-slurm5-accounting:6819) failed: Name or service not known
[2025-06-19T08:08:45.195] error: slurm_set_addr: Unable to resolve "tj-slurm5-accounting"
[2025-06-19T08:08:45.195] error: slurm_get_port: Address family '0' not supported
[2025-06-19T08:08:45.195] error: Error connecting, bad data: family = 0, port = 0
[2025-06-19T08:08:45.195] error: _open_persist_conn: failed to open persistent connection to host:tj-slurm5-accounting:6819: No error
[2025-06-19T08:08:45.195] error: Sending PersistInit msg: No error
[2025-06-19T08:08:45.195] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd
[2025-06-19T08:08:45.251] error: _xgetaddrinfo: getaddrinfo(tj-slurm5-accounting:6819) failed: Name or service not known
[2025-06-19T08:08:45.251] error: slurm_set_addr: Unable to resolve "tj-slurm5-accounting"
[2025-06-19T08:08:45.251] error: slurm_get_port: Address family '0' not supported
[2025-06-19T08:08:45.251] error: Error connecting, bad data: family = 0, port = 0
[2025-06-19T08:08:45.251] error: Sending PersistInit msg: No error
...
```
- 對照

<br>
### STATE=inval, Reason=Low RealMemory
- ### k8s 狀態:有 RESTART
```
$ kubectl get all -n tj-slurm4
NAME READY STATUS RESTARTS AGE
pod/tj-slurm4-accounting-0 1/1 Running 3 (28h ago) 2d19h
pod/tj-slurm4-compute-5glab-slurm-1-0 2/2 Running 0 28h
pod/tj-slurm4-compute-5glab-slurm-2-0 2/2 Running 0 28h
pod/tj-slurm4-controller-0 3/3 Running 6 (28h ago) 2d19h
pod/tj-slurm4-login-7d74dccc58-ws9sm 1/1 Running 6 (28h ago) 2d19h
pod/tj-slurm4-mariadb-0 1/1 Running 3 (28h ago) 2d19h
pod/tj-slurm4-restapi-9c445cffb-pkbjf 1/1 Running 3 (28h ago) 2d19h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/tj-slurm4-accounting ClusterIP None <none> 6819/TCP 2d19h
service/tj-slurm4-compute ClusterIP None <none> 6818/TCP 2d19h
service/tj-slurm4-controller ClusterIP None <none> 6817/TCP 2d19h
service/tj-slurm4-login LoadBalancer 10.103.224.117 <pending> 2222:31921/TCP 2d19h
service/tj-slurm4-mariadb ClusterIP 10.105.112.163 <none> 3306/TCP 2d19h
service/tj-slurm4-mariadb-headless ClusterIP None <none> 3306/TCP 2d19h
service/tj-slurm4-restapi ClusterIP 10.111.142.26 <none> 6820/TCP 2d19h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/tj-slurm4-login 1/1 1 1 2d19h
deployment.apps/tj-slurm4-restapi 1/1 1 1 2d19h
NAME DESIRED CURRENT READY AGE
replicaset.apps/tj-slurm4-login-7d74dccc58 1 1 1 2d19h
replicaset.apps/tj-slurm4-restapi-9c445cffb 1 1 1 2d19h
NAME READY AGE
statefulset.apps/tj-slurm4-accounting 1/1 2d19h
statefulset.apps/tj-slurm4-controller 1/1 2d19h
statefulset.apps/tj-slurm4-mariadb 1/1 2d19h
```
- ### slurm 狀態
```
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
5glab-slurm-1 up infinite 1 inval 5glab-slurm-1-0
5glab-slurm-2 up infinite 1 inval 5glab-slurm-2-0
all* up infinite 2 inval 5glab-slurm-1-0,5glab-slurm-2-0
```
```
$ sinfo -N -l
Mon Jun 23 02:53:55 2025
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
5glab-slurm-1-0 1 5glab-slurm-1 inval 128 2:32:2 515645 0 1 5glab-sl Low RealMemory (repo
5glab-slurm-1-0 1 all* inval 128 2:32:2 515645 0 1 5glab-sl Low RealMemory (repo
5glab-slurm-2-0 1 5glab-slurm-2 inval 128 2:32:2 515645 0 1 5glab-sl Low RealMemory (repo
5glab-slurm-2-0 1 all* inval 128 2:32:2 515645 0 1 5glab-sl Low RealMemory (repo
```
- `-N, --Node` Node-centric format
- `-l, --long` long output - displays more information
- `REASON`: Low RealMemory (repo
```
$ scontrol show node
NodeName=5glab-slurm-1-0 Arch=x86_64 CoresPerSocket=32
CPUAlloc=0 CPUEfctv=126 CPUTot=128 CPULoad=4.22
AvailableFeatures=5glab-slurm-1
ActiveFeatures=5glab-slurm-1
Gres=(null)
NodeAddr=10.244.2.63 NodeHostName=5glab-slurm-1-0 Version=25.05.0
OS=Linux 5.15.0-142-generic #152-Ubuntu SMP Mon May 19 10:54:31 UTC 2025
RealMemory=515645 AllocMem=0 FreeMem=482601 Sockets=2 Boards=1
CoreSpecCount=1 CPUSpecList=126-127 MemSpecLimit=514571
State=IDLE+DRAIN+DYNAMIC_NORM+INVALID_REG ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=5glab-slurm-1,all
BootTime=2025-06-21T21:40:26 SlurmdStartTime=2025-06-21T21:55:14
LastBusyTime=2025-06-20T08:59:38 ResumeAfterTime=None
CfgTRES=cpu=126,mem=515645M,billing=126
AllocTRES=
CurrentWatts=0 AveWatts=0
Reason=Low RealMemory (reported:515644 < 100.00% of configured:515645) [slurm@2025-06-21T21:55:14]
Comment={"namespace":"tj-slurm4","podName":"tj-slurm4-compute-5glab-slurm-1-0"}
```
- **State**
- `State=IDLE+DRAIN+DYNAMIC_NORM+INVALID_REG ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A`
- **Reason**
- `Reason=Low RealMemory (reported:515644 < 100.00% of configured:515645)`
- **預設單位 MB**
Slurm 自早期設計就以 MB 做為記憶體的基本單位,大多數系統管理員已經習慣以 MB(或 GB)思考記憶體容量。
- 節點記憶體為 515644(MB) < 515645(MB),比健康狀態少 1MB
- [[stackoverflow] Slurm says drained Low RealMemory](https://stackoverflow.com/questions/68132982/)
This could be that `RealMemory=541008` in `slurm.conf` is too high for your system. Try lowering the value. Lets suppose you have indeed 541 Gb of RAM installed: change it to `RealMemory=500000`, do a `scontrol reconfigure` and then a `scontrol update nodename=transgen-4 state=resume`.
- ### node 配置
```
$ kubectl exec -it pod/tj-slurm4-compute-5glab-slurm-1-0 -n tj-slurm4 -c slurmd -- bash
---
root@5glab-slurm-1-0:/tmp# slurmd -C
NodeName=5glab-slurm-1-0 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=515644 Gres=gpu:nvidia_a40:1
Found gpu:nvidia_a40:1 with Autodetect=nvidia (Substring of gpu name may be used instead)
---
root@5glab-slurm-1-0:/tmp# free -m
total used free shared buff/cache available
Mem: 515644 28561 333519 1596 158438 487083
Swap: 0 0 0
---
root@5glab-slurm-1-0:/tmp# free -g
total used free shared buff/cache available
Mem: 503 27 325 1 154 475
Swap: 0 0 0
```
- ### gpt 解法
> gpt-o4-mini-high (2025/06/23)
1. **手動指定 `RealMemory`**
在對應的 `nodeConfig` 塊中,加入正確的 MB 數,例如少一 MB:
```yaml
compute:
nodesets:
- name: 5glab-slurm-1
…
nodeConfig:
RealMemory: 515644
- name: 5glab-slurm-2
…
nodeConfig:
RealMemory: 515644
```
這樣 operator 產出的 `slurm.conf` 裡,節點定義就會是 `RealMemory=515644`,和 slurmd 回報吻合。
2. **或關閉記憶體檢查**
如果不想管記憶體微小誤差,也可以在 `extraSlurmConf` 裡把選擇型參數改掉,不再驗證 Memory:
```yaml
slurm:
extraSlurmConf:
SelectTypeParameters: CR_CPU
```
operator 重新部署後,Slurm 只檢查 CPU,不會因為 Memory mismatch 而把節點 DRAIN/INVALID\_REG。
選擇其中一種方式,然後執行:
```bash
# 重新套用 Helm values
helm upgrade slurm-cluster slurm-operator/helm-chart -f your-values.yaml
# 或直接動態更新(不重啟 daemons)
scontrol update NodeName=5glab-slurm-1-0 RealMemory=515644
scontrol update NodeName=5glab-slurm-2-0 RealMemory=515644
```
之後再執行 `sinfo`,分區就會恢復成 `idle`、節點回到 `IDLE`,不再顯示 `inval`。
<br>
### Converting whiteouts 失敗
```
root@esc8000b-1-0:~# enroot import -o llm-gemma3-trainer-0.2.1-by-tj.sqsh 'docker://ociscloud@ociscloud/llm-gemma3-trainer:v0.2.1'
[INFO] Querying registry for permission grant
[INFO] Authenticating with user: ociscloud
Enter host password for user 'ociscloud':
[INFO] Authentication succeeded
[INFO] Fetching image manifest list
[INFO] Fetching image manifest
[INFO] Found all layers in cache
[INFO] Extracting image layers...
100% 55:0=43s 3153aa388d026c26a2235e1ed0163e350e451f41a8a313e1804d7e1afb857ab4
[INFO] Converting whiteouts...
0% 0:55=0s 4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1 enroot-aufs2ovlfs: failed to create opaque ovlfs whiteout: /tmp/enroot.XXsLpqy45b/1/workspace/Gemma3-Finetune-master/: Not supported
3% 2:53=0s f8ca6ca18d9e3a91bbac01fbc3758d39232a89317604d6f43de447cbbd7500cf enroot-aufs2ovlfs: failed to create opaque ovlfs whiteout: /tmp/enroot.XXsLpqy45b/3/usr/local/lib/python3.10/dist-packages/tests/: Not supported
5% 3:52=0s 91b7f77c53292461e64f3855290713157a645ea2725f4dc7d4dce19aaf48f955 enroot-aufs2ovlfs: failed to create ovlfs whiteout: /tmp/enroot.XXsLpqy45b/4/usr/local/bin/plasma_store: Operation not permitted
32% 18:37=0s ce782753a19f67a2c94eefa0ff3dd560a4b847462aa95eb2c82be662dce4b132 enroot-aufs2ovlfs: failed to create ovlfs whiteout: /tmp/enroot.XXsLpqy45b/19/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_target_encoder.py: Operation not permitted
36% 20:35=0s 6c9f88339e6283cb72af66d47db4818e17155ef072be34fddd0df91b5305de52 enroot-aufs2ovlfs: failed to create ovlfs whiteout: /tmp/enroot.XXsLpqy45b/21/usr/local/lib/python3.10/dist-packages/Pillow.libs: Operation not permitted
47% 26:29=0s ba56bd485673e44a146422277ac896a790afc372443156d6f3ffa126033a97a7 enroot-aufs2ovlfs: failed to create ovlfs whiteout: /tmp/enroot.XXsLpqy45b/22/usr/local/lib/python3.10/dist-packages/google/protobuf/pyext/_message.cpython-310-x86_64-linux-gnu.so: Operation not permitted
58% 32:23=0s e4cba662a9603ad133d46b375fd6a4160a906db10e6107b91c9eb753ba97903f enroot-aufs2ovlfs: failed to create ovlfs whiteout: /tmp/enroot.XXsLpqy45b/32/usr/local/share/jupyter/lab/static/vendors~main.4d56d8a74685f09ed4a6.js.LICENSE.txt: Operation not permitted
60% 33:22=0s 497e75b3d1c5473a012b207c553f76e4b11419f2f2f66bbdd07017435dd84957 enroot-aufs2ovlfs: failed to create ovlfs whiteout: /tmp/enroot.XXsLpqy45b/34/usr/local/lib/python3.10/dist-packages/urllib3-2.0.4.dist-info: Operation not permitted
74% 41:14=0s 2ad84487f9d4d31cd1e0a92697a5447dd241935253d036b272ef16d31620c1e7 enroot-aufs2ovlfs: failed to create ovlfs whiteout: /tmp/enroot.XXsLpqy45b/43/tmp/cuda-9.0.patch: Operation not permitted
92% 51:4=0s 9ac855545fa90ed2bf3b388fdff9ef06ac9427b0c0fca07c9e59161983d8827e enroot-aufs2ovlfs: failed to create ovlfs whiteout: /tmp/enroot.XXsLpqy45b/53/var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy-updates_universe_binary-amd64_Packages.lz4: Operation not permitted
100% 55:0=0s 3153aa388d026c26a2235e1ed0163e350e451f41a8a313e1804d7e1afb857ab4
```
- 解法
```
$ export ENROOT_DATA_PATH=/run/enroot/data
$ export ENROOT_CACHE_PATH=/run/enroot/cache
$ export TMPDIR=/run/enroot/tmp
```
- 正常執行結果
```
root@esc8000a-1-0:/run/enroot# enroot import -o llm-gemma3-trainer-0.2.1-by-tj.sqsh 'docker://ociscloud@ociscloud/llm-gemma3-trainer:v0.2.1'
[INFO] Querying registry for permission grant
[INFO] Authenticating with user: ociscloud
Enter host password for user 'ociscloud':
[INFO] Authentication succeeded
[INFO] Fetching image manifest list
[INFO] Fetching image manifest
[INFO] Found all layers in cache
[INFO] Extracting image layers...
100% 55:0=47s 3153aa388d026c26a2235e1ed0163e350e451f41a8a313e1804d7e1afb857ab4
[INFO] Converting whiteouts...
100% 55:0=0s 3153aa388d026c26a2235e1ed0163e350e451f41a8a313e1804d7e1afb857ab4
[INFO] Creating squashfs filesystem...
Parallel mksquashfs: Using 2 processors
Creating 4.0 filesystem on /run/enroot/llm-gemma3-trainer-0.2.1-by-tj.sqsh, block size 131072.
[=====================================================================================================================================================\] 501558/501558 100%
Exportable Squashfs 4.0 filesystem, lzo compressed, data block size 131072
uncompressed data, compressed metadata, compressed fragments,
compressed xattrs, compressed ids
duplicates are removed
Filesystem size 22946004.17 Kbytes (22408.21 Mbytes)
86.94% of uncompressed filesystem size (26392494.29 Kbytes)
Inode table size 4142611 bytes (4045.52 Kbytes)
34.07% of uncompressed inode table size (12160805 bytes)
Directory table size 4464938 bytes (4360.29 Kbytes)
40.10% of uncompressed directory table size (11133859 bytes)
Number of duplicate files found 44919
Number of inodes 352975
Number of files 316965
Number of fragments 17960
Number of symbolic links 1655
Number of device nodes 0
Number of fifo nodes 0
Number of socket nodes 0
Number of directories 34355
Number of hard-links 123
Number of ids (unique uids + gids) 1
Number of uids 1
root (0)
Number of gids 1
root (0)
```
<br>
## 資源修改
```
$ kubectl logs pod/slurm-operator-54c85757f9-wcbrp -n tj-slinky4
```
`2025-06-19T09:37:09Z ERROR Reconciler error {"controller": "nodeset-controller", "controllerGroup": "slinky.slurm.net", "controllerKind": "NodeSet", "NodeSet": {"name":"tj-slurm4-compute-5glab-slurm","namespace":"tj-slurm4"}, "namespace": "tj-slurm4", "name": "tj-slurm4-compute-5glab-slurm", "reconcileID": "7bb2eff2-2522-484f-80f8-bf6817f48fd6", "error": "Pod \"tj-slurm4-compute-5glab-slurm-0\" is invalid: spec.containers[0].resources.limits: Required value: Limit must be set for non overcommitable resources", "errorCauses": [{"error": "Pod \"tj-slurm4-compute-5glab-slurm-0\" is invalid: spec.containers[0].resources.limits: Required value: Limit must be set for non overcommitable resources"}]}`
<br>
## 情境
### 同一個 node 下,配兩個 partition (每個 partition 各一張 GPU)


### 待整理
1. 掛 volume 進 gpu worker node
2. 兩個 worker gpu node, 每個 node 一張 gpu , 位在同一個 partition
3. 建兩個 partion: A40-partion (一個 node 2 張 a40 gpu), 3090-partion (一個 note 1 張 3090 gpu)
4. 動態新增 or 減少 node
5. 設定 enroot 環境變數
6. node config 沒有 gres 資訊
<br>
## 其他測試指令
### 待整理
```
$> srun --nodelist=5glab-slurm-1-0 --pty bash
$> enroot import -o llm-gemma3-trainer-0.2.1.sqsh 'docker://ociscloud@ociscloud/llm-gemma3-trainer:v0.2.1'
$> enroot import -o alpine.sqsh 'docker://alpine:latest'
srun --nodelist=5glab-slurm-1-0 --container-image=ubuntu:22.04 --pty bash
$> export ENROOT_DATA_PATH=/run/enroot/data
$> export ENROOT_CACHE_PATH=/run/enroot/cache
$> export TMPDIR=/run/enroot/tmp
```
我在 /run/enroot 建了三個資料夾 data/cache/tmp,
分別把 ENROOT_CACHE_PATH / ENROOT_DATA_PATH / TMPDIR 指過去
讓轉換過程所使用到的 filesystem 為 ext4,
然後就可以順利轉換
<br>
## 術語
### DRAIN
- DRAIN 排空後不再接受新作業
- [dren], 排出, 排掉
<br>
{%hackmd vaaMgNRPS4KGJDSFG0ZE0w %}