Slurm & docs - HackMD

Slurm & docs === ###### tags: `Slurm` ###### tags: `Slurm`, `HPC`, `Linux`, `srun`, `sinfo`, `docs` [TOC] ## 簡介 - Slurm：開源資源管理與排程系統，最初由 LLNL 開發。 - SchedMD：成立於 2002 年，負責 Slurm 的進一步開發與維護。 - 關係：SchedMD 是 Slurm 的主要開發者，並提供專業支援與服務。 - [2025/12/16 - 🧠 NVIDIA 收購 SchedMD：Slurm 這位「叢集管家」加入後，AI 訓練排程會更順嗎？🚀](https://vocus.cc/article/6940f0e9fd89780001e6f026) ## docs - [slurm.conf](https://slurm.schedmd.com/slurm.conf.html) - [DefMemPerCPU](http://slurm.schedmd.com/slurm.conf.html#OPT_DefMemPerCPU) - [Control Group in Slurm](https://slurm.schedmd.com/cgroups.html) - [Quality of Service (QOS)](https://slurm.schedmd.com/qos.html) - [Resource Limits](https://slurm.schedmd.com/resource_limits.html) - [Prolog and Epilog Guide](https://slurm.schedmd.com/prolog_epilog.html) ## DeepWiki > https://deepwiki.com/SchedMD/slurm ### 自動幫你找 slurm 用法 > 範例： > https://deepwiki.com/search/-slurmconf-partitionnamedebug_93d101fc-912c-4a61-8b2f-c62b28853a9a > ![image](https://hackmd.io/_uploads/SJDwPTAtgg.png) - 我來幫您分析 SLURM 中 `AllowGroups` 配置的相關設置和限制。 - 讓我深入了解 AllowGroups 的實現細節： - 讓我搜尋關於 AllowGroups 的具體說明： - 讓我查看 slurm.conf.5 手冊中關於 AllowGroups 的說明： - 讓我搜尋 PARTITION 配置部分的文件： - 現在讓我查看詳細的 partition 參數說明： - 讓我尋找更詳細的 partition 參數文檔： - 讓我繼續搜尋 partition 的配置段落： - 讓我在 man 頁面中查找 partition 相關的部分： ## 參考資料 - [[twcc][@f1-manual] Slurm 簡介](https://man.twcc.ai/@f1-manual/slurm_instructions) - [[twcc][@AI-Pilot] Slurm基本概念](https://man.twcc.ai/@AI-Pilot/SJfnyE_Mkg) - [Slurm 安裝流程](https://hackmd.io/@stargazerwang/S1t60i6lF) > Slurm 在 HPC 叢集中的角色 ## 指令 - sinfo ``` $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST tn* up infinite 3 idle esc8000a-[0-1],esc8000b-0 debug up infinite 1 idle e900-0 ``` ## 指令 - scontrol ### `$ scontrol show ` ``` $ scontrol show config $ scontrol show node $ scontrol show nodes $ scontrol show partition $ scontrol show partition <partition-name> $ scontrol show partitions $ scontrol show partitions <partition-name> ``` ``` $ scontrol show partition debug PartitionName=debug AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED NodeSets=e900 Nodes=e900-0 PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=72 TotalNodes=1 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED TRES=cpu=12,mem=240G,node=1,billing=12,gres/gpu=1 ``` ### `$ scontrol update NodeName=esc8000b-0 State=RESUME` ``` root@slurm-login-6fb5d948d6-j6b7z:~# scontrol show node esc8000b-0 NodeName=esc8000b-0 Arch=x86_64 CoresPerSocket=20 CPUAlloc=0 CPUEfctv=80 CPUTot=80 CPULoad=0.25 AvailableFeatures=esc8000b ActiveFeatures=esc8000b Gres=(null) NodeAddr=10.244.1.213 NodeHostName=esc8000b-0 Version=25.05.0 OS=Linux 5.15.0-136-generic #147-Ubuntu SMP Sat Mar 15 15:53:30 UTC 2025 RealMemory=245760 AllocMem=0 FreeMem=169470 Sockets=2 Boards=1 State=IDLE+DRAIN+DYNAMIC_NORM ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=tn BootTime=2025-06-21T21:42:27 SlurmdStartTime=2025-07-25T08:23:18 LastBusyTime=2025-07-25T08:24:39 ResumeAfterTime=None CfgTRES=cpu=80,mem=240G,billing=80 AllocTRES= CurrentWatts=0 AveWatts=0 Reason=Kill task failed (JobId=561 StepId=extern) [root@2025-07-25T08:24:33] Comment={"namespace":"slurm","podName":"slurm-compute-esc8000b-0"} ``` - `Reason=Kill task failed (JobId=561 StepId=extern) [root@2025-07-25T08:24:33]` - ### 變更前 ``` root@slurm-login-6fb5d948d6-j6b7z:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST tn* up infinite 1 drain esc8000b-0 debug up infinite 0 n/a ``` - ### 變更後 ``` root@slurm-login-6fb5d948d6-j6b7z:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST tn* up infinite 1 idle esc8000b-0 ``` ### 強制 reset + resume compute node ``` $ scontrol update NodeName=esc8000b-0 State=DOWN Reason="reset for debug" $ scontrol update NodeName=esc8000b-0 State=RESUME ``` ## 指令 - sacctmgr > https://slurm.schedmd.com/accounting.html ```bash $ sacctmgr list users $ sacctmgr list account $ sacctmgr list cluster $ sacctmgr list stats $ sacctmgr list tres ``` ### `$ sacctmgr list cluster` ``` $ sacctmgr list cluster Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS ---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- slurm 10.244.4.254 6817 11008 1 normal ``` ### `$ sacctmgr list stats` ``` root@slurm-login-6fb5d948d6-j6b7z:~# sacctmgr list stats ******************************************************************* sacctmgr show stats output at Fri Jul 25 09:24:13 2025 (1753435453) Data since Tue Jul 15 08:40:21 2025 (1752568821) All statistics are in microseconds ******************************************************************* Internal DBD rollup last ran Fri Jul 25 09:00:00 2025 (1753434000) Last cycle: 93156 Max cycle: 2378947 Total time: 48027091 Total cycles: 242 Mean cycle: 198459 Cluster 'slurm' rollup statistics Hour last ran Fri Jul 25 09:00:00 2025 (1753434000) Last cycle: 48480 Max cycle: 1293641 Total time: 30226828 Total cycles: 240 Mean cycle: 125945 Day last ran Fri Jul 25 00:00:00 2025 (1753401600) Last cycle: 43465 Max cycle: 132705 Total time: 408674 Total cycles: 11 Mean cycle: 37152 Remote Procedure Call statistics by message type DBD_CLUSTER_TRES ( 1407) count:40781 ave_time:4961 total_time:202336387 SLURM_PERSIST_INIT ( 6500) count:38182 ave_time:1050 total_time:40105651 DBD_GET_USERS ( 1415) count:38006 ave_time:147 total_time:5590324 DBD_GET_TRES ( 1486) count:37993 ave_time:367 total_time:13949239 DBD_GET_QOS ( 1448) count:37992 ave_time:298 total_time:11325751 DBD_GET_ASSOCS ( 1410) count:37990 ave_time:979 total_time:37195568 DBD_GET_FEDERATIONS ( 1494) count:37988 ave_time:328 total_time:12482373 DBD_GET_RES ( 1478) count:37988 ave_time:214 total_time:8145557 DBD_GET_WCKEYS ( 1453) count:37988 ave_time:179 total_time:6806631 DBD_REGISTER_CTLD ( 1434) count:19030 ave_time:51406 total_time:978256576 DBD_STEP_START ( 1442) count:894 ave_time:84322 total_time:75384570 DBD_STEP_COMPLETE ( 1441) count:881 ave_time:99862 total_time:87978781 DBD_JOB_START ( 1425) count:668 ave_time:106460 total_time:71115555 DBD_SEND_MULT_MSG ( 1474) count:526 ave_time:204597 total_time:107618176 DBD_JOB_COMPLETE ( 1424) count:497 ave_time:68886 total_time:34236519 DBD_FINI ( 1401) count:274 ave_time:1406 total_time:385249 DBD_NODE_STATE ( 1432) count:184 ave_time:87882 total_time:16170425 DBD_GET_JOBS_COND ( 1444) count:97 ave_time:4920 total_time:477322 DBD_GET_ACCOUNTS ( 1409) count:26 ave_time:275 total_time:7160 DBD_ADD_ACCOUNTS_COND ( 1501) count:8 ave_time:35473 total_time:283785 DBD_ADD_USERS_COND ( 1502) count:7 ave_time:71481 total_time:500369 DBD_GET_EVENTS ( 1470) count:7 ave_time:333 total_time:2333 DBD_GET_CLUSTERS ( 1412) count:6 ave_time:802 total_time:4816 DBD_REMOVE_ACCOUNTS ( 1435) count:2 ave_time:11589 total_time:23179 DBD_GET_STATS ( 1489) count:2 ave_time:18 total_time:37 Remote Procedure Call statistics by user slurm ( 401) count:367372 ave_time:4650 total_time:1708302700 root ( 0) count:545 ave_time:3658 total_time:1993758 1000 ( 1000) count:83 ave_time:866 total_time:71911 bob ( 1002) count:17 ave_time:821 total_time:13964 ``` ## 指令 - srun ### 常用參數說明 - `-c, --cpus-per-task=<num>` 指定**每個任務需要的 CPU 數量**。 - `-n, --ntasks=<num>` (預設=2) 指定**總共啟動幾個任務**。 - `--ntasks-per-node=<num>` 指定**每個節點啟動幾個任務**。 - `-N, --nodes=<num>` 指定要用幾個節點。 ### 範例 > `my_program` 可使用底下指令置換測試： > - `hostname` > - `nvidia-smi -L` - ### 範例 1：指定單一任務、需要 4 個 CPU ```bash srun -c 4 my_program # 等同於 srun --cpus-per-task=4 my_program ``` 這代表「只執行 1 個任務，該任務會分配到 4 顆 CPU」。 - ### 範例 2：執行 8 個任務，每個任務 2 顆 CPU ```bash srun -n 8 -c 2 my_program # 等同於 srun --ntasks=8 --cpus-per-task=2 my_program ``` 這代表「總共執行 8 個任務，每個任務各自分配 2 顆 CPU」。 - ### 範例 3：指定總共用多少 CPU（只要多任務，每個一顆 CPU） ```bash srun -n 16 my_program ``` 這會執行 16 個任務，每個任務預設用 1 顆 CPU。 - `srun --gres=gpu:2 my_program` 向 Slurm 要求分配 2 張 GPU，然後在這 2 張 GPU 上執行 test.sh 這個腳本。 ## 指令 - squeue ``` squeue -o "%.8i %.9P %.8j %.8u %.2t %.10M %.6D %R" JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) ``` ## Terms ### TRES - TRES = Trackable RESources，Slurm 的資源統一計量單位。 - 可以細緻追蹤 CPU、GPU、記憶體、license 等多種資源。 - 方便做資源分配、限制、查詢和計費。 ## config ### gres - ### [PARAMETERS / GresTypes](https://slurm.schedmd.com/slurm.conf.html#OPT_GresTypes) 以逗號分隔的通用資源（Generic Resources, GRES）清單（例如：GresTypes=gpu,mps）。這些資源可能有同名的 GRES 外掛程式，提供額外的功能。預設情況下不會管理任何通用資源。為確保運作正常，請確保這個參數在叢集中的所有節點上都一致。 - ### Q:　這些資源可能有同名的 GRES 外掛程式，提供額外的功能？ - ### A: gpt-4.1 (2025/07/20) 這句話的意思是：當你在 `GresTypes` 裡面設定某個資源名稱（比如 `gpu` 或 `mps`），Slurm 會去尋找一個同名的「GRES 插件」（Generic Resource Plugin），這個插件用來管理該資源並提供額外的功能。簡單來說： * 如果你設定 `GresTypes=gpu`，那 Slurm 就會啟用 `gpu` 這個 GRES 插件（例如 `/usr/lib/slurm/gres_gpu.so`）。備註：實際確認 ```bash # controller pod & compute pod 皆同 /usr/lib/x86_64-linux-gnu/slurm/gres_gpu.a /usr/lib/x86_64-linux-gnu/slurm/gres_gpu.la /usr/lib/x86_64-linux-gnu/slurm/gres_gpu.so /usr/lib/x86_64-linux-gnu/slurm/gres_mps.a /usr/lib/x86_64-linux-gnu/slurm/gres_mps.la /usr/lib/x86_64-linux-gnu/slurm/gres_mps.so /usr/lib/x86_64-linux-gnu/slurm/gres_nic.a /usr/lib/x86_64-linux-gnu/slurm/gres_nic.la /usr/lib/x86_64-linux-gnu/slurm/gres_nic.so /usr/lib/x86_64-linux-gnu/slurm/gres_shard.a /usr/lib/x86_64-linux-gnu/slurm/gres_shard.la /usr/lib/x86_64-linux-gnu/slurm/gres_shard.so ``` * 這個插件的角色是協助 Slurm 追蹤、管理並分配這類資源，還能提供像是監控、限制、報表等功能。 * 如果你設定了 `mps`，但沒有安裝 `mps` 的 GRES 插件，那就只有最基本的資源數量管理，沒有額外功能。 **總結說明：** > 「同名的 GRES 外掛程式」是指 Slurm 會依你設定的資源名稱（如 gpu、mps）尋找對應名稱的外掛（plugin），用來加強對該資源的管理和操作功能。 - ### [NODE CONFIGURATION](https://slurm.schedmd.com/slurm.conf.html#SECTION_NODE-CONFIGURATION) / [Gres](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1) > - `Gres=gpu:A40:2` > - `Gres=gpu:tesla:1,gpu:kepler:1,bandwidth:lustre:no_consume:4G` - 每個節點的通用資源（Generic Resources, GRES）規格，需以逗號分隔。格式為：「`<name>[:<type>][:no_consume]:<number>[K|M|G]`」。 - 第一個欄位是資源名稱，必須與 `GresType` 設定參數的名稱相同。 - 可選的 `type` 欄位用於區分該通用資源的型號（如 GPU 型號）。禁止同時對同一 `<name>` 指定未分類（untyped）和分類（typed）的 GRES。 - 可選的 `no_consume` 欄位代表該資源在被分配時不會被消耗數量（即沒有數量限制）。`no_consume` 屬於特定 GRES 的設定，無論有無指定 type 都會套用於該資源。 - **如果該資源有專屬外掛（plugin），則不應使用 `no_consume`；若你想讓多個程序同時共用 GPU，可考慮使用 "shard" GRES**。 - 最後一個欄位必須指定資源數量。你也可以使用「K」、「M」、「G」、「T」或「P」等後綴來分別將數量乘以 1024、1048576、1073741824 等（例如：`Gres=gpu:tesla:1,gpu:kepler:1,bandwidth:lustre:no_consume:4G`）。 - 預設情況下，節點沒有任何通用資源，最大可支援的資源數量為無號 64 位元整數。 - 如需透過作業約束（job constraint）以布林旗標篩選節點，請參閱 **Features** 說明。 - ### [NODE CONFIGURATION](https://slurm.schedmd.com/slurm.conf.html#SECTION_NODE-CONFIGURATION) / [Features](https://slurm.schedmd.com/slurm.conf.html#OPT_Features) > - `Features=a40` <-> `--constraint=A40` > - `Features=e900` <-> `--constraint=e900` > - **錯誤範例**： > - `Gres=gpu:rtx3090:1` <-> `--constraint=rtx3090` > 以逗號分隔的任意字串，用來標示與該節點相關的某些特徵（feature）。目前，feature 沒有值或數量的設定，一個節點要嘛有這個 feature，要嘛沒有。你可以在 feature 名稱中加入數字成分（例如表示處理器速度），但這個數字只會被當作 feature 字串的一部分處理。 Feature 主要用於透過 `--constraint` 參數來篩選適合執行作業的節點。預設情況下，節點沒有任何 feature。如需更多控制（如資源型態與數量），請參閱 Gres。使用 feature 進行排程比 GRES 快，但僅限於布林運算（Boolean operations）。 **注意：** hostlist 函式 `feature{myfeature}` 會展開成所有具有指定 feature 的節點。這個寫法可以在與 slurmctld 互動的指令或設定檔中，取代或搭配一般的 hostlist 表達式來使用。例如： ```bash scontrol update node=feature{myfeature} state=resume ``` 或 ```bash PartitionName=p1 Nodes=feature{myfeature} ``` ## 討論 ### slurm 是誰開發的？ - ### GPT-4 Turbo (2025/06/16) Slurm（Simple Linux Utility for Resource Management）是由 **Lawrence Livermore National Laboratory (LLNL)** 開發的。這是一個開源的資源管理和作業排程系統，主要用於大型計算叢集，尤其是高效能計算（HPC）環境中。Slurm 的設計目的是管理大量的計算資源，協調作業執行並最大化叢集資源的使用效率。自從開發以來，Slurm 被廣泛應用於科研機構、大學以及一些企業的計算基礎設施中。它是目前許多超級電腦和大規模叢集中使用的資源管理系統之一。 ### 跟 SchedMD 有什麼關係？ - ### GPT-4 Turbo (2025/06/16) **SchedMD** 是負責維護和發展 **Slurm** 的公司。雖然 Slurm 最初由 **Lawrence Livermore National Laboratory (LLNL)** 開發，但在 2002 年，開發 Slurm 的團隊從 LLNL 分離出來，並成立了 SchedMD 公司，專注於 Slurm 的進一步開發、支援和商業化。因此，**SchedMD** 是 **Slurm** 的主要開發者和支援提供者，並且為全球的使用者提供付費的支援服務、顧問服務、以及定制化的功能開發等。 ### 關係總結： 1. **Slurm** 是一個開源的資源管理和排程系統。 2. **SchedMD** 是負責維護和開發 Slurm 的公司，並提供專業的支援服務。如果你在使用 Slurm 時遇到問題，SchedMD 提供專業支援，但 Slurm 本身依然是一個開源項目，可以免費使用。 ### Slurm Accounting (slurmdbd) 用途為何？ - ### gpt-4.1-mini (2025/06/17) Slurm Accounting 的元件 **slurmdbd**（Slurm Database Daemon）主要用途是： * **集中管理與儲存 Slurm 叢集的會計資料（Accounting Data）**，例如使用者提交的作業資訊、作業執行情況、資源使用量（CPU、GPU、記憶體等）及使用時長等。 * **將會計資料寫入資料庫**，通常是 MySQL 或 MariaDB，方便後續查詢、分析、計費或報表製作。 * **支援多叢集環境下的資料集中管理**，讓多個 Slurm 叢集的使用情況都能統一記錄與管理。 * **提供 API 介面給其他工具或系統**，用來取得會計資料，便於自動化監控、資源配額管控、帳單產生等應用。簡單說，**slurmdbd 就是 Slurm 的會計資料收集及管理伺服器，確保所有作業使用的資源及歷史紀錄有完整保存且方便存取。** --- 如果要簡單類比： * **slurmctld** 管理 Slurm 叢集作業調度。 * **slurmdbd** 則專門管理「作業執行後的會計資料及記錄」。你可以把 slurmdbd 想成 Slurm 的「會計部門」服務。 ### `scontrol update node=feature{myfeature} state=resume` 的 feature{myfeature} 用法 - ### gpt-4.1 (2025/07/20) ### `feature{myfeature}` 的意思在 Slurm 裡面，`feature{myfeature}` 這個寫法代表：「所有擁有 myfeature 這個特徵（feature）的節點」。這是 Slurm 的一種 **主機清單（hostlist）展開語法**，可以快速選出所有有某個 feature 的節點，不需要一一列舉節點名稱。 --- ### 用法範例假設你的 `slurm.conf` 有設定多個節點，每個節點都標註了不同的 features，例如： ``` NodeName=node1 Features=ssd,fastcpu NodeName=node2 Features=ssd NodeName=node3 Features=fastcpu NodeName=node4 Features=gpu ``` 如果你想要對「所有有 fastcpu 這個特徵」的節點下指令，你就可以這樣寫： ```bash scontrol update node=feature{fastcpu} state=resume ``` 這條指令等同於： ```bash scontrol update node=node1,node3 state=resume ``` 因為只有 node1 和 node3 有 fastcpu 這個 feature。 --- ### 更多應用場景 * 指定 partition 節點時也能用： ```bash PartitionName=p1 Nodes=feature{ssd} ``` 這樣 `p1` 分割區就會包含所有有 ssd feature 的節點。 * 其他指令也支援這種 hostlist 表達式（如 scontrol、sinfo）。 --- ### 小結 * `feature{myfeature}` = 擁有 myfeature 這個 feature 的所有節點。 * 讓你不用一個個列節點，直接用特徵來選節點，適合動態管理或節點數多的環境。 {%hackmd vaaMgNRPS4KGJDSFG0ZE0w %}