# Slurm 安裝流程 # Slurm 在 HPC 叢集中的角色 * User **Login** Node: HPC 叢集的登入點,使用者進行操作的主機。 * Slurm **Control** Node: Slurm 控制主機,負責工作排程、指派。 * Slurm **Compute** Node: Slurm 運算主機,接受 Slurm 控制主機指派的工作。 * Slurm **Database** Node: Slurm 資料庫主機,負責紀錄工作歷史、資源用量等資訊。資料庫服務通常會由建置在同一臺主機上的 MariaDB 或 MySQL 提供。 本安裝流程的每個步驟皆有標示該步驟適用的角色。例如: 標示「**All** Nodes」的步驟必須在所有主機上完成;而標示「User **Login** Node」的步驟只要在 HPC 叢集的登入點主機上完成即可。依此類推。 # 安裝環境 * 作業系統: * CentOS,版本 7.9.2009。 * 關閉 SELinux。 * 防火牆允許源自 HPC 叢集所在網段的所有連線。 * 安裝所有可用更新。 * 本安裝流程以及提供的範例設定檔假設有 4 臺主機,主機名稱 (Host name) 及其對應的 Slurm 角色為: * `login`: User **Login** Node * `slurm-control`: Slurm **Control** Node、Slurm **Database** Node * `slurm-compute01`: Slurm **Compute** Node * `slurm-compute02`: Slurm **Compute** Node * 所有主機的 `/etc/hosts` 都必須包含完整、正確的主機名稱與對應 IP 位址,例如: 172.100.0.10 login 172.100.0.20 slurm-control 172.100.0.30 slurm-compute01 172.100.0.31 slurm-compute02 # 1. 新增 `epel-release`、`mariadb` 套件庫 (角色: **All** Nodes) cat >/etc/yum.repos.d/MariaDB.repo <<EOF [mariadb] name=MariaDB baseurl=https://ftp.ubuntu-tw.org/mirror/mariadb/yum/10.5/centos7-amd64 gpgkey=https://ftp.ubuntu-tw.org/mirror/mariadb/yum/RPM-GPG-KEY-MariaDB gpgcheck=1 EOF yum makecache yum install -y epel-release yum makecache # 2. 新增必要的使用者和群組 (角色: **All** Nodes) groupadd -g 901 munge useradd -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -g munge -m -s /sbin/nologin -u 901 munge groupadd -g 902 slurm useradd -c "Slurm Workload Manager" -d /var/lib/slurm -g slurm -m -s /bin/bash -u 902 slurm # 3. 安裝套件 ## 3.1. 安裝 `munge`、`slurm`、`slurm-contribs`、`slurm-perlapi` 套件 (角色: **All** Nodes) yum install -y munge slurm slurm-contribs slurm-perlapi ## 3.2. 安裝 `slurm-devel`、`slurm-pmi`、`slurm-pmi-devel` 套件 (角色: User **Login** Node) yum install -y slurm-devel slurm-pmi slurm-pmi-devel ## 3.3. 安裝 `slurm-slurmctld` 套件 (角色: Slurm **Control** Node) yum install -y slurm-slurmctld ## 3.4. 安裝 `slurm-slurmd`、`slurm-pmi` 套件 (角色: Slurm **Compute** Node) yum install -y slurm-slurmd slurm-pmi ## 3.5. 安裝 `MariaDB-server`、`MariaDB-client`、`slurm-slurmdbd` 套件 (角色: Slurm **Database** Node) yum install -y MariaDB-server MariaDB-client slurm-slurmdbd # 4. 產生、複製 `munge.key` 金鑰 (角色: **All** Nodes) 在任意一臺主機上執行以下指令,即可產生 `/etc/munge/munge.key` 金鑰檔案。 create-munge-key 然後,將此檔案複製到 HPC 叢集中,所有主機上的相同位置。最後,基於安全因素,應確認以下檔案之權限、擁有者。 chmod 400 /etc/munge/munge.key chown munge:munge /etc/munge/munge.key # 5. 確認目錄權限、擁有者 (角色: **All** Nodes) find /var/log/munge /etc/munge -type d -execdir chmod 700 '{}' ';' find /var/run/munge -type d -execdir chmod 755 '{}' ';' find /var/log/slurm /var/spool/slurm -type d -execdir chmod 700 '{}' ';' find /var/run/slurm /etc/slurm -type d -execdir chmod 755 '{}' ';' chown -R munge:munge /var/log/munge /var/run/munge /etc/munge chown -R slurm:slurm /var/log/slurm /var/run/slurm /etc/slurm /var/spool/slurm # 6. 編寫、複製設定檔 ## 6.1. 修改 `cgroup.conf`、`slurm.conf` 範例設定檔 (角色: **All** Nodes) 依需求修改 `cgroup.conf`、`slurm.conf` 範例設定檔的內容,並複製到 `/etc/slurm` 路徑。基本上可以直接用我寫的設定,不過 `slurm.conf` 裡的以下內容會需要因地制宜: * 註解標示 `Control Nodes` 的區塊 需填寫具有 Slurm **Control** Node 角色的主機名稱。 * 註解標示 `Compute Nodes` 的區塊 需填寫具有 Slurm **Compute** Node 角色的主機名稱、硬體相關規格。 * 註解標示 `Partitions` 的區塊 需填寫具有 Slurm **Compute** Node 角色的主機名稱、工作分區相關資訊。 * `AccountingStorageHost` 項目 需填寫具有 Slurm **Database** Node 角色的主機名稱。 最後,基於安全因素,應確認以下檔案之權限、擁有者。 chmod 644 /etc/slurm/cgroup.conf chmod 644 /etc/slurm/slurm.conf chown slurm:slurm /etc/slurm/cgroup.conf chown slurm:slurm /etc/slurm/slurm.conf ## 6.2. 修改 `slurmdbd.conf` 範例設定檔 (角色: Slurm **Database** Node) 依需求修改 `slurmdbd.conf` 範例設定檔的內容,並複製到 `/etc/slurm` 路徑。基本上可以直接用我寫的設定。最後,基於安全因素,應確認以下檔案之權限、擁有者。 chmod 600 /etc/slurm/slurmdbd.conf chown slurm:slurm /etc/slurm/slurmdbd.conf # 7. 啟動服務 ## 7.1. 啟動 `munge` 服務 (角色: **All** Nodes) systemctl start munge systemctl enable munge ## 7.2. 啟動 `mariadb` 服務 (角色: Slurm **Database** Node) systemctl start mariadb systemctl enable mariadb ## 7.3. 設定 `mariadb` 服務、新增必要的資料庫 (角色: Slurm **Database** Node) mysql_secure_installation 2>/dev/null <<EOF n n y y y y EOF mysql -e "create database slurm_acct_db;" mysql -e "create user 'slurm'@'localhost' identified by 'slurmdbd';" mysql -e "grant all privileges on slurm_acct_db.* to 'slurm'@'localhost';" systemctl restart mariadb ## 7.4. 啟動 `slurmdbd` 服務 (角色: Slurm **Database** Node) systemctl start slurmdbd systemctl enable slurmdbd ## 7.5. 啟動 `slurmctld` 服務 (角色: Slurm **Control** Node) systemctl start slurmctld systemctl enable slurmctld ## 7.6. 啟動 `slurmd` 服務 (角色: Slurm **Compute** Node) systemctl start slurmd systemctl enable slurmd # 8. 測試 Slurm 狀態、功能 (角色: User **Login** Node) [root@login ~]# scontrol ping Slurmctld(primary) at slurm-control is UP [root@login ~]# scontrol show node NodeName=slurm-compute01 Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUTot=8 CPULoad=0.08 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=slurm-compute01 NodeHostName=slurm-compute01 Version=20.11.5 OS=Linux 5.4.72-microsoft-standard-WSL2 #1 SMP Wed Oct 28 23:40:43 UTC 2020 RealMemory=1024 AllocMem=0 FreeMem=4096 Sockets=1 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2021-05-25T05:24:52 SlurmdStartTime=2021-05-25T08:23:59 CfgTRES=cpu=8,mem=1G,billing=8 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Comment=(null) NodeName=slurm-compute02 Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUTot=8 CPULoad=0.08 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=slurm-compute02 NodeHostName=slurm-compute02 Version=20.11.5 OS=Linux 5.4.72-microsoft-standard-WSL2 #1 SMP Wed Oct 28 23:40:43 UTC 2020 RealMemory=1024 AllocMem=0 FreeMem=4096 Sockets=1 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2021-05-25T05:24:52 SlurmdStartTime=2021-05-25T08:24:01 CfgTRES=cpu=8,mem=1G,billing=8 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Comment=(null) [root@login ~]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up 1-00:00:00 2 idle slurm-compute[01-02] [root@login ~]# srun --nodes=2 --ntasks-per-node=1 bash -c "echo Hello world from \`hostname\`" Hello world from slurm-compute01 Hello world from slurm-compute02 [root@login ~]# sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 2 bash debug root 4 COMPLETED 0:0 3 bash debug root 4 COMPLETED 0:0 4 bash debug root 4 COMPLETED 0:0 5 bash debug root 4 COMPLETED 0:0 # 9. 附錄: 範例設定檔 * `cgroup.conf` ``` ### Slurm cgroup.conf Template CgroupAutomount=yes #CgroupMountpoint=/sys/fs/cgroup AllowedRAMSpace=100 AllowedSwapSpace=0 ConstrainCores=yes ConstrainRAMSpace=no ConstrainSwapSpace=no MaxRAMPercent=100 MaxSwapPercent=100 TaskAffinity=no ``` * `slurm.conf` ``` ### Slurm slurm.conf Template ### Control Nodes SlurmctldHost=slurm-control ### Compute Nodes NodeName=slurm-compute[01-02] CPUs=8 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=1024 State=UNKNOWN ### Partitions PartitionName=debug Nodes=slurm-compute[01-02] Default=yes MaxCPUsPerNode=8 MaxMemPerNode=1024 MaxTime=24:00:00 State=UP ### Authentication #AuthInfo=/var/run/munge/munge.socket.2 AuthType=auth/munge ### Users SlurmUser=slurm ### Ports SlurmctldPort=6817 SlurmdPort=6818 ### State Preservation ReturnToService=2 SlurmdSpoolDir=/var/spool/slurm/d StateSaveLocation=/var/spool/slurm/ctld ### Scheduling SchedulerType=sched/backfill ### Interconnect SwitchType=switch/none ### Default MPI Type MpiDefault=pmi2 ### Process Tracking ProctrackType=proctrack/cgroup ### Resource Selection SelectType=select/cons_tres SelectTypeParameters=cr_core_memory ### Task Launch TaskPlugin=task/affinity,task/cgroup ### Event Logging SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm/slurmd.log ### Job Accounting Gather JobAcctGatherFrequency=task=30 JobAcctGatherType=jobacct_gather/linux ### Job Accounting Storage AccountingStorageEnforce=safe AccountingStorageHost=slurm-control #AccountingStorageLoc= #AccountingStoragePass= AccountingStoragePort=6819 AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageUser= AccountingStoreJobComment=yes ClusterName=cluster ### Job Completion Logging #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= JobCompType=jobcomp/none #JobCompUser= ### Process ID Logging SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmdPidFile=/var/run/slurm/slurmd.pid ### Timers SlurmctldTimeout=120 SlurmdTimeout=300 InactiveLimit=60 KillWait=60 MinJobAge=300 WaitTime=60 #### Miscellaneous KillOnBadExit=1 MailProg=/bin/true RebootProgram=/usr/sbin/reboot ``` * `slurmdbd.conf` ``` ### Slurm slurmdbd.conf Template ### Authentication #AuthInfo=/var/run/munge/munge.socket.2 AuthType=auth/munge ### Slurmdbd DbdHost=localhost DbdPort=6819 DebugLevel=info LogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurm/slurmdbd.pid SlurmUser=slurm ### Archive Database #ArchiveDir=/tmp #ArchiveScript= ArchiveEvents=no ArchiveJobs=no ArchiveResvs=no ArchiveSteps=no ArchiveSuspend=no ArchiveTXN=no ArchiveUsage=no ### Purge Database PurgeEventAfter=1month PurgeJobAfter=1month PurgeResvAfter=1month PurgeStepAfter=1month PurgeSuspendAfter=1month PurgeTXNAfter=1month PurgeUsageAfter=1month ### MariaDB/MySQL StorageHost=localhost StorageLoc=slurm_acct_db StoragePass=slurmdbd StoragePort=3306 StorageType=accounting_storage/mysql StorageUser=slurm ```
×
Sign in
Email
Password
Forgot password
or
Sign in via Google
Sign in via Facebook
Sign in via X(Twitter)
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
)
Connect another wallet
Continue with a different method
New to HackMD?
Sign up
By signing in, you agree to our
terms of service
.