本安裝流程的每個步驟皆有標示該步驟適用的角色。例如: 標示「All Nodes」的步驟必須在所有主機上完成;而標示「User Login Node」的步驟只要在 HPC 叢集的登入點主機上完成即可。依此類推。
作業系統:
本安裝流程以及提供的範例設定檔假設有 4 臺主機,主機名稱 (Host name) 及其對應的 Slurm 角色為:
login
: User Login Nodeslurm-control
: Slurm Control Node、Slurm Database Nodeslurm-compute01
: Slurm Compute Nodeslurm-compute02
: Slurm Compute Node所有主機的 /etc/hosts
都必須包含完整、正確的主機名稱與對應 IP 位址,例如:
172.100.0.10 login
172.100.0.20 slurm-control
172.100.0.30 slurm-compute01
172.100.0.31 slurm-compute02
epel-release
、mariadb
套件庫 (角色: All Nodes)cat >/etc/yum.repos.d/MariaDB.repo <<EOF
[mariadb]
name=MariaDB
baseurl=https://ftp.ubuntu-tw.org/mirror/mariadb/yum/10.5/centos7-amd64
gpgkey=https://ftp.ubuntu-tw.org/mirror/mariadb/yum/RPM-GPG-KEY-MariaDB
gpgcheck=1
EOF
yum makecache
yum install -y epel-release
yum makecache
groupadd -g 901 munge
useradd -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -g munge -m -s /sbin/nologin -u 901 munge
groupadd -g 902 slurm
useradd -c "Slurm Workload Manager" -d /var/lib/slurm -g slurm -m -s /bin/bash -u 902 slurm
munge
、slurm
、slurm-contribs
、slurm-perlapi
套件 (角色: All Nodes)yum install -y munge slurm slurm-contribs slurm-perlapi
slurm-devel
、slurm-pmi
、slurm-pmi-devel
套件 (角色: User Login Node)yum install -y slurm-devel slurm-pmi slurm-pmi-devel
slurm-slurmctld
套件 (角色: Slurm Control Node)yum install -y slurm-slurmctld
slurm-slurmd
、slurm-pmi
套件 (角色: Slurm Compute Node)yum install -y slurm-slurmd slurm-pmi
MariaDB-server
、MariaDB-client
、slurm-slurmdbd
套件 (角色: Slurm Database Node)yum install -y MariaDB-server MariaDB-client slurm-slurmdbd
munge.key
金鑰 (角色: All Nodes)在任意一臺主機上執行以下指令,即可產生 /etc/munge/munge.key
金鑰檔案。
create-munge-key
然後,將此檔案複製到 HPC 叢集中,所有主機上的相同位置。最後,基於安全因素,應確認以下檔案之權限、擁有者。
chmod 400 /etc/munge/munge.key
chown munge:munge /etc/munge/munge.key
find /var/log/munge /etc/munge -type d -execdir chmod 700 '{}' ';'
find /var/run/munge -type d -execdir chmod 755 '{}' ';'
find /var/log/slurm /var/spool/slurm -type d -execdir chmod 700 '{}' ';'
find /var/run/slurm /etc/slurm -type d -execdir chmod 755 '{}' ';'
chown -R munge:munge /var/log/munge /var/run/munge /etc/munge
chown -R slurm:slurm /var/log/slurm /var/run/slurm /etc/slurm /var/spool/slurm
cgroup.conf
、slurm.conf
範例設定檔 (角色: All Nodes)依需求修改 cgroup.conf
、slurm.conf
範例設定檔的內容,並複製到 /etc/slurm
路徑。基本上可以直接用我寫的設定,不過 slurm.conf
裡的以下內容會需要因地制宜:
Control Nodes
的區塊Compute Nodes
的區塊Partitions
的區塊AccountingStorageHost
項目最後,基於安全因素,應確認以下檔案之權限、擁有者。
chmod 644 /etc/slurm/cgroup.conf
chmod 644 /etc/slurm/slurm.conf
chown slurm:slurm /etc/slurm/cgroup.conf
chown slurm:slurm /etc/slurm/slurm.conf
slurmdbd.conf
範例設定檔 (角色: Slurm Database Node)依需求修改 slurmdbd.conf
範例設定檔的內容,並複製到 /etc/slurm
路徑。基本上可以直接用我寫的設定。最後,基於安全因素,應確認以下檔案之權限、擁有者。
chmod 600 /etc/slurm/slurmdbd.conf
chown slurm:slurm /etc/slurm/slurmdbd.conf
munge
服務 (角色: All Nodes)systemctl start munge
systemctl enable munge
mariadb
服務 (角色: Slurm Database Node)systemctl start mariadb
systemctl enable mariadb
mariadb
服務、新增必要的資料庫 (角色: Slurm Database Node)mysql_secure_installation 2>/dev/null <<EOF
n
n
y
y
y
y
EOF
mysql -e "create database slurm_acct_db;"
mysql -e "create user 'slurm'@'localhost' identified by 'slurmdbd';"
mysql -e "grant all privileges on slurm_acct_db.* to 'slurm'@'localhost';"
systemctl restart mariadb
slurmdbd
服務 (角色: Slurm Database Node)systemctl start slurmdbd
systemctl enable slurmdbd
slurmctld
服務 (角色: Slurm Control Node)systemctl start slurmctld
systemctl enable slurmctld
slurmd
服務 (角色: Slurm Compute Node)systemctl start slurmd
systemctl enable slurmd
[root@login ~]# scontrol ping
Slurmctld(primary) at slurm-control is UP
[root@login ~]# scontrol show node
NodeName=slurm-compute01 Arch=x86_64 CoresPerSocket=4
CPUAlloc=0 CPUTot=8 CPULoad=0.08
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=slurm-compute01 NodeHostName=slurm-compute01 Version=20.11.5
OS=Linux 5.4.72-microsoft-standard-WSL2 #1 SMP Wed Oct 28 23:40:43 UTC 2020
RealMemory=1024 AllocMem=0 FreeMem=4096 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=debug
BootTime=2021-05-25T05:24:52 SlurmdStartTime=2021-05-25T08:23:59
CfgTRES=cpu=8,mem=1G,billing=8
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Comment=(null)
NodeName=slurm-compute02 Arch=x86_64 CoresPerSocket=4
CPUAlloc=0 CPUTot=8 CPULoad=0.08
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=slurm-compute02 NodeHostName=slurm-compute02 Version=20.11.5
OS=Linux 5.4.72-microsoft-standard-WSL2 #1 SMP Wed Oct 28 23:40:43 UTC 2020
RealMemory=1024 AllocMem=0 FreeMem=4096 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=debug
BootTime=2021-05-25T05:24:52 SlurmdStartTime=2021-05-25T08:24:01
CfgTRES=cpu=8,mem=1G,billing=8
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Comment=(null)
[root@login ~]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up 1-00:00:00 2 idle slurm-compute[01-02]
[root@login ~]# srun --nodes=2 --ntasks-per-node=1 bash -c "echo Hello world from \`hostname\`"
Hello world from slurm-compute01
Hello world from slurm-compute02
[root@login ~]# sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
2 bash debug root 4 COMPLETED 0:0
3 bash debug root 4 COMPLETED 0:0
4 bash debug root 4 COMPLETED 0:0
5 bash debug root 4 COMPLETED 0:0
cgroup.conf
### Slurm cgroup.conf Template
CgroupAutomount=yes
#CgroupMountpoint=/sys/fs/cgroup
AllowedRAMSpace=100
AllowedSwapSpace=0
ConstrainCores=yes
ConstrainRAMSpace=no
ConstrainSwapSpace=no
MaxRAMPercent=100
MaxSwapPercent=100
TaskAffinity=no
slurm.conf
### Slurm slurm.conf Template
### Control Nodes
SlurmctldHost=slurm-control
### Compute Nodes
NodeName=slurm-compute[01-02] CPUs=8 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=1024 State=UNKNOWN
### Partitions
PartitionName=debug Nodes=slurm-compute[01-02] Default=yes MaxCPUsPerNode=8 MaxMemPerNode=1024 MaxTime=24:00:00 State=UP
### Authentication
#AuthInfo=/var/run/munge/munge.socket.2
AuthType=auth/munge
### Users
SlurmUser=slurm
### Ports
SlurmctldPort=6817
SlurmdPort=6818
### State Preservation
ReturnToService=2
SlurmdSpoolDir=/var/spool/slurm/d
StateSaveLocation=/var/spool/slurm/ctld
### Scheduling
SchedulerType=sched/backfill
### Interconnect
SwitchType=switch/none
### Default MPI Type
MpiDefault=pmi2
### Process Tracking
ProctrackType=proctrack/cgroup
### Resource Selection
SelectType=select/cons_tres
SelectTypeParameters=cr_core_memory
### Task Launch
TaskPlugin=task/affinity,task/cgroup
### Event Logging
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
### Job Accounting Gather
JobAcctGatherFrequency=task=30
JobAcctGatherType=jobacct_gather/linux
### Job Accounting Storage
AccountingStorageEnforce=safe
AccountingStorageHost=slurm-control
#AccountingStorageLoc=
#AccountingStoragePass=
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageUser=
AccountingStoreJobComment=yes
ClusterName=cluster
### Job Completion Logging
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
### Process ID Logging
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmdPidFile=/var/run/slurm/slurmd.pid
### Timers
SlurmctldTimeout=120
SlurmdTimeout=300
InactiveLimit=60
KillWait=60
MinJobAge=300
WaitTime=60
#### Miscellaneous
KillOnBadExit=1
MailProg=/bin/true
RebootProgram=/usr/sbin/reboot
slurmdbd.conf
### Slurm slurmdbd.conf Template
### Authentication
#AuthInfo=/var/run/munge/munge.socket.2
AuthType=auth/munge
### Slurmdbd
DbdHost=localhost
DbdPort=6819
DebugLevel=info
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurm/slurmdbd.pid
SlurmUser=slurm
### Archive Database
#ArchiveDir=/tmp
#ArchiveScript=
ArchiveEvents=no
ArchiveJobs=no
ArchiveResvs=no
ArchiveSteps=no
ArchiveSuspend=no
ArchiveTXN=no
ArchiveUsage=no
### Purge Database
PurgeEventAfter=1month
PurgeJobAfter=1month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month
PurgeTXNAfter=1month
PurgeUsageAfter=1month
### MariaDB/MySQL
StorageHost=localhost
StorageLoc=slurm_acct_db
StoragePass=slurmdbd
StoragePort=3306
StorageType=accounting_storage/mysql
StorageUser=slurm