Basic HPC cluster setup with slurm (Ubuntu 22.04)

此篇文章練習模仿 HPC 架設簡易的叢集架構，內容包括如下

HPC 簡介
Slurm 簡介
/etc/hosts 的功用
如何修改 hostname
架設 NFS
一台主機 as linux router
簡單的 pdsh 指令
設定 ssh authentication without password
設定 MUNGE
架設 Slurm cluster

HPC 簡介

高性能運算（High-Performance Computing，簡稱HPC），指的是使用先進的計算資源和技術，處理龐大且複雜的計算工作，像是科學、工程和其他領域的高度複雜分析的問題。
在 HPC 中，除了需要高效能的硬體裝置外，還需要 fine-tuning 與優化相對應的軟體環境，提供穩定的軟體服務給使用者。這篇文章會使用 ubuntu 22.04 的 linux os 搭配 slurm 資源分配軟體進行架設 cluster。

Slurm 簡介

Slurm (Simple Linux Utility for Resource Management) 是一個 Open Source 的資源調度及叢集管控的軟體，主要會有三個部分，如下說明:

slurm controller: 負責整個集群的管理和任務調度。它接收用戶提交的任務，根據資源狀態和管理政策，分配這些任務到可用的工作節點上。
slurm compute node: 執行實際計算任務的節點。它們擁有計算資源，例如處理器、內存、儲存和網絡連接。當控制節點分配任務給工作節點時，工作節點負責執行這些任務。
slurm database: 存儲 slurm 任務狀態、用戶帳戶、節點信息等。

Basic cluster 架構

在這邊文章中會使用下圖的架構，建置簡易的叢集系統，並派送 Job 至此叢集系統進行測試。

需準備4台VM，且安裝ubuntu 22.04 版本，每一台的 VM 作用如下:

slurm control node (slurm 的控制節點)
- VM source requirement:
  - Cpu: 2 cores
  - Mem: 4G
  - Storage: 10G
slurm compute node 1 / node 2 (slurm 的計算節點1 / 節點2)
- VM source requirement:
  - Cpu: 2 cores
  - Mem: 4G
  - Storage: 10G
slurm database (slurm 的資料庫)
- VM source requirement:
  - Cpu: 2 cores
  - Mem: 4G
  - Storage: 10G

Note.

通常大型 HPC 架構會在有一層 login nodes 的主機供使用者登入，避免多使用者登入時造成主機較大的負載，而在這篇文章中 login node 和 control node 會在同一台主機上。
大型 HPC 架構也會有高速的 shared storage (ex. GPFS、Lustre 等)平行存取檔案，而在這篇文章中會使用 NFS 的方式來提供 shared storage 給 cluster 使用。

基礎建置的 setup

安裝基礎套件

在 controller node 上先安裝 pdsh，pdsh 是一個可以跑 remote shell command 的工具，也就是可以同時對多台主機進行 command 的操作，方便後續對其他主機進行指令的操作。

$apt update
$apt install pdsh -y

網路設置

Control node

針對 Control node 配置系統網路，針對系統網路的設定，Ubuntu 系統會放在 /etc/netplan/00-installer-config.yaml

$sudo vi /etc/netplan/00-installer-config.yaml
修改檔案內容如下
```
# This is the network config written by 'subiquity'
network:
  ethernets:
    enp0s3: # public ip range
      dhcp4: true

    enp0s8: # private ip range
      addresses: [192.168.56.23/24]
      dhcp4: false
  version: 2
```
- enp0s3: 使用 DHCP 進行網路配置，且此網卡可以連至 public network
- enp0s8: 使用 static ip 進行網路配置，且是在 192.168.56.0 這個 subnet 底下，同一個 cluster 的 work nodes 也必須在相同的 subnet 底下，這樣 slurm 的 control node 和 work nodes 彼此之間才能溝通。
- subnet 的概念可以搜尋 cidr 或是 subnet mask 等內容
- 在設定 yaml file 時，須注意格式及 key-value 的內容，否則很容易會跳 error 訊息
$sudo netplan try -> 確定系統網路設置是否有錯誤
$sudo netplan apply -> 啟用系統網路設置

Compute node 1

針對 compute node 1 配置系統網路，針對系統網路的設定，Ubuntu 系統會放在 /etc/netplan/00-installer-config.yaml

sudo vi /etc/netplan/00-installer-config.yaml

修改檔案內容如下

# This is the network config written by 'subiquity'
network:
  ethernets:
    enp0s3: # public ip range
      dhcp4: true

    enp0s8: # private ip range
      addresses: [192.168.56.24/24]
      dhcp4: false
      gateway4: 192.168.56.23
  version: 2

sudo netplan try -> 確定系統網路設置是否有錯誤
sudo netplan apply -> 啟用系統網路設置

Compute node 2

針對 compute node 2 配置系統網路，針對系統網路的設定，Ubuntu 系統會放在 /etc/netplan/00-installer-config.yaml

sudo vi /etc/netplan/00-installer-config.yaml

修改檔案內容如下

# This is the network config written by 'subiquity'
network:
  ethernets:
    enp0s3: # public ip range
      dhcp4: true

    enp0s8: # private ip range
      addresses: [192.168.56.25/24]
      dhcp4: false
      gateway4: 192.168.56.23
  version: 2

sudo netplan try -> 確定系統網路設置是否有錯誤
sudo netplan apply -> 啟用系統網路設置

Slurm database

針對 Slurm database 配置系統網路，針對系統網路的設定，Ubuntu 系統會放在 /etc/netplan/00-installer-config.yaml

sudo vi /etc/netplan/00-installer-config.yaml

修改檔案內容如下

# This is the network config written by 'subiquity'
network:
  ethernets:
    enp0s3: # public ip range
      dhcp4: true

    enp0s8: # private ip range
      addresses: [192.168.56.26/24]
      dhcp4: false
      gateway4: 192.168.56.23
  version: 2

sudo netplan try -> 確定系統網路設置是否有錯誤
sudo netplan apply -> 啟用系統網路設置

設置 slurm control node as Linux Router

在本篇文章中，control node 身兼多職，當作 login node，也必須為 compute nodes 的 router。通常在 HPC 的環境中 compute nodes 通常不能直接對外進行網路連接，這樣可以降低 security issue 的風險。
在這裡會 compute node 1 / 2 的 gateway 是連到，control node 上的 enp0s8 網卡，然後在 control node 上的 enp0s3 網卡可以 forward 封包到 enp0s8，enp0s8 也可以 forward 封包到 enp0s3，以達到 compute node 可以透過 login node的網路對外進行連線，但無法從 internet 直接連線到 compute node 上。

步驟如下:

修改 /etc/sysctl.conf 找到 net.ipv4.ip_forward 參數，將原本的註解拿掉，如下

# Uncomment the next line to enable packet forwarding for IPv4
net.ipv4.ip_forward=1

啟用這個 ip forward 的設定

$sysctl -p

設定 packet forward 的規則

# Configure the enp0s8 packets forward to enp0s3
$iptables -A FORWARD -i enp0s8 -o enp0s3 -j ACCEPT

# Configure the enp0s3 packets forward to enp0s8
$iptables -A FORWARD -i  enp0s3 -o enp0s8 -m state --state RELATED,ESTABLISHED -j ACCEPT

# source IP address of the outgoing packets from that device 
# will be dynamically translated to the IP address of the enp0s3 
$iptables -t nat -A POSTROUTING -o enp0s3 -j MASQUERADE

# ensure that the two local networks can also communicate
$iptables -t nat -A POSTROUTING -o enp0s8 -j MASQUERADE

確認 forward 規則是否有設定成功

$iptables-save
>>>
...
-A FORWARD -i enp0s8 -o enp0s3 -j ACCEPT
-A FORWARD -i enp0s3 -o enp0s8 -m state --state RELATED,ESTABLISHED -j ACCEPT
...

將 iptables 的 forward 規則設定儲存為永久的設定 (避免重開機後，設定就消失)

$apt install iptables-persistent
$iptables-save > /etc/iptables/rules.v4

Hostname setting

會依照此篇的規劃進行名稱的設定，此設定會攸關在 /etc/hosts 中所設定的 hostname 對應 ip 的部分，若設定不對的 hostname 會無法進行 ip 的對應。

Control node

hostname 會有兩個需要修改的檔案

/etc/hosts -> 127.0.1.1的地方需修改為 127.0.1.1 slurm-ctl
/etc/hostnmae -> 修改此檔案，調整內容為 slurm-ctl
重新開機，hostname 的設定即可生效。

Compute node 1

/etc/hosts -> 127.0.1.1的地方需修改為 127.0.1.1 slurm-wrk-01
/etc/hostnmae -> 修改此檔案，調整內容為 slurm-wrk-01
重新開機，hostname 的設定即可生效。

Compute node 2

/etc/hosts -> 127.0.1.1的地方需修改為 127.0.1.1 slurm-wrk-02
/etc/hostnmae -> 修改此檔案，調整內容為 slurm-wrk-02
重新開機，hostname 的設定即可生效。

Slurm database

/etc/hosts -> 127.0.1.1的地方需修改為 127.0.1.1 slurm-mariadb
/etc/hostnmae -> 修改此檔案，調整內容為 slurm-mariadb
重新開機，hostname 的設定即可生效。

設置 hosts file

/etc/hosts 檔案內容，可以為 static ip 建立起本地端的 DNS，讓 domain name 可以對應到正確的 ip 位置，且在 slurm 的設定檔中常會使用 hostname 來進行溝通。
所以在四台主機的 /etc/hosts 都需要新增內容如下

sudo vi /etc/hosts
新增內容如下

#loopback address
127.0.1.1 slurm-ctl

## SLURM cluster private IP range

# Controller
192.168.56.23 slurm-ctl

# compute nodes
192.168.56.24 slurm-wrk-01
192.168.56.25 slurm-wrk-02

# mariadb
192.168.56.26 slurm-mariadb

設定 NFS

NFS (Network File System) 是一個可以在分散式架構的系統中用來共享檔案。

由於 slurm configuration 的設定檔，在 cluster 中都須有相同的一份資料，所以我們可以建立一個 NFS，當作共享的檔案，使在 cluster 中的其他的節點可以透過建立 symbolic link 參照到相同的設定檔，方便後續的維護。
至於如何建立 NFS，細節可以參考另外一篇文章 NFS Setup on Ubuntu，在這裡 controller node 會作為 NFS 的 Server side, 建立方式如下:

control node

建立 shared 的目錄並且設定其權限。

$mkdir /shared
$chown nobody.nogroup -R /shared
$chmod 777 -R /shared

安裝 NFS Server 的 ubuntu 套件。

$apt update
$apt install nfs-kernel-server -y

在 /etc/exports 檔案中，設定 shared 目錄以及指定分享的子網路區段。

# /etc/exports: the access control list for filesystems which may be exported
#               to NFS clients.  See exports(5).
#
# Example for NFSv2 and NFSv3:
# /srv/homes       hostname1(rw,sync,no_subtree_check) hostname2(ro,sync,no_subtree_check)
#
# Example for NFSv4:
# /srv/nfs4        gss/krb5i(rw,sync,fsid=0,crossmnt,no_subtree_check)
# /srv/nfs4/homes  gss/krb5i(rw,sync,no_subtree_check)
#
/shared 192.168.56.0/24(rw,sync,no_root_squash,no_subtree_check)

導出在 /etc/exports 檔案中列出的所有檔案系統。

$exportfs -a

compute node 1 / compute node 2 / slurm-mariadb

其他三台主機會需要做相同安裝 nfs-client 的套件，以及把在 controller node 建立的 /shared 資料夾 mount 到各自的系統中。

因為在前面的步驟中，我們已經設置好網路以及hostname，所以可以從 controller 連到其他三台的主機，因此這裡透過 pdsh 指令 (透過 ssh 的方式，一次對多台主機下命令)來安裝。

#在個別的機器中建立 /shared資料夾，並設定權限
$pdsh -w root@slurm-wrk-0[1-2],root@slurm-mariadb -R ssh "mkdir /shared"
$pdsh -w root@slurm-wrk-0[1-2],root@slurm-mariadb -R ssh "chown nobody.nogroup -R /shared"
$pdsh -w root@slurm-wrk-0[1-2],root@slurm-mariadb -R ssh "chmod 777 -R /shared"

# 安裝 nfs client 的套件
$pdsh -w root@slurm-wrk-0[1-2],root@slurm-mariadb -R ssh "apt update"
$pdsh -w root@slurm-wrk-0[1-2],root@slurm-mariadb -R ssh "apt install nfs-common -y"

# 在 /etc/fstab
$pdsh -w root@slurm-wrk-0[1-2],root@slurm-mariadb -R ssh "echo '192.168.56.23:/shared /shared nfs defaults 0 0' >> /etc/fstab"

設定 ssh authentication without password

為了方便後續節點的管理，會希望能夠使用 root 的身分從 control node 以免密碼的方式快速的登入到其他 node 上的 root 進行管理。
步驟如下:

切換到 root 使用者

$sudo su

在 controller node 的 root 的 .ssh 資料夾下產生一對 RSA 的 key

$ssh-keygen -t rsa -b 4096 -f /root/.ssh/id_rsa

# copy the public key content
$cat /root/.ssh/id_rsa.pub

將 public key 的內容，放置到 compute node 1 / node 2 以及 slurm-mariadb 的 /root/.ssh/authorized_keys 上。 (若沒有authorized_keys的檔案時，需要在 create 這個檔案)
測試從 controller node 透過 ssh 連線的方式，連線到 compute node 上。

$ssh slurm-wrk-01

$ssh slurm-wrk-02

note.
需確認在兩台主機都有相同的使用者名稱，且放在對應 /<user_dir>/.ssh。

設定 MUNGE key

MUNGE是一種常用於高性能計算（HPC）環境中的安全工具和身份驗證服務。它被設計用於根據使用者的 UID（用戶ID）和 GID（群組ID）生成和驗證憑證。
建立的步驟如下:

安裝 munge tool 到這四台主機，(在 control node上進行操作)

# controll node
$sudo apt update
$sudo apt install munge

# 其他三台主機
$pdsh -w root@slurm-wrk-0[1-2],root@slurm-mariadb -R ssh "apt update"
$pdsh -w root@slurm-wrk-0[1-2],root@slurm-mariadb -R ssh "apt install munge -y"

複製 contorl node 的 /etc/munge/munge.key 到 NFS的 /shared 資料夾。
在 control node 上，透過 pdsh 下指令，將 munge.key 複製的其他三台主機

# 其他三台主機
$pdsh -w root@slurm-wrk-0[1-2],root@slurm-mariadb -R ssh "cp -a /shared/munge.keh /etc/munge/munge.key"

須注意在使用 MUNGE 的認證方式時，cluster 中的主機的時間都會需要同步。

設定 slurm service

在經過前面的基礎設置後，接著就可以開始建立我們架構的叢集系統。
目前我們會有四台主機來組成一個叢集系統，一台是 slurm 的 controller, 兩台當作 slurm 的 compute node, 一台是 slurm 的 database。
這些不同目的的主機，會需要安裝相對應的 slurm package，但他們會使用相同的設定檔，因此我們會把相關的設定檔放在 NFS 上，之後所有的主機在建立 symbolic link 連結到這些檔案。

control node

這裡我們我先把 slurm controller node 先 setup 起來。

在 slurm 的 control node 上須先安裝以下套件

$apt update
$apt install slurm-wlm slurm-wlm-doc -y

slurm-wlm: main slurm package
slurm-wlm-doc: slurm document

compute nodes

這裡我們從 control node 上透過 pdsh 指令的方式去操作2台 compute nodes.

$pdsh -w root@slurm-wrk-0[1-2] -R ssh "apt update"
$pdsh -w root@slurm-wrk-0[1-2] -R ssh "apt install slurmd"

slurmd: compute node's deamon

database

我們直接登入到 database node 去作操作。

下載 mariadb-server 以及 slurmdbd

$apt update
$apt install mariadb-server slurmdbd -y

針對 mariadb 第一次開始啟用時， setup security 的保護

setup root password (這裡是設定成 example)
remove anonymous users
test database
disallow remote root login

$mysql_secure_installation

進入到 mariadb 的 CLI 介面後，需設定以下事情

創建 slurm_usr 的 user
創建 slurm_acc_db 以及 slurm_job_db 的 table
給定 user 特定的權限

# local host to mariadb CLI 
$mariadb -u root -p
Enter password:

# login into mariadb CLI
Server version: 10.6.12-MariaDB-0ubuntu0.22.04.1 Ubuntu 22.04

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.


MariaDB [(none)]> create database slurm_acc_db; 
MariaDB [(none)]> create database slurm_job_db;

# slurm_usr 的 password 設定為 example
MariaDB [(none)]> create user 'slurm_usr'@localhost identified by 'example';


# 給 slurm_usr 擁有完整 access slurm_acc_db and slurm_job_db 的權限
MariaDB [(none)]> grant all privileges on slurm_acc_db.* to ‘slurm_usr’@localhost;
MariaDB [(none)]> grant all privileges on slurm_job_db.* to ‘slurm_usr’@localhost;
MariaDB [(none)]> flush privileges;

設定 slurm 的設定檔

在這邊我們總共會有四個設定檔，分別說明如下:
slurm.conf: 主要的配置檔，定義群集中的節點配置，Log檔位置，排程相關的參數調整等設定。
slurmdbd.conf: 會有連接到後端資料庫等相關設定。
cgroup.conf: 用於限制，隔離和管理排程的資源使用。
cgroup_allowed_devices_file.conf: 配置控制群組中允許的設備。

此篇文章會將這些設定檔放在 NFS 上，之後所有的主機在建立 symbolic link 到 NFS 上進行參照。

從control node上，創建 slurm 的 configuration folder 在 NFS 上。

$mkdir -p /shared/HPC_SYS/slurm

在 /shared/HPC_SYS/slurm 設定 slurm.conf 如下

# General

ClusterName=testp
SlurmctldHost=slurm-ctl
#ProctrackType=proctrack/linuxproc
ProctrackType=proctrack/cgroup
ReturnToService=2
SlurmctldPidFile=/run/slurmctld.pid
SlurmdPidFile=/run/slurmd.pid
SlurmdSpoolDir=/var/lib/slurm/slurmd
StateSaveLocation=/var/lib/slurm/slurmctld
SlurmUser=slurm
#TaskPlugin=task/none
TaskPlugin=task/cgroup,task/affinity


# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory


# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm_usr
JobCompType=jobcomp/none
#JobAcctGatherType=jobacct_gather/linux
JobacctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmSchedLogFile=/var/log/slurm/slurmschd.log
SlurmSchedLogLevel=3

PrologFlags=Contain


# Preemptions
PreemptType=preempt/partition_prio
PreemptMode=REQUEUE


# Node
NodeName=slurm-wrk-01 CPUs=2 RealMemory=1963
NodeName=slurm-wrk-02 CPUs=2 RealMemory=1963


# Partition
PartitionName=testp Nodes=ALL Default=YES MaxTime=INFINITE State=UP

在 /shared/HPC_SYS/slurm 設定 slurmdbd.conf 如下

# Authentication info
AuthType=auth/munge
#AuthInfo=/var/run/munge/munge.socket.2

# SlrumDBD info
DbdAddr=192.168.56.26
#DbdHost=slurm-mariadb
DebugLevel=4
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/run/slurmdbd.pid

SlurmUser=slurm
# Accounting database info
StorageType=accounting_storage/mysql
StorageHost=192.168.56.26
StoragePort=3306
StorageUser=slurm_usr # this is the DB user which owns the database
StoragePass=example
StorageLoc=slurm_acc_db

在 /shared/HPC_SYS/slurm 設定 cgroup.conf 如下

CgroupAutomount=yes
ConstrainCores=yes

在 /shared/HPC_SYS/slurm 設定 cgroup_allowed_devices_file.conf 如下

/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*
/shared*

這四台主機都需要建立 symbolic link 連結到 NFS 上的這些 slurm configuration，也是從 control node 開始，之後在使用 control node 透過 pdsh 對其他三台機器建立 symbolic link 連結到 NFS 上的這些 slurm configuration

# control node
$ln -s /shared/HPC_SYS/slurm/slurm.conf /etc/slurm/slurm.conf
$ln -s /shared/HPC_SYS/slurm/slurmdbd.conf /etc/slurm/slurmdbd.conf
$ln -s /shared/HPC_SYS/slurm/cgroup.conf /etc/slurm/cgroup.conf
$ln -s /shared/HPC_SYS/slurm/cgroup_allowed_devices_file.conf /etc/slurm/cgroup_allowed_devices_file.conf


# work nodes + slurm database node
$pdsh -w root@slurm-wrk-0[1-2],root@slurm-mariadb -R ssh "ln -s /shared/HPC_SYS/slurm/slurm.conf /etc/slurm/slurm.conf"
$pdsh -w root@slurm-wrk-0[1-2],root@slurm-mariadb -R ssh "ln -s /shared/HPC_SYS/slurm/slurmdbd.conf /etc/slurm/slurmdbd.conf"
$pdsh -w root@slurm-wrk-0[1-2],root@slurm-mariadb -R ssh "ln -s /shared/HPC_SYS/slurm/cgroup.conf /etc/slurm/cgroup.conf"
$pdsh -w root@slurm-wrk-0[1-2],root@slurm-mariadb -R ssh "ln -s /shared/HPC_SYS/slurm/cgroup_allowed_devices_file.conf"

當全部設定完後，從 control node 重啟 slurmdbd, slurmd, slurmctld 服務

# Remote slurm maria database
$pdsh -w root@slurm-mariadb -R ssh "systemctl restart slurmdbd"

# Remote compute nodes
$pdsh -w root@slurm-wrk-0[1-2] -R ssh "systemctl restart slurmd"

# local control node
$ systemctl restart slurmctld

在 control node 上，確認叢集的狀態

$sinfo
>>>
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
testp*       up   infinite      2   idle slurm-wrk-[01-02]

在 control node 上，確認叢集的設定

$scontrol show config
>>>
Configuration data as of 2024-01-18T09:17:16
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost   = localhost
AccountingStorageExternalHost = (null)
AccountingStorageParameters = (null)
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreFlags    = (null)
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = No
...

Cgroup Support Configuration:
AllowedDevicesFile      = /etc/slurm/cgroup_allowed_devices_file.conf
AllowedKmemSpace        = (null)
AllowedRAMSpace         = 100.0%
AllowedSwapSpace        = 0.0%
CgroupAutomount         = yes
CgroupMountpoint        = /sys/fs/cgroup
CgroupPlugin            = (null)
ConstrainCores          = yes
ConstrainDevices        = no
ConstrainKmemSpace      = no
ConstrainRAMSpace       = no
ConstrainSwapSpace      = no
MaxKmemPercent          = 100.0%
MaxRAMPercent           = 100.0%
MaxSwapPercent          = 100.0%
MemorySwappiness        = (null)
MinKmemSpace            = 30 MB
MinRAMSpace             = 30 MB
TaskAffinity            = no

Slurmctld(primary) at slurm-ctl is UP

Note.

若遇到 slurm controller 有問題時，可以確認如下
a. 查看 slurm controller 的 status, systemctl status slurmctld
b. 查看 /var/log/slurm/slurmctld.log 的內容
c. 在根據 log 內容解 Bug
若遇到 slurm worker 有問題時，可以確認如下
a. 查看 slurm worker 的 status, systemctl status slurmctd
b. 查看 /var/log/slurm/slurmd.log 的內容
c. 在根據 log 內容解 Bug
若遇到 slurm database 有問題時，可以確認如下
a. 查看 slurm database 的 status, systemctl status slurmctdbd
b. 查看 /var/log/slurm/slurmdbd.log 的內容
c. 在根據 log 內容解 Bug
需注意在 cluster 中，時間需要同步，否則在 MUNGE key 的認證會出現錯誤，在 slurm 的 log 也會有時間不一致的情況。

REF

Basic HPC cluster setup with slurm (Ubuntu 22.04)

HPC 簡介

Slurm 簡介

Basic cluster 架構

基礎建置的 setup

安裝基礎套件

網路設置

Control node

Compute node 1

Compute node 2

Slurm database

設置 slurm control node as Linux Router

Hostname setting

Control node

Compute node 1

Compute node 2

Slurm database

設置 hosts file

設定 NFS

control node

compute node 1 / compute node 2 / slurm-mariadb

設定 ssh authentication without password

設定 MUNGE key

設定 slurm service

control node

compute nodes

database

設定 slurm 的設定檔

Read more

NFS Setup on Ubuntu