Basic HPC cluster setup with slurm (Ubuntu 22.04)

# Basic HPC cluster setup with slurm (Ubuntu 22.04) 此篇文章練習模仿 HPC 架設簡易的叢集架構，內容包括如下 * HPC 簡介 * Slurm 簡介 * `/etc/hosts` 的功用 * 如何修改 hostname * 設定 ssh authentication without password * 架設 NFS * pdsh 指令 * 設定 MUNGE * 架設 Slurm cluster ### HPC 簡介高性能運算（High-Performance Computing，簡稱HPC），指的是使用先進的計算資源和技術，處理龐大且複雜的計算工作，像是科學、工程和其他領域的高度複雜分析的問題。在 HPC 中，除了需要高效能的硬體裝置外，還需要 fine-tuning 與優化相對應的軟體環境，提供穩定的軟體服務給使用者。這篇文章會使用 ubuntu 22.04 的 linux os 搭配 slurm 資源分配軟體進行架設 cluster。 ### Slurm 簡介 Slurm (Simple Linux Utility for Resource Management) 是一個 Open Source 的資源調度及叢集管控的軟體，主要會有三個部分，如下說明: 1. slurm controller: 負責整個集群的管理和任務調度。它接收用戶提交的任務，根據資源狀態和管理政策，分配這些任務到可用的工作節點上。 2. slurm compute node: 執行實際計算任務的節點。它們擁有計算資源，例如處理器、內存、儲存和網絡連接。當控制節點分配任務給工作節點時，工作節點負責執行這些任務。 3. slurm database: 存儲 slurm 任務狀態、用戶帳戶、節點信息等。 ![image](https://hackmd.io/_uploads/HyXZgOSY6.png) ### Basic cluster 架構在這邊文章中會使用下圖的架構，建置簡易的叢集系統，並派送 Job 至此叢集系統進行測試。 ![image](https://hackmd.io/_uploads/BJnfoT3ua.png) 需準備4台VM，且安裝ubuntu 22.04 版本，每一台的 VM 作用如下: 1. slurm control node (**slurm 的控制節點**) - VM source requirement: * Cpu: 2 cores * Mem: 2G * Storage: 20G 2. slurm compute node 1 / node 2 (**slurm 的計算節點1 / 節點2**) - VM source requirement: * Cpu: 2 cores * Mem: 2G * Storage: 20G 3. slurm database (**slurm 的資料庫**) - VM source requirement: * Cpu: 2 cores * Mem: 2G * Storage: 20G **Note.** 1. 通常大型 HPC 架構會在有一層 login nodes 的主機供使用者登入，避免多使用者登入時造成主機較大的負載，而在這篇文章中 login node 和 control node 會在同一台主機上。 2. 大型 HPC 架構也會有高速的 shared storage (ex. GPFS、Lustre 等)平行存取檔案，而在這篇文章中會使用 NFS 的方式來提供 shared storage 給 cluster 使用。 --- ### 網路設置 #### Control node 針對 Control node 配置系統網路，針對系統網路的設定，Ubuntu 系統會放在 `/etc/netplan/00-installer-config.yaml` 1. `$sudo vi /etc/netplan/00-installer-config.yaml` 2. 修改檔案內容如下 ``` # This is the network config written by 'subiquity' network: ethernets: enp0s3: # public ip range dhcp4: true enp0s8: # private ip range addresses: [192.168.56.23/24] dhcp4: false version: 2 ``` - enp0s3: 使用 DHCP 進行網路配置，且此網卡可以連至 public network - enp0s8: 使用 static ip 進行網路配置，且是在 `192.168.56.0` 這個 subnet 底下，同一個 cluster 的 work nodes 也必須在相同的 subnet 底下，這樣 slurm 的 control node 和 work nodes 彼此之間才能溝通。 - subnet 的概念可以搜尋 cidr 或是 subnet mask 等內容 - 在設定 yaml file 時，須注意格式及 key-value 的內容，否則很容易會跳 error 訊息 3. `$sudo netplan try` -> 確定系統網路設置是否有錯誤 4. `$sudo netplan apply` -> 啟用系統網路設置 #### Compute node 1 針對 compute node 1 配置系統網路，針對系統網路的設定，Ubuntu 系統會放在 `/etc/netplan/00-installer-config.yaml` 1. `sudo vi /etc/netplan/00-installer-config.yaml` 2. 修改檔案內容如下 ``` # This is the network config written by 'subiquity' network: ethernets: enp0s3: # public ip range dhcp4: true enp0s8: # private ip range addresses: [192.168.56.24/24] dhcp4: false version: 2 ``` 3. `sudo netplan try` -> 確定系統網路設置是否有錯誤 4. `sudo netplan apply` -> 啟用系統網路設置 #### Compute node 2 針對 compute node 2 配置系統網路，針對系統網路的設定，Ubuntu 系統會放在 `/etc/netplan/00-installer-config.yaml` 1. `sudo vi /etc/netplan/00-installer-config.yaml` 2. 修改檔案內容如下 ``` # This is the network config written by 'subiquity' network: ethernets: enp0s3: # public ip range dhcp4: true enp0s8: # private ip range addresses: [192.168.56.25/24] dhcp4: false version: 2 ``` 3. `sudo netplan try` -> 確定系統網路設置是否有錯誤 4. `sudo netplan apply` -> 啟用系統網路設置 #### Slurm database 針對 Slurm database 配置系統網路，針對系統網路的設定，Ubuntu 系統會放在 `/etc/netplan/00-installer-config.yaml` 1. `sudo vi /etc/netplan/00-installer-config.yaml` 2. 修改檔案內容如下 ``` # This is the network config written by 'subiquity' network: ethernets: enp0s3: # public ip range dhcp4: true enp0s8: # private ip range addresses: [192.168.56.26/24] dhcp4: false version: 2 ``` 3. `sudo netplan try` -> 確定系統網路設置是否有錯誤 4. `sudo netplan apply` -> 啟用系統網路設置 --- ### Hostname setting **會依照此篇的規劃進行名稱的設定，此設定會攸關在 `/etc/hosts` 中所設定的 hostname 對應 ip 的部分，若設定不對的 hostname 會無法進行 ip 的對應。** #### Control node hostname 會有兩個需要修改的檔案 1. `/etc/hosts` -> `127.0.1.1`的地方需修改為 `127.0.1.1 slurm-ctl` 2. `/etc/hostnmae` -> 修改此檔案，調整內容為 `slurm-ctl` 3. 重新開機，hostname 的設定即可生效。 #### Compute node 1 1. `/etc/hosts` -> `127.0.1.1`的地方需修改為 `127.0.1.1 slurm-wrk-01` 2. `/etc/hostnmae` -> 修改此檔案，調整內容為 `slurm-wrk-01` 3. 重新開機，hostname 的設定即可生效。 #### Compute node 2 1. `/etc/hosts` -> `127.0.1.1`的地方需修改為 `127.0.1.1 slurm-wrk-02` 2. `/etc/hostnmae` -> 修改此檔案，調整內容為 `slurm-wrk-02` 3. 重新開機，hostname 的設定即可生效。 #### Slurm database 1. `/etc/hosts` -> `127.0.1.1`的地方需修改為 `127.0.1.1 slurm-mariadb` 2. `/etc/hostnmae` -> 修改此檔案，調整內容為 `slurm-mariadb` 3. 重新開機，hostname 的設定即可生效。 --- ### 設置 hosts file `/etc/hosts` 檔案內容，可以為 static ip 建立起本地端的 DNS，讓 domain name 可以對應到正確的 ip 位置，且在 slurm 的設定檔中常會使用 hostname 來進行溝通。所以在四台主機的 `/etc/hosts` 都需要新增內容如下 1. `sudo vi /etc/hosts` 2. 新增內容如下 ``` #loopback address 127.0.1.1 slurm-ctl ## SLURM cluster private IP range # Controller 192.168.56.23 slurm-ctl # compute nodes 192.168.56.24 slurm-wrk-01 192.168.56.25 slurm-wrk-02 # mariadb 192.168.56.26 slurm-mariadb ``` ### 設定 ssh authentication without password 為了方便後續節點的管理，會希望能夠使用 root 的身分從 control node 以免密碼的方式快速的登入到其他 node 上的 root 進行管理。步驟如下: 1. 切換到 root 使用者 ```bash $sudo su ``` 2. 在 controller node 的 root 的 .ssh 資料夾下產生一對 RSA 的 key ```bash $ssh-keygen -t rsa -b 4096 -f /root/.ssh/id_rsa # copy the public key content $cat /root/.ssh/id_rsa.pub ``` 2. 將 public key 的內容，放置到 control node / compute node 1 / node 2 以及 slurm-mariadb 的 `/root/.ssh/authorized_keys` 上。 (若沒有`authorized_keys`的檔案時，需要在 create 這個檔案) 3. 測試從 controller node 透過 ssh 連線的方式，連線到 compute node 上。 ```bash $ssh slurm-ctl $ssh slurm-wrk-01 $ssh slurm-wrk-02 $ssh slurm-mariadb ``` 4. 驗證流程說明 ``` 控制節點 (Client) 計算節點 (Server) 1. 發起連線 ssh slurm-wrk-01 -----------------------> 檢查 authorized_keys 找到對應的公鑰 2. 發出挑戰 <----------------------- 用公鑰加密一段隨機字串 (challenge) 3. 回應挑戰用"私鑰"解密 challenge 並回傳簽章 (signature) ----------------> 用公鑰驗證簽章是否正確 4. 驗證成功登入成功 (無需密碼) --------------------> 開啟 shell session ``` ### 安裝pdsh套件在 control node 上先安裝 pdsh，pdsh 是一個可以跑 remote shell command 的工具，也就是可以同時對多台主機進行 command 的操作，方便後續對其他主機進行指令的操作。 ```bash $apt update $apt install pdsh -y ``` 在control node上的 `/root/.bashrc` 加上如下 ``` $export PDSH_RCMD_TYPE=ssh ``` 立即啟用 ``` $source /root/.bashrc ``` check pdsh command ``` $pdsh -w slurm-ctl,slurm-wrk-0[1-2],slurm-mariadb hostname ``` note. 需確認在兩台主機都有相同的使用者名稱，且放在對應 /<user_dir>/.ssh。 ### 設定 NFS NFS (Network File System) 是一個可以在分散式架構的系統中用來共享檔案。 ![image](https://hackmd.io/_uploads/r1IKQPzKa.png) 由於 slurm configuration 的設定檔，在 cluster 中都須有相同的一份資料，所以我們可以建立一個 NFS，當作共享的檔案，使在 cluster 中的其他的節點可以透過建立 symbolic link 參照到相同的設定檔，方便後續的維護。至於如何建立 NFS，細節可以參考另外一篇文章 [NFS Setup on Ubuntu](https://https://hackmd.io/fXezVuTrQBiKqDoBZmdAdg)，在這裡 controller node 會作為 NFS 的 Server side, 建立方式如下: #### control node (NFS server端) 1. 先建立共享資料夾 /shared，並調整權限，讓所有使用者都能讀寫。 ```bash $mkdir /shared $chown nobody.nogroup -R /shared $chmod 777 -R /shared ``` 2. 安裝 NFS Server 的 ubuntu 套件。 ```bash $apt update $apt install nfs-kernel-server -y ``` 3. 編輯 /etc/exports，加入分享的目錄和子網路設定： ```bash # /etc/exports: the access control list for filesystems which may be exported # to NFS clients. See exports(5). # # Example for NFSv2 and NFSv3: # /srv/homes hostname1(rw,sync,no_subtree_check) hostname2(ro,sync,no_subtree_check) # # Example for NFSv4: # /srv/nfs4 gss/krb5i(rw,sync,fsid=0,crossmnt,no_subtree_check) # /srv/nfs4/homes gss/krb5i(rw,sync,no_subtree_check) # /shared 192.168.56.0/24(rw,sync,no_root_squash,no_subtree_check) ``` 4. 套用設定 ```bash $exportfs -a ``` #### compute node 1 / compute node 2 / slurm-mariadb (NFS client端) 其他三台主機會需要做相同安裝 nfs-client 的套件，以及把在 controller node 建立的 `/shared` 資料夾 mount 到各自的系統中。因為在前面的步驟中，我們已經設置好網路以及hostname，所以可以從 controller 連到其他三台的主機，因此這裡透過 `pdsh` 指令 (透過 ssh 的方式，一次對多台主機下命令)來安裝。 ```bash #在個別的機器中建立 /shared資料夾，並設定權限 $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "mkdir /shared" $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "chown nobody.nogroup -R /shared" $pdsh -w slurm-wrk-0[1-2]slurm-mariadb "chmod 777 -R /shared" # 安裝 nfs client 的套件 $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "apt update" $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "apt install nfs-common -y" # 在 /etc/fstab $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "echo '192.168.56.23:/shared /shared nfs defaults 0 0' >> /etc/fstab" # client mount fs $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "mount -a" ``` check client if mount successfully ```bash $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "mount -n | grep shared" ``` ### 設定 MUNGE key MUNGE是一種常用於高性能計算（HPC）環境中的安全工具和身份驗證服務。它被設計用於根據使用者的 UID（用戶ID）和 GID（群組ID）生成和驗證憑證。建立的步驟如下: 1. 安裝 munge tool 到這四台主機，(在 control node上進行操作) ```bash # controll node $sudo apt update $sudo apt install munge # 其他三台主機 $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "apt update" $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "apt install munge -y" ``` 2. 複製 contorl node 的 `/etc/munge/munge.key` 到 NFS的 `/shared` 資料夾。 ``` $cp /etc/munge/munge.key /shared ``` 3. 在 control node 上，透過 pdsh 下指令，將 `munge.key` 複製的其他三台主機 ```bash # 其他三台主機 $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "cp -a /shared/munge.key /etc/munge/munge.key" ``` 4. 調整 `/etc/munge/munge.key` 的使用者權限 ```bash $pdsh -w slurm-ctl,slurm-wrk-[01-02],slurm-mariadb "chown munge:munge /etc/munge/munge.key" ``` 5. 可以使用md5sum 來確認是否使用相同的 munge key ```bash $pdsh -w slurm-ctl,slurm-wrk-[01-02],slurm-mariadb "md5sum /etc/munge/munge.key" ``` 6. 必須reload munge service, 讓 slurm-wrk-[01-02],slurm-mariadb 可以讀到新的munge key ```bash $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "systemctl restart munge" ``` 須注意在使用 MUNGE 的認證方式時，**cluster 中的主機的時間都會需要同步**。 ### 設定 slurm service 在經過前面的基礎設置後，接著就可以開始建立我們架構的叢集系統。目前我們會有四台主機來組成一個叢集系統，一台是 slurm 的 controller, 兩台當作 slurm 的 compute node, 一台是 slurm 的 database。這些不同目的的主機，會需要安裝相對應的 slurm package，但他們會使用相同的設定檔，因此我們會把相關的設定檔放在 NFS 上，之後所有的主機在建立 symbolic link 連結到這些檔案。 #### control node 這裡我們我先把 slurm controller node 先 setup 起來。 1. 在 slurm 的 control node 上須先安裝以下套件 ```bash $apt update $apt install slurm-wlm slurm-wlm-doc -y ``` * 確認是否有安裝，目前是inactive 狀態，但可以找到有對應的service ```bash $systemctl status slurmctld ``` `slurm-wlm`: main slurm package `slurm-wlm-doc`: slurm document #### compute nodes 這裡我們從 control node 上透過 pdsh 指令的方式去操作2台 compute nodes, 安裝的 slurmd 是計算節點的 daemon，負責接收控制節點派下來的工作並執行。 ```bash $pdsh -w slurm-wrk-0[1-2] "apt update" $pdsh -w slurm-wrk-0[1-2] "apt install slurmd -y" ``` * 確認是否有安裝，目前是inactive 狀態，但可以找到有對應的service ```bash $systemctl status slurmd ``` `slurmd`: compute node's deamon #### database 我們直接登入到 database node 去作操作。 1. 下載 mariadb-server 以及 slurmdbd。 ```bash $apt update $apt install mariadb-server slurmdbd -y ``` 2. 針對 mariadb 第一次開始啟用時， setup security 的保護 * setup root password (這裡是設定成 example) * remove anonymous users * test database * disallow remote root login ```bash $mysql_secure_installation ``` 3. 進入到 mariadb 的 CLI 介面後，需設定以下事情 * 創建 slurm_usr 的 user * 創建 slurm_acc_db 以及 slurm_job_db 的 table * 給定 user 特定的權限 ```bash # local host to mariadb CLI $mariadb -u root -p Enter password: # login into mariadb CLI Server version: 10.6.12-MariaDB-0ubuntu0.22.04.1 Ubuntu 22.04 Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> create database slurm_acc_db; MariaDB [(none)]> create database slurm_job_db; # slurm_usr 的 password 設定為 example MariaDB [(none)]> create user 'slurm_usr'@localhost identified by 'example'; # 給 slurm_usr 擁有完整 access slurm_acc_db and slurm_job_db 的權限 MariaDB [(none)]> grant all privileges on slurm_acc_db.* to 'slurm_usr'@localhost; MariaDB [(none)]> grant all privileges on slurm_job_db.* to 'slurm_usr'@localhost; MariaDB [(none)]> flush privileges; ``` 4. check the 資料庫是否創建成功 ``` MariaDB [(none)]> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | mysql | | performance_schema | | slurm_acc_db | | slurm_job_db | | sys | +--------------------+ 6 rows in set (0.001 sec) 可以看到 slurm_acc_db, slurm_job_db exists ``` 5. check slurm_usr是否被創建 ``` MariaDB [(none)]> select user,host from mysql.user; +-------------+-----------+ | User | Host | +-------------+-----------+ | mariadb.sys | localhost | | mysql | localhost | | root | localhost | | slurm_usr | localhost | +-------------+-----------+ 4 rows in set (0.001 sec) 可以看到 slurm_user exist ``` ### 設定 slurm 的設定檔在這邊我們總共會有四個設定檔，分別說明如下: `slurm.conf`: 主要的配置檔，定義群集中的節點配置，Log檔位置，排程相關的參數調整等設定。 `slurmdbd.conf`: 會有連接到後端資料庫等相關設定。 `cgroup.conf`: 用於限制，隔離和管理排程的資源使用。 `cgroup_allowed_devices_file.conf`: 配置控制群組中允許的設備。此篇文章會將這些設定檔放在 NFS 上，之後所有的主機在建立 symbolic link 到 NFS 上進行參照。 1. 從control node上，創建 slurm 的 configuration folder 在 NFS 上。 ```bash $mkdir -p /shared/HPC_SYS/slurm ``` 2. 在 `/shared/HPC_SYS/slurm` 設定 `slurm.conf` 如下 ``` # General ClusterName=testp SlurmctldHost=slurm-ctl ProctrackType=proctrack/cgroup ReturnToService=2 SlurmctldPidFile=/run/slurmctld.pid SlurmdPidFile=/run/slurmd.pid SlurmdSpoolDir=/var/lib/slurm/slurmd StateSaveLocation=/var/lib/slurm/slurmctld SlurmUser=slurm TaskPlugin=task/cgroup,task/affinity # SCHEDULING SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/slurmdbd AccountingStorageUser=slurm_usr AccountingStorageHost=192.168.56.26 AccountingStoragePort=6819 JobCompType=jobcomp/none JobacctGatherType=jobacct_gather/cgroup SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm/slurmd.log SlurmSchedLogFile=/var/log/slurm/slurmschd.log SlurmSchedLogLevel=3 PrologFlags=Contain # Node NodeName=slurm-wrk-01 CPUs=2 RealMemory=1963 NodeName=slurm-wrk-02 CPUs=2 RealMemory=1963 # Partition PartitionName=testp Nodes=ALL Default=YES MaxTime=INFINITE State=UP ``` 2. 在 `/shared/HPC_SYS/slurm` 設定 `slurmdbd.conf` 如下 ``` # Authentication info AuthType=auth/munge # SlrumDBD info DbdAddr=192.168.56.26 DebugLevel=4 LogFile=/var/log/slurm/slurmdbd.log PidFile=/run/slurmdbd.pid SlurmUser=slurm # Accounting database info StorageType=accounting_storage/mysql StorageHost=127.0.0.1 StoragePort=3306 StorageUser=slurm_usr # this is the DB user which owns the database StoragePass=example StorageLoc=slurm_acc_db ``` 3. 在 `/shared/HPC_SYS/slurm` 設定 `cgroup.conf` 如下 ``` CgroupAutomount=yes ConstrainCores=yes ``` 4. 在 `/shared/HPC_SYS/slurm` 設定 `cgroup_allowed_devices_file.conf` 如下 ``` /dev/null /dev/urandom /dev/zero /dev/sda* /dev/cpu/*/* /dev/pts/* /shared* ``` 5.調整 slurm 相關conf 的file 權限 ``` $chmod 600 /shared/HPC_SYS/slurm/*.conf $chown slurm:slurm /shared/HPC_SYS/slurm/*.conf ``` 6. 這四台主機都需要建立 symbolic link 連結到 NFS 上的這些 slurm configuration，也是從 control node 開始，之後在使用 control node 透過 pdsh 對其他三台機器建立 symbolic link 連結到 NFS 上的這些 slurm configuration ```bash # control node $ln -s /shared/HPC_SYS/slurm/slurm.conf /etc/slurm/slurm.conf $ln -s /shared/HPC_SYS/slurm/slurmdbd.conf /etc/slurm/slurmdbd.conf $ln -s /shared/HPC_SYS/slurm/cgroup.conf /etc/slurm/cgroup.conf $ln -s /shared/HPC_SYS/slurm/cgroup_allowed_devices_file.conf /etc/slurm/cgroup_allowed_devices_file.conf # work nodes + slurm database node $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "ln -s /shared/HPC_SYS/slurm/slurm.conf /etc/slurm/slurm.conf" $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "ln -s /shared/HPC_SYS/slurm/slurmdbd.conf /etc/slurm/slurmdbd.conf" $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "ln -s /shared/HPC_SYS/slurm/cgroup.conf /etc/slurm/cgroup.conf" $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "ln -s /shared/HPC_SYS/slurm/cgroup_allowed_devices_file.conf" ``` 6. 當全部設定完後，從 control node 重啟 slurmdbd, slurmd, slurmctld 服務 ```bash # Remote slurm maria database $pdsh -w slurm-mariadb "systemctl restart slurmdbd" # Remote compute nodes $pdsh -w slurm-wrk-0[1-2] "systemctl restart slurmd" # local control node $ systemctl restart slurmctld ``` 7. 在 control node 上，確認叢集的狀態 ```bash $sinfo >>> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST testp* up infinite 2 idle slurm-wrk-[01-02] ``` 8. 在 control node 上，確認叢集的設定 ```bash $scontrol show config >>> Configuration data as of 2024-01-18T09:17:16 AccountingStorageBackupHost = (null) AccountingStorageEnforce = none AccountingStorageHost = localhost AccountingStorageExternalHost = (null) AccountingStorageParameters = (null) AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreFlags = (null) AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = No ... Cgroup Support Configuration: AllowedDevicesFile = /etc/slurm/cgroup_allowed_devices_file.conf AllowedKmemSpace = (null) AllowedRAMSpace = 100.0% AllowedSwapSpace = 0.0% CgroupAutomount = yes CgroupMountpoint = /sys/fs/cgroup CgroupPlugin = (null) ConstrainCores = yes ConstrainDevices = no ConstrainKmemSpace = no ConstrainRAMSpace = no ConstrainSwapSpace = no MaxKmemPercent = 100.0% MaxRAMPercent = 100.0% MaxSwapPercent = 100.0% MemorySwappiness = (null) MinKmemSpace = 30 MB MinRAMSpace = 30 MB TaskAffinity = no Slurmctld(primary) at slurm-ctl is UP ``` Note. 1. 若遇到 slurm controller 有問題時，可以確認如下 a. 查看 slurm controller 的 status, `systemctl status slurmctld` b. 查看 `/var/log/slurm/slurmctld.log` 的內容 c. 在根據 log 內容解 Bug 2. 若遇到 slurm worker 有問題時，可以確認如下 a. 查看 slurm worker 的 status, `systemctl status slurmctd` b. 查看 `/var/log/slurm/slurmd.log` 的內容 c. 在根據 log 內容解 Bug 2. 若遇到 slurm database 有問題時，可以確認如下 a. 查看 slurm database 的 status, `systemctl status slurmctdbd` b. 查看 `/var/log/slurm/slurmdbd.log` 的內容 c. 在根據 log 內容解 Bug 3. 需注意在 cluster 中，時間需要同步，否則在 MUNGE key 的認證會出現錯誤，在 slurm 的 log 也會有時間不一致的情況。 REF 1. [L2_02_Basic_HPC_Cluster_Setup_Howto_Guide.pdf](https://h3abionet.org/images/Technical_guides/L2_02_Basic_HPC_Cluster_Setup_Howto_Guide.pdf) 2. [Building a SLURM Cluster Using Amazon EC2 (AWS) Virtual Machines](https://mhsamsal.wordpress.com/2022/01/15/building-a-slurm-cluster-using-amazon-ec2-aws-virtual-machines/) 3. [Slurm Document](https://slurm.schedmd.com/overview.html)