Jt_Tsai
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Basic HPC cluster setup with slurm (Ubuntu 22.04) 此篇文章練習模仿 HPC 架設簡易的叢集架構,內容包括如下 * HPC 簡介 * Slurm 簡介 * `/etc/hosts` 的功用 * 如何修改 hostname * 設定 ssh authentication without password * 架設 NFS * pdsh 指令 * 設定 MUNGE * 架設 Slurm cluster ### HPC 簡介 高性能運算(High-Performance Computing,簡稱HPC),指的是使用先進的計算資源和技術,處理龐大且複雜的計算工作,像是科學、工程和其他領域的高度複雜分析的問題。 在 HPC 中,除了需要高效能的硬體裝置外,還需要 fine-tuning 與優化相對應的軟體環境,提供穩定的軟體服務給使用者。這篇文章會使用 ubuntu 22.04 的 linux os 搭配 slurm 資源分配軟體進行架設 cluster。 ### Slurm 簡介 Slurm (Simple Linux Utility for Resource Management) 是一個 Open Source 的資源調度及叢集管控的軟體,主要會有三個部分,如下說明: 1. slurm controller: 負責整個集群的管理和任務調度。它接收用戶提交的任務,根據資源狀態和管理政策,分配這些任務到可用的工作節點上。 2. slurm compute node: 執行實際計算任務的節點。它們擁有計算資源,例如處理器、內存、儲存和網絡連接。當控制節點分配任務給工作節點時,工作節點負責執行這些任務。 3. slurm database: 存儲 slurm 任務狀態、用戶帳戶、節點信息等。 ![image](https://hackmd.io/_uploads/HyXZgOSY6.png) ### Basic cluster 架構 在這邊文章中會使用下圖的架構,建置簡易的叢集系統,並派送 Job 至此叢集系統進行測試。 ![image](https://hackmd.io/_uploads/BJnfoT3ua.png) 需準備4台VM,且安裝ubuntu 22.04 版本,每一台的 VM 作用如下: 1. slurm control node (**slurm 的控制節點**) - VM source requirement: * Cpu: 2 cores * Mem: 2G * Storage: 20G 2. slurm compute node 1 / node 2 (**slurm 的計算節點1 / 節點2**) - VM source requirement: * Cpu: 2 cores * Mem: 2G * Storage: 20G 3. slurm database (**slurm 的資料庫**) - VM source requirement: * Cpu: 2 cores * Mem: 2G * Storage: 20G **Note.** 1. 通常大型 HPC 架構會在有一層 login nodes 的主機供使用者登入,避免多使用者登入時造成主機較大的負載,而在這篇文章中 login node 和 control node 會在同一台主機上。 2. 大型 HPC 架構也會有高速的 shared storage (ex. GPFS、Lustre 等)平行存取檔案,而在這篇文章中會使用 NFS 的方式來提供 shared storage 給 cluster 使用。 --- ### 網路設置 #### Control node 針對 Control node 配置系統網路,針對系統網路的設定,Ubuntu 系統會放在 `/etc/netplan/00-installer-config.yaml` 1. `$sudo vi /etc/netplan/00-installer-config.yaml` 2. 修改檔案內容如下 ``` # This is the network config written by 'subiquity' network: ethernets: enp0s3: # public ip range dhcp4: true enp0s8: # private ip range addresses: [192.168.56.23/24] dhcp4: false version: 2 ``` - enp0s3: 使用 DHCP 進行網路配置,且此網卡可以連至 public network - enp0s8: 使用 static ip 進行網路配置,且是在 `192.168.56.0` 這個 subnet 底下,同一個 cluster 的 work nodes 也必須在相同的 subnet 底下,這樣 slurm 的 control node 和 work nodes 彼此之間才能溝通。 - subnet 的概念可以搜尋 cidr 或是 subnet mask 等內容 - 在設定 yaml file 時,須注意格式及 key-value 的內容,否則很容易會跳 error 訊息 3. `$sudo netplan try` -> 確定系統網路設置是否有錯誤 4. `$sudo netplan apply` -> 啟用系統網路設置 #### Compute node 1 針對 compute node 1 配置系統網路,針對系統網路的設定,Ubuntu 系統會放在 `/etc/netplan/00-installer-config.yaml` 1. `sudo vi /etc/netplan/00-installer-config.yaml` 2. 修改檔案內容如下 ``` # This is the network config written by 'subiquity' network: ethernets: enp0s3: # public ip range dhcp4: true enp0s8: # private ip range addresses: [192.168.56.24/24] dhcp4: false version: 2 ``` 3. `sudo netplan try` -> 確定系統網路設置是否有錯誤 4. `sudo netplan apply` -> 啟用系統網路設置 #### Compute node 2 針對 compute node 2 配置系統網路,針對系統網路的設定,Ubuntu 系統會放在 `/etc/netplan/00-installer-config.yaml` 1. `sudo vi /etc/netplan/00-installer-config.yaml` 2. 修改檔案內容如下 ``` # This is the network config written by 'subiquity' network: ethernets: enp0s3: # public ip range dhcp4: true enp0s8: # private ip range addresses: [192.168.56.25/24] dhcp4: false version: 2 ``` 3. `sudo netplan try` -> 確定系統網路設置是否有錯誤 4. `sudo netplan apply` -> 啟用系統網路設置 #### Slurm database 針對 Slurm database 配置系統網路,針對系統網路的設定,Ubuntu 系統會放在 `/etc/netplan/00-installer-config.yaml` 1. `sudo vi /etc/netplan/00-installer-config.yaml` 2. 修改檔案內容如下 ``` # This is the network config written by 'subiquity' network: ethernets: enp0s3: # public ip range dhcp4: true enp0s8: # private ip range addresses: [192.168.56.26/24] dhcp4: false version: 2 ``` 3. `sudo netplan try` -> 確定系統網路設置是否有錯誤 4. `sudo netplan apply` -> 啟用系統網路設置 --- ### Hostname setting **會依照此篇的規劃進行名稱的設定,此設定會攸關在 `/etc/hosts` 中所設定的 hostname 對應 ip 的部分,若設定不對的 hostname 會無法進行 ip 的對應。** #### Control node hostname 會有兩個需要修改的檔案 1. `/etc/hosts` -> `127.0.1.1`的地方需修改為 `127.0.1.1 slurm-ctl` 2. `/etc/hostnmae` -> 修改此檔案,調整內容為 `slurm-ctl` 3. 重新開機,hostname 的設定即可生效。 #### Compute node 1 1. `/etc/hosts` -> `127.0.1.1`的地方需修改為 `127.0.1.1 slurm-wrk-01` 2. `/etc/hostnmae` -> 修改此檔案,調整內容為 `slurm-wrk-01` 3. 重新開機,hostname 的設定即可生效。 #### Compute node 2 1. `/etc/hosts` -> `127.0.1.1`的地方需修改為 `127.0.1.1 slurm-wrk-02` 2. `/etc/hostnmae` -> 修改此檔案,調整內容為 `slurm-wrk-02` 3. 重新開機,hostname 的設定即可生效。 #### Slurm database 1. `/etc/hosts` -> `127.0.1.1`的地方需修改為 `127.0.1.1 slurm-mariadb` 2. `/etc/hostnmae` -> 修改此檔案,調整內容為 `slurm-mariadb` 3. 重新開機,hostname 的設定即可生效。 --- ### 設置 hosts file `/etc/hosts` 檔案內容,可以為 static ip 建立起本地端的 DNS,讓 domain name 可以對應到正確的 ip 位置,且在 slurm 的設定檔中常會使用 hostname 來進行溝通。 所以在四台主機的 `/etc/hosts` 都需要新增內容如下 1. `sudo vi /etc/hosts` 2. 新增內容如下 ``` #loopback address 127.0.1.1 slurm-ctl ## SLURM cluster private IP range # Controller 192.168.56.23 slurm-ctl # compute nodes 192.168.56.24 slurm-wrk-01 192.168.56.25 slurm-wrk-02 # mariadb 192.168.56.26 slurm-mariadb ``` ### 設定 ssh authentication without password 為了方便後續節點的管理,會希望能夠使用 root 的身分 從 control node 以免密碼的方式快速的登入到其他 node 上的 root 進行管理。 步驟如下: 1. 切換到 root 使用者 ```bash $sudo su ``` 2. 在 controller node 的 root 的 .ssh 資料夾下產生一對 RSA 的 key ```bash $ssh-keygen -t rsa -b 4096 -f /root/.ssh/id_rsa # copy the public key content $cat /root/.ssh/id_rsa.pub ``` 2. 將 public key 的內容,放置到 control node / compute node 1 / node 2 以及 slurm-mariadb 的 `/root/.ssh/authorized_keys` 上。 (若沒有`authorized_keys`的檔案時,需要在 create 這個檔案) 3. 測試從 controller node 透過 ssh 連線的方式,連線到 compute node 上。 ```bash $ssh slurm-ctl $ssh slurm-wrk-01 $ssh slurm-wrk-02 $ssh slurm-mariadb ``` 4. 驗證流程說明 ``` 控制節點 (Client) 計算節點 (Server) 1. 發起連線 ssh slurm-wrk-01 -----------------------> 檢查 authorized_keys 找到對應的公鑰 2. 發出挑戰 <----------------------- 用公鑰加密一段隨機字串 (challenge) 3. 回應挑戰 用"私鑰"解密 challenge 並回傳簽章 (signature) ----------------> 用公鑰驗證簽章是否正確 4. 驗證成功 登入成功 (無需密碼) --------------------> 開啟 shell session ``` ### 安裝pdsh套件 在 control node 上先安裝 pdsh,pdsh 是一個可以跑 remote shell command 的工具,也就是可以同時對多台主機進行 command 的操作,方便後續對其他主機進行指令的操作。 ```bash $apt update $apt install pdsh -y ``` 在control node上的 `/root/.bashrc` 加上如下 ``` $export PDSH_RCMD_TYPE=ssh ``` 立即啟用 ``` $source /root/.bashrc ``` check pdsh command ``` $pdsh -w slurm-ctl,slurm-wrk-0[1-2],slurm-mariadb hostname ``` note. 需確認在兩台主機都有相同的使用者名稱,且放在對應 /<user_dir>/.ssh。 ### 設定 NFS NFS (Network File System) 是一個可以在分散式架構的系統中用來共享檔案。 ![image](https://hackmd.io/_uploads/r1IKQPzKa.png) 由於 slurm configuration 的設定檔,在 cluster 中都須有相同的一份資料,所以我們可以建立一個 NFS,當作共享的檔案,使在 cluster 中的其他的節點可以透過建立 symbolic link 參照到相同的設定檔,方便後續的維護。 至於如何建立 NFS,細節可以參考另外一篇文章 [NFS Setup on Ubuntu](https://https://hackmd.io/fXezVuTrQBiKqDoBZmdAdg),在這裡 controller node 會作為 NFS 的 Server side, 建立方式如下: #### control node (NFS server端) 1. 先建立共享資料夾 /shared,並調整權限,讓所有使用者都能讀寫。 ```bash $mkdir /shared $chown nobody.nogroup -R /shared $chmod 777 -R /shared ``` 2. 安裝 NFS Server 的 ubuntu 套件。 ```bash $apt update $apt install nfs-kernel-server -y ``` 3. 編輯 /etc/exports,加入分享的目錄和子網路設定: ```bash # /etc/exports: the access control list for filesystems which may be exported # to NFS clients. See exports(5). # # Example for NFSv2 and NFSv3: # /srv/homes hostname1(rw,sync,no_subtree_check) hostname2(ro,sync,no_subtree_check) # # Example for NFSv4: # /srv/nfs4 gss/krb5i(rw,sync,fsid=0,crossmnt,no_subtree_check) # /srv/nfs4/homes gss/krb5i(rw,sync,no_subtree_check) # /shared 192.168.56.0/24(rw,sync,no_root_squash,no_subtree_check) ``` 4. 套用設定 ```bash $exportfs -a ``` #### compute node 1 / compute node 2 / slurm-mariadb (NFS client端) 其他三台主機會需要做相同安裝 nfs-client 的套件,以及把在 controller node 建立的 `/shared` 資料夾 mount 到 各自的系統中。 因為在前面的步驟中,我們已經設置好網路以及hostname,所以可以從 controller 連到其他三台的主機,因此這裡透過 `pdsh` 指令 (透過 ssh 的方式,一次對多台主機下命令)來安裝。 ```bash #在個別的機器中建立 /shared資料夾,並設定權限 $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "mkdir /shared" $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "chown nobody.nogroup -R /shared" $pdsh -w slurm-wrk-0[1-2]slurm-mariadb "chmod 777 -R /shared" # 安裝 nfs client 的套件 $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "apt update" $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "apt install nfs-common -y" # 在 /etc/fstab $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "echo '192.168.56.23:/shared /shared nfs defaults 0 0' >> /etc/fstab" # client mount fs $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "mount -a" ``` check client if mount successfully ```bash $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "mount -n | grep shared" ``` ### 設定 MUNGE key MUNGE是一種常用於高性能計算(HPC)環境中的安全工具和身份驗證服務。它被設計用於根據使用者的 UID(用戶ID)和 GID(群組ID)生成和驗證憑證。 建立的步驟如下: 1. 安裝 munge tool 到這四台主機,(在 control node上進行操作) ```bash # controll node $sudo apt update $sudo apt install munge # 其他三台主機 $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "apt update" $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "apt install munge -y" ``` 2. 複製 contorl node 的 `/etc/munge/munge.key` 到 NFS的 `/shared` 資料夾。 ``` $cp /etc/munge/munge.key /shared ``` 3. 在 control node 上,透過 pdsh 下指令,將 `munge.key` 複製的其他三台主機 ```bash # 其他三台主機 $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "cp -a /shared/munge.key /etc/munge/munge.key" ``` 4. 調整 `/etc/munge/munge.key` 的使用者權限 ```bash $pdsh -w slurm-ctl,slurm-wrk-[01-02],slurm-mariadb "chown munge:munge /etc/munge/munge.key" ``` 5. 可以使用md5sum 來確認是否使用相同的 munge key ```bash $pdsh -w slurm-ctl,slurm-wrk-[01-02],slurm-mariadb "md5sum /etc/munge/munge.key" ``` 6. 必須reload munge service, 讓 slurm-wrk-[01-02],slurm-mariadb 可以讀到新的munge key ```bash $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "systemctl restart munge" ``` 須注意在使用 MUNGE 的認證方式時,**cluster 中的主機的時間都會需要同步**。 ### 設定 slurm service 在經過前面的基礎設置後,接著就可以開始建立我們架構的叢集系統。 目前我們會有四台主機來組成一個叢集系統,一台是 slurm 的 controller, 兩台當作 slurm 的 compute node, 一台是 slurm 的 database。 這些不同目的的主機,會需要安裝相對應的 slurm package,但他們會使用相同的設定檔,因此我們會把相關的設定檔放在 NFS 上,之後所有的主機在建立 symbolic link 連結到這些檔案。 #### control node 這裡我們我先把 slurm controller node 先 setup 起來。 1. 在 slurm 的 control node 上須先安裝以下套件 ```bash $apt update $apt install slurm-wlm slurm-wlm-doc -y ``` * 確認是否有安裝,目前是inactive 狀態,但可以找到有對應的service ```bash $systemctl status slurmctld ``` `slurm-wlm`: main slurm package `slurm-wlm-doc`: slurm document #### compute nodes 這裡我們從 control node 上透過 pdsh 指令的方式去操作2台 compute nodes, 安裝的 slurmd 是計算節點的 daemon,負責接收控制節點派下來的工作並執行。 ```bash $pdsh -w slurm-wrk-0[1-2] "apt update" $pdsh -w slurm-wrk-0[1-2] "apt install slurmd -y" ``` * 確認是否有安裝,目前是inactive 狀態,但可以找到有對應的service ```bash $systemctl status slurmd ``` `slurmd`: compute node's deamon #### database 我們直接登入到 database node 去作操作。 1. 下載 mariadb-server 以及 slurmdbd。 ```bash $apt update $apt install mariadb-server slurmdbd -y ``` 2. 針對 mariadb 第一次開始啟用時, setup security 的保護 * setup root password (這裡是設定成 example) * remove anonymous users * test database * disallow remote root login ```bash $mysql_secure_installation ``` 3. 進入到 mariadb 的 CLI 介面後,需設定以下事情 * 創建 slurm_usr 的 user * 創建 slurm_acc_db 以及 slurm_job_db 的 table * 給定 user 特定的權限 ```bash # local host to mariadb CLI $mariadb -u root -p Enter password: # login into mariadb CLI Server version: 10.6.12-MariaDB-0ubuntu0.22.04.1 Ubuntu 22.04 Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> create database slurm_acc_db; MariaDB [(none)]> create database slurm_job_db; # slurm_usr 的 password 設定為 example MariaDB [(none)]> create user 'slurm_usr'@localhost identified by 'example'; # 給 slurm_usr 擁有完整 access slurm_acc_db and slurm_job_db 的權限 MariaDB [(none)]> grant all privileges on slurm_acc_db.* to 'slurm_usr'@localhost; MariaDB [(none)]> grant all privileges on slurm_job_db.* to 'slurm_usr'@localhost; MariaDB [(none)]> flush privileges; ``` 4. check the 資料庫是否創建成功 ``` MariaDB [(none)]> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | mysql | | performance_schema | | slurm_acc_db | | slurm_job_db | | sys | +--------------------+ 6 rows in set (0.001 sec) 可以看到 slurm_acc_db, slurm_job_db exists ``` 5. check slurm_usr是否被創建 ``` MariaDB [(none)]> select user,host from mysql.user; +-------------+-----------+ | User | Host | +-------------+-----------+ | mariadb.sys | localhost | | mysql | localhost | | root | localhost | | slurm_usr | localhost | +-------------+-----------+ 4 rows in set (0.001 sec) 可以看到 slurm_user exist ``` ### 設定 slurm 的設定檔 在這邊我們總共會有四個設定檔,分別說明如下: `slurm.conf`: 主要的配置檔,定義群集中的節點配置,Log檔位置,排程相關的參數調整等設定。 `slurmdbd.conf`: 會有連接到後端資料庫等相關設定。 `cgroup.conf`: 用於限制,隔離和管理排程的資源使用。 `cgroup_allowed_devices_file.conf`: 配置控制群組中允許的設備。 此篇文章會將這些設定檔放在 NFS 上,之後所有的主機在建立 symbolic link 到 NFS 上進行參照。 1. 從control node上,創建 slurm 的 configuration folder 在 NFS 上。 ```bash $mkdir -p /shared/HPC_SYS/slurm ``` 2. 在 `/shared/HPC_SYS/slurm` 設定 `slurm.conf` 如下 ``` # General ClusterName=testp SlurmctldHost=slurm-ctl ProctrackType=proctrack/cgroup ReturnToService=2 SlurmctldPidFile=/run/slurmctld.pid SlurmdPidFile=/run/slurmd.pid SlurmdSpoolDir=/var/lib/slurm/slurmd StateSaveLocation=/var/lib/slurm/slurmctld SlurmUser=slurm TaskPlugin=task/cgroup,task/affinity # SCHEDULING SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/slurmdbd AccountingStorageUser=slurm_usr AccountingStorageHost=192.168.56.26 AccountingStoragePort=6819 JobCompType=jobcomp/none JobacctGatherType=jobacct_gather/cgroup SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm/slurmd.log SlurmSchedLogFile=/var/log/slurm/slurmschd.log SlurmSchedLogLevel=3 PrologFlags=Contain # Node NodeName=slurm-wrk-01 CPUs=2 RealMemory=1963 NodeName=slurm-wrk-02 CPUs=2 RealMemory=1963 # Partition PartitionName=testp Nodes=ALL Default=YES MaxTime=INFINITE State=UP ``` 2. 在 `/shared/HPC_SYS/slurm` 設定 `slurmdbd.conf` 如下 ``` # Authentication info AuthType=auth/munge # SlrumDBD info DbdAddr=192.168.56.26 DebugLevel=4 LogFile=/var/log/slurm/slurmdbd.log PidFile=/run/slurmdbd.pid SlurmUser=slurm # Accounting database info StorageType=accounting_storage/mysql StorageHost=127.0.0.1 StoragePort=3306 StorageUser=slurm_usr # this is the DB user which owns the database StoragePass=example StorageLoc=slurm_acc_db ``` 3. 在 `/shared/HPC_SYS/slurm` 設定 `cgroup.conf` 如下 ``` CgroupAutomount=yes ConstrainCores=yes ``` 4. 在 `/shared/HPC_SYS/slurm` 設定 `cgroup_allowed_devices_file.conf` 如下 ``` /dev/null /dev/urandom /dev/zero /dev/sda* /dev/cpu/*/* /dev/pts/* /shared* ``` 5.調整 slurm 相關conf 的file 權限 ``` $chmod 600 /shared/HPC_SYS/slurm/*.conf $chown slurm:slurm /shared/HPC_SYS/slurm/*.conf ``` 6. 這四台主機都需要建立 symbolic link 連結到 NFS 上的這些 slurm configuration,也是從 control node 開始,之後在使用 control node 透過 pdsh 對其他三台機器建立 symbolic link 連結到 NFS 上的這些 slurm configuration ```bash # control node $ln -s /shared/HPC_SYS/slurm/slurm.conf /etc/slurm/slurm.conf $ln -s /shared/HPC_SYS/slurm/slurmdbd.conf /etc/slurm/slurmdbd.conf $ln -s /shared/HPC_SYS/slurm/cgroup.conf /etc/slurm/cgroup.conf $ln -s /shared/HPC_SYS/slurm/cgroup_allowed_devices_file.conf /etc/slurm/cgroup_allowed_devices_file.conf # work nodes + slurm database node $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "ln -s /shared/HPC_SYS/slurm/slurm.conf /etc/slurm/slurm.conf" $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "ln -s /shared/HPC_SYS/slurm/slurmdbd.conf /etc/slurm/slurmdbd.conf" $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "ln -s /shared/HPC_SYS/slurm/cgroup.conf /etc/slurm/cgroup.conf" $pdsh -w slurm-wrk-0[1-2],slurm-mariadb "ln -s /shared/HPC_SYS/slurm/cgroup_allowed_devices_file.conf" ``` 6. 當全部設定完後,從 control node 重啟 slurmdbd, slurmd, slurmctld 服務 ```bash # Remote slurm maria database $pdsh -w slurm-mariadb "systemctl restart slurmdbd" # Remote compute nodes $pdsh -w slurm-wrk-0[1-2] "systemctl restart slurmd" # local control node $ systemctl restart slurmctld ``` 7. 在 control node 上,確認叢集的狀態 ```bash $sinfo >>> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST testp* up infinite 2 idle slurm-wrk-[01-02] ``` 8. 在 control node 上,確認叢集的設定 ```bash $scontrol show config >>> Configuration data as of 2024-01-18T09:17:16 AccountingStorageBackupHost = (null) AccountingStorageEnforce = none AccountingStorageHost = localhost AccountingStorageExternalHost = (null) AccountingStorageParameters = (null) AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreFlags = (null) AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = No ... Cgroup Support Configuration: AllowedDevicesFile = /etc/slurm/cgroup_allowed_devices_file.conf AllowedKmemSpace = (null) AllowedRAMSpace = 100.0% AllowedSwapSpace = 0.0% CgroupAutomount = yes CgroupMountpoint = /sys/fs/cgroup CgroupPlugin = (null) ConstrainCores = yes ConstrainDevices = no ConstrainKmemSpace = no ConstrainRAMSpace = no ConstrainSwapSpace = no MaxKmemPercent = 100.0% MaxRAMPercent = 100.0% MaxSwapPercent = 100.0% MemorySwappiness = (null) MinKmemSpace = 30 MB MinRAMSpace = 30 MB TaskAffinity = no Slurmctld(primary) at slurm-ctl is UP ``` Note. 1. 若遇到 slurm controller 有問題時,可以確認如下 a. 查看 slurm controller 的 status, `systemctl status slurmctld` b. 查看 `/var/log/slurm/slurmctld.log` 的內容 c. 在根據 log 內容解 Bug 2. 若遇到 slurm worker 有問題時,可以確認如下 a. 查看 slurm worker 的 status, `systemctl status slurmctd` b. 查看 `/var/log/slurm/slurmd.log` 的內容 c. 在根據 log 內容解 Bug 2. 若遇到 slurm database 有問題時,可以確認如下 a. 查看 slurm database 的 status, `systemctl status slurmctdbd` b. 查看 `/var/log/slurm/slurmdbd.log` 的內容 c. 在根據 log 內容解 Bug 3. 需注意在 cluster 中,時間需要同步,否則在 MUNGE key 的認證會出現錯誤,在 slurm 的 log 也會有時間不一致的情況。 REF 1. [L2_02_Basic_HPC_Cluster_Setup_Howto_Guide.pdf](https://h3abionet.org/images/Technical_guides/L2_02_Basic_HPC_Cluster_Setup_Howto_Guide.pdf) 2. [Building a SLURM Cluster Using Amazon EC2 (AWS) Virtual Machines](https://mhsamsal.wordpress.com/2022/01/15/building-a-slurm-cluster-using-amazon-ec2-aws-virtual-machines/) 3. [Slurm Document](https://slurm.schedmd.com/overview.html)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully