Try   HackMD

Ubuntu 安裝 Slurm (使用 apt)

apt 安裝 Slurm package (所有節點)

sudo apt update -y && sudo apt upgrade -y sudo apt install -y slurm-wlm slurm-wlm-doc

建立設定檔 (所有節點)

以下設定檔都放在 /etc/slurm 底下

  • Ubuntu 版本不同,有可能路徑稍微有差異
  • 有的版本是 /etc/slurm,有的版本是 /etc/slurm-wlm
  • 可以用以下指令確定確切路徑
    ls /etc | grep slurm
    

以下的所有設定檔,所有的節點上都要有一份,且必須完全相同

cgroup.conf

參考 官方範例

###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

slurm.conf

不同的 Slurm 版本,支援的欄位不相同,建議使用自帶的設定檔產生器製作設定檔

查詢安裝位置

用以下指令查詢產生器所在的位置

dpkg -L slurm-wlm-doc | grep html
  • 應該可以獲得類似輸出
    ​$ dpkg -L slurm-wlm-doc | grep html
    ​/usr/share/doc/slurm-wlm/html
    ​/usr/share/doc/slurm-wlm/html/Slurm_Entity.pdf
    ​/usr/share/doc/slurm-wlm/html/Slurm_Individual.pdf
    ​...
    

以上面的輸出為例,產生器在 /usr/share/doc/slurm-wlm/html 底下;由於產生器是 html 寫的,需要透過瀏覽器使用,這邊建議直接在該目錄啟動一個 http server

  • 可以用 python 自帶的 http server
    cd /usr/share/doc/slurm-wlm/html # 替換成你查到的路徑
    ​python -m http.server
    
  • 打開瀏覽器,瀏覽該機器的 port 8000
  • 在打開的頁面中找到 configurator.html
    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →
  • 填入必要的欄位
    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →
    • Control Machines: SlurmctldHost
      填入 Control node 的 hostname
      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →
    • Compute Machines: NodeName
      填入 Compute node 的 hostname,可用 [A-B] 表示 A, A+1, A+2, , B
      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →
    • Compute Machines: CPU
      填入每個 Compute node 的 CPU 核心數量
      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →
    • State Preservation
      • StateSaveLocation 改成 /var/spool/slurm
      • SlurmdSpoolDir 改成 /var/spool/slurm/slurmd
        Image Not Showing Possible Reasons
        • The image was uploaded to a note which you don't have access to
        • The note which the image was originally uploaded to has been deleted
        Learn More →
  • State Preservation 的兩個路徑可以改成其他路徑,或是按照預設值
  • 但是後面調整目錄權限時,需要確定自己調整的目錄是對的
  • 完成設定後,按下頁面最下方的 submit,此頁面就會變成 Slurm 設定檔
    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →
  • 全選並複製所有內容
    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →
  • 在 /etc/slurm/slurm.conf 貼上剛剛複製的內容

設定 log 目錄權限

在 /var/spool 底下建立 slurm 目錄

sudo mkdir /var/spool/slurm

將所有權改成 slurm:slurm

sudo chmod slurm:slurm /var/spool/slurm

設定 munge key (多節點)

在 control node 上執行

create-munge-key

munge key 的路徑是 /etc/munge/munge.key,將這個檔案複製到所有節點上 (並放在每個節點的 /etc/munge 中)

  • 可以先將 key 複製到 ~
    ​sudo cp /etc/munge/* ~
    
  • 將權限暫時改成 777,保證他在其他節點上可以被操作
    ​sudo chmod 777 ~/munge.key
    
  • 用 scp (或其他方法) 傳送到其他節點,這邊放到另一個節點的 ~
    ​scp ~/munge.key other_node:~/munge.key
    
  • 登入到其他節點上,將 key 移動到 /etc/munge 底下
    ​sudo mv ~/munge.key /etc/munge
    
  • 修改權限,以符合安全要求
    ​sudo chmod 400 /etc/munge/munge.key
    ​chown munge:munge /etc/munge/munge.key
    

啟動 Slurm

  • control node
    ​sudo systemctl start slurmctld
    
  • compute node
    ​sudo systemctl start slurmd
    

確認是否啟動成功

# 列出所有節點資訊
sinfo