Singularity & Slurm

Singularity & Slurm === ###### tags: `TWCC` ###### tags: `TWCC`, `HPC`, `Singularity`, `Slurm`, `SIF`, `OTP`, `IDExpert`, `sinfo`, `sinfo`, `sacct`, `srun`, `sbatch` [TOC] ## 登入 ### 如何登入 > 主機密碼加上OTP動態碼 ``` $ ssh 主機帳號@ln01.twcc.ai ##################################################### # # # ~~ WELCOME TO TAIWANIA 2 ~~ # # # # How to log in? # # https://docs.twcc.ai/docs/login-logout/ # # # ##################################################### Please select the 2FA login method. 1. Mobile APP OTP 2. Mobile APP PUSH 3. Email OTP Login method: 1 Password: <--- 主機帳號所對應的密碼 OTP: 750880 <--- OTP 密碼 [PASS] The OTP verification succeeded. ``` - 如何登入 https://docs.twcc.ai/docs/login-logout/ - 建立 OTP 載具 https://docs.twcc.ai/docs/system-account-password-otp#判斷是否已建立-otp-載具 - 手機下載「`IDExpert` 」APP - Google Play連結： https://play.google.com/store/apps/details?id=com.changingtec.idexpert - Apple App Store 連結： https://apps.apple.com/tw/app/idexpert/id1447069966 ### 登入後的提示 ``` Dear User, To run your jobs, use Slurm commands: step 1: Prepare your job script first and specify Queue and ProjectID in it. $ vim job.sh step 2: Submit your job script to Slurm and then you'll get the job id. $ sbatch job.sh step 3: Trace job id and monitor your job. $ squeue -u ${USER} $ sacct -j $job_id Other handy Slurm commands: Display the information about Slurm nodes and partitions. $ sinfo Terminate your job. $ scancel job_id Other useful query commands: $ wallet (check your project id and wallet balance) $ wallet $PROJECT_ID (specify a project) $ cd /home/${USER} (change directory to your home folder) $ cd /work/${USER} (change directory to your work folder) ================================================= Note: Do NOT use the login nodes for computation. No Bitcoin Mining! Please follow the rules ! Otherwise, We will block your account :) ================================================= ``` <hr> ## 下載 ubuntu image，並透過 slurm 執行調度 ### 下載 image: `ubuntu:20.04` ```bash= # 下載公開的 docker image $ singularity pull docker://ubuntu:20.04 # or $ singularity pull ubuntu:20.04 # 查看 sif 檔 $ ls ubuntu_20.04.sif ``` ### 執行 sif 檔 ```bash= $ singularity exec ubuntu_20.04.sif bash Singularity> cat /etc/os-release NAME="Ubuntu" VERSION="20.04.4 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.4 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal ``` ### 透過 slurm 執行調度 - ### 準備 query_os_info.sh ```bash #!/bin/bash echo '>>>>>' singularity exec ubuntu_20.04.sif cat /etc/os-release echo '<<<<<' ``` - 沒有 `#!/bin/bash` 資訊，執行時會有錯誤訊息： ``` $ sbatch --account=ENT110209 query_os_info.sh sbatch: error: This does not look like a batch script. The first sbatch: error: line must start with #! followed by the path to an interpreter. sbatch: error: For instance: #!/bin/sh ``` - ### slurm 透過 `slurm` 執行調度 ``` $ sbatch --gpus-per-node=1 --account=ENT21**** query_os_info.sh sbatch: INFO: It is recommended to specify `--nodes` and `--ntasks-per-node` together Submitted batch job 628675 ``` - job ID 為 **628675** - 後面可查看 job 的狀態 - 沒有 `--gpus-per-node=<num>` 參數，執行時會有錯誤訊息： ``` sbatch: error: Missing assigned gpus, try to use --gpus-per-node=<num> sbatch: error: Batch job submission failed: Unspecified error ``` - `<num>` 只能填 1~8 ``` sbatch: error: --gpus-per-node=<num> only accept 1,2,3,4,5,6,7,8 sbatch: error: Batch job submission failed: Unspecified error ``` - 專案不存在時，執行時會有錯誤訊息： ``` sbatch: error: Looks like ENT110209 is a non-existent project id o.O sbatch: error: Or you don't belong to this project sbatch: error: Please check the wallet information below :) sbatch: error: ----------------------------------- wallet info ----------------------------------- sbatch: error: PROJECT_ID: ENT21****, PROJECT_NAME: T01-*******, SU_BALANCE: ***** sbatch: error: PROJECT_ID: ENT21****, PROJECT_NAME: T01-*******, SU_BALANCE: ***** sbatch: error: PROJECT_ID: GOV11****, PROJECT_NAME: OneAI******, SU_BALANCE: ***** sbatch: error: ----------------------------------------------------------------------------------- sbatch: error: Batch job submission failed: Unspecified error ``` - ### slurm 產出 ``` $ ll -rw-r--r-- 1 <主機帳號> ENT21**** 120 Jul 17 16:02 query_os_info.sh -rw-r--r-- 1 <主機帳號> ENT21**** 382 Jul 17 16:03 query_os_info.sh.log -rw-r--r-- 1 <主機帳號> ENT21**** 394 Jul 17 16:03 slurm-628678.out ``` - 輸出結果為 `slurm-628678.out` ```bash= $ cat slurm-628678.out >>>>> NAME="Ubuntu" VERSION="20.04.4 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.4 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal <<<<< ``` - query_os_info.sh.log > 將 terminal 輸出的 log，導向到指定檔案 ### 確認 slurm 任務狀態 > [[TWCC] 確認任務狀態](https://man.twcc.ai/@twccdocs/doc-twnia2-main-zh/https%3A%2F%2Fman.twcc.ai%2F%40twccdocs%2Fguide-twnia2-job-state-zh) - ### sacct > displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database > ``` $ sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 628675 query_os_+ gp2d ent21**** 1 RUNNING 0:0 628675.exte+ extern ent21**** 1 RUNNING 0:0 ``` - 狀態有：COMPLETE、PENDING 或是 FAILED - 若ExitCode的開頭不為「0」，則表示此任務是錯誤狀態。 - 生命週期 ![](https://hackmd.io/_uploads/Hk22ZZSu0.png) <hr> ## 下載私有 docker image，並透過 slurm 執行調度 ### 1. 建立 dockerhub 的 token 1. **點選帳號圖示，再點選 Account settings** ![](https://hackmd.io/_uploads/rk8K7-H_A.png =75%x) --- 2. **點選 Security** ![](https://hackmd.io/_uploads/SkSym-Su0.png) --- 3. **點選 New Access Token** ![](https://hackmd.io/_uploads/H1CfEbHOC.png) ### 2. 下載私有 docker image ```bash export SINGULARITY_DOCKER_USERNAME=mycloud export SINGULARITY_DOCKER_PASSWORD=dckr_pat_abcdefghijklmnopqrstuvwxyz0123456789 singularity pull mycloud/llm-benchmark:v0.1 ``` - 查看當前目錄，會有 `llm-benchmark_v0.1.sif` ``` $ ls -ls 48656 -rwxr-xr-x 1 主機帳號 ENT21**** 49811456 Jul 17 16:46 llm-benchmark_v0.1.sif ``` ### 3. 透過 singularity 容器，來執行 `llm-benchmark_v0.1.sif` ``` $ singularity exec \ --env API_KEY=********-****-****-****-************ \ --env MODEL=llama3-70b-0520 \ --env OUTPUT_DIR=`pwd`/llama3-70b-0520-output \ llm-benchmark_v0.1.sif \ bash -c "cd /workspace/src && python main.py" ``` ### 4. 透過 slurm 調度，來執行 `llm-benchmark_v0.1.sif` - 建立 `run_benchmark.sh` ```bash #!/bin/sh # Set environment variables via `SINGULARITYENV_<key>=<value>` export SINGULARITYENV_API_KEY=********-****-****-****-************ # 執行 llm-benchmark_v0.1.sif # - Set environment variables via `--env` singularity exec \ --env MODEL=llama3-70b-0520 \ --env OUTPUT_DIR=`pwd`/llama3-70b-0520-output \ llm-benchmark_v0.1.sif \ bash -c "cd /workspace/src && python main.py" ``` - 透過 slurm 執行調度 ``` $ sbatch --gpus-per-node=1 --account=ENT21**** run_benchmark.sh sbatch: INFO: It is recommended to specify `--nodes` and `--ntasks-per-node` together Submitted batch job 628720 ``` - 查看 slurm 調度狀態 - `$ sacct` (查看執行 slurm 調度的歷史紀錄) - `$ sacct -j 628720` (查看某個 job 的調度狀態) ``` $ sacct -j 628720 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 628720 run_bench+ gp2d ent21**** 1 RUNNING 0:0 628720.exte+ extern ent21**** 1 RUNNING 0:0 ``` - `$ scontrol show job` (查看執行 job 的參數) ``` $ scontrol show job 628720 JobId=628720 JobName=run_benchmark.sh UserId=<主機帳號>(15444) GroupId=ENT21****(200231) MCS_label=N/A Priority=10004785 Nice=0 Account=ent21**** QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:03:44 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=2024-07-17T17:15:46 EligibleTime=2024-07-17T17:15:46 AccrueTime=2024-07-17T17:15:46 StartTime=2024-07-17T17:15:47 EndTime=2024-07-19T17:15:47 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-07-17T17:15:47 Partition=gp2d AllocNode:Sid=ln01-twnia2:97861 ReqNodeList=(null) ExcNodeList=(null) NodeList=gn1214 BatchHost=gn1214 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=90G,node=1,billing=1,gres/gpu=1 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=90G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/<主機帳號>/run_benchmark.sh WorkDir=/home/<主機帳號> StdErr=/home/<主機帳號>/slurm-628720.out StdIn=/dev/null StdOut=/home/<主機帳號>/slurm-628720.out Power= TresPerNode=gpu:1 ``` - `$ squeue` (查看 queue 的狀態) ``` $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) ... 628720 gp2d run_benc <主機帳號> R 6:18 1 gn1214 ... ``` - `$ scancel 628720` (取消 job 的執行) `$ scancel <job_id>` ## 客製化 image 建立 > https://hackmd.io/@praexisio/BJdzfcEGY/%2F%40praexisio%2Fbuild ### 待驗證 ``` $ singularity build new_image.sif my_container ``` ### 將 sandbox 轉換為 SIF 1. 建立可寫的 sandbox 容器使用 `--sandbox` 參數建立一個可修改的容器。例如： ```bash singularity build --sandbox my_sandbox docker://ubuntu ``` 2. 在 sandbox 容器中進行修改啟動容器後，你可以進行安裝或其他修改： ```bash singularity shell --writable my_sandbox ``` 修改完成後退出容器。 3. 將 sandbox 轉換為 SIF 文件最後，用 `singularity build` 指令將修改後的 sandbox 轉換為新的 SIF 鏡像： ```bash singularity build new_image.sif my_sandbox ``` 這樣你就能從一個容器中存出一個新的 SIF 文件了。 ### Docker image 直接生成 SIF > 待驗證 ``` $ singularity build new_image.sif docker://ubuntu ``` 這樣 Singulariy 會直接從指定來源（此例為 Docker Hub 上的 Ubuntu 映像）建構並輸出新的 SIF 文件，而無需先建立 sandbox。 <hr> ## 討論 ### singularity 如何導入環境變數 - ### Question `singularity exec llm-benchmark_v0.1.sif bash` 執行 singularity 要如何把環境變數帶入？ - ### ChatGPT-4o 在使用 Singularity 容器時，可以使用 `--env` 或 `-e` 選項將環境變數傳入容器內。這裡有幾種方法來傳遞環境變數： ### 1. 使用 `--env` 選項你可以直接在命令行中使用 `--env` 選項來傳遞環境變數，例如： ```bash singularity exec --env MY_VAR=value llm-benchmark_v0.1.sif bash ``` ### 2. 使用 `SINGULARITYENV_` 前綴另一種方法是使用 `SINGULARITYENV_` 前綴來設定環境變數。這些變數將在容器內被識別並應用。例如： ```bash export SINGULARITYENV_MY_VAR=value singularity exec llm-benchmark_v0.1.sif bash ``` ### 3. 使用環境變數檔案如果有多個環境變數要傳遞，可以將它們放在一個檔案中，然後在執行容器時載入這個檔案。例如，你可以創建一個 `.env` 檔案，內容如下： ```bash MY_VAR1=value1 MY_VAR2=value2 ``` 然後在執行容器時使用 `source` 命令載入這個檔案： ```bash source .env singularity exec llm-benchmark_v0.1.sif bash ``` 以上這些方法都可以幫助你在使用 Singularity 容器時傳遞環境變數。根據你的具體需求選擇最適合的方法即可。 ### 執行 singularity 容器時，如何切換到指定目錄，再執行 python ? - ### Question `singularity exec llm-benchmark_v0.1.sif cd /workspace/src && python main` 我要如何切換到指定目錄，再執行 python ? - ### ChatGPT-4o 在 Singularity 容器中，你可以使用 `exec` 命令並指定多個指令，以便先切換到指定目錄，再執行 Python 腳本。你可以使用 `&&` 來串接多個命令。以下是具體的命令範例： ```bash singularity exec llm-benchmark_v0.1.sif bash -c "cd /workspace/src && python main" ``` 這裡的 `bash -c` 允許你傳遞一個包含多個指令的命令字串。整個指令會在 Singularity 容器內執行。 ### 如何查看 Slurm 的 log - ### Question `sbatch --gpus-per-node=1 --account=ENT210000 run_benchmark.sh` 假設 job_id 為 628746 我要如何查看 job_id=628746 的 log? - ### ChatGPT-4o 要查看特定作業的日誌（假設作業 ID 為 628746），你可以使用以下命令： ```bash sacct -j 628746 --format=JobID,State,Elapsed,ExitCode ``` 這將顯示作業的狀態和基本信息。要查看作業的輸出和錯誤日誌，通常可以找到作業提交時指定的輸出和錯誤文件，這些文件通常會在提交腳本中指定，例如： ```bash #SBATCH --output=output_%j.log #SBATCH --error=error_%j.log ``` `%j` 會被替換為作業 ID，所以在這種情況下，你可以查看 `output_628746.log` 和 `error_628746.log` 這些文件。你可以使用以下命令來檢查這些日誌文件： ```bash cat output_628746.log cat error_628746.log ``` 如果你沒有指定輸出和錯誤日誌文件，默認情況下，它們通常會在你提交作業的目錄中，文件名為 `slurm-<job_id>.out`。所以你可以用以下命令來查看日誌： ```bash cat slurm-628746.out ``` 如果日誌文件很大，你也可以使用 `less` 或 `tail` 來查看部分內容，例如： ```bash less slurm-628746.out tail -n 50 slurm-628746.out ``` 這些命令可以幫助你方便地檢查和分析作業的日誌文件。 ## 如何掛載 HPC storage 進來 ### 安裝 sshfs ``` sudo apt-get update -y sudo apt-get install sshfs ``` ### 執行掛載 HPC storage - `/mnt/mywork/myfolder` <--> HPC: `主機帳號@ln01.twcc.ai/home/主機帳號` ``` $> mkdir /mnt/mywork/myfolder $> sshfs 主機帳號@xdata1.twcc.ai:/home/主機帳號 myfolder # 掛載自己的 home 到當前環境 ``` ### 卸載 HPC storage - `$ unmount /mnt/mywork/myfolder` ## 參考資料 - [[TWCC] 如何登入](https://docs.twcc.ai/docs/login-logout/) - [[TWCC] 確認任務狀態](https://man.twcc.ai/@twccdocs/doc-twnia2-main-zh/https%3A%2F%2Fman.twcc.ai%2F%40twccdocs%2Fguide-twnia2-job-state-zh) - [HPC 高速運算任務](https://man.twcc.ai/@twccdocs/doc-hpc-job-main-zh#Slurm-%E5%9F%BA%E6%9C%AC%E6%8C%87%E4%BB%A4%E7%B0%A1%E4%BB%8B) - 若需使用 **8 顆以上 GPU** 或執行跨節點高速運算，請使用台灣杉二號(命令列介面)，並參考以下教學案例： - [HowTo：建立 TWNIA2 容器](https://www.twcc.ai/doc?page=howto-twnia2-create-sglrt-container) - [HowTo：容器跨節點高速運算－AI Benchmark](https://www.twcc.ai/doc?page=howto-twnia2-run-parallel-job-container) - [Conda 更新規劃 & 操作範例](https://hackmd.io/@kmo/twcc_hpc_conda) - [[docs] singularity exec](https://docs.sylabs.io/guides/3.3/user-guide/cli/singularity_exec.html) <hr> ## 討論 ### 可以不要掛載 `/home/<USER>` 目錄？ - Question 在 HPC 中使用 Singularity，似乎會自動掛載 /home 目錄，有辦法不要掛載嗎？ - gpt-4o-2024-08-06 在 HPC 中使用 Singularity 時，預設會自動掛載使用者的 `/home` 目錄，以便讓使用者可以在容器內存取自己的主目錄。如果你希望在啟動容器時不要掛載 `/home` 目錄，可以使用 Singularity 的 `--no-home` 或 `--contain` 選項來避免掛載主機的 `/home` 目錄。 ### 解決方法： 1. **使用 `--no-home` 選項**：使用此選項來避免將主機的 `/home` 目錄掛載到容器中。 ``` singularity exec --no-home your_image.sif python your_script.py ``` 此選項會禁用將主機 `/home` 目錄掛載到容器的行為，並使用容器內的 `/home` 路徑（如果有的話）。 2. **使用 `--contain` 選項**：如果你想要更隔離的環境，除了不掛載 `/home` 目錄以外，`--contain` 還會將 `/tmp`、`/var/tmp` 等其他目錄也隔離，並且只允許預設的 `/dev` 設置。 ``` singularity exec --contain your_image.sif python your_script.py ``` 使用此選項會創建一個更乾淨的容器環境，避免主機上的大部分目錄被自動掛載進去。 3. **使用 `--bind` 自定義掛載點**：如果你希望掛載其他自定義目錄，而不掛載 `/home`，你可以使用 `--bind` 選項來指定掛載點。例如： ``` singularity exec --no-home --bind /my/custom/dir:/mnt your_image.sif python your_script.py ``` 這樣可以確保 `/home` 沒有被掛載，同時你可以掛載其他自定義目錄進容器。 ### 小結： - 使用 `--no-home` 可以避免掛載 `/home` 目錄。 - 使用 `--contain` 可以創建更隔離的環境，不僅不掛載 `/home`，還隔離其他目錄。 - 使用 `--bind` 可以指定掛載自定義的目錄，同時避免 `/home` 被掛載。 - ### 參考資料 - [singularity run options](https://docs.sylabs.io/guides/3.4/user-guide/cli/singularity_run.html#options)