# NVIDIA HPC-Benchmarks多節點架設 ## 前言 本文將會介紹如何在多節點環境下設定Docker使用NVIDIA HPC-Benchmarks執行HPL及HPCG測試的環境。 ## Slurm安裝 請見[SergioMEV所作的slurm-for-dummies](https://github.com/SergioMEV/slurm-for-dummies),如果節點數量很少,你可以設定讓control node也允許計算。見[Danduk82在StackOverflow的回答](https://stackoverflow.com/questions/23497004/slurm-use-a-control-node-also-for-computing) ## GPU驅動程式安裝 此處裝525版,建議先至[NVIDIA驅動程式下載處](https://www.nvidia.com/Download/index.aspx?lang=en-us)選擇適合的版本。 首先要檢查是否有我們要的版本的預編譯檔 ```bash sudo ubuntu-drivers list --gpgpu ``` 如果有525這個數字的話就可以直接用apt安裝了 ```bash sudo apt install nvidia-driver-525 ``` 如果沒有的話,也可以從剛剛的[NVIDIA驅動程式下載處](https://www.nvidia.com/Download/index.aspx?lang=en-us)下載安裝程式,然後就直接執行就行了。 安裝完之後,測試看看 ```bash nvidia-smi ``` 如果顯示出每個GPU的狀態就沒問題了。 ## Docker安裝(非必要) >[!NOTE] >參考 https://docs.docker.com/engine/install/ubuntu/ 首先,設定好apt的repository ```bash # Add Docker's official GPG key: sudo apt-get update sudo apt-get install ca-certificates curl sudo install -m 0755 -d /etc/apt/keyrings sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc sudo chmod a+r /etc/apt/keyrings/docker.asc # Add the repository to Apt sources: echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \ $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt-get update ``` 接著就可以安裝需要的package ```bash sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin ``` 安裝完後就可以測試看看,執行內建的hello-world image ```bash sudo docker run --rm hello-world ``` ## 安裝NVIDIA Container Toolkit(建議) 這個模組能讓docker容器更有效率的運用NVIDIA的GPU 首先,一樣要設定好apt repository。 ```bash curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list ``` 接著要更新apt ```bash sudo apt-get update ``` 然後就可以安裝了 ```bash sudo apt-get install -y nvidia-container-toolkit ``` 接著要設定docker daemon讓他使用container toolkit ```bash sudo nvidia-ctk runtime configure --runtime=docker ``` 重啟 ```bash sudo systemctl restart docker ``` 最後就可以測試看docker有沒有辦法找到GPU了 ```bash sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi ``` ## 安裝Singularity ### 裝dependency ```bash sudo apt-get install -y \ autoconf \ automake \ cryptsetup \ git \ libfuse-dev \ libglib2.0-dev \ libseccomp-dev \ libtool \ pkg-config \ runc \ squashfs-tools \ squashfs-tools-ng \ uidmap \ wget \ zlib1g-dev ``` ### 安裝Go ```bash rm -rf /usr/local/go && tar -C /usr/local -xzf go1.22.4.linux-amd64.tar.gz sudo echo "export PATH=$PATH:/usr/local/go/bin" >> /etc/profile . /etc/profile ``` 驗證版本 ```bash go version ``` ### 安裝Singulairty ```bash export VERSION=4.1.0 && # adjust this as necessary \ wget https://github.com/sylabs/singularity/releases/download/v${VERSION}/singularity-ce-${VERSION}.tar.gz && \ tar -xzf singularity-ce-${VERSION}.tar.gz && \ cd singularity-ce-${VERSION} ``` ## Intel OneAPI + HPC Toolkit 因為我們用的環境是Intel的節點,所以就安裝Intel MPI,而不是OpenMPI(要裝OpenMPI也行,但比較慢)。為此,我們安裝了OneAPI並加上HPC Toolkit。 首先設定APT repository ```bash wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list ``` 更新APT ```bash sudo apt update ``` 安裝OneAPI Base Toolkit ```bash sudo apt install intel-basekit ``` 安裝HPC Toolkit ```bash sudo apt install intel-hpckit ``` 最後看看有沒有裝成功 ```bash mpirun --version ``` 裝好之後要弄好`~/hostfile`,`slot=4`代表有4個GPU,以下是一個範例 ```text pca1 slots=4 pca2 slots=4 ``` ## HPL、HPCG測試 見[NVIDIA HPC Benchmark Container Overview](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/hpc-benchmarks) 如果有裝NVIDIA Container Toolkit的話,裡面singularity如果有`--nv`的flag,可以再加一個`--nvccli` 例如在每個節點上有4個GPU,共16個節點的環境下可以這樣 ```bash CONT='/path/to/hpc-benchmarks:24.03.sif' srun -N 16 --ntasks-per-node=4 singularity run --nvccli \ "${CONT}" \ ./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat ``` 或是不想用srun的話可以這樣 ```bash CONT='/path/to/hpc-benchmarks:24.03.sif' HOSTFILE=hostfile SINGULARITY_CMD="singularity exec --nvccli ${CONT}" mpirun -np 64 --hostfile ${HOSTFILE} \ ${SINGULARITY_CMD} ./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat ``` >[!NOTE] >在執行之前,要先編輯好`~/hostfile`