HPL - Fermi_v15
- CPU : 4cores
- RAM : 4GB
- GPU : RTX3060
-sudo apt install build-essential hwloc libhwloc-dev libevent-dev gfortran
1.安裝nvidia driver
1.1 下載runfile (.run)
找對應的顯卡版本nvidia官網
複製連結後wget
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
1.2 關閉 nouveau
更改config
更新kernal並重開機
檢查是否成功關閉
1.3進行安裝
(–no-cc-version-check) : 忽略 compiler 版本差異檢查,否則驅動程式會因為 GCC 版本不一致而拒絕安裝。
可供下載的兩種選項
-
NVIDIA Proprietary
- 封閉原始碼(Proprietary)
- 使用 NVIDIA 官方編寫、最佳化的核心模組
- 效能、穩定性最佳,支援功能最多(如 CUDA、Optimus、VDPAU)
但屬於 NVIDIA 專有授權,不開放原始碼
-
MIT/GPL
- 這是 開放原始碼(Open Kernel Module, OKM)版本
- 採用 MIT/GPL 授權,NVIDIA 近年開始逐步開源核心模組
- 更容易整合到開源 Linux 發行版中(如 Fedora、Arch)
目前尚未完全等同 proprietary 版本,在某些 GPU 上 功能不完整或效能略差
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
=>選擇 NVIDIA Proprietary
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
系統偵測到apt install 的驅動,可能會與runfile的安裝衝突,若是確定不用apt就按左邊繼續安裝。
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
系統警告不支援32bit(i386架構)
只使用64bit程式(HPL、CUDA)則沒有影響
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
不安裝EGL,只使用CUD/HPL等可以忽略這個警告
若有需要可以安裝完套件後重新執行.run
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
要不要執行nvidia-xconfig
來自動修改 X server 設定檔/etc/X11/xorg.conf
- 系統開機後使用NVIDIA驅動作為圖形顯示預設顯示卡
由於不需要使用圖形介面,只有拿來跑HPL,所以選No
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
cuda可支援到12.8
https://developer.nvidia.com/cuda-toolkit-archive
嘗試安裝cuda12.8.0和12.5.0
2.1 安裝 cuda
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
-
取消安裝driver(因為前面已經裝過了)
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
-
成功安裝12.8.0
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
-
設定環境變數
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
不是沒安裝,只是沒設定環境變數。
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
-
用同樣的方法安裝另一個cuda toolkit(我裝12.5)
2.2 設定symlink以在多個版本間切換
symlink是一個符號連結,就像「指標」,指向真正的資料夾
則/usr/local/cuda會「代表」/usr/local/cuda-12.5
可以透過以下指令更改symlink
修改原本的~/.bashrc以使用symlink
這時只要更改symlink就可以使用不同版本cuda,而不需要改環境變數
2.3 建立指令腳本快速切換
-
在家目錄建立一個名為 cuda-switch
的檔案:
-
現在可以透過執行cuda-switch改變cuda版本

-
把剛剛寫的腳本加到PATH
加到PATH(~/.bashrc)
現在可以在任何目錄下切換版本

3. SLURM
Ubuntu 安裝 Slurm (使用 apt)
3.1 在所有節點安裝Slurm package
3.2 在所有節點建立設定檔
- cgroup.conf (資源隔離機制設定檔)
- slurm.conf (用
configurator.html
產生)
4. Environment Modules
這個也要多節點
5. 開始編譯HPL Fermi V15
安裝的東西很多都裝在/opt,有用叢集的話可以NFS共用
5.0.1 NFS共用/opt
在exports中 :
/opt mpi-n1(rw,sync,no_subtree_check)
需要改成
/opt mpi-n1(rw,sync,no_subtree_check,no_root_squash)
在NFS預設會啟用 root squash,也就是 client 端的 root 權限會被映射成 nobody/nogroup,導致你即使用 root 也無法寫入。
而/opt目錄預設需要root權限才能寫入
把/opt改成所有人都可以寫入 sudo chmod 777 /opt
5.1 裝Intel MKL
https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-download.html?operatingsystem=linux&linux-install=offline

會安裝在/opt/intel

設定環境變數
(叢集的話好像要連結動態連結庫)
安裝完成:(setvars.sh是Intel OneAPI附帶的設定環境的script)

5.2 裝OpenMPI(5.0.7)
下載壓縮檔並解壓縮
編譯並指定安裝路徑到/opt/openmpi
編譯過程可以優化(但目前沒做)

設定環境變數

5.3 下載 HPL
官網登入後 :
https://developer.nvidia.com/rdp/assets/cuda-accelerated-linpack-linux64
更改Make.CUDA
由於HPL2.0不支援OpenMPI3.0以上之版本,有一些原始碼需要更改
第172、186、200行 MPI_Address :arrow_right: MPI_Get_address
第211行 MPI_Type_struct :arrow_right: MPI_Type_create_struct
這是錯的,address小寫

新增一個symbolic link
開始編譯
設定一個能用的HPL.dat
修改bin/CUDA/run_linpack
把hpl-2.0_FERMI_v15加入LD_LIBRARY_PATH
測試執行
5.4 編譯cuda 12.5
先把原本的複製一份
Make.CUDA
會依照CUDA_HOME去找,在前面的cuda-switch
有改過了
所以直接
應該就會以switch後的版本編譯
並沒有
TOPdir沒改所以我根本只是把本來的重新編譯一遍
還是要改Make.CUDA
TOPdir = /work/hpl-2.0_FERMI_v15
:arrow_right: TOPdir = /work/hpl-2.0_FERMI_v15_cuda12.5
更改run_linpack,以讓執行run_linpack
時先切換版本
檢查cuda版本
ldd bin/CUDA/xhpl | grep dgemm
應該會輸出:
其實好像還是沒辦法確定,但編號不一樣?


結論:用symlink會難以追蹤版本(因為紀錄裡寫的都是 /usr/local/cuda),所以應該編譯時要用實際路徑。
(又或是有其他方法?)
6. 設定HPL.dat
Ns : N^2 * 8 = 0.8*Mem
目前Ns跑不了19000
cuda 12.8
2.(這組參數作為對cuda12.5的對比)
cuda 12.5
1.看來cuda 12.8應該好一點
Profiling HPL