# InfiniBand 練習記錄 ## 基本設置 ### 免密碼登入 1. 在 vscode 的 ssh remote 新增兩台 ib 節點的連線 2. 到 `~/.ssh/config` 將 Host 分別修改成 `hpc-ib-n1` 與 `hpc-ib-n2` ``` Host hpc-ib-n1 HostName xxx.xx.xx.xxx User ntcucshpc Port 22054 Host hpc-ib-n2 HostName xxx.xx.xx.xxx User ntcucshpc Port 22055 ``` ### 更改系統語言 在 `~/.bashrc` 加入: ```bash # use English export LANG=en_US.UTF-8 export LANGUAGE=en_US.UTF-8 export LC_ALL=en_US.UTF-8 ``` ### 修改 `sources.list` 先查看 `/etc/apt/sources.list` 確認原本的網址前綴是 `http://tw.archive.ubuntu.com`,修改成國網的鏡像站: ```bash sudo sed -i 's|http://archive.ubuntu.com/ubuntu|http://free.nchc.org.tw/ubuntu|g' /etc/apt/sources.list ``` :::info Ubuntu 24.04 之後,此檔案被移動到 `/etc/apt/sources.list.d/ubuntu.sources`,且檔案內容有所不同,但仍可使用相同指令修改。 ::: ### 安裝必要套件 ```bash sudo apt update sudo apt install -y vim build-essential net-tools tmux htop ``` ## 安裝 MLNX_OFED 驅動 ### 確認網卡型號 ```bash $ lspci | grep Network 06:10.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] ``` ### 下載驅動 <https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/> :::warning 根據[官方文件](https://docs.nvidia.com/networking/display/mlnxofedv586042lts/release+notes)的說明,ConnectX-3 Pro 型號已不被支援,最後支援的版本是 4.9 LTS,需要在下載區域切換到 Archive Versions 才找得到。 ::: ```bash wget https://content.mellanox.com/ofed/MLNX_OFED-4.9-7.1.0.0/MLNX_OFED_LINUX-4.9-7.1.0.0-ubuntu20.04-x86_64.tgz tar -xzvf MLNX_OFED_LINUX-4.9-7.1.0.0-ubuntu20.04-x86_64.tgz ``` ``` cd MLNX_OFED_LINUX-4.9-7.1.0.0-ubuntu20.04-x86_64 sudo ./mlnxofedinstall # 按 y 確認 ``` :::danger 出現以下錯誤訊息: ``` ... Failed to install mlnx-ofed-kernel-dkms DEB Collecting debug info... See /tmp/MLNX_OFED_LINUX.135788.logs/mlnx-ofed-kernel-dkms.debinstall.log ``` log 內的訊息節錄: ``` /usr/bin/dpkg -i --force-confnew --force-confmiss /home/ntcucshpc/MLNX_OFED_LINUX-4.9-7.1.0.0-ubuntu20.04-x86_64/DEBS/MLNX_LIBS/mlnx-ofed-kernel-dkms_4.9-OFED.4.9.7.1.0.1_all.deb Selecting previously unselected package mlnx-ofed-kernel-dkms. (Reading database ... 196683 files and directories currently installed.) Preparing to unpack .../mlnx-ofed-kernel-dkms_4.9-OFED.4.9.7.1.0.1_all.deb ... Unpacking mlnx-ofed-kernel-dkms (4.9-OFED.4.9.7.1.0.1) ... Setting up mlnx-ofed-kernel-dkms (4.9-OFED.4.9.7.1.0.1) ... Loading new mlnx-ofed-kernel-4.9 DKMS files... First Installation: checking all kernels... Building only for 5.15.0-134-generic Building for architecture x86_64 Building initial module for 5.15.0-134-generic ERROR (dkms apport): unable to determine source package for mlnx-ofed-kernel-dkms Error! Bad return status for module build on kernel: 5.15.0-134-generic (x86_64) Consult /var/lib/dkms/mlnx-ofed-kernel/4.9/build/make.log for more information. dpkg: error processing package mlnx-ofed-kernel-dkms (--install): installed mlnx-ofed-kernel-dkms package post-installation script subprocess returned error exit status 10 Errors were encountered while processing: mlnx-ofed-kernel-dkms ``` [相關討論](https://forums.developer.nvidia.com/t/failed-to-install-mlnx-ofed-kernel-dkms-deb-with-version-4-9-4-1-7-0/205889/2) ::: 先解除安裝再繼續操作 ```bash sudo /usr/sbin/ofed_uninstall.sh ``` ### 新增 kernel 版本到安裝檔(沒用) :::info 可以直接跳到下一節 ::: 用 `uname` 查看 kernel 版本: ```bash $ uname -r 5.15.0-134-generic ``` 推測是 kernel 太新,不在官方的[支援列表](https://docs.nvidia.com/networking/display/mlnxofedv497100lts/general+support+in+mlnx_ofed#GeneralSupportinMLNX_OFED-MLNX_OFEDSupportedOperatingSystems)裡。 查看安裝參數: ```bash $ sudo ./mlnxofedinstall --help ``` 發現有一個 `--add-kernel-support` 的參數,可能可以透過 `mlnx_add_kernel_support.sh` 讓 kernel 能被支援。 --- 更新 kernel header ```bash sudo apt install linux-headers-$(uname -r) ``` 嘗試直接執行: ```bash sudo ./mlnx_add_kernel_support.sh -m ./ ``` :::danger 出現錯誤訊息: ``` ERROR: Failed executing "MLNX_OFED_SRC-4.9-7.1.0.0/install.pl --tmpdir /tmp/MLNX_OFED_LINUX-4.9-7.1.0.0-5.15.0-134-generic/mlnx_iso.286778_logs --kernel-only --kernel 5.15.0-134-generic --kernel-sources /lib/modules/5.15.0-134-generic/build --builddir /tmp/MLNX_OFED_LINUX-4.9-7.1.0.0-5.15.0-134-generic/mlnx_iso.286778 --without-dkms --without-debug-symbols --build-only --distro ubuntu20.04" ERROR: See /tmp/MLNX_OFED_LINUX-4.9-7.1.0.0-5.15.0-134-generic/mlnx_iso.286778_logs/mlnx_ofed_iso.286778.log Failed to build MLNX_OFED_LINUX for 5.15.0-134-generic ``` Log: ``` Logs dir: /tmp/MLNX_OFED_LINUX-4.9-7.1.0.0-5.15.0-134-generic/mlnx_iso.286778_logs/OFED.287032.logs General log file: /tmp/MLNX_OFED_LINUX-4.9-7.1.0.0-5.15.0-134-generic/mlnx_iso.286778_logs/OFED.287032.logs/general.log  Below is the list of OFED packages that you have chosen (some may have been added by the installer due to package dependencies):  ofed-scripts mlnx-ofed-kernel-utils mlnx-ofed-kernel-modules rshim-modules iser-modules isert-modules srp-modules mlnx-nvme-modules kernel-mft-modules knem-modules Checking SW Requirements... One or more required packages for installing OFED-internal are missing. Attempting to install the following missing packages: quilt debhelper bzip2 dh-autoreconf gcc make pkg-config build-essential This program will install the OFED package on your machine. Note that all other Mellanox, OEM, OFED, RDMA or Distribution IB packages will be removed. Those packages are removed due to conflicts with OFED, do not reinstall them. Installing new packages Building DEB for ofed-scripts-4.9 (ofed-scripts)... Running /usr/bin/dpkg-buildpackage -us -uc Building DEB for mlnx-ofed-kernel-utils-4.9 (mlnx-ofed-kernel)... -W- --with-mlx5-ipsec is enabled Running /usr/bin/dpkg-buildpackage -us -uc Failed to build mlnx-ofed-kernel DEB Collecting debug info... See /tmp/MLNX_OFED_LINUX-4.9-7.1.0.0-5.15.0-134-generic/mlnx_iso.286778_logs/OFED.287032.logs/mlnx-ofed-kernel.debbuild.log ``` 在 `mlnx-ofed-kernel.debbuild.log` 發現: ``` Error: CONFIG_MLX5_ESWITCH not support kernel version 5.6 or higher (current: 5.15.0-134-generic). ``` 因此只能降 kernel 版本再裝驅動。 ::: 相關文章: - [ConnectX-3 on Ubuntu 20.04 - Infrastructure & Networking / Adapters and Cables - NVIDIA Developer Forums](https://forums.developer.nvidia.com/t/connectx-3-on-ubuntu-20-04/206201) - [Mellanox网卡驱动固件升级案例](https://shenzhoukuntai.com/post/detail/208/4433) - [ubuntu20.04上安装驱动 | 电脑硬件学习笔记](https://skyao.io/learning-computer-hardware/nic/hp544/driver/ubuntu/ubuntu20.04/) ## 降級 kernel 版本 MLNX_OFED 4.9 版本的[支援列表](https://docs.nvidia.com/networking/display/mlnxofedv497100lts/general+support+in+mlnx_ofed#GeneralSupportinMLNX_OFED-MLNX_OFEDSupportedOperatingSystems)中,Ubuntu 20.04 只支援到 5.4 版的 kernel,因此要嘗試降到這個版本。 ### 安裝 5.4 版 kernel 確認目前系統資訊: ```bash $ cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=20.04 DISTRIB_CODENAME=focal DISTRIB_DESCRIPTION="Ubuntu 20.04.6 LTS" $ uname -sr Linux 5.15.0-134-generic ``` 查看可安裝的 5.4 kernel: ```bash apt-cache search linux-image-5.4.0 | grep generic ``` 最新的版本是 `linux-image-5.4.0-208-generic`,接著安裝對應的 kernel 與 header 檔案: ```bash sudo apt install linux-image-5.4.0-208-generic linux-headers-5.4.0-208-generic ``` 由於 kernel 發生變動,因此會自動更新 `/boot/grub/grub.cfg` ### 設置 GRUB 選單 在 `/boot/grub/grub.cfg` 確認 GRUB 的 menu entry 名稱: ```bash $ grep gnulinux /boot/grub/grub.cfg menuentry 'Ubuntu' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-eeb9a9be-7674-457c-b1e3-b9b42058035d' { submenu 'Advanced options for Ubuntu' $menuentry_id_option 'gnulinux-advanced-eeb9a9be-7674-457c-b1e3-b9b42058035d' { menuentry 'Ubuntu, with Linux 5.15.0-134-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.15.0-134-generic-advanced-eeb9a9be-7674-457c-b1e3-b9b42058035d' { menuentry 'Ubuntu, with Linux 5.15.0-134-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.15.0-134-generic-recovery-eeb9a9be-7674-457c-b1e3-b9b42058035d' { menuentry 'Ubuntu, with Linux 5.15.0-67-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.15.0-67-generic-advanced-eeb9a9be-7674-457c-b1e3-b9b42058035d' { menuentry 'Ubuntu, with Linux 5.15.0-67-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.15.0-67-generic-recovery-eeb9a9be-7674-457c-b1e3-b9b42058035d' { menuentry 'Ubuntu, with Linux 5.4.0-208-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.4.0-208-generic-advanced-eeb9a9be-7674-457c-b1e3-b9b42058035d' { menuentry 'Ubuntu, with Linux 5.4.0-208-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.4.0-208-generic-recovery-eeb9a9be-7674-457c-b1e3-b9b42058035d' { ``` 我們要的開機選項就是選擇 `Advanced options for Ubuntu`,然後進入 `Ubuntu, with Linux 5.4.0-208-generic`。 為了能開機自動進入指定的 kernel,需要修改 `/etc/default/grub`。找到 `GRUB_DEFAULT=0`,註解掉再新增這行: ```conf GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 5.4.0-208-generic" ``` 修改完檔案要記得更新設定: ```bash sudo update-grub ``` 重啟後確認 kernel 版本已變更到 5.4: ```bash $ uname -r 5.4.0-208-generic ``` ## 再次安裝驅動 ``` cd MLNX_OFED_LINUX-4.9-7.1.0.0-ubuntu20.04-x86_64 sudo ./mlnxofedinstall # 按 y 確認 ``` 發生以下錯誤: ``` DKMS: install completed. Building initial module for 5.15.0-134-generic ERROR (dkms apport): unable to determine source package for mlnx-ofed-kernel-dkms Error! Bad return status for module build on kernel: 5.15.0-134-generic (x86_64) ``` 雖然 kernel 已經切換到 5.4 版,但 DKMS 仍會為 5.15 的 kernel 編譯 module 而導致錯誤。 查看所有 kernel 版本: ```bash dpkg --list | grep linux-image ``` 移除不需要的 kernel: ```bash sudo apt purge linux-image-5.15.0-134-generic linux-headers-5.15.0-134-generic linux-image-5.15.0-67-generic linux-headers-5.15.0-67-generic sudo apt autoremove ``` 安裝驅動: ``` sudo ./mlnxofedinstall ``` 載入驅動: ``` sudo /etc/init.d/openibd restart ``` 執行 `ibstat` 確認安裝成功: ```bash $ ibstat CA 'mlx4_0' CA type: MT4103 Number of ports: 2 Firmware version: 2.35.5100 Hardware version: 0 Node GUID: 0x480fcfffffec3e10 System image GUID: 0x480fcfffffec3e13 Port 1: State: Active Physical state: LinkUp Rate: 40 (FDR10) Base lid: 1 LMC: 0 SM lid: 1 Capability mask: 0x0251486a Port GUID: 0x480fcfffffec3e11 Link layer: InfiniBand Port 2: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02514868 Port GUID: 0x480fcfffffec3e12 Link layer: InfiniBand ``` 這張網卡有兩個 port,只用了第一個 port 連接。 ## 測試連線 ### 啟動 OpenSM 先確認 OpenSM 能正常執行: ```bash sudo openmd ``` 看到 `Entering MASTER state` 後,<kbd>Ctrl</kbd> + <kbd>C</kbd> 關閉,接著讓 OpenSM 作為 daemon 執行: ```bash sudo /etc/init.d/opensmd start ``` ### 互 ping 測試 用 `ibstat` 確認網卡代號 (`mlx4_0`) 跟 LID (`1`) n1 (server): ```bash sudo ibping -S -C mlx4_0 -P 1 ``` - `-C`: CA (Channel adapter) 名稱 - `-P`: 要用的 CA port number n2 (client),ping LID 1: `ibping [options] <dest lid|guid>` ```bash $ sudo ibping 1 Pong from hpc-ib-n1.(none) (Lid 1): time 0.552 ms Pong from hpc-ib-n1.(none) (Lid 1): time 0.333 ms Pong from hpc-ib-n1.(none) (Lid 1): time 0.405 ms ^C --- hpc-ib-n1.(none) (Lid 1) ibping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2442 ms rtt min/avg/max = 0.333/0.430/0.552 ms ``` ### 設定 IP 啟用 IPoIB 模組: ```bash sudo modprobe ib_ipoib ``` 查看每個 port 對應的 interface 名稱: ```bash $ ibdev2netdev mlx4_0 port 1 ==> ibp6s16 (Up) mlx4_0 port 2 ==> ibp6s16d1 (Down) ``` 設定 `n1` 的 IP: ```bash sudo ip addr add 10.0.10.3/17 dev ibp6s16 sudo ip link set ibp6s17 up ``` 設定 `n2` 的 IP: ```bash sudo ip addr add 10.0.10.3/17 dev ibp6s16 sudo ip link set ibp6s17 up ``` ### 用 `ib_read_bw` 與測速 也可以用 `ib_write_bw` n1 (server): ```bash $ ib_read_bw -a --------------------------------------------------------------------------------------- RDMA_Read BW Test Dual-port : OFF Device : mlx4_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF CQ Moderation : 100 Mtu : 2048[B] Link type : IB Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x01 QPN 0x0261 PSN 0x35ca0f OUT 0x10 RKey 0xc8010100 VAddr 0x007fcaa5401000 remote address: LID 0x02 QPN 0x0262 PSN 0x558979 OUT 0x10 RKey 0xd0010100 VAddr 0x007f8da1a59000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 8388608 1000 39.27 39.27 0.000585 --------------------------------------------------------------------------------------- ``` - `-a`: 做所有測試 n2 (client): ```bash $ ib_read_bw -a --report_gbits 10.0.10.3 --------------------------------------------------------------------------------------- RDMA_Read BW Test Dual-port : OFF Device : mlx4_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 128 CQ Moderation : 100 Mtu : 2048[B] Link type : IB Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x02 QPN 0x0260 PSN 0xb88a4e OUT 0x10 RKey 0xc0010100 VAddr 0x007fc2ba892000 remote address: LID 0x01 QPN 0x025f PSN 0x99d360 OUT 0x10 RKey 0xb8010100 VAddr 0x007f4773345000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 2 1000 0.077704 0.062133 3.883315 4 1000 0.15 0.14 4.438859 8 1000 0.32 0.30 4.720743 16 1000 0.80 0.69 5.379386 32 1000 1.28 1.11 4.343282 64 1000 2.38 2.25 4.392266 128 1000 5.08 4.82 4.703802 256 1000 10.02 9.31 4.543610 512 1000 19.89 17.18 4.193167 1024 1000 38.20 36.78 4.489978 2048 1000 38.84 37.77 2.305010 4096 1000 38.98 38.41 1.172041 8192 1000 39.08 39.01 0.595204 16384 1000 38.67 38.58 0.294341 32768 1000 39.30 39.30 0.149922 65536 1000 39.28 39.27 0.074896 131072 1000 39.07 39.07 0.037255 262144 1000 39.31 39.25 0.018716 524288 1000 39.22 39.16 0.009336 1048576 1000 39.28 39.24 0.004677 2097152 1000 39.24 39.17 0.002335 4194304 1000 39.20 39.20 0.001168 8388608 1000 39.27 39.27 0.000585 --------------------------------------------------------------------------------------- ``` ==39.27 Gbps== ### 用 iperf 測速 下載較新的 iperf3: ```bash sudo apt install iperf3 -y ``` --- 用預設參數執行: n1 (server): ```bash iperf3 -s ``` n2 (client): ```bash $ iperf3 -c 10.0.10.3 Connecting to host 10.0.10.3, port 5201 [ 5] local 10.0.10.4 port 50144 connected to 10.0.10.3 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 798 MBytes 6.70 Gbits/sec 2778 282 KBytes [ 5] 1.00-2.00 sec 739 MBytes 6.20 Gbits/sec 3025 428 KBytes [ 5] 2.00-3.00 sec 775 MBytes 6.50 Gbits/sec 3687 288 KBytes [ 5] 3.00-4.00 sec 798 MBytes 6.70 Gbits/sec 3302 224 KBytes [ 5] 4.00-5.00 sec 863 MBytes 7.24 Gbits/sec 3521 255 KBytes [ 5] 5.00-6.00 sec 861 MBytes 7.22 Gbits/sec 3335 265 KBytes [ 5] 6.00-7.00 sec 827 MBytes 6.94 Gbits/sec 3989 259 KBytes [ 5] 7.00-8.00 sec 720 MBytes 6.04 Gbits/sec 3312 276 KBytes [ 5] 8.00-9.00 sec 797 MBytes 6.68 Gbits/sec 3006 195 KBytes [ 5] 9.00-10.00 sec 864 MBytes 7.25 Gbits/sec 3909 274 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 7.85 GBytes 6.75 Gbits/sec 33864 sender [ 5] 0.00-10.00 sec 7.85 GBytes 6.74 Gbits/sec receiver iperf Done. ``` ==6.74 Gbps== ### 將 InfiniBand 改成 Connected Mode https://docs.nvidia.com/networking/display/mlnxofedv543750lts/ip+over+infiniband+(ipoib) ```bash sudo ip link set ibp6s16 down echo "connected" | sudo tee /sys/class/net/ibp6s16/mode sudo ip link set up ``` 原本是 datagram 直接重新跑一次預設可以到 ==9.45 Gbps== ## NFS Over RDMA https://skyao.io/learning-ubuntu-server/command/network/nfs/nfsordma/ ## Open MPI ### 安裝 UCX ```bash cd ~ wget https://github.com/openucx/ucx/releases/download/v1.18.0/ucx-1.18.0.tar.gz tar xf ucx-1.18.0.tar.gz cd ucx-1.18.0 mkdir build cd build ../contrib/configure-release --prefix=/opt/ucx-1.18.0 make -j & sudo make install ``` ### 安裝 Open MPI 安裝必要套件: ``` sudo apt install libnuma-dev libudev-dev zlib1g-dev ``` ```bash cd ~ wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.7.tar.gz tar xf openmpi-5.0.7.tar.gz cd openmpi-5.0.7 ./configure --prefix=/opt/openmpi --with-ucx=/opt/ucx-1.18.0 make -j & sudo make install ``` ## 參考連結 - [Ubuntu 切换指定版本的内核 – 陈少文的网站](https://www.chenshaowen.com/blog/set-specific-kernel-version-in-ubuntu/) - [首頁 | Grub 探索筆記](https://samwhelp.github.io/note-about-grub/) - [如何設定 GRUB 預設的開機項目? | MagicLen](https://magiclen.org/grub-default/) https://docs.redhat.com/zh-cn/documentation/red_hat_enterprise_linux/8/html-single/configuring_infiniband_and_rdma_networks/index#configuring-an-ipoib-connection-using-nmcli-commands_configuring-ipoib https://blog.csdn.net/msdnchina/article/details/71133494