# InfiniBand 練習記錄
## 基本設置
### 免密碼登入
1. 在 vscode 的 ssh remote 新增兩台 ib 節點的連線
2. 到 `~/.ssh/config` 將 Host 分別修改成 `hpc-ib-n1` 與 `hpc-ib-n2`
```
Host hpc-ib-n1
HostName xxx.xx.xx.xxx
User ntcucshpc
Port 22054
Host hpc-ib-n2
HostName xxx.xx.xx.xxx
User ntcucshpc
Port 22055
```
### 更改系統語言
在 `~/.bashrc` 加入:
```bash
# use English
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8
export LC_ALL=en_US.UTF-8
```
### 修改 `sources.list`
先查看 `/etc/apt/sources.list` 確認原本的網址前綴是 `http://tw.archive.ubuntu.com`,修改成國網的鏡像站:
```bash
sudo sed -i 's|http://archive.ubuntu.com/ubuntu|http://free.nchc.org.tw/ubuntu|g' /etc/apt/sources.list
```
:::info
Ubuntu 24.04 之後,此檔案被移動到 `/etc/apt/sources.list.d/ubuntu.sources`,且檔案內容有所不同,但仍可使用相同指令修改。
:::
### 安裝必要套件
```bash
sudo apt update
sudo apt install -y vim build-essential net-tools tmux htop
```
## 安裝 MLNX_OFED 驅動
### 確認網卡型號
```bash
$ lspci | grep Network
06:10.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
```
### 下載驅動
<https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/>
:::warning
根據[官方文件](https://docs.nvidia.com/networking/display/mlnxofedv586042lts/release+notes)的說明,ConnectX-3 Pro 型號已不被支援,最後支援的版本是 4.9 LTS,需要在下載區域切換到 Archive Versions 才找得到。
:::
```bash
wget https://content.mellanox.com/ofed/MLNX_OFED-4.9-7.1.0.0/MLNX_OFED_LINUX-4.9-7.1.0.0-ubuntu20.04-x86_64.tgz
tar -xzvf MLNX_OFED_LINUX-4.9-7.1.0.0-ubuntu20.04-x86_64.tgz
```
```
cd MLNX_OFED_LINUX-4.9-7.1.0.0-ubuntu20.04-x86_64
sudo ./mlnxofedinstall
# 按 y 確認
```
:::danger
出現以下錯誤訊息:
```
...
Failed to install mlnx-ofed-kernel-dkms DEB
Collecting debug info...
See /tmp/MLNX_OFED_LINUX.135788.logs/mlnx-ofed-kernel-dkms.debinstall.log
```
log 內的訊息節錄:
```
/usr/bin/dpkg -i --force-confnew --force-confmiss /home/ntcucshpc/MLNX_OFED_LINUX-4.9-7.1.0.0-ubuntu20.04-x86_64/DEBS/MLNX_LIBS/mlnx-ofed-kernel-dkms_4.9-OFED.4.9.7.1.0.1_all.deb
Selecting previously unselected package mlnx-ofed-kernel-dkms.
(Reading database ... 196683 files and directories currently installed.)
Preparing to unpack .../mlnx-ofed-kernel-dkms_4.9-OFED.4.9.7.1.0.1_all.deb ...
Unpacking mlnx-ofed-kernel-dkms (4.9-OFED.4.9.7.1.0.1) ...
Setting up mlnx-ofed-kernel-dkms (4.9-OFED.4.9.7.1.0.1) ...
Loading new mlnx-ofed-kernel-4.9 DKMS files...
First Installation: checking all kernels...
Building only for 5.15.0-134-generic
Building for architecture x86_64
Building initial module for 5.15.0-134-generic
ERROR (dkms apport): unable to determine source package for mlnx-ofed-kernel-dkms
Error! Bad return status for module build on kernel: 5.15.0-134-generic (x86_64)
Consult /var/lib/dkms/mlnx-ofed-kernel/4.9/build/make.log for more information.
dpkg: error processing package mlnx-ofed-kernel-dkms (--install):
installed mlnx-ofed-kernel-dkms package post-installation script subprocess returned error exit status 10
Errors were encountered while processing:
mlnx-ofed-kernel-dkms
```
[相關討論](https://forums.developer.nvidia.com/t/failed-to-install-mlnx-ofed-kernel-dkms-deb-with-version-4-9-4-1-7-0/205889/2)
:::
先解除安裝再繼續操作
```bash
sudo /usr/sbin/ofed_uninstall.sh
```
### 新增 kernel 版本到安裝檔(沒用)
:::info
可以直接跳到下一節
:::
用 `uname` 查看 kernel 版本:
```bash
$ uname -r
5.15.0-134-generic
```
推測是 kernel 太新,不在官方的[支援列表](https://docs.nvidia.com/networking/display/mlnxofedv497100lts/general+support+in+mlnx_ofed#GeneralSupportinMLNX_OFED-MLNX_OFEDSupportedOperatingSystems)裡。
查看安裝參數:
```bash
$ sudo ./mlnxofedinstall --help
```
發現有一個 `--add-kernel-support` 的參數,可能可以透過 `mlnx_add_kernel_support.sh` 讓 kernel 能被支援。
---
更新 kernel header
```bash
sudo apt install linux-headers-$(uname -r)
```
嘗試直接執行:
```bash
sudo ./mlnx_add_kernel_support.sh -m ./
```
:::danger
出現錯誤訊息:
```
ERROR: Failed executing "MLNX_OFED_SRC-4.9-7.1.0.0/install.pl --tmpdir /tmp/MLNX_OFED_LINUX-4.9-7.1.0.0-5.15.0-134-generic/mlnx_iso.286778_logs --kernel-only --kernel 5.15.0-134-generic --kernel-sources /lib/modules/5.15.0-134-generic/build --builddir /tmp/MLNX_OFED_LINUX-4.9-7.1.0.0-5.15.0-134-generic/mlnx_iso.286778 --without-dkms --without-debug-symbols --build-only --distro ubuntu20.04"
ERROR: See /tmp/MLNX_OFED_LINUX-4.9-7.1.0.0-5.15.0-134-generic/mlnx_iso.286778_logs/mlnx_ofed_iso.286778.log
Failed to build MLNX_OFED_LINUX for 5.15.0-134-generic
```
Log:
```
Logs dir: /tmp/MLNX_OFED_LINUX-4.9-7.1.0.0-5.15.0-134-generic/mlnx_iso.286778_logs/OFED.287032.logs
General log file: /tmp/MLNX_OFED_LINUX-4.9-7.1.0.0-5.15.0-134-generic/mlnx_iso.286778_logs/OFED.287032.logs/general.log
[32m
Below is the list of OFED packages that you have chosen
(some may have been added by the installer due to package dependencies):
[0m
ofed-scripts
mlnx-ofed-kernel-utils
mlnx-ofed-kernel-modules
rshim-modules
iser-modules
isert-modules
srp-modules
mlnx-nvme-modules
kernel-mft-modules
knem-modules
Checking SW Requirements...
[31mOne or more required packages for installing OFED-internal are missing.[0m
[31mAttempting to install the following missing packages:
quilt debhelper bzip2 dh-autoreconf gcc make pkg-config build-essential[0m
This program will install the OFED package on your machine.
Note that all other Mellanox, OEM, OFED, RDMA or Distribution IB packages will be removed.
Those packages are removed due to conflicts with OFED, do not reinstall them.
Installing new packages
Building DEB for ofed-scripts-4.9 (ofed-scripts)...
Running /usr/bin/dpkg-buildpackage -us -uc
Building DEB for mlnx-ofed-kernel-utils-4.9 (mlnx-ofed-kernel)...
-W- --with-mlx5-ipsec is enabled
Running /usr/bin/dpkg-buildpackage -us -uc
[31mFailed to build mlnx-ofed-kernel DEB[0m
Collecting debug info...
[31mSee /tmp/MLNX_OFED_LINUX-4.9-7.1.0.0-5.15.0-134-generic/mlnx_iso.286778_logs/OFED.287032.logs/mlnx-ofed-kernel.debbuild.log[0m
```
在 `mlnx-ofed-kernel.debbuild.log` 發現:
```
Error: CONFIG_MLX5_ESWITCH not support kernel version 5.6 or higher (current: 5.15.0-134-generic).
```
因此只能降 kernel 版本再裝驅動。
:::
相關文章:
- [ConnectX-3 on Ubuntu 20.04 - Infrastructure & Networking / Adapters and Cables - NVIDIA Developer Forums](https://forums.developer.nvidia.com/t/connectx-3-on-ubuntu-20-04/206201)
- [Mellanox网卡驱动固件升级案例](https://shenzhoukuntai.com/post/detail/208/4433)
- [ubuntu20.04上安装驱动 | 电脑硬件学习笔记](https://skyao.io/learning-computer-hardware/nic/hp544/driver/ubuntu/ubuntu20.04/)
## 降級 kernel 版本
MLNX_OFED 4.9 版本的[支援列表](https://docs.nvidia.com/networking/display/mlnxofedv497100lts/general+support+in+mlnx_ofed#GeneralSupportinMLNX_OFED-MLNX_OFEDSupportedOperatingSystems)中,Ubuntu 20.04 只支援到 5.4 版的 kernel,因此要嘗試降到這個版本。
### 安裝 5.4 版 kernel
確認目前系統資訊:
```bash
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.6 LTS"
$ uname -sr
Linux 5.15.0-134-generic
```
查看可安裝的 5.4 kernel:
```bash
apt-cache search linux-image-5.4.0 | grep generic
```
最新的版本是 `linux-image-5.4.0-208-generic`,接著安裝對應的 kernel 與 header 檔案:
```bash
sudo apt install linux-image-5.4.0-208-generic linux-headers-5.4.0-208-generic
```
由於 kernel 發生變動,因此會自動更新 `/boot/grub/grub.cfg`
### 設置 GRUB 選單
在 `/boot/grub/grub.cfg` 確認 GRUB 的 menu entry 名稱:
```bash
$ grep gnulinux /boot/grub/grub.cfg
menuentry 'Ubuntu' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-eeb9a9be-7674-457c-b1e3-b9b42058035d' {
submenu 'Advanced options for Ubuntu' $menuentry_id_option 'gnulinux-advanced-eeb9a9be-7674-457c-b1e3-b9b42058035d' {
menuentry 'Ubuntu, with Linux 5.15.0-134-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.15.0-134-generic-advanced-eeb9a9be-7674-457c-b1e3-b9b42058035d' {
menuentry 'Ubuntu, with Linux 5.15.0-134-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.15.0-134-generic-recovery-eeb9a9be-7674-457c-b1e3-b9b42058035d' {
menuentry 'Ubuntu, with Linux 5.15.0-67-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.15.0-67-generic-advanced-eeb9a9be-7674-457c-b1e3-b9b42058035d' {
menuentry 'Ubuntu, with Linux 5.15.0-67-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.15.0-67-generic-recovery-eeb9a9be-7674-457c-b1e3-b9b42058035d' {
menuentry 'Ubuntu, with Linux 5.4.0-208-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.4.0-208-generic-advanced-eeb9a9be-7674-457c-b1e3-b9b42058035d' {
menuentry 'Ubuntu, with Linux 5.4.0-208-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.4.0-208-generic-recovery-eeb9a9be-7674-457c-b1e3-b9b42058035d' {
```
我們要的開機選項就是選擇 `Advanced options for Ubuntu`,然後進入 `Ubuntu, with Linux 5.4.0-208-generic`。
為了能開機自動進入指定的 kernel,需要修改 `/etc/default/grub`。找到 `GRUB_DEFAULT=0`,註解掉再新增這行:
```conf
GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 5.4.0-208-generic"
```
修改完檔案要記得更新設定:
```bash
sudo update-grub
```
重啟後確認 kernel 版本已變更到 5.4:
```bash
$ uname -r
5.4.0-208-generic
```
## 再次安裝驅動
```
cd MLNX_OFED_LINUX-4.9-7.1.0.0-ubuntu20.04-x86_64
sudo ./mlnxofedinstall
# 按 y 確認
```
發生以下錯誤:
```
DKMS: install completed.
Building initial module for 5.15.0-134-generic
ERROR (dkms apport): unable to determine source package for mlnx-ofed-kernel-dkms
Error! Bad return status for module build on kernel: 5.15.0-134-generic (x86_64)
```
雖然 kernel 已經切換到 5.4 版,但 DKMS 仍會為 5.15 的 kernel 編譯 module 而導致錯誤。
查看所有 kernel 版本:
```bash
dpkg --list | grep linux-image
```
移除不需要的 kernel:
```bash
sudo apt purge linux-image-5.15.0-134-generic linux-headers-5.15.0-134-generic linux-image-5.15.0-67-generic linux-headers-5.15.0-67-generic
sudo apt autoremove
```
安裝驅動:
```
sudo ./mlnxofedinstall
```
載入驅動:
```
sudo /etc/init.d/openibd restart
```
執行 `ibstat` 確認安裝成功:
```bash
$ ibstat
CA 'mlx4_0'
CA type: MT4103
Number of ports: 2
Firmware version: 2.35.5100
Hardware version: 0
Node GUID: 0x480fcfffffec3e10
System image GUID: 0x480fcfffffec3e13
Port 1:
State: Active
Physical state: LinkUp
Rate: 40 (FDR10)
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x0251486a
Port GUID: 0x480fcfffffec3e11
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x480fcfffffec3e12
Link layer: InfiniBand
```
這張網卡有兩個 port,只用了第一個 port 連接。
## 測試連線
### 啟動 OpenSM
先確認 OpenSM 能正常執行:
```bash
sudo openmd
```
看到 `Entering MASTER state` 後,<kbd>Ctrl</kbd> + <kbd>C</kbd> 關閉,接著讓 OpenSM 作為 daemon 執行:
```bash
sudo /etc/init.d/opensmd start
```
### 互 ping 測試
用 `ibstat` 確認網卡代號 (`mlx4_0`) 跟 LID (`1`)
n1 (server):
```bash
sudo ibping -S -C mlx4_0 -P 1
```
- `-C`: CA (Channel adapter) 名稱
- `-P`: 要用的 CA port number
n2 (client),ping LID 1:
`ibping [options] <dest lid|guid>`
```bash
$ sudo ibping 1
Pong from hpc-ib-n1.(none) (Lid 1): time 0.552 ms
Pong from hpc-ib-n1.(none) (Lid 1): time 0.333 ms
Pong from hpc-ib-n1.(none) (Lid 1): time 0.405 ms
^C
--- hpc-ib-n1.(none) (Lid 1) ibping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2442 ms
rtt min/avg/max = 0.333/0.430/0.552 ms
```
### 設定 IP
啟用 IPoIB 模組:
```bash
sudo modprobe ib_ipoib
```
查看每個 port 對應的 interface 名稱:
```bash
$ ibdev2netdev
mlx4_0 port 1 ==> ibp6s16 (Up)
mlx4_0 port 2 ==> ibp6s16d1 (Down)
```
設定 `n1` 的 IP:
```bash
sudo ip addr add 10.0.10.3/17 dev ibp6s16
sudo ip link set ibp6s17 up
```
設定 `n2` 的 IP:
```bash
sudo ip addr add 10.0.10.3/17 dev ibp6s16
sudo ip link set ibp6s17 up
```
### 用 `ib_read_bw` 與測速
也可以用 `ib_write_bw`
n1 (server):
```bash
$ ib_read_bw -a
---------------------------------------------------------------------------------------
RDMA_Read BW Test
Dual-port : OFF Device : mlx4_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
CQ Moderation : 100
Mtu : 2048[B]
Link type : IB
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x01 QPN 0x0261 PSN 0x35ca0f OUT 0x10 RKey 0xc8010100 VAddr 0x007fcaa5401000
remote address: LID 0x02 QPN 0x0262 PSN 0x558979 OUT 0x10 RKey 0xd0010100 VAddr 0x007f8da1a59000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
8388608 1000 39.27 39.27 0.000585
---------------------------------------------------------------------------------------
```
- `-a`: 做所有測試
n2 (client):
```bash
$ ib_read_bw -a --report_gbits 10.0.10.3
---------------------------------------------------------------------------------------
RDMA_Read BW Test
Dual-port : OFF Device : mlx4_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 2048[B]
Link type : IB
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x02 QPN 0x0260 PSN 0xb88a4e OUT 0x10 RKey 0xc0010100 VAddr 0x007fc2ba892000
remote address: LID 0x01 QPN 0x025f PSN 0x99d360 OUT 0x10 RKey 0xb8010100 VAddr 0x007f4773345000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
2 1000 0.077704 0.062133 3.883315
4 1000 0.15 0.14 4.438859
8 1000 0.32 0.30 4.720743
16 1000 0.80 0.69 5.379386
32 1000 1.28 1.11 4.343282
64 1000 2.38 2.25 4.392266
128 1000 5.08 4.82 4.703802
256 1000 10.02 9.31 4.543610
512 1000 19.89 17.18 4.193167
1024 1000 38.20 36.78 4.489978
2048 1000 38.84 37.77 2.305010
4096 1000 38.98 38.41 1.172041
8192 1000 39.08 39.01 0.595204
16384 1000 38.67 38.58 0.294341
32768 1000 39.30 39.30 0.149922
65536 1000 39.28 39.27 0.074896
131072 1000 39.07 39.07 0.037255
262144 1000 39.31 39.25 0.018716
524288 1000 39.22 39.16 0.009336
1048576 1000 39.28 39.24 0.004677
2097152 1000 39.24 39.17 0.002335
4194304 1000 39.20 39.20 0.001168
8388608 1000 39.27 39.27 0.000585
---------------------------------------------------------------------------------------
```
==39.27 Gbps==
### 用 iperf 測速
下載較新的 iperf3:
```bash
sudo apt install iperf3 -y
```
---
用預設參數執行:
n1 (server):
```bash
iperf3 -s
```
n2 (client):
```bash
$ iperf3 -c 10.0.10.3
Connecting to host 10.0.10.3, port 5201
[ 5] local 10.0.10.4 port 50144 connected to 10.0.10.3 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 798 MBytes 6.70 Gbits/sec 2778 282 KBytes
[ 5] 1.00-2.00 sec 739 MBytes 6.20 Gbits/sec 3025 428 KBytes
[ 5] 2.00-3.00 sec 775 MBytes 6.50 Gbits/sec 3687 288 KBytes
[ 5] 3.00-4.00 sec 798 MBytes 6.70 Gbits/sec 3302 224 KBytes
[ 5] 4.00-5.00 sec 863 MBytes 7.24 Gbits/sec 3521 255 KBytes
[ 5] 5.00-6.00 sec 861 MBytes 7.22 Gbits/sec 3335 265 KBytes
[ 5] 6.00-7.00 sec 827 MBytes 6.94 Gbits/sec 3989 259 KBytes
[ 5] 7.00-8.00 sec 720 MBytes 6.04 Gbits/sec 3312 276 KBytes
[ 5] 8.00-9.00 sec 797 MBytes 6.68 Gbits/sec 3006 195 KBytes
[ 5] 9.00-10.00 sec 864 MBytes 7.25 Gbits/sec 3909 274 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 7.85 GBytes 6.75 Gbits/sec 33864 sender
[ 5] 0.00-10.00 sec 7.85 GBytes 6.74 Gbits/sec receiver
iperf Done.
```
==6.74 Gbps==
### 將 InfiniBand 改成 Connected Mode
https://docs.nvidia.com/networking/display/mlnxofedv543750lts/ip+over+infiniband+(ipoib)
```bash
sudo ip link set ibp6s16 down
echo "connected" | sudo tee /sys/class/net/ibp6s16/mode
sudo ip link set up
```
原本是 datagram
直接重新跑一次預設可以到 ==9.45 Gbps==
## NFS Over RDMA
https://skyao.io/learning-ubuntu-server/command/network/nfs/nfsordma/
## Open MPI
### 安裝 UCX
```bash
cd ~
wget https://github.com/openucx/ucx/releases/download/v1.18.0/ucx-1.18.0.tar.gz
tar xf ucx-1.18.0.tar.gz
cd ucx-1.18.0
mkdir build
cd build
../contrib/configure-release --prefix=/opt/ucx-1.18.0
make -j & sudo make install
```
### 安裝 Open MPI
安裝必要套件:
```
sudo apt install libnuma-dev libudev-dev zlib1g-dev
```
```bash
cd ~
wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.7.tar.gz
tar xf openmpi-5.0.7.tar.gz
cd openmpi-5.0.7
./configure --prefix=/opt/openmpi --with-ucx=/opt/ucx-1.18.0
make -j & sudo make install
```
## 參考連結
- [Ubuntu 切换指定版本的内核 – 陈少文的网站](https://www.chenshaowen.com/blog/set-specific-kernel-version-in-ubuntu/)
- [首頁 | Grub 探索筆記](https://samwhelp.github.io/note-about-grub/)
- [如何設定 GRUB 預設的開機項目? | MagicLen](https://magiclen.org/grub-default/)
https://docs.redhat.com/zh-cn/documentation/red_hat_enterprise_linux/8/html-single/configuring_infiniband_and_rdma_networks/index#configuring-an-ipoib-connection-using-nmcli-commands_configuring-ipoib
https://blog.csdn.net/msdnchina/article/details/71133494