20250412_mpi - HackMD

[doctest 或其他測試框架是否支持一個功能，如果一個 TEST_CASE 失敗了，則觸發其他測試案例，而這些其他測試案例是該 TEST_CASE 的細分組件(由用戶指定)？](https://chatgpt.com/share/67f81023-5108-800b-b859-d4a1dee83a9e) 0: 今天完成部屬 openmx，用 slurm 運行, 1。 1: 實際生產環境的 mpi modulefile ``` $ ml show intel-mpi ----------------------------------------------------------------------------------------------------------------------------- /opt/modulefiles/production/intel-mpi/2021.10.0.lua: ----------------------------------------------------------------------------------------------------------------------------- whatis("Sets up the Intel(R) MPI Library environment") pushenv("I_MPI_ROOT","/opt/packages/oneapi/v2023.2.0/mpi/2021.10.0") prepend_path("CLASSPATH","/opt/packages/oneapi/v2023.2.0/mpi/2021.10.0/lib/mpi.jar") prepend_path("PATH","/opt/packages/oneapi/v2023.2.0/mpi/2021.10.0/bin") prepend_path("LD_LIBRARY_PATH","/opt/packages/oneapi/v2023.2.0/mpi/2021.10.0/lib") prepend_path("LD_LIBRARY_PATH","/opt/packages/oneapi/v2023.2.0/mpi/2021.10.0/lib/release") prepend_path("MANPATH","/opt/packages/oneapi/v2023.2.0/mpi/2021.10.0/man") pushenv("FI_PROVIDER_PATH","/opt/packages/oneapi/v2023.2.0/mpi/2021.10.0/libfabric/lib/prov:/usr/lib64/libfabric") prepend_path("PATH","/opt/packages/oneapi/v2023.2.0/mpi/2021.10.0/libfabric/bin") prepend_path("LD_LIBRARY_PATH","/opt/packages/oneapi/v2023.2.0/mpi/2021.10.0/libfabric/lib") prepend_path("LIBRARY_PATH","/opt/packages/oneapi/v2023.2.0/mpi/2021.10.0/libfabric/lib") ``` ``` ml intel-compiler ml intel-mpi ml intel-mkl ml intel-itac ``` ### 1: 出現錯誤，intel compiler 編譯的軟體不支持 AMD cpu 的 avx2 指令集。 ``` ml intel-itac/2021.10.0 ml openmpi/4.0.5-nvhpc22.9 ml nvhpc/22.9 export OMPI_CC=nvc export OMPI_FC=nvfortran ``` [hpc compiler 的 reference](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-ref-guide/index.html) ``` -tp=px|bulldozer|piledriver|zen|zen2|zen3|sandybridge|haswell|skylake|host|native Select target processor px Generic Linux/Windows x86_64 Processor. bulldozer AMD Bulldozer processor piledriver AMD Piledriver processor zen AMD Zen architecture (Epyc, Ryzen) zen2 AMD Zen 2 architecture (Ryzen 2) zen3 AMD Zen 3 architecture (Ryzen 3) sandybridge Intel SandyBridge processor haswell Intel Haswell processor skylake Intel Skylake Xeon processor host Link native version of HPC SDK cpu math library native Alias for -tp host ``` ``` The AMD EPYC 7742 is a server/workstation processor with 64 cores, launched in August 2019. It is part of the EPYC lineup, using the Zen 2 (Rome) architecture with Socket SP3. Thanks to AMD Simultaneous Multithreading (SMT) the core-count is effectively doubled, to 128 threads. ``` 要拋棄 intel tool chain 了，改用 nvhpc ### 1:openblas 安裝 ``` wget https://github.com/OpenMathLib/OpenBLAS/releases/download/v0.3.29/OpenBLAS-0.3.29.tar.gz tar OpenBLAS-0.3.29.tar.gz cd OpenBLAS-0.3.29 export CC=nvc export FC=nvfortran export USE_OPENMP=1 make -j16 TARGET=ZEN2 USE_THREAD=1 USE_OPENMP=1 BINARY=64 make install -j4 PREFIX=/jet/home/yhuang28/opt/nvhpc-22.9/openblas/0.3.29 ``` 在 makefile.rule 中，對 PGI (nvhpc) 做出警告，目前打算棄用，改用 AOCL。 ### amd tool chain ``` ml load aocc/2.3.0 ml load aocl/3.1.0 ml load openmpi/5.0.3-aocc2.3.0 ``` [clang 抱怨 multiple symbol](https://stackoverflow.com/questions/5669328/force-ignore-duplicate-symbols) 發現都太舊了，兩個都到 `5.0.0` 了，去裝新的。兩套件都直接下載，不用編譯。 [aocc 裝 openmpi 的教學](https://www.amd.com/zh-tw/developer/zen-software-studio/applications/spack/hpc-applications-openmpi.html) 1: `lspci`，一個有趣的可執行檔，可探索 0: 明天裝好 openmpi 。 ### openmpi openmpi: 用 f xpmem,knem,ucx : 只須裝 ucx，ucx 包含另外兩者，這只能用 gcc 編譯。系統上有 `ucx/1.11.2` mxm,hcoll : 從 Mellanox 取得 ### knem ``` CC=gcc ./configure --prefix=/jet/home/yhuang28/opt/gcc-10.2.0/knem/1.1.4 ``` 1: `kenm` 太老了，新的 linux 有些 api 改動了，編譯出現錯誤。 ``` /jet/home/yhuang28/repo/knem-knem-1.1.4/driver/linux/knem_hal.h:211:46: error: 'struct task_struct' has no member named 'cpus_allowed'; did you mean 'nr_cpus_allowed'? 211 | #define knem_get_cpus_allowed(task) ((task)->cpus_allowed) ``` 1: 從 AI 得知可以只用 ucx 內建的 xpmem ### openmpi ml gcc/10.2.0 export CC=$(which gcc) export CXX=$(which g++) ./contrib/configure-release --prefix=/jet/home/yhuang28/opt/gcc-10.2.0/ucx/1.18.0 --enable-mt --enable-cma --with-xpmem 1: configure 的輸出 ``` configure: UCX build configuration: configure: Build prefix: /jet/home/yhuang28/opt/gcc-10.2.0/ucx/1.18.0 configure: Configuration dir: ${prefix}/etc/ucx configure: Preprocessor flags: -DCPU_FLAGS="" -I${abs_top_srcdir}/src -I${abs_top_builddir} -I${abs_top_builddir}/src configure: C compiler: /jet/packages/spack/opt/spack/linux-centos8-zen/gcc-8.3.1/gcc-10.2.0-tfzxq7udz2a53dmujvasy4uz33t27iwv/bin/gcc -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch -Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length -Wnested-externs -Wshadow -Werror=declaration-after-statement configure: C++ compiler: /jet/packages/spack/opt/spack/linux-centos8-zen/gcc-8.3.1/gcc-10.2.0-tfzxq7udz2a53dmujvasy4uz33t27iwv/bin/g++ -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch configure: ASAN check: no configure: Multi-thread: enabled configure: MPI tests: disabled configure: VFS support: no configure: Devel headers: no configure: io_demo CUDA support: no configure: Bindings: < java > configure: UCS modules: < > configure: UCT modules: < ib rdmacm cma > configure: CUDA modules: < > configure: ROCM modules: < > configure: IB modules: < mlx5 > configure: UCM modules: < > configure: Perf modules: < mad > ``` 1: 給 AI 看了。你在 UCX 的 `configure` 輸出中看到的： ``` UCT modules: < ib rdmacm cma > IB modules: < mlx5 > ``` 這些欄位代表 UCX（Unified Communication X）的 **模組化傳輸層（UCT）後端與底層硬體模組（IB）**，會根據系統支援情況進行啟用。 --- ### ✅ 說明各模組的意義： #### `UCT modules:`（傳輸通訊模組）這些是 UCX 在 UCT 層支援的通訊通道，常見有： | 模組名稱 | 說明 | |----------|------| | `ib` | 使用 InfiniBand Verbs API（低延遲 RDMA 傳輸） | | `rdmacm` | 使用 RDMA Connection Manager（在 `ib` 上處理連線管理） | | `cma` | Linux 的 Cross Memory Attach 機制，用於在 local processes 間直接記憶體傳輸（zero-copy IPC） | | `tcp` | 使用傳統 TCP Socket 傳輸（fallback，延遲高但可用於測試） | | `posix` | 使用 POSIX shared memory（在同一台機器上多進程共享） | | `self` | 用於同一 process 的 UCX 傳輸，功能類似 loopback | --- #### `IB modules:`（硬體特定支援模組）這些是專門針對 **InfiniBand 硬體廠牌與驅動** 的支援模組，常見選項有： | 模組名稱 | 說明 | |----------|------| | `mlx5` | 支援 Mellanox ConnectX-5/6/7 等新型網卡（支援 verbs, tag matching, GPU direct 等） | | `mlnx` | 舊版 Mellanox Verbs 驅動對應模組 | | `hfi1` | 支援 Intel Omni-Path 架構的驅動 | | `roce` | 支援 RoCE (RDMA over Converged Ethernet) | | `dc` | Dynamic Connection，Mellanox 的一種高效 RDMA 模式 | | `xpmem` | Zero-copy Shared memory 機制，可與 KNEM 替代使用（本地節點高速 IPC）| 目前你系統啟用了 `mlx5` 表示 UCX 成功找到 **支援 Mellanox 硬體** 的驅動環境，這是非常理想的情況。 --- ### 🔍 如何顯示所有可用模組？你可以在 UCX 原始碼目錄下執行： ```bash ./contrib/configure-release --help ``` 或在 `configure` 完成後查看： ```bash ucx_info -d # 顯示所有可用的網路介面、模組 ucx_info -v # 顯示版本與支援模組 ``` --- ### 🎯 總結 - `UCT modules:` 表示 **UCX 支援的傳輸方式**（IB, TCP, POSIX, CMA 等）。 - `IB modules:` 表示 **實體網卡支援模組**（Mellanox mlx5、Intel hfi1 等）。 - `mlx5` 模組的出現代表你的 UCX 能正確驅動 Mellanox HCA。 - 若需要本地通訊最佳化（非 RDMA），也可考慮加入 `xpmem`、`posix`、`cma` 模組。 --- 如果你需要列出所有模組名稱與開啟條件，我可以幫你整理一份 `UCT` / `UCM` / `IB` 模組總覽，要嗎？ 1: 延伸點, 兩個提供資訊的可執行檔 ``` ucx_info -d # 顯示所有可用的網路介面、模組 ucx_info -v # 顯示版本與支援模組 ``` ### openmpi ml aocc/5.0.0 export CC=clang export CXX=clang++ export FC=flang ./configure --prefix=/jet/home/yhuang28/opt/aocc-5.0.0/openmpi/4.1.6 \ --with-ucx=$PREFIX \ --with-pmix \ --with-hwloc \ --enable-mpirun-prefix-by-default \ --enable-mca-no-build=btl-uct \ --enable-mpi-cxx \ --enable-shared \ --enable-mpi-fortran 1: 有一些有趣的事情發生了，可以總結成兩個問題。 Q1: 官方上來說， flang 其實是 clang ？這個問題問得非常好，而且在目前 Fortran 編譯器的發展階段中，**「flang」的名稱確實有點混亂**，我來幫你梳理一下整體脈絡與現況。 --- ## ✅ 官方定義：Flang 是什麼？ ### 1. **舊版 Flang（Classic Flang）** - 來源：PGI（NVIDIA）最早開源的一個 Fortran 前端，後來貢獻給 LLVM。 - 架構：基於 LLVM，但自己實作了一整個 Fortran Frontend。 - 現狀：這個版本已被棄用，**LLVM 官方不再發展它**。 --- ### 2. **新版 Flang（LLVM Flang）— 官方版** - 來源：LLVM 官方團隊自 2019 年起重新開發。 - 架構： - Frontend: 實作真正的 Fortran Parser、Semantic Analysis（名為 `f18`，Fortran 2018 規範） - Backend: 完整使用 LLVM 的 IR 與 Codegen 系統 - 指令名稱：原本叫 `flang-new`，**自 LLVM 16 起正式使用 `flang` 命名**。 🔧 **也就是說，真正支援 Fortran 的 `flang` 是 LLVM 的官方實作，需要 LLVM 15+ 版本**。 --- ## ❌ AOCC 提供的 flang 是什麼？ ### ➤ AOCC 中的 `flang` 實際上是： ```bash /opt/aocc/bin/flang -> clang ``` - 它只是簡單地把 `clang` 命名成 `flang`，所以它並不支援 Fortran！ - 這個 `flang` 實際上無法處理 `.f90`，即使有 `.f90` 的語法，也不會做正確的 Fortran semantic 分析與轉譯。 👉 所以 AOCC 中的 `flang` **只是個名稱，不是能用來編譯 Fortran 的 compiler**！ --- ## ✅ 要使用真正的 Fortran 編譯器該怎麼做？ | 需求 | 建議使用 | |------|-----------| | 通用 Fortran 編譯器 | `gfortran`（來自 GCC）✅ 最穩定、最相容 | | 想用 LLVM 架構 | 安裝真正的 LLVM Flang（LLVM 15+） | | AOCC 環境混用 | 使用 `AOCC clang` 編 C/C++，使用 `gfortran` 編 Fortran 程式 | --- ## 🧠 補充知識：如何取得真正的 LLVM Flang？ ```bash spack install llvm +flang # 或者 git clone https://github.com/llvm/llvm-project cd llvm-project mkdir build && cd build cmake ../llvm -DLLVM_ENABLE_PROJECTS="flang" -G Ninja -DCMAKE_BUILD_TYPE=Release ninja ``` 之後你會取得： - `flang`（真正的 Fortran Frontend） - `flang-new`（還是有些版本會這樣叫） --- ### 🔚 總結 | 問題 | 回答 | |------|------| | AOCC 中的 `flang` 是真正的 flang 嗎？ | ❌ 不是，只是 `clang` 的 symlink，不支援 Fortran | | 真正的 flang 是誰開發的？ | ✅ LLVM 官方，自 LLVM 15 開始成熟 | | 若要寫 Fortran 程式建議用什麼？ | ✅ `gfortran` 或真正的 `flang`（LLVM 15+） | 如果你願意，我可以幫你寫一個模組檔或 shell script，在 AOCC 環境下切換成 `gfortran` 作為 Fortran 編譯器，保持 C/C++ 用 AOCC，Fortran 用 GNU。是否需要？ Q2: 等等，我可以 c/c++ 用一款編譯器， fortran 用另外一款？這是通用的方法嘛？還是只有幾款能這樣組合? 這個問題超棒，而且你觸及到高效能運算（HPC）與多語言混編的核心議題了：**C/C++ 用一套 compiler，Fortran 用另一套，可以嗎？常見嗎？** 答案是： --- ## ✅ 可以，這是常見做法（尤其在 HPC 環境）在實務上，**很多超級電腦中心（如 NERSC、TACC、CSC、JSC）也都這麼做**，尤其在下列情境： 1. **某個編譯器 Fortran 支援較弱（例如 AOCC 的 flang）** 2. **某些編譯器的 Fortran 性能或相容性不如 gfortran** 3. **混合 MPI / OpenMP + Fortran + C/C++，需要最穩定組合** --- ## 🛠️ 怎麼做到？當你在編譯混合語言的程式（如 C + Fortran）時，**連結階段**是關鍵。 - **前提：ABI（Application Binary Interface）要相容** - 尤其是：**名稱修飾（name mangling）、呼叫約定（calling convention）、Fortran runtime library 相容性** 幸運的是： | 組合 | 是否常見 | 是否可行 | |------|----------|----------| | **GCC + GFortran** | ✅ 非常常見 | ✅ 完美 | | **AOCC Clang + GFortran** | ✅ 現實常見 | ✅ 可行（需小心 runtime lib） | | **Intel ICC + GFortran** | ✅ 可行但需加 `-lgfortran` | ✅ 可行 | | **Clang + Flang (LLVM)** | ⚠️ 新技術、正在穩定 | ⚠️ 部分可行 | | **NVHPC C + GFortran** | ⚠️ 不建議 | ❌ 通常不推薦 | --- ## 🧪 實際案例：如何混用？你可以這樣做（舉個例子）： ### `main.cpp` ```cpp extern "C" void my_fortran_func_(); int main() { my_fortran_func_(); return 0; } ``` ### `func.f90` ```fortran subroutine my_fortran_func() print *, "Hello from Fortran" end subroutine ``` ### 建構指令： ```bash gfortran -c func.f90 clang++ main.cpp func.o -lgfortran -o mix_app ``` 或用 AOCC： ```bash /opt/aocc/bin/clang++ main.cpp func.o -lgfortran -o mix_app ``` --- ## ❗ 注意事項為了保證 C++ 和 Fortran 可以共用： 1. **使用 `extern "C"` 避免 C++ 名稱修飾** 2. **Fortran 函式名小心尾端 `_`（gfortran 預設會加）** 3. **加上 `-lgfortran` 以連結 GFortran runtime** --- ## ✅ 總結：你可以混用編譯器嗎？ | 問題 | 答案 | |------|------| | C/C++ 用 AOCC / Clang，Fortran 用 GFortran？ | ✅ 可以，而且是實務常見做法 | | 為什麼這樣做？ | AOCC 沒有真正的 flang，而 gfortran 強大穩定 | | 有什麼要小心？ | 確保 ABI 相容、加入 Fortran runtime（`-lgfortran`） | | 有哪些組合推薦？ | `AOCC + GFortran`、`Clang + GFortran`、`Intel + GFortran` 都可行 | --- 如果你要用這種混合方式來編譯 OpenMPI、OpenMX 或其他大型套件，我也可以幫你設計 CMake flags 或 autotools configure script！是否需要我幫你整理一份通用的 config 配置？ ``` ml aocc/5.0.0 ml gcc/10.2.0 export CC=clang export CXX=clang++ export FC=gfortran ./configure --prefix=/jet/home/yhuang28/opt/aocc-5.0.0/openmpi/4.1.6 \ --with-ucx=$PREFIX \ --with-pmix \ --with-hwloc \ --enable-mpirun-prefix-by-default \ --enable-mca-no-build=btl-uct \ --enable-mpi-cxx \ --enable-shared \ --enable-mpi-fortran ``` ``` ml aocc/5.0.0 ml gcc/10.2.0 export CC=clang export CXX=clang++ export FC=gfortran ./configure --prefix=/jet/home/yhuang28/opt/aocc-5.0.0/openmpi/4.1.6 \ --with-ucx=$PREFIX \ --with-pmix \ --with-hwloc \ --enable-mpirun-prefix-by-default \ --enable-mca-no-build=btl-uct \ --enable-mpi-cxx \ --enable-shared \ --enable-mpi-fortran \ --disable-mpi-fortran-optional ``` 1: 這是 openmpi configure 的輸出 ``` Open MPI configuration: ----------------------- Version: 4.1.6 Build MPI C bindings: yes Build MPI C++ bindings (deprecated): yes Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08 MPI Build Java bindings (experimental): no Build Open SHMEM support: yes Debug build: no Platform file: (none) Miscellaneous ----------------------- CUDA support: no HWLOC support: external Libevent support: internal Open UCC: no PMIx support: Internal Transports ----------------------- Cisco usNIC: no Cray uGNI (Gemini/Aries): no Intel Omnipath (PSM2): no Intel TrueScale (PSM): no Mellanox MXM: no Open UCX: yes OpenFabrics OFI Libfabric: no OpenFabrics Verbs: yes Portals4: no Shared memory/copy in+copy out: yes Shared memory/Linux CMA: yes Shared memory/Linux KNEM: no Shared memory/XPMEM: no TCP: yes Resource Managers ----------------------- Cray Alps: no Grid Engine: no LSF: no Moab: no Slurm: yes ssh/rsh: yes Torque: no OMPIO File Systems ----------------------- DDN Infinite Memory Engine: no Generic Unix FS: yes IBM Spectrum Scale/GPFS: no Lustre: no PVFS2/OrangeFS: no ``` 1: 其中我在意這幾個。 ``` Mellanox MXM: no Shared memory/Linux CMA: yes Shared memory/XPMEM: no ``` 1: MXM 似乎是 legacy，可以不用， XPMEM 不包含於 ucx，需要額外裝，先跳過吧。 1: PIC (Position-independent code) 和 non-PIC 是什麼呢？ [so 上的說明](https://stackoverflow.com/questions/54351586/how-to-understand-the-difference-between-pic-and-no-pic) 1: .lo 後綴是什麼呢？ ### openmx 1: [openmx error of Spherical_Bessel2](https://www.openmx-square.org/forum/patio.cgi?mode=view&no=3153), 我遇到這個錯誤，基本上解決方法是定義一個宏，把錯誤區域跳過去。 1: 還有個錯是 ` 67 | use elpa2_compute_complex`，重新運行 make 就能解決 ``` mpirun -n 2 openmx ./Methane.dat ``` ``` WARNING: There was an error initializing an OpenFabrics device. [1744451323.031216] [r428:28903:0] ucp_context.c:2190 UCX WARN UCP API version is incompatible: required >= 1.18, actual 1.17.0 (loaded from /usr/lib64/libucp.so.0) [r214.ib.bridges2.psc.edu:101324] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init [r214.ib.bridges2.psc.edu:101324] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages The calculation was terminated due to the illegal SCF calculation. ``` 1: 整理情況: 1. openmx 有問題，將要測試 hello_mpi.c 文件，確保 openmpi 本身沒問題。 1: load ml ``` aocc/5.0.0 aocl/5.0 openmpi/4.1.6 ucx/1.18.0 gcc/10.2.0 ``` 1: [牟的 mpi 教學](https://hackmd.io/@William-Mou/HJWRSNROS)