## Cudaq 第二題 ## Installation ### remote - Global setting ```bash MF_ROOT="/opt/modulefiles" OPT_ROOT="/opt" REPO_ROOT="/home/team7admin/hlajungo/repo" CORE_NUM=32 GENERAL_OPT_FLAG=" -O3 -march=native -mtune=native -ffast-math -fopenmp -fno-common " ``` ### local - My scripts ```bash rsync -av -e ssh /media/hlajungo/D/linux/script USER@IP:/home/hlajungo ``` ### remote - My scripts ```bash PATH=/home/hlajungo/script:$PATH ``` ### remote - gcc-12.4.0 ```bash bash download_unzip.sh https://gcc.gnu.org/pub/gcc/infrastructure/gmp-4.3.2.tar.bz2 -o $REPO_ROOT bash download_unzip.sh https://gcc.gnu.org/pub/gcc/infrastructure/mpfr-3.1.4.tar.bz2 -o $REPO_ROOT bash download_unzip.sh https://gcc.gnu.org/pub/gcc/infrastructure/mpc-1.0.3.tar.gz -o $REPO_ROOT bash download_unzip.sh https://gcc.gnu.org/pub/gcc/infrastructure/isl-0.18.tar.bz2 -o $REPO_ROOT bash download_unzip.sh https://github.com/gcc-mirror/gcc/archive/refs/tags/releases/gcc-12.4.0.zip -o $REPO_ROOT ``` gmp-4.3.2 ```bash ml purge cd $REPO_ROOT/gmp-4.3.2 mkdir build && cd build ../configure \ --prefix=$OPT_ROOT/gmp/4.3.2 \ --enable-shared --enable-static make -j$CORE_NUM make install bash gen_mf_tcl_top.sh $OPT_ROOT/gmp/4.3.2/ -o $MF_ROOT/gmp/4.3.2.tcl ``` mpfr-3.1.4 ```bash ml purge ml gmp/4.3.2 cd $REPO_ROOT/mpfr-3.1.4 mkdir build && cd build ../configure \ --prefix=$OPT_ROOT/mpfr/3.1.4 \ --with-gmp=$GMP_ROOT \ --enable-shared --enable-static make -j$CORE_NUM make install bash gen_mf_tcl_top.sh $OPT_ROOT/mpfr/3.1.4/ -o $MF_ROOT/mpfr/3.1.4.tcl ``` mpc-1.0.3 ```bash ml purge ml gmp/4.3.2 ml mpfr-3.1.4 cd $REPO_ROOT/mpc-1.0.3 mkdir build && cd build ../configure \ --prefix=$OPT_ROOT/mpc/1.0.3 \ --with-gmp=$GMP_ROOT \ --with-mpfr=$MPFR_ROOT \ --enable-shared --enable-static make -j$CORE_NUM make install bash gen_mf_tcl_top.sh $OPT_ROOT/mpc/1.0.3/ -o $MF_ROOT/mpc/1.0.3.tcl ``` isl-0.18 ```bash ml purge ml gmp/4.3.2 cd $REPO_ROOT/isl-0.18 mkdir build && cd build ../configure \ --prefix=$OPT_ROOT/isl/0.18 \ --with-gmp=$GMP_ROOT \ --enable-shared --enable-static make -j$CORE_NUM make install bash gen_mf_tcl_top.sh $OPT_ROOT/isl/0.18/ -o $MF_ROOT/isl/0.18.tcl ``` gcc-12.4.0 ```bash ml purge ml gmp/4.3.2 ml mpfr/3.1.4 ml mpc/1.0.3 ml isl/0.18 cd $REPO_ROOT/gcc-releases-gcc-12.4.0 mkdir build && cd build LIBRARY_PATH=$(echo "$LIBRARY_PATH" | tr ':' '\n' | grep -v -e '^$' -e '^\.$' -e '^\.\/' | paste -sd ':' -) ../configure \ --prefix=$OPT_ROOT/gcc/12.4.0 \ --with-gmp=$GMP_ROOT \ --with-mpfr=$MPFR_ROOT \ --with-mpc=$MPC_ROOT \ --with-isl=$ISL_ROOT \ --disable-werror \ --enable-checking=release \ --enable-languages=c,c++,fortran \ --disable-multilib \ --disable-bootstrap \ --without-included-gettext \ --enable-threads=posix \ --enable-nls \ --enable-shared --enable-static make -j$CORE_NUM make install bash gen_mf_tcl_top.sh $OPT_ROOT/gcc/12.4.0/ -o $MF_ROOT/gcc/12.4.0.tcl ``` ### remote - cuda-12.1 ```bash wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run -P $REPO_ROOT ml purge ml gcc/12.4.0 # Note: gcc-11 too old, gcc-14.3.0 too new. Use gcc-12. bash cuda_12.1.1_530.30.02_linux.run --installpath=$OPT_ROOT/cuda/12.1.1 bash gen_mf_tcl_top.sh $OPT_ROOT/cuda/12.1.1 -o $MF_ROOT/cuda/12.1.1.tcl ``` ### remote - ucx-1.18.1 prebuilt ```bash wget https://github.com/openucx/ucx/releases/download/v1.18.1/ucx-1.18.1-ubuntu22.04-mofed5-cuda12-x86_64.tar.bz2 -P $REPO_ROOT tar xf ucx-1.18.1-ubuntu22.04-mofed5-cuda12-x86_64.tar.bz2 dpkg-deb -x ucx-1.18.1.deb $REPO_ROOT/ucx_tmp/ dpkg-deb -x ucx-cuda-1.18.1.deb $REPO_ROOT/ucx_tmp/ dpkg-deb -x ucx-xpmem-1.18.1.deb $REPO_ROOT/ucx_tmp/ mkdir -p $OPT_ROOT/ucx/1.18.1_ofed5_cuda12/ mv $REPO_ROOT/ucx_tmp/usr/* $OPT_ROOT/ucx/1.18.1_ofed5_cuda12/ bash gen_mf_tcl_top.sh $OPT_ROOT/ucx/1.18.1_ofed5_cuda12/ -o $MF_ROOT/ucx/1.18.1_ofed5_cuda12.tcl ``` ### remote - openmpi-5.0.8 with ucx with ib <!--[來源 - using openmpi + ucx](https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX)--> ```bash ml purge ml gcc/12.4.0 ml cuda/12.1.1 ml ucx/1.18.1_ofed5_cuda12 cd $REPO_ROOT/openmpi-5.0.8/ echo " ../configure \ --prefix=$OPT_ROOT/openmpi/5.0.8_gcc-12.4.0_cuda-12.1.1_ucx-1.18.1 \ CFLAGS=\" $GENERAL_OPT_FLAG \" \ CXXFLAGS=\" $GENERAL_OPT_FLAG \" \ --enable-shared \ --enable-static \ --with-cuda=${CUDA_ROOT} \ --with-ucx=${UCX_ROOT}\ --with-slurm \ --with-io-romio-flags=--with-file-system=nfs \ --enable-mca-no-build=btl-uct \ --enable-dlopen \ --disable-mpi-cxx \ --disable-cxx-exceptions \ --disable-memchecker \ --disable-java \ --disable-mpi-java \ --without-verbs \ --without-mxm \ --without-psm \ --without-psm2" # 在 configure 之前先刪掉這些 *.la find /media/hlajungo/D/linux/opt/ucx/1.18.1_ofed5_cuda12/lib -name "*.la" -delete make -j$CORE_NUM make install bash gen_mf_tcl_top.sh $OPT_ROOT/openmpi/5.0.8_gcc-12.4.0_cuda-12.1.1_ucx-1.18.1 -o $MF_ROOT/openmpi/5.0.8_gcc-12.4.0_cuda-12.1.1_ucx-1.18.1.tcl # 註解: openmpi 有一些內建紀錄,在安裝完成後,不能 mv 移動。要移動就改 prefix 重裝。 ``` <!-- 在 make 時,遇到錯誤,把 "*.la" 刪掉來避免。 ``` libtool: warning: library '/media/hlajungo/D/linux/opt/ucx/1.18.1_ofed5_cuda12/lib/libucp.la' was moved. /usr/bin/grep: /usr/lib/libuct.la: No such file or directory /usr/bin/sed: can't read /usr/lib/libuct.la: No such file or directory libtool: error: '/usr/lib/libuct.la' is not a valid libtool archive ```--> ### remote - openmpi-5.0.8 no ib no ucx (備案) ```bash wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.8.tar.gz -P $REPO_ROOT tar xf openmpi-5.0.8.tar.gz ml purge ml gcc/12.4.0 ml cuda/12.1.1 cd $REPO_ROOT/openmpi-5.0.8/ mkdir build && cd build echo "../configure --prefix=/opt/openmpi/5.0.8_gcc-12.4.0_cuda-12.1.1 \ CFLAGS=\" $GENERAL_OPT_FLAG \" \ CXXFLAGS=\" $GENERAL_OPT_FLAG \" \ --with-cuda=${CUDA_ROOT} \ --enable-mpirun-prefix-by-default" # 執行這個 echo 輸出 make -j$CORE_NUM make install bash gen_mf_tcl_top.sh $OPT_ROOT/openmpi/5.0.8_gcc-12.4.0_cuda-12.1.1 -o $MF_ROOT/openmpi/5.0.8_gcc-12.4.0_cuda-12.1.1.tcl ``` ### local - cudaq-0.11.0 ```bash rsync -av -e ssh /media/hlajungo/D/linux/repo_my/hello_cudaq/install_cuda_quantum_cu12.x86_64 \ USER@IP:/home/hlajungo ``` ### remote - cudaq-0.11.0 install ```bash ml purge ml gcc/12.4.0 ml cuda/12.1.1 ml openmpi/5.0.8_gcc-12.4.0_cuda-12.1.1 cd $REPO_ROOT MPI_PATH=${OPENMPI_ROOT} \ sudo -E bash install_cuda_quantum*.$(uname -m) --accept --target ./out --noexec ``` ```bash vim out/build_config.xml ``` 整個改成 ```xml <build_config> <CUDAQ_INSTALL_PREFIX>/opt/cudaq/0.11.0/cudaq</CUDAQ_INSTALL_PREFIX> <LLVM_INSTALL_PREFIX>/opt/cudaq/0.11.0/llvm</LLVM_INSTALL_PREFIX> <CUQUANTUM_INSTALL_PREFIX>/opt/cudaq/0.11.0/cuquantum</CUQUANTUM_INSTALL_PREFIX> <CUTENSOR_INSTALL_PREFIX>/opt/cudaq/0.11.0/cutensor</CUTENSOR_INSTALL_PREFIX> </build_config> ``` ``` ./out/install.sh -s ./out -c ./out/build_config.xml -t /opt/cudaq/0.11.0/cudaq bash gen_mf_tcl_all.sh /opt/cudaq/0.11.0 -o $MF_ROOT/cudaq/0.11.0.tcl ``` ``` vim [上個腳本輸出的 mf 路徑] ``` 自動生成的 mf 有遺漏,尾部添加 ```bash prepend-path C_INCLUDE_PATH /opt/cudaq/0.11.0/llvm/include/c++/v1/ prepend-path CPLUS_INCLUDE_PATH /opt/cudaq/0.11.0/llvm/include/c++/v1/ prepend-path C_INCLUDE_PATH /opt/cudaq/0.11.0/llvm/include/x86_64-unknown-linux-gnu/c++/v1/ prepend-path CPLUS_INCLUDE_PATH /opt/cudaq/0.11.0/llvm/include/x86_64-unknown-linux-gnu/c++/v1/ prepend-path LIBRARY_PATH /opt/cudaq/0.11.0/llvm/lib/x86_64-unknown-linux-gnu prepend-path LD_LIBRARY_PATH /opt/cudaq/0.11.0/llvm/lib/x86_64-unknown-linux-gnu ``` ``` ml cudaq/0.11.0 ``` ``` vim $(which which nvq++) ``` 修改以下行 ``` llvm_install_dir="${CUDAQ_ROOT}/llvm/" NVQPP_LD_PATH=${NVQPP_LD_PATH:-"${CUDAQ_ROOT}/llvm/bin/ld.lld"} ``` ### remote - 激活 cudaq 的 MPI ``` vim $CUDAQ_ROOT/cudaq/distributed_interfaces/activate_custom_mpi.sh ``` 修改 ``` $CXX -shared -std=c++17 -fPIC \ ``` 為 ``` $CXX --library-mode -shared -std=c++17 -fPIC \ ``` ``` sudo -E MPI_PATH=$OPENMPI_ROOT bash /opt/cudaq/0.11.0/cudaq/distributed_interfaces/activate_custom_mpi.sh ``` 安裝完成! ::: sudo -E MPI_PATH=/home/team7admin/hlajungo/repo/hpcx-v2.18.1-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/ompi/ bash /opt/cudaq/0.11.0/cudaq/distributed_interfaces/activate_custom_mpi.sh ## Runtime work ### local - 拷貝 grover 代碼和腳本 ```bash rsync -av -e ssh /media/hlajungo/D/linux/repo_my/hello_cudaq/code \ USER@IP:/home/hlajungo ``` ### remote - grover 單節點測試 ``` ml purge ml gcc/12.4.0 ml cuda/12.1.1.tcl ml openmpi/5.0.8 ml cudaq/0.11.0 nvq++ --library-mode grover_algorithm.cpp -o grover_exec mpirun -np 2 ./grover_exec ``` ## 小抄 slurm ``` # 取消 JOBID=11 的 batch scancel 11 # 查看節點信息 scontrol show nodes # 要一個 slurm shell salloc -p debug \ --nodes=1 \ --ntasks=2 \ --cpus-per-task=1 --gpus=1 ``` ucx ``` # 查閱所有 ucx 能設定的變量 (c for showing variable, f for detail) ucx_info -cf > ucx_env.txt # 查閱所有網路設備 ucx_info -d UCX_TLS=all ``` ## grover code ```cpp #include <cmath> #include <cudaq.h> #include <functional> // for std::reference_wrapper #include <iostream> #include <string> #include <chrono> #include <iomanip> // for setw // Apply the oracle: flip phase on the marked state void oracle (auto& q, const std::string& marked_state) { // Step 1: 通過 marked_state 把指定的 qubit 翻轉 for (int i = 0; i < marked_state.size (); ++i) { if (marked_state[i] == '0') { x (q[i]); } } // Step 2: 進行 HXH=Z ,對 "11...1" 變號 h (q[q.size () - 1]); // 改用 reference_wrapper std::vector<std::reference_wrapper<cudaq::qubit> > controls; for (std::size_t i = 0; i < q.size () - 1; ++i) { controls.emplace_back (std::ref (q[i])); } x (controls, q[q.size () - 1]); // 多控制 X gate h (q[q.size () - 1]); // Step 3: 還原 Step 1 翻轉的 bit for (int i = 0; i < marked_state.size (); ++i) { if (marked_state[i] == '0') { x (q[i]); } } // 此處完成對 w 乘上 -1,其餘不變動。 } // Diffusion operator: reflection about average void diffusion (auto& q) { // Step 1: 弄成 "00...0" 狀態 h (q); // Step 2: 弄成 "11...1" 狀態 x (q); // Step 3: 進行 HXH=Z, 對 "11...1" 變號 h (q[q.size () - 1]); std::vector<std::reference_wrapper<cudaq::qubit> > controls; for (std::size_t i = 0; i < q.size () - 1; ++i) { controls.emplace_back (std::ref (q[i])); } x (controls, q[q.size () - 1]); h (q[q.size () - 1]); // Step 4: 還原,弄成 "00...0" x (q); // Step 5: 還原, 弄成之前的狀態 h (q); } /* s = target state "1111" w = 均勻超位置狀態 oracle -> 對 w 翻轉 diffusion -> 對 s 翻轉 */ __qpu__ void grover_kernel (std::size_t n, const std::string& marked, std::size_t iterations) { cudaq::qvector q (n); // 初始化為均衡態 for (auto& qbit : q) h (qbit); // 疊代 Grover for (std::size_t i = 0; i < iterations; ++i) { oracle (q, marked); diffusion (q); } // 測量 mz (q); } int main_mpi() { cudaq::mpi::initialize(); if (cudaq::mpi::rank() == 0) { std::cout << "qubits target amp run time(s)\n"; } for (std::size_t n = 24; n <= 24; ++n) { std::string marked(n, '1'); std::size_t opt_it = std::floor((M_PI / 4.0) * std::sqrt(std::pow(2, n))); std::size_t it = opt_it*0.61; auto start = std::chrono::high_resolution_clock::now(); auto result = cudaq::sample(grover_kernel, n, marked, it); auto end = std::chrono::high_resolution_clock::now(); std::chrono::duration<double> elapsed = end - start; // 只由 rank 0 輸出 if (cudaq::mpi::rank() == 0) { std::cout << std::setprecision(3) << n << std::setw(10) << result.probability(result.most_probable()) << std::setw(10) << elapsed.count() << std::setw(10)<< "\n"; } } cudaq::mpi::finalize(); return 0; } int main_my () { for (std::size_t n = 24; n <= 24; ++n) { std::string marked(n, '1'); std::size_t opt_it = std::floor((M_PI / 4.0) * std::sqrt(std::pow(2, n))); std::size_t it = opt_it*0.61; auto start = std::chrono::high_resolution_clock::now(); auto result = cudaq::sample(grover_kernel, n, marked, it); auto end = std::chrono::high_resolution_clock::now(); std::chrono::duration<double> elapsed = end - start; std::cout << std::setprecision(3) << n << std::setw(10) << result.probability(result.most_probable()) << std::setw(10) << elapsed.count() << std::setw(10)<< "\n"; } return 0; } int main () { main_mpi (); return 0; } ``` <!-- 1. slurm sbatch 不熟悉。 我需要安裝完 slurm sbatch 後,寫一些測試腳本。 通過這些測試腳本,固定下行為,包括路徑,資源那些。 並通過腳本檢查移植性。同時,需要給出標準 slurm sbatch 模板。 2. 透析本質 如果 quantum 就是矩陣運算,那我應該要知道 oracle 11..1,是一個特別的簡單矩陣。 能自己寫 C++ 解決。 3. mpirun 工廠化 我需要有一個生成 mpirun 的腳本,有參數能決定細節。 4. cudaq 錯誤 我需要更熟悉 profiler,並建立認知。 當過了 compile, 在 run time,結果正確但速度異常時(這裡的異常是指,你不滿意效能),這就是 profiler 用的地方。 4.1 cudaq 該怎麼正確實現 MPI? 我發現有 cuda quantum 文檔和 cudaq 文檔。 cuda quantum 是平台。 cudaq 是 SDK。 nvq++ --qpu cuquantum_mgmn src.cpp ... 5. 和自己比 看不到別人的結果總是焦慮的。 先拿最基本的,像是 小型測試, 1 process, 1 cpu, 當作基準點。 所有的優化行為,都拿去和這個基本測試比,你就會對優化有感覺。 6. lustre vs nfs lustre 在 mpi runtime 會比 nfs 有優勢嘛? from chat學長 【會】被 Lustre/NFS 影響的情境: 程式啟動時,每個 rank 開始從磁碟讀取初始資料 模擬過程中,定期將資料存成檔案(如 checkpoint、log) 訓練模型時,每個 epoch 都從磁碟重新讀大量資料 這些都會透過檔案系統進行,因此: 使用 NFS 時會遇到 I/O contention(特別是 metadata) 使用 Lustre 可以平行存取多個儲存端,改善效能 -->