## Cudaq 第二題
## Installation
### remote - Global setting
```bash
MF_ROOT="/opt/modulefiles"
OPT_ROOT="/opt"
REPO_ROOT="/home/team7admin/hlajungo/repo"
CORE_NUM=32
GENERAL_OPT_FLAG=" -O3 -march=native -mtune=native -ffast-math -fopenmp -fno-common "
```
### local - My scripts
```bash
rsync -av -e ssh /media/hlajungo/D/linux/script USER@IP:/home/hlajungo
```
### remote - My scripts
```bash
PATH=/home/hlajungo/script:$PATH
```
### remote - gcc-12.4.0
```bash
bash download_unzip.sh https://gcc.gnu.org/pub/gcc/infrastructure/gmp-4.3.2.tar.bz2 -o $REPO_ROOT
bash download_unzip.sh https://gcc.gnu.org/pub/gcc/infrastructure/mpfr-3.1.4.tar.bz2 -o $REPO_ROOT
bash download_unzip.sh https://gcc.gnu.org/pub/gcc/infrastructure/mpc-1.0.3.tar.gz -o $REPO_ROOT
bash download_unzip.sh https://gcc.gnu.org/pub/gcc/infrastructure/isl-0.18.tar.bz2 -o $REPO_ROOT
bash download_unzip.sh https://github.com/gcc-mirror/gcc/archive/refs/tags/releases/gcc-12.4.0.zip -o $REPO_ROOT
```
gmp-4.3.2
```bash
ml purge
cd $REPO_ROOT/gmp-4.3.2
mkdir build && cd build
../configure \
--prefix=$OPT_ROOT/gmp/4.3.2 \
--enable-shared --enable-static
make -j$CORE_NUM
make install
bash gen_mf_tcl_top.sh $OPT_ROOT/gmp/4.3.2/ -o $MF_ROOT/gmp/4.3.2.tcl
```
mpfr-3.1.4
```bash
ml purge
ml gmp/4.3.2
cd $REPO_ROOT/mpfr-3.1.4
mkdir build && cd build
../configure \
--prefix=$OPT_ROOT/mpfr/3.1.4 \
--with-gmp=$GMP_ROOT \
--enable-shared --enable-static
make -j$CORE_NUM
make install
bash gen_mf_tcl_top.sh $OPT_ROOT/mpfr/3.1.4/ -o $MF_ROOT/mpfr/3.1.4.tcl
```
mpc-1.0.3
```bash
ml purge
ml gmp/4.3.2
ml mpfr-3.1.4
cd $REPO_ROOT/mpc-1.0.3
mkdir build && cd build
../configure \
--prefix=$OPT_ROOT/mpc/1.0.3 \
--with-gmp=$GMP_ROOT \
--with-mpfr=$MPFR_ROOT \
--enable-shared --enable-static
make -j$CORE_NUM
make install
bash gen_mf_tcl_top.sh $OPT_ROOT/mpc/1.0.3/ -o $MF_ROOT/mpc/1.0.3.tcl
```
isl-0.18
```bash
ml purge
ml gmp/4.3.2
cd $REPO_ROOT/isl-0.18
mkdir build && cd build
../configure \
--prefix=$OPT_ROOT/isl/0.18 \
--with-gmp=$GMP_ROOT \
--enable-shared --enable-static
make -j$CORE_NUM
make install
bash gen_mf_tcl_top.sh $OPT_ROOT/isl/0.18/ -o $MF_ROOT/isl/0.18.tcl
```
gcc-12.4.0
```bash
ml purge
ml gmp/4.3.2
ml mpfr/3.1.4
ml mpc/1.0.3
ml isl/0.18
cd $REPO_ROOT/gcc-releases-gcc-12.4.0
mkdir build && cd build
LIBRARY_PATH=$(echo "$LIBRARY_PATH" | tr ':' '\n' | grep -v -e '^$' -e '^\.$' -e '^\.\/' | paste -sd ':' -)
../configure \
--prefix=$OPT_ROOT/gcc/12.4.0 \
--with-gmp=$GMP_ROOT \
--with-mpfr=$MPFR_ROOT \
--with-mpc=$MPC_ROOT \
--with-isl=$ISL_ROOT \
--disable-werror \
--enable-checking=release \
--enable-languages=c,c++,fortran \
--disable-multilib \
--disable-bootstrap \
--without-included-gettext \
--enable-threads=posix \
--enable-nls \
--enable-shared --enable-static
make -j$CORE_NUM
make install
bash gen_mf_tcl_top.sh $OPT_ROOT/gcc/12.4.0/ -o $MF_ROOT/gcc/12.4.0.tcl
```
### remote - cuda-12.1
```bash
wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run -P $REPO_ROOT
ml purge
ml gcc/12.4.0
# Note: gcc-11 too old, gcc-14.3.0 too new. Use gcc-12.
bash cuda_12.1.1_530.30.02_linux.run --installpath=$OPT_ROOT/cuda/12.1.1
bash gen_mf_tcl_top.sh $OPT_ROOT/cuda/12.1.1 -o $MF_ROOT/cuda/12.1.1.tcl
```
### remote - ucx-1.18.1 prebuilt
```bash
wget https://github.com/openucx/ucx/releases/download/v1.18.1/ucx-1.18.1-ubuntu22.04-mofed5-cuda12-x86_64.tar.bz2 -P $REPO_ROOT
tar xf ucx-1.18.1-ubuntu22.04-mofed5-cuda12-x86_64.tar.bz2
dpkg-deb -x ucx-1.18.1.deb $REPO_ROOT/ucx_tmp/
dpkg-deb -x ucx-cuda-1.18.1.deb $REPO_ROOT/ucx_tmp/
dpkg-deb -x ucx-xpmem-1.18.1.deb $REPO_ROOT/ucx_tmp/
mkdir -p $OPT_ROOT/ucx/1.18.1_ofed5_cuda12/
mv $REPO_ROOT/ucx_tmp/usr/* $OPT_ROOT/ucx/1.18.1_ofed5_cuda12/
bash gen_mf_tcl_top.sh $OPT_ROOT/ucx/1.18.1_ofed5_cuda12/ -o $MF_ROOT/ucx/1.18.1_ofed5_cuda12.tcl
```
### remote - openmpi-5.0.8 with ucx with ib
<!--[來源 - using openmpi + ucx](https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX)-->
```bash
ml purge
ml gcc/12.4.0
ml cuda/12.1.1
ml ucx/1.18.1_ofed5_cuda12
cd $REPO_ROOT/openmpi-5.0.8/
echo "
../configure \
--prefix=$OPT_ROOT/openmpi/5.0.8_gcc-12.4.0_cuda-12.1.1_ucx-1.18.1 \
CFLAGS=\" $GENERAL_OPT_FLAG \" \
CXXFLAGS=\" $GENERAL_OPT_FLAG \" \
--enable-shared \
--enable-static \
--with-cuda=${CUDA_ROOT} \
--with-ucx=${UCX_ROOT}\
--with-slurm \
--with-io-romio-flags=--with-file-system=nfs \
--enable-mca-no-build=btl-uct \
--enable-dlopen \
--disable-mpi-cxx \
--disable-cxx-exceptions \
--disable-memchecker \
--disable-java \
--disable-mpi-java \
--without-verbs \
--without-mxm \
--without-psm \
--without-psm2"
# 在 configure 之前先刪掉這些 *.la
find /media/hlajungo/D/linux/opt/ucx/1.18.1_ofed5_cuda12/lib -name "*.la" -delete
make -j$CORE_NUM
make install
bash gen_mf_tcl_top.sh $OPT_ROOT/openmpi/5.0.8_gcc-12.4.0_cuda-12.1.1_ucx-1.18.1 -o $MF_ROOT/openmpi/5.0.8_gcc-12.4.0_cuda-12.1.1_ucx-1.18.1.tcl
# 註解: openmpi 有一些內建紀錄,在安裝完成後,不能 mv 移動。要移動就改 prefix 重裝。
```
<!--
在 make 時,遇到錯誤,把 "*.la" 刪掉來避免。
```
libtool: warning: library '/media/hlajungo/D/linux/opt/ucx/1.18.1_ofed5_cuda12/lib/libucp.la' was moved.
/usr/bin/grep: /usr/lib/libuct.la: No such file or directory
/usr/bin/sed: can't read /usr/lib/libuct.la: No such file or directory
libtool: error: '/usr/lib/libuct.la' is not a valid libtool archive
```-->
### remote - openmpi-5.0.8 no ib no ucx (備案)
```bash
wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.8.tar.gz -P $REPO_ROOT
tar xf openmpi-5.0.8.tar.gz
ml purge
ml gcc/12.4.0
ml cuda/12.1.1
cd $REPO_ROOT/openmpi-5.0.8/
mkdir build && cd build
echo "../configure --prefix=/opt/openmpi/5.0.8_gcc-12.4.0_cuda-12.1.1 \
CFLAGS=\" $GENERAL_OPT_FLAG \" \
CXXFLAGS=\" $GENERAL_OPT_FLAG \" \
--with-cuda=${CUDA_ROOT} \
--enable-mpirun-prefix-by-default"
# 執行這個 echo 輸出
make -j$CORE_NUM
make install
bash gen_mf_tcl_top.sh $OPT_ROOT/openmpi/5.0.8_gcc-12.4.0_cuda-12.1.1 -o $MF_ROOT/openmpi/5.0.8_gcc-12.4.0_cuda-12.1.1.tcl
```
### local - cudaq-0.11.0
```bash
rsync -av -e ssh /media/hlajungo/D/linux/repo_my/hello_cudaq/install_cuda_quantum_cu12.x86_64 \
USER@IP:/home/hlajungo
```
### remote - cudaq-0.11.0 install
```bash
ml purge
ml gcc/12.4.0
ml cuda/12.1.1
ml openmpi/5.0.8_gcc-12.4.0_cuda-12.1.1
cd $REPO_ROOT
MPI_PATH=${OPENMPI_ROOT} \
sudo -E bash install_cuda_quantum*.$(uname -m) --accept --target ./out --noexec
```
```bash
vim out/build_config.xml
```
整個改成
```xml
<build_config>
<CUDAQ_INSTALL_PREFIX>/opt/cudaq/0.11.0/cudaq</CUDAQ_INSTALL_PREFIX>
<LLVM_INSTALL_PREFIX>/opt/cudaq/0.11.0/llvm</LLVM_INSTALL_PREFIX>
<CUQUANTUM_INSTALL_PREFIX>/opt/cudaq/0.11.0/cuquantum</CUQUANTUM_INSTALL_PREFIX>
<CUTENSOR_INSTALL_PREFIX>/opt/cudaq/0.11.0/cutensor</CUTENSOR_INSTALL_PREFIX>
</build_config>
```
```
./out/install.sh -s ./out -c ./out/build_config.xml -t /opt/cudaq/0.11.0/cudaq
bash gen_mf_tcl_all.sh /opt/cudaq/0.11.0 -o $MF_ROOT/cudaq/0.11.0.tcl
```
```
vim [上個腳本輸出的 mf 路徑]
```
自動生成的 mf 有遺漏,尾部添加
```bash
prepend-path C_INCLUDE_PATH /opt/cudaq/0.11.0/llvm/include/c++/v1/
prepend-path CPLUS_INCLUDE_PATH /opt/cudaq/0.11.0/llvm/include/c++/v1/
prepend-path C_INCLUDE_PATH /opt/cudaq/0.11.0/llvm/include/x86_64-unknown-linux-gnu/c++/v1/
prepend-path CPLUS_INCLUDE_PATH /opt/cudaq/0.11.0/llvm/include/x86_64-unknown-linux-gnu/c++/v1/
prepend-path LIBRARY_PATH /opt/cudaq/0.11.0/llvm/lib/x86_64-unknown-linux-gnu
prepend-path LD_LIBRARY_PATH /opt/cudaq/0.11.0/llvm/lib/x86_64-unknown-linux-gnu
```
```
ml cudaq/0.11.0
```
```
vim $(which which nvq++)
```
修改以下行
```
llvm_install_dir="${CUDAQ_ROOT}/llvm/"
NVQPP_LD_PATH=${NVQPP_LD_PATH:-"${CUDAQ_ROOT}/llvm/bin/ld.lld"}
```
### remote - 激活 cudaq 的 MPI
```
vim $CUDAQ_ROOT/cudaq/distributed_interfaces/activate_custom_mpi.sh
```
修改
```
$CXX -shared -std=c++17 -fPIC \
```
為
```
$CXX --library-mode -shared -std=c++17 -fPIC \
```
```
sudo -E MPI_PATH=$OPENMPI_ROOT bash /opt/cudaq/0.11.0/cudaq/distributed_interfaces/activate_custom_mpi.sh
```
安裝完成!
:::
sudo -E MPI_PATH=/home/team7admin/hlajungo/repo/hpcx-v2.18.1-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/ompi/ bash /opt/cudaq/0.11.0/cudaq/distributed_interfaces/activate_custom_mpi.sh
## Runtime work
### local - 拷貝 grover 代碼和腳本
```bash
rsync -av -e ssh /media/hlajungo/D/linux/repo_my/hello_cudaq/code \
USER@IP:/home/hlajungo
```
### remote - grover 單節點測試
```
ml purge
ml gcc/12.4.0
ml cuda/12.1.1.tcl
ml openmpi/5.0.8
ml cudaq/0.11.0
nvq++ --library-mode grover_algorithm.cpp -o grover_exec
mpirun -np 2 ./grover_exec
```
## 小抄
slurm
```
# 取消 JOBID=11 的 batch
scancel 11
# 查看節點信息
scontrol show nodes
# 要一個 slurm shell
salloc -p debug \
--nodes=1 \
--ntasks=2 \
--cpus-per-task=1
--gpus=1
```
ucx
```
# 查閱所有 ucx 能設定的變量 (c for showing variable, f for detail)
ucx_info -cf > ucx_env.txt
# 查閱所有網路設備
ucx_info -d
UCX_TLS=all
```
## grover code
```cpp
#include <cmath>
#include <cudaq.h>
#include <functional> // for std::reference_wrapper
#include <iostream>
#include <string>
#include <chrono>
#include <iomanip> // for setw
// Apply the oracle: flip phase on the marked state
void
oracle (auto& q, const std::string& marked_state)
{
// Step 1: 通過 marked_state 把指定的 qubit 翻轉
for (int i = 0; i < marked_state.size (); ++i)
{
if (marked_state[i] == '0')
{
x (q[i]);
}
}
// Step 2: 進行 HXH=Z ,對 "11...1" 變號
h (q[q.size () - 1]);
// 改用 reference_wrapper
std::vector<std::reference_wrapper<cudaq::qubit> > controls;
for (std::size_t i = 0; i < q.size () - 1; ++i)
{
controls.emplace_back (std::ref (q[i]));
}
x (controls, q[q.size () - 1]); // 多控制 X gate
h (q[q.size () - 1]);
// Step 3: 還原 Step 1 翻轉的 bit
for (int i = 0; i < marked_state.size (); ++i)
{
if (marked_state[i] == '0')
{
x (q[i]);
}
}
// 此處完成對 w 乘上 -1,其餘不變動。
}
// Diffusion operator: reflection about average
void
diffusion (auto& q)
{
// Step 1: 弄成 "00...0" 狀態
h (q);
// Step 2: 弄成 "11...1" 狀態
x (q);
// Step 3: 進行 HXH=Z, 對 "11...1" 變號
h (q[q.size () - 1]);
std::vector<std::reference_wrapper<cudaq::qubit> > controls;
for (std::size_t i = 0; i < q.size () - 1; ++i)
{
controls.emplace_back (std::ref (q[i]));
}
x (controls, q[q.size () - 1]);
h (q[q.size () - 1]);
// Step 4: 還原,弄成 "00...0"
x (q);
// Step 5: 還原, 弄成之前的狀態
h (q);
}
/*
s = target state "1111"
w = 均勻超位置狀態
oracle -> 對 w 翻轉
diffusion -> 對 s 翻轉
*/
__qpu__ void
grover_kernel (std::size_t n, const std::string& marked, std::size_t iterations)
{
cudaq::qvector q (n);
// 初始化為均衡態
for (auto& qbit : q)
h (qbit);
// 疊代 Grover
for (std::size_t i = 0; i < iterations; ++i)
{
oracle (q, marked);
diffusion (q);
}
// 測量
mz (q);
}
int main_mpi()
{
cudaq::mpi::initialize();
if (cudaq::mpi::rank() == 0)
{
std::cout << "qubits target amp run time(s)\n";
}
for (std::size_t n = 24; n <= 24; ++n)
{
std::string marked(n, '1');
std::size_t opt_it = std::floor((M_PI / 4.0) * std::sqrt(std::pow(2, n)));
std::size_t it = opt_it*0.61;
auto start = std::chrono::high_resolution_clock::now();
auto result = cudaq::sample(grover_kernel, n, marked, it);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = end - start;
// 只由 rank 0 輸出
if (cudaq::mpi::rank() == 0)
{
std::cout << std::setprecision(3) << n << std::setw(10) << result.probability(result.most_probable()) << std::setw(10) << elapsed.count() << std::setw(10)<< "\n";
}
}
cudaq::mpi::finalize();
return 0;
}
int
main_my ()
{
for (std::size_t n = 24; n <= 24; ++n)
{
std::string marked(n, '1');
std::size_t opt_it = std::floor((M_PI / 4.0) * std::sqrt(std::pow(2, n)));
std::size_t it = opt_it*0.61;
auto start = std::chrono::high_resolution_clock::now();
auto result = cudaq::sample(grover_kernel, n, marked, it);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = end - start;
std::cout << std::setprecision(3) << n << std::setw(10) << result.probability(result.most_probable()) << std::setw(10) << elapsed.count() << std::setw(10)<< "\n";
}
return 0;
}
int
main ()
{
main_mpi ();
return 0;
}
```
<!--
1. slurm sbatch 不熟悉。
我需要安裝完 slurm sbatch 後,寫一些測試腳本。
通過這些測試腳本,固定下行為,包括路徑,資源那些。
並通過腳本檢查移植性。同時,需要給出標準 slurm sbatch 模板。
2. 透析本質
如果 quantum 就是矩陣運算,那我應該要知道 oracle 11..1,是一個特別的簡單矩陣。
能自己寫 C++ 解決。
3. mpirun 工廠化
我需要有一個生成 mpirun 的腳本,有參數能決定細節。
4. cudaq 錯誤
我需要更熟悉 profiler,並建立認知。
當過了 compile, 在 run time,結果正確但速度異常時(這裡的異常是指,你不滿意效能),這就是 profiler 用的地方。
4.1 cudaq 該怎麼正確實現 MPI?
我發現有 cuda quantum 文檔和 cudaq 文檔。
cuda quantum 是平台。
cudaq 是 SDK。
nvq++ --qpu cuquantum_mgmn src.cpp ...
5. 和自己比
看不到別人的結果總是焦慮的。
先拿最基本的,像是 小型測試, 1 process, 1 cpu, 當作基準點。
所有的優化行為,都拿去和這個基本測試比,你就會對優化有感覺。
6. lustre vs nfs
lustre 在 mpi runtime 會比 nfs 有優勢嘛?
from chat學長
【會】被 Lustre/NFS 影響的情境:
程式啟動時,每個 rank 開始從磁碟讀取初始資料
模擬過程中,定期將資料存成檔案(如 checkpoint、log)
訓練模型時,每個 epoch 都從磁碟重新讀大量資料
這些都會透過檔案系統進行,因此:
使用 NFS 時會遇到 I/O contention(特別是 metadata)
使用 Lustre 可以平行存取多個儲存端,改善效能
-->