--- title: Benchmark error messages tags: APAC HPC-AI competition --- [TOC] # AI BERT > Run on the aspire.nscc.sg 修改指令 ```bash= Origin: module load /applications/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-5.0-1.0.0.0- redhat7.7-x86_64/modulefiles/hpcx-mt-ompi After: /app/hpc-x/hpcx-v1.4.352-gcc-MLNX_OFED_LINUX-3.1-1.0.3-redhat6.6-x86_64\ /modulefiles/hpcx-ompi-mellanox-v1.8 ``` 執行 ```bash= time -p $MPIBINPATH/mpirun -np 4 -H ops003:2,ops004:2 \ -mca btl_tcp_if_include enp3s0f1 -npernode 2 \ -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib \ -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \ python $CODE_PATH/run_classifier.py \ --do_train=true --do_eval=false --do_predict=false \ --task_name=$TASK_NAME --data_dir=$DATA_PATH \ --vocab_file=$MODEL_PATH/vocab.txt \ --bert_config_file=$MODEL_PATH/bert_config.json \ --init_checkpoint=$MODEL_PATH/bert_model.ckpt \ --output_dir=$RESULT_PATH \ --learning_rate=5e-5 --num_train_epochs=0.001 \ --max_seq_length=128 --train_batch_size=32 \ --num_accumulation_steps=1 --save_checkpoints_steps=1000 \ --warmup_proportion=0.1 --use_fp16 --horovod \ 2>&1 | tee py36traininglog ``` Result ```bash= Conflicting directives for mapping policy are causing the policy to be redefined: New policy: bynode Prior policy: BYSLOT Please check that only one policy is defined. -------------------------------------------------------------------------- real 0.54 user 0.02 sys 0.10 ``` ## Problem - What is policy? --- # NEMO Climate simulation application ## in server aspire.nscc.sg command ```bash= cmake .. -DCMAKE_INSTALL_PREFIX=$APPROOT/deps/hdf5/hdf5 - DHDF5_ENABLE_PARALLEL=1 -DHDF5_BUILD_CPP_LIB=0 ``` error: ```bash= Could NOT find MPI_C (missing: MPI_C_LIB_NAMES MPI_C_HEADER_DIR MPI_C_WORKS) CMake Error at /app/cmake/cmake-3.14.4/share/cmake-3.14/Modules/FindPackageHandleStandardArgs.cmake:137 (message): Could NOT find MPI (missing: MPI_C_FOUND)Call Stack (most recent call first):/app/cmake/cmake-3.14.4/share/cmake-3.14/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)/app/cmake/cmake-3.14.4/share/cmake-3.14/Modules/FindMPI.cmake:1672 (find_package_handle_standard_args)CMakeLists.txt:606 (find_package) ``` cmake version 3.14.4 ## in server btn ### lack library [libnetcdff.so.6](https://pkgs.org/download/libnetcdff.so.6()(64bit)) - error: ```bash= $ cd /home/marvin/NEMO/apps/nemo-4.0/cfgs/hpcx_linux_gfortran_gyre_pisces/EXP00 $ /usr/bin/time -p mpirun -np 1 --map-by core -report-bindings \ -mca io ompio -x UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 \ ./nemo /usr/bin/time -p mpirun -np 1 --map-by core -report-bindings -mca io ompio -x UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 ./nemo [btn:323518] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../..] ./nemo: error while loading shared libraries: libnetcdff.so.6: cannot open shared object file: No such file or directory -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[597,1],0] Exit code: 127 -------------------------------------------------------------------------- Command exited with non-zero status 127 real 2.43 user 0.03 sys 0.05 ``` ### OpenMPI not finding the device - command: ``` bash= $ module purge $ module load $APPROOT/apps/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1- redhat7.7-x86_64/modulefiles/hpcx $ /usr/bin/time -p mpirun -np $((28*4)) --map-by core -report-bindings \ -mca io ompio -x UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 \ ./nemo ``` - error: ```bash= [btn:35243] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../..] [1600360575.022407] [btn:35248:0] ucp_context.c:690 UCX WARN network devices 'mlx5_0:1','mlx5_2:1' are not available, please use one or more of: 'enp0s31f6'(tcp) ``` --- # NAMD Bio-Science application in server **btn**, cannot run on server aspire.nscc.sg ## problem about Benchmark - build namd ```bash= CHARM_ARCH_UCX_GCC=ucx-linux-x86_64-gfortran-ompipmix-gcc \ CHARM_ARCH_MPI_GCC=mpi-linux-x86_64-gfortran-gcc \ CODE_NAME=charm \ CODE_GIT_TAG=FETCH_HEAD \ CODE_GIT_TAG=v6.10.1 \ GIT_WORK_TREE=/HPCAI/NAMD/code \ CHARM_CODE_DIR=$GIT_WORK_TREE/$CODE_NAME-$CODE_GIT_TAG-20-08-11 \ CHARM_BASE=$CHARM_CODE_DIR/built \ FFTW3_LIB_DIR=/HPCAI/NAMD/application/libs/fftw \ GCC_FFTW3_LIB_DIR=$FFTW3_LIB_DIR/3.3.8-shared-gcc840-avx2-broadwell \ APP_MPI_DIR=/HPCAI/NAMD/application/mpi \ HPCX_FILES_DIR=$APP_MPI_DIR/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64 \ GCC_DIR=/usr \ GCC_PATH='"$GCC_DIR/bin/gcc "' \ GXX_PATH='"$GCC_DIR/bin/g++ -std=c++0x"' \ NATIVE_GCC_FLAGS='"-static-libstdc++ -static-libgcc -march=native -mtune=native -mavx2 -msse4.2 -O3 -DNDEBUG"' \ GCC_FLAGS='"-static-libstdc++ -static-libgcc -march=broadwell -mtune=broadwell -mavx2 -msse4.2 -O3 -DNDEBUG"' \ CODE_NAME=namd \ CODE_GIT_TAG=FETCH_HEAD \ GIT_DIR=/HPCAI/NAMD/github/$CODE_NAME.git \ GIT_WORK_TREE=/HPCAI/NAMD/code \ NAMD_CODE_DIR=$GIT_WORK_TREE/$CODE_NAME-$CODE_GIT_TAG-20-08-11 \ NAMD_DIR=$NAMD_CODE_DIR \ bash -c ' cd $NAMD_DIR; CMD_BUILD_UCX_NAMD_GCC_FFTW3=" PATH=$GCC_DIR/bin:$PATH \ module purge && module load gcc && \ ./config Linux-x86_64-g++ --with-memopt \ --charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_UCX_GCC \ --with-fftw3 --fftw-prefix $GCC_FFTW3_LIB_DIR \ --cc $GCC_PATH --cc-opts $GCC_FLAGS \ --cxx $GXX_PATH --cxx-opts $GCC_FLAGS \ && cd Linux-x86_64-g++ && time -p make -j \ && cd $NAMD_DIR && mv Linux-x86_64-g++ Linux-x86_64-g++-ucx-fftw3 \ && module purge" CMD_BUILD_UCX_NAMD_GCC_MKL=" PATH=$GCC_DIR/bin:$PATH \ module purge && module load gcc && \ ./config Linux-x86_64-g++ --with-memopt \ --charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_UCX_GCC \ --with-mkl --mkl-prefix $MKL_DIR \ --cc $GCC_PATH --cc-opts $GCC_FLAGS \ --cxx $GXX_PATH --cxx-opts $GCC_FLAGS \ && cd Linux-x86_64-g++ && time -p make -j \ && cd $NAMD_DIR && mv Linux-x86_64-g++ Linux-x86_64-g++-ucx-mkl \ && module purge" CMD_BUILD_MPI_NAMD_GCC_FFTW3=" module purge && module load gcc && \ . $HPCX_FILES_DIR/hpcx-mt-init-ompi.sh && hpcx_load \ && PATH=$GCC_DIR/bin:$PATH \ ./config Linux-x86_64-g++ --with-memopt \ --charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_MPI_GCC \ --with-fftw3 --fftw-prefix $GCC_FFTW3_LIB_DIR \ --cc $GCC_PATH --cc-opts $GCC_FLAGS \ --cxx $GXX_PATH --cxx-opts $GCC_FLAGS \ && cd Linux-x86_64-g++ && time -p make -j \ && cd $NAMD_DIR && mv Linux-x86_64-g++ Linux-x86_64-g++-mpi-fftw3 \ && hpcx_unload && module purge" CMD_BUILD_MPI_NAMD_GCC_MKL=" module purge && module load gcc && \ . $HPCX_FILES_DIR/hpcx-mt-init-ompi.sh && hpcx_load \ && PATH=$GCC_DIR/bin:$PATH \ ./config Linux-x86_64-g++ --with-memopt \ --charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_MPI_GCC \ --with-mkl --mkl-prefix $MKL_DIR \ --cc $GCC_PATH --cc-opts $GCC_FLAGS \ --cxx $GXX_PATH --cxx-opts $GCC_FLAGS \ && cd Linux-x86_64-g++ && time -p make -j \ && cd $NAMD_DIR && mv Linux-x86_64-g++ Linux-x86_64-g++-mpi-mkl \ && hpcx_unload && module purge" eval $CMD_BUILD_UCX_NAMD_GCC_FFTW3 eval $CMD_BUILD_MPI_NAMD_GCC_FFTW3 eval $CMD_BUILD_UCX_NAMD_GCC_MKL; eval $CMD_BUILD_MPI_NAMD_GCC_MKL; wait echo $CMD_BUILD_UCX_NAMD_GCC_FFTW3 echo $CMD_BUILD_MPI_NAMD_GCC_FFTW3 echo $CMD_BUILD_UCX_NAMD_GCC_MKL; echo $CMD_BUILD_MPI_NAMD_GCC_MKL; ' | tee namdbuildlog 2>&1 ``` - error message ![](https://i.imgur.com/PXqzH0x.png) - 問題點 (看不懂) ```bash= ./config Linux-x86_64-g++ --with-memopt \ --charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_UCX_GCC \ --with-fftw3 --fftw-prefix $GCC_FFTW3_LIB_DIR \ --cc $GCC_PATH --cc-opts $GCC_FLAGS \ --cxx $GXX_PATH --cxx-opts $GCC_FLAGS \ ```