---
title: Benchmark error messages
tags: APAC HPC-AI competition
---
[TOC]
# AI BERT
> Run on the aspire.nscc.sg
修改指令
```bash=
Origin:
module load /applications/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-5.0-1.0.0.0-
redhat7.7-x86_64/modulefiles/hpcx-mt-ompi
After:
/app/hpc-x/hpcx-v1.4.352-gcc-MLNX_OFED_LINUX-3.1-1.0.3-redhat6.6-x86_64\
/modulefiles/hpcx-ompi-mellanox-v1.8
```
執行
```bash=
time -p $MPIBINPATH/mpirun -np 4 -H ops003:2,ops004:2 \
-mca btl_tcp_if_include enp3s0f1 -npernode 2 \
-bind-to none -map-by slot -mca pml ob1 -mca btl ^openib \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
python $CODE_PATH/run_classifier.py \
--do_train=true --do_eval=false --do_predict=false \
--task_name=$TASK_NAME --data_dir=$DATA_PATH \
--vocab_file=$MODEL_PATH/vocab.txt \
--bert_config_file=$MODEL_PATH/bert_config.json \
--init_checkpoint=$MODEL_PATH/bert_model.ckpt \
--output_dir=$RESULT_PATH \
--learning_rate=5e-5 --num_train_epochs=0.001 \
--max_seq_length=128 --train_batch_size=32 \
--num_accumulation_steps=1 --save_checkpoints_steps=1000 \
--warmup_proportion=0.1 --use_fp16 --horovod \
2>&1 | tee py36traininglog
```
Result
```bash=
Conflicting directives for mapping policy are causing the policy
to be redefined:
New policy: bynode
Prior policy: BYSLOT
Please check that only one policy is defined.
--------------------------------------------------------------------------
real 0.54
user 0.02
sys 0.10
```
## Problem
- What is policy?
---
# NEMO Climate simulation application
## in server aspire.nscc.sg
command
```bash=
cmake .. -DCMAKE_INSTALL_PREFIX=$APPROOT/deps/hdf5/hdf5 - DHDF5_ENABLE_PARALLEL=1 -DHDF5_BUILD_CPP_LIB=0
```
error:
```bash=
Could NOT find MPI_C (missing: MPI_C_LIB_NAMES MPI_C_HEADER_DIR MPI_C_WORKS)
CMake Error at /app/cmake/cmake-3.14.4/share/cmake-3.14/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
Could NOT find MPI (missing: MPI_C_FOUND)Call Stack (most recent call first):/app/cmake/cmake-3.14.4/share/cmake-3.14/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)/app/cmake/cmake-3.14.4/share/cmake-3.14/Modules/FindMPI.cmake:1672 (find_package_handle_standard_args)CMakeLists.txt:606 (find_package)
```
cmake version 3.14.4
## in server btn
### lack library
[libnetcdff.so.6](https://pkgs.org/download/libnetcdff.so.6()(64bit))
- error:
```bash=
$ cd /home/marvin/NEMO/apps/nemo-4.0/cfgs/hpcx_linux_gfortran_gyre_pisces/EXP00
$ /usr/bin/time -p mpirun -np 1 --map-by core -report-bindings \ -mca io ompio -x UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 \
./nemo
/usr/bin/time -p mpirun -np 1 --map-by core -report-bindings -mca io ompio -x UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 ./nemo
[btn:323518] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../..]
./nemo: error while loading shared libraries: libnetcdff.so.6: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[597,1],0]
Exit code: 127
--------------------------------------------------------------------------
Command exited with non-zero status 127
real 2.43
user 0.03
sys 0.05
```
### OpenMPI not finding the device
- command:
``` bash=
$ module purge
$ module load $APPROOT/apps/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-
redhat7.7-x86_64/modulefiles/hpcx
$ /usr/bin/time -p mpirun -np $((28*4)) --map-by core -report-bindings \
-mca io ompio -x UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 \
./nemo
```
- error:
```bash=
[btn:35243] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../..]
[1600360575.022407] [btn:35248:0] ucp_context.c:690 UCX WARN network
devices 'mlx5_0:1','mlx5_2:1' are not available, please use one or more of:
'enp0s31f6'(tcp)
```
---
# NAMD Bio-Science application
in server **btn**, cannot run on server aspire.nscc.sg
## problem about Benchmark
- build namd
```bash=
CHARM_ARCH_UCX_GCC=ucx-linux-x86_64-gfortran-ompipmix-gcc \
CHARM_ARCH_MPI_GCC=mpi-linux-x86_64-gfortran-gcc \
CODE_NAME=charm \
CODE_GIT_TAG=FETCH_HEAD \
CODE_GIT_TAG=v6.10.1 \
GIT_WORK_TREE=/HPCAI/NAMD/code \
CHARM_CODE_DIR=$GIT_WORK_TREE/$CODE_NAME-$CODE_GIT_TAG-20-08-11 \
CHARM_BASE=$CHARM_CODE_DIR/built \
FFTW3_LIB_DIR=/HPCAI/NAMD/application/libs/fftw \
GCC_FFTW3_LIB_DIR=$FFTW3_LIB_DIR/3.3.8-shared-gcc840-avx2-broadwell \
APP_MPI_DIR=/HPCAI/NAMD/application/mpi \
HPCX_FILES_DIR=$APP_MPI_DIR/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64 \
GCC_DIR=/usr \
GCC_PATH='"$GCC_DIR/bin/gcc "' \
GXX_PATH='"$GCC_DIR/bin/g++ -std=c++0x"' \
NATIVE_GCC_FLAGS='"-static-libstdc++ -static-libgcc -march=native -mtune=native -mavx2 -msse4.2 -O3 -DNDEBUG"' \
GCC_FLAGS='"-static-libstdc++ -static-libgcc -march=broadwell -mtune=broadwell -mavx2 -msse4.2 -O3 -DNDEBUG"' \
CODE_NAME=namd \
CODE_GIT_TAG=FETCH_HEAD \
GIT_DIR=/HPCAI/NAMD/github/$CODE_NAME.git \
GIT_WORK_TREE=/HPCAI/NAMD/code \
NAMD_CODE_DIR=$GIT_WORK_TREE/$CODE_NAME-$CODE_GIT_TAG-20-08-11 \
NAMD_DIR=$NAMD_CODE_DIR \
bash -c '
cd $NAMD_DIR;
CMD_BUILD_UCX_NAMD_GCC_FFTW3="
PATH=$GCC_DIR/bin:$PATH \
module purge && module load gcc && \
./config Linux-x86_64-g++ --with-memopt \
--charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_UCX_GCC \
--with-fftw3 --fftw-prefix $GCC_FFTW3_LIB_DIR \
--cc $GCC_PATH --cc-opts $GCC_FLAGS \
--cxx $GXX_PATH --cxx-opts $GCC_FLAGS \
&& cd Linux-x86_64-g++ && time -p make -j \
&& cd $NAMD_DIR && mv Linux-x86_64-g++ Linux-x86_64-g++-ucx-fftw3 \
&& module purge"
CMD_BUILD_UCX_NAMD_GCC_MKL="
PATH=$GCC_DIR/bin:$PATH \
module purge && module load gcc && \
./config Linux-x86_64-g++ --with-memopt \
--charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_UCX_GCC \
--with-mkl --mkl-prefix $MKL_DIR \
--cc $GCC_PATH --cc-opts $GCC_FLAGS \
--cxx $GXX_PATH --cxx-opts $GCC_FLAGS \
&& cd Linux-x86_64-g++ && time -p make -j \
&& cd $NAMD_DIR && mv Linux-x86_64-g++ Linux-x86_64-g++-ucx-mkl \
&& module purge"
CMD_BUILD_MPI_NAMD_GCC_FFTW3="
module purge && module load gcc && \
. $HPCX_FILES_DIR/hpcx-mt-init-ompi.sh && hpcx_load \
&& PATH=$GCC_DIR/bin:$PATH \
./config Linux-x86_64-g++ --with-memopt \
--charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_MPI_GCC \
--with-fftw3 --fftw-prefix $GCC_FFTW3_LIB_DIR \
--cc $GCC_PATH --cc-opts $GCC_FLAGS \
--cxx $GXX_PATH --cxx-opts $GCC_FLAGS \
&& cd Linux-x86_64-g++ && time -p make -j \
&& cd $NAMD_DIR && mv Linux-x86_64-g++ Linux-x86_64-g++-mpi-fftw3 \
&& hpcx_unload && module purge"
CMD_BUILD_MPI_NAMD_GCC_MKL="
module purge && module load gcc && \
. $HPCX_FILES_DIR/hpcx-mt-init-ompi.sh && hpcx_load \
&& PATH=$GCC_DIR/bin:$PATH \
./config Linux-x86_64-g++ --with-memopt \
--charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_MPI_GCC \
--with-mkl --mkl-prefix $MKL_DIR \
--cc $GCC_PATH --cc-opts $GCC_FLAGS \
--cxx $GXX_PATH --cxx-opts $GCC_FLAGS \
&& cd Linux-x86_64-g++ && time -p make -j \
&& cd $NAMD_DIR && mv Linux-x86_64-g++ Linux-x86_64-g++-mpi-mkl \
&& hpcx_unload && module purge"
eval $CMD_BUILD_UCX_NAMD_GCC_FFTW3
eval $CMD_BUILD_MPI_NAMD_GCC_FFTW3
eval $CMD_BUILD_UCX_NAMD_GCC_MKL;
eval $CMD_BUILD_MPI_NAMD_GCC_MKL;
wait
echo $CMD_BUILD_UCX_NAMD_GCC_FFTW3
echo $CMD_BUILD_MPI_NAMD_GCC_FFTW3
echo $CMD_BUILD_UCX_NAMD_GCC_MKL;
echo $CMD_BUILD_MPI_NAMD_GCC_MKL;
' | tee namdbuildlog 2>&1
```
- error message

- 問題點 (看不懂)
```bash=
./config Linux-x86_64-g++ --with-memopt \
--charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_UCX_GCC \
--with-fftw3 --fftw-prefix $GCC_FFTW3_LIB_DIR \
--cc $GCC_PATH --cc-opts $GCC_FLAGS \
--cxx $GXX_PATH --cxx-opts $GCC_FLAGS \
```