--- title: NAMD Bio-Science Application tags: APAC HPC-AI competition --- [Benchmark](https://www.hpcadvisorycouncil.com/events/2020/APAC-AI-HPC/pdf/HPC-AI_Competition_NAMD_Benchmark_Guideline.pdf) [NAMD and Charm++](https://www.youtube.com/watch?v=-hiCMAtX0Hc) [3D Deep learning](https://www.youtube.com/watch?v=T-tFg8b6R-0&list=PLa6_BhFeNAXx4ASDtGBauhZvjVnr20xxs&index=5) # NAMD Bio-Science Application [TOC] ## Something need to be installed - Charm++ - NAMD - UCX - OpenMPI / Intel MPI --- ## NAMD serves NIH (health) mission and use practical supercomputing for Biomedical research - 1115000 users cannot all be computer experts - 18% are NIH-funded; many in other countries - 34000 have downloaded more than one version - 13000 citations of NAMD reference papers - 1000 users per month download lastest release - One program available on all platforms - desktops and laptops - setup and testing - Linux clusters - affordable local workhorse - Supercomputers - "most used code" at XSEDE TACA - Petascale - "widest- used application" on Blue Waters - Exascale - early science on Frontier,Aurora - GPUs - from desktop to supercomputer - User knowledge is preserved across platforms - No change in input or output files. - Run any simulation on **any number of cores**. - Available free of charge to all. --- ## NAMD embeds the TCl scripting language instead of python - to enable *portable* innovation *by users* - no need to recompile, can move scripts between platforms - package management, portable - interfaces haven't changed - encapsulates mini-languages - used in VMD - looks like a simple script language, don't scare no-programmers ## NAMD runs well on NVIDIA GPUs ### What to expect : - 1 GPU =~ 100 CPU cores - Depending on CPU and GPU - Scaling to 10K atoms/GPU - Assuming fast network - Must use smp/multicore - Many cores share each GPU - Use multicore for single node - At most one process per GPU - Must use `+pemap 0-n` - Consider `+devices i,j` ### Why it may be wrong : - Weak GPU (e.g. laptop) - Too few CPU cores used - Coarse-grained simulation - Too few atoms per GPU - Limited by network - Limited by MPI (use verbs) - Limited by special features --- ## But GPU Builds Disable Some Features ### Disable - Alchemical (FEP and TI) - Locally enhanced sampling - Tabulated energies - Drude (nonbonded Thole) - Go forces - Pairwise interaction - Pressure profile ### Not Disabled - Memory optimized builds - Conformational free energy - Collective variables - Grid forces - steering forces - Almost everything else --- ## NAMD and Charm++ Grew Up together ### Parallel Programming Lab Achievements : - Charm++ parallel runtime system - Gordon Bell Prize 2002 - IEEE Fernbach Award 2012 - 16 publications SC 2012-16 - 6+ codes on Blue Waters ### Charm++ features used by NAMD : - Parallel C++ with data driven objects - Asynchronous method invocation - Priortized scheduling of messages/execution - Measurement-based load balancing - Portable messaging layer --- ## A NAMD Build Script is Pretty Short ```bash= tar xzf NAMD_2.14_Source.tar.gz ``` ```bash= cd NAMD_2.14_Source ``` ```bash= tar xf charm-6.10.1.tar ``` ```bash= cd charm-6.10.1 ``` ```bash= ./build charm++ verbs-linux-x86_64 smp icc --no-build-shared --with-production -j 8 ``` ```bash= ./config Linux-x86_64-icc Frontera --with-mkl --charm-arch verbs-linux-x86_64-smp-icc ``` ```bash= cd Linux-x86_64-icc ``` ```bash= make release -j 8 ``` --- ## Must Choose Charm++ Build OPtions - Choose network layer : - multicore (smp but only a single process, no network) - netlrts (UDP over ethernet or loopback) - gni-crayx[ce] (Cray Gemini or Aries network) - verbs or ucx (InfiniBand) - mpi (fall back to MPI library, use for Omni-Path) - Choose smp or (default) non-smp : - smp uses one core process for communication - Optional compiler options : - iccstatic usea Intel compiler, links Intel-provided libreries statically - Also : `--no-build-shared -- with-production` --- ## Plus a Few NAMD Build Config Options - Choose network layer : - Build Charm++ - `--charm-base <build-dir> --charm-arch <arch>` - Choose FFTW 3 or Intel MKL rather than default FFTW 2 : - `--with-fftw3` (you want 3, but our binaries ship with 2) - Options : - `--with-cuda` (NVIDIA GPU acceleration, requires smp) - `--with-memopt` (use compressed structure files) - Using smp build is the simplest way to reduce memory usage --- ## Charm++ Launch Varies Across Platforms - Multicore - run namd2 binary directly, specify +pP - MPI, ucx, and gni (Cray) builds follow system docs - Typically mpirun, mpiexec, aprun, srun - **Specify `+ppn M`** (also `+pemap`...) to namd2 for smp builds - Other (verbs, netlrs) use charmrun - charmrun `++n N [++ppn M] /path/to/namd2 arg` - N is number of processes, older Charm++ required `++pP` (total PEs) - With queueing system use `++mpiexec[-no-n]`, `++remote-shell` - Otherwise must set up nodelist file (see notes.txt, Charm++ docs) - For single host can use `++local` - Arguments to charmrun begin with "++", to namd2 with "+" --- ## NAMD command line is consistent - `+pemap m-n[:stride.run+add+add][,...]` - For example, 0-31:8.7 = 0-6,8-14, 16-22 24-30 - For smp also `+commap`, e.g. 7-31:8=7,15,23,31 - Path(s) to NAMD simulation config file(s) - Accepts any option on command line with `--name value` - Files/arguments are treated as if concatenated - Matters if run/minimize commands are used - Startup at first run/minimize command or end of file(s) - All paths are relative to config file - To avoid this can do `--source /path/to/file` --- ## NAMD/Charm++ Performance Tips - <font color=red>DO NOT</font> use the MPI network (except on OmniPath) - Low-level verbs, gni, pami, ucx layers exist because they are faster - Leverage MPI startup via `charmrun ++mpiexec` - See also `++scalabe-start`, `++remote-shell`, `++runscript` - <font color=red>DO</font> use SMP builds for larger simulations - Reduced memory usage and often faster - Trade-off : communication thread not available for work - Major direction of future optimization and tuning - <font color=red>DO</font> set processor affinity explocitly - For example : `++ppn 7 +commap 0,8 +pemap 1-7,9-15` - Cray by default tends to lock all threads onto same core - <font color=red>DO</font> save one core for OS to improve scaling - Cray `aprun -r 1` reserves and forces OS to run on last core --- ## Measure Relevant Perform - Benchmark your user's real science on your machine - <font color=red>DO NOT</font> simply "time namd2..." - Includes startup and load balancing - Really want marginal cost of additional ns - Startup time is highly variable across runs - Need 500-1000 steps for load balancing - Several "LDB:" outputs near beginning of run - Look for several "Benchmark time:" lines on output - For "TIMING:" output only care about wall(clock) time - Be sur to benchmark dynamics, not minimization --- ### Example 1 : ALCF Theta Build and Run Options - 64-core processors, Cray Aries network - build charm++ gni-crayxc persistent smp -xMIC-AVX512 - `aprun -n $((7*$nodes)) -N 7 -d 17 -j 2 -r 1` - `+ppn 16 +pemap 0-55 +commap 56-62` #### ALCF Theta Run Option Math - 64 cores, reserve one for OS (-r 1), leaves 63 - 63 = 9x7 = 9x(6+1) = 54 PE + 9 comm +ppn 12 +pemap 0-53+64 +commap 54-62 - 63 = 7x9 = 7x(8+1) = 56 PE + 7 comm +ppn 16 +pemap 0-55+64 +commap 56-62 - 60 = 4x15 = 4x(14+1) = 56 PE + 4 comm +ppn 28 +pemap 0-63:16.14+64 +commap 14-62:16 ### Example 2: TACC Stampede KNL Build and Run Options - 68-core processors, Intel Omni-Path network - build mpi-linux-x86_64 smp icc -xMIC-AVX512 - `sbatch --ntasks=$((13*$nodes))` - `+ppn 8 +pemap 0-51+68 +commap 53-65` (or `+ppn 4 +pemap 0-51 +commap 53-65`) #### TACC Stampede KNL Run Option Math - 68 cores, reserve one for OS, leaves 67 - 65 = 13x5 = 13x(4+1) = 52 PE + 13 comm +ppn 8 +pemap 0-51+68 +commap 53-65 - 66 = 6x11 = 6x(10+1) = 60 PE + 6 comm +ppn 20 +pemap 0-59+68 +commap 60-65 - 68 = 4x17 = 4x(16+1) = 64 PE + 4 comm +ppn 32 +pemap 0-63+68 +commap 64-67 ### KNL Run Option Reasoning - Leave core free to isolate OS noise - Pairs of cores on a "tile" share 1MB L2 cache - <font color=red>Do not</font> split tile between PEs of different nodes - OK to split tile between comm threads - Use 1 or 2 hyperthreads for PE core - Dedicate core to each comm thread - Need several comm threads per host - Fewer for Cray Aries and than for Intel Omni-Path - Multiple copies of static data reduce memory contention - Different configuration fit 64-core vs 68-core models --- ![](https://i.imgur.com/3WBZldt.png) --- ## UCX NAMD Hangs on Frontera - Appears to be an issue in UCX library : - Not fixed in UCX 1.8.1 release - Likely fixed in UCX master and 1.9.x branches - Download from https://github.com/openucx/ucx - Monitor Charm++ issue for updates : - https://github.com/UIUC-PPL/charm/issues/2716 --- ## Upcoming Changes - NAMD 2.14 release imminent - will have same performance as 2.14b2 release - Waiting to merge after 2.14 release : - Support for AMD's "HIP" GPU API - improved tile-based AVX512 kernel - https://charm.cs.illinois.edu/gerrit/q/project:namd - NAMD 3.0 single-node alpha release - Greatly improved CUDA performance - Single process per replica, but supports mult-copy --- ## term - module `module help` – 「module」簡要使用說明 `module avail` – 查詢系統可用軟體 `module load` – 載入指定軟體 `module purge` – 清除已載入軟體 `module list` – 查詢已載入軟體 --- ## problem about Benchmark 1. To get `FFTW3 tar file` & `HPC-X 2.6 tar file` failed ```bach= wget http://www.fftw.org/fftw-3.3.8.tar.gz \ -O ./code/fftw-3.3.8.tar.gz ``` ``` wget ./code/fftw-3.3.8.tar.gz : no such file or dictionary ``` :heavy_check_mark: **solution** : not to paste on directly, must type in yourself 2. failed when checkouting `Charm++ v6.10.1` & `NAMD 2.13` & `Untar HPC-X 2.6` ```bach= ... GIT_WORK_TREE=./cluster/thor/code \ ... ``` ```bash= APP_MPI_PATH=./cluster/thor/application/mpi \ ... ``` ``` fatal: Invalid path '/HPCAI/NAMD/cluster': No such file or directory ``` :heavy_check_mark: **sol** : mkdir code under NAMD/ ```bach= ... GIT_WORK_TREE=/HPCAI/NAMD/code \ ... ``` 3. build fftw ```bash= CODE_NAME=fftw \ CODE_TAG=3.3.8 \ CODE_BASE_DIR=/HPCAI/NAMD/code \ CODE_DIR=$CODE_BASE_DIR/$CODE_NAME-$CODE_TAG \ INSTALL_DIR=/HPCAI/NAMD/application/libs/fftw \ CMAKE_PATH=/usr/bin/cmake \ GCC_PATH=/usr/bin/gcc \ NATIVE_GCC_FLAGS='"-march=native -mtune=native -mavx2 -msse4.2 -O3 -DNDEBUG"' \ GCC_FLAGS='"-march=broadwell -mtune=broadwell -mavx2 -msse4.2 -O3 -DNDEBUG"'\ bash -c ' CMD_REBUILD_CODE_DIR="rm -fr $CODE_DIR \ && tar xf /HPCAI/NAMD/code/$CODE_NAME-$CODE_TAG.tar.gz -C $CODE_BASE_DIR" ### To build shared library (single precision) with GNU Compiler BUILD_LABEL=$CODE_TAG-shared-gcc930-avx2-broadwell \ CMD_BUILD_SHARED_GCC=" \ mkdir $CODE_DIR/build-$BUILD_LABEL; \ cd $CODE_DIR/build-$BUILD_LABEL \ && $CMAKE_PATH .. \ -DBUILD_SHARED_LIBS=ON -DENABLE_FLOAT=ON \ -DENABLE_OPENMP=OFF -DENABLE_THREADS=OFF \ -DCMAKE_C_COMPILER=$GCC_PATH -DCMAKE_CXX_COMPILER=$GCC_PATH \ -DENABLE_AVX2=ON -DENABLE_AVX=ON \ -DENABLE_SSE2=ON -DENABLE_SSE=ON \ -DCMAKE_INSTALL_PREFIX=$INSTALL_DIR/$BUILD_LABEL \ -DCMAKE_C_FLAGS_RELEASE=$GCC_FLAGS \ -DCMAKE_CXX_FLAGS_RELEASE=$GCC_FLAGS \ && time -p make VERBOSE=1 V=1 install -j \ && cd $INSTALL_DIR/$BUILD_LABEL && ln -s lib64 lib | tee $BUILD_LABEL.log " eval $CMD_REBUILD_CODE_DIR; eval $CMD_BUILD_SHARED_GCC & wait echo $CMD_REBUILD_CODE_DIR; echo $CMD_BUILD_SHARED_GCC ' | tee fftw3buildlog 2>&1 ``` - Run a small FFTW benchmark with/without SIMD ```bash= ./build-3.3.8-shared-icc20-avx2-broadwell/bench -o patient -o nosimd 10240 ``` error message : ![](https://i.imgur.com/ZCXRTi4.png) :heavy_check_mark: sol : we don't install icc, remove the icc - another problem : - `bash -c '...'` ![](https://i.imgur.com/MNKfU8h.png) 4. build charm ```bash= CODE_NAME=charm \ CODE_GIT_TAG=FETCH_HEAD \ CODE_GIT_TAG=v6.10.1 \ GIT_DIR=/HPCAI/NAMD/github/$CODE_NAME.git \ GIT_WORK_TREE=/HPCAI/NAMD/code \ CHARM_CODE_DIR=$GIT_WORK_TREE/$CODE_NAME-$CODE_GIT_TAG-20-08-11 \ CHARM_DIR=$CHARM_CODE_DIR \ APP_MPI_PATH=/HPCAI/NAMD/application/mpi \ HPCX_FILES_DIR=$APP_MPI_PATH/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64 \ HPCX_MPI_DIR=$HPCX_FILES_DIR/ompi \ HPCX_UCX_DIR=$HPCX_FILES_DIR/ucx \ UCX_DIR=$SELF_BUILT_DIR \ UCX_DIR=$HPCX_UCX_DIR \ GCC_DIR=/usr/bin \ NATIVE_GCC_FLAGS="-march=native -mtune=native -mavx2 -msse4.2 -O3 -DNDEBUG"\ GCC_FLAGS="-static-libstdc++ -static-libgcc -march=broadwell -mtune=broadwell -mavx2 -msse4.2 -O3 -DNDEBUG" \ bash -c ' CMD_REBUILD_BUILD_DIR="rm -fr $CHARM_DIR/built && mkdir $CHARM_DIR/built;" ### To build UCX executables with HPC-X OpenMPI + GCC8.4.0 CMD_BUILD_UCX_CHARM_GCC=" module purge && module load gcc \ && cd $CHARM_DIR/built \ && time -p ../build charm++ ucx-linux-x86_64 ompipmix \ -j --with-production \ --basedir=$HPCX_MPI_DIR \ --basedir=$UCX_DIR \ gcc gfortran $GCC_FLAGS \ && module purge;" ### To build MPI executables with HPC-X OpenMPI + GCC8.4.0 CMD_BUILD_MPI_CHARM_GCC=" module purge && module load gcc \ && . $HPCX_FILES_DIR/hpcx-mt-init-ompi.sh \ && hpcx_load \ && cd $CHARM_DIR/built \ && time -p ../build charm++ mpi-linux-x86_64 \ -j --with-production \ --basedir=$HPCX_MPI_DIR \ gcc gfortran $GCC_FLAGS \ && hpcx_unload && module purge;" eval $CMD_REBUILD_BUILD_DIR; eval $CMD_BUILD_UCX_CHARM_GCC & eval $CMD_BUILD_MPI_CHARM_GCC & wait echo $CMD_REBUILD_BUILD_DIR; echo $CMD_BUILD_UCX_CHARM_GCC; echo $CMD_BUILD_MPI_CHARM_GCC; ' | tee charmbuildlog 2>&1 ``` - when I use my account - ![](https://i.imgur.com/ovP85SE.png) - when I use root - ![](https://i.imgur.com/iHzEXFp.png) :heavy_check_mark: sol : USE `bash -i` instead of `sudo bash` 5. build namd ```bash= CHARM_ARCH_UCX_GCC=ucx-linux-x86_64-gfortran-ompipmix-gcc \ CHARM_ARCH_MPI_GCC=mpi-linux-x86_64-gfortran-gcc \ CODE_NAME=charm \ CODE_GIT_TAG=FETCH_HEAD \ CODE_GIT_TAG=v6.10.1 \ GIT_WORK_TREE=/HPCAI/NAMD/code \ CHARM_CODE_DIR=$GIT_WORK_TREE/$CODE_NAME-$CODE_GIT_TAG-20-08-11 \ CHARM_BASE=$CHARM_CODE_DIR/built \ FFTW3_LIB_DIR=/HPCAI/NAMD/application/libs/fftw \ GCC_FFTW3_LIB_DIR=$FFTW3_LIB_DIR/3.3.8-shared-gcc840-avx2-broadwell \ APP_MPI_DIR=/HPCAI/NAMD/application/mpi \ HPCX_FILES_DIR=$APP_MPI_DIR/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64 \ GCC_DIR=/usr \ GCC_PATH='"$GCC_DIR/bin/gcc "' \ GXX_PATH='"$GCC_DIR/bin/g++ -std=c++0x"' \ NATIVE_GCC_FLAGS='"-static-libstdc++ -static-libgcc -march=native -mtune=native -mavx2 -msse4.2 -O3 -DNDEBUG"' \ GCC_FLAGS='"-static-libstdc++ -static-libgcc -march=broadwell -mtune=broadwell -mavx2 -msse4.2 -O3 -DNDEBUG"' \ CODE_NAME=namd \ CODE_GIT_TAG=FETCH_HEAD \ GIT_DIR=/HPCAI/NAMD/github/$CODE_NAME.git \ GIT_WORK_TREE=/HPCAI/NAMD/code \ NAMD_CODE_DIR=$GIT_WORK_TREE/$CODE_NAME-$CODE_GIT_TAG-20-08-11 \ NAMD_DIR=$NAMD_CODE_DIR \ bash -c ' cd $NAMD_DIR; CMD_BUILD_UCX_NAMD_GCC_FFTW3=" PATH=$GCC_DIR/bin:$PATH \ module purge && module load gcc && \ ./config Linux-x86_64-g++ --with-memopt \ --charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_UCX_GCC \ --with-fftw3 --fftw-prefix $GCC_FFTW3_LIB_DIR \ --cc $GCC_PATH --cc-opts $GCC_FLAGS \ --cxx $GXX_PATH --cxx-opts $GCC_FLAGS \ && cd Linux-x86_64-g++ && time -p make -j \ && cd $NAMD_DIR && mv Linux-x86_64-g++ Linux-x86_64-g++-ucx-fftw3 \ && module purge" CMD_BUILD_UCX_NAMD_GCC_MKL=" PATH=$GCC_DIR/bin:$PATH \ module purge && module load gcc && \ ./config Linux-x86_64-g++ --with-memopt \ --charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_UCX_GCC \ --with-mkl --mkl-prefix $MKL_DIR \ --cc $GCC_PATH --cc-opts $GCC_FLAGS \ --cxx $GXX_PATH --cxx-opts $GCC_FLAGS \ && cd Linux-x86_64-g++ && time -p make -j \ && cd $NAMD_DIR && mv Linux-x86_64-g++ Linux-x86_64-g++-ucx-mkl \ && module purge" CMD_BUILD_MPI_NAMD_GCC_FFTW3=" module purge && module load gcc && \ . $HPCX_FILES_DIR/hpcx-mt-init-ompi.sh && hpcx_load \ && PATH=$GCC_DIR/bin:$PATH \ ./config Linux-x86_64-g++ --with-memopt \ --charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_MPI_GCC \ --with-fftw3 --fftw-prefix $GCC_FFTW3_LIB_DIR \ --cc $GCC_PATH --cc-opts $GCC_FLAGS \ --cxx $GXX_PATH --cxx-opts $GCC_FLAGS \ && cd Linux-x86_64-g++ && time -p make -j \ && cd $NAMD_DIR && mv Linux-x86_64-g++ Linux-x86_64-g++-mpi-fftw3 \ && hpcx_unload && module purge" CMD_BUILD_MPI_NAMD_GCC_MKL=" module purge && module load gcc && \ . $HPCX_FILES_DIR/hpcx-mt-init-ompi.sh && hpcx_load \ && PATH=$GCC_DIR/bin:$PATH \ ./config Linux-x86_64-g++ --with-memopt \ --charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_MPI_GCC \ --with-mkl --mkl-prefix $MKL_DIR \ --cc $GCC_PATH --cc-opts $GCC_FLAGS \ --cxx $GXX_PATH --cxx-opts $GCC_FLAGS \ && cd Linux-x86_64-g++ && time -p make -j \ && cd $NAMD_DIR && mv Linux-x86_64-g++ Linux-x86_64-g++-mpi-mkl \ && hpcx_unload && module purge" eval $CMD_BUILD_UCX_NAMD_GCC_FFTW3 eval $CMD_BUILD_MPI_NAMD_GCC_FFTW3 eval $CMD_BUILD_UCX_NAMD_GCC_MKL; eval $CMD_BUILD_MPI_NAMD_GCC_MKL; wait echo $CMD_BUILD_UCX_NAMD_GCC_FFTW3 echo $CMD_BUILD_MPI_NAMD_GCC_FFTW3 echo $CMD_BUILD_UCX_NAMD_GCC_MKL; echo $CMD_BUILD_MPI_NAMD_GCC_MKL; ' | tee namdbuildlog 2>&1 ``` - error message ![](https://i.imgur.com/PXqzH0x.png) - 問題點 (看不懂Q) ```bash= ./config Linux-x86_64-g++ --with-memopt \ --charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_UCX_GCC \ --with-fftw3 --fftw-prefix $GCC_FFTW3_LIB_DIR \ --cc $GCC_PATH --cc-opts $GCC_FLAGS \ --cxx $GXX_PATH --cxx-opts $GCC_FLAGS \ ```