Introduction to HPC: molecular dynamics simulations with GROMACS

# Introduction to HPC: molecular dynamics simulations with GROMACS ## Introduction :::success :dart: **Learning goals** * Familiarizing with the GROMACS tools used in the exercises. * Understanding the features of the `mdrun` command line interface used in the exercises. * Understanding the `mdrun` log file structure and its key parts. ::: [//]: # (### Performance in MD) [//]: # (Molecular dynamics simulations do the same calculation repeatedly, typically for a large number of time-steps. ) Authors: Szilárd Páll ([doi:10.6084/m9.figshare.22303477](https://doi.org/10.6084/m9.figshare.22303477)), adapted by Cathrine Bergh for Devana ### The GROMACS simulation engine GROMACS is a molecular simulation package which comes with a molecular dynamics simulation engine (`mdrun`), a set of analysis tools, and the gmxlib python API. GROMACS is highly flexible and can be built in various ways depending on the target harware architecture and the parallelization features enabled. The GROMACS features and dependencies enabled at compile-time are shown in the _version header_ which is listed by invoking the `gmx -version` command as well as at the top of the simulation log outputs. All functionalities of the GROMACS package, the simulation engine and tools, are provided in the `gmx` program through subcommands. The program can have suffixes, e.g. MPI builds are typically installed as `gmx_mpi`. #### The `mdrun` simulation tool `mdrun` is the simulation tool, a command line program which allows running on CPUs and GPUs. The tool can be invoked as a subcommand of the main program, e.g. `gmx mdrun`. The `mdrun` functionalities available in a specific build depend on the options GROMACS was configured with and can be seen in the version header. The following are the mainly performance-related command line options used in the exercises: * `-g LOGFILE` set a custom name for the log file (default `md.log`); * `-pin on` enable `mdrun` internal thread affinity setting (might override externally set affinities); * `-nsteps N` set the number of simulations steps for the current run to `N` (N=-1 means infinitely long runs, intended to be combined with `-maxh`); * `-maxh H` stop the simulation after `0.99*H` hours; * `-resethway` reset performance counters halfway through the run; - `-nb`/`-pme`/`-bonded`/`-update` task assignment options used to select tasks to run on either CPU, GPU. For further information on the `mdrun` simulation tool command line options and features see the [online documentation](https://manual.gromacs.org/current/onlinehelp/gmx-mdrun.html). #### The `mdrun` log file The log file of the `mdrun` simulation engine contains extensive information about the GROMACS build, hardware detected at runtime, complete set of simulation setting, diagnostic output related to the run and its performance, as well as physics and performance statistics. The version header in a GROMACS `mdrun` log: ![](https://hackmd.io/_uploads/B1XEckUZa.png) The hardware detection section of the `mdrun` log: ![](https://hackmd.io/_uploads/rJ-qq1IWa.png) The performance accounting in a multi-rank simulation: ![](https://hackmd.io/_uploads/BJFs918Zp.png) ## Simulation input system In these exercises we will be using a simulation system consisting of the [Alcohol dehydrogenase](https://en.wikipedia.org/wiki/Alcohol_dehydrogenase) protein solvated in a box of water set up with the AMBER99SB-ildn force field. The simulation input file (tpr) can be obtained from: * [this link]( https://zenodo.org/record/7941558) ## Using multi-core CPUs in a compute node In the first two exercises we will be using the NSCC/SAS Devana cluster with 2 x Intel Xeon Gold 6338 processors each with 32 CPU cores and 100 Gbit/s HDR Infiniband interconnect. This means we can use maximum 64 cores per node for any CPU runs below. For further details on the architecture and usage seethe [system documentation](https://userdocs.nscc.sk/devana/system_overview/introduction/). :::success :dart: **Learning goals** * Understanding the impact of using multiple CPU cores on different parts of the MD computation. * Understanding the trade-off between using threads and MPI ranks. ::: As discussed in the lecture, there are two ways to express data parallelism in GROMACS: using multi-threading and spatial decomposition across MPI ranks. First, we will explore using OpenMP multi-threading alone to run the GROMACS simulation engine on CPU cores in a compute node. Then we will combine MPI ranks and OpenMP threads to distribute work. :::info :bulb: **The GROMACS log file** Before starting take a look at the introduction on the GROMACS `mdrun` tool and its log file. ::: ### Exercise 1.1: OpenMP multi-threading Starting with the following job script we can explore multi-threading in GROMACS. A few notes on the job script: * The option `-pin on` enables `mdrun` internal thread affinity. This means that threads will be pinned to cores (hardware threads) and will not be permitted to be moved around by the operating system which improves performance. * `-nsteps N` sets the number of simulation steps (iterations) the MD engine will run (we picked the value here so runs are reasonably fast). * `-s` specifies the input tpr file which contains the topology information needed to run the simulation. Note that this option isn't necessary when the tpr file has the default name `topol.tpr` and is placed in the working directory (which is the case in this tutorial). ```bash= #!/bin/bash #SBATCH --time=00:10:00 # maximum execution time of 10 minutes #SBATCH --nodes=1 # requesting 1 compute node #SBATCH --ntasks=1 # use 1 MPI rank (task) #SBATCH --exclusive # no node sharing to make performance analysis easier #SBATCH --cpus-per-task=... # modify this number of CPU cores per MPI task # load the necessary modules module load GROMACS export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # Add your input (tpr) file in the command below mpirun -np 1 gmx_mpi mdrun -s topol.tpr -pin on -ntomp $SLURM_CPUS_PER_TASK -g ex1.1_1x${SLURM_CPUS_PER_TASK}_jID${SLURM_JOB_ID} -nsteps 10000 -resethway ``` * Modify the script varying the number of CPU cores used (`--cpus-per-task`) and submit 3 different jobs with each new configuration. * Look at the `mdrun` log file output (the files will be named `ex1.1_1xN_jIDXXXXXX.log`) and report your three performance values to the shared [document](https://docs.google.com/spreadsheets/d/1cDMNIlsFPlJTCT0zw7-khdFS8mKICgYNTPgarx_QTds/edit?usp=sharing). :::warning * How does the absolute performance (ns/day) change when increasing the number of cores used? * How does the wall-time of various computations change with the thread count (e.g. "Force", "PME mesh" tasks)? ::: :::info :bulb: `SBATCH` arguments provided in the job script header can also be passed on the command line (e.g. `sbatch --cpus-per-task N`) overriding the setting in the job script header. Doing so can allow varying submission parameters without having to edit the job script. ::: :::spoiler Help with the solution We can use the `cat` command to print the content of the log file, or use the `grep` and `awk` tools to parse row(s) from the log file and extract rows of the performance table or fields of the performance summary. E.g. `grep "Force " *.log` will allow inspecting runtime statistics of the (the wall-time of a task). ``` $ grep "Performance: " *.log | awk '{print $1, "\t", $2}' ex1.1_1x1_jID42238.log:Performance: 3.214 ex1.1_1x24_jID42242.log:Performance: 44.121 ex1.1_1x64_jID42244.log:Performance: 79.894 $ grep "Force " *.log | awk '{print $1, "\t", $(NF-2)}' ex1.1_1x1_jID42238.log: 145.519 ex1.1_1x24_jID42242.log: 8.451 ex1.1_1x64_jID42244.log: 3.722 $ grep "PME mesh " *.log | awk '{print $1, "\t", $(NF-2)}' ex1.1_1x1_jID42238.log: 98.029 ex1.1_1x24_jID42242.log: 7.888 ex1.1_1x64_jID42244.log: 4.005 ``` ::: ### Exercise 1.2: MPI ranks and OpenMP threads Recall from the lecture that the domain decomposition allows spatially decomposing simulation data into _domains_ to parallelize over the distributed resources in an HPC machine. `mdrun` assigns all computation associated with a domain to an _MPI rank_ (also called _MPI task_). This work can further be distributed across multiple OpenMP threads within a rank. Hence, an MPI rank can use one or multiple threads and can be assigned one or multiple CPU cores. Continue from the previous experiments, now using all CPU cores in a compute node (64 on Devana). In this exercise we will explore how different ways of distributing work across MPI tasks and OpenMP threads affects performance. ```bash= #!/bin/bash #SBATCH --time=00:10:00 # maximum execution time of 10 minutes #SBATCH --nodes=1 # requesting 1 compute node #SBATCH --ntasks=... # vary this value #SBATCH --cpus-per-task=... # change this value as well # load the necessary modules module load GROMACS export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # Add your input (tpr) file in the command below mpirun gmx_mpi mdrun -s topol.tpr -pin on -g ex1.2_${SLURM_NTASKS}x${SLURM_CPUS_PER_TASK}_jID${SLURM_JOB_ID} -nsteps 20000 -resethway ``` * Submit three different combinations of MPI rank / OpenMP thread count paying attention to use all CPU cores in a node (`ntasks`x`cpus-per-task` should be 64). * Look at the `mdrun` log file outputs (files will be named `ex1.2_NxM_jIDXXXXXX.log`). Again report your performance measurements to the shared [document](https://docs.google.com/spreadsheets/d/1cDMNIlsFPlJTCT0zw7-khdFS8mKICgYNTPgarx_QTds/edit?usp=sharing). :::warning * Check the absolute performance in the logs. Which is the most efficient setup? ::: :::spoiler Help with the results [Sample log files for this exercise session.](https://doi.org/10.5281/zenodo.10004861) ::: ## Scaling GROMACS across multiple nodes with CPUs :::success :dart: **Learning goals** * Understanding how data and task decomposition is used in the `mdrun` simulation engine. * Exploring the multi-node strong scaling of GROMACS simulations ::: In this exercise, we will explore the use of multiple compute nodes to improve the simulation throughput. First we will rely on domain decomposition alone to scale to larger number of processors and will observe the limitations to this approach, then we will look at how OpenMP can help. ### Exercise 2.1: Multi-node scaling Continuing from Exercise 1.2, we will now investigate how the `mdrun` performance scales when using multiple compute nodes. ```bash= #!/bin/bash #SBATCH --time=00:10:00 #SBATCH --nodes=... # Vary this value and #SBATCH --ntasks=... # change this #SBATCH --cpus-per-task=... # and this value as well. module load GROMACS export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK mpirun gmx_mpi mdrun -s topol.tpr -pin on -g ex2.1_${SLURM_NNODES}N_${SLURM_NTASKS}x${SLURM_CPUS_PER_TASK}_jID${SLURM_JOB_ID} -nsteps 40000 -resethway ``` Starting with the most efficient MPI ranks / OpenMP threads setup from the previous exercise, we will extend this to run on multiple compute nodes. Next, we will check whether a different MPI/OpenMP setup can give better performance on larger node counts. * Take the most efficient MPI ranks / OpenMP threads single node configuration from Ex 1.2 (likely the one with `--cpus-per-task=1`). * Submit runs extending the same on 2, 4, and 8 nodes (e.g.`-nodes=2 --ntasks=128 --cpus-per-task=1`). Report your results to the [shared document.](https://docs.google.com/spreadsheets/d/1cDMNIlsFPlJTCT0zw7-khdFS8mKICgYNTPgarx_QTds/edit?exids=71471476,71471470#gid=1548649358) * If you have time, try to adjust the ratio between MPI ranks and OpenMP threads. :::warning * Look at the absolute performance in the log file. How does the simulation performance (ns/day) change when increasing the number of nodes used? ::: :::spoiler Help with the results [Sample log files for this exercise session.](https://doi.org/10.5281/zenodo.10004861) ::: ## GPU accelerated simulations :::success :dart: **Learning goals** * Understanding how the GROMACS heterogeneous parallelization allows moving tasks between CPU and GPU and the impact of this on performance. * Understanding the difference between force-offload and GPU-resident modes. ::: The GROMACS MD engine uses heterogeneous parallelization which can flexibly utilize both CPU and GPU resources. As discussed in the lecture, there are two offload modes: * In the _force offload_ mode some or all forces are computed on the GPU, but are transferred to the CPU every iteration for integration. * In the GPU-resident mode the integration happens on the GPU allowing the simulation state to reside on the GPU for tens or hundreds of iterations. Further details can be found in the [GROMACS users guide](https://manual.gromacs.org/current/user-guide/mdrun-performance.html#running-mdrun-with-gpus) and [DOI:10.1063/5.0018516](https://aip.scitation.org/doi/full/10.1063/5.0018516). In this exercise we will learn how moving tasks between the CPU and GPU impacts performance. We will be using Devana's accelerated module which uses 2 x Intel Xeon Gold 6338 CPU and 4 x NVIDIA Volta A100 GPUs per node. For further details on the architecture and usage see the [system documentation](https://userdocs.nscc.sk/devana/system_overview/introduction/). ### Exercise 3.1: GPU offloading force computations There are three force tasks that can be offloaded to a GPU in GROMACS. The assignment of these tasks is controlled by `mdrun` command line options corresponding to each taks: * (short-range) nonbonded: `-nb ASSIGNMENT` * particle mesh Ewald: `-pme ASSIGNMENT` * bonded: `-bonded ASSIGNMENT` The possible "`ASSIGNMENT`" values are `cpu`, `gpu`, or `auto`. We will start by using one GPU with 16 CPU cores (a quarter of the Devana accelerated node) in a simulation and assess how the performance changes with offloading different parts of the force calculation to the GPU. ```bash= #!/bin/bash #SBATCH --time=00:10:00 #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --gres=gpu:1 #SBATCH --cpus-per-task=64 # load the modules module load GROMACS/2023.2-intelmkl-CUDA-12.0 export OMP_NUM_THREADS=16 gmx_mpi mdrun -nb cpu -pme cpu -bonded cpu -update cpu -g ex3.1_${SLURM_NTASKS}x${OMP_NUM_THREADS}_jID${SLURM_JOB_ID} -nsteps -1 -maxh 0.017 -resethway -notunepme ``` * Note that we use time limited runs here by passing `-maxh 0.017` which sets the run time limit in hours (~one minute); we do that as the simulation throughput significantly changes and we want to avoid either very long wait time or unreliable benchmark measurements due to just a few seconds of runtime. * As a baseline, launch a run first with assigning all tasks to the CPU (as above). * Next submit jobs by incrementally offloading various force tasks (`-nb`, `-pme`, `-bonded`) to the GPU. Report you performance values to the shared [document](https://docs.google.com/spreadsheets/d/1cDMNIlsFPlJTCT0zw7-khdFS8mKICgYNTPgarx_QTds/edit?usp=sharing). :::warning * How does the performance (ns/day) change with offloading more tasks? * Look at the performance table in the log and observe how the fraction wall-time spent in the tasks left on the CPU change. (Note that the log file performance report will only contain timings of tasks executed on the CPU, not those offloaded to the GPU, as well as timings of the CPU time spent launching GPU work as well as waiting for GPU results.) ::: ### Exercise 3.2: GPU-resident mode Continuing from the previous exercise, we will now explore using the GPU-resident mode. ```bash= #!/bin/bash #SBATCH --time=00:10:00 #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --gres=gpu:1 #SBATCH --cpus-per-task=64 # load the modules module load GROMACS/2023.2-intelmkl-CUDA-12.0 export OMP_NUM_THREADS=16 gmx_mpi mdrun -nb gpu -pme gpu -bonded gpu -update gpu -g ex3.2_${SLURM_NTASKS}x${OMP_NUM_THREADS}_jID${SLURM_JOB_ID} -nsteps -1 -maxh 0.017 -resethway -notunepme ``` * Submit a fully offloaded GPU-resident job using the `-update gpu` option (as above). * Since we moved most computation to the GPU, the CPU cores are left unused. The GROMACS heterogeneous engine allows moving work to the CPU. Next, let's try to also utilize CPU cores for potential performance benefits. Try moving the bonded tasks back to the CPU. Again, report your performance results in the shared [document](https://docs.google.com/spreadsheets/d/1cDMNIlsFPlJTCT0zw7-khdFS8mKICgYNTPgarx_QTds/edit?usp=sharing). * If you have time, try adding more cores to the run by specifying `OMP_NUM_THREADS=32` in the job script. :::warning * How does the GPU-resident mode perform compared to the best performing force-offload run from Ex 3.1? * How did the performance change when force tasks were moved back to the CPU? ::: :::spoiler help with the results [Sample log files for the exercise session.](https://doi.org/10.5281/zenodo.10004861) ::: :::info **License** This material in shared under CC BY 4.0 license. :::