Introduction to HPC: molecular dynamics simulations with GROMACS

# Introduction to HPC: molecular dynamics simulations with GROMACS ## Introduction :::success :dart: **Learning goals** * Familiarizing with the GROMACS tools used in the exercises. * Understanding the features of the `mdrun` command line interface used in the exercises. * Understanding the `mdrun` log file structure and its key parts. ::: [//]: # (### Performance in MD) [//]: # (Molecular dynamics simulations do the same calculation repeatedly, typically for a large number of time-steps. ) ### The GROMACS simulation engine GROMACS is a molecular simulation package which comes with a molecular dynamics simulation engine (`mdrun`), a set of analysis tools, and the gmxlib python API. GROMACS is highly flexible and can be built in various ways depending on the target harware architecture and the parallelization features enabled. The GROMACS features and dependencies enabled at compile-time are shown in the _version header_ which is listed by invoking the `gmx -version` command as well as at the top of the simulation log outputs. All functionalities of the GROMACS package, the simulation engine and tools, are provided in the `gmx` program through subcommands. The program can have suffixes, e.g. MPI builds are typically installed as `gmx_mpi`. #### The `mdrun` simulation tool `mdrun` is the simulation tool, a command line program which allows running on CPUs and GPUs. The tool can be invoked as a subcommand of the main program, e.g. `gmx mdrun`. The `mdrun` functionalities available in a specific build depend on the options GROMACS was configured with and can be seen in the version header. The following are the mainly performance-related command line options used in the exercises: * `-g LOGFILE` set a custom name for the log file (default `md.log`); * `-pin on` enable `mdrun` internal thread affinity setting (might override externally set affinities); * `-tunepme`/`-notunepme` enable PME task load balancing; * `-nsteps N` set the number of simulations steps for the current run to `N` (N=-1 means infinitely long runs, intended to be combined with `-maxh`); * `-maxh H` stop the simulation after `0.99*H` hours; * `-resethway` reset performance counters halfway through the run; - `-nb`/`-pme`/`-bonded`/`-update` task assignment options used to select tasks to run on either CPU, GPU. For further information on the `mdrun` simulation tool command line options and features see the [online documentation](https://manual.gromacs.org/current/onlinehelp/gmx-mdrun.html). #### The `mdrun` log file The log file of the `mdrun` simulation engine contains extensive information about the GROMACS build, hardware detected at runtime, complete set of simulation setting, diagnostic output related to the run and its performance, as well as physics and performance statistics. The version header in a GROMACS `mdrun` log: ![](https://i.imgur.com/dhkfJSr.png) The hardware detection section of the `mdrun` log: ![](https://i.imgur.com/68AWSW1.png) The performance accounting in a multi-rank simulation: ![](https://i.imgur.com/PUlYmaR.png) ## Simulation input system In these exercises we will be using a simulation system consisting of the [Alcohol dehydrogenase](https://en.wikipedia.org/wiki/Alcohol_dehydrogenase) protein solvated in a box of water set up with the AMBER99SB-ildn force field. The simulation input file (tpr) can be obtained from: * [this link](https://zenodo.org/record/7941558) * or on MareNostrum at `/gpfs/home/nct00/nct00027/exercises/_material_/adh_dodec_with-gmx21-tpr.tar.gz` ## Using multi-core CPUs in a compute node In the first two exercises we will be using the BSC MareNostrum4 cluster which uses 2 x Intel Xeon Platinum 8160 CPU with 24 cores per node and 100 Gbit/s Intel Omni-Path interconnect, for further details on the architecture and usage see the [system documentation](https://www.bsc.es/supportkc/docs/MareNostrum4/intro). :::success :dart: **Learning goals** * Understanding the impact of using multiple CPU cores on different parts of the MD computation. * Understanding the trade-off between using threads and MPI ranks. ::: As discussed in the lecture, there are two ways to express data parallelism in GROMACS: using multi-threading and spatial decomposition across MPI ranks. First, we will explore using OpenMP multi-threading alone to run the GROMACS simulation engine on CPU cores in a compute node. Then we will combine MPI ranks and OpenMP threads to distribute work. :::info :bulb: **The GROMACS log file** Before starting take a look at the introduction on the GROMACS `mdrun` tool and its log file. ::: ### Exercise 1.1: OpenMP multi-threading Starting with the following job script explore multi-threading in GROMACS. Notes the job script below: * The option `-pin on` enables `mdrun` internal thread affinity. This means that threads will be pinned to cores (hardware threads) and will not be permitted to be moved around by the operating system which improves performance. * `-nsteps N` sets the number of simulation steps (iterations) the MD engine will run (we picked the value here so runs are reasonably fast). ```bash= #!/bin/bash #SBATCH --time=00:10:00 # maximum execution time of 10 minutes #SBATCH --nodes=1 # requesting 1 compute node #SBATCH --ntasks=1 # 1 MPI rank (task) #SBATCH --exclusive # no node sharing to make performance analysis easier #SBATCH --cpus-per-task=... # modify this number of CPU cores per MPI task # load modules for MN4 module purge module load gcc/11.2.0_binutils openmpi/3.1.1 gromacs/2023 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK mpirun -np 1 gmx_mpi mdrun -pin on -ntomp $SLURM_CPUS_PER_TASK -g ex1.1_1x${SLURM_CPUS_PER_TASK}_jID${SLURM_JOB_ID} -nsteps 10000 -resethway ``` * Modify the script varying the number of CPU cores used (`--cpus-per-task`) and submit runs with each new configuration. * Look at the `mdrun` log file output (the files will be named `ex1.1_1xN_jIDXXXXXX.log`). :::warning * How does the absolute performance (ns/day) change when increasing the number of cores used? * How does the wall-time of various computations changes with the thread count (e.g. "Force", "PME mesh", "Update" tasks). ::: :::info :bulb: `SBATCH` arguments provided in the job script header can also be passed on the command line (e.g. `sbatch --cpus-per-task N`) overriding the setting in the job script header. Doing so can allow varying submission parameters without having to edit the job script. ::: :::spoiler Help with the solution ```bash= #!/bin/bash #SBATCH --time=00:10:00 # maximum execution time of 10 minutes #SBATCH --nodes=1 # requesting 1 compute node #SBATCH --ntasks=1 # 1 MPI rank (task) #SBATCH --cpus-per-task=12 # 12 CPU cores #SBATCH --exclusive # no node sharing to make performance analysis easier # load modules for MN4 module purge module load gcc/11.2.0_binutils openmpi/3.1.1 gromacs/2023 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK mpirun -np 1 gmx_mpi mdrun -pin on -ntomp $SLURM_CPUS_PER_TASK -g ex1.1_1x${SLURM_CPUS_PER_TASK}_jID${SLURM_JOB_ID} -nsteps 10000 -resethway ``` We can use the `grep` and `awk` tools to parse row(s) from the log file and extract rows of the performance table or fields of the performance summary. E.g. `grep "Force " *.log` will allow inspecting runtime statistics of the (the wall-time of a task). ``` $ grep "Force " *.log | awk '{print $1, "\t", $(NF-2)}' bench_6C_1x6T.log: 36.942 bench_12C_1x12T.log: 21.678 bench_24C_1x24T.log: 13.372 $ grep "PME mesh " *.log | awk '{print $1, "\t", $(NF-2)}' bench_6C_1x6T.log: 6.693 bench_12C_1x12T.log: 11.622 bench_24C_1x24T.log: 24.038 ``` ::: ### Exercise 1.2: MPI ranks and OpenMP threads Recall from the lecture that the domain decomposition allows spatially decomposing simulation data into _domains_ to parallelize over the distributed resources in an HPC machine. `mdrun` assigns all computation associated with a domain to an _MPI rank_ (also called _MPI task_). This work can further be distributed across multiple OpenMP threads within a rank. Hence, an MPI rank can use one or multiple threads and can be assigned one or multiple CPU cores. Continue from the previous experiments, now using all CPU cores in a compute node (48 on MareNostrum 4), in this exercise we will explore how different ways of distributing work across MPI tasks and OpenMP threads affects performance. ```bash= #!/bin/bash #SBATCH --time=00:10:00 # maximum execution time of 10 minutes #SBATCH --nodes=1 # requesting 1 compute node #SBATCH --ntasks=... # vary this value #SBATCH --cpus-per-task=... # change this value as well # load modules for MN4 module purge module load gcc/11.2.0_binutils openmpi/3.1.1 gromacs/2023 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK mpirun gmx_mpi mdrun -pin on -g ex1.2_${SLURM_NTASKS}x${SLURM_CPUS_PER_TASK}_jID${SLURM_JOB_ID} -nsteps 20000 -resethway -npme 0 ``` * Note: the `-npme 0` option explicitly disables the use of the "separate PME ranks" feature for MPMD / task decomposition (we will look at its impact in Exercise 2). * Submit a few combinations of MPI rank / OpenMP thread count paying attention to use all CPU cores in a node (`ntasks`x`cpus-per-task` should be 48). * Look at the `mdrun` log file outputs (files will be named `ex1.2_NxM_jIDXXXXXX.log`). :::warning * Check the absolute performance in the logs. Which is the most efficient setup? * How does the cost of different computation vary between the different setups (e.g. compare "Force" and "PME mesh"). * _Bonus_: Check how well the dynamic load balancing (DLB) was able to reduce imbalance during the run: * find in the log the "Turning on dynamic load balancing" message which marks the point where (if) DLB turned on and how much the imbalance was at that point; * find the final DLB report under "Dynamic load balancing report". ::: :::spoiler Help with the results [Sample log files for this exercise session.](https://zenodo.org/record/7941556) ::: ## Scaling GROMACS across multiple nodes with CPUs :::success :dart: **Learning goals** * Understanding how data and task decomposition is used in the `mdrun` simulation engine. * Exploring the multi-node strong scaling of GROMACS simulations and of different MD algorithms. * _Advanced_: Understand the role of load balancing. ::: In this exercise, we will explore the use of multiple compute nodes to improve the simulation throughput. First we will rely on domain decomposition alone to scale to larger number of processors and will observe the limitations to this approach, then we will look at how OpenMP can help. Finally, we will use task decomposition with the "separate PME ranks" `mdrun` functionality to improve efficiency and allow better scaling. ### Exercise 2.1: Multi-node scaling Continuing from Exercise 1.2, in this exercise we will observe how the `mdrun` performance scales when using multiple compute nodes. ```bash= #!/bin/bash #SBATCH --time=00:10:00 #SBATCH --nodes=... # Vary this value and #SBATCH --ntasks=... # change this #SBATCH --cpus-per-task=... # and this value as well. # module load ... export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK mpirun gmx_mpi mdrun -npme 0 -pin on -g ex2.1_${SLURM_NNODES}N_${SLURM_NTASKS}x${SLURM_CPUS_PER_TASK}_jID${SLURM_JOB_ID} -nsteps 40000 -resethway ``` Starting with the most efficient MPI ranks / OpenMP threads setup on a single node, we will extend this to run on multiple compute nodes. Next, we will check whether a different MPI/OpenMP setup can give better performance on larger node counts. * Take the most efficient MPI ranks / OpenMP threads single node configuration from Ex 1.2 (likely the one with `--cpus-per-task=1`). * Submit runs extending the same on 2, 4, 8, and 16 nodes (e.g.`-nodes=2 --ntasks=96 --cpus-per-task=1`). * Next, try more OpenMP threads per rank adjusting the number of fewer MPI ranks in total. :::warning * Look at the absolute performance in the log file. How does the simulation performance (ns/day) change when increasing the number of nodes used? * Does the optimal threads/MPI rank change when using more nodes? Why? * _Bonus_: Look at individual rows of the performance table in the log. Which computation scales well and which ones scales poorly? ::: ### Exercise 2.2: Improving multi-node scaling using task parallelism (separate PME ranks) The scaling of the PME computation becomes limited by the communication requirements of the parallel fast Fourier transform (FFT) computation used. To circumvent this, GROMACS can specialize a subset of MPI ranks and dedicate them to do the PME computation associated with work assigned to multiple domains. Since fewer MPI ranks participate in the very costly communication involved in parallel FFTs, this reduces the communication bottleneck and improves scaling. This is called multiple-program-multiple data parallelization (MPMD) parallelization and it is referred to as ["separate PME ranks" in GROMACS](https://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html#separate-pme-ranks). In the previous experiments in 2.1 we passed the `-npme 0` argument to `mdrun`. This option explicitly disables separate PME ranks, so each MPI rank will have distributed FFT computation. We will now explore the benefits of separate PME ranks by removing the `-npme 0` argument. This will allow the simulation to use an automatically determined number of separate PME ranks. ```bash= #!/bin/bash #SBATCH --time=00:10:00 #SBATCH --nodes=... # Also vary this value and #SBATCH --ntasks=... # change this #SBATCH --cpus-per-task=... # and this value as well, together with the "..." below # module load ... export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK mpirun gmx_mpi mdrun -npme 0 -pin on -g ex2.2_${SLURM_NNODES}N_${SLURM_NTASKS}x${SLURM_CPUS_PER_TASK}_jID${SLURM_JOB_ID} -nsteps 40000 -resethway ``` * Modify the job script from 2.1 removing the `-npme 0` argument. Submit jobs with the most efficient setup observed in 2.1. * Next submit jobs also with fewer OpenMP threads per rank for those configurations where this makes sense; e.g. if the most efficient setup was `--cpus-per-task=2` also run with `--cpus-per-task=1`. :::warning * How does the performance change with separate PME ranks? * Does the optimal MPI rank / OpenMP thread setup change compared to 2.1? * _Bonus_: How does the total "PME mesh" (total PME wall-time) and "PME 3D-FFT Comm." (FFT communication time) change? ::: ### _Bonus Exercise 2.3_: Load imbalance It is not always possible to equalize the amount of wall-time the work assigned to different compute units takes which leads to _load imabalance_. As mentioned in the lecture, there can be multiple sources of imbalance, including differences in the amount of computation assigned to different processors or the time communication takes in different parts of a compute job. In this exercise we will observe the load imbalance present in the previous simulation experiments. The `mdrun` log file contains an imbalance report section under `Dynamic load balancing report`. :::warning * Observe the single node experiments where we varied the number of ranks and threads. How does the load imbalance change across these? * How does the load imbalance change when increasing the number nodes used (strong scaling)? ::: :::spoiler Help with the results [Sample log files for this exercise session.](https://zenodo.org/record/7941556) ::: ## GPU accelerated simulations :::success :dart: **Learning goals** * Understanding how the GROMACS heterogeneous parallelization allows moving tasks between CPU and GPU and the impact of this on performance. * Understanding the difference between force-offload and GPU-resident modes. * _Advanced_: Exploring multi-GPU scaling. ::: The GROMACS MD engine uses heterogeneous parallelization which can flexibly utilize both CPU and GPU resources. As discussed in the lecture, there are two offload modes: * In the _force offload_ mode some or all forces are computed on the GPU, but are transferred to the CPU every iteration for integration; * In the GPU-resident mode the integration happens on the GPU allowing the simulation state to reside on the GPU for tens or hundreds of iterations. Further details can be found in the [GROMACS users guide](https://manual.gromacs.org/current/user-guide/mdrun-performance.html#running-mdrun-with-gpus) and [DOI:10.1063/5.0018516](https://aip.scitation.org/doi/full/10.1063/5.0018516). In this exercise we will learn how moving tasks between the CPU and GPU impacts performance. We will be using the BSC MareNostrum4 CTE-POWER cluster which uses 2 x IBM Power9 CPUs and 4 x GPU NVIDIA V100 GPUs per node, for further details on the architecture and usage see the [system documentation](https://www.bsc.es/supportkc/docs/CTE-POWER/intro). ### Exercise 3.1: GPU offloading force computations There are three force tasks that can be offloaded to a GPU in GROMACS. The assignment of these tasks is controlled by `mdrun` command line options corresponding to each taks: * (short-range) nonbonded: `-nb ASSIGNMENT` * particle mesh Ewald: `-pme ASSIGNMENT` * bonded: `-bonded ASSIGNMENT` The possible "`ASSIGNMENT`" values are `cpu`, `gpu`, or `auto`. We will start by using one GPU with 10 CPU cores (a quarter of the CTE-POWER node) in a simulation and assess how the performance changes with force offload. ```bash= #!/bin/bash #SBATCH --time=00:10:00 #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=40 # The CTE machine has 4-way SMT and we are required to request all 40 hardware threads together with a GPU. #SBATCH --gres=gpu:1 # module load for CTE-POWER module load gcc/7.3.0 cuda/10.2 gromacs/2021.7-no-mpi-own-fftw export OMP_NUM_THREADS=10 gmx mdrun -ntmpi 1 -nb cpu -pme cpu -bonded cpu -update cpu \ -g ex3.1_${SLURM_NTASKS}x${OMP_NUM_THREADS}_jID${SLURM_JOB_ID} -nsteps -1 -maxh 0.017 -resethway -notunepme ``` * Note that we use time limited runs here by passing `-maxh 0.017` which sets the run time limit in hours (~one minute); we do that as the simulation throughput significantly changes and we want to avoid either very long wait time or unreliable benchmark measurements due to just a few seconds of runtime. * As a baseline, launch a run first with assigning all tasks to the CPU (as above). * Next submit jobs by incrementally offloading various force tasks (`-nb`, `-pme`, `-bonded`) to the GPU. :::warning * How does the performance (ns/day) change with offloading more tasks? * Look at the performance table in the log and observe how the fraction wall-time spent in the tasks left on the CPU change. (Note that the log file performance report will only contain timings of tasks executed on the CPU, not those offloaded to the GPU, as well as timings of the CPU time spent launching GPU work as well as waiting for GPU results.) * _Bonus_: Enable (PME) task load balancing by replacing `-notunepme` with `-tunepme`. Assign the PME task to the CPU and GPU and observe how the performance changes compared to the earlier similar run without load balancing. ::: ### Exercise 3.2: GPU-resident mode Continuing from the previous exercise, we will now explore using the GPU-resident mode. ```bash= #!/bin/bash #SBATCH --time=00:10:00 #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=40 # The CTE machine has 4-way SMT and we are required to request all 40 hardware threads together with a GPU. #SBATCH --gres=gpu:1 # module load for CTE-POWER module load gcc/7.3.0 cuda/10.2 gromacs/2021.7-no-mpi-own-fftw export OMP_NUM_THREADS=10 gmx mdrun -ntmpi 1 -nb gpu -pme gpu -bonded gpu -update gpu \ -g ex3.2_${SLURM_NTASKS}x${OMP_NUM_THREADS}_jID${SLURM_JOB_ID} -nsteps -1 -maxh 0.017 -resethway -notunepme ``` * Submit a fully offloaded GPU-resident job using the `-update gpu` option (as above). * Since we moved most computation to the GPU, the CPU cores are left unused. The GROMACS heterogeneous engine allows moving work to the CPU. Next, let's try to also utilize CPU cores for potential performance benefits. Try moving the PME and bonded tasks back to the CPU. * Next let's try adding more cores to the run by specifying `--cpus-per-task=80` and `OMP_NUM_THREADS=20` in the job script. :::warning * How does the GPU-resident mode perform compared to the best performing force-offload run from Ex 3.1? * How did the performance change when force tasks were moved back to the CPU? * _Bonus_: Try using `-tunepme` as in 3.1. ::: :::spoiler help with the results [Sample log files for the exercise session.](https://zenodo.org/record/7941556) ::: :::info **License** This material in shared under CC BY 4.0 license. [DOI: 10.6084/m9.figshare.22303477](https://doi.org/10.6084/m9.figshare.22303477) :::