This is the collaborative "notebook" for the GROMACS on LUMI online workshop organized by CSC and BioExcel on 24-25 January 2024.
📅 The workshop schedule can be found here.
⁉️ This is also the place to ask questions about the content!
📔 For exercise instructions, see this page!
💡 Hint: HackMD is great for sharing information in this kind of courses, as the code formatting is nice & easy with Markdown! Just add backticks (`
) for the code blocks
. Otherwise, it's like Google docs as it allows simultaneous editing. There's a section for practice at the bottom ⬇️
ssh <username>@lumi.csc.fi
or using the LUMI web interface (open a Login node shell).You must have set up and registered your SSH key pair prior to this./scratch/project_465000934
) and in your own respective folders. Don't launch jobs from your $HOME
directory.#SBATCH
details before launching simulations. Ensure that all the details are entered correctly. Common Slurm options and error messages are detailed in the LUMI documentation.We strive to follow the Code of Conduct developed by The Carpentries organisation to foster a welcoming environment for everyone. In short:
You can find all the lecture slides for the workshop here
Time | Content |
---|---|
9:30 | Workshop Introduction (communication channels, icebreaker with HackMD, code of conduct) |
9:40 | LUMI Architecture + Q/A session |
10:10 | Brief Introduction to GROMACS + Q/A session |
10:25 | Break (15 min) |
10:40 | GROMACS parallelization / heterogeneous/GPU algorithms + Q/A session |
11:30 | Lunch break (75 min) |
12:45 | AMD GPU support in GROMACS + Q/A session |
13:15 | How to run GROMACS on LUMI (Slurm; batch script option) + Q/A session |
13:30 | Break (15 min) |
13:45 | Introduction to Hands-on session |
14:05 | Hands-on session instructions |
15:20 | Finish day 1 |
Time | Content |
---|---|
9:30 | Assessing and tuning performance of GROMACS simulations + Q/A session |
10:00 | Break (15 min) |
10:15 | Hands-on: GPU accelerated simulations instructions |
11:30 | Lunch break (75 min) |
12:45 | Hands-on: Scaling GROMACS across multiple GPUs instructions |
14:00 | Break (15 min) |
14:15 | Hands-on: Ensemble parallelization across multiple GPUs instructions |
15:30 | Finish day 2 |
Q1: How to get access to LUMI-D?
--partition=largemem
in your batch job. LUMI-D visualization nodes can be accessed through the LUMI web interface accelerated visualization app.Q2: Is there a difference between CPU and socket?
Q3: Do we get the slides?
Q4: Is it possible to run GPU jobs across multiple nodes ?
-multidir
(https://docs.csc.fi/support/tutorials/gromacs-throughput/)Q5: How to estimate the scaling limit for your system (i.e., if it's feasible to run on multiple nodes)?
Q6: Is it possible to also run jobs on e.g. LUMI-G and LUMI-C in parallel, with communication via high-speed interconnect ?
Q7: Would the uncoupled ensembles have the advantage of being unbiased? how bias would account in the ensemble paralelization choice?
Q8: What is the frequency of communication between domains and where it is done?
Q9: nstlist=20-40 was mentioned in previous presentation to work well with GPU offloading. With multi-GPU runs the nstlist value can be automatically increased to e.g. 100 to increase performance. The GROMACS documentation recommends in some cases to increase this automatic value manually even more to nstlist=200-300. Is there an upper limit for nstlist and how to know which value to choose?
Q10: (related to Q9) What should we consider specifically to choose the correct or most efficient nstlist value? Is it just a trial-and-error approach? Just testing different values and compare the performance?
Q11: Are there any specific challenges of installing GROMACS to be able to use AMD GPUs? For example, to make sure the correct GCD-CCD mapping is used (in case the mapping is necessary at that stage).
mdrun
at the command line) how to map/use the hardware when launching the job. Installation is a separate topic. On LUMI GROMACS is preinstalled.
Q12: Would it be possible to run with 1 full node + 1(or 2) GPUs? Or if multiple GPU nodes are used, do they need to be always full nodes?
Q13: Is the energy minimization supported with GPUs?
Q14: From my limited experience compiling GROMACS it seems to me GPU support is not compatible with double precision calculations. Is it correct? Is there any plan to extend GPU calculations to double precision in future versions?
Q15: Can you explain what "ROCR_VISIBLE_DEVICES" does again?
Q16: What distinguishes #SBATCH –gres= from #SBATCH gpu-per-node?
Q17: Would assigning ntask-per-node > 16 have an impact on efficiency?
Q18: How to get production access to LUMI?
Q19: What would be considered good enough scaling?
Q20: What is "Wait GPU state copy" exactly?
Q21: I wondered whether, due to the LUMI arch, one can always use –gpus-per-node=2 when running gromacs?
Q22: What if I have 2x22 cores in HPC, then how many cpu-per-task can I assign per node?
Q23: So, either use 8 gpus/node or if you have small systems then run multiple small simulations with -multidir?
Q24: everytime I increase the number of openmpthreads the simulation time keeps on increasing, so I want to understand how ntask-per-node and cpu-per-task correlate?
Q25: Are we billed based on how much time we ask or how much time we actually use?
Q26: Is there a "best practice" number of steps in scaling test runs? i.e., is there some delay before the dynamic load balancing for example starts to act properly?
Q27: The documentation says that -tunepme
is incomaptible with some mdrun options. Could you explain what these incompatibilities mean and how we can spot them? In other words, what happens if I use -tunepme
but it is incompatible with the simulation system?
-tunepme
cannot be used. The only major feature that disables PME tuning is PME decomposition (multiple GPUs for PME).Q28: I am confused with the terms used in the gromacs output: Using __ MPI process and Using __ OpenMP threads per MPI process. if we talk about running simulations only on lumi CPU where each node has 56 cores (if I'm correct) then what would be the numbers appear on those two dashes to run the simulation in a efficient way without having any problems of dynamic load imbalancing?
--ntasks-per-node
and --cpus-per-task
(and OMP_NUM_THREADS
). The total number (ntasks-per-node*cpus-per-task) should add up to the number of cores on the node (or twice the amount if using simultaneous multi-threading, which requires you to add option --hint=multithread
).Q29: Does it take time to tune the PME in the beginning and the counters should be reset later because of that?
-resetstep
or -resethway
to get more reliable performance output.Q30: How should we change the nstlist frequency?
-nstlist
option. Increasing it can have a positive impact on performance, especially for multi-GPU runs.Q31: I tried setting the nstlist = 200 with PME tuning on. I got following error: Fatal error: PME tuning was still active when attempting to reset mdrun counters at step 1901. Try resetting counters later in the run, e.g. with gmx mdrun -resetstep.
-resetstep
/-resethway
is resetting our internal performance counters after the first half of the simulation; it is a useful option to avoid the first few steps skewing the reported performance. But if this coincides with the time PME tuning is being performed, bad things happen. You can either try running longer (2-3 minutes) so that tuning happens in the first half of the run (change -maxh
) or set -resetstep
(instead of -resethway
) such that it happens after the tuning.Q32:If I have already ran simulations, would it be safe to increase the nstlist for continuation without affecting results? Basically only the performance is changed?
Q33: I tried to activate -tunepme and I got this error. What can the reason be "Fatal error: PME tuning was still active when attempting to reset mdrun counters at step 451. Try resetting counters later in the run, e.g. with gmx mdrun -resetstep. "?
Q34: In exercise 3.2 bonus 1, we are asked to set –gpus-per-node=2 and –ntasks-per-node=3, but then I get this error:
Inconsistency in user input:
There were 3 GPU tasks found on node nid005037, but 2 GPUs were available. If
the GPUs are equivalent, then it is usually best to have a number of tasks
that is a multiple of the number of GPUs. You should reconsider your GPU task
assignment, number of ranks, or your use of the -nb, -pme, and -npme options,
perhaps after measuring the performance you can get.
lumi-affinity.sh
script and using --cpu-bind=${CPU_BIND}
and ./select_gpu
?#SBATCH --nodes=1 # we run on 1 node
#SBATCH --gpus-per-node=2 # fill in number of GPU devices
#SBATCH --ntasks-per-node=3 # fill in number of MPI ranks
module use /appl/local/csc/modulefiles
module load gromacs/2023.3-gpu
source ${GMXBIN}/lumi-affinity.sh
export OMP_NUM_THREADS=7
export MPICH_GPU_SUPPORT_ENABLED=1
export GMX_ENABLE_DIRECT_GPU_COMM=1
export GMX_FORCE_GPU_AWARE_MPI=1
srun --cpu-bind=${CPU_BIND} ./select_gpu \
gmx_mpi mdrun -npme 1 \
-nb gpu -pme gpu -bonded gpu -update gpu \
-g ex3.1_${SLURM_NTASKS}x${OMP_NUM_THREADS}_jID${SLURM_JOB_ID} \
-nsteps -1 -maxh 0.017 -resethway -notunepme
Q35: I get error when trying to assign 3 tasks on 2 GPUs for bonus in ex3.2. MPICH ERROR - Abort(1) (rank 0 in comm 0): application called MPI_A
bort(MPI_COMM_WORLD, 1) - process 0?
Q36:Was it more efficient to run 9 tasks with 8 GPUs? Or is the difference compared to 8 tasks/8 GPUs really minimal?
Q37: In exercise 4.1 the suggested number for OMP_NUM_THREADS are 7/(ntasks-per-node/8). What does this mean? In the previous exercises the number was always 7.
Q38: Sorry, I am lost. How many MPI ranks should we use in exercise 4.1?
Q39: I am getting this error To run mdrun in multi-simulation mode, more then one actual simulation is
required. The single simulation case is not supported. How can I solve it?
-multidir
mode which is designed for running ensembles. If you want a single simulation, don't use -multidir
. You should also make sure that ntasks
is >= number of ensemble members (number of directories).??
after the -multidir
flag?Q40: How do I get the aggregate performance?
grep Performance */ex4.1_${SLURM_NNODES}N_multi${num_multi}_jID${SLURM_JOB_ID}.log
and calculate the sumgrep Perf */ex4.1_${SLURM_NNODES}N_multi${num_multi}_jID${SLURM_JOB_ID}.log | awk '{ sum += $2; n++ } END { if (n > 0) print sum ; }'
grep Perf */ex4.1_${SLURM_NNODES}N_multi${num_multi}_jID${SLURM_JOB_ID}.log | awk '{ sum += $2; n++ } END { if (n > 0) print sum / n ; }'
to get the averageQ41: Sorry, I am completely lost. This is my jobscript:
#SBATCH --nodes=1 # we run on 1 node
#SBATCH --gpus-per-node=8 # the number of GPU devices per node
#SBATCH --ntasks-per-node=8
module use /appl/local/csc/modulefiles
module load gromacs/2023.3-gpu
source ${GMXBIN}/lumi-affinity.sh #
export OMP_NUM_THREADS=7 # fill in the number of threads (7/(ntasks-per-node/8))
num_multi=8 # change ensemble size
srun --cpu-bind=${CPU_BIND} ./select_gpu \
gmx_mpi mdrun -multidir sim_{01..??} \
-nb gpu -pme gpu -bonded gpu -update gpu \
-g ex4.1_${SLURM_NNODES}N_multi${num_multi}_jID${SLURM_JOB_ID} \
-nsteps -1 -maxh 0.017 -resethway -notunepme
I receive the following error immediately after submission:
Feature not implemented:
To run mdrun in multi-simulation mode, more then one actual simulation is
required. The single simulation case is not supported.
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
MPICH ERROR [Rank 5] [job id 5896229.0] [Thu Jan 25 15:51:05 2024] [nid005037] - Abort(1) (rank 5 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
srun: error: nid005037: tasks 0-7: Exited with exit code 255
srun: launch/slurm: _step_signal: Terminating StepId=5896229.0
??
to the same value as num_multi after the -multidir
flagQ42: Sorry, should we execute the job from within the ensemble_inputs subdirectory?
Q43: How about tradeoff in having to queue for resources? Specifically to LUMI, would it make sense to benchmark one's system say for several GPU nodes (and how many) or stick to fewer ones to reduce queueing time at the cost of performance?
Q44: I am not sure how I can assign tasks to multiple GPU nodes. in ex4.2 we have 16 ensembles. If we use 2 GPUs, then we would have 112 MPI ranks. Can we have 32 tasks and 2 OpenMPs?
OMP_NUM_THREADS=1
OMP_NUM_THREADS=3
OMP_NUM_THREADS=3
can be run in 1 GPU.number of cores <= 1 * OMP_NUM_THREADS * Number of ensembles
.Q45: Could you post the job script used for ex4.2?
Q46: ..?
Q47: ..?
Q48: ..?
Q49: ..?
What are your expectations for this workshop?
Try out HackMD and the Markdown syntax by typing something under here!
this is code
, this is bold, this is italicized, subscript, superscriptcode block
with multiple
lines