or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
Running GROMACS on LUMI workshop 24.-25.01.2024 Q&A
📅 The workshop schedule can be found here.
⁉️ This is also the place to ask questions about the content!
📔 For exercise instructions, see this page!
💡 Hint: HackMD is great for sharing information in this kind of courses, as the code formatting is nice & easy with Markdown! Just add backticks (
`
) for thecode blocks
. Otherwise, it's like Google docs as it allows simultaneous editing. There's a section for practice at the bottom ⬇️💻 How we work
Overall setup
Other Useful links
General instructions for LUMI
ssh <username>@lumi.csc.fi
or using the LUMI web interface (open a Login node shell).You must have set up and registered your SSH key pair prior to this./scratch/project_465000934
) and in your own respective folders. Don't launch jobs from your$HOME
directory.#SBATCH
details before launching simulations. Ensure that all the details are entered correctly. Common Slurm options and error messages are detailed in the LUMI documentation.Code of conduct
We strive to follow the Code of Conduct developed by The Carpentries organisation to foster a welcoming environment for everyone. In short:
Lecture Presentations
You can find all the lecture slides for the workshop here
📅 Agenda
Day 1: Wednesday 24.01.2024 (times CET)
Day 2: Thursday 25.01.2024 (times CET)
⁉️ Q&A – type your questions here!
Q1: How to get access to LUMI-D?
--partition=largemem
in your batch job. LUMI-D visualization nodes can be accessed through the LUMI web interface accelerated visualization app.Q2: Is there a difference between CPU and socket?
Q3: Do we get the slides?
Q4: Is it possible to run GPU jobs across multiple nodes ?
-multidir
(https://docs.csc.fi/support/tutorials/gromacs-throughput/)Q5: How to estimate the scaling limit for your system (i.e., if it's feasible to run on multiple nodes)?
Q6: Is it possible to also run jobs on e.g. LUMI-G and LUMI-C in parallel, with communication via high-speed interconnect ?
Q7: Would the uncoupled ensembles have the advantage of being unbiased? how bias would account in the ensemble paralelization choice?
Q8: What is the frequency of communication between domains and where it is done?
Q9: nstlist=20-40 was mentioned in previous presentation to work well with GPU offloading. With multi-GPU runs the nstlist value can be automatically increased to e.g. 100 to increase performance. The GROMACS documentation recommends in some cases to increase this automatic value manually even more to nstlist=200-300. Is there an upper limit for nstlist and how to know which value to choose?
Q10: (related to Q9) What should we consider specifically to choose the correct or most efficient nstlist value? Is it just a trial-and-error approach? Just testing different values and compare the performance?
Q11: Are there any specific challenges of installing GROMACS to be able to use AMD GPUs? For example, to make sure the correct GCD-CCD mapping is used (in case the mapping is necessary at that stage).
mdrun
at the command line) how to map/use the hardware when launching the job. Installation is a separate topic. On LUMI GROMACS is preinstalled.Q12: Would it be possible to run with 1 full node + 1(or 2) GPUs? Or if multiple GPU nodes are used, do they need to be always full nodes?
Q13: Is the energy minimization supported with GPUs?
- So do we always have to run the energy minimization using LUMI-C partition only and the next steps using LUMI-G?
Q14: From my limited experience compiling GROMACS it seems to me GPU support is not compatible with double precision calculations. Is it correct? Is there any plan to extend GPU calculations to double precision in future versions?
Q15: Can you explain what "ROCR_VISIBLE_DEVICES" does again?
Q16: What distinguishes #SBATCH –gres= from #SBATCH gpu-per-node?
Q17: Would assigning ntask-per-node > 16 have an impact on efficiency?
Q18: How to get production access to LUMI?
Q19: What would be considered good enough scaling?
Q20: What is "Wait GPU state copy" exactly?
Q21: I wondered whether, due to the LUMI arch, one can always use –gpus-per-node=2 when running gromacs?
Q22: What if I have 2x22 cores in HPC, then how many cpu-per-task can I assign per node?
Q23: So, either use 8 gpus/node or if you have small systems then run multiple small simulations with -multidir?
Q24: everytime I increase the number of openmpthreads the simulation time keeps on increasing, so I want to understand how ntask-per-node and cpu-per-task correlate?
Q25: Are we billed based on how much time we ask or how much time we actually use?
Q26: Is there a "best practice" number of steps in scaling test runs? i.e., is there some delay before the dynamic load balancing for example starts to act properly?
Q27: The documentation says that
-tunepme
is incomaptible with some mdrun options. Could you explain what these incompatibilities mean and how we can spot them? In other words, what happens if I use-tunepme
but it is incompatible with the simulation system?-tunepme
cannot be used. The only major feature that disables PME tuning is PME decomposition (multiple GPUs for PME).Q28: I am confused with the terms used in the gromacs output: Using __ MPI process and Using __ OpenMP threads per MPI process. if we talk about running simulations only on lumi CPU where each node has 56 cores (if I'm correct) then what would be the numbers appear on those two dashes to run the simulation in a efficient way without having any problems of dynamic load imbalancing?
--ntasks-per-node
and--cpus-per-task
(andOMP_NUM_THREADS
). The total number (ntasks-per-node*cpus-per-task) should add up to the number of cores on the node (or twice the amount if using simultaneous multi-threading, which requires you to add option--hint=multithread
).Q29: Does it take time to tune the PME in the beginning and the counters should be reset later because of that?
-resetstep
or-resethway
to get more reliable performance output.Q30: How should we change the nstlist frequency?
-nstlist
option. Increasing it can have a positive impact on performance, especially for multi-GPU runs.Q31: I tried setting the nstlist = 200 with PME tuning on. I got following error:
Fatal error: PME tuning was still active when attempting to reset mdrun counters at step 1901. Try resetting counters later in the run, e.g. with gmx mdrun -resetstep.
-resetstep
/-resethway
is resetting our internal performance counters after the first half of the simulation; it is a useful option to avoid the first few steps skewing the reported performance. But if this coincides with the time PME tuning is being performed, bad things happen. You can either try running longer (2-3 minutes) so that tuning happens in the first half of the run (change-maxh
) or set-resetstep
(instead of-resethway
) such that it happens after the tuning.Q32:If I have already ran simulations, would it be safe to increase the nstlist for continuation without affecting results? Basically only the performance is changed?
Q33: I tried to activate -tunepme and I got this error. What can the reason be "Fatal error: PME tuning was still active when attempting to reset mdrun counters at step 451. Try resetting counters later in the run, e.g. with gmx mdrun -resetstep. "?
Q34: In exercise 3.2 bonus 1, we are asked to set –gpus-per-node=2 and –ntasks-per-node=3, but then I get this error:
lumi-affinity.sh
script and using--cpu-bind=${CPU_BIND}
and./select_gpu
?Q35: I get error when trying to assign 3 tasks on 2 GPUs for bonus in ex3.2. MPICH ERROR - Abort(1) (rank 0 in comm 0): application called MPI_A
bort(MPI_COMM_WORLD, 1) - process 0?
Q36:Was it more efficient to run 9 tasks with 8 GPUs? Or is the difference compared to 8 tasks/8 GPUs really minimal?
Q37: In exercise 4.1 the suggested number for OMP_NUM_THREADS are 7/(ntasks-per-node/8). What does this mean? In the previous exercises the number was always 7.
Q38: Sorry, I am lost. How many MPI ranks should we use in exercise 4.1?
Q39: I am getting this error To run mdrun in multi-simulation mode, more then one actual simulation is
required. The single simulation case is not supported. How can I solve it?
-multidir
mode which is designed for running ensembles. If you want a single simulation, don't use-multidir
. You should also make sure thatntasks
is >= number of ensemble members (number of directories).??
after the-multidir
flag?Q40: How do I get the aggregate performance?
grep Performance */ex4.1_${SLURM_NNODES}N_multi${num_multi}_jID${SLURM_JOB_ID}.log
and calculate the sumgrep Perf */ex4.1_${SLURM_NNODES}N_multi${num_multi}_jID${SLURM_JOB_ID}.log | awk '{ sum += $2; n++ } END { if (n > 0) print sum ; }'
grep Perf */ex4.1_${SLURM_NNODES}N_multi${num_multi}_jID${SLURM_JOB_ID}.log | awk '{ sum += $2; n++ } END { if (n > 0) print sum / n ; }'
to get the averageQ41: Sorry, I am completely lost. This is my jobscript:
I receive the following error immediately after submission:
??
to the same value as num_multi after the-multidir
flagQ42: Sorry, should we execute the job from within the ensemble_inputs subdirectory?
Q43: How about tradeoff in having to queue for resources? Specifically to LUMI, would it make sense to benchmark one's system say for several GPU nodes (and how many) or stick to fewer ones to reduce queueing time at the cost of performance?
Q44: I am not sure how I can assign tasks to multiple GPU nodes. in ex4.2 we have 16 ensembles. If we use 2 GPUs, then we would have 112 MPI ranks. Can we have 32 tasks and 2 OpenMPs?
OMP_NUM_THREADS=1
OMP_NUM_THREADS=3
OMP_NUM_THREADS=3
can be run in 1 GPU.number of cores <= 1 * OMP_NUM_THREADS * Number of ensembles
.Q45: Could you post the job script used for ex4.2?
Q46: ..?
Q47: ..?
Q48: ..?
Q49: ..?
Feedback
What are your expectations for this workshop?
🦔 HackMD practice area
Try out HackMD and the Markdown syntax by typing something under here!
this is code
, this is bold, this is italicized, subscript, superscript