HPC course notes - Day 2

# HPC course notes - Day 2 ## Presentation material - [How to improve job scripts for better resource usage](https://cicero.xyz/v3/remark/0.14.0/github.com/bast/talk-better-resource-usage/main/talk.md/) - [NIRD Archive](https://drive.google.com/file/d/1KIXaWqOlZR8khb3YJjmJ1UnujC51n5vX/view?usp=sharing) - [Archive Screencast](https://drive.google.com/file/d/1Te9kXeUyqLLO8-Eyx_BonF2X1Lxk47kT/view?usp=sharing) ## Notes - For the time setting, when a job usually finishes in around 16 hours, but sometimes it takes over 16 hours, even 18 hours to finish, what's the reason? And because of this, I will set the time to 20 hours, is this reasonable? - Yes that request is not unreasonable. I would even ask for 24h for this specific example. If the job finishes in 16, you will only be charged for the 16 hours. Only issue is queue time might be slightly longer - Within the job script, can you specify part of the resources for one command to use? I want to include several commands that share the total resources and they run simultaneously. - The question is not clear enough to give a solid answer. Do you mean that you wante to request a pool of resources that many analysis can share ? - Say I want to run an ensemble model forecast, with 10 runs, I request 10 cpus in the job script. I will use a for loop in the script to call model exe 10 times, each run in background (with a &). How can I tell each of the run to use cpu #0, #1, ...? Do I use -machinefile in the mpirun command? - Would array job be a solution : https://documentation.sigma2.no/jobs/job_scripts/array_jobs.html - This will not specify which cores to be used, but it will use the next available core from your allocation (core level not CPU/socket level control though) - Thanks, the array job seems to be exactly what I need. If I calculate the total number of cpus = number of runs x cpus per run, it should work - On Fram, my climate model archive script has a #SBATCH --qos=preproc, and #SBATCH --exclusive, I know it should not use exclusive, and fram don't need to specify preproc anymore, but I don't know how to change it automatically or permanently. I mean I can delete it in the specific file, but everytime I start a model run, it will be there, how to avoid this? - You can specify this outside the jobscript and as an variable to SBATCH. - sbatch [option] jobscript - https://slurm.schedmd.com/sbatch.html - Example - sbatch --qos=preproc batch_script.sh - What exactly do you mean by "start a model run", which program is this? Sounds like there's a meta script/GUI or something that writes the actual run script for you, which needs to be changed in order to permanently remove these options. - There is a meta script from the model configuration file, I don't know where and how to modify it yet, but this might out of the course scope. - Is there a pdf version of the lecture slides somewhere? - The presentation is here : https://github.com/bast/talk-better-resource-usage/, let me see if I can export it to PDF - Thanks! Unfortunately having internet problems today and getting kicked out of zoom, so I'd be happy to receive any kinds of (the very helpful, thanks) materials - pdf export is possible, see https://cicero.readthedocs.io/en/latest/export.html (please contact me, radovan.bast@uit.no if this didn't work and if I can help with the pdf version) - - Is there overhead using /usr/bin/time? - No, `/usr/bin/time` has no significant overhead (nor does `time`). - Just a heads up, "$ /usr/bin/time" is not the same as "$ time", the "-v" option will not work for "time" - What happens if we don't set the memory in the script? I realized in my climate model running script, I didn't specify memory. - Depends on the machine. Betzy and Fram are "exclusive", which means that you get the full node with all cores and memory regardless (in fact you cannot specify memory in the "normal" partition here). Saga is not exclusive, and here you _have to_ specify memory, so you will get an sbatch error if you do not. - The usage of the terms nodes, cores and cpu is not completely consistent used in the presentation. Could you maybe specify this a bit more clear? - Node is a compute node (one computer) - CPU is a processor inside a node. There are 2 CPUS per node. This is also called a socket - Core is a processing unit inside a CPU. This differs for each system and you can check with the command "freepe" to get a list of all the variations and how much is free at a given time. - Saga regular compute node has 2 CPUS per node and 20 cores per CPU. i.e 40 cores per node. - Many softwares have an option -threads. This refers to the number of cores, right? The --ntasks refers to the number of cores, right? The --memory-per-cpu refers to the cores, right? This is confusing. - Yes, definitely confusing :) The sbatch options refer to: - `--nodes`: nodes/machines - `--ntasks`: MPI tasks or processes, equal to "cores" only when "cpus-per-task" is set to one - `--cpus-per-tasks`: number of cores per task/process - `--mem-per-cpu`: memory per core - `--mem`: memory per node - If a program accepts an option called `-threads`, it typically refers to shared memory (OpenMP etc) which relates to the `--cpus-per-task` option in sbatch. - Thanks for this. I think it might be a good idea to explain these words and how they relate to each other for new users. - Please say something about hyperthreading. What is it? Is it useful? - On the HPC system, we have I do not think threading allowed. You need to specify and use only hardware cores. (I will confirm this and come back to you) - How could we check the optimal reservation on nodes when we run big climate models? The consumed recources of the model will depend greatly on the input/case I would like to run. Should we do something like Radovan showed for each test case? - maybe in this case one could analyze "real" calculations and check usage of past calculations and from this adapt the future ones. for a couple of them i would run this: https://documentation.sigma2.no/jobs/performance.html and check the percentage of computation vs. MPI. and in the computation part also check the percentage of actual computation vs. waiting for memory. - I have a code that works well on ~32-128 cores with parallelization through Python multiprocessing. To get more speedup from the cluster I need a second layer of parallelization that runs many of these types of jobs - do you have a suggestion for the technology / environment / language that works well for that on the clusters? - There's definitely many possiblities but have you thought of trying a python process that generates independent slurm jobscripts. That of course depends on how much the diferent processes depends on each other. If you need interaction/message passing between processes on different nodes you can also look into mpi4py which opens the world of MPI for you. - Indeed my example that I used was mpi4py :-) This could be the second layer if the jobs need to communicate/coordinate. Here the mpi4py could take care of coordinating across nodes and multiprocessing for coordinating within the node. If they are completely independent, then I would go for job arrays. - If you need help with expanding your program you can send us a mail and we can work on it together. - How did you get to this desktop.saga and show the nodes info, I didn't totally get it. Can you summarize it somewhere that we can read and follow? - forgot to say that you need to be on university network and then here: https://documentation.sigma2.no/jobs/monitoring.html - then you can log in with your metacenter username and password and then to get to the node info it's on top of the page [sorry for being too quick there] - Maybe a bit off-topic for today and related to yesterday: After compiling my code (which I use the login node for), there's often the possibility to do a small test run (e.g. make test) to check whether compilation worked. Do I use the login nodes for this, or is it better to use an interactive job? In the documentation there are examples for interactive jobs on saga/fram, but I'm on Betzy, so I wasn't sure how to best do that? - Depends a bit how heavy the test suite is, if it runs on a single core for a few seconds/minutes it's no big problem, but I would recommend interactive job for this. The procedure for getting interactive jobs is exactly the same for Betzy, by the way, but here you probably want to add "--qos=devel" in order to be allowed to use a single node for this. - Perfect thanks! Then I'll try to use an interactive job for this. By the way hackmd is a perfect idea for asking/sharing questions, I like this. * If one does not ask for ntask: Is it the same as aksing for 1 core, or will the scheduler decide on how many cores to use? - as far as i know you can ask for number of tasks or (number of nodes and number of tasks per node). i prefer the former since it gives the scheduler more flexibility and often the exact placement does not matter. but sometimes it matters how the tasks are distributed across nodes and then you need the latter. i am sure slurm allows for a lot more but the above is what i use. - I meant: If you only specify "time" and "mem", will the sceduler use one core, or several cores if that improves the memory usage? - I think that by default it will use one core if you don't specify anything but I am not sure about it (it depends on the machine). I would verify it. - If you do not specify the following are the defaults (checked on Fram and Saga by submitting a srun job without sepcifying nodes or tasks) - NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 - Actually, on Fram the `SLURM_CPUS_PER_TASK` variable is not set in this case. You get assigned one _full node_ with a single task, with access to all the 32 cores. The minimum job size on Fram is one (full) node, on Saga the minimum is single core. - Is --ntasks equivalent to number of cpus? - Number of cores (there are 20 cores on one CPU on SAGA for example ) - See similar question above - When running array jobs; how do I find out about the sweet spot for requesting resources (number of nodes) in terms of runtime vs wait time? - No universal answer. One thing is to test one job and eastimate the optimal before starting the array. You could use the monitoring Radvan showed to see how the resource usage is. - For the exercise, I ran a script, and look into the output file, I see that the "memory usage stats" and uDisk usage stats" are all 0 - Was this a real job with actual computation? The memory statistics in the slurm output is only sampled every 30sec or so. - I ran on Saga with`sbatch saga_GROMACS.sh`. In the output file there is "MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them." Is this mean the running failed? - that seems to be an interesting (not yet reproducible for me) problem - could you contact me via email thomas.roblitz@uib.no? - (Question not related to today, but I find this super confusing): what is the difference between the feide and opendID accounts? Why is it recommended to use OpenID at UiO? - Which resource are you trying to access ? - I am UiO and use Feide to access metacenter resources. - I meant for the NIRD-toolkit. Why does it matter which login one use? - You may use Feide and I do, who/where has OpenId recomended over Feide ? - Excuse me for a very bad formulated question! I got this e-mail to prepare for the NIRD hands-on tomorrow: '''There are regulations with regard to access to services through Feide identity applied by the University of Oslo (UiO). For participants from either UiO or for participants that does not already have a Feide account, you will need to create an account with OpenIDP Feide (via https://openidp.feide.no) in order to join the hands-on part of the NIRD Toolkit lectures. ''' Also, at other workshops the organizers has asked me to use/create an OpenIDP Feide to login to a juptyerhub. Why is this better? - As far as I can see this recomendation is for "participants who does not already have a Feide account". - Or 'For participants from either UiO'? - Let me check with the organisers (contacted and wating). - Great! Thank you! - I could not reach the exact person, can you please drop a mail to hpc-drift@usit.uio.no so I can reply (I do not want you to place your email here) - Thank you! I will take time to better formulate my question in an e-mail when we are done for today. - Thank you for the undestanding, this email was originated out side of UiO and we shall find out the reasons and fix if there is any (if possible) - Concerning UiO users they can only use feide against "activated" services by UiO. Since there is no such thing like a wildcard activation and we cannot ask UiO to activate manually every service created in the NIRD toolkit so we use openfeide. --- ## Disk quota and usage, data archiving - Small local storage 300 GB, is this mean we have the same 300 GB on saga and fram? Can we request larger local storage if needed? Seems we only have 20 GB on home directory now. - The local storage means local to the compute nodes during job execution, and is only available as long as the job runs. The 20 GiB quota will usually not be extended, but you can ask for extra project storage under `/cluster/projects/nnXXXXk`: https://www.sigma2.no/apply-e-infrastructure-resources - Project directory, 1 TB is really limit, we can't even store all the initial data needed for climate model, can we request larger storage for a project? - Yes, you can apply for up to 10TB. https://documentation.sigma2.no/files_storage/clusters.html#project-area - Just to confirm: auto-cleanup time is different for each project and depends on whether usage is above or below 70% capacity? - No, it depends on the disk usage of the whole system. - Roger that. Thanks for confirming! --- ## Day 2 Q&A - Regarding parallelising python code, is mpi4py a solution commonly used on sigma hpc systems? - Yes, I have used this module at least on Vilje. It is a tool for MPI parallelization, and not the only way to parallelize Python programs. - How does the `/nird/projects/nird/${projectID}/` directory differ from the Nird archive? Is it meant to be used as a staging area before proper archiving or is it a longer-term storage directory? - The project area is primarily for working data. You could use it as a staging area for archiving data if you want. But, the project area storage belongs to your group - you use it as you see fit. Once you no longer actively need to use your data, or your project has completed and the data is valuable to your community you could consider archiving it. The archive will keep the data for at least 10 years.