What would you like to get out of this workshop?
What is the payment model for use og the HPC resources?
What other systems (outside of Norway) do you use?
What are the biggest obstacles when starting or working on an HPC?
How much time do you spend on IT matters (HPC, programming, NIRD, …)?
How do you find the best machine for your type of work?
What are the "papercuts" for you on the HPC systems? (small but annoying things that you wish were better/ different)? I am asking this so that we improve these.
A truly small papercut: I am involved in three different projects on saga with very similar project directory names (nn####k). I always end up working the wrong directory
Time estimates for when a job starts in the queue
squeue -j 12345 --start
. but it is an estimate. things are shifting all the time so it cannot be 100% reliable always.Sometimes a job fails (before finishing or reaching requested job time) after waiting really long in a queue and it would be nice to have a small 'grace period' of a few minutes in which one could fix the issue and resubmit (continue) the job while your nodes stay reserved for your job in the 'grace period'.
EasyBuild slides: https://cicero.xyz/v2/remark/github/stigrj/easybuild-demo/main/easybuild-demo.mkd
Official tutorial: https://easybuilders.github.io/easybuild-tutorial/
Is there an easy way to clear or tidy up the softwares I installed?
What does the 'setenv'(XXXX) which is shown with module show YYYYY. These environmental variables are not standard in a computer right? I'm used to seeing LD_LIBRARY_PATH for instance.
I'm trying to install some packages in a conda environment, but am getting the error "[Errno 122] Disk quota exceeded". I'm only using 600 MB right now, so what's going on?
dusage -i
. Often you have to clean the .conda folder. Try conda clean -a
. More on that later in the next session.File "/usr/lib64/python2.7/decimal.py", line 3872, in _raise_error raise error(explanation) decimal.InvalidOperation: Invalid literal for Decimal: '100000\nXXXXXX_g'
(where XXXXXX is the username; I'm in a clean environment on SAGA with no modules loaded)What happens if there is a conflict between the versions of the loaded modules? Forinstance loading a compiled version NETCDF will be dependent of a compiler, whereas loading an other library can depend on a different compiler version.
Configure your own environment, how?
What about docker as containers?
I tried to set a variable; then load a module that sets it; then unload the module.
Result is that the original variable is not restored, so loading a module is apparently irreversible?
Where can I find which packages I am able to install using EasyBuild already?
eb -S packagename
to see available easyconfig files in the repo after you load EasyBuild modulePresentation based on documentation: https://documentation.sigma2.no/software/userinstallsw/python.html#installing-python-packages
Can you show the isolation node command again please? Is this the recommended way to work when one are compiling a little bigger code bases for instance?
Follow up: when should we DEFINITELY switch from the login-nodes to keep good cluster hygiene?
if I created different conda environments and Virtualenv, how to keep the most updated one and easily tide up others?
requirements.txt
(virtualenv) or environment.yml
(conda) and install from these files. this way you have documented what you have installed. one environment per project. if you use one per project or folder, then it is also easier to remove them without affecting all other projects. documenting the actual dependencies in environment.yml
may in my opinion be more important than always using the latest versions. because latest versions today will be old versions in two years and if you return to your project in 2 years it's nice if the versions are documented somewhere.Presentation following: https://documentation.sigma2.no/software/userinstallsw/R.html
Given a lot of bioinformatics tools based in R are focused on producing graphics, can R be ran interactively on any of the sigma2/Nird systems?
ssh -X saga.sigma2.no
(thanks!)How about putting the .libPaths() statement into the .Rprofile on your HOME? This works on most systems.
Presentation slides: How to improve job scripts for better resource usage
For the time setting, when a job usually finishes in around 16 hours, but sometimes it takes over 16 hours, even 18 hours to finish, what's the reason? And because of this, I will set the time to 20 hours, is this reasonable?
Within the job script, can you specify part of the resources for one command to use? I want to include several commands that share the total resources and they run simultaneously.
On Fram, my climate model archive script has a #SBATCH –qos=preproc, and #SBATCH –exclusive, I know it should not use exclusive, and fram don't need to specify preproc anymore, but I don't know how to change it automatically or permanently. I mean I can delete it in the specific file, but everytime I start a model run, it will be there, how to avoid this?
Is there a pdf version of the lecture slides somewhere?
Is there overhead using /usr/bin/time?
/usr/bin/time
has no significant overhead (nor does time
).What happens if we don't set the memory in the script? I realized in my climate model running script, I didn't specify memory.
The usage of the terms nodes, cores and cpu is not completely consistent used in the presentation. Could you maybe specify this a bit more clear?
Many softwares have an option -threads. This refers to the number of cores, right? The –ntasks refers to the number of cores, right? The –memory-per-cpu refers to the cores, right? This is confusing.
--nodes
: nodes/machines--ntasks
: MPI tasks or processes, equal to "cores" only when "cpus-per-task" is set to one--cpus-per-tasks
: number of cores per task/process--mem-per-cpu
: memory per core--mem
: memory per node-threads
, it typically refers to shared memory (OpenMP etc) which relates to the --cpus-per-task
option in sbatch.Please say something about hyperthreading. What is it? Is it useful?
How could we check the optimal reservation on nodes when we run big climate models? The consumed recources of the model will depend greatly on the input/case I would like to run. Should we do something like Radovan showed for each test case?
I have a code that works well on ~32-128 cores with parallelization through Python multiprocessing. To get more speedup from the cluster I need a second layer of parallelization that runs many of these types of jobs - do you have a suggestion for the technology / environment / language that works well for that on the clusters?
How did you get to this desktop.saga and show the nodes info, I didn't totally get it. Can you summarize it somewhere that we can read and follow?
Maybe a bit off-topic for today and related to yesterday: After compiling my code (which I use the login node for), there's often the possibility to do a small test run (e.g. make test) to check whether compilation worked. Do I use the login nodes for this, or is it better to use an interactive job? In the documentation there are examples for interactive jobs on saga/fram, but I'm on Betzy, so I wasn't sure how to best do that?
SLURM_CPUS_PER_TASK
variable is not set in this case. You get assigned one full node with a single task, with access to all the 32 cores. The minimum job size on Fram is one (full) node, on Saga the minimum is single core.Is –ntasks equivalent to number of cpus?
When running array jobs; how do I find out about the sweet spot for requesting resources (number of nodes) in terms of runtime vs wait time?
For the exercise, I ran a script, and look into the output file, I see that the "memory usage stats" and uDisk usage stats" are all 0
sbatch saga_GROMACS.sh
. In the output file there is "MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1.(Question not related to today, but I find this super confusing): what is the difference between the feide and opendID accounts? Why is it recommended to use OpenID at UiO?
Small local storage 300 GB, is this mean we have the same 300 GB on saga and fram? Can we request larger local storage if needed? Seems we only have 20 GB on home directory now.
/cluster/projects/nnXXXXk
: https://www.sigma2.no/apply-e-infrastructure-resourcesProject directory, 1 TB is really limit, we can't even store all the initial data needed for climate model, can we request larger storage for a project?
Just to confirm: auto-cleanup time is different for each project and depends on whether usage is above or below 70% capacity?
Regarding parallelising python code, is mpi4py a solution commonly used on sigma hpc systems?
How does the /nird/projects/nird/${projectID}/
directory differ from the Nird archive? Is it meant to be used as a staging area before proper archiving or is it a longer-term storage directory?
ns9989k-jupyter.craas1.sigma2.no
ns9989k-minio.craas1.sigma2.no
ns9989k-rstudio.craas1.sigma2.no
shiny-ns9989k-rstudio.craas1.sigma2.no
ns9989k-jupyterhub.craas1.sigma2.no
What is a CPU in this context? ~500 cores on NIRD-TOS? How many nodes?
When I registered for dataporten with feide guest account, the login credential is not connected to my user account on NIRD. So the resulting dashboard says "Your account is not allowed to register personal applications and APIs." How can I resolve this issue?
Who is the "optimal" person to make/enable/change the top docker image of a project. The project leader?
(from zoom chat) Smallest machine is "Base 1CPU 1GB", right?
When given access to an application for group members, they will equivalently be using the group leader's account on NIRD to access data and docker images? Will they all have read and write access to the files on NIRD? How to ensure that one doesn't accidently break things unintentionally?
i'm still on "fetching events". Is the initialization a bit slow for others as well?
On my application page it shows "Failing The application could not start due to: The list of events may contain more details explaining what is causing the application to fail. What's the reason? Below at Recent events | fetching events…it shows: 2m Back-off restarting failed container backoff
Will the minio data download link be available to share and download after the application has been closed? Or should it only be closed once the data has been downloaded by the intended recipient? i.e. Does the data and its download link persist after closing the application?
I have accepted the invitation to join the group CRaaS1, ns9989k and Dataporten App Creators on Dataporten. When trying to install or look at any of the services I receive the notification: You are not allowed to use any projectspaces.
At this jupyter, are the python packages already installed after setting up?
Can you explain where we could check/change the dockerimages used by the jupyter notebook again?
For jupyterhub authorized groups, you refered to a class and create a new group, can you explain how to create it?
What happens if someone (student) forgets to logout?
I have accepted the invitation to join the group CRaaS1, ns9989k and Dataporten App Creators on Dataporten, but when trying to install jupyterhub, for the Projectspace, I only see my research project as the only choice.
Not clear to me: Is the point with the NIRD toolkit that you can run tools/portals/codes to analyze and visualize massive data stored on NIRD (yourself or students/others that you invite) or is this unrelated to actual NIRD storage? F.ex., just to teach someone to use Jupyter notebooks in a course?
Is it just me or others also, my jupyterhub just initializing, always shows "Started container pause".
What is the differece between docker image and proxy image?
I think this question is related to rstudio, since there are two docker images you are allowed to change in the advanced configuration of rstudio. The one pointing to quay.io/uninett/rstudio-server:tag is the one you should change if you would like to add software.
The proxy image sets up an nginx proxy, you would probably never need to change this.
library(sf)
Error: package or namespace load failed for ‘sf’ in dyn.load(file, DLLpath = DLLpath, …):
unable to load shared object '/usr/local/lib/R/site-library/units/libs/units.so':
libudunits2.so.0: cannot open shared object file: No such file or directory
Do you have any tips for using emacs locally and login to the cluster through ssh?
Follow up: which editor do you (instructurs) use on the clusters? vim/vi? Do you have any recommended setups you could share?
How many ssh keys should one typically have? I guess it is ok to use the same key to log in to Saga, Fram, Betzy etc? How about further connecting to github etc from these machines?
Is there anything special about the ssh-keys on the HPC systems compared to other systems? I have ssh-key login set up for several local servers and they work. Tried setting up for Saga in the last few days and just now with the whole instructions, and I'm still being asked for the password each time. Any suggestions what could go wrong? -
ssh -v says "host key algorithm: ecdsa-sha2-nistp256" - does that mean ed25519 and RSA are not supported?
ssh-keygen
&c.)A followup on the editor point above: As a more novice user to HPC I'm more used to GUI editors. Some offer the possibilty to login via SSH (e.g., VisualStudioCode), so that you can edit files on a server basically from your own machine with all the GUI 'benefits'. Is it possible to do so also on the Metacenter machines?
Have problem install any of the apps in nird toolkit. Marius helped me a bit, told me to use my UiO account instead of openIDP, I am quite confused, which one should we use exactly? None of them installed successfully so far.
Where can we get the nird toolkit demo-videos that we can see later?