HPC course notes from previous days

Day 1

Icebreaker

What would you like to get out of this workshop?

Efficient and fair use of our clusters :) +1
blue skies in Bergen :)
Learn new tricks and learn more about available services
Learn more about Containers and maybe some tricks for the job scripts
learn how to improve the running sbatch script and nird toolkit for better and direct data analysis on nird
Better understanding of job priorities in the queue (i.e. better to run shorter jobs with more cpus, or longer jobs with less cpus)
using NIRD toolkit for teaching

Overview about the infrastructure

It shows currently no GPU on Betzy, is this mean it's also CPU-based, and the climate models run on Fram can be definitely run on Betzy?
- Indeed Betzy is also CPU based and has no GPU (?, as far as I know?) porting the models should be not too much work but moving models from one machine to another is always some work (in my experience). There are more cores per processor on Betzy and also different processor infrastructure (AMD vs Intel). The programming and library environment however, should be basically the same and all libraries available on Fram should in principle be available on Betzy.
- Fram regular compute node (32 cores per node, 61Gb available RAM), Betzy regular compute node (128 cores per node, 245Gb available RAM)
- Betzy will (hopefully) be getting A100 GPUs within this year though.

Handouts of the presentations? Will they be made available somewhere?
- good question. we will collect all material and link to it from top of this document

What is the payment model for use og the HPC resources?
- This depends on the research and funding. More on pricing is available here: https://www.sigma2.no/user-contribution-model
- Followup: Can you provide assistance for selecting the right model to include in f.ex an application?
  - Certainly! Please contact us on sigma2@uninett.no and we'll assist :)
    - OK Thanks!
What other systems (outside of Norway) do you use?
- I have been using systems in Germany.
- Have used systems in Sweden
- Systems in Germany
- HPC at Météo-France
What are the biggest obstacles when starting or working on an HPC?
- As a student, jumping straight into HPC without previous experience was really daunting and it was difficult to know if I was "breaking the rules" or whether I was following best practices or not. Besides the "do's and dont's" in the online documentation, I never knew whether small pre/post-processing scripts were frowned upon or not. It's still a little unclear for me, actually.
- How to run a climate model properly without knowing the details of the platform
How much time do you spend on IT matters (HPC, programming, NIRD, …)?
- Probably around 90%
- a lot, running climate model, data analysis, teaching on data visualization and programming etc.
How do you find the best machine for your type of work?
- Answer: You can ask for one specific. RFK will try to assign you to the best fitting system. Aspects: parallelism of jobs, GPUs, needed memory(/cpu), file system needs, (software), amount of core hours needed, need to prove that applications scale (ARM reports)
What are the "papercuts" for you on the HPC systems? (small but annoying things that you wish were better/ different)? I am asking this so that we improve these.
- A truly small papercut: I am involved in three different projects on saga with very similar project directory names (nn####k). I always end up working the wrong directory
  - i understand so it would be nice to have project "aliases"/"nicknames"? name instead of number?
    - That would definitely help avoid some confusion on my part!
      - thanks for pointing it out. i will note these down so that we consider how to improve these
        
        It's definitely not a big issue and the best solution might be changing my workflow, thanks though!
- Time estimates for when a job starts in the queue
  - that they are wrong/unreliable?
  - Rephrased question: Estimates for how long time a job has to wait in the queue
    - there is a command for that: squeue -j 12345 --start. but it is an estimate. things are shifting all the time so it cannot be 100% reliable always.
    - In many cases, this –start command only returns N/A for the START_TIME
    - list of commands: https://slurm.schedmd.com/pdfs/summary.pdf
      - I made myself a cheat sheet since I can never remember these: https://github.com/bast/til/blob/master/slurm/cheat.md
- Sometimes a job fails (before finishing or reaching requested job time) after waiting really long in a queue and it would be nice to have a small 'grace period' of a few minutes in which one could fix the issue and resubmit (continue) the job while your nodes stay reserved for your job in the 'grace period'.
  - This is tricky as the job queue possition is decided after considring the resources request in the job script. i.e. If you ask for one minute runtime and then after getting the reservation if you change it to one week then it is not fair for others.
    - Indeed but this was not about prolonging (which we also get asked from time to time) but for keeping the long awaited time slot and then seeing it disappear after a simple mistake. I see how frustrating that can be :-) What I will recommend in a talk tomorrow and Thu is to "grow" a calculation to avoid these. But cannot be 100% avoided.
      - True. but what will stop from changing the runtime, number of cores ? what if the failure is due to not asking enough memory?.
        
        one would have to keep the same settings. indeed slurm would not release the reserved resources but keep them for a bit longer. i know it's tricky.
  - But a great idea! I never thought of this.
  - I agree. It is something for better user experience. What about giving some credits for user, where they can them selves increase queue possition for a given number of time, and increase runtime according to credit available
  - There might also be other reasons for the job not to start properly, e.g. insufficient funds on the project. It would be good to get some sort of job diagnostics upon submission instead of realizing that it is not starting when it is finally your turn.

Installing software with EasyBuild

EasyBuild slides: https://cicero.xyz/v2/remark/github/stigrj/easybuild-demo/main/easybuild-demo.mkd
Official tutorial: https://easybuilders.github.io/easybuild-tutorial/
Is there an easy way to clear or tidy up the softwares I installed?
- but how was it installed? with easybuild? or differently? because it depends what was used to install software.
  - I tried pip and Virtualenv, I installed python2, and later python 3, and later miniconda etc, so it's a mess
    - i perfectly understand and have been there many times. for python-based software i can recommend to use either virtual environments or conda (actually both because some codes need the one, some the other) and for each project to create a separate environment so that packages "don't cross" and so that you can remove it any time. good resources about this: https://carpentries-incubator.github.io/introduction-to-conda-for-data-scientists/ also https://researchsoftwarehour.github.io/sessions/rsh-011/
What does the 'setenv'(XXXX) which is shown with module show YYYYY. These environmental variables are not standard in a computer right? I'm used to seeing LD_LIBRARY_PATH for instance.
- this can set any environment variables that are then used by the code in question (the example that we have seen set some variables only understood by the Arm perf-report code)
I'm trying to install some packages in a conda environment, but am getting the error "[Errno 122] Disk quota exceeded". I'm only using 600 MB right now, so what's going on?
- There are actually two different file quotas in place (size and number). Probably you exceeded the number of file quota. You can check that with dusage -i. Often you have to clean the .conda folder. Try conda clean -a. More on that later in the next session.
- dusage -i gives the following error:
  File "/usr/lib64/python2.7/decimal.py", line 3872, in _raise_error raise error(explanation) decimal.InvalidOperation: Invalid literal for Decimal: '100000\nXXXXXX_g' (where XXXXXX is the username; I'm in a clean environment on SAGA with no modules loaded)
- this is unfortunately a known problem (which should have been communicated better). this broke recently after a new storage pool was added. we will fix this soon. very sorry that this is still not fixed. we are also working on a new version of dusage which will give better/easier information about your quota
What happens if there is a conflict between the versions of the loaded modules? Forinstance loading a compiled version NETCDF will be dependent of a compiler, whereas loading an other library can depend on a different compiler version.
- this is the "dependency hell". in this case either the dependency needs to be changed or the code. these happen from time to time and it can be a bit of work to resolve it. Spack is a bit more relaxed about this as far as I know and allows to mix dependencies compiled with different compilers and versions.
- "dependency hell". Hehe. Nice vocabulary.
  - it's a thing :-) https://en.wikipedia.org/wiki/Dependency_hell
Configure your own environment, how?
- in which context was it mentioned? (I missed that part)
  - in the chat
    - aha, Thomas can you expand? …
    - instead of setting or changing environment variables in your .bashrc, you could make a small module file as demonstrated (the SetFoo.lua example), then you can easily adjust your environment and revert it back by loading or unloading the module
    - as pointed out below, setting an environment variable with this method, does not allow to revert the change, so be careful
    - I guess it will not remember the previous state, so unloading means unsetting, not reverting
What about docker as containers?
- we will talk about this on Friday. Singularity is available on all machines and Singularity can import Docker containers. so it is possible.
I tried to set a variable; then load a module that sets it; then unload the module.
Result is that the original variable is not restored, so loading a module is apparently irreversible?
- hmm… it can be that they are not completely reversible/commutative. Also for this reason I like to load all modules in the run scripts and not "outside" to have the loads separated and also documented.
  - side note: I often thought: programming a module system should be easy and why is this so slow/complicated/broken? but it turns out it is complicated and many corner cases and now I take back all my past comments on this. it is suprisingly non-trivial.
Where can I find which packages I am able to install using EasyBuild already?
- You can use eb -S packagename to see available easyconfig files in the repo after you load EasyBuild module
- I also like to browse https://github.com/easybuilders/easybuild-easyconfigs/tree/develop/easybuild/easyconfigs
- One can also write their own easyconfig files https://docs.easybuild.io/en/latest/
- Great thanks!
- You might also encounter situations where the software you want is available in EasyBuild, but not for the particular version that you need. Then it's not too complicated to tweak the existing easyconfigs for the new version. More on this in our documentation: https://documentation.sigma2.no/software/userinstallsw/easybuild.html

Installing Python packages

Presentation based on documentation: https://documentation.sigma2.no/software/userinstallsw/python.html#installing-python-packages

Can you show the isolation node command again please? Is this the recommended way to work when one are compiling a little bigger code bases for instance?
- https://documentation.sigma2.no/jobs/interactive_jobs.html
Follow up: when should we DEFINITELY switch from the login-nodes to keep good cluster hygiene?
- Needs multiple threads and more memory for a longer time. (e.g. compile GCC)
- Internet access should not be needed during compiling/installing
  - but sometimes it is (some codes fetch dependencies from the net at configure or build time) and I think this is now solved? Compute nodes can now access internet via proxy (? right?)
    - should be covered a bit in tomorrow's session 2 about jobs
    - depends a bit on how Internet is accessed: can work without any configuration out of the box, with some configuration or not at all (note, only setup to work on Fram and Saga currently)
- Personally I only use login nodes for editing files or submitting/monitoring jobs but everything else on compute nodes. I like to also put compilations into a run script. The bonus of this is that it forces me to document and isolate dependencies which is good for me next time I want to build it two months later where I forgot everything again.
if I created different conda environments and Virtualenv, how to keep the most updated one and easily tide up others?
- for both i recommend to list the dependencies that you have installed in requirements.txt (virtualenv) or environment.yml (conda) and install from these files. this way you have documented what you have installed. one environment per project. if you use one per project or folder, then it is also easier to remove them without affecting all other projects. documenting the actual dependencies in environment.yml may in my opinion be more important than always using the latest versions. because latest versions today will be old versions in two years and if you return to your project in 2 years it's nice if the versions are documented somewhere.

Installing R packages

Presentation following: https://documentation.sigma2.no/software/userinstallsw/R.html

Given a lot of bioinformatics tools based in R are focused on producing graphics, can R be ran interactively on any of the sigma2/Nird systems?
- great question. investigating … at the minimum we will return to this on Thu but looking for an answer now
- indeed possible on an interactive node but include X forwarding when loggin into Saga
  - can you add the command here? how do you do it from putty?
    - i am unsure about putty but recommend using https://mobaxterm.mobatek.net/ on Windows since it has built-in support for "X server" (the graphics part) and generally more options
    - command: ssh -X saga.sigma2.no (thanks!)
How about putting the .libPaths() statement into the .Rprofile on your HOME? This works on most systems.
- (I am not an R expert …) but I would recommend to load everything in a script. Then if you return to the script 1 year later, you see all the dependencies. It makes it also easier for staff to help because they otherwise do not have the same environment as you. If everything is in the script, staff can take the script and reproduce the problem.

Day 2

Presentation slides: How to improve job scripts for better resource usage

For the time setting, when a job usually finishes in around 16 hours, but sometimes it takes over 16 hours, even 18 hours to finish, what's the reason? And because of this, I will set the time to 20 hours, is this reasonable?
- Yes that request is not unreasonable. I would even ask for 24h for this specific example. If the job finishes in 16, you will only be charged for the 16 hours. Only issue is queue time might be slightly longer
Within the job script, can you specify part of the resources for one command to use? I want to include several commands that share the total resources and they run simultaneously.
- The question is not clear enough to give a solid answer. Do you mean that you wante to request a pool of resources that many analysis can share ?
  - Say I want to run an ensemble model forecast, with 10 runs, I request 10 cpus in the job script. I will use a for loop in the script to call model exe 10 times, each run in background (with a &). How can I tell each of the run to use cpu #0, #1, …? Do I use -machinefile in the mpirun command?
  - Would array job be a solution : https://documentation.sigma2.no/jobs/job_scripts/array_jobs.html
  - This will not specify which cores to be used, but it will use the next available core from your allocation (core level not CPU/socket level control though)
  - Thanks, the array job seems to be exactly what I need. If I calculate the total number of cpus = number of runs x cpus per run, it should work
On Fram, my climate model archive script has a #SBATCH –qos=preproc, and #SBATCH –exclusive, I know it should not use exclusive, and fram don't need to specify preproc anymore, but I don't know how to change it automatically or permanently. I mean I can delete it in the specific file, but everytime I start a model run, it will be there, how to avoid this?
- You can specify this outside the jobscript and as an variable to SBATCH.
- sbatch [option] jobscript
- https://slurm.schedmd.com/sbatch.html
- Example
- sbatch –qos=preproc batch_script.sh
- What exactly do you mean by "start a model run", which program is this? Sounds like there's a meta script/GUI or something that writes the actual run script for you, which needs to be changed in order to permanently remove these options.
  - There is a meta script from the model configuration file, I don't know where and how to modify it yet, but this might out of the course scope.
Is there a pdf version of the lecture slides somewhere?
- The presentation is here : https://github.com/bast/talk-better-resource-usage/, let me see if I can export it to PDF
  - Thanks! Unfortunately having internet problems today and getting kicked out of zoom, so I'd be happy to receive any kinds of (the very helpful, thanks) materials
    - pdf export is possible, see https://cicero.readthedocs.io/en/latest/export.html (please contact me, radovan.bast@uit.no if this didn't work and if I can help with the pdf version)
Is there overhead using /usr/bin/time?
- No, /usr/bin/time has no significant overhead (nor does time).
- Just a heads up, "$ /usr/bin/time" is not the same as "$ time", the "-v" option will not work for "time"
What happens if we don't set the memory in the script? I realized in my climate model running script, I didn't specify memory.
- Depends on the machine. Betzy and Fram are "exclusive", which means that you get the full node with all cores and memory regardless (in fact you cannot specify memory in the "normal" partition here). Saga is not exclusive, and here you have to specify memory, so you will get an sbatch error if you do not.
The usage of the terms nodes, cores and cpu is not completely consistent used in the presentation. Could you maybe specify this a bit more clear?
- Node is a compute node (one computer)
- CPU is a processor inside a node. There are 2 CPUS per node. This is also called a socket
- Core is a processing unit inside a CPU. This differs for each system and you can check with the command "freepe" to get a list of all the variations and how much is free at a given time.
- Saga regular compute node has 2 CPUS per node and 20 cores per CPU. i.e 40 cores per node.
Many softwares have an option -threads. This refers to the number of cores, right? The –ntasks refers to the number of cores, right? The –memory-per-cpu refers to the cores, right? This is confusing.
- Yes, definitely confusing :) The sbatch options refer to:
  - --nodes: nodes/machines
  - --ntasks: MPI tasks or processes, equal to "cores" only when "cpus-per-task" is set to one
  - --cpus-per-tasks: number of cores per task/process
  - --mem-per-cpu: memory per core
  - --mem: memory per node
- If a program accepts an option called -threads, it typically refers to shared memory (OpenMP etc) which relates to the --cpus-per-task option in sbatch.
- Thanks for this. I think it might be a good idea to explain these words and how they relate to each other for new users.
Please say something about hyperthreading. What is it? Is it useful?
- On the HPC system, we have I do not think threading allowed. You need to specify and use only hardware cores. (I will confirm this and come back to you)
How could we check the optimal reservation on nodes when we run big climate models? The consumed recources of the model will depend greatly on the input/case I would like to run. Should we do something like Radovan showed for each test case?
- maybe in this case one could analyze "real" calculations and check usage of past calculations and from this adapt the future ones. for a couple of them i would run this: https://documentation.sigma2.no/jobs/performance.html and check the percentage of computation vs. MPI. and in the computation part also check the percentage of actual computation vs. waiting for memory.
I have a code that works well on ~32-128 cores with parallelization through Python multiprocessing. To get more speedup from the cluster I need a second layer of parallelization that runs many of these types of jobs - do you have a suggestion for the technology / environment / language that works well for that on the clusters?
- There's definitely many possiblities but have you thought of trying a python process that generates independent slurm jobscripts. That of course depends on how much the diferent processes depends on each other. If you need interaction/message passing between processes on different nodes you can also look into mpi4py which opens the world of MPI for you.
- Indeed my example that I used was mpi4py :-) This could be the second layer if the jobs need to communicate/coordinate. Here the mpi4py could take care of coordinating across nodes and multiprocessing for coordinating within the node. If they are completely independent, then I would go for job arrays.
- If you need help with expanding your program you can send us a mail and we can work on it together.
How did you get to this desktop.saga and show the nodes info, I didn't totally get it. Can you summarize it somewhere that we can read and follow?
- forgot to say that you need to be on university network and then here: https://documentation.sigma2.no/jobs/monitoring.html - then you can log in with your metacenter username and password and then to get to the node info it's on top of the page [sorry for being too quick there]
Maybe a bit off-topic for today and related to yesterday: After compiling my code (which I use the login node for), there's often the possibility to do a small test run (e.g. make test) to check whether compilation worked. Do I use the login nodes for this, or is it better to use an interactive job? In the documentation there are examples for interactive jobs on saga/fram, but I'm on Betzy, so I wasn't sure how to best do that?
- Depends a bit how heavy the test suite is, if it runs on a single core for a few seconds/minutes it's no big problem, but I would recommend interactive job for this. The procedure for getting interactive jobs is exactly the same for Betzy, by the way, but here you probably want to add "–qos=devel" in order to be allowed to use a single node for this.
  - Perfect thanks! Then I'll try to use an interactive job for this. By the way hackmd is a perfect idea for asking/sharing questions, I like this.

If one does not ask for ntask: Is it the same as aksing for 1 core, or will the scheduler decide on how many cores to use?
- as far as i know you can ask for number of tasks or (number of nodes and number of tasks per node). i prefer the former since it gives the scheduler more flexibility and often the exact placement does not matter. but sometimes it matters how the tasks are distributed across nodes and then you need the latter. i am sure slurm allows for a lot more but the above is what i use.
- I meant: If you only specify "time" and "mem", will the sceduler use one core, or several cores if that improves the memory usage?
- I think that by default it will use one core if you don't specify anything but I am not sure about it (it depends on the machine). I would verify it.
- If you do not specify the following are the defaults (checked on Fram and Saga by submitting a srun job without sepcifying nodes or tasks)
  - NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1
- Actually, on Fram the SLURM_CPUS_PER_TASK variable is not set in this case. You get assigned one full node with a single task, with access to all the 32 cores. The minimum job size on Fram is one (full) node, on Saga the minimum is single core.

Is –ntasks equivalent to number of cpus?
- Number of cores (there are 20 cores on one CPU on SAGA for example )
- See similar question above
When running array jobs; how do I find out about the sweet spot for requesting resources (number of nodes) in terms of runtime vs wait time?
- No universal answer. One thing is to test one job and eastimate the optimal before starting the array. You could use the monitoring Radvan showed to see how the resource usage is.
For the exercise, I ran a script, and look into the output file, I see that the "memory usage stats" and uDisk usage stats" are all 0
- Was this a real job with actual computation? The memory statistics in the slurm output is only sampled every 30sec or so.
  - I ran on Saga withsbatch saga_GROMACS.sh. In the output file there is "MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1.
    NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them." Is this mean the running failed?
    - that seems to be an interesting (not yet reproducible for me) problem
    - could you contact me via email thomas.roblitz@uib.no?
(Question not related to today, but I find this super confusing): what is the difference between the feide and opendID accounts? Why is it recommended to use OpenID at UiO?
- Which resource are you trying to access ?
- I am UiO and use Feide to access metacenter resources.
  - I meant for the NIRD-toolkit. Why does it matter which login one use?
    - You may use Feide and I do, who/where has OpenId recomended over Feide ?
  - Excuse me for a very bad formulated question! I got this e-mail to prepare for the NIRD hands-on tomorrow:
    '''There are regulations with regard to access to services through Feide identity applied by the University of Oslo (UiO). For participants from either UiO or for participants that does not already have a Feide account, you will need to create an account with OpenIDP Feide (via https://openidp.feide.no) in order to join the hands-on part of the NIRD Toolkit lectures. '''
    Also, at other workshops the organizers has asked me to use/create an OpenIDP Feide to login to a juptyerhub. Why is this better?
  - As far as I can see this recomendation is for "participants who does not already have a Feide account".
  - Or 'For participants from either UiO'?
  - Let me check with the organisers (contacted and wating).
  - Great! Thank you!
  - I could not reach the exact person, can you please drop a mail to hpc-drift@usit.uio.no so I can reply (I do not want you to place your email here)
  - Thank you! I will take time to better formulate my question in an e-mail when we are done for today.
    - Thank you for the undestanding, this email was originated out side of UiO and we shall find out the reasons and fix if there is any (if possible)
  - Concerning UiO users they can only use feide against "activated" services by UiO. Since there is no such thing like a wildcard activation and we cannot ask UiO to activate manually every service created in the NIRD toolkit so we use openfeide.

Disk quota and usage, data archiving

Small local storage 300 GB, is this mean we have the same 300 GB on saga and fram? Can we request larger local storage if needed? Seems we only have 20 GB on home directory now.
- The local storage means local to the compute nodes during job execution, and is only available as long as the job runs. The 20 GiB quota will usually not be extended, but you can ask for extra project storage under /cluster/projects/nnXXXXk: https://www.sigma2.no/apply-e-infrastructure-resources
Project directory, 1 TB is really limit, we can't even store all the initial data needed for climate model, can we request larger storage for a project?
- Yes, you can apply for up to 10TB. https://documentation.sigma2.no/files_storage/clusters.html#project-area
Just to confirm: auto-cleanup time is different for each project and depends on whether usage is above or below 70% capacity?
- No, it depends on the disk usage of the whole system.
  - Roger that. Thanks for confirming!

Day 2 Q&A

Regarding parallelising python code, is mpi4py a solution commonly used on sigma hpc systems?
- Yes, I have used this module at least on Vilje. It is a tool for MPI parallelization, and not the only way to parallelize Python programs.
How does the /nird/projects/nird/${projectID}/ directory differ from the Nird archive? Is it meant to be used as a staging area before proper archiving or is it a longer-term storage directory?
- The project area is primarily for working data. You could use it as a staging area for archiving data if you want. But, the project area storage belongs to your group - you use it as you see fit. Once you no longer actively need to use your data, or your project has completed and the data is valuable to your community you could consider archiving it. The archive will keep the data for at least 10 years.
  - :+1:

Day 3

We have set up some shared services that everyone in the course should be able to access:

ns9989k-vnc.craas1.sigma2.no

ns9989k-jupyter.craas1.sigma2.no

ns9989k-minio.craas1.sigma2.no

ns9989k-rstudio.craas1.sigma2.no
shiny-ns9989k-rstudio.craas1.sigma2.no

ns9989k-jupyterhub.craas1.sigma2.no

What is a CPU in this context? ~500 cores on NIRD-TOS? How many nodes?
- I think we have 8 physical nodes on each site.
- 500 "cores" on one site are likely virtual cores, so there is likely some oversubscription factor. If each node had 32 physical cores, oversubscription factor would be 2.
  - Ok, and with "oversubscription" you mean that the resources you request are not reserved for your application? So the 256 physical cores can hold 512 single-core applications before being "full"?
- You can find more info on https://www.sigma2.no/systems#nird-service-platform
When I registered for dataporten with feide guest account, the login credential is not connected to my user account on NIRD. So the resulting dashboard says "Your account is not allowed to register personal applications and APIs." How can I resolve this issue?
- The following is the answer from the SP team " - Concerning UiO users they can only use feide against "activated" services by UiO. Since there is no such thing like a wildcard activation and we cannot ask UiO to activate manually every service created in the NIRD toolkit so we use openfeide"
- So if you are UiO you need to use OpenId (https://openidp.feide.no/).
- We do see the usability issue here, but sorry to say that this is something to do with agreements and not implementation limitation.
  - Thanks. In terms of openIdp access, is there somewhere I can submit my project information online to connect to the NIRD toolkit. Or does this have to be done by the project manager by emailing the support?
    - If you could send an e-mail to sigma2@uninett.no i'll gladly assist you with this.
    - Thanks, already did
Who is the "optimal" person to make/enable/change the top docker image of a project. The project leader?
- A lot of the collegaues in my groop which are projects leaders (professors) does not know how to change a docker image.
- The PI enables the NIRD Toolkit, but anyone in the group can install the packages if the Toolkit is enabled for the project. More info here https://documentation.sigma2.no/nird_toolkit/package-install.html#appendix-a-the-meaning-of-each-of-common-fields-in-the-installation-form
(from zoom chat) Smallest machine is "Base 1CPU 1GB", right?
- yes , you are right and please use machine type base
When given access to an application for group members, they will equivalently be using the group leader's account on NIRD to access data and docker images? Will they all have read and write access to the files on NIRD? How to ensure that one doesn't accidently break things unintentionally?
- You use your own account.
  - Okay, but do you gain writing access to the project manager's files?
    - I don't thinks so, but sp-team will answer how this works
    - I think the question is for files stored under the "project space" (similar to /cluster/projects/nn… on the HPC systems). On the HPC systems all project members have write access.
i'm still on "fetching events". Is the initialization a bit slow for others as well?
- seems it was also slow for others … we need to investigate what is the root cause for this
On my application page it shows "Failing The application could not start due to: The list of events may contain more details explaining what is causing the application to fail. What's the reason? Below at Recent events | fetching events…it shows: 2m Back-off restarting failed container backoff
- problem might be too high load … so maybe try again later
- if problem persists please let us know again
Will the minio data download link be available to share and download after the application has been closed? Or should it only be closed once the data has been downloaded by the intended recipient? i.e. Does the data and its download link persist after closing the application?
- The download link persist until you close the application. When you delete the bucket the data won't be available there.
I have accepted the invitation to join the group CRaaS1, ns9989k and Dataporten App Creators on Dataporten. When trying to install or look at any of the services I receive the notification: You are not allowed to use any projectspaces.
- Could you send an email, so that I can check your username or send a private chat on zoom to (Dhanya Pushpadas)
  - On it!
At this jupyter, are the python packages already installed after setting up?
- More info here https://documentation.sigma2.no/nird_toolkit/package-install.html#appendix-a-the-meaning-of-each-of-common-fields-in-the-installation-form
Can you explain where we could check/change the dockerimages used by the jupyter notebook again?
- it is coming in the next session
  - Thanks! :)
For jupyterhub authorized groups, you refered to a class and create a new group, can you explain how to create it?
- we have an example of that on the documentation pages:
  https://documentation.sigma2.no/nird_toolkit/getting_started_guide.html
- the groups are created here:
  https://minside.dataporten.no/#userinfo
What happens if someone (student) forgets to logout?
- service runs until someone stops it (admin or student)
- Services that are still running in the course namespace after the entire course is over will be stopped and deleted by us.
I have accepted the invitation to join the group CRaaS1, ns9989k and Dataporten App Creators on Dataporten, but when trying to install jupyterhub, for the Projectspace, I only see my research project as the only choice.
- Please try logging out and back in to the Toolkit, to ensure that your group access is refreshed.
  - Yes, now the craas1-ns9989k shows up
    - Great! 👌
Not clear to me: Is the point with the NIRD toolkit that you can run tools/portals/codes to analyze and visualize massive data stored on NIRD (yourself or students/others that you invite) or is this unrelated to actual NIRD storage? F.ex., just to teach someone to use Jupyter notebooks in a course?
- The Toolkit can be used for both scenarios actually. A namespace in the toolkit is directly connected to a NIRD storage project so that you have access to any files stored there, which makes it possible to run analysis on the data without having to stage it first.
- .. and we are using it right now in the course to teach you how it works. The namespace "craas1-ns9989k" that you have access to is connected to the NIRD storage project "NS9989K".
Is it just me or others also, my jupyterhub just initializing, always shows "Started container pause".
- Looks like there is only one deployment that it stuck right now. Have you tried deleting it and deploying a new one?
  - I just deleted the jupyterhub installation, and try to install minio, it also keep initializing, and shows "1m Back-off restarting failed container backoff" and "2m Ingress craas1-ns9989k/minio-1616061065-minio update". It keep initializing, so I tried reconfigure, and it shows "Persistent storage No persistent storage found in this projectspace". Can someone help, so far, none of the installation succeed.
    - If you could try another deployment and leave it i'll check the logs.
      - I deleted all the installation, and tried new minio install now, but same problem: "keep initializing, and 1m Back-off restarting failed container backoff".
      - Looks like you are deploying with OpenIdP? Log in to the toolkit with Feide from your institute and try deploying again. Make sure that you have joined the Dataporten group with that Feide user as well. (Feel free to message me directly on zoom - Marius Linge)
What is the differece between docker image and proxy image?

I think this question is related to rstudio, since there are two docker images you are allowed to change in the advanced configuration of rstudio. The one pointing to quay.io/uninett/rstudio-server:tag is the one you should change if you would like to add software.
The proxy image sets up an nginx proxy, you would probably never need to change this.

tried to install (installing gave no errors) and add some libraries in R and got this error
- library(sf)
  Error: package or namespace load failed for ‘sf’ in dyn.load(file, DLLpath = DLLpath, …):
  unable to load shared object '/usr/local/lib/R/site-library/units/libs/units.so':
  libudunits2.so.0: cannot open shared object file: No such file or directory
- has it to do with the docker image?

Thanks for these pointers, will definitelly use these approaches when interacting with my user base ;)

Do you have any tips for using emacs locally and login to the cluster through ssh?
- Can we run eshell on the remote or is this not recommanded?
- I have seen this: https://documentation.sigma2.no/getting_started/editing_files.html, but is does not give a lot of practical tips.
  - (I am not an Emacs user so I cannot answer first question) but I agree we need to improve the documentation for this. It can be really useful to use a local editor editing files on remote resource and we should show how.
- Especially if we are not supposed to change the .bashrc file!
  - let me clarify (I was brief) it is OK to change .bashrc for everything that does not involve calculations. totally ok to set up an environment there for "working"/editing. only the computations should in my opinion not depend on .bashrc, both to simplify debugging but also to allow for reproducibility of calculations
- I'm looking in to find a good setup myself. Do you accept issues and pr for the documentation?
  - yes! PRs very welcome towards: https://github.com/UNINETTSigma2/documentation
- Here is a good explanation for emacs and tramp set-up: https://willschenk.com/articles/2020/tramp_tricks/
Follow up: which editor do you (instructurs) use on the clusters? vim/vi? Do you have any recommended setups you could share?
- The default configuration is not very friendly/good I think
- we use the editor we are most comfortable with (some use emacs, some vim/vi, some something else) … everyone should do the same, particularly if you edit many files … it shouldn't be an additional hassle in your daily work
  - I use vim with the following plugins: https://github.com/bast/config/blob/main/install.sh#L11-L17 and following configuration: https://github.com/bast/config/blob/main/vimrc
How many ssh keys should one typically have? I guess it is ok to use the same key to log in to Saga, Fram, Betzy etc? How about further connecting to github etc from these machines?
- I use one keypair per my own hardware device. So I have one on my desktop and another keypair on my laptop. And then I put these two public keys on all the services I need to get to.
- To access GitHub you could use key forwarding but there I actually prefer to either clone to my laptop and scp to cluster or to create a separate key for the cluster which can only clone but cannot "write" to GitHub. But if you want to do development on the cluster, it might be useful to create a keypair on the cluster or to key forward.
Is there anything special about the ssh-keys on the HPC systems compared to other systems? I have ssh-key login set up for several local servers and they work. Tried setting up for Saga in the last few days and just now with the whole instructions, and I'm still being asked for the password each time. Any suggestions what could go wrong? -
- in principle there should be no difference. two possible reasons why it fails (try also ssh -v host, to get verbose debug output): wrong file permissions for your authorized_keys or maybe you use a protocol which is not supported (too old or too new). i am unsure whether Saga understands ed25519 keys for instance, i need to check. but if you use RSA keys, it should work.
ssh -v says "host key algorithm: ecdsa-sha2-nistp256" - does that mean ed25519 and RSA are not supported?
- i wanted to write that RSA is definitely supported but i should check :-) personally i am using ed25519 but some clusters' operating systems don't support it yet
- Ed25519 is supported on any recent installation, including the clusters' login nodes (CentOS).
- we can also try to debug this together via screenshare if it helps
  - yes that would be nice, where could we do that?
    - Radovan maybe we can go to a breakout room?
      - Marco if still the plan, did you create it?
        
        oh, they are not created. hmm. sorry, i thought breakout rooms were already there to join. i will need to leave soon but happy to help with the ssh setup, you can email radovan.bast@uit.no or the support line :-)
        
        ok thanks!
- Note that the host key is something else from the public-key authentication key pair. The former is just used for setting up a pre-authentication session; the latter is most important for the user (ssh-keygen &c.)
A followup on the editor point above: As a more novice user to HPC I'm more used to GUI editors. Some offer the possibilty to login via SSH (e.g., VisualStudioCode), so that you can edit files on a server basically from your own machine with all the GUI 'benefits'. Is it possible to do so also on the Metacenter machines?
- I think this is possible. I haven't tried but it should be possible and we should document how.
  - Cool, thanks for the reply. Maybe I'll just try and see whether it works.
    - it should work because for the cluster it will look like any other ssh connection and the cluster has really no way of knowing that there is an editor on the other side. one thing that might happen is that the login node gets a bit overwhelmed if the editor "bombs" it with too many requests but i don't think this is a concern and let's solve that if that happens.
  - This is also possible to do with emacs
Have problem install any of the apps in nird toolkit. Marius helped me a bit, told me to use my UiO account instead of openIDP, I am quite confused, which one should we use exactly? None of them installed successfully so far.
- the last time i used the NTK from an UiO account (maybe a year or two ago) I think I actually used the UiO account to launch/install the app, but used the OpenIDP account to access the running service … very confusing also for admins I think
- That approach worked for me as well today. Not super user friendly…
- i think some legal work/agreement needs to be done (databehandleravtale) to simplify this … it's ongoing work if IIRC
Where can we get the nird toolkit demo-videos that we can see later?
- somewhere (eg a Google drive) via links on the course page
- will let you know via email

HPC course notes from previous days

Day 1

Icebreaker

Overview about the infrastructure

Installing software with EasyBuild

Installing Python packages

Installing R packages

Day 2

Disk quota and usage, data archiving

Day 2 Q&A

Day 3

Asking for help & Login via ssh keys

Read more

Archive of LUMI Coffee Break 2023

Q & A sessions

Archive of LUMI Coffee Break 2022

NRIS HPC On-boarding 18-20 October 2022