HPC course notes - Day 1

# HPC course notes - Day 1 ## Presentation material - Folder with presentation material: https://drive.google.com/drive/u/1/folders/10VuU7REr4xlijXzg7R7wxfzTsamXHqJw - EasyBuild slides: https://cicero.xyz/v2/remark/github/stigrj/easybuild-demo/main/easybuild-demo.mkd - Pip and conda commands and info can be found in our documentation: https://documentation.sigma2.no/software/userinstallsw/python.html#installing-python-packages ## Overview about the infrastructure - It shows currently no GPU on Betzy, is this mean it's also CPU-based, and the climate models run on Fram can be definitely run on Betzy? - Indeed Betzy is also CPU based and has no GPU (?, as far as I know?) porting the models should be not too much work but moving models from one machine to another is always some work (in my experience). There are more cores per processor on Betzy and also different processor infrastructure (AMD vs Intel). The programming and library environment however, should be basically the same and all libraries available on Fram should in principle be available on Betzy. - Fram regular compute node (32 cores per node, 61Gb available RAM), Betzy regular compute node (128 cores per node, 245Gb available RAM) - Betzy will (hopefully) be getting A100 GPUs within this year though. * Handouts of the presentations? Will they be made available somewhere? - good question. we will collect all material and link to it from top of this document - What is the payment model for use og the HPC resources? - This depends on the research and funding. More on pricing is available here: https://www.sigma2.no/user-contribution-model - Followup: Can you provide assistance for selecting the right model to include in f.ex an application? - Certainly! Please contact us on sigma2@uninett.no and we'll assist :) - OK Thanks! - What other systems (outside of Norway) do you use? - I have been using systems in Germany. - Have used systems in Sweden - Systems in Germany - HPC at Météo-France - What are the biggest obstacles when starting or working on an HPC? - As a student, jumping straight into HPC without previous experience was really daunting and it was difficult to know if I was "breaking the rules" or whether I was following best practices or not. Besides the "do's and dont's" in the online documentation, I never knew whether small pre/post-processing scripts were frowned upon or not. It's still a little unclear for me, actually. - How to run a climate model properly without knowing the details of the platform - How much time do you spend on IT matters (HPC, programming, NIRD, ...)? - Probably around 90% - a lot, running climate model, data analysis, teaching on data visualization and programming etc. - How do you find the best machine for your type of work? - Answer: You can ask for one specific. RFK will try to assign you to the best fitting system. Aspects: parallelism of jobs, GPUs, needed memory(/cpu), file system needs, (software), amount of core hours needed, need to prove that applications scale (ARM reports) - What are the "papercuts" for you on the HPC systems? (small but annoying things that you wish were better/ different)? I am asking this so that we improve these. - A truly small papercut: I am involved in three different projects on saga with very similar project directory names (nn####k). I always end up working the wrong directory - i understand so it would be nice to have project "aliases"/"nicknames"? name instead of number? - That would definitely help avoid some confusion on my part! - thanks for pointing it out. i will note these down so that we consider how to improve these - It's definitely not a big issue and the best solution might be changing my workflow, thanks though! - Time estimates for when a job starts in the queue - that they are wrong/unreliable? - Rephrased question: Estimates for how long time a job has to wait in the queue - there is a command for that: `squeue -j 12345 --start`. but it is an estimate. things are shifting all the time so it cannot be 100% reliable always. - In many cases, this --start command only returns N/A for the START_TIME - list of commands: https://slurm.schedmd.com/pdfs/summary.pdf - I made myself a cheat sheet since I can never remember these: https://github.com/bast/til/blob/master/slurm/cheat.md - Sometimes a job fails (before finishing or reaching requested job time) after waiting really long in a queue and it would be nice to have a small 'grace period' of a few minutes in which one could fix the issue and resubmit (continue) the job while your nodes stay reserved for your job in the 'grace period'. - This is tricky as the job queue possition is decided after considring the resources request in the job script. i.e. If you ask for one minute runtime and then after getting the reservation if you change it to one week then it is not fair for others. - Indeed but this was not about prolonging (which we also get asked from time to time) but for keeping the long awaited time slot and then seeing it disappear after a simple mistake. I see how frustrating that can be :-) What I will recommend in a talk tomorrow and Thu is to "grow" a calculation to avoid these. But cannot be 100% avoided. - True. but what will stop from changing the runtime, number of cores ? what if the failure is due to not asking enough memory?. - one would have to keep the same settings. indeed slurm would not release the reserved resources but keep them for a bit longer. i know it's tricky. - But a great idea! I never thought of this. - I agree. It is something for better user experience. What about giving some credits for user, where they can them selves increase queue possition for a given number of time, and increase runtime according to credit available - There might also be other reasons for the job not to start properly, e.g. insufficient funds on the project. It would be good to get some sort of job diagnostics upon submission instead of realizing that it is not starting when it is finally your turn. --- ## Installing software with EasyBuild - EasyBuild slides: https://cicero.xyz/v2/remark/github/stigrj/easybuild-demo/main/easybuild-demo.mkd - Official tutorial: https://easybuilders.github.io/easybuild-tutorial/ - Is there an easy way to clear or tidy up the softwares I installed? - but how was it installed? with easybuild? or differently? because it depends what was used to install software. - I tried pip and Virtualenv, I installed python2, and later python 3, and later miniconda etc, so it's a mess - i perfectly understand and have been there many times. for python-based software i can recommend to use either virtual environments or conda (actually both because some codes need the one, some the other) and for each project to create a separate environment so that packages "don't cross" and so that you can remove it any time. good resources about this: https://carpentries-incubator.github.io/introduction-to-conda-for-data-scientists/ also https://researchsoftwarehour.github.io/sessions/rsh-011/ - What does the 'setenv'(XXXX) which is shown with module show YYYYY. These environmental variables are not standard in a computer right? I'm used to seeing LD_LIBRARY_PATH for instance. - this can set any environment variables that are then used by the code in question (the example that we have seen set some variables only understood by the Arm perf-report code) - I'm trying to install some packages in a conda environment, but am getting the error "[Errno 122] Disk quota exceeded". I'm only using 600 MB right now, so what's going on? - There are actually two different file quotas in place (size and number). Probably you exceeded the number of file quota. You can check that with `dusage -i`. Often you have to clean the .conda folder. Try `conda clean -a`. More on that later in the next session. - dusage -i gives the following error: `File "/usr/lib64/python2.7/decimal.py", line 3872, in _raise_error raise error(explanation) decimal.InvalidOperation: Invalid literal for Decimal: '100000\nXXXXXX_g'` (where XXXXXX is the username; I'm in a clean environment on SAGA with no modules loaded) - this is unfortunately a known problem (which should have been communicated better). this broke recently after a new storage pool was added. we will fix this soon. very sorry that this is still not fixed. we are also working on a new version of dusage which will give better/easier information about your quota - What happens if there is a conflict between the versions of the loaded modules? Forinstance loading a compiled version NETCDF will be dependent of a compiler, whereas loading an other library can depend on a different compiler version. - this is the "dependency hell". in this case either the dependency needs to be changed or the code. these happen from time to time and it can be a bit of work to resolve it. Spack is a bit more relaxed about this as far as I know and allows to mix dependencies compiled with different compilers and versions. - "dependency hell". Hehe. Nice vocabulary. - it's a thing :-) https://en.wikipedia.org/wiki/Dependency_hell - Configure your own environment, how? - in which context was it mentioned? (I missed that part) - in the chat - aha, Thomas can you expand? ... - instead of setting or changing environment variables in your .bashrc, you could make a small module file as demonstrated (the SetFoo.lua example), then you can easily adjust your environment and revert it back by loading or unloading the module - as pointed out below, setting an environment variable with this method, does not allow to revert the change, so be careful - I guess it will not remember the previous state, so unloading means unsetting, not reverting - What about docker as containers? - we will talk about this on Friday. Singularity is available on all machines and Singularity can import Docker containers. so it is possible. - I tried to set a variable; then load a module that sets it; then unload the module. Result is that the original variable is not restored, so loading a module is apparently irreversible? - hmm... it can be that they are not completely reversible/commutative. Also for this reason I like to load all modules in the run scripts and not "outside" to have the loads separated and also documented. - side note: I often thought: programming a module system should be easy and why is this so slow/complicated/broken? but it turns out it is complicated and many corner cases and now I take back all my past comments on this. it is suprisingly non-trivial. - Where can I find which packages I am able to install using EasyBuild already? - You can use `eb -S packagename` to see available easyconfig files in the repo after you load EasyBuild module - I also like to browse https://github.com/easybuilders/easybuild-easyconfigs/tree/develop/easybuild/easyconfigs - One can also write their own easyconfig files https://docs.easybuild.io/en/latest/ - Great thanks! - You might also encounter situations where the software you want is available in EasyBuild, but not for the particular version that you need. Then it's not too complicated to tweak the existing easyconfigs for the new version. More on this in our documentation: https://documentation.sigma2.no/software/userinstallsw/easybuild.html --- ## Installing Python packages Presentation based on documentation: https://documentation.sigma2.no/software/userinstallsw/python.html#installing-python-packages - Can you show the isolation node command again please? Is this the recommended way to work when one are compiling a little bigger code bases for instance? - https://documentation.sigma2.no/jobs/interactive_jobs.html - Follow up: when should we DEFINITELY switch from the login-nodes to keep good cluster hygiene? - Needs multiple threads and more memory for a longer time. (e.g. compile GCC) - Internet access should not be needed during compiling/installing - but sometimes it is (some codes fetch dependencies from the net at configure or build time) and I think this is now solved? Compute nodes can now access internet via proxy (? right?) - should be covered a bit in tomorrow's session 2 about jobs - depends a bit on how Internet is accessed: can work without any configuration out of the box, with some configuration or not at all (note, only setup to work on Fram and Saga currently) - Personally I only use login nodes for editing files or submitting/monitoring jobs but everything else on compute nodes. I like to also put compilations into a run script. The bonus of this is that it forces me to document and isolate dependencies which is good for me next time I want to build it two months later where I forgot everything again. - if I created different conda environments and Virtualenv, how to keep the most updated one and easily tide up others? - for both i recommend to list the dependencies that you have installed in `requirements.txt` (virtualenv) or `environment.yml` (conda) and install from these files. this way you have documented what you have installed. one environment per project. if you use one per project or folder, then it is also easier to remove them without affecting all other projects. documenting the actual dependencies in `environment.yml` may in my opinion be more important than always using the latest versions. because latest versions today will be old versions in two years and if you return to your project in 2 years it's nice if the versions are documented somewhere. --- ## Installing R packages Presentation following: https://documentation.sigma2.no/software/userinstallsw/R.html - Given a lot of bioinformatics tools based in R are focused on producing graphics, can R be ran interactively on any of the sigma2/Nird systems? - great question. investigating ... at the minimum we will return to this on Thu but looking for an answer now - indeed possible on an interactive node but include X forwarding when loggin into Saga - can you add the command here? how do you do it from putty? - i am unsure about putty but recommend using https://mobaxterm.mobatek.net/ on Windows since it has built-in support for "X server" (the graphics part) and generally more options - command: `ssh -X saga.sigma2.no` (thanks!) - How about putting the .libPaths() statement into the .Rprofile on your HOME? This works on most systems. - (I am not an R expert ...) but I would recommend to load everything in a script. Then if you return to the script 1 year later, you see all the dependencies. It makes it also easier for staff to help because they otherwise do not have the same environment as you. If everything is in the script, staff can take the script and reproduce the problem.