# Q & A sessions :::info - Link to this document: https://hackmd.io/@hpc/q-a ::: [toc] --- **Please add here below questions or comments or share links (edit button top left or top right):** ## Upcoming session ### 2024-01-25 :::info Please write your questions above this block. ::: ## Previous sessions ### 2023-03-30 #### Upcoming NRIS courses - [HPC on boarding course 18. -- 20. April](https://documentation.sigma2.no/training/events/2023-04-hpc-on-boarding.html) - [HPC Best practises 9. -- 11. May](https://documentation.sigma2.no/training/events/2023-05-best-practices-on-NRIS-clusters.html) #### Question 1. How to do slurm array jobs with inputs with non-simple files names as inputs like sample-1.txt, sample-55.txt, sample-111.txt? - Example as [here](https://documentation.sigma2.no/jobs/job_scripts/array_jobs.html) with dataset as `DATASET=$(head -n$SLURM_ARRAY_TASK_ID files.txt | tail -n1)`. In this case the `files.txt` can be created with for example `ls *.out > files.txt` 2. Persistent mounting of project is not working inside of JupyterHub container on NIRD Toolkit? - Looks like a problem in the setup we can't solve now. Please open a new ticket ### 2023-02-23 #### Upcoming NRIS courses - [HPC on boarding course 18. -- 20. April](https://documentation.sigma2.no/training/events/2023-04-hpc-on-boarding.html) - [HPC Best practises 9. -- 11. May](https://documentation.sigma2.no/training/events/2023-05-best-practices-on-NRIS-clusters.html) #### Questions 1. How to share data (zar archives) from NIRD best? - NirdToolkit can be used to run an MinIO as application to run a public accessible file server - [Doc page](https://documentation.sigma2.no/nird_toolkit/getting_started_guide.html) - [Youtube intro to NIRD Toolkit](https://www.youtube.com/playlist?list=PLoR6m-sar9Ai3TMU96xAGDx-UImMzLXae) - Is it po 2. Can you share files from NIRD by placing them in a `ssh` folder? - Yes, but it has to be activated for the project - It can be password protected 3. ... ### 2022-11-09 #### Upcoming NRIS courses - [HPC on boarding course 18. -- 20. April](https://documentation.sigma2.no/training/events/2023-04-hpc-on-boarding.html) - [HPC Best practises 9. -- 11. May](https://documentation.sigma2.no/training/events/2023-05-best-practices-on-NRIS-clusters.html) #### Today's seminar: Helpful Tools and Services [Slides](https://docs.google.com/presentation/d/1pgueQ6w8sFW4-1y3iRwiWgkypUhrlLfhEPTFSY2_Lw8/edit?usp=sharing) #### Questions 1. How can you run one node jobs on Betzy? - Yes, in the `preproc` partition but only for one day, see https://documentation.sigma2.no/jobs/choosing_job_types.html - Otherwise not and you can check that with for example `scontrol show partition=normal` on betzy. ### 2022-11-09 1. Is it possible to run longer than usual jobs? Background is to run a dask scheduler that would "run" for several weeks and schedule shorter jobs on the cluster - suggestions: run the scheduler "outside" the cluster, either on a cloud instance/VM or on a real computer - NREC: https://www.nrec.no/ - if there are ssh-problems, I am sure we can figure it out 2. Suggestion for future HPC: provide a side-node to "park" schedulers since users regularly need and ask to have a scheduler which is "outside" of slurm (Dask, Snakemake) - This could be a separate Slurm partition for very long running jobs that require very little memory and little resources and all they do is to poll their "sub-jobs" and submit other slurm jobs. Then many of these could be on the same node. ### 2022-09-29 Upcoming NRIS courses - [HPC on boarding course 18. -- 20. October](https://documentation.sigma2.no/training/events/2022-10-hpc-on-boarding.html) - [HPC Best practises 1. -- 3. November](https://documentation.sigma2.no/training/events/2022-11-best-practices-on-NRIS-clusters.html) ### Today's seminar: Bioinformatics [Slides](https://docs.google.com/presentation/d/1pF_zZkuCITE3QwwpgEtv53zhvr-lbDklBRt30DZovQ4/edit?usp=sharing) ### Open Q&A session 1. would it be beneficial to have modules for data(sets/bases)? 2. How to know what resources (time, cores, memory) to ask for a job? - Documentation page about [choosing memory](https://documentation.sigma2.no/jobs/choosing-memory-settings.html) and [choosing number of cores](https://documentation.sigma2.no/jobs/choosing-number-of-cores.html). - [Information about interactive jobs](https://documentation.sigma2.no/jobs/interactive_jobs.html) to test programs with immediate feedback - Short jobs have advantage that they quickly start, best in combination with subset ### 2022-08-30 Upcoming NRIS courses - [HPC on boarding course 18. -- 20. October](https://documentation.sigma2.no/training/events/2022-10-hpc-on-boarding.html) - [HPC Best practises 1. -- 3. November](https://documentation.sigma2.no/training/events/2022-11-best-practices-on-NRIS-clusters.html) ### Today's seminar: GPUs Slides will be linked soon ### Open Q&A sesssion 1. This is not related to GPU, but I cannot use sftp to access NIRD for a while. I asked for email support but there is no response yet. WHen I use scp, this is only response "First: /usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin" - Also what is the ticket number of your email request? - You try to scp from your laptop towards NIRD? Or from some other server/cluster? - The ticket number is: #309254, I can access ssh without no problem. I cannot access from both laptop or other cluster. - ok, looking. i am sorry you have been waiting. i think your ticket got a bit forgotten as we moved to a different ticketing system. i will raise it to my colleagues and make sure that your request gets followed up. - Thanks. Crossing fingers for it. Currently I have to scp from nird to betzy, then scp to betzy :( - you should receive a response and follow-up soon. i apologize that you never heard back there. it got accidentally moved to a queue where it wasn't watched over summer. 2. Is there a rule-of-thumb for deciding when it's worth using a GPU (e.g. size of matrix, number of matrix operations)? - Difficult to answer in the general, but if you are doing multiple matrix operations on somewhat large matricies then I would at least try. 3. A question related to parallel python runs. We have a lot of single-processor python scripts to do model diagnostics, which we would like to run in parallel (a lot of simultaneous runs) in a slurm job. But when submitting "mpirun python run.py", all the runs are submitted to only 1 cpu in the allocated resources. - python package Python/3.8.2-GCCcore-9.3.0 - I used interactive job, run from login node as "srun --nodes=1 --tasks-per-node=128 --time=01:00:00 --qos=devel --account=* --pty bash -i", then on the bash prompt do "python run.py &" several times. - is this the question being discussed now over audio? or a separate question? - It is the one being discussed =] - Use salloc to get the interactive session (How to [https://documentation.sigma2.no/jobs/interactive_jobs.html](https://documentation.sigma2.no/jobs/interactive_jobs.html). Then make sure that you have multiple cores (echo $SLURM_NTASKS) - Tested, using salloc instead of srun does solve the issue, now all python instances are running pro cpus. Thanks! 4. What computational tasks are more suitable to run with GPUs instead of CPUs? - Generally GPUs are really good at applying the same operation to a large array of data. So, if you have either, 1) a lot of data or 2) doing the same operations on data in many iterations, it will be well suited to running on GPUs. 5. Question to meeting participants: what change/improvement (small or large) would make your work on computing and storage resources easier and smoother? 6. I suppose it is a very complicated task to modify a CPU code into a GPU code, right? - It can be, but it doesn't have to be, one can test and play around with OpenACC - If the code makes heavy use of matrix operations (multiplications) and the matrices are sufficiently large, it can be relatively easy to port your code and offload these operations to the GPU since libraries for this exist. - [You can read more about this here](https://documentation.sigma2.no/code_development/guides/external_libraries.html#cublas-openmp) 7. Gromacs is available for AMD GPUs, do you know if there are any efforts porting NAMD to AMD GPUs so that it would be available on LUMI? I saw some slides from AMD about it a while ago (at a LUMI meeting actually) but haven't seen anything about it lately. (edited out the name since now we know and since we reuse this document) - Yes, NAMD (and NAMD3) is being ported by AMD. You can already find containers on [AMD's Infinity Hub](https://www.amd.com/en/technologies/infinity-hub/namd3) - The GPU team can also help you get started and help test if your experiments can be run on LUMI - there is a AMD accelerated [alpha of NAMD](https://www.ks.uiuc.edu/Research/namd/alpha/2.15_amdhip/) - If you have any allocation on LUMI, you can use the `eap` partition to test run on GPUs, even without having a LUMI-G allocation at the moment 8. How do I get in contact with the GPU team? - [Contact information for the GPU team](https://documentation.sigma2.no/getting_help/extended_support/gpu.html) 9. ... Interconnect diagram of a LUMI-G node: ![Crusher interconnect diagram](https://docs.olcf.ornl.gov/_images/Crusher_Node_Diagram.jpg) ### 2022-06-14 #### Today's seminar - [NIRD Toolkit Documentation ](https://documentation.sigma2.no/nird_toolkit/overview.html) - [NIRD Toolkit Training Videos](https://www.youtube.com/watch?v=f7MKoPoNSfQ&list=PLoR6m-sar9Ai3TMU96xAGDx-UImMzLXae&index=5) #### Open Q&A session Please write your questions here: - Can you show again how to be in this page you are showing now, i am just logged in... - https://apps.sigma2.no/packages/sigma2/jupyterhub/0.16.15/install - how to reconfigure a stopped/failed application? - If failed, it easiest to delete and start again - why some memory is is not released after closing all the applications? - after stopping or deleting an application it can take a couple of minutes for the service to be completely shut down. If there is some resource that is not released, you can ask us to look at it in a support ticket . When stopping a jupyterhub service, the user services attached to it can still be running, so before stopping the hub service go to the hub control panel and stop all running servers . - - How to apply to have more resources (e.g. memory)? - at the moment you can apply through a support ticket to either support@nris.no or contact@sigma2.no - Can I set the paths of the tensorboard in Deep learning apps? - it is not currently configurable from the installation pages, but we have noted it as a feature request - when using persistent data storage, would it be possible to specify the home path and jupyter lab configuration path? - similarly to the question above, and we will look at it. - NorESM DIagnostic Tool and ESMValTool are also included in the NIRD Toolkits? -It is not included by default in the NIRD toolkit, one way of adding software in the NIRD toolkit is to build a custom docker image that contains the software : https://documentation.sigma2.no/nird_toolkit/custom-docker-image.html ### 2022-05-10 - Slides: https://docs.google.com/presentation/d/1pgueQ6w8sFW4-1y3iRwiWgkypUhrlLfhEPTFSY2_Lw8/ - Next course: https://documentation.sigma2.no/training/events/2022-05-best-practices-on-NRIS-clusters.html - Where to put self-installed software? Home or project folder? - Project is recommended, as HOME has a quite small file size and number quota and if it get's full, you struggle to do any work at all because of Disk space exceeded errors. - useful command to compare time that a job took and cpu time used: - `sacct -j JOB_ID -o NTasks,ReqCPUS,AllocCPUs,CPUTime,Elapsed,Timelimit,ExitCode,NodeList` - it seems `seff JOB_ID` is more useful - adding `AveCPU` can also help to show whether CPUs were busy - the job needs to run long enough to not see effect from sampling rate - How to ssh to a node running my job ``` [sabryr@login-3.SAGA ~]$ squeue -u $USER JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 5621650 normal example- sabryr PD 0:00 1 (Priority) [sabryr@login-3.SAGA ~]$ ssh c1-32 [sabryr@c1-32 ~]$ hostname c1-32 [sabryr@c1-32 ~]$ top -u $USER ``` - then quit top by pressing "q" - log out of compute node with "exit" and you are back on login node - To check jobeffiencey - `seff <JOBID>` - Some places to look for bioinformatics pipelines and tools which are not installed on our servers: nextflow.io, https://singularity-hpc.readthedocs.io/en/latest/ --- ### 2022-03-08 [**Status of Quantum Chemistry Software VASP & Gaussian - Presentation slides**](https://docs.google.com/presentation/d/13vm5-Yx_VTfg02SAgrzki9rgSlUTDW5cERVSIdKCrfc/edit#slide=id.p) [Super short feedback form. Please fill it out](https://forms.gle/kk8E3mD8NtJ2XPMN9) #### VASP - Dedicated Q&A sessions with VASP? - Since the Betzy and Fram have CPUs from different companies and we are compiling VASP by ourself. Is it possible to maybe share a recommended compiling setting for Betzy? - Yes. It seems based on investigations until now, Intel toolchain with the compiler flags ``-march=core-avx2`` is the most safe option and also what performs decent. Running VASP on AMD is not straightforward, but easier with VASP 6 due to better support for newer GCC versions etc. Also, other cleanup in the code. We will thus prioritize getting VASP 6 deployed on Betzy first. Our compilation setup, including the full stack, which also runs the tests and build the modules etc. will be accesible to users. - Also have you done any benchmark like calculation on Betzy which you can share? - Yes, we will share this in the documentation pages. First, we will finalize the setup which develops this data so that it will be much easier for us and users to provide this in the future for new/other clusters as well. But we already know that the performance is not on par with the Intel hardware. We also have not yet investigated how the performance looks if we replace Intel MKL with the dedicated AMD BLAS/LAPACK. - If I want to run several VASP calculations with ASE(python package) in a single submit run are there a best practice to do it nicely parallelization in documentation? - Workshops, meetings, etc. How often? - VASP tutorials. Educating new users. How often? Local training? - Dedicated channel in our Mattermost chat client? - Email list that is automatically updated? - Development of better documentation. Also in tutorial form. - Need to also be a community effort. Do users want to contribute? - Do users feel like they are part of a local VASP community? - Do you see a user need to have VASP in a container? Do you need a reproducible environment for VASP? - What is most pressing for the VASP community? - Getting VASP 6 deployed. - Getting VASP on Betzy. Will most likely only be for VASP 6. Is this okey for the user community? - Any interest in the user community to utilize AiiDA-VASP and/or the AiiDA framework? - https://www.aiida.net/ - https://github.com/aiidateam/aiida-core - https://github.com/aiida-vasp/aiida-vasp - Would maybe be available here: http://apps.sigma2.no - Does there exist an overview of the numerous quantum chemistry softwares installed on NRIS? Also, is there an overview of the licenses and who pays for them? - https://documentation.sigma2.no/software/licenses.html - https://documentation.sigma2.no/software/licenses/license_list.html - https://documentation.sigma2.no/software/appguides.html (incomplete) #### Gaussian - AiiDA-Gaussian status? Would it be useful? Could we do a joint, cross code effort on getting this going for our users? - What is the most pressing issues for the Gaussian users? #### Other questions or comments - Espen Tangen, can you share the slides for the Gaussian presentation? - it seems there are no slides, but I will update the docs according to my talk or may provide slides retrospectively if you prefer. Just let me know (me=Espen T) --- ### 2022-01-26 - For development access to LUMI-G, should I apply through Norway’s share in next week’s deadline? - If you already have a Sigma2 project you should apply to [Sigma2 here](https://www.metacenter.no/mas/saml2/login/?next=/mas/) and we can take it from there - For application to LUMI-G, can I / should I enter required number of CPU hours on the GPU nodes? Are there different quotas for CPU and GPU hours? - For now, yes. We will update the application process when we have clearly worked out how to convert between GPU accounting on LUMI-G and billing hours in MAS. - Could we have the slides from the talk please? Lots of usefull links there. - [Slides are here](https://docs.google.com/presentation/d/1mSl6q6dvi12ouY0Rt5eephgFR-G_4WzB/) - Does the application + allocation of storage on LUMI work the same as on Saga etc.? - Not sure about storage, maybe you could send us a [support request](https://documentation.sigma2.no/getting_help/support_line.html), and we can have someone from the application process answer - Problem with multi-node Gaussian jobs on Saga - Jobs crash sometimes, without error message - reported by multiple groups, who were forced to move to a different machine - RB and JD will follow-up and check the status with ET - File limit problem with Conda environments - can lead to space or file number quota limitations - try to use project folder instead of home folder - alternative: install conda stuff into a singularity container (however this needs to be done on a differnt machine) and then you can run singularity container - conda also ships docker/singularity containers which can be used as a base (to have less to install and also to get leaner containers) - we should provide an example container and document how such containers can be built - create ext3 image with - `singularity exec docker://ubuntu:18.04 bash -c "mkdir -p overlay/upper overlay/work && dd if=/dev/zero of=overlay.img bs=1M count=50 && mkfs.ext3 -d overlay overlay.img"` - using image `singularity shell --overlay overlay.img docker://ubuntu:18.04` - something is missing here :-( ... doesn't work right now - https://sylabs.io/guides/3.7/user-guide/persistent_overlays.html#persistent-overlays - Will there ever be same VNC support for Betzy or LUMI the same way it exist on FRAM and SAGA right now? - https://documentation.sigma2.no/getting_started/remote-desktop.html?highlight=vnc - LUMI: seems to be planned and work in progress - betzy: we will inquire if it's planned - When running Slurm Array jobs on GPU-nodes on Saga, I don't get a "GPU usage stats:"-summary for any of the array tasks or the whole job. I only get the "GPU usage stats" when NOT running an array job - any way to get GPU stats also for arrays? - We will look into this, it should work - This seems to work, the output is included, but there are some problems with GPU statistics at the moment which are unreleated to Array Jobs. Maybe you could share how you specified the jobs? --- ### 2021-12-08 - Can I ask very specific questions on climate model simulation? Or is there specific staff to consult when I have questions? - (I wasn't at the meeting) but did this ever get answered? Perhaps in voice? - Yes, it was but orally, so I don't remember what the answer was. But you can just ask again and we discuss it. --- ### 2021-10-13 Training event first week on November: - https://documentation.sigma2.no/training/events.html - Registration will close very soon - All the sessions will be recorded - We have all material from last course (March): https://documentation.sigma2.no/training/material.html#training-material-used-in-our-past-courses - After the Fram downtime: inter-node communication (?) with Gaussian/Linda. Sometimes it crashes out of the blue, but only for multi-node jobs. Difficult to reproduce. - Slurm environment variables: https://documentation.sigma2.no/jobs/job_scripts/environment_variables.html - If a job crashes sometimes, what can one do? - print $SLURM_JOB_NODELIST in the job script. This can help locating faulty nodes or core distributions which fail. If it is a fauly node, you can `--exclude` it in your jobscript. But even better for everybody else is if you report it to us so that we take that node out of the system and fix it. - We also discussed strategies of what to do if a Betzy job runs optimally on only half the cores. How to schedule it without paying for the unused cores and/or wasting resources. --- ### 2021-09-09 - Is it user friendly on Sigma2, since I am not a regular Linux user? - I mean the installation of the custom programs and running the jobs and tests. - several approaches are available for software installations: - EasyBuild - Conda - pip install into virtual environment - The more traditional configure - make - make install - make test - so it depends a bit, and with varying difficulty. But we have a software install team in the center which can assist in installations or can take care of software installation requests - starting point: https://documentation.sigma2.no/software/userinstallsw.html - So the super-computers you have at the center could be accessed but remote loggin or pl has to sit physically by the computers? - always via remote login. we only go physically to the computers if there is a hardware issue and something needs to be exchanged. - Is there some kind of forum or online help pl can ask stupid questions and discuss issues of the workflow and problems undergoing? - right now the two places are these QA sessions or writing an email to https://documentation.sigma2.no/getting_help/support_line.html. I think that no questions are "stupid" and our documentation always can use more improvement so don't hesitate to ask. - unfortunately we don't have any forum yet where questions could be discussed in the open (email support line is 1-1) - I [RB] would really like that we provide a forum to users, in addition to email support line. I believe this will come this fall. - thanks, I would like have a forum, so that pl who has similar projects can discuss about some details of the workflow and problems. - yes exactly. because with 1-1 email, many questions get re-asked again and again and nobody else can see the answer. also sometimes the community knows a much better answer than our staff can provide and then it would be nice if community can help the community. - Question about whether we have tested Singularity performance when scaling to many nodes - [RB] I will check with two people who have tested this on our cluster or CSCS machines - I [KZ] am working on a rat genome assembly. There is a huge amount (around 1TB) of raw data to screen against. Would it be problem to upload such huge amount of raw data? - 1 TB on project folder, can be extended upon request to 10 TB, [more info](https://documentation.sigma2.no/files_storage/clusters.html) - but you cannot place the TB in your home folder - recommending `rsync` for transfering files of this size/amount (it also checks consistency of data) - More long term storage (archiving) on [NIRD](https://documentation.sigma2.no/files_storage/nird.html), can be accessed from there for smaller scale computations/visualisation via [NIRD toolkit](https://documentation.sigma2.no/files_storage/nird.html) - Research Data Archive - https://documentation.sigma2.no/nird_archive/user-guide.html - Seems to be missing: searchability of metadata. To find anything you need to know that something is there. - - Request to host https://www.iochem-bd.org/ - might become possible as part of the next generation NIRD - group around MF at UiB is in contact with the developers of this platform, this has been requested before