Q & A sessions

Link to this document: https://hackmd.io/@hpc/q-a

Q & A sessions
- Upcoming session
  - 2024-01-25
- Previous sessions

Please add here below questions or comments or share links (edit button top left or top right):

Upcoming session

2024-01-25

Please write your questions above this block.

Previous sessions

2023-03-30

Upcoming NRIS courses

Question

How to do slurm array jobs with inputs with non-simple files names as inputs like sample-1.txt, sample-55.txt, sample-111.txt?
- Example as here with dataset as DATASET=$(head -n$SLURM_ARRAY_TASK_ID files.txt | tail -n1). In this case the files.txt can be created with for example ls *.out > files.txt
Persistent mounting of project is not working inside of JupyterHub container on NIRD Toolkit?
- Looks like a problem in the setup we can't solve now. Please open a new ticket

2023-02-23

Upcoming NRIS courses

Questions

How to share data (zar archives) from NIRD best?
- NirdToolkit can be used to run an MinIO as application to run a public accessible file server
- Doc page
- Youtube intro to NIRD Toolkit
- Is it po
Can you share files from NIRD by placing them in a ssh folder?
- Yes, but it has to be activated for the project
- It can be password protected
…

2022-11-09

Upcoming NRIS courses

Today's seminar: Helpful Tools and Services

Slides

Questions

How can you run one node jobs on Betzy?
- Yes, in the preproc partition but only for one day, see https://documentation.sigma2.no/jobs/choosing_job_types.html
- Otherwise not and you can check that with for example scontrol show partition=normal on betzy.

2022-11-09

Is it possible to run longer than usual jobs? Background is to run a dask scheduler that would "run" for several weeks and schedule shorter jobs on the cluster
- suggestions: run the scheduler "outside" the cluster, either on a cloud instance/VM or on a real computer
- NREC: https://www.nrec.no/
- if there are ssh-problems, I am sure we can figure it out
Suggestion for future HPC: provide a side-node to "park" schedulers since users regularly need and ask to have a scheduler which is "outside" of slurm (Dask, Snakemake)
- This could be a separate Slurm partition for very long running jobs that require very little memory and little resources and all they do is to poll their "sub-jobs" and submit other slurm jobs. Then many of these could be on the same node.

2022-09-29

Upcoming NRIS courses

Today's seminar: Bioinformatics

Slides

Open Q&A session

would it be beneficial to have modules for data(sets/bases)?
How to know what resources (time, cores, memory) to ask for a job?
- Documentation page about choosing memory and choosing number of cores.
- Information about interactive jobs to test programs with immediate feedback
- Short jobs have advantage that they quickly start, best in combination with subset

2022-08-30

Upcoming NRIS courses

Today's seminar: GPUs

Slides will be linked soon

Open Q&A sesssion

This is not related to GPU, but I cannot use sftp to access NIRD for a while. I asked for email support but there is no response yet. WHen I use scp, this is only response "First: /usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin"
- Also what is the ticket number of your email request?
- You try to scp from your laptop towards NIRD? Or from some other server/cluster?
- The ticket number is: #309254, I can access ssh without no problem. I cannot access from both laptop or other cluster.
  - ok, looking. i am sorry you have been waiting. i think your ticket got a bit forgotten as we moved to a different ticketing system. i will raise it to my colleagues and make sure that your request gets followed up.
    - Thanks. Crossing fingers for it. Currently I have to scp from nird to betzy, then scp to betzy :(
      - you should receive a response and follow-up soon. i apologize that you never heard back there. it got accidentally moved to a queue where it wasn't watched over summer.
Is there a rule-of-thumb for deciding when it's worth using a GPU (e.g. size of matrix, number of matrix operations)?
- Difficult to answer in the general, but if you are doing multiple matrix operations on somewhat large matricies then I would at least try.
A question related to parallel python runs. We have a lot of single-processor python scripts to do model diagnostics, which we would like to run in parallel (a lot of simultaneous runs) in a slurm job. But when submitting "mpirun python run.py", all the runs are submitted to only 1 cpu in the allocated resources.
- python package Python/3.8.2-GCCcore-9.3.0
- I used interactive job, run from login node as "srun –nodes=1 –tasks-per-node=128 –time=01:00:00 –qos=devel –account=* –pty bash -i", then on the bash prompt do "python run.py &" several times.
- is this the question being discussed now over audio? or a separate question?
  - It is the one being discussed =]
- Use salloc to get the interactive session (How to https://documentation.sigma2.no/jobs/interactive_jobs.html. Then make sure that you have multiple cores (echo $SLURM_NTASKS)
  - Tested, using salloc instead of srun does solve the issue, now all python instances are running pro cpus. Thanks!
What computational tasks are more suitable to run with GPUs instead of CPUs?
- Generally GPUs are really good at applying the same operation to a large array of data. So, if you have either, 1) a lot of data or 2) doing the same operations on data in many iterations, it will be well suited to running on GPUs.
Question to meeting participants: what change/improvement (small or large) would make your work on computing and storage resources easier and smoother?
I suppose it is a very complicated task to modify a CPU code into a GPU code, right?
- It can be, but it doesn't have to be, one can test and play around with OpenACC
- If the code makes heavy use of matrix operations (multiplications) and the matrices are sufficiently large, it can be relatively easy to port your code and offload these operations to the GPU since libraries for this exist.
  - You can read more about this here
Gromacs is available for AMD GPUs, do you know if there are any efforts porting NAMD to AMD GPUs so that it would be available on LUMI? I saw some slides from AMD about it a while ago (at a LUMI meeting actually) but haven't seen anything about it lately. (edited out the name since now we know and since we reuse this document)
- Yes, NAMD (and NAMD3) is being ported by AMD. You can already find containers on AMD's Infinity Hub
- The GPU team can also help you get started and help test if your experiments can be run on LUMI
- there is a AMD accelerated alpha of NAMD
- If you have any allocation on LUMI, you can use the eap partition to test run on GPUs, even without having a LUMI-G allocation at the moment
How do I get in contact with the GPU team?
- Contact information for the GPU team
…

Interconnect diagram of a LUMI-G node:

2022-06-14

Today's seminar

Open Q&A session

Please write your questions here:

Can you show again how to be in this page you are showing now, i am just logged in…
- https://apps.sigma2.no/packages/sigma2/jupyterhub/0.16.15/install
how to reconfigure a stopped/failed application?
- If failed, it easiest to delete and start again
why some memory is is not released after closing all the applications?
- after stopping or deleting an application it can take a couple of minutes for the service to be completely shut down. If there is some resource that is not released, you can ask us to look at it in a support ticket . When stopping a jupyterhub service, the user services attached to it can still be running, so before stopping the hub service go to the hub control panel and stop all running servers .
-
How to apply to have more resources (e.g. memory)?
- at the moment you can apply through a support ticket to either support@nris.no or contact@sigma2.no
Can I set the paths of the tensorboard in Deep learning apps?
- it is not currently configurable from the installation pages, but we have noted it as a feature request
when using persistent data storage, would it be possible to specify the home path and jupyter lab configuration path?
- similarly to the question above, and we will look at it.
NorESM DIagnostic Tool and ESMValTool are also included in the NIRD Toolkits?
-It is not included by default in the NIRD toolkit, one way of adding software in the NIRD toolkit is to build a custom docker image that contains the software : https://documentation.sigma2.no/nird_toolkit/custom-docker-image.html

2022-05-10

Slides: https://docs.google.com/presentation/d/1pgueQ6w8sFW4-1y3iRwiWgkypUhrlLfhEPTFSY2_Lw8/
Next course: https://documentation.sigma2.no/training/events/2022-05-best-practices-on-NRIS-clusters.html
Where to put self-installed software? Home or project folder?
- Project is recommended, as HOME has a quite small file size and number quota and if it get's full, you struggle to do any work at all because of Disk space exceeded errors.
useful command to compare time that a job took and cpu time used:
- sacct -j JOB_ID -o NTasks,ReqCPUS,AllocCPUs,CPUTime,Elapsed,Timelimit,ExitCode,NodeList
- it seems seff JOB_ID is more useful
- adding AveCPU can also help to show whether CPUs were busy
- the job needs to run long enough to not see effect from sampling rate
- How to ssh to a node running my job

  [sabryr@login-3.SAGA ~]$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           5621650    normal example-   sabryr PD       0:00      1 (Priority)
[sabryr@login-3.SAGA ~]$ ssh c1-32
[sabryr@c1-32 ~]$ hostname
c1-32
[sabryr@c1-32 ~]$ top -u $USER

   - then quit top by pressing "q"
   - log out of compute node with "exit" and you are back on login node

To check jobeffiencey
- seff <JOBID>
Some places to look for bioinformatics pipelines and tools which are not installed on our servers: nextflow.io, https://singularity-hpc.readthedocs.io/en/latest/

2022-03-08

Status of Quantum Chemistry Software VASP & Gaussian - Presentation slides
Super short feedback form. Please fill it out

VASP

Dedicated Q&A sessions with VASP?
Since the Betzy and Fram have CPUs from different companies and we are compiling VASP by ourself. Is it possible to maybe share a recommended compiling setting for Betzy?
- Yes. It seems based on investigations until now, Intel toolchain with the compiler flags -march=core-avx2 is the most safe option and also what performs decent. Running VASP on AMD is not straightforward, but easier with VASP 6 due to better support for newer GCC versions etc. Also, other cleanup in the code. We will thus prioritize getting VASP 6 deployed on Betzy first. Our compilation setup, including the full stack, which also runs the tests and build the modules etc. will be accesible to users.
Also have you done any benchmark like calculation on Betzy which you can share?
- Yes, we will share this in the documentation pages. First, we will finalize the setup which develops this data so that it will be much easier for us and users to provide this in the future for new/other clusters as well. But we already know that the performance is not on par with the Intel hardware. We also have not yet investigated how the performance looks if we replace Intel MKL with the dedicated AMD BLAS/LAPACK.
If I want to run several VASP calculations with ASE(python package) in a single submit run are there a best practice to do it nicely parallelization in documentation?
Workshops, meetings, etc. How often?
VASP tutorials. Educating new users. How often? Local training?
Dedicated channel in our Mattermost chat client?
Email list that is automatically updated?
Development of better documentation. Also in tutorial form.
Need to also be a community effort. Do users want to contribute?
Do users feel like they are part of a local VASP community?
Do you see a user need to have VASP in a container? Do you need a reproducible environment for VASP?
What is most pressing for the VASP community?
- Getting VASP 6 deployed.
- Getting VASP on Betzy. Will most likely only be for VASP 6. Is this okey for the user community?
Any interest in the user community to utilize AiiDA-VASP and/or the AiiDA framework?
- https://www.aiida.net/
- https://github.com/aiidateam/aiida-core
- https://github.com/aiida-vasp/aiida-vasp
- Would maybe be available here: http://apps.sigma2.no
Does there exist an overview of the numerous quantum chemistry softwares installed on NRIS? Also, is there an overview of the licenses and who pays for them?

Gaussian

AiiDA-Gaussian status? Would it be useful? Could we do a joint, cross code effort on getting this going for our users?
What is the most pressing issues for the Gaussian users?

2022-01-26

For development access to LUMI-G, should I apply through Norway’s share in next week’s deadline?
- If you already have a Sigma2 project you should apply to Sigma2 here and we can take it from there
For application to LUMI-G, can I / should I enter required number of CPU hours on the GPU nodes? Are there different quotas for CPU and GPU hours?
- For now, yes. We will update the application process when we have clearly worked out how to convert between GPU accounting on LUMI-G and billing hours in MAS.
Could we have the slides from the talk please? Lots of usefull links there.
- Slides are here
Does the application + allocation of storage on LUMI work the same as on Saga etc.?
- Not sure about storage, maybe you could send us a support request, and we can have someone from the application process answer
Problem with multi-node Gaussian jobs on Saga
- Jobs crash sometimes, without error message
  - reported by multiple groups, who were forced to move to a different machine
  - RB and JD will follow-up and check the status with ET
File limit problem with Conda environments
- can lead to space or file number quota limitations
- try to use project folder instead of home folder
- alternative: install conda stuff into a singularity container (however this needs to be done on a differnt machine) and then you can run singularity container
- conda also ships docker/singularity containers which can be used as a base (to have less to install and also to get leaner containers)
- we should provide an example container and document how such containers can be built
  - create ext3 image with
    - singularity exec docker://ubuntu:18.04 bash -c "mkdir -p overlay/upper overlay/work && dd if=/dev/zero of=overlay.img bs=1M count=50 && mkfs.ext3 -d overlay overlay.img"
  - using image singularity shell --overlay overlay.img docker://ubuntu:18.04
  - something is missing here :-( … doesn't work right now
  - https://sylabs.io/guides/3.7/user-guide/persistent_overlays.html#persistent-overlays
Will there ever be same VNC support for Betzy or LUMI the same way it exist on FRAM and SAGA right now?
- https://documentation.sigma2.no/getting_started/remote-desktop.html?highlight=vnc
- LUMI: seems to be planned and work in progress
- betzy: we will inquire if it's planned
When running Slurm Array jobs on GPU-nodes on Saga, I don't get a "GPU usage stats:"-summary for any of the array tasks or the whole job. I only get the "GPU usage stats" when NOT running an array job - any way to get GPU stats also for arrays?
- We will look into this, it should work
- This seems to work, the output is included, but there are some problems with GPU statistics at the moment which are unreleated to Array Jobs. Maybe you could share how you specified the jobs?

2021-12-08

Can I ask very specific questions on climate model simulation? Or is there specific staff to consult when I have questions?
- (I wasn't at the meeting) but did this ever get answered? Perhaps in voice?
- Yes, it was but orally, so I don't remember what the answer was. But you can just ask again and we discuss it.

2021-10-13

Training event first week on November:

https://documentation.sigma2.no/training/events.html
Registration will close very soon
All the sessions will be recorded
We have all material from last course (March): https://documentation.sigma2.no/training/material.html#training-material-used-in-our-past-courses
After the Fram downtime: inter-node communication (?) with Gaussian/Linda. Sometimes it crashes out of the blue, but only for multi-node jobs. Difficult to reproduce.
Slurm environment variables: https://documentation.sigma2.no/jobs/job_scripts/environment_variables.html
If a job crashes sometimes, what can one do?
- print $SLURM_JOB_NODELIST in the job script. This can help locating faulty nodes or core distributions which fail. If it is a fauly node, you can --exclude it in your jobscript. But even better for everybody else is if you report it to us so that we take that node out of the system and fix it.
We also discussed strategies of what to do if a Betzy job runs optimally on only half the cores. How to schedule it without paying for the unused cores and/or wasting resources.

2021-09-09

Is it user friendly on Sigma2, since I am not a regular Linux user?
- I mean the installation of the custom programs and running the jobs and tests.
  - several approaches are available for software installations:
    - EasyBuild
    - Conda
    - pip install into virtual environment
    - The more traditional configure - make - make install - make test
  - so it depends a bit, and with varying difficulty. But we have a software install team in the center which can assist in installations or can take care of software installation requests
  - starting point: https://documentation.sigma2.no/software/userinstallsw.html
So the super-computers you have at the center could be accessed but remote loggin or pl has to sit physically by the computers?
- always via remote login. we only go physically to the computers if there is a hardware issue and something needs to be exchanged.
Is there some kind of forum or online help pl can ask stupid questions and discuss issues of the workflow and problems undergoing?
- right now the two places are these QA sessions or writing an email to https://documentation.sigma2.no/getting_help/support_line.html. I think that no questions are "stupid" and our documentation always can use more improvement so don't hesitate to ask.
- unfortunately we don't have any forum yet where questions could be discussed in the open (email support line is 1-1)
- I [RB] would really like that we provide a forum to users, in addition to email support line. I believe this will come this fall.
- thanks, I would like have a forum, so that pl who has similar projects can discuss about some details of the workflow and problems.
  - yes exactly. because with 1-1 email, many questions get re-asked again and again and nobody else can see the answer. also sometimes the community knows a much better answer than our staff can provide and then it would be nice if community can help the community.
Question about whether we have tested Singularity performance when scaling to many nodes
- [RB] I will check with two people who have tested this on our cluster or CSCS machines
I [KZ] am working on a rat genome assembly. There is a huge amount (around 1TB) of raw data to screen against. Would it be problem to upload such huge amount of raw data?
- 1 TB on project folder, can be extended upon request to 10 TB, more info
- but you cannot place the TB in your home folder
- recommending rsync for transfering files of this size/amount (it also checks consistency of data)
- More long term storage (archiving) on NIRD, can be accessed from there for smaller scale computations/visualisation via NIRD toolkit
Research Data Archive
- https://documentation.sigma2.no/nird_archive/user-guide.html
- Seems to be missing: searchability of metadata. To find anything you need to know that something is there.
Request to host https://www.iochem-bd.org/
- might become possible as part of the next generation NIRD
- group around MF at UiB is in contact with the developers of this platform, this has been requested before

Q & A sessions

Upcoming session

2024-01-25

Previous sessions

2023-03-30

Upcoming NRIS courses

Question

2023-02-23

Upcoming NRIS courses

Questions

2022-11-09

Upcoming NRIS courses

Today's seminar: Helpful Tools and Services

Questions

2022-11-09

2022-09-29

Today's seminar: Bioinformatics

Open Q&A session

2022-08-30

Today's seminar: GPUs

Open Q&A sesssion

2022-06-14

Today's seminar

Open Q&A session

2022-05-10

2022-03-08

VASP

Gaussian

Other questions or comments

2022-01-26

2021-12-08

2021-10-13

2021-09-09

Read more

Archive of LUMI Coffee Break 2023

Archive of LUMI Coffee Break 2022

NRIS HPC On-boarding 18-20 October 2022

Best Practices on NRIS Clusters 23-24 May 2022