Try   HackMD

Using CSC HPC environment efficiently 2021

tags: puhti mahti allas

This is the common "notebook" for the "Using CSC HPC environment efficiently" course organised in April 2021 at CSC -IT center for Science. Course page in e-Lena platform

This is the place to ask questions about the workshop content! We use the Zoom chat only for posting links, reporting Zoom problems and such.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Hint: HackMD is great for sharing information in this kind of courses, as the code formatting is nice & easy with MarkDown! Just add 3 ticks (`) for thecode blocks. Otherwise, it's like a Google doc: it allows simultaneous editing.

Code of conduct

We strive to follow the Code of Conduct developed by The Carpentries organisation to foster a welcoming environment for everyone. In short:

  • Use welcoming and inclusive language
  • Be respectful of different viewpoints and experiences
  • Gracefully accept constructive criticism
  • Focus on what is best for the community
  • Show courtesy and respect towards other community members

TO DO before the course

Zoom instructions

  • Link to the zoom room was sent to you via e-mail
  • Please arrive 5-10 minutes before to test your microphone setup.
  • Use your full name (Firstname Lastname)!
  • During the course, please remember to always mute your microphone when you are not speaking.
  • Please use a headset in order to avoid echo (a simple phone headset is just fine).
  • You can find all the controls (mic, video, chat, screen sharing) at the bottom of the Zoom window (when you bring your mouse there).
  • You can use the chat box for questions and comments, but please make sure you reply to "all panelists and participants" instead of just "all panelists", which is often the default.
  • If you have a spoken question/comment, please use the "raise hand" button: we will then give the floor (and microphone rights) to you.
  • Note: for questions and answers about the course topics, we will be using this living document
  • Break-out rooms: More info:https://support.zoom.us/hc/en-us/articles/115005769646-Participating-in-Breakout-Rooms#collapseWeb Breakout rooms are smaller sessions that are split off from the main Zoom meeting. They are completely isolated in terms of audio and video. The host will need to invite you to join the breakout room, after which you can click "Join" in the notification pop-up.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Q & A

You can type your questions here. We will answer them, and this document will store the answers for you for later use!

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

General and practical matters

  • Q1: I have difficulty pasting my questions into HackMD (here). Do you have some instructions on how to write here?

    • A: Can you see these three icons on top left corner, next to HackMD text? There’s pencil, this side-by-side symbol, and an eye. In eye view, you can’t edit, you are just viewing. The other two reveal the markdown (MD) version of the page, which you can edit. I find it easiest to edit with the side-by-side view.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Hint: You can also apply styling from the toolbar at the top
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
of the editing area.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Q2: I cannot access most of the slides from the course page. These are marked with an earth logo.

    • A: Try to use the arrow keys or space bar to switch the slide!
  • Q3: Slides available after the course?

    • A: They sure are! We are developing self-learning material in these topics. You will have the access to this HackMD document as well as the e-lena course materials also after the course.
  • Q3: Has the Zoom meeting already started? I'm just gettin "The meeting has not started" -note.

    • A: This was solved :) Make sure you have the full Zoom link when connecting! It's long and contains the password :)
  • Q4: Do we get ECTS credits from this course for our university transcripts?

    • A: We will sent a certificate in e-mail to the course participants that were present in both sessions, where we recommend 0.5 credits for this course. We are not allowed to give credits, but we can recommend them, and by taking the certificate to your university, they usually accept :)
    • NOTE: If you didn't receive the certificate, and were participating, please contact our event support (event-support (at) csc.fi). We have a new process we are just starting to use, so thank you for your understanding and sorry for the possible inconvenience!
  • Q4: How long “project_2004306” will be available for use after the course?

    • A: The course project will be available tomorrow (Friday 9.4.) still, but after that you need to do the exercises on your own projects :)

User account, logging in to Puhti, ssh

  • Q1: I have a CSC login password as well as Haka password. I guess CSC login password should work for Puhti?

    • A: Yes, you need CSC credentials to login to Puhti - for my.csc.fi (and e-Lena) you can nowadays login also with Haka credentials (=your university credentials), but to access Puhti, you need the CSC username and password. In my.csc.fi you can check the username. If you have forgotten your password, check this link: https://docs.csc.fi/accounts/how-to-change-password/ -Note that it takes some time for the password to update to Puhti :)
  • Q2: I can see the course's project in list of projects in mycsc profile but I can't access. i.e it doesn't show up under csc-workspaces.

    • A: Did you accept the terms of use in MyCSC?See: https://docs.csc.fi/accounts/how-to-add-service-access-for-project/#member
    • Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
      this is what I see. I had accepted the terms.
    • Looks correct -there might be some delay before this is updated to Puhti, but I will confirm the situation with our account managers -what would be your user account?
    • Ok, everything SHOULD be ok -have you tried logging out and back in?
    • Yes, logging out and logging back in worked. SOLVED!

HPC environment(s)

  • Q2: Is memory available per core is total memory/number of cores available in that node?

    • A: The memory is available per node, not per core. For example OpenMP code, that utilize common memory, can be run in maximum one node. You can also request memory per core, which is the default also (1 GiB per core).
  • Q3: In general, how many computing nodes can be applied for a general Molecular dynamic simulation task? As I read it might takes several days to complete the job. Can we use many many computing at the same time? (e.g. molecular weight of 69 kDa protein)

    • A: The hard limits will be the partition (=queue) sizes. However, you must test how many can be efficiently used. This will depend on many factors, like which code you're using, how big is your system (number of atoms), what parameters you're using (PME vs. cutoff), etc. For a 69kDa protein with water around, PME and Gromacs, maybe around 80 cores. Please check this tutorial on performing a scaling test (testing how many cores it makes sense to use): https://docs.csc.fi/support/tutorials/cmdline-handson/#scaling-test-for-an-mpi-parallel-job
  • Q4: Tips on optimizing the application for memory usage? I also need a good tool to monitor the usage of memory of my application

    • A: There are lots of ways to optimize a code for memory usage and it depends a lot on exactly what the algorithm is. Sometimes people make mistakes by not being careful to allocate exactly what they need. Sometime also data do not need to be in memory, but can be read from disc in suitable chuncks. I just use the 'seff' command to check after the run is finished how much memory have been used.
  • Q6: Is there any possiblity to run jobs with GPU memory more than 3 days in Puhti for submitting sbatch jobs? or is that simply the max duration?

    • A: 3 days is maximum run time for GPU jobs in Puhti. It's always a good idea to make jobs write checkpoints, so that the job can be continued if it is interrupted (by a fault or queue time limit). Longrun (for CPUs) queue has lower priority, so you'll get to run faster in "normal" queues. Short queues enable fair share and high usage ratio i.e. maximum benefit of the system to large user group.
  • Q7: Do you also recommend using conda environment for running our python scripts in sbatch in Puhti? (I need Deep Learning libraries
    such as Pytorch and Tensorflow) so it makes it easier to load libraries in conda

    • A: Before doing your own Pytorch and Tensorfolow installations, check: https://docs.csc.fi/apps/pytorch https://docs.csc.fi/apps/tensorflow/
    • Some of the PyTorch and Keras installations are done with conda. Some as Singularity containers.
    • Conda environments can suffer from slowness (first import) on Lustre parallel file system and create a lot of files on disk. Pip can be better if possible to use.
  • Q8: How do I know how many cores (CPUs) I need to optimize my sbatch jobs? Is it a trial and error task or there exists a logic I could calculate in advance?

    • A: I typically start with a small number of cores for a short test case, and then increase number of cores until I find the optimal value. There is a logic as well, but it demands quite a lot of knowledge of exactly how the computer that is used is working and the algorithm of the code.
    • A: Also check the software documentation to see if the developers give any insight on this.
  • Q9: Is NoMachine something like ssh user@machine -XY in linux?

  • Q10: How to see on what node I am in? Login or Calculation one?

    • A: Login node you can see from the command prompt: It says something like [username@puhti-login1 ~]$ -> you're on login node 1.
    • In general, one can use hostname command on the terminal. If it is login node you can see command output as : puhti-login1.bullx or puhti-login2.bullx, for other (compute or interactive) nodes, you can just see node name which looks like r07c49.bullx

Disk areas

  • Q1: Is automatic deletion of scratch active in Mahti or Puhti?

    • A: The clean-up is still not yet in use. We have not yet set a date, but /scratch is filling up fast, so we will have to turn it on soon.
  • Q2: Should I store my data at Allas and them move them to Puhti for calculation, and them move them back from Puhti to Allas, correct?

    • A: Yes, this is how Allas should be used. We will talk more about this in Allas session tomorrow!
  • Q3: Should I use /tmpdir/ (temporal directory) for testing, and then /stratch/ disk for calculation themselves?

    • A: I use scratch for testing and running jobs. Use $TMPDIR for reading lots of files e.g. in compiling a large code. $TMPDIR is local to each node so you can't submit a batch job from login node $TMPDIR to compute nodes.
  • Q4: Do u recommend to save my dataset (~900GB < 1.1 TB) in /scratch/proj_num in Puhti or it is better to save it in Allas regardless the size of the dataset?

    • A: 1.1 TB in scratch is already more than standard disk space (1T) for a project. If you need the data often (every week) it's not a good idea to copy it back and forth Allas all the time. You can apply for more quota for scratch, but it will use billing units. Note also the limit of number of files.
  • Q5: What is a 'small' file in this context?

    • A: This is indirectly meaning lots of files. 1 GB in small files can mean 100 000 files, and that's not a good idea on Lustre. Lustre is good for reading in quickly large files (like ~10MB and more) in parallel, but slow for reading (and writing) many (thousands and more) "small" files. This kind of usage should go to NVMe (node local disk) or if it's "database" like, then perhaps not in Puhti at all.
  • Q6: Please, can you remind what is connection between Lustre and Puhti? What is Lustre useful for?

    • A: Lustre is the file handling system for the discs, while Puhti is the name of computer itself.
  • Q7: How to access temp folder in Puhti on Windows? echo $TMDIR prints nothing

    • A: $TMPDIR, not TMDIR
  • Q8: Should we have access to Mahti for the Disk-areas exercises? I seem to have a pretty red X over Mahti on my Project_2004306 page. :)

    • A: Nope, working with Puhti here! So in this course, we are working with Puhti & Allas, those should be green in MyCSC :) NOTE: exercises mention Mahti!

Module system

  • Q1: I can see that on Puhti Python 3.7.3 is installed by default. Would you recommend using tools like pyenv to manage other python versions?

    • A: You can change the current version via the modules. Did you mean installing your own python libraries?
    • Puhti has already several Python versions: python-data, pytorch, tensorflow, geoconda, bioconda.
  • Q2: Yes, I was wondering what the best way to install and manage different python versions in Puhti is.

  • Q3: How can I launch Jupyter notebook when using Bioconda package?

 export PROJAPPL=/projappl/project_xxx   # Add your project here
 module load bioconda    #to activate conda command on Puhti
 conda create --name gromacs-tutorials -c conda-forge -c bioconda gromacs=2020.4 matplotlib nglview   notebook numpy requests pandas seaborn  # Create conda environment with your packages
 source activate gromacs-tutorials  # activate environment
 python -m ipykernel install --user --name gromacs-tutorials --display-name "gromacs"  # you will see a seperate kernal with name "gromacs" 
 module load python-data  # load one of those packages that have jupyter
 start-jupyter-server # Launch jupyter

Batch queue system & interactive use

  • Q1: Still, to accordingly select a partition, number of hours, cores, ect. requires likely error and trial approach?

    • A: More or less. After you've run some jobs, you start to build up experience on how much resources each type of job needs. But for all new kinds of jobs (new input, new application, ) check the resource usage afterwards to see how much was actually used. The number of cores you need to find with a scaling test (i.e. adding more cores significantly speeds up your job).
  • Q2: How to comment in batch files? I tried using :: but this doesn't work

    • A: Use hashes (#) for commenting! Note: For commenting out the #SBATCH rows, add a space between # and SBATCH :).
    • I use ##, and that works for everything.
  • Q3: How many projects can I have in CSC? Can I remove my Test projects? Can't find a way to do this.

    • A: What do you mean by Test projects? If you have some redundant projects, it is not a big deal.
  • Q4: Can I run an open source computing software (FDS for CFD simulation) in HPC environment?

    • A: Yes you can. You need to download it and compile it, then run it. Unless you find it among our already installed, then use that.
  • Q4b: The executable files are available in developer's website. I believe users from Aalto Unv may have installed it before. How can I check this? Should I contact servicedesk? If need install, how? FDS = Fire Dynamic Simulator

    • A: Normally it's not a good idea to use prebuilt binaries for mpi parallel jobs (jobs that use a lot of resources). It's better to compile them at the supercomputer to make sure it runs as efficiently as possible. The last slideset https://a3s.fi/CSC_training/10_speed_up_jobs.html discusses this. If the Aalto group has the binaries somewhere, you can use them, of course. You need to contact them, though. You can then copy the binaries to your /projappl/project_XXXX/FDS folder and use it from there.
  • Q5: The examples did not allocate a certain amount of memory. How much memory is reserved if it is not defined in the bash script?

    • A: 1 GB is the default and the minimum (less can't be reserved). That's a lot already, don't reserve more if you don't really need more :)
  • Q6a: How to run or output batch jobs under /scratch/<project> or /projappl/<project> ?

    • A: Simplest way to run your batch job under /scratch/project_xxxx is by changing from current directory to scratch directory before submitting the job. As below:
      cd /scratch/project_xxx/$USER (This assumes you have your own directory, $USER, under your project). Prepare all data and scripts needed to run your job in that directory. Finally, you can submit your job using sbatch command. Unless, you redirect your output to elsewhere in script, your output files from batch job are written to the scratch directory.

    • Please note that /projappl/<project> directories are meant for installing(or sharing) software tools for your project. You DON'T run any batch jobs in that directory.

  • Q6b: How to access /scratch/<project> and /projappl/<project> via a graphical interface (eg. FileZilla or VsCode SSH - Remote) ?

  • Q7a: How to direct all input (output?) of a batch job to a job-specific folder?
    As the file must start with #!/bin/bash -l, do not know how to create directories or change working directory. I am looking for something like #SBATCH mkdir output_%j or #SBATCH cd output_%j which should precede #SBATCH --output=x_output_%j.txt and the like so that everything releated to that job would end up in the same folder.

    • A: Only the instructions to the batch job system (resource reservations etc) start with #SBATCH. The rest is just a normal bash script. By default the working directory of the job is the directory where you submit the job. You can create directories and move to them during the job:
    ​​​​mkdir some/directory
    ​​​​cd some/directory
    
  • Q7b: How to name the folders and files with the job-id? Tried this but it did not work:

    ​​​​mkdir out_${SLURM_JOB_ID}
    ​​​​cd out_${SLURM_JOB_ID}
    ​​​​srun singularity_wrapper exec R --no-save --quiet < rscript.R 2>&1 | tee out_${SLURM_JOB_ID}.R
    
    • A: Can you paste here what works and what didn't? I would not use tee in the command. Simple redirect should be enough (srun ... < input > output). Otherwise the output (that would appear on screen) goes to #SBATCH -o <filename> or if not defined, to slurm-${SLURM_JOB_ID}.out

    • Follow up Q7B:
      I figured it out. The problem was that the rscript was not not in the newly created directory so had to use absolute path.

    ​​​​srun ... < /scratch/project_XXX/user/input.R 2>&1 | tee output_${SLURM_JOB_ID}.R
    
    • As for the directing: simple redirect won't print all R input & output including comments, which is what I want. Unless this slows down the process considerably (does it?), I am reasonably happy with the current solution, albeit verbose.
    • Other option would be to omit | tee output_${SLURM_JOB_ID}.R and use #SBATCH --output=x_output_%j.R but that output would have to be moved manually to the new folder. That's why I originally asked about changing working directory before or as part of the #SBATCH commands in Q7A. The idea was to have the my_batch_job.sh file in a directory, and in the very beginning of that file, create a subdirectory with the job-specific name.

Guidelines for efficient use

  • Q1: Does this mean using this query command watch -n 60 sacct (to check the job status every minute) is heavy on the login node?
    • A: I usually follow my jobs by having an output file where I can follow progress.
    • It's not only heavy for the login node, but it is heavy for SLURM! Don't do it.
    • It's not recommended to build "scripts" that monitor jobs and submit new once the old have passed (a job scheduler on top of a job scheduler). For this kind of needs, please contact servicedesk for a better approach.
  • Q2: Follow up to previous: So how do we know when the job finished?
    • A: If you can't see it in the queue anymore, it's done :) You can also specify so that it will send you an e-mail when the job is done by adding these to your batch script:
    ​​​​#SBATCH --mail-user=your.email@university.fi
    ​​​​#SBATCH --mail-type=END
    
  • Q3: Will additional disk be charged for academic use?
    • A: Yes, in billing units, which you can apply in MyCSC.

Allas -where to keep your active data

  • Q1: Where should I store for example 80 GB geopackage (sqlite extension) so that it has as fast as possible public http access (queryable with gdal/ogr sql commands)?

    • A: For public http access Allas is ok Access from Puhti is faster from Puhti local disk.
  • Q2: Can I access one single object from the bucket?

  • Q3 Is there a way to sync data using rsync cmd without deleting the existing data on the destination folder?

    • A: I think rsync command with -P flag will check the destination folder (with checksum) and then transfers only additional data on source folder to destination folder.
  • Q4: How to close the connection to a project in Allas?

    • A: You don't really have to do that :) If you change the connection to another project, the previous connection is closed.
    • A: In case if swift protocol (a-commands, rclone)
      the "key" is stored to a session specific environmet variable:
      OS_AUTH_TOKEN. You can unset this variable (unset OS_AUTH_TOKEN) or just log out.
    • In case of S3 protocol (s3cmd) you can use command:
      allas-conf s3remove
  • Q5: What would be the best way to take automatic backups of all/some files and directories in /scratch/project_XXX/user to Allas? Preferably as part of a batch job

    • A: There are two options: 1) Just copy the data you want to back up to Allas with tools like a-put or rclone or 2) use allas-backup to make incremental backups. The latter one is good if you want to store several versions of the same file or directory. Both a-put and allas-backup can be used in batch jobs if you set up the Allas connection with: allas-conf -k

Installing your own applications

  • Q1: The compilers are already in the csc machines?
  • Q2: Note: in exercises, there's one USERAPPL where there should be PROJAPPL
    • A: Fixed this, thanks!
  • Q3: Differences between installing with conda and from github?
    • A: We should add something about Conda to the installation material. Conda is an easy way to install, but the environment folder will have LOTS of files, so it will fill your PROJAPPL folder soon :( You can use Conda, but it has these limitations. Check if the same stuff is available as a container!
    • See: https://docs.csc.fi/support/tutorials/conda/
    • Note: careful when installing multiple Conda envs

General questions