Try   HackMD

Intro to Scientific Computing and High Performance Computing

╔═══════════╗ ╔═════════════╗
║   WEB     ║ ║  TERMINAL   ║
║  BROWSER  ║ ║   WINDOW    ║
║  WINDOW   ║ ╚═════════════╝
║   WITH    ║ ╔═════════════╗
║   THE     ║ ║   BROWSER   ║
║  STREAM   ║ ║   W/HACKMD  ║
╚═══════════╝ ╚═════════════╝
  • Do not put names or identifying information on this page

Day1 and Day2

HackMD archived here: https://hackmd.io/@AaltoSciComp/ARCHIVEintroWinter2022

Day3 4/Feb/2022 Introduction to HPC

Today, parallel computing and scaling up to Lumi
https://scicomp.aalto.fi/training/scip/winter-kickstart/

Questions from yesterday

  • Thank you for clarification. How to choose the volume of memory and GPU/CPU usage for JupyterHub? After logging in I cannot see any choice for these parameters.
    • Currently there is not an option for GPUs. The GPU resources are so highly requested, that we need to prioritize non-interactive usage as that utilizes the GPUs to greatest extend. There's also no CPU selection for similar reason. Memory limit should be chosen based on the size of the data you're processing.
  • Not sure other people had the same problems, but I get an error when using lfs find directoryname: "cannot get lov name. inappropriate ioctl for device". what am I doing wrong?

Other resources

Icebreaker: Do you think your work could be parallelized? Could it be run side by side or with multiple processors?

  • All the data points in my research code can in principle be calculated independently in parallel. At the moment I'm only using core level vectorizations and omp.
  • Sometimes I have to calculate aggregate statistics/estimate models on various subsets of the data.
  • Right now, due to my overall ignorance of How and where to parallelize, I fear I can not even answer this question! I mostly run ML pipelines to analyse data though and do not really write a software.
  • I too am going to run two ML pipelines for feature detection and not quite sure whether that would require parallelization

Yesterday was:

  • useful: oooooooooo
  • not useful: o
  • too fast: o
  • too slow: o
  • just right: o
  • I would recommend this course to others: ooooooooo

How should this course be made better:

  • It should be integrated in a full course ordered by the skill level: shell -> this course -> Git -> Software design
    • good idea! We sort of do them separately because otherwise it becomes too long, but check out https://hands-on.coderefinery.org/ (and CR may try to advertise them together more).
  • .
  • .
  • .

Array jobs

https://scicomp.aalto.fi/triton/tut/array/

  • Question: What happens if I add multiple srun commands to the same script?
    • each appears as a command in the history, with resources tracked separately. They still run one after another.
      • Thanks!
  • Q: Will the above wait for the completion of the previous step?
    • No. Array jobs are parallel and independent of each other. They are not running in a series.
      • Sorry, there is a misunderstanding. The question is about the multiple srun commands in the same script.
      • All srun steps within the array job task will be executed one by one within that particular array task indepenedently on what is going on with other array tasks.
      • Overall, array job is just a number of jobs that run using the same SLURM batch script template, they only differ by the $SLURM_ARRAY_TASK_ID.
  • Q: How long should each parallel job in an array take to be useful instead of just running in series?
    • Depends on the system. I would say ~1 hour each, at least.
    • You can always play and adapt your workflow, make one array task with enough 'srun' steps so that it would take 1h at least. If the question is: array job with N tasks vs N sbatch submissions, then in case you have a serie of simulations, like same binary but just different inputs (ie the same SBATCH options, running environment etc), it make sense always put them into arrays. Arrays jobs (i) makes SLURM work more efficient, SLURM controller checks/sets up all the parameters only once for one job, and then copies the same parameters to all the other array tasks, (ii) easy to follow from the user point view with squeue/scancel etc.
  • Is the %A and %a a part of the syntax in slurm?
    • Correct. man sbatch will tell more.
  • Q: Will the array jobs queue seperately? Or will they wait until all of them are launched simultaneously?
    • Array jobs will queue independently. For the arrays jobs with large amount of tasks, it is usually common that few tasks are running at a time while rest are pending.
  • Q: When you request memory for X number of array jobs. Will the amount of memory be divided into the array jobs or does each array job get the X amount of memory?
    • Each will get the same memory requirement, so there's no division. All other #SBATCH-requirements are identical (basically copy-pasted) for all individual array jobs. So if you say #SBATCH --mem=2G and #SBATCH --array=1-10, you won't get 20GB memory requirement, but each of the 10 jobs gets 2G memory requirement.
    • Consider submit script as a template. That is array= .. number of jobs will run using that the same template.
  • Q: What is the maximum length of array and is there a maximum for the argument?
    • depends on cluster, in 10000s on Triton I think.
      • But please test with smaller numbers first :) If you are going to run 10K jobs and each job lasts for 2 minutes, then this is not good for you and for others. It is better to pack iterations or parameters so that 1 job lasts few hours. Otherwise you will be spending most of the time queueing, and only few minutes running.
    • If you think of a number of array tasks in one array, that is 'scontrol show config | grep -i array' : MaxArraySize = 200001; thatt means array=0-200000 is ok, but array=0-200001 will shoot an error (Triton example)
    • Regarding a max for arguments; if in question how many arguments you can give to your running binary, it is not up to SLURM, but BASH/Linux setup, see 'getconf ARG_MAX' for the bash line length in bytes. It is huge anyway.

Type-along: your first array job, until xx:31

https://scicomp.aalto.fi/triton/tut/array/#your-first-array-job
Just this section, exercise time is later.

I managed to do the sample:

  • yes: ooooooooo
  • no: o
  • I did not try: oo

Array exercises until xx:00, then break until xx:10

https://scicomp.aalto.fi/triton/tut/array/#exercises
At least try Exercise 1. Exercise 2 in HackMD, and then whatever else you have time/desire to do

Helsinki:
There is again the University of Helsinki "type-along" session in the Break Out Room 1. Paying attention to the differences between triton and turso.

  • .Where do i download the pi.py file?
    • If you did the exercise yesterday, it is in the git repository, otherwise do these commands:
    ​​​​ssh NAMEOFTHECLUSTER
    ​​​​cd $WRKDIR # or cd /scratch/work/MYUSERNAME
    ​​​​git clone https://github.com/AaltoSciComp/hpc-examples.git 
    
    the file is in the subfolder ./hpc-examples/slurm/pi.py

I managed to do the exercise 1:

  • yes: ooooo

  • no: oo

  • I did not try: o

  • Q: Instead of that case thing I just did the following. Is there some difference in execution?

#SBATCH --array=50,100,500,1000
srun python hpc-examples/slurm/memory-hog.py ${SLURM_ARRAY_TASK_ID}\M

Exercise 2: Do you think your problem could utilize an array-structure? Are you interested in it but you're not certain how to split your data / parameters? Answer/ask below:

  • This has been certainly the direction for me. Each data point takes 1h or more depending on case and running a full analysis takes a week or so in series. I have been now changing my C++ code to handle and organize data points independently and save/load data also indepedently. I'm not sure how to combine algorithm level parallelism with the embarassing parallelism (running jobs in parallel) at the moment.
    • If needed, help is available for Aalto people (either garage or even an RSE project), just in case.

continued

Parallel

https://scicomp.aalto.fi/triton/tut/parallel/

  • What is the difference between CPUs, cores, and nodes?
    • in SLURM terms, "core" == "cpu", and physical one processor (with multiple cores) is called "socket", it's the shorthand we use.
    • Though in hardware design, they would say one CPU==one chip and one chip can have multilpe computing cores in it.
    • Node == "one discrete computer's hardware"
      • Indeed I used term "CPU" in a bit ambiguous way. Usually our servers have two actual CPU's, physical chips, in two sockets. Each of these CPUs has multiple logical processors aka. cores that can do calculations. They are often also referred to CPUs. Slurm considers these cores as CPUs for --cpus-per-taskSimo
  • Hi Embarrassingly I missed the MPI part of the video, is there a way to get back to it right now on twitch?
  • What is the difference between OpenMP and MPI?
    • OpenMP is a standard and an implementation of the shared memory programmig model, one of them; while MPI is a message passing interface, a standard developed for the distributed memory programming. MPI has several implementations, the most ofen used is OpenMPI (do not mix with OpenMP), but there are others, like Intel MPI, mvapich, etc
    • The names are similar, but that is just a unfortunate choice by the standard developers. The function is like mentioned in the above comment.
  • Do all Triton nodes share the same basic processor architecture so that e.g. Intel core level vectorizations should work?
    • Triton has several CPU arch generations, they all are x86_64, but differ in details; yes all support vectorization, the oldest we have are on the pe[] and c[] nodes with the Intel Xeon E5-2680 v3 (Haswell, AVX2 instructions).
    • If you want to choose a specific processor architecture with --constraint=X-option. The available features are listable with slurm features. For example, if you need avx512 instructions
    • In total: https://scicomp.aalto.fi/triton/overview/ see in particular 'Arch' column
  • So do I understand this correctly? openMPI would take you across nodes but openMP is for distributing over the cores on the same node?
    • OpenMP is for distributing work over the cores on the same node
    • MPI allows programs to communicate accross nodes
    • So yes :)
    • OpenMPI is one implementation of MPI.
    • OpenMP, a standard for doing shared-memory calculations, sounds unfortunately similar to OpenMPI, which is a popular MPI implementation.
  • If I have a Mathematica code on my computer that takes a long time to run, how can I run it on a cluster?
    • You can probably use functions from Wolfram's documentation and --cpus-per-task together to run your program in parallel. It really depends on the program. It might be a good idea to contact RSE if you need help with the implementation.
    • For usage in Triton, see our Mathematica page. We'll need to add info on parallel runs there.
  • .
  • .

Break until xx:04

From laptop to LUMI - CSC services for researchers

Slides: https://github.com/AaltoSciComp/scicomp-docs/raw/master/training/scip/CSC-services_022022.pdf

Question: I have used some CSC services

  • Yes: xxxxxxx
  • No: xxxxxxxxxxxxx
  • I am going to: x

Questions to Jussi & questions about CSC:

Break until xx:00

Then GPU.

GPU

https://scicomp.aalto.fi/triton/tut/gpu/

  • How do you draw a monster for gaming?

    • Usually they are models constructed from voxels (3d pixels).
  • How much faster should the job be on GPU compared to CPU? (to be eligible for GPU)

    • That depends a lot on the algorithm, implementation, type of GPU card, amount of videomemory and other factors; some code may give 50x times speed increase, some 2x, some none
  • current example: https://scicomp.aalto.fi/triton/tut/gpu/#simple-tensorflow-keras-model

    • If the speed is the same as on CPU, but GPU is more expensive, does it make sense to run on GPU?
      • Nope. CPU way is chepaer, if no diff in performance, CPU is easier to get
    • Is there some price comparison for them to know how much faster GPU should be to make it worth it?
      • One Nvidia Tesla A100 costs us 7-8ke, a box with two modern Intel CPUs and 128G of mem costs 5-7ke; but A100, theorerically, provides 9.6 TFlops (double precision), and a two socket node, with 40+ CPU cores in total is ~2.5TFlops, depends on the type; very rough (=theoretical) numbers; live benchmarks to be run to see realistic FLOPs per euro
  • Can you also use newer versions of, for example, tensorflow than there are in the anaconda module at the moment?

    • yes, you can install your own anaconda enviroments with any versions of things you want. We haven't covered it much so far but if you read the Python page we have hints.
    • https://scicomp.aalto.fi/triton/apps/python/
    • You can also let us know and we can install a newer version for you
  • How do other clusters monitor their GPU performance?

    • HY, TUNI, ?

Exercises (done as demos/type-along)

https://scicomp.aalto.fi/triton/tut/gpu/#exercises

  • why do I get this error?
    srun -M ukko -p gpu nvidia-smi
    srun: job 270697145 queued and waiting for resources
    srun: job 270697145 has been allocated resources
    srun: error: task 0 launch failed: Slurmd could not execve job

That is strange, I can't reproduce that now (Pekko Metsä).
If you say 'module purge', do you still have the same problem?

  • yes
    Strange.. If you could visit in our HPC Garage on Monday, we'd have a closer look.

Announcements

Upcoming (hybrid!) course about MPI programming. March 8-18, register here:
https://scicomp.aalto.fi/training/scip/mpi-introduction

Other courses are being added at: https://scicomp.aalto.fi/training/

The Aalto RSE service (for the Aalto community) can help you manage any of the things we discuss today, free and short consultations welcome: https://scicomp.aalto.fi/rse/ . Come to our daily garage any day at 13:00.

Feedback

Note: registered participants will also receive a form for anonymous feedback. It helps us a lot to receive your feedback, because we can make future version of this course even better! <3

Did you feel the course was engaging, even though it was online:

  • better than in-person: oooooooo
  • same as in-person: o0o
  • worse than in-person: o

I would have:

  • attended hybrid in a lecture room: o
  • online-only was good enough: oooooooo
  • preferred lecturers to be in lecture hall: o

One good thing about the course:

  • you didn't make us feel stupid about our stupid questions
  • the fact that it can be done remoetly
  • I really liked that this was divided to several shorter days, and not one or two long days. It’s easier to concentrate this way and get more out of each part.
  • Getting to use the system during the lectures
  • You guys are fantastic! Thank you!

One thing to improve for next time:

  • One longer break would be nice
  • .

Favorite lesson of the course:

  • Serial jobs
  • .

Lesson that needs most improvement:

  • .
  • .

Lesson that could be added:

  • .
  • .

Thank you and follow us on Twitter for more news and interesting links! https://twitter.com/SciCompAalto


This is the end of the document, WRITE ABOVE THIS LINE ^^

HackMD can feel slow if more than 100 participants are editing at the same time: If you do not need to write, please switch to "view mode" by clicking the eye icon on top left

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →