Intro to Scientific Computing and High Performance Computing

Infos and important links

To watch: https://www.twitch.tv/coderefinery
To ask questions and interact (this document): https://hackmd.io/@AaltoSciComp/IntroWinter2022
- to write on this document, click on the
  
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
  
  (Edit) icon on the top right corner and write at the bottom, above the ending line. If you experience lags, switch back to "view mode" ("eye" icon)
Archive of HackMD conversations: https://hackmd.io/@AaltoSciComp/ARCHIVEintroWinter2022
Questions that require one-to-one chat with our helpers (e.g. an installation issue or non-Aalto specific settings):
- PLEASE USE HE ZOOM LINK SENT TO REGISTERED PARTICIPANTS IN THE INVITATION EMAIL
Program day 1: https://scicomp.aalto.fi/training/scip/getting-started-with-scientific-computing/
Program day 2&3: https://scicomp.aalto.fi/training/scip/winter-kickstart/
Prerequisites: a computer with an internet connection and with a shell terminal
- Linux/Mac: you already have a bash terminal
- Windows: we recommend https://gitforwindows.org/ (Windows PowerShell is not bash compatible).
- Alternatively, if you have access to a university linux server, you do not need to install anything, you can just open powershell and run something like ssh username@machine.uni.fi
Remote SSH to a native Linux
Suggestion: if you have just one screen (e.g. a laptop), we recommend arranging your windows like this:

╔═══════════╗ ╔═════════════╗
║   WEB     ║ ║  TERMINAL   ║
║  BROWSER  ║ ║   WINDOW    ║
║  WINDOW   ║ ╚═════════════╝
║   WITH    ║ ╔═════════════╗
║   THE     ║ ║   BROWSER   ║
║  STREAM   ║ ║   W/HACKMD  ║
╚═══════════╝ ╚═════════════╝

Do not put names or identifying information on this page

Day1 and Day2

HackMD archived here: https://hackmd.io/@AaltoSciComp/ARCHIVEintroWinter2022

Day3 4/Feb/2022 Introduction to HPC

Today, parallel computing and scaling up to Lumi
https://scicomp.aalto.fi/training/scip/winter-kickstart/

Questions from yesterday

Thank you for clarification. How to choose the volume of memory and GPU/CPU usage for JupyterHub? After logging in I cannot see any choice for these parameters.
- Currently there is not an option for GPUs. The GPU resources are so highly requested, that we need to prioritize non-interactive usage as that utilizes the GPUs to greatest extend. There's also no CPU selection for similar reason. Memory limit should be chosen based on the size of the data you're processing.
Not sure other people had the same problems, but I get an error when using lfs find directoryname: "cannot get lov name. inappropriate ioctl for device". what am I doing wrong?

Other resources

CodeRefinery (course in March): https://coderefinery.org/
- Q: How to register? What is covered?
- Sample schedule: https://coderefinery.github.io/2021-05-10-workshop/
- Registration isn't open yet but will be soon. ( :) )
Aalto's Linux shell courses: https://aaltoscicomp.github.io/linux-shell/
Aalto's RSE service: https://scicomp.aalto.fi/rse/
CSC courses: https://www.csc.fi/en/training

Icebreaker: Do you think your work could be parallelized? Could it be run side by side or with multiple processors?

All the data points in my research code can in principle be calculated independently in parallel. At the moment I'm only using core level vectorizations and omp.
Sometimes I have to calculate aggregate statistics/estimate models on various subsets of the data.
Right now, due to my overall ignorance of How and where to parallelize, I fear I can not even answer this question! I mostly run ML pipelines to analyse data though and do not really write a software.
I too am going to run two ML pipelines for feature detection and not quite sure whether that would require parallelization

Yesterday was:

useful: oooooooooo
not useful: o
too fast: o
too slow: o
just right: o
I would recommend this course to others: ooooooooo

How should this course be made better:

It should be integrated in a full course ordered by the skill level: shell -> this course -> Git -> Software design
- good idea! We sort of do them separately because otherwise it becomes too long, but check out https://hands-on.coderefinery.org/ (and CR may try to advertise them together more).
.
.
.

Array jobs

https://scicomp.aalto.fi/triton/tut/array/

Question: What happens if I add multiple srun commands to the same script?
- each appears as a command in the history, with resources tracked separately. They still run one after another.
  - Thanks!
Q: Will the above wait for the completion of the previous step?
- No. Array jobs are parallel and independent of each other. They are not running in a series.
  - Sorry, there is a misunderstanding. The question is about the multiple srun commands in the same script.
  - All srun steps within the array job task will be executed one by one within that particular array task indepenedently on what is going on with other array tasks.
  - Overall, array job is just a number of jobs that run using the same SLURM batch script template, they only differ by the $SLURM_ARRAY_TASK_ID.
Q: How long should each parallel job in an array take to be useful instead of just running in series?
- Depends on the system. I would say ~1 hour each, at least.
- You can always play and adapt your workflow, make one array task with enough 'srun' steps so that it would take 1h at least. If the question is: array job with N tasks vs N sbatch submissions, then in case you have a serie of simulations, like same binary but just different inputs (ie the same SBATCH options, running environment etc), it make sense always put them into arrays. Arrays jobs (i) makes SLURM work more efficient, SLURM controller checks/sets up all the parameters only once for one job, and then copies the same parameters to all the other array tasks, (ii) easy to follow from the user point view with squeue/scancel etc.
Is the %A and %a a part of the syntax in slurm?
- Correct. man sbatch will tell more.
Q: Will the array jobs queue seperately? Or will they wait until all of them are launched simultaneously?
- Array jobs will queue independently. For the arrays jobs with large amount of tasks, it is usually common that few tasks are running at a time while rest are pending.
Q: When you request memory for X number of array jobs. Will the amount of memory be divided into the array jobs or does each array job get the X amount of memory?
- Each will get the same memory requirement, so there's no division. All other #SBATCH-requirements are identical (basically copy-pasted) for all individual array jobs. So if you say #SBATCH --mem=2G and #SBATCH --array=1-10, you won't get 20GB memory requirement, but each of the 10 jobs gets 2G memory requirement.
- Consider submit script as a template. That is –array= .. number of jobs will run using that the same template.
Q: What is the maximum length of array and is there a maximum for the argument?
- depends on cluster, in 10000s on Triton I think.
  - But please test with smaller numbers first :) If you are going to run 10K jobs and each job lasts for 2 minutes, then this is not good for you and for others. It is better to pack iterations or parameters so that 1 job lasts few hours. Otherwise you will be spending most of the time queueing, and only few minutes running.
- If you think of a number of array tasks in one array, that is 'scontrol show config | grep -i array' : MaxArraySize = 200001; thatt means –array=0-200000 is ok, but –array=0-200001 will shoot an error (Triton example)
- Regarding a max for arguments; if in question how many arguments you can give to your running binary, it is not up to SLURM, but BASH/Linux setup, see 'getconf ARG_MAX' for the bash line length in bytes. It is huge anyway.

Type-along: your first array job, until xx:31

https://scicomp.aalto.fi/triton/tut/array/#your-first-array-job
Just this section, exercise time is later.

I managed to do the sample:

yes: ooooooooo
no: o
I did not try: oo

Array exercises until xx:00, then break until xx:10

https://scicomp.aalto.fi/triton/tut/array/#exercises
At least try Exercise 1. Exercise 2 in HackMD, and then whatever else you have time/desire to do

Helsinki:
There is again the University of Helsinki "type-along" session in the Break Out Room 1. Paying attention to the differences between triton and turso.

.Where do i download the pi.py file?
- If you did the exercise yesterday, it is in the git repository, otherwise do these commands:
```
ssh NAMEOFTHECLUSTER
cd $WRKDIR # or cd /scratch/work/MYUSERNAME
git clone https://github.com/AaltoSciComp/hpc-examples.git 
```
the file is in the subfolder ./hpc-examples/slurm/pi.py

I managed to do the exercise 1:

yes: ooooo
no: oo
I did not try: o
Q: Instead of that case thing I just did the following. Is there some difference in execution?

#SBATCH --array=50,100,500,1000
srun python hpc-examples/slurm/memory-hog.py ${SLURM_ARRAY_TASK_ID}\M

Exercise 2: Do you think your problem could utilize an array-structure? Are you interested in it but you're not certain how to split your data / parameters? Answer/ask below:

This has been certainly the direction for me. Each data point takes 1h or more depending on case and running a full analysis takes a week or so in series. I have been now changing my C++ code to handle and organize data points independently and save/load data also indepedently. I'm not sure how to combine algorithm level parallelism with the embarassing parallelism (running jobs in parallel) at the moment.
- If needed, help is available for Aalto people (either garage or even an RSE project), just in case.

continued

Parallel

https://scicomp.aalto.fi/triton/tut/parallel/

What is the difference between CPUs, cores, and nodes?
- in SLURM terms, "core" == "cpu", and physical one processor (with multiple cores) is called "socket", it's the shorthand we use.
- Though in hardware design, they would say one CPU==one chip and one chip can have multilpe computing cores in it.
- Node == "one discrete computer's hardware"
  - Indeed I used term "CPU" in a bit ambiguous way. Usually our servers have two actual CPU's, physical chips, in two sockets. Each of these CPUs has multiple logical processors aka. cores that can do calculations. They are often also referred to CPUs. Slurm considers these cores as CPUs for --cpus-per-taskSimo
Hi Embarrassingly I missed the MPI part of the video, is there a way to get back to it right now on twitch?
- You can go to https://www.twitch.tv/coderefinery/videos to look at the most recent videos. You can then scroll back time when you choose today's video.
What is the difference between OpenMP and MPI?
- OpenMP is a standard and an implementation of the shared memory programmig model, one of them; while MPI is a message passing interface, a standard developed for the distributed memory programming. MPI has several implementations, the most ofen used is OpenMPI (do not mix with OpenMP), but there are others, like Intel MPI, mvapich, etc
- The names are similar, but that is just a unfortunate choice by the standard developers. The function is like mentioned in the above comment.
Do all Triton nodes share the same basic processor architecture so that e.g. Intel core level vectorizations should work?
- Triton has several CPU arch generations, they all are x86_64, but differ in details; yes all support vectorization, the oldest we have are on the pe[] and c[] nodes with the Intel Xeon E5-2680 v3 (Haswell, AVX2 instructions).
- If you want to choose a specific processor architecture with --constraint=X-option. The available features are listable with slurm features. For example, if you need avx512 instructions
- In total: https://scicomp.aalto.fi/triton/overview/ see in particular 'Arch' column
So do I understand this correctly? openMPI would take you across nodes but openMP is for distributing over the cores on the same node?
- OpenMP is for distributing work over the cores on the same node
- MPI allows programs to communicate accross nodes
- So yes :)
- OpenMPI is one implementation of MPI.
- OpenMP, a standard for doing shared-memory calculations, sounds unfortunately similar to OpenMPI, which is a popular MPI implementation.
If I have a Mathematica code on my computer that takes a long time to run, how can I run it on a cluster?
- You can probably use functions from Wolfram's documentation and --cpus-per-task together to run your program in parallel. It really depends on the program. It might be a good idea to contact RSE if you need help with the implementation.
- For usage in Triton, see our Mathematica page. We'll need to add info on parallel runs there.
.
.

Break until xx:04

From laptop to LUMI - CSC services for researchers

Slides: https://github.com/AaltoSciComp/scicomp-docs/raw/master/training/scip/CSC-services_022022.pdf

Question: I have used some CSC services

Yes: xxxxxxx
No: xxxxxxxxxxxxx
I am going to: x

Questions to Jussi & questions about CSC:

Does CSC provide RSE like support for free? For anyone or just CSC users?
- Well, most of the services are funded by the Ministry of education and culture, and hence most of the academics (people from universities) support and services are free. (free of use cases). "deep support", i.e. something that takes days can be provided in special cases only. E.g. if there is some specific funding for doing that (like an EU-development project related to X) or if it results in a new service or will be generally / widely useful for other users, too. Anyway, if you have questions, please ask!
Can you send us the link for the CSC course? I can not find it in Google.
- is it in the PDF link above?
- https://ssl.eventilla.com/enveff2022
- (there will be a eLearn version for self learning coming this month)
- https://edukamu.fi/elements-of-supercomputing
- enrico here: I am updating https://scicomp.aalto.fi/training/ and make sure we will list also upcoming CSC courses
Training materials by CSC in GitHub: https://github.com/csc-training/
Training materials on Docs CSC: https://docs.csc.fi/support/training-material/ (+ tutorials https://docs.csc.fi/support/tutorials/)

Break until xx:00

Then GPU.

GPU

https://scicomp.aalto.fi/triton/tut/gpu/

How do you draw a monster for gaming?
- Usually they are models constructed from voxels (3d pixels).
How much faster should the job be on GPU compared to CPU? (to be eligible for GPU)
- That depends a lot on the algorithm, implementation, type of GPU card, amount of videomemory and other factors; some code may give 50x times speed increase, some 2x, some none
current example: https://scicomp.aalto.fi/triton/tut/gpu/#simple-tensorflow-keras-model
- If the speed is the same as on CPU, but GPU is more expensive, does it make sense to run on GPU?
  - Nope. CPU way is chepaer, if no diff in performance, CPU is easier to get
- Is there some price comparison for them to know how much faster GPU should be to make it worth it?
  - One Nvidia Tesla A100 costs us 7-8ke, a box with two modern Intel CPUs and 128G of mem costs 5-7ke; but A100, theorerically, provides 9.6 TFlops (double precision), and a two socket node, with 40+ CPU cores in total is ~2.5TFlops, depends on the type; very rough (=theoretical) numbers; live benchmarks to be run to see realistic FLOPs per euro
Can you also use newer versions of, for example, tensorflow than there are in the anaconda module at the moment?
- yes, you can install your own anaconda enviroments with any versions of things you want. We haven't covered it much so far but if you read the Python page we have hints.
- https://scicomp.aalto.fi/triton/apps/python/
- You can also let us know and we can install a newer version for you
How do other clusters monitor their GPU performance?
- HY, TUNI, ?

Exercises (done as demos/type-along)

https://scicomp.aalto.fi/triton/tut/gpu/#exercises

why do I get this error?
srun -M ukko -p gpu nvidia-smi
srun: job 270697145 queued and waiting for resources
srun: job 270697145 has been allocated resources
srun: error: task 0 launch failed: Slurmd could not execve job

That is strange, I can't reproduce that now (Pekko Metsä).
If you say 'module purge', do you still have the same problem?

yes
Strange.. If you could visit in our HPC Garage on Monday, we'd have a closer look.

Announcements

Upcoming (hybrid!) course about MPI programming. March 8-18, register here:
https://scicomp.aalto.fi/training/scip/mpi-introduction

Other courses are being added at: https://scicomp.aalto.fi/training/

The Aalto RSE service (for the Aalto community) can help you manage any of the things we discuss today, free and short consultations welcome: https://scicomp.aalto.fi/rse/ . Come to our daily garage any day at 13:00.

Feedback

Note: registered participants will also receive a form for anonymous feedback. It helps us a lot to receive your feedback, because we can make future version of this course even better! <3

Did you feel the course was engaging, even though it was online:

better than in-person: oooooooo
same as in-person: o0o
worse than in-person: o

I would have:

attended hybrid in a lecture room: o
online-only was good enough: oooooooo
preferred lecturers to be in lecture hall: o

One good thing about the course:

you didn't make us feel stupid about our stupid questions
the fact that it can be done remoetly
I really liked that this was divided to several shorter days, and not one or two long days. It’s easier to concentrate this way and get more out of each part.
Getting to use the system during the lectures
You guys are fantastic! Thank you!

One thing to improve for next time:

One longer break would be nice
.

Favorite lesson of the course:

Serial jobs
.

Lesson that needs most improvement:

Lesson that could be added:

Thank you and follow us on Twitter for more news and interesting links! https://twitter.com/SciCompAalto

This is the end of the document, WRITE ABOVE THIS LINE ^^

HackMD can feel slow if more than 100 participants are editing at the same time: If you do not need to write, please switch to "view mode" by clicking the eye icon on top left

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →