Introduction to Scientific Computing and HPC / Summer 2025 (ARCHIVE)

Infos and important links

This is the archived notes document for Day 2-3
Here you find the archived notes for Day 1: https://hackmd.io/@AaltoSciComp/scicomphpc2025Archive
Program and materials: https://scicomp.aalto.fi/training/scip/kickstart-2025/
page**

Day 2

Icebreakers

Programming langages you use? (poll, add o to vote):

python: oooooooooooooooooooooooooo
R: oooooooo
matlab: oooooooo0ooo
C: ooooo
shell: oooooo
C++: ooooooo+1
java: oo
javascript: o
Scala: oooo
Julia: o
SML: o
FORTRAN: ooo
Ruby: o
Basic: +1

How long have you used computers? What was your first?

Commodor 64, around 1983-84 :)
Some Windows 98 PC in the early 2000's. Intel C2D E8500 & Radeon HD 4870 was the dream setup when I was looking into building my first PC.
Windows 98
Intel 80386 with MS-DOS
That windows with the green hill bg, early 2000s (my parents owned it).
The first we had at home was the first MacBook in 2006. Before that I had used a PC at my kindergarten :D
Around 2007, intel p4 with xubuntu and windows xp
HP Probook, 2019 :)
Lenovo Yoga, 2015-ish
2019 Lenovo
Commodore 128
Probably some old Dell around 2012
1989 - ATARI 800 xl - coding on Basics :-D
VIC20 <3… loved basic
Celeron, running windows 3.1
2012, a Mac
Some Desktop in preliminary school that was brand new when it was bought 20 years ago
Windows 98 (Early 2000s)
Windows 98 - 2006

Favorite ice cream?:

Italian gelato: pistacchio or coffee (or both at the same time?) +3
Just simple vanilla the best! +1+1
Vanilla with nougat pieces
Raspberry-lemon or just lemon/raspberry sorbet
Salmiakki / salty licorice
Mint chocolate chip +1
Stracciatella
swiss chocolate
Ben&jerry
vanilla with a salted caramel swirl
Stracciatella & Yogurt
Anything Italian
Malaga = rum, raising, vanilla icecream

Follow-up questions from day 1:

Are we going to continue working on gutenberg fiction?
I tried to do the exercises in Triton but I have this error:
```
"slurm
zsh: command not found: slurm
```
Same problem with srun –> command not found
- That's weird. Are you sure you were connected to Triton when you tried it? If you were, I would recommend joining the zoom.

Yes it is weird; I do via kosh– ssh
- Something could have messed up your PATH variable, but I couldn't say what would cause that.

I have a question not related to the course, i am wondering if it's on to ask or maybe send you a mail about it

Go ahead if it's appropriate (code of conduct and all)
This question looks a bit too complicated for us to answer here. If you are at Aalto, you could join our garage sessions at some point?
I am in Oulu, would it be possible to send you a mail ab it ?
- I'm afraid our RSE services / garage are for people at Aalto, I don't know if Oulu has something similar?
- Have no idea

yes, so i am using Fenics to do calculations and there's a predefined function assemble() to calculate the integrals, the thing is I’m using FEniCS (2019 version) to compute a Fourier-type integral of a function q whith two formulas that should give the same result but it doesn't

Rough guess for the cause would be that some of the calculations are not entirely numerically stable (for example they give very small results that could be effected by floating point errors). Then these would be handled slightly differently based on the version. That is a complete shot in the dark though.
I see, you probably right cause for a small xi value (a parameter) the error is smaller, but thank you.

import numpy as np
from dolfin import *


xi_val = [2, 1]
zeta_val = [2, 1]

x0_val = 0.0    
y0_val = 0.0
d_val = 0.1


def V_0(x, xi, zeta):
    return np.exp((-(zeta[0] + xi[0]*1j)*x[0] - (zeta[1] + xi[1]*1j)*x[1]) )

def V_1(x, xi, zeta):
    return np.exp(((zeta[0] - xi[0]*1j)*x[0] + (zeta[1] - xi[1]*1j)*x[1])/2)

def exp_Fourier(x, xi):
    return np.exp(-2*1j*(xi[0] * x[0] + xi[1] * x[1]))


def phi(x, d, x0, y0):
    norm = np.sqrt((x[0]-x0)**2 + (x[1]-y0)**2)
    if norm < d:
        return np.exp(1 + 1 / ((norm/d)**2 - 1))
    else:
        return 0.0


def q_function_epsilon(x, d, x0, y0):
    return phi(x, d, x0, y0)

mesh = RectangleMesh(Point(-1, -1), Point(1, 1), 32, 32)
V = FunctionSpace(mesh, "CG", 3)


class PythonQExpression(UserExpression):
    def __init__(self, d, x0, y0, **kwargs):
        self.d = d
        self.x0 = x0
        self.y0 = y0
        super().__init__(**kwargs)

    def eval(self, values, x):
        values[0] = q_function_epsilon(x, self.d, self.x0, self.y0)

    def value_shape(self):
        return ()

q_expr = PythonQExpression(d=d_val, x0=x0_val, y0=y0_val, degree=2)
q_proj = interpolate(q_expr, V)


class PythonV0Expression(UserExpression):
    def __init__(self, xi, zeta, **kwargs):
        self.xi = xi
        self.zeta = zeta
        super().__init__(**kwargs)

    def eval(self, values, x):
        values[0] = V_0(x, self.xi, self.zeta)

    def value_shape(self):
        return ()

V0_expr = PythonV0Expression(xi=xi_val, zeta=zeta_val, degree=3)
V0_proj = interpolate(V0_expr, V)


class PythonVExpression(UserExpression):
    def __init__(self, xi, zeta, **kwargs):
        self.xi = xi
        self.zeta = zeta
        super().__init__(**kwargs)

    def eval(self, values, x):
        values[0] = V_1(x, self.xi, self.zeta)

    def value_shape(self):
        return ()

V_expr = PythonVExpression(xi=xi_val, zeta=zeta_val, degree=3)
V_proj = interpolate(V_expr, V)


class PythonQFExpression(UserExpression):
    def __init__(self, xi, **kwargs):
        self.xi = xi
        super().__init__(**kwargs)

    def eval(self, values, x):
        values[0] = exp_Fourier(x, self.xi)

    def value_shape(self):
        return ()

F_q_expr = PythonQFExpression(xi=xi_val, degree=3)
F_q_proj = interpolate(F_q_expr, V)


q_fourier_1 = assemble(V0_proj * V_proj * V_proj * q_proj * dx)
q_fourier_2 = assemble(F_q_proj * q_proj * dx)

print("q_fourier_1:", q_fourier_1)
print("q_fourier_2:", q_fourier_2) "

Humans of Scientific Computing

Any questions to the instructors (about careers, etc)?:

How about the non-humans of scientific computing: do you use chatGPT or similar?
- It's a little helper, so yes when the output (= code) is easy to review.
- Mainly to avoid having to parse through documentation pages / getting examples in languages not that familiar with
- Yes, for parsing documentation and for mindless template filling
- Yes, sometimes when I need quick code skeletons, then I start developing details that (at least until now) LLMs don't reach until you go into details in the prompt
- Sometimes. Usually when I need to reformulate a text I'm writing and cannot find the correct words, but also when I cannot find the correct syntax for the language of a framework I'm using.
What’s the most exciting project or task you've worked on in your role?
- Hosting+running+teaching the CodeRefinery workshop. Here the last workshop and its topics for those who are interested: https://coderefinery.github.io/2025-03-25-workshop/
- I don't have a single thing, but several projects where I clearly see that I could help the researchers save lots of their time.
- Collecting data from personal devices people carry around everywhere
- Mining Refactoring data from OSS projects and training LLMs for refactoring recommendation.
- Helping a data center for a major astronomy satellite with their data center setup.
What's the most important skill you look for when you hire RSEs?
- Attitude towards good coding practices and towards willingness to help people.
- Ability to help people and some public code examples (willingness to publish software)
- Many people have good technical skills, it's harder to find someone with good desire to help others/social skills. (technically, the real thing is being able to learn new things)
  - I actuall think it's not so much social skills and more about will to help, and being supportive. Sure completely hiding in your room not wanting to interact with others wouldn't help.
- At least to me it's all about willingness/social skills to work in a team, be able to explain your work, and listen to better approaches.
What are your recommendations for someone interested in neuroscience? What to study etc (in addition to the neuro courses university offers)?
- Reproducibility and how to enable it (version control, reproducible environment, avoiding questionable research practices like p-hacking, HARKing, etc.., registered reports or preregistrations)… well this is for any field of science in the end. :)
- Specifically for the neuroimaging field, get a good grasp of statistics!
- Terminology is subtly different between different imaging modalities/subfields, if you read books/papers be aware of what subfield the author is from.
What's the most common (and maybe easy to learn/fix) mistake that people working with HPCs make? And what's the most challenging part that many have a hard time to have a grasp of it?
- One hard thing to grasp is the workflow where you need to deal with waiting and be productive while your jobs are running. Usually with the queue system one needs to submit jobs and do something while waiting for the jobs to finish. This can feel strange as we're often used to doing something and seeing results of our work immediately. A common mistake and easy to fix thing might be to mix bash and sbatch. They are different things. bash is the terminal program while sbatch is the Slurm's queue submission command.
- Testing before starting a large run. If you run a small test case before submitting a multi-day run, you can avoid a lot of extra wait time.

Conda

https://scicomp.aalto.fi/triton/apps/python-conda/

For CSC, check their Tykky page for information how you can use conda with their containers.
- Yes! We teach generic conda here. Different clusters, especially the CSC-managed ones, have special instructions to make it scale better.
- Consider this an intro to understand those pages better.
Helsinki University Turso users: load mamba with command module load Mamba (with a capital "M"). See module spider Mamba for more info.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
to use mamba do i have to firt run triton?
- I don't understand the question. Do you mean if you need to be connected to Triton?
- Our demo is happening on Triton (where Mamba is pre-installed), but it can work on other computers if you install it first.

I get this error :

bash-4.4$ module load mamba
Lmod has detected the following error: The following module(s) are unknown: "mamba"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
 $ module --ignore_cache load "mamba"

Also make sure that all modulefiles written in TCL start with the string #%Module

What cluster are you working on?

Lehmus, i don't think i have this module

Some version of it probably exists, but it could be called something else. You would need to ask your cluster admins / look at your documentation for specifics.
Check anaconda and conda modules with module spider MODULENAME. Those could be ready-made environments, or one of them could be meant for building your own environment.

4.4$ module avail

----------------------------------------------------------------- /sw/apps/modulefiles/Core --------------------            ----------------------------------------------
   adams/2018.1    ccp4/7.1   (D)    idl/8.8             matlab/R2019a    matlab/R2021b     paraview/5.9.1        qe/7.2
   adams/2019.2    comsol/5.5        idl/8.8.1           matlab/R2019b    matlab/R2022b     phenix/1.17.1         shapeit/v2.904
   ads/2022.01     comsol/5.6        intel/2020.01       matlab/R2020a    mosek/9.1.13      phenix/1.18                                           xds/20210205
   ccp4/7.0        idl/8.7.3         mathematica/12.1    matlab/R2020b    mosek/9.2-beta    phenix/1.19.2-4158                    xds/20210601

------------------------------------------------------------------ /sw/generic/modulefiles -------------------------------------------------------------------
   aladin/aladin            comsol/6.0    cst/2021    cst/2023        fftw/3.3.10    iraf/iraf     mathematica/13.0.0    netMHCpan/4.1        xds/2023.10
   astromatic/astromatic    cst/2020      cst/2022    cst/2024 (D)    idl/8.8.2      maple/2022    mathematica/13.1.0    xds/2023.4    (D)    xds/20220220

------------------------------------------------------------------- /sw/rhel8/modulefiles --------------------------------------------------------------------
   R/4.2.2                      boost/1.81.0              engauge/12.1         klayout/0.29.11         mosek/11.0         (D)    texlive/2024
   R/4.4.0                      comsol/6.2.2              fsl/2024a            libzmq/4.3.5            netctplan/1.1             vmd/1.9.3
   R/4.4.2               (D)    comsol/6.2.3              glpk/5.0             mathematica/14.0 (D)    openmpi/5.0.5             whisper/2024.06
   adams/2022.4          (D)    comsol/6.3.0       (D)    gmp/6.3.0            matlab/R2022a           openmpi/5.0.6      (D)    xds/2024.1
   ads/2023.1            (D)    conda/insilico2025        gurobi/11.0.0        matlab/R2023a           phenix/1.20.1-4487 (D)
   afni/2024a                   conda/2024.06      (D)    hmmer/3.4            matlab/R2023b           rstudio/2024.12.0
   anaconda/insilico2025        cuda/5.0                  idl/8.8.3            matlab/R2024a           scwrl4/4.0.2
   anaconda/2024.06      (D)    cuda/7.0           (D)    idl/9.0       (D)    matlab/R2024b    (D)    shapeit/2.904      (D)
   aocc/4.2.0                   db/4.7.25                 impute/2.3.2         matlabruntime/95        snptest/2.5.6

Can you clean the terminal so we can see the commands on top?. +1
i mean how can we use mamba?
- Aalto has a module mamba for it (module load mamba), but you can also install it from https://github.com/conda-forge/miniforge
Is UV a good replacement for Conda? At least for local software management it's great. How about HPCs? (should be the same technically)
- We don't have much experience with uv yet, but as I understand it's essentially a better replacement for pip. Conda's biggest benefit is that you can get a lot of "system" libraries in your environment as well, which might not be as convenient to get on a cluster otherwise (behind modules etc.)
  - So they both have their pros and cons essentially.
- Another benefit to conda is that most of the packages are precompiled binaries, which is nice on a heterogenous cluster (nodes with different architectures). Otherwise you might need to do a lot of work to get things working on all nodes at the same time.
I didn't know about the mamba run thing - how new is it?
How do I exit the environment tab.
- You can deactivate the environment with source deactivate
Is conda like python3 venv? Where you can install various modules/dependencies?
- conda & mamba are package managers that can install various dependencies into an environment. It is often used to install python environments.
- Broadly yes, they solve similar problems. conda was made because virtualenvs didn't handle complex compiled dependencies well (often needed for science)
I think that in CSC conda is now deprecated?
- See the answer at the top of this section. CSC uses Tykky.
why we create environement can't just work directly and install the packages needed?
- In that situation you can only have a single set of packages that are active all the time. So if you have multiple projects that need different packages you will quickly encounter problems setting up a single installation that solves needs of all projects. In addition, some of the dependencies can be really hard to install. So using tools that allow you to install complex dependencies helps a lot.
- you could re-install conda/mamba multiple times to get different base envs… but those are environments anyway. So may as well share the overhead.
- Thank you, it's clear now.
How does it work under the hood? What's the difference between Mamba/Conda on a local system versus on an HPC?
- No real difference. Besides that mamba is usually provided as a module, instead of being installed directly.
- Under the hood Conda and Mamba solve the environment based on the requirements you have specified in the environment.yml. Then they download pre-built packages that are extracted to a common package cache. Then an environment is created by creating a folder structure and linking relevant packages to the common package cache. Conda and Mamba use the package cache so that it can re-use already downloaded packages. But the tools work the same in local machines and in HPC machines.
While running 'mamba env create' I got an error 'libmamba YAML spec file environment.yml not found' and 'critical libmamba File not found. Aborting'
- Did you create an environment.yml file yet? mamba env create always assumes you have a yml file, and with no arguments expects that it is in your current directory and is called environment.yml.
  - Ah, I created environment called 'tidyverse.yml', but followed the scicomp tutorial which had the name 'environment.yml'. Problem solved
    - You can do mamba env create -f filename to create env from a file with different name.
If mamba doesn't exist in the cluster can we just replace it by conda right ?
- Yes, mamba is just a drop-in replacement for conda that does same things. (It's just faster since its written in C, whereas conda is in python iirc.)
- thanks
- Interestingly, you can use conda to install mamba
- How?
  - You can create an environment like
```
name: mamba
channels:
  - conda-forge
dependencies:
  - mamba
```
    - thanks
I am curious how your file directory looks like. Where will you have the conda env files except the R and yml file that we are seeing on the screen?
- Conda and mamba install the environments to an environment directory. By default it goes to your home directory ~/.conda/envs, but at the start of the session we specified that we wanted the conda environments to go to $WKRDIR/.conda_envs/. This was done during the first time setup. Usually you do not need to know where the environment is. Once it is activated programs from the environment become available from there.
  - Okay. So, let's say that each of my environment has same python3 version. So each environment will then install it separately and have equal number of copies of that?
    - The environments share a common package cache where the packages will be downloaded and packages in the environment point to those packages. So if you have the same python version in all environments you won't be using that much more disk space. However, you will create more files (links from the environment to the package cache).
    - Thank you!
What are the channels?
- Channels are essentially repositories for packages. Nvidia has packages managed by nvidia, and conda-forge is an open-source channel containing a large varity of packages.
Why can't I activate the environment? says 'critical libmamba Shell not initialized'
- Do source activate ENVIRONMENT_NAME. Using conda activate requires running conda's initialization script, but this comes with bunch of other undersirable side effects on a cluster.
Should I always specify the exact package versions to guarantee that it is reproducible? Or is it generally safe to leave the version unspecified?
- Not specifying the version means downloading the latest available version. Usually this is safe, but if you need a specific one, it's better to mention the version.
- You can have a version of the environment that has the locked versions for e.g. showing exact versions that were used when doing a paper, but usually keeping the version numbers loose makes it easier to maintain the code.
- For a published version of my software or data, I would specify exact versions. When developing, I keep to the latest unless there is a reason to change (like the reason we set python to 3.12, it would have failed with 3.13.)
  - Is there some mamba command I can use to export the exact versions of an env once I am ready to publish (i.e. update the yaml with the current versions)?
What was the command that we used to override for Cuda?
- export CONDA_OVERRIDE_CUDA=cuda_version. This sets an environment variable that tells conda that this version of cuda will be available.
I am getting the following error when trying to create the mamba env: critical libmamba filesystem error: cannot create directories: Read-only file system [/appl/scibuilder-mamba/aalto-rhel9/prod/software/mamba/2025.1/f67be15/envs/pytorch-env/conda-meta]
- It sounds like mamba is trying to install the environment in the installation directory of the mamba module, which can't be modified by users. Not sure why though, what commands did you run?
  - Did you run the inital setup?
I get the error :
- error libmamba Could not solve for environment specs The following package could not be installed └─ pytorch-gpu >=2.6,<2.7 * is not installable because it requires └─ pytorch [==2.6.0 cuda*_generic*200|==2.6.0 cuda*_generic*201|...|==2.6.0 cuda*_mkl*304], which requires └─ __cuda =* *, which is missing on the system.
- my yml file is this
```
name: pytorch
channel:
  - nvidia
  - conda-forge
dependencies:
  - python==3.12
  - pytorch-gpu>=2.6,<2.7
  - torchvision
  - torchaudio
  - transformers
```
- I used thismamba env create --file pytorch.yml
  I am not able to get out of this environment. i mean the page in the terminal. what should i type? ctrl 0?? ctrl X?? +1
  - What does it look like? Is it asking if you want to install the environment?
  - In general Ctrl + c will exit almost any command, if you are lost.
  - yes this is what i wanted. thanks. but one more question, does it save though??
- You need to run export CONDA_OVERRIDE_CUDA=cuda_version before the mamba command.
- I get a new error, sorry having trouble formatting
  - The reason for the error is that the login node does not have any GPUs, so mamba does not find any. We need to tell it we have some anyway.
  - If i want to actually use some GPUs how would i activate them?
    - You need to ask for GPUs in your batch script. Then you activate the environment with source activate env-name in the batch script.
    - can you show an example please :D, not sure i get batch script fully
    - A batch script is a series of commands you want the compute nodes to run. It also tells what resources you need for the command. The serial job tutorial has some examples. Later we will have a section about GPUs.

error    libmamba Could not solve for environment specs
    The following packages are incompatible
    ├─ python ==3.12 * is installable and it requires
    │  └─ python_abi =3.12 *_cp312, which can be installed;
    └─ pytorch-gpu >=2.6,<2.7 * is not installable because it requires
       └─ pytorch [==2.6.0 cuda*_generic*200|==2.6.0 cuda*_generic*201|...|==2.6.0 cuda*_mkl*304] but there are no viable options
          ├─ pytorch 2.6.0 would require
          │  ├─ __cuda =* *, which can be installed;
          │  └─ libtorch [==2.6.0 cuda126_generic_h4a15719_200|==2.6.0 cuda126_generic_h8d116dc_202|...|==2.6.0 cuda126_mkl_h9fa54b4_302], which requires
          │     └─ cuda-version >=12.6,<13 *, which requires
          │        └─ __cuda >=12 *, which conflicts with any installable versions previously reported;
          ├─ pytorch 2.6.0 would require
          │  └─ python_abi =3.10 *_cp310, which conflicts with any installable versions previously reported;
          ├─ pytorch 2.6.0 would require
          │  └─ python_abi =3.11 *_cp311, which conflicts with any installable versions previously reported;
          ├─ pytorch 2.6.0 would require
          │  └─ python_abi =3.13 *_cp313, which conflicts with any installable versions previously reported;
          └─ pytorch 2.6.0 would require
             └─ python_abi =3.9 *_cp39, which conflicts with any installable versions previously reported.
critical libmamba Could not solve for environment specs

what how do i resolve this error above?
- Mamba does not find a GPU on the system. This is probably because the login node does not have a GPU. You can override it with export CONDA_OVERRIDE_CUDA=12.6 (or replace 12.6 with other CUDA version)
  - This environment will work on the compute nodes that have a GPU
  - Thank you that worked
CSC does not seem to support use of conda environments on parallel filesystem anymore.
- See the CSC documentation about using conda
What is your policy on anaconda packages and Anaconda license?
- We strongly recommend not using defaults channel because of licensing issues. For now at least, conda-forge packages don't have licensing issues.
what's this error for $ source activate conda-example ? EnvironmentNameNotFound: Could not find conda environment: conda-example (helsinki cluster )
- Did you successfully run the command mamba env create --file environment.yml before you got the error?
- The name of the enviroment could be different in the file.
  - actually it doesn't load Mamba.
    - module load Mamba should work in Turso.
    - Capitalization matters. So for Helsinki, is it Mamba and not mamba?
      - module is loaded. thanks. just about the environment.yml
  - Do you have conda? conda --version
  - yes, 23.11.0
  - You can try conda env create --file environment.yml, it should do the same thing. Then
  - thank you. but error: EnvironmentFileNotFound: ' ~/environment.yml' file not found
    - Did you create an environment file? The file name could be different from ~/environment.yml. (The ~/ part refers to your home folder. If the file is in the current folder, remove that part.)
    - I checked that the pkg and env are ok. but same error. did i miss something
      - What is the name set inside the environment file? in the name: something field? That is the name you have to use to activate the environment.
      - Which error was it? The EnvironmentFileNotFound means it does not find the .yml file and cannot create the environment. What files are in the current folder? (You can check with ls)
      - gutenberg-fiction hpc-examples ondemand poem.txt workdir. I cannot find the environmentfile
      - Indeed, no environment file there. Please create one using a text editor such as nano, so nano environment.yml and then copy-and paste the yml file content from the course notes and save the file (Ctrl-O) and then exit nano (Ctrl-X).
        
        this I solved with text editor. thank you
can you send the previous commands somewhere? I got lost somewhere
- thanks!
- Can the ones used in the very beginning be found somewhere?

Break until xx:12

Array jobs

We get our first taste of running stuff in parallel, the easiest way possible (it's also the most commonly used).
https://scicomp.aalto.fi/triton/tut/parallel/
https://scicomp.aalto.fi/triton/tut/array/

Following will set you up where we left off after day 1 if you did not do it yesterday:

cd $WRKDIR
mkdir -p gutenberg-fiction
cd gutenberg-fiction
wget https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip
cd ..
git clone https://github.com/AaltoScicomp/hpc-examples.git
cd hpc-examples

Previous day's submission script count-ngrams-2.sh (in the hpc-examples-folder):

#!/bin/bash
#SBATCH --mem=1G
#SBATCH --time=00:10:00

python3 ngrams/count.py -n 2 --words --output ngrams-2.out ../gutenberg-fiction/Gutenberg-Fiction-first100.zip

what does -1 in the first line of the .sh?
- -l flag for bash tells to run it as a login shell. It doesn't make a huge difference but it causes bash to run some profile files again etc.
  - In practice on most slurm clusters it should make no difference, since by default slurm will export all environment variables you had at submission time.
just an aside, how do you keep track of all the commands you type on terminal in another screen (as visible on the top right of the screen)?
- We use prompt-log, which we wrote specifically for these workshops.
- That's cool. And thanks!
- All shells have a built-in history which is useful for seeing your own history.
Does each job have 1GB of memory allocated or 100mb?
- Each has 1GB. Array jobs essentially launch copies of the script, however many times you specified. Only difference between jobs is the value of SLURM_ARRAY_TASK_ID environment variable. (And potentially node and cpus it is running on etc.)
- And yes, this is a good question to answer: how should resource requests be modified for arrays?
Can you break down the whole python 3 command to the end ?
- what does this mean?
- The python command in the batch script?
- They are doing it. Nevermind
–step=10 what is that?
- It was an argument specific to that python script, that tells it to process every 10th book. So it starts at the value given with --start, then next it does start + step and so on. Since we had an array of 10 jobs, this means all of them will be processed and there is no overlap between processes.
- count.py had these option specifically added to make it suitable for array jobs.
I was wondering, one can also write a bash script to split these 100 subtasks to 10 batches. I have seen people using srun to split the resources. Where can array job be beneficial then?
- One difference is that if you have a single sbatch script with multiple srun calls, the whole thing needs to sit in the queue until it can get all the resources. Each step of the array job can queue independently and it will be easier to find 10 nodes with 10 free cpus compared to one with 100 for example.
  - In addition I think the scheduler can handle array jobs a bit better, compared to same number of independent jobs. (If you were to do it the other way around)
- Got it! Thanks!
I followed along with what you did, however the output files are all named slurm-NUMBERS-0.out, and every single one shows 'No such file or directory when opened'
- Slurm creates these by default for each task. These are different from the files created by the python command.
- The error seems to be that the Gutenberg files are missing. Are they in the current folder?
  - They should be, but I keep getting the same error –> Solved
Let’s say I have two hyperparameters, x1 and x2, both ranging from 0 to 9. I want to run all the different combinations of them, basically the Cartesian product x1 * x2. I can run my script like this:
python my_script.py x1 x2
What’s the best practice to manage and run all these different hyperparameter combinations? or maybe I can have more than only x1 and x2, what if I have more than 2 hyperparams to sweep?
- You could modify the Python program to take one argument and do the mapping internally (or maybe more like a separate python program to do that). But Bash can also do a 1D -> 2D mapping, I think we have an example in the docs.
- If you are asking this question I think you can figure out some ways to do it - you are on the right track!
- Relatively simple way would be to have array of 100 jobs, then use floor(SLURM_ARRAY_TASK_ID/10) as x1 and SLURM_ARRAY_TASK_ID mod 10 as x2. But you could use whatever kind of logic you want.
if I want to delete the files now, how can I do it; since it looks so missed up?
- The rm command will delete files - check before pushing enter, since there is no undelete! If you are unsure, run ls with the arguments first to verify.
- You can use the * glob like rm ngrams-array_*.out to delete them in batch. It helps to name files consistently so it's easy to list and delete like this
- Thanks very much!
- It does not work to delete them all together, only indivually:
  - rm ngrams-array_*.out
  - rm ngrams-2-array-0.out this one works
After using the command "less file_name.out" to view an output file, is there a way to get out of it? For example when I try to view the non-array output I go into this huge list of results from that output from which I can't find a way to get out.
- You can exit less by pressing q.
- If it says "calculating line numbers" a "control-c" will stop that, and then q works.
what is the output I should get? How do I know if it succeeded?
- The job is writing a log file (usually in the location where you submitted it). You can investigate the log file and see if it worked.

Array exercises (until xx:55)

Try to do what we have done.

Here's the submission script count-ngrams-2-array.sh:

#!/bin/bash
#SBATCH --mem=1G
#SBATCH --array=0-9
#SBATCH --time=00:10:00

python3 ngrams/count.py \
  -n 2 --words \
  --start=$SLURM_ARRAY_TASK_ID --step=10 \
  --output=ngrams-2-array-$SLURM_ARRAY_TASK_ID.out \
  ../gutenberg-fiction/Gutenberg-Fiction-first100.zip

Submit with sbatch count-ngrams-2-array.sh.

Combine data with python3 ngrams/combine-counts.py ngrams-2-array*.out -o ngrams-2-array.out

Try more advanced array exoamples here: https://scicomp.aalto.fi/triton/tut/array/#hardcoding-arguments-in-the-batch-script

I'm

Done: ooooooooooooooooooo
Not trying: ooo
Having problems: o

Note on best practices: Running a huge amount of short jobs is worse for the cluster than smaller amount of longer jobs. So if you have an array of 1000 jobs that each take 10 minutes, it's better to rewrite it to an array of 100 jobs each taking 100 minutes.
- But this doesn't matter for smaller arrays.
Which one consume more credits regarding to the priority of queue? 1000 jobs with each taking 10 mins or 100jobs each taking 100 mins.
- I am not sure, but 1000 jobs will have 1000 wait times in the queue (if I had to queue for a long time, then I'd rather keep my turn for longer times). But then you can run in theory 1000 jobs in parallel so it would be faster..
- You will usually get the maximum throughput (maximum amount of things done) when you have a medium sized job parallelized with array. It is much easier for the queue to fit a medium sized job than a very large one. Both should have a similar effect on the priority. You can think about the box of "CPU x RAM x time". If you do an array job you just divide a box of volume "CPU x RAM x long time" into smaller boxes "N array jobs x CPU x RAM x long time / N array jobs". Priority is calculated based on used resources.
- Thank you for answering! I had the experience before about 1000 jobs with each taking 3-4 mins. Then I divided it into 100 jobs with each taking around 25 mins. The results is 100 jobs run much faster due to the shorter queue time and also the starting time.
  - If you have a lot of small jobs the startup can also take a meaningful amount of the overall job runtime. When programs start they need to fetch libraries, executables, data etc. that can take some time and if you do the same thing multiple times in the same job this data fetching is kept in the computer's memory and it does not need to fetch everything again. This is especially true if your program uses a common starting dataset and the only difference between runs is a parameter / seed change.
  - Yes! That is true. I did the parameter sweeping. Thanks for your explanation.
i got this error on mahti,
sbatch: error: AssocMaxSubmitJobLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
- You might be requesting for resources that are not compatible OR you might be missing your CSC project ID. If you can paste the code, we can help more. :Here is the Project Number: 2014489)
- CSC always needs an –account parameter
I am trying to run this pi.py array task but I am having an issue with batch script somehow. It cannot find the slurm/pi.py file
- are you in the hpc-examples repository?
nvm I found the error
Is there a cheat sheet for all the slurm keywords somewhere?
- Yes. This is for Triton https://aaltoscicomp.github.io/cheatsheets/triton-cheatsheet.pdf
- I am sure there are other cheatsheets for schedmd tools (sbatch, sinfo, etc) -> https://slurm.schedmd.com/pdfs/summary.pdf Nice one by the developers of slurm themselves

What did I miss?

[user@login4 hpc-examples]$ sbatch pi-array.sh
sbatch: error: slurm_job_submit: Automatically setting partition to: batch-hsw,batch-bdw,batch-csl,batch-skl,batch-milan
sbatch: error: Batch job submission failed: Invalid job array specification
Can you paste the script that you are submitting? Maybe an error in the SBATCH directives?

 -#!/bin/bash
     #SBATCH --time=00:15:00
     #SBATCH --mem=500M
     #SBATCH --job-name=pi-array-hardcoded
     #SBATCH --output=pi-array-hardcoded_%a.out
     #SBATCH --array=0.4

     case $SLURM_ARRAY_TASK_ID in
       0) SEED=123 ;;
       1) SEED=38 ;;
       2) SEED=22 ;;
       3) SEED=60 ;;
       4) SEED=432 ;;
     python3 slurm/pi.py 2500000 --seed=$SEED > pi_$SEED.json

^^^ There is a dot (.) instead of a dash (-) to define the range of IDs. It should be #SBATCH --array=0-4 https://slurm.schedmd.com/job_array.html
- oh wow, thanks :DD
Is there a way of knowing how long a specific batch job will take, for example is it on the order of tens of minutes or tens of hours. Or would you just have to know how fast your own program is?
- It depends on many parameters (e.g. which node will process the job, is there lots of I/O), but as you wrote, a decent approximation is to run one case you have, and then scale "almost" linearly (e.g. 100 iterations of a pipeline, try running one iteration and then multiply times 120 for a worst case scenario)

Could there be something wrong with this? Because when I run it, and I open up the file I don't see a result.

      #!/bin/bash
 #SBATCH --time=00:15:00
 #SBATCH --mem=500M
 #SBATCH --job-name=pi-array-hardcoded
 #SBATCH --output=pi-array-hardcoded_%a.out
 #SBATCH --array=0-4
 case $SLURM_ARRAY_TASK_ID in
   0) SEED=123 ;;
   1) SEED=38 ;;
   2) SEED=22 ;;
   3) SEED=68 ;;
   4) SEED=432 ;;
 esac
 python3 slurm/pi.py 2500000 --seed=$SEED > pi_$SEED.json

I essentially see this:

   
 Calculating pi via 2500000 stochastic trials
 pi-array-hardcoded_2.out (END)

Which file is showing that? pi_$SEED.json? Could it be that it goes out of memory and gets killed?

When I run this command after running the sbatch,

  less pi-array-hardcoded_2.out

all of these files show the output above.

Do you have any pi_$SEED.json files in the same folder? The script has a ">" which means "redirect the output to a file …json": those are the results of the computation. The .out is just a log (what you would see in the terminal if you were running it interactively).

I do have it, I think at least
- Then look at the json files, those are the results.

                R                           io                    ngrams-2-array_3.out  openmp                    pi_123.json  slurm                slurm-7904827_5.out
  README.rst                  misc                  ngrams-2-array_4.out  pi-array-hardcoded_0.out  pi_22.json   slurm-7860424.out    slurm-7904827_6.out
  count-2grams.sh             mpi                   ngrams-2-array_5.out  pi-array-hardcoded_1.out  pi_38.json   slurm-7904199.out    slurm-7904827_7.out
  count-ngrams-2-array.sh     nano.save             ngrams-2-array_6.out  pi-array-hardcoded_2.out  pi_432.json  slurm-7904827_0.out  slurm-7904827_8.out
  count-ngrams-2.sh           ngrams                ngrams-2-array_7.out  pi-array-hardcoded_3.out  pi_68.json   slurm-7904827_1.out  slurm-7904827_9.out
  gpu                         ngrams-2-array_0.out  ngrams-2-array_8.out  pi-array-hardcoded_4.out  postgres     slurm-7904827_2.out
  gutenberg-words-1grams.out  ngrams-2-array_1.out  ngrams-2-array_9.out  pi-array.sh               python       slurm-7904827_3.out
  gutenberg-words-2grams.out  ngrams-2-array_2.out  ngrams-2.out          pi_.json                  scip         slurm-7904827_4.out

I only see this.
- You also have pi_123.json if the bit above is the content of your folder :)
  - I do, thanks for holding my hand through this, very dumb of me.
```
  
  pi_.json (END)
```

Continue with some array examples:

Monitoring

https://scicomp.aalto.fi/triton/tut/monitoring/

Triton only we made these jobs to test:
seff 7897826 (single CPU job)
seff 7897849 (multi CPU job)
seff 5246490 (for GPU job)

How long does the HPC system keep job history? For example, when I run slurm history, what is the earliest job date it will show?
- Triton: I don't think we have a delete policy, but we do prune the database for efficiency, so I would expect that beyond 1 year in the past we cannot guarantee that the history is there.
- By default slurm history shows last two weeks I think, but you can give it arguments to show older jobs. For how old, see the message above.
Considering that we can have profilers as part of our code, which can be very detailed, how can we benefit from this short summary such as seff? I assume it's just there as a very easy way of knowing info.
- If you have an existing profiling solution those might indeed work better for your use case. The information stored in the slurm database is usually a good first step if you do not have such tools available to you.
- Biggest issue with fancier profilers is that they usually cause far greater overhead and are far more complicated to use. So it's just a question of how heavy-duty tool you need.
Is there a way to save all this cpu and mem usage info in the slurm-JOBID.out file produced at the end automatically? i.e. without having to manually check it for each job.
- what about inside sbatch_script.sh to add as last command, e.g.
  seff $SLURM_JOB_ID"
  - Seff won't give you reliable information if the job is still running. So eventually at the end you run a script that appends to each log the output of seff.
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
  - would this work, viz. seff is run last ?

#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --output=output_%j.txt

# Start the first srun command in the background
srun --job-name=task1 my_command_1 &

# Start the second srun command in the background
srun --job-name=task2 my_command_2 &

# Wait for both srun commands to complete
wait

# Check the exit status of the last command
if [ $? -eq 0 ]; then
    echo "All tasks completed successfully. Running seff..."
    seff $SLURM_JOB_ID
else
    echo "One or more tasks failed. Not running seff."
fi

I am not sure which tasks are going to be checked, but here you are still checking with seff the jobid that is currently running, so it won't give reliable info. One could just have a final bash script that looks at all slurm-JOBID.out and then run seff (very fast to run) and append to the slurm-JOBID.out.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Good profilers:
- https://github.com/plasma-umass/scalene : Profiler for python code that does not require any changes to the program
- /usr/bin/time -v

Applications

https://scicomp.aalto.fi/triton/tut/applications/

Is python 2 supported in Triton? What to do if I want to run some old python 2 code?
- Conda is your best friend: create a conda environment with a specific version of python2 and you can run python2 software.
what is docker? I heard of this term throw around before…
- A software that lets you execute containers: imagine like virtual machines that are run on top of your operating system. Longer explanation: https://coderefinery.github.io/reproducible-research/environments/
What is the name of the cat??
- Codename is CATS
  - I love him
- Cute cat!
- Lovely cat
- Bonus if you know the origin of the name
  - Computational Animal Troublemaker and Support?
  - Cute Animal That's a Scientist?
Does CATS know how to use HPC?
- In the HPC-kithcen metaphor CATS is a data management specialist (eating all the misplaced food)
  - So efficient waste management software?
    - Garbage collector.
    - yes, also a chaos monkey to improve our resiliance
- CATS appears in many of the HPC-kitchen videos if you need more motivation to watch them

Responsible Computational Research

https://scicomp.aalto.fi/scicomp/rcr-scicomp/

What I have noticed as this some cherry pick the "random" seed (not so much random) to show a stable reproducible training.
- I think there was a kaggle thread of how to optimise for best random seeds :D https://towardsdatascience.com/optimizing-the-random-seed-99a90bd272e/
ALLEA - European Code of Conduct for Research Integrity : the link is working for me.
- I'll fix, it will probably update in a minute or so
  - Thanks!
The link towards the ALLEA book doesn't work. I clicked on it, but the paeg doesn't exist. I mean the first link in the section "Level 1: Foundational ethical principles in research"
- fixed and rebuilding
Can you comment about different levels of "goodness" for different projects - does every project take the same practices?
- Could you specify a bit more what kind of comment you are looking for? What we think is needed for different kinds of projects? Or what are you looking for?
- I'm not quite sure… are the recommendations for a masters thesis the same as for a large study? etc?
  - in general: yes. it's just that the scale is a lot smaller, so a lot of administrative stuff should be done not by the person doing the thesis but by those providing the thesis opportunity.
    - In some cases (e.g. EU horizon big projects) there are many administrative tasks that a master thesis (which could be part of the same project) does not have to.
Hasn't there been conversations lately about younger programmers actually not knowing how to program, since many depend so much on for example copilot?
- Could be! And in general some studies have shown dependency, anxiety, burnout related to genAI use… (there is one review linked in the page)
what does it mean 'there is no cloud computing'? What is cloud computing? Is this what I do when interacting with chatGPT to code?
- It's anything that you run "on the internet", but what's meant with "there is no cloud computing" is that you should not assume that this cloud is something that isn't controlled by someone else.
- Yes ChatGPT runs "somewhere" (= cloud = most likely some machine room in USA) and your uni/organisation cannot guarantee that they will protect your work, even though OpenAI/ChatGPT promised you to not leak your data. Maybe there are no risks with your code/data, beyond your own professional risks of being scooped.
  - I heard that if we are using AI resources then we should use Aalto AI instead of ChatGPT, because it's apparently safer. But isn't it still running ChatGPT so it still has vulnerabilities?
    - Indeed at Aalto (and Helsinki Uni has something similar) we have an interface to the GPT models so that they are run in Europe in Sweden Microsoft Azure data center. A bit more data protection, but of course probability of something going wrong will never be exactly zero (so don't paste your dataset there if it's secret data :)). This is only the "C" of the CIA triad (Confidentiality), there are other issues and vulnerabilities related to how LLMs work.
      - Still someone else's computer and US can request it with/without a warrant?
        
        Maybe? :) How much do you trust Microsoft? :)
        
        EDPS does not trust Microsoft for example, at least when public eu organs (like EU commission) use MS tools https://www.edps.europa.eu/press-publications/press-news/press-releases/2024/european-commissions-use-microsoft-365-infringes-data-protection-law-eu-institutions-and-bodies_en
        
        I trust Microsoft for "non-sensitive" data, happy that at Aalto/CSC we have lots of local storage and computing for anything confidential/sensitive.

Break until xx:12

Then "parallel", the last session of the day. We cover two things,
Shared-memory and multiprocessing parallelism (relatively commonly used for single-node stuff)
MPI parellisism (for the biggest codes)

Parallel - shared memory

https://scicomp.aalto.fi/triton/tut/parallel-shared/

.
What was --pty again? +2
- Launches the job in pseudo terminal mode. Usually it's used to get an interactive terminal on a compute node with the combination --pty bash.
  - You can compare yourself with srun bash and srun --pty bash.
How many CPUs does srun use by default? Is there a max or min?
- All of these can be set by admins, but generally the default is 1 cpu and there isn't an explicit max for CPUs (at least one you would realistically hit).
What is the difference between –cpus-per-task=4 and –nproc=4? Why are they both 4?
- nprocs is the argument specific for that code, that tells it how many cpus to use. --cpus-per-task is argument for slurm that tells how many cpus you should reserve per task. (By default you have one task.)
- So they are both 4 because we wanted to reserve 4 cpus and then tell the code to use all 4.
what is the limit of CPUs that we can ask the programm to use?
- This is going to depend on the program. Often there is no limit (assuming the program knows how to use multiple CPUs in the first place), but you might see diminishing returns with huge amount of cpus. In addition you need fancier methods to parallelize across nodes, so without those you are limited to the amount of cpus on a single node.
- There is an exercise "Parellelism-1: Test Scaling" that discusses this some: look at the solution, it shows the performance decreasing as you use more and more processors.
I am using Puhti, I set cpus-per-task=4 but when I checked seff after the job was finished 'Cores per node: 8' In the interactive sessions it says 2 threads per core will be allocated, does it have something to do with this and what does it mean?
- In some clusters the computers have hyperthreading enabled, which allows for multiple threads to run on a single CPU. So "Core" here refers to one of these threads, so "CPUS per task * threads per CPU = Cores". So if you request 4 CPUs, you get 2 threads and thus 8 "cores". It depends on a program whether they can use hyperthreading efficiently so the number you should tell the program can be either 4 or 8.
- This exact behaviour is also going to depend on the cluster. Some clusters will treat each hyperthreaded logical core as a "CPU", in which case you would get 2 physical cores with 4 threads and logical cores.
How we can set the number of CPUs in the bash script? I did not really get how it automatically worked
- Slurm automatically sets various environment variables, including $SLURM_CPUS_PER_TASK (which should be the same as –cpus-per-task). Which can be useful to keep track of info in only one place.
How to parallelize cpu across nodes?
- We'll talk about MPI parallization next, which does just that. There are some other libraries as well.
Why does the CPU efficiency decrease as you use more cpus?
- Not every operation can be parallelized. Basically every program has parts that need to be done sequentially, and in those cases rest of the CPUs are just waiting.
Why is this section called 'Parallel shared memory' if we just allocate more cpus to work in parallel? Where is the shared memory?
- The parallelization is done using shared memory paradigm. Each thread has access to shared memory pool and uses that to communicate. We will go over another version of parallelism that doesn't require shared memory later (MPI).
What is the best way for implementing this in our own code? Just using multithreading? What is the optimal number of threads in the cluster? Usually 8? What about the multithreading?
- Often programs will get diminished returns after 8 or 16 threads or so, but it will depend on your code. Biggest factor is how large amount of your code can actually be parallelized. If lot of it has to still run sequentially, you will notice diminishing returns because of the overhead from initializing the threads, among other things.
Is multithreading then the best way to go about it?
- It depends. Multithreading is often easier than MPI and has different pros and cons. Often the most efficient method is combination of both, e.g. 4 MPI processes each with 4 threads, using 16 cores in total. I think we quickly go over this later as well.
Why does slurm q only return this:
Why I got 0 CPU efficiency:
- CPU Utilized: 00:00:00
  CPU Efficiency: 0.00% of 00:00:01 core-walltime
  Job Wall-clock time: 00:00:01
  Memory Utilized: 78.09 MB
  - Slurm doesn't track resource usage of very short jobs well. It probably finished before it actually had time to check the usage. As a rule of thumb, seff results can be poor for jobs shorter than 1 minute.
[user@login4 hpc-examples]$ slurm q
JOBID PARTITION NAME TIME START_TIME STATE NODELIST(REASON)
- Your job is no longer in the queue. It probably already ran (or failed.) You can check what happened to it with slurm h or slurm history.
if I can't see anything when writing slurm q, how do I get the job id?
- Your job probably already ran and is no longer in the queue. Try slurm h or slurm history instead. (Same command, first one is a shorthand)

Exercise (we return at xx:35)

Exercise:

Run the same commands we did

srun --pty --time=00:10:00 --mem=2G python3 slurm/pi.py 50000000
srun --pty --cpus-per-task=4 --time=00:10:00 --mem=2G python3 slurm/pi.py --nprocs 4 50000000

run-pi-4core.sh:

#!/bin/bash
#SBATCH --cpus-per-task=4
#SBATCH --time=00:10:00
#SBATCH --mem=2G
o
python3 slurm/pi.py --nprocs $SLURM_CPUS_PER_TASK 50000000

There is more you can experiment with in the documentation pages.

I'm:

Done: ooooooooooooooo
Did not try:
Having problems:

MPI parallelism

https://scicomp.aalto.fi/triton/tut/parallel-mpi/

is –ntasks=4 means that we run 4 independent threads?
- By itself they are 4 completely independent processes, but MPI provides means for those tasks to communicate with each other and work as a collective.
I only have openmpi as a module
- That should work
- ok, isn't it another different module or way of doing paralelisation
  - No. OpenMPI is one implementation of the MPI standard. Another common implementation is MPICH.
Note about MPI: MPI is just a standard for implementing this communication between processes and there are multiple implementations. openmpi is a common one, but your cluster might use another one such as mpich.
What does –nodes mean? And –ntasks?
- --nodes tells slurm how many nodes (physical computers) you want to request. Relevant thing here is that all CPUS/cores on a node have access to same shared memory, allowing you to use multithreading. With multiple nodes they can't all access the same shared memory.
- --ntasks tells slurm how tasks you want. Then generally srun will launch that many copies of the process and initialize the MPI environment so they can communicate with each other.
- --nodes, --ntasks, and --cpus-per-task have a complex relationship and you can dig deep into this.
what is the diff between srun and sbatch?
- The most important difference is that srun starts the job interactively, whereas sbatch makes it run in the background.
Should I run pi-mpi.sh with srun then?

running the slurm script returns the following error:

The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.

What cluster are you on? Some clusters do not have their MPI implementation built with Slurm support. In that case you need to launch the processes with mpirun inside your script to initialize the MPI environment.
- this was ran in turso
- I'd recommend checking the clusters docs, but mpirun might work as well
  - mpirun worked, thank you!
- We used srun --mpi=pmix and that also worked. Source

What does the first 2 lines do in the exercise?
- They load the MPI installation (In Aalto we use OpenMPI) and compile the C code using the MPI compiler into executable.
I get this error: pi-mpi: error while loading shared libraries: libfabric.so.1: cannot open shared object file: No such file or directory while i have pi-mpi in the same folder
- Do you have an mpi module loaded?
- i loaded openmpi cause couldn't find mpi in the modules
  - What cluster is this? It seems to be missing some libraries but I think those should be provided by the module
  - Lehmus, but it's weird that the first command worked the compilation
    - I would suggest asking your cluster admins. You could try the zoom room?
    - yes what is the link? and are you connected after the course?
      - You should have received the email. What university's cluster was Lehmus by the way? I think there were at least people from helsinki and oulu.
      - And I think the room will stay open for a short moment after course at least, but might close soon if there are no questions.
      - It's oulu's, ok i'll try to connect now, thank you.

Exercise until xx:55

Run the same commands we did

module load triton/2024.1-gcc openmpi/4.1.6
mpicc -o pi-mpi slurm/pi-mpi.c
srun --time=00:10:00 --mem=500M ./pi-mpi 1000000
srun --nodes=1 --ntasks=4 --time=00:10:00 --mem=500M ./pi-mpi 1000000

pi-mpi.sh:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --mem=500M
#SBATCH --output=pi-mpi.out
#SBATCH --nodes=1
#SBATCH --ntasks=4

module load triton/2024.1-gcc openmpi/4.1.6

srun ./pi-mpi 1000000

I'm:

Done: ooooooooo
Did not try: o
Having problems:

Feedback day 2

News for day 2 / preparation for day 3

We covered what was in the schedule
- There are more exercises/practice at the bottom of every page.
The main sessions are about GPUs and LLMs
- Try to check GPU instructions on your cluster, it will be useful tomorrow.
A wrap-up session where you can ask us anything

Today was (vote for all that apply):

too fast: ooo
too slow: o
right speed: oooooooooooooooo
too slow sometimes, too fast other times: oo
too advanced: o
too basic: o
right level: oooooooooooooo
I will use what I learned today: oooooooooooooooo
I would recommend today to others: ooooooooooooooo
I would not recommend today to others:

One good thing about today:

Learned how to create envirenements and the use of them
Array jobs was explained well +4
CATS <3
Covering many topics, specially that parallelism is not "magic", it's an investment worth spending time learning.

One thing to improve for next time:

More exercises? +3
The section where we did Conda felt a bit rushed and working with an example that was not in the documentation was also a bit hard to follow.+1
A little more involved exercises (not just copy-pasting commands) and more time to do those could enhance learning. On the other hand it is understandable that this occurs when one actually wants to use Triton etc. for their own purposes.
I missed one thing during the Conda example and couldn't find in the documentation it was a bit confusing
the conda part was kinda slow/lamelly presented

Any other feedback?

CATS was lovely +11
Nice examples and mix of topics. I also liked the philosophical break.+2
Asking anonymous questions and getting immidiate answers shows the true power of online teaching! +8
..

General questions continued:

Day 3

Icebreakers, day 3

What is your favorite fruit, and why?:

Orange; easily accessible, color, works for all season.
Watermelon; Very fun experience eating them growing up, just holding a big slice and chomping on it :D
watermelon
Tomato; used everyday in cooking
Nectarine
Avocado; healthy and tasty
apple. a pleasant eating experience
"Grenade apple" (Pomegranate) +1
Mandarins
Mango
all fruits are vegetables

What will you do about computing next week:

I will probably create a cookie-cutter template repo for my HPC based repositories that contains majority of steps, such as data handling via DVC, standard repo structure, MLflow for experiment tracking and so on and the dot files and other configs for Triton. One repo to rule them all!
I would like to see whether I can get started on using COMSOL on the cluster.
I will try to get my supervisor's python library to the cluster from git, and then hopefully use that library to do cool examples!
I would try setup some ML code, maybe just as a test run and see how it works
run Hector
database stuff
finish my thesis
IHC image analysis

Would you be willing to do a course on "What services are available at your university" ?

yes: ooooooooo
no:
depends: ooooo

We do a lot of work making videos for the course. You think it is "worth it" if the videos take less than X hours per day. What is X? (https://www.youtube.com/playlist?list=PLZLVmS9rf3nNK5qWN6FdrQPHns4fNZyMX, https://github.com/coderefinery/video-processing)

one: ooo
two: oooooooo
three: o
four: o
five:
ten:

Any other questions/comments:

Do you guys have a sort of "visit day" to see how a normal days goes on?
- Come to the garage (that you'll see soon)

How to ask for help with (super)computers (slides)

Questions:

GPU Computing (docs)

Questions:

I heard of these GPUs: V100, A100, H100. What is the difference? Is there a common site for comparing GPUs benchmarks?
- Biggest difference between these three is just that A100 is newer than V100, H100 is newer than A100. The computing power also increases significantly for the newer ones. Otherwise there are some differences in the architecture, but that won't be relevant for most users.
When we request a GPU node, is the GPU on that node (or the node itself) shared with other users?
- You need to specifically request however many GPUs you want, just having a job on the node isn't enough. If you request n GPUs, then none of those GPUs will be shared by anyone else. Slurm doesn't support splitting a GPU like it does with individual CPU cores.
- Helsinki cluster (Turso) users: add options -M ukko and -p gpu to the srun commands in the exercises for success.
  - And the module you need in Turso is called CUDA with capital letters. So: module load CUDA; e.g. (omit version for latest),
```
module load CUDAcore/11.1.1
module load CUDA/12.4.0
module load GCC
```
    and for shared gpu, add to srun these options,
```
-p gpu-oversub --export=ALL,CUDA_VISIBLE_DEVICES=0
```
What's the rule of thumb for requesting RAM and GPU? Sometimes, when I use a large dataset, I also need to request an appropriate amount of RAM
- How much RAM you need will depend on your program. RAM and GPU memory are separate, but you usually need some RAM so that the CPU can transfer the data to GPU efficiently.
- If you're doing deep learning quite often you'll need to transfer the model parameters and data to the GPU. So the RAM requirement will be at least the amount of memory needed by the parameters and the data. In most cases you'll need to do some data preprocessing on the CPUs in order to fully utilize the GPU, so you'll usually need more memory and multiple CPUs for that.
What is the difference between the GPUs that e.g. Facebook has and that Aalto/CSC has? Does Facebook just have more of the same GPUs that we use?
- Generally there wouldn't be a huge difference. Facebook would mostly just have a far greater number of them, and generally newer and shinier versions.
If I want to train a model on Triton and then run inference on a consumer GPU (RTX4090), will different architectures be a likely problem?
- You might need to recompile the code for the different architecture. But unless you have written some extremely low-level code that depends on specific arch, it shouldn't require more than that.
- If you're using common libraries such as PyTorch they have been compiled for many different GPU architectures, so you don't need to worry about that so much. But sometimes consumer GPUs do not have the same amount of calculation units for different data types (half-precision, single precision, double precision), so some calculations might not work or might be slow.
When we need more GPUs than a single node can provide, how do we request multiple GPUs across nodes? Are there any best practices or documentation you can point us to for doing this?
- As long as the version of Slurm is new enough, you can request --gpus-per-node=X. Your software will need to actually support multinode calculations however, and might need some additional setup.
  - By supporting multinode calculations its not just multi GPU support right? any documentation I can check?
    - Yes, the problem is similar as parallel CPU jobs, where you no longer have access to same shared memory (RAM). So you need to use something like cuda-aware MPI or similar. (Or whatever software you are using needs to be able to do that behind the scenes.)
- Also as a rule of thumb, if your job is so big that you need multiple nodes worth of GPUs, you might want to consider moving your calculations to LUMI.
  - Yeah I've been using LUMI all these time and still need to scale more on my future project(s).
  - Many training jobs utilize library called NCCL (pronounced nickel) or AMDs version of that (RCCL, rickel?). This makes it possible to transfer data from GPU to GPU directly through the fast interconnects. It can sometimes get a bit tricky in LUMI to get this working. The launching is often done via torchrun that creates a communication between different jobs, but you usually need to tell the launcher how it should connect to other nodes. See e.g. LUMI's example on distributed training
What is a GPU kernel?
- In the context of GPU computing, kernel is a small bit of code that is sent to the GPU to execute. Generally the paradigm is that the CPU offloads specific parts of the code as kernels to GPU and then those are executed asynchronyously.
Is there a n00bs training course for starting out with GPUs and GPU coding? I would love to learn more in my own time.
- You can probably find some from the CSC training portal, they often have stuff. I'm not sure what is at the right level: https://csc.fi/en/trainings/
- Aalto University has this academic course that goes into various parallel stuff. Lectures online. I'm not sure if it's at a good level for new people, though: https://ppc.cs.aalto.fi/
- In the spring we hosted CSC's Practical deep learning-course on Aalto premises. They might have a re-run in the autumn, but we'll probably host it again next spring. The materials are great for self learning as well.
is gpu-debug exclusive for triton only? do we have the same stuff or equivalent on CSC's HPC?
- Every partition name is specific to Triton. CSC clusters probably have something equivalent though.
- I think Puhti and Mahti both have a partition called gputest which serves a similar purpose.
- LUMI also has a GPU debug: https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/partitions/#debugging-nodes

have this error:

bash-4.4$ srun --pty --mem=1G --time=00:10:00 -p gpu-debug --gpus=1 nvidia-smi
srun: error: invalid partition specified: gpu-debug
srun: error: Unable to allocate resources: Invalid partition name specified

Are you not on the Triton cluster? Different clusters will have different test partition names, see above
You can see a list of partitions with sinfo. Hopefully the names are somewhat self-explanatory.
I managed to get a good partition name but now i have this error:

bash-4.4$ srun --pty --mem=1G --time=00:10:00 -p itee --gpus=1 nvidia-smi
  srun: error: Unable to allocate resources: Requested node configuration is not available

From the error it sounds like that partition doesn't have gpus available. Check your cluster documentation.

What is gcc? Is that a … compiler? What's a compiler?
- Yes, a compiler, the main open-source compiler (but there are newer)
- For complied languages (C, C++, Fortran, etc.) the source code first has to be trasformed to compiled machine code. Then it can be run. Newer languages like Python don't need this (but they do internally transform the code before running it)
  - So does the compiler do this transformation from source to machine code?
  - yes
- Compilers are important for HPC, unfortunately they aren't really taught anymore for a general audience. But, most people use code made by others and can use it from Python/R/Matlab so it's easier these days
what are the throws?
- Do you mean in error messages? "Throwing" an error is means that something has gone wrong, and usually the program exits and prints an error message. It is possible for another part of the program to "catch" and handle the error
- If you mean in the pi example, it is part of the simulation on how to calculate the pi.
- The simulation basically throws darts and determines pi from how many hit the dart board.
Hello, I cannot find ./pi-gpu in hpc-examples. Where is it?
- You need to compile it using the commands given in the exercise block
- And the output file where pi-gpu was created will depend on the options used in the compiler command, -o pi-gpu.
This is maybe a bit off-topic, but if I create a code, say, with pytorch, Do i need to determine the batch size there first, and then again when running the code?
- Usually you'll want to pick a batch size that fits into the GPU memory. So if you have picked e.g. batch size that requires the GPU to have more than 32GB of memory, you can limit yourself to GPUs that support that batch size. Alternatively, you can reduce the batch size and run it with GPUs that have 32GB of memory as well. Often parameters such as batch size are written in some configuration file so you can adapt that based on the location where you're running without needing to modify the code.
Why is the GPU utilized 0%?
- Either it really did no work, or the program ran too fast. Slurm's CPU and GPU efficiency stats are not very reliable for extremely short jobs.
- The utilization statistics are sampled, so they will be only accurate for jobs that run for several minutes.
I did the exercise. But I don't understand what I actually did? I just typed the shown stuff.
- First line loads the necessary modules to compile the code. (gcc is a compiler and then cuda libraries are needed to use GPU)
- Second line uses nvcc to compile the code. nvcc is essentially a compiler wrapper that takes care of setting up cuda related things. Most of the arguments for nvcc tell it to compile for multiple GPU architectures, so the code works on all the GPUs on triton.
  - Meanings of different -arch and -gencode options are elaborated here.
  - then in the end -o pi-gpu is the name you want for the compiled binary, and last argument is the name of the source code.
    - What does the pi-gpu compiled binary mean?
      - The compiler turns pi-gpu.cu source code into machine code file pi-gpu. These compiled machine code files are generally called binaries since they are saved in binary format. (Since they don't need to be human-readable anymore.)
      - All these terms might be unfamiliar if you are used to working with languages like python, which take care of all of this in the background during execution.
- Last line runs pi-gpu in the queue.
- Then the sbatch version does mostly same things but inside an sbatch script instead. We skip the compilation step since the code was already compiled in the previous exercise.
GPU arhc question (was deleted)
- This refers to the GPU arch that we talked about. This means in the in cluster you are using, that specific arch in not available.
  - Thanks I got it
If i want to run my own code, I am getting error 'permission denied': my call script is: - srun –pty –time=00:10:00 –mem=500M –gpus=1 -p gpu-debug ./Toss.py 1000000… Full Error code: srun: error: dgx16: task 0: Exited with exit code 13 srun: Terminating StepId=7926631.0

Exercises xx:55

For your site the partition names and module names might differ. Check your site's documentation first.

Commands to run the demo:

Running nvidia-smi:

srun --pty --time=00:10:00 --mem=500M -p gpu-debug --gpus=1 nvidia-smi

Running pi-gpu:

# Get libraries needed to compile the GPU code
module load triton/2024.1-gcc gcc/12.3.0 cuda/12.2.1
# Compile the GPU code
nvcc -arch=sm_60 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_90,code=sm_90 -o pi-gpu slurm/pi-gpu.cu
# Reserve the GPU and run the GPU program
srun --time=00:10:00 --mem=500M --gpus=1 -p gpu-debug ./pi-gpu 1000000

pi-gpu.sh:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --mem=500M
#SBATCH --output=pi-gpu.out
#SBATCH --gpus=1
#SBATCH --partition=gpu-debug

module load triton/2024.1-gcc cuda/12.2.1
./pi-gpu 100000000

Submit using sbatch pi-gpu.sh

I'm:

Done: oooooooooooo
Not trying: o
Having problems:

Break until xx:12

LLMs on Triton

Docs:
- https://scicomp.aalto.fi/triton/apps/llms/
Example repo:
- https://github.com/AaltoSciComp/llm-examples
List of Gen-AI tools at Aalto:
- https://scicomp.aalto.fi/aalto/generative-ai-tools/

Questions:
What is hugging-face?
- It's platform which developers use to publish their open-source models.
- It also provides libraries for running these models easily.
I heard of RTX gpus for PCs. How do those GPUs differ from the ones on Triton?
- For the purposes of HPC use, they mostly differ in scale. Server GPUs are generally more powerful and can do more calculations.
- There are also some differences in architecture. Server GPUs are made for HPC use and don't really work as well for live rendering graphics iirc, which would be the usual primary use of a consumer GPU.
Regarding speech to text LLMs, does it (for ex. whisper) support finnish language interviews and translate them to english?
- Yes! Finnish and Swedish are both supported.
Whats the tradeoff between model size and precision?
- It's a tricky question to answer :D the reason is the precision depends on what task are you looking for. Usually, models do run some tests or benchmarks and report the accuracy. So the size itself cannot gaurantee the precision of the model; it depends on how it was trained.
Instructor says –mem=80GB is system memory? Is that RAM? What is system memory?
- Yes, RAM. "RAM", "system memory", "main memory", "memory" - all are usually terms for the same things. There are other types of "memory" though. Unfortunately the terms are a bit too flexible.
Is system memory equivalent to the data that will be used? For example the voice recording, or the text?
- So the CPU memory will be used to pre-load some of the data. And the GPU memory is where the model is loaded and does the calculation. Therefore, the GPU memory should be large enough to hold the model itself, and also the batch (small part of) data.
- In case of text the GPU will have to house the model parameters and the input and output layers that are basically arrays of numbers. So in the case of text, the input and output layer sizes depend on the context length of the LLM. Different models have been trained with different context lengths, so if the model can be read into the GPU memory, you can run arbitrary amount of queries through it as each query can have a maximum length of the context length.
- In case of audio the audio is often split into small snippets that sent through the model one by one. So the length of the audio does not matter.

I get this error: RuntimeError: Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, maia, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: auto

should I select cuda?

Are you running the llm-example?

yes
Are you using Triton?
yes, I can copy the whole output
Can you paste the command you used?

from transformers import pipeline
import torch

# Initialize pipeline
pipe = pipeline(
  "text-generation", # Task type
  model="mistralai/Mistral-7B-Instruct-v0.1", # Model name
  device="auto", # Let the pipeline automatically select best available device
  max_new_tokens=1000
)

# Prepare prompts
messages = [
  {"role": "user", "content": "Continue the following sequence: 1, 2, 3, 5, 8"},
  {"role": "user", "content": "What is the meaning of life?"}
  ]

# Generate and print responses
responses = pipe(messages)
print(responses)

#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=80GB #This is system memory, not GPU memory.
#SBATCH --gpus=1
#SBATCH --partition=gpu-a100-80g # modify according to your needs
#SBATCH --output huggingface.%J.out
#SBATCH --error huggingface.%J.err

#By loading the model-huggingface module, we set HF_HOME to /scratch/shareddata/dldata/huggingface-hub-cache which is a shared scratch space.
#By default, HF_HOME is set to $HOME/.cache/huggingface, which is under your own home directory where you have limited quota.
module load model-huggingface

# Load a ready to use conda environment to use HuggingFace Transformers
module load scicomp-llm-env

# Force transformer to load model(s) from local hub instead of download and load model(s) from remote hub.
export TRANSFORMERS_OFFLINE=1
export HF_HUB_OFFLINE=1

python llm_example.py

and I run sbatch llm_example.sh
Try device_map="auto" instead of device="auto"
I tried with device="cuda" and I got this. I copied the device="auto" from https://scicomp.aalto.fi/triton/apps/llms/
```
_template_utils.py", line 419, in raise_exception
raise jinja2.exceptions.TemplateError(message)
jinja2.exceptions.TemplateError: Conversation roles must alternate user/assistant/user/assistant/..
```
- We will need to update that example. But it might also depend on the versions you use, since this "alternating" pattern is not a requirement I was aware of, and haven't seen before.
now I tried with device_map="auto" and the same error of jinja2.excepetions happened
- Might be that at some point the templates used in transformers got updated to include this restrictions since (probably) some models have this restriction. In this instance, you would need to only enter one prompt in the messages array. i.e. remove the "What is the meaning of life?" question.
- Yes, it worked commenting out the second msg(thanks!):

[{'generated_text': [{'role': 'user', 'content': 'Continue the following sequence: 1, 2, 3, 5, 8'}, {'role': 'assistant', 'content': ' The next number in the sequence is 13. The pattern seems to be that each number is the sum of the previous two numbers: 1+2=3, 2+3=5, 3+5=8, and 5+8=13.'}]}]

Is this model being run locally?
- Yes, locally on triton; not Yu's own computer.
Where can I find the instructions for memory usage in real time?
- In the Monitoring section of the course, we talked about the nvidia-smi and other tools and "profilers" you can use.
What is this watch command?
- The watch command simply re-runs the command comming after that: watch -n 2 nvidia-smi means it will re-run nvidia-smi every -n 2 2 seconds.
.

General Q&A

You can ask us anything, instructors will discuss as a group on stream.

I don't have a particular coding background, why I wanted to learn Triton was to be able to run my COMSOL simulations on here. The reason I want to do it on Triton because the simulations become computationally too heavy, not on my personal computer, but on one of the virtual computers offered by Aalto. There are documentation regarding using COMSOL for triton. Would you say following those documentations should be enough to get started running my simulations? Or should I run something "lighter" first on the cluster to see whether I actually understood how to do it?
- I would sugest trying, whether you can set up a comsol job (and the docs should have examples for this). If not, there are Comsol focus days, where we have additional experts that can help with COMSOL specific questions in garage (but you can also come at any other day and ask. it might just be, that we don't have the 'right' expert around).
I know I need to use a custom library for my summer project, and the README of it details installation guidelines, which are not written for Triton, and do not work in the terminal connected to the cluster… Is my best bet to come to the garage to get the custom library installed, or is there still some tab in the docs that explains this kind of a situation (I at least couldn't find one)?
- what kind of library (and what programming language) are we talking about here?
- I believe it is meant to be used with python (at least for my project), but most of it is written in C++. The git repository includes an install.py that is meant to be run, but I don't know how to adjust the installation process to the Triton case.
- Odds are that you are missing some libraries for example, which would be provided by modules. So that is something we can help with in garage. (It's a lot easier for us to figure out what provides what, since some of us have probably already done something similar.)
- OK, thanks!
Is it possible to download this document for future reference?
- It is archived (see links above), also we'll make sure it's added from the course page. It should stay up a while, but you your browser can probably download.
.I also want to run simaulations using platforms such aas COMSOL or OpenFOAM. Does Helsinki UNI also provide access t o these on their clusters??
- Helsinki Uni Turso cluster has an OpenFOAM module: module spider OpenFOAM . We can also guide COMSOL users, please come to our garage (see below).
- Helsinki Uni users, please get started here: https://version.helsinki.fi/it-for-science/hpc/-/wikis/home
  - If you have a Helsinki University user account, please join the course Zoom breakout room 1 right now so that we can get you identified and connected to the Turso cluster.
  - You can also get started by sending email to helpdesk[ät]helsinki.fi and request Turso cluster access. To speed things up, you can mention in the email that it's intended to the IT for Science solution team.
  - And like Aalto, we have also adopted a daily HPC Garage practice to assist our users: https://version.helsinki.fi/it-for-science/hpc/-/wikis/Garage
.
.
.

Feedback, day 3

News for day 3 / what to do next

We covered what was in the schedule.
There is a lot more written material linked from the schedule (also Triton tutorials in general) that you should read for anything you need.
Just try stuff out! And ask us for help when needed.
Follow-up courses
- CodeRefinery: version control and software development stuff (mid-september).
- Python for Scientific Computing: late autumn
- Study a bit about shell scripting (as appropriate for you), we have a longer course

Join this Zoom now to meet the instructors: https://aalto.zoom.us/j/69608324491

The course was (vote for all that apply):

too fast: oooo
too slow: o
right speed: oooooooooo
too slow sometimes, too fast other times: o
too advanced:ooo
too basic:
right level: ooooooo
I will use what I learned today: ooo0ooooooo
I would recommend today to others: ooooooooo
I would not recommend today to others:

One good thing about today:

to the point, and concrete.
..
..

One thing to improve for next time:

I think the course should target an audience with a higher base level. For instance, if someone hasn't ever hear of a GPU, maybe they should be referred to more fundemntal courses rather than a overview of a HPC. Overall, the material can be focused more on real challenges. +2
I personnaly get distructed with the non technical part,I think it would be better to leave them all till the last day or a specific time.
llm example did not work, it should be corrected

Any other feedback?

Too fast, advanced, tehcnical +1
I really liked the execution of this course. Worked very nicely +2
I wish there were more courses following this kind of execution
Overall, i really liked your teaching way, it was encouraging to ask questions, which is really good, thank you so much for your hard work.
Lots more today is Aalto/Triton-specific. It would be good to get input from other clusters to understand what's possible from built-in modules, and what we need to configure ourselves to keep up with the examples. +1
Too advanced, too technical
I really enjoyed when some of you had the terminal log showing, so I could easily follow along e ven when I'm not as fast with typing. I wish all of you would do it, since when it wasn't used I would easily fall behind.
Good course! I think you covered everything important.

General questions continued:

What now? Like I want to mess around with LLMs in my free time. How should I start? Can I somehow contribute to open source software?
How long materials are available and where?
- This chat will be archived and link added to the course page. Youtube videos of the course will also be available on our youtube channel
..
What is CodeRefinery course?
- Coderefinery workshop covers more practical scicomp skills, such as git usage and general best practices.

"Thank you to all the instructors. It helped a lot. :)" +3

Introduction to Scientific Computing and HPC / Summer 2025 (ARCHIVE)

Infos and important links

Day 2

Icebreakers

Humans of Scientific Computing

Conda

Break until xx:12

Array jobs

Array exercises (until xx:55)

Monitoring

Applications

Responsible Computational Research

Break until xx:12

Parallel - shared memory

Exercise (we return at xx:35)

MPI parallelism

Exercise until xx:55

Feedback day 2

News for day 2 / preparation for day 3

Day 3

Icebreakers, day 3

How to ask for help with (super)computers (slides)

GPU Computing (docs)

Exercises xx:55

Break until xx:12

LLMs on Triton

General Q&A

Feedback, day 3

News for day 3 / what to do next

Read more

RSE training checklist

RSE tech kickstart checklist

Software evaluation flowchart

Kickstart-2025-1 (archive)