This is the playbook for how we'll do the 2025 summer kickstart Triton demos. It's mainly what ST and RD will do but others could add their stuff here.
:::::info
## This is a good example for instructors preparing their own lessons
- This is an example instructor prep for livestream team teach teaching.
- This plan covered for 2 days of ~4 hours/day course. So one CodeRefinery lesson prep is *much* less.
- The format was actually used in February and June 2025, and worked very well. We can recommend it.
- The outline is made to very closely follow the actual lesson page we are doing.
- It doesn't have deep details on the actual materials. We know that well enough (and can figure out together if needed).
- It *does* have exact commands we will run as much as possible. This makes things much smoother and prevents surprises from someone doing things a new way.
- This format allows us to make sure we cover everything needed, and nothing else (to keep the time).
- It gets more brief as we go further, because
This is the markdown format you see used:
````markdown
# Lesson episode/page (LeaderInitials, total duration)
```
This preformatted block is...
notes from pre-planning...
before we broke it down to chunks.
```
{Talk/Demo} (XX min) (this chunk - talking or demo based)
* First bullet - section on the page
* Action: something we show/type/etc.
* Including explicit commands
* Explanation: something we say
* Subpoints to remember
````
:::::
questions: (for the instructor prep)
- remix the starting a project and data transfer parts
- Enough time for the setting up project and colning?
- is parallel too long?
## HPC Kitchen
- Mention the video
- What is a cluster?
- More burners doesn't make faster pasta
- You have to become the chef manager to use all the burners.
## Setting up a new project (RD) (40 min)
```quote
Big example: create kickstart-2025 folder with git clones
One of us pretends to be a new person and asks things to get set up
e.g. ask for new project folder
Create $WRKDIR/kickstart-2025
Clone git (learn git!)
TODO: adjust/split the cluster-shell to "Set up your project"
```
Talk - Intro (4 min)
- Go through the images in the intro
- https://scicomp.aalto.fi/triton/tut/intro/
- Reminder: What is cluster
- Describe a typical workflow
- Tell what options there are for editing, inform that for now we'll demonstrate everything through the command line
- "To demonstrate this workflow, we have created an example workflow based on this ngram-code"
- What's our problem?
- Language processing
- What's an ngram
- Why is it appropriate for the cluster
- data-based: hard disk→processor time is expected to be bottleneck
- can be split easily
- Plan for the day
- *demos and exercises*
- We only teach HPC related stuff - not shell or git
- You will have homework
Demo - code (4 min)
- We first need to copy our code to the cluster
- We have our example code in the Git version system. It allows easy transfer
- It is designed so that you can clone it by yourself
- Where do we want to store the code?
- In our work directory (like most projects)
- Action: clone code
- `ssh triton`
- `cd $WRKDIR`
- `pwd` # to check where we are
- `git clone https://github.com/AaltoSciComp/hpc-examples.git`
- `cd hpc-examples`
Talk - data (4 min)
- Data transfer is one of the hardest parts of cluster usage
- The transfer itself isn't hard, but setting up can be (VPN, SSH keys, workarounds if you don't have that)
- Also keeping stuff in sync both ways is hard
- Two main styles
- Make a copy (we will do this now)
- Remote mount
- This is fast, you need to learn more
- Two HPC kitchen videos
Demos - data (4 min)
- Download data to local computer
- https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip
- Create data storage place
- Exercise Storage-1: https://scicomp.aalto.fi/triton/tut/storage/#exercises
- Explanation: Storage places, why you have to care. Why its good to separate code, data and installations.
- `mkdir $WRKDIR/gutenberg-fiction`
- `ls $WRKDIR`
- Copy our data over
- Exercise RemoteData-1: https://scicomp.aalto.fi/triton/tut/remotedata/#exercises
- Explanation: ondemand
- Action:
- Find the directory we made from the shell
- Upload data via OOD interface
- Alternative:
- `wget https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip`
- We need to remember this path.
- Summary
- We copied data
- Note at least three copies exist now (original, own computer, cluster)
- You have to know what is where.
Demo - all together (4 min)
- Initial test: run code on login node
- Explanation: login node vs cluster
- Explanation: --help
- Action
- `python3 ngrams/count.py --help`
- Explanation:
- arguments (how we tell our programs what to do)
- CodeRefinery course on scripting
- Action:
- `python3 ngrams/count.py ../gutenberg-fiction/Gutenberg-Fiction-first100.zip`
- Discuss how we use zipfile directly
Exercise (20 min)
```
Try to repeat what we just did. If you can't, it's OK: work on it as homework. The parts:
* Download the code
* Exercise Shell-4 from https://scicomp.aalto.fi/triton/tut/cluster-shell/#triton-tut-example-repo
* Create a directory to store the data
* Exercise Storage-1: https://scicomp.aalto.fi/triton/tut/storage/#exercises
* Copy over the data
* Exercise RemoteData-1: https://scicomp.aalto.fi/triton/tut/remotedata/#exercises
If you can't, it's OK - most people need to play with this from home. If your cluster is not Triton, you *will* need to do these steps some different way.
```
Talk - summary (3 min)
- Reminder: you need to change to the same directory every time you log in
- The documentation lists better options for regular transfers (mounting) and for transferring big data (rsync etc.)
## What is Slurm (ST) (10 min)
Explain (10 min)
- When we have 200 computers, how do people know what to use?
- Slurm is the manager: it divides up all the tasks among the different computers?
- What resources are managed?
- How you know what to request?
- After the break: we use this with our job
## Break
## Interactive jobs (RD) (20 min)
Big example: run Gutenberg
Talk:
- Interactive is good for testing and looks familiar
- It doesn't scale
- So we usually start here but never end here.
Demo:
- Explanation:
- srun
- --pty
- default resources (1 CPU)
- --time=00:05:00
- --mem=200M
- Explain output
- Action:
- srun --pty --time=00:05:00 --mem=200M python3 ngrams/count.py -n 2 -o ngrams-2.out ../gutenberg-fiction/Gutenberg-Fiction-first100.zip
- Add -n 2
- Add -o ngrams-2.out
- Explanation:
- pager
- Action:
- less ngrams-2.out
Exercises
- (how do we have time for this?)
## First serial jobs (RD) (40 min)
Big example: run Gutenberg
Talk / Demo (10 min)
- Explanation:
- batch script
- Action:
- nano
- Explanation
- shebang
- #SBATCH lines
- Action:
- sbatch
- Explanation:
- slurm history
- Action:
- slurm queue
- slurm history 1hour
- Explanation: looking at output
- stdout output
- Action:
- less slurm-NNN.out
- Check timings
- slurm history
- seff
Exercises (20 min)
Talk (10 min)
## End day 1
## Conda (JR, YT) (40 min)
Big example: Creating an environment for with pytorch
- More details: https://hackmd.io/Hvv4vE5nSmiBX9Qcxftwcw
- Mention that this is a Aalto specific example, look for other sites docs for their recommended solutions
- General dependency manager
- Language agnostic, install binaries
- Demo: create environment with R
``` yml
name: tidyverse
channels:
- conda-forge
dependencies:
- r-tidyverse
```
- Understanding the environment file
- Why you should track environments
- Demo: Python LLM environment
``` yml
name: pytorch
channels:
- nvidia
- conda-forge
dependencies:
- python==3.12
- pytorch-gpu>=2.6,<2.7
- torchvision
- torchaudio
- transformers
```
- Need to run `export CONDA_OVERRIDE_CUDA=12.6`
- but show first without, to demonstrate the error
## Bringing everyone up to speed
Following will set you up where we left off after day 1:
```sh
cd $WRKDIR
mkdir -p gutenberg-fiction
cd gutenberg-fiction
wget https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip
cd ..
git clone https://github.com/AaltoScicomp/hpc-examples.git
cd hpc-examples
```
Previous day's submission script:
```slurm
#!/bin/bash
#SBATCH --mem=1G
#SBATCH --time=00:10:00
python3 ngrams/count.py -n 2 --words --output ngrams-2.out ../gutenberg-fiction/Gutenberg-Fiction-first100.zip
```
## Array jobs (ST) (50 min)
Big example: gutenberg in parallel
Talk (5 min)
- You want the cluster to do many things at the same time
- Array jobs are one of the easiest ways to do that
- Give everyone the same instructions and one number. That number tells them what to do
- Unfortunately you need to be good at shell for this. You are the manager, not the cook.
Demo (10 min)
- Action:
- Create script `count-ngrams-2-array.sh`:
```slurm
#!/bin/bash
#SBATCH --mem=1G
#SBATCH --array=0-9
#SBATCH --time=00:10:00
python3 ngrams/count.py \
-n 2 --words \
--start=$SLURM_ARRAY_TASK_ID --step=10 \
--output=ngrams-2-array-$SLURM_ARRAY_TASK_ID.out \
../gutenberg-fiction/Gutenberg-Fiction-first100.zip
```
- `#SBATCH --array=0-9`
- `python3 ngrams/count.py -n 2 --words --start=$SLURM_ARRAY_TASK_ID --step=10 -o ngrams-2-array-$SLURM_ARRAY_TASK_ID.out ../data/Gutenberg-Fiction-first100.zip`
- `sbatch count-ngrams-2-array.sh`
- Look at array job outputs
- Array data processing:
- Combine array job outputs to one
- `python3 ngrams/combine-counts.py ngrams-2-array*.out -o ngrams-2-array.out`
- `head ngrams-2.out`
- `head ngrams-2-array.out`
Complex example:
```sh=
#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --mem=500M
#SBATCH --job-name=pi-array-hardcoded
#SBATCH --output=pi-array-hardcoded_%a.out
#SBATCH --array=0-4
case $SLURM_ARRAY_TASK_ID in
0) SEED=123 ;;
1) SEED=38 ;;
2) SEED=22 ;;
3) SEED=60 ;;
4) SEED=432 ;;
esac
python3 slurm/pi.py 2500000 --seed=$SEED > pi_$SEED.json
```
Exercise (30 min)
- Repeat the demo.
- Do one or more of the [hardcoding examples](https://scicomp.aalto.fi/triton/tut/array/#hardcoding-arguments-in-the-batch-script)
- These examples use a code that calculates an estimate of pi with statistical trials
Talk (5 min)
## Monitoring (ST) (20 min)
Big example: Gutenberg monitoring for multiprocessing vs single-core.
Instructors submit jobs during break, users examine that job's performance (only works on Aalto cluster).
- `squeue -u $USER`/ `slurm queue`
- `sacct -u $USER` / `slurm history`
- `seff 7897826` (single CPU job)
- `seff 7897849` (multi CPU job)
- `seff 5246490` (for GPU job)
- `ssh gpuXX` -> run `nvidia-smi`
## Applications (RD) (20 min)
modules, containers, conda
- Our cluster has more than 1000 users. You can't install stuff like normal
- There are ways to install software so that it can be shared
- Problems
- Different people need different versions
- Can't break the cluster OS
- See our applications page, we don't go in depth.
- System stuff
- Demo
- which python3
- Modules
- Make software availabel relatively normally
- Demo
- `module spider matlab`
- `module load matlab`
- `which matlab`
- `matlab -nojvm`
- We have just changed how things work so docs may not be up
- Containers
- When something is really hard to install, you install a whole OS togethert
- Demo
- `module load apptainer-freesurfer`
- `apptainer exec ${IMAGE_PATH} freesurfer --help`
- Link this:
- https://github.com/coderefinery/hpc-containers
- Conda
- Conda started for Python but it's good for almost anything
- It can install amost anything into an environment
- You had a demo earlier in the morning about this
- `module load scicomp-python-env`
- `which python3`
- What you should do
- When you need something, first read the instructions or ask
- Don't try to figure out yourself
## Parallel (ST) (50 min)
Big example: pi example
Discuss why it's hard to know what is parallel
submitting
Could this include speed considerations? (copy from winter kickstart)
Follow the tutorials for this session more closely
- Multiprocessing:
- Explanation:
- Quick difference between array and multiprocessing
- If your code uses multiprocessing, you need to tell it how many CPUs it should use
- and if you're planning on using multiprocessing, you should tell the queue to give you the resources
- Action:
::: warning
Let's use pi instead
- Run single core job: `srun --pty --time=00:05:00 --mem=2G python3 ngrams/count.py -n 2 --words -o ngrams-2.out ../data/Gutenberg-Fiction-first100.zip`
- Modify to run 4 core version
- `--cpus-per-task=4`
- Increase memory to `--mem=2G`
- `--threads=4`
- `Change the script name to `count-multi.py`
:::
- Show example: `python3 slurm/pi.py 1000`
- Run single core job: `srun --pty --time=00:10:00 --mem=2G python3 slurm/pi.py 50000000`
- Check seff
- Modify to run with multiple cores: `srun --pty --cpus-per-task=4 --time=00:10:00 --mem=2G python3 slurm/pi.py --nprocs 4 50000000`
- Check seff
- Explanation:
- Check run time. Debate whether it was worth it (saves time if it works vs have to queue longer to get a good slot)
- Explain that not all programs that can utilize multiple CPUs are optimized.
- Mention `OMP_NUM_THREADS`, `SLURM_CPUS_PER_TASK`, automatic parallelization and oversubscription
Exercise:
- Run the same commands we did
```sh
srun --pty --time=00:10:00 --mem=2G python3 slurm/pi.py 50000000
srun --pty --cpus-per-task=4 --time=00:10:00 --mem=2G python3 slurm/pi.py --nprocs 4 50000000
```
`run-pi-4core.sh`:
```sh
#!/bin/bash
#SBATCH --cpus-per-task=4
#SBATCH --time=00:10:00
#SBATCH --mem=2G
```
- MPI:
- Explanation:
- Action:
- Compile MPI program:
```sh
module load triton/2024.1-gcc openmpi/4.1.6
mpicc -o pi-mpi slurm/pi-mpi.c
```
- Run the program with a single processor:
```sh
srun --time=00:10:00 --mem=500M ./pi-mpi 1000000
```
- Run the program with a four MPI tasks:
```sh
srun --nodes=1 --ntasks=4 --time=00:10:00 --mem=500M ./pi-mpi 1000000
```
- Write a sbatch script `pi-mpi.sh` for the job and submit it
```sh
#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --mem=500M
#SBATCH --output=pi-mpi.out
#SBATCH --nodes=1
#SBATCH --ntasks=2
module load triton/2024.1-gcc openmpi/4.1.6
srun ./pi-mpi 1000000
```
- Explain why srun in script
Exercise:
- Run the same commands we did
```sh
module load triton/2024.1-gcc openmpi/4.1.6
mpicc -o pi-mpi slurm/pi-mpi.c
srun --time=00:10:00 --mem=500M ./pi-mpi 1000000
srun --nodes=1 --ntasks=4 --time=00:10:00 --mem=500M ./pi-mpi 1000000
```
`pi-mpi.sh`:
```sh
#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --mem=500M
#SBATCH --output=pi-mpi.out
#SBATCH --nodes=1
#SBATCH --ntasks=4
module load triton/2024.1-gcc openmpi/4.1.6
srun ./pi-mpi 1000000
```
## GPU (ST, HF) (30 min)
- Teaching:
- What are GPUs? (ST)
- How to choose GPUs (HF)
- VRAM and Compute Power
- How to get the list `slurm p`
- How programs use GPUs (image of CUDA stack) (ST)
- Interactive usage:
- `-p gpu-debug`
- MIG GPUs on Jupyter
- Why interactive usage is wasteful
- If you run a training job, you can run single epoch and test if your code works
- Action:
- `module load triton/2024.1-gcc gcc/12.3.0 cuda/12.2.1`
- `nvcc -arch=sm_70 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_90,code=sm_90 -o pi-gpu slurm/pi-gpu.cu`
- `srun --time=00:10:00 --mem=500M --gpus=1 ./pi-gpu 1000000`
- ```sh
#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --mem=500M
#SBATCH --output=pi-gpu.out
#SBATCH --gpus=1
#SBATCH --partition=gpu-v100-16g,gpu-v100-32g,gpu-a100-80g,gpu-h100-80g
module load triton/2024.1-gcc cuda/12.2.1
./pi-gpu 1000000
```