This is the playbook for how we'll do the 2025 summer kickstart Triton demos. It's mainly what ST and RD will do but others could add their stuff here. :::::info ## This is a good example for instructors preparing their own lessons - This is an example instructor prep for livestream team teach teaching. - This plan covered for 2 days of ~4 hours/day course. So one CodeRefinery lesson prep is *much* less. - The format was actually used in February and June 2025, and worked very well. We can recommend it. - The outline is made to very closely follow the actual lesson page we are doing. - It doesn't have deep details on the actual materials. We know that well enough (and can figure out together if needed). - It *does* have exact commands we will run as much as possible. This makes things much smoother and prevents surprises from someone doing things a new way. - This format allows us to make sure we cover everything needed, and nothing else (to keep the time). - It gets more brief as we go further, because This is the markdown format you see used: ````markdown # Lesson episode/page (LeaderInitials, total duration) ``` This preformatted block is... notes from pre-planning... before we broke it down to chunks. ``` {Talk/Demo} (XX min) (this chunk - talking or demo based) * First bullet - section on the page * Action: something we show/type/etc. * Including explicit commands * Explanation: something we say * Subpoints to remember ```` ::::: questions: (for the instructor prep) - remix the starting a project and data transfer parts - Enough time for the setting up project and colning? - is parallel too long? ## HPC Kitchen - Mention the video - What is a cluster? - More burners doesn't make faster pasta - You have to become the chef manager to use all the burners. ## Setting up a new project (RD) (40 min) ```quote Big example: create kickstart-2025 folder with git clones One of us pretends to be a new person and asks things to get set up e.g. ask for new project folder Create $WRKDIR/kickstart-2025 Clone git (learn git!) TODO: adjust/split the cluster-shell to "Set up your project" ``` Talk - Intro (4 min) - Go through the images in the intro - https://scicomp.aalto.fi/triton/tut/intro/ - Reminder: What is cluster - Describe a typical workflow - Tell what options there are for editing, inform that for now we'll demonstrate everything through the command line - "To demonstrate this workflow, we have created an example workflow based on this ngram-code" - What's our problem? - Language processing - What's an ngram - Why is it appropriate for the cluster - data-based: hard disk→processor time is expected to be bottleneck - can be split easily - Plan for the day - *demos and exercises* - We only teach HPC related stuff - not shell or git - You will have homework Demo - code (4 min) - We first need to copy our code to the cluster - We have our example code in the Git version system. It allows easy transfer - It is designed so that you can clone it by yourself - Where do we want to store the code? - In our work directory (like most projects) - Action: clone code - `ssh triton` - `cd $WRKDIR` - `pwd` # to check where we are - `git clone https://github.com/AaltoSciComp/hpc-examples.git` - `cd hpc-examples` Talk - data (4 min) - Data transfer is one of the hardest parts of cluster usage - The transfer itself isn't hard, but setting up can be (VPN, SSH keys, workarounds if you don't have that) - Also keeping stuff in sync both ways is hard - Two main styles - Make a copy (we will do this now) - Remote mount - This is fast, you need to learn more - Two HPC kitchen videos Demos - data (4 min) - Download data to local computer - https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip - Create data storage place - Exercise Storage-1: https://scicomp.aalto.fi/triton/tut/storage/#exercises - Explanation: Storage places, why you have to care. Why its good to separate code, data and installations. - `mkdir $WRKDIR/gutenberg-fiction` - `ls $WRKDIR` - Copy our data over - Exercise RemoteData-1: https://scicomp.aalto.fi/triton/tut/remotedata/#exercises - Explanation: ondemand - Action: - Find the directory we made from the shell - Upload data via OOD interface - Alternative: - `wget https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip` - We need to remember this path. - Summary - We copied data - Note at least three copies exist now (original, own computer, cluster) - You have to know what is where. Demo - all together (4 min) - Initial test: run code on login node - Explanation: login node vs cluster - Explanation: --help - Action - `python3 ngrams/count.py --help` - Explanation: - arguments (how we tell our programs what to do) - CodeRefinery course on scripting - Action: - `python3 ngrams/count.py ../gutenberg-fiction/Gutenberg-Fiction-first100.zip` - Discuss how we use zipfile directly Exercise (20 min) ``` Try to repeat what we just did. If you can't, it's OK: work on it as homework. The parts: * Download the code * Exercise Shell-4 from https://scicomp.aalto.fi/triton/tut/cluster-shell/#triton-tut-example-repo * Create a directory to store the data * Exercise Storage-1: https://scicomp.aalto.fi/triton/tut/storage/#exercises * Copy over the data * Exercise RemoteData-1: https://scicomp.aalto.fi/triton/tut/remotedata/#exercises If you can't, it's OK - most people need to play with this from home. If your cluster is not Triton, you *will* need to do these steps some different way. ``` Talk - summary (3 min) - Reminder: you need to change to the same directory every time you log in - The documentation lists better options for regular transfers (mounting) and for transferring big data (rsync etc.) ## What is Slurm (ST) (10 min) Explain (10 min) - When we have 200 computers, how do people know what to use? - Slurm is the manager: it divides up all the tasks among the different computers? - What resources are managed? - How you know what to request? - After the break: we use this with our job ## Break ## Interactive jobs (RD) (20 min) Big example: run Gutenberg Talk: - Interactive is good for testing and looks familiar - It doesn't scale - So we usually start here but never end here. Demo: - Explanation: - srun - --pty - default resources (1 CPU) - --time=00:05:00 - --mem=200M - Explain output - Action: - srun --pty --time=00:05:00 --mem=200M python3 ngrams/count.py -n 2 -o ngrams-2.out ../gutenberg-fiction/Gutenberg-Fiction-first100.zip - Add -n 2 - Add -o ngrams-2.out - Explanation: - pager - Action: - less ngrams-2.out Exercises - (how do we have time for this?) ## First serial jobs (RD) (40 min) Big example: run Gutenberg Talk / Demo (10 min) - Explanation: - batch script - Action: - nano - Explanation - shebang - #SBATCH lines - Action: - sbatch - Explanation: - slurm history - Action: - slurm queue - slurm history 1hour - Explanation: looking at output - stdout output - Action: - less slurm-NNN.out - Check timings - slurm history - seff Exercises (20 min) Talk (10 min) ## End day 1 ## Conda (JR, YT) (40 min) Big example: Creating an environment for with pytorch - More details: https://hackmd.io/Hvv4vE5nSmiBX9Qcxftwcw - Mention that this is a Aalto specific example, look for other sites docs for their recommended solutions - General dependency manager - Language agnostic, install binaries - Demo: create environment with R ``` yml name: tidyverse channels: - conda-forge dependencies: - r-tidyverse ``` - Understanding the environment file - Why you should track environments - Demo: Python LLM environment ``` yml name: pytorch channels: - nvidia - conda-forge dependencies: - python==3.12 - pytorch-gpu>=2.6,<2.7 - torchvision - torchaudio - transformers ``` - Need to run `export CONDA_OVERRIDE_CUDA=12.6` - but show first without, to demonstrate the error ## Bringing everyone up to speed Following will set you up where we left off after day 1: ```sh cd $WRKDIR mkdir -p gutenberg-fiction cd gutenberg-fiction wget https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip cd .. git clone https://github.com/AaltoScicomp/hpc-examples.git cd hpc-examples ``` Previous day's submission script: ```slurm #!/bin/bash #SBATCH --mem=1G #SBATCH --time=00:10:00 python3 ngrams/count.py -n 2 --words --output ngrams-2.out ../gutenberg-fiction/Gutenberg-Fiction-first100.zip ``` ## Array jobs (ST) (50 min) Big example: gutenberg in parallel Talk (5 min) - You want the cluster to do many things at the same time - Array jobs are one of the easiest ways to do that - Give everyone the same instructions and one number. That number tells them what to do - Unfortunately you need to be good at shell for this. You are the manager, not the cook. Demo (10 min) - Action: - Create script `count-ngrams-2-array.sh`: ```slurm #!/bin/bash #SBATCH --mem=1G #SBATCH --array=0-9 #SBATCH --time=00:10:00 python3 ngrams/count.py \ -n 2 --words \ --start=$SLURM_ARRAY_TASK_ID --step=10 \ --output=ngrams-2-array-$SLURM_ARRAY_TASK_ID.out \ ../gutenberg-fiction/Gutenberg-Fiction-first100.zip ``` - `#SBATCH --array=0-9` - `python3 ngrams/count.py -n 2 --words --start=$SLURM_ARRAY_TASK_ID --step=10 -o ngrams-2-array-$SLURM_ARRAY_TASK_ID.out ../data/Gutenberg-Fiction-first100.zip` - `sbatch count-ngrams-2-array.sh` - Look at array job outputs - Array data processing: - Combine array job outputs to one - `python3 ngrams/combine-counts.py ngrams-2-array*.out -o ngrams-2-array.out` - `head ngrams-2.out` - `head ngrams-2-array.out` Complex example: ```sh= #!/bin/bash #SBATCH --time=00:15:00 #SBATCH --mem=500M #SBATCH --job-name=pi-array-hardcoded #SBATCH --output=pi-array-hardcoded_%a.out #SBATCH --array=0-4 case $SLURM_ARRAY_TASK_ID in 0) SEED=123 ;; 1) SEED=38 ;; 2) SEED=22 ;; 3) SEED=60 ;; 4) SEED=432 ;; esac python3 slurm/pi.py 2500000 --seed=$SEED > pi_$SEED.json ``` Exercise (30 min) - Repeat the demo. - Do one or more of the [hardcoding examples](https://scicomp.aalto.fi/triton/tut/array/#hardcoding-arguments-in-the-batch-script) - These examples use a code that calculates an estimate of pi with statistical trials Talk (5 min) ## Monitoring (ST) (20 min) Big example: Gutenberg monitoring for multiprocessing vs single-core. Instructors submit jobs during break, users examine that job's performance (only works on Aalto cluster). - `squeue -u $USER`/ `slurm queue` - `sacct -u $USER` / `slurm history` - `seff 7897826` (single CPU job) - `seff 7897849` (multi CPU job) - `seff 5246490` (for GPU job) - `ssh gpuXX` -> run `nvidia-smi` ## Applications (RD) (20 min) modules, containers, conda - Our cluster has more than 1000 users. You can't install stuff like normal - There are ways to install software so that it can be shared - Problems - Different people need different versions - Can't break the cluster OS - See our applications page, we don't go in depth. - System stuff - Demo - which python3 - Modules - Make software availabel relatively normally - Demo - `module spider matlab` - `module load matlab` - `which matlab` - `matlab -nojvm` - We have just changed how things work so docs may not be up - Containers - When something is really hard to install, you install a whole OS togethert - Demo - `module load apptainer-freesurfer` - `apptainer exec ${IMAGE_PATH} freesurfer --help` - Link this: - https://github.com/coderefinery/hpc-containers - Conda - Conda started for Python but it's good for almost anything - It can install amost anything into an environment - You had a demo earlier in the morning about this - `module load scicomp-python-env` - `which python3` - What you should do - When you need something, first read the instructions or ask - Don't try to figure out yourself ## Parallel (ST) (50 min) Big example: pi example Discuss why it's hard to know what is parallel submitting Could this include speed considerations? (copy from winter kickstart) Follow the tutorials for this session more closely - Multiprocessing: - Explanation: - Quick difference between array and multiprocessing - If your code uses multiprocessing, you need to tell it how many CPUs it should use - and if you're planning on using multiprocessing, you should tell the queue to give you the resources - Action: ::: warning Let's use pi instead - Run single core job: `srun --pty --time=00:05:00 --mem=2G python3 ngrams/count.py -n 2 --words -o ngrams-2.out ../data/Gutenberg-Fiction-first100.zip` - Modify to run 4 core version - `--cpus-per-task=4` - Increase memory to `--mem=2G` - `--threads=4` - `Change the script name to `count-multi.py` ::: - Show example: `python3 slurm/pi.py 1000` - Run single core job: `srun --pty --time=00:10:00 --mem=2G python3 slurm/pi.py 50000000` - Check seff - Modify to run with multiple cores: `srun --pty --cpus-per-task=4 --time=00:10:00 --mem=2G python3 slurm/pi.py --nprocs 4 50000000` - Check seff - Explanation: - Check run time. Debate whether it was worth it (saves time if it works vs have to queue longer to get a good slot) - Explain that not all programs that can utilize multiple CPUs are optimized. - Mention `OMP_NUM_THREADS`, `SLURM_CPUS_PER_TASK`, automatic parallelization and oversubscription Exercise: - Run the same commands we did ```sh srun --pty --time=00:10:00 --mem=2G python3 slurm/pi.py 50000000 srun --pty --cpus-per-task=4 --time=00:10:00 --mem=2G python3 slurm/pi.py --nprocs 4 50000000 ``` `run-pi-4core.sh`: ```sh #!/bin/bash #SBATCH --cpus-per-task=4 #SBATCH --time=00:10:00 #SBATCH --mem=2G ``` - MPI: - Explanation: - Action: - Compile MPI program: ```sh module load triton/2024.1-gcc openmpi/4.1.6 mpicc -o pi-mpi slurm/pi-mpi.c ``` - Run the program with a single processor: ```sh srun --time=00:10:00 --mem=500M ./pi-mpi 1000000 ``` - Run the program with a four MPI tasks: ```sh srun --nodes=1 --ntasks=4 --time=00:10:00 --mem=500M ./pi-mpi 1000000 ``` - Write a sbatch script `pi-mpi.sh` for the job and submit it ```sh #!/bin/bash #SBATCH --time=00:10:00 #SBATCH --mem=500M #SBATCH --output=pi-mpi.out #SBATCH --nodes=1 #SBATCH --ntasks=2 module load triton/2024.1-gcc openmpi/4.1.6 srun ./pi-mpi 1000000 ``` - Explain why srun in script Exercise: - Run the same commands we did ```sh module load triton/2024.1-gcc openmpi/4.1.6 mpicc -o pi-mpi slurm/pi-mpi.c srun --time=00:10:00 --mem=500M ./pi-mpi 1000000 srun --nodes=1 --ntasks=4 --time=00:10:00 --mem=500M ./pi-mpi 1000000 ``` `pi-mpi.sh`: ```sh #!/bin/bash #SBATCH --time=00:10:00 #SBATCH --mem=500M #SBATCH --output=pi-mpi.out #SBATCH --nodes=1 #SBATCH --ntasks=4 module load triton/2024.1-gcc openmpi/4.1.6 srun ./pi-mpi 1000000 ``` ## GPU (ST, HF) (30 min) - Teaching: - What are GPUs? (ST) - How to choose GPUs (HF) - VRAM and Compute Power - How to get the list `slurm p` - How programs use GPUs (image of CUDA stack) (ST) - Interactive usage: - `-p gpu-debug` - MIG GPUs on Jupyter - Why interactive usage is wasteful - If you run a training job, you can run single epoch and test if your code works - Action: - `module load triton/2024.1-gcc gcc/12.3.0 cuda/12.2.1` - `nvcc -arch=sm_70 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_90,code=sm_90 -o pi-gpu slurm/pi-gpu.cu` - `srun --time=00:10:00 --mem=500M --gpus=1 ./pi-gpu 1000000` - ```sh #!/bin/bash #SBATCH --time=00:10:00 #SBATCH --mem=500M #SBATCH --output=pi-gpu.out #SBATCH --gpus=1 #SBATCH --partition=gpu-v100-16g,gpu-v100-32g,gpu-a100-80g,gpu-h100-80g module load triton/2024.1-gcc cuda/12.2.1 ./pi-gpu 1000000 ```