2025 summer kickstart - triton tutorial notes

This is the playbook for how we'll do the 2025 summer kickstart Triton demos. It's mainly what ST and RD will do but others could add their stuff here. :::::info ## This is a good example for instructors preparing their own lessons - This is an example instructor prep for livestream team teach teaching. - This plan covered for 2 days of ~4 hours/day course. So one CodeRefinery lesson prep is *much* less. - The format was actually used in February and June 2025, and worked very well. We can recommend it. - The outline is made to very closely follow the actual lesson page we are doing. - It doesn't have deep details on the actual materials. We know that well enough (and can figure out together if needed). - It *does* have exact commands we will run as much as possible. This makes things much smoother and prevents surprises from someone doing things a new way. - This format allows us to make sure we cover everything needed, and nothing else (to keep the time). - It gets more brief as we go further, because This is the markdown format you see used: ````markdown # Lesson episode/page (LeaderInitials, total duration) ``` This preformatted block is... notes from pre-planning... before we broke it down to chunks. ``` {Talk/Demo} (XX min) (this chunk - talking or demo based) * First bullet - section on the page * Action: something we show/type/etc. * Including explicit commands * Explanation: something we say * Subpoints to remember ```` ::::: questions: (for the instructor prep) - remix the starting a project and data transfer parts - Enough time for the setting up project and colning? - is parallel too long? ## HPC Kitchen - Mention the video - What is a cluster? - More burners doesn't make faster pasta - You have to become the chef manager to use all the burners. ## Setting up a new project (RD) (40 min) ```quote Big example: create kickstart-2025 folder with git clones One of us pretends to be a new person and asks things to get set up e.g. ask for new project folder Create $WRKDIR/kickstart-2025 Clone git (learn git!) TODO: adjust/split the cluster-shell to "Set up your project" ``` Talk - Intro (4 min) - Go through the images in the intro - https://scicomp.aalto.fi/triton/tut/intro/ - Reminder: What is cluster - Describe a typical workflow - Tell what options there are for editing, inform that for now we'll demonstrate everything through the command line - "To demonstrate this workflow, we have created an example workflow based on this ngram-code" - What's our problem? - Language processing - What's an ngram - Why is it appropriate for the cluster - data-based: hard disk→processor time is expected to be bottleneck - can be split easily - Plan for the day - *demos and exercises* - We only teach HPC related stuff - not shell or git - You will have homework Demo - code (4 min) - We first need to copy our code to the cluster - We have our example code in the Git version system. It allows easy transfer - It is designed so that you can clone it by yourself - Where do we want to store the code? - In our work directory (like most projects) - Action: clone code - `ssh triton` - `cd $WRKDIR` - `pwd` # to check where we are - `git clone https://github.com/AaltoSciComp/hpc-examples.git` - `cd hpc-examples` Talk - data (4 min) - Data transfer is one of the hardest parts of cluster usage - The transfer itself isn't hard, but setting up can be (VPN, SSH keys, workarounds if you don't have that) - Also keeping stuff in sync both ways is hard - Two main styles - Make a copy (we will do this now) - Remote mount - This is fast, you need to learn more - Two HPC kitchen videos Demos - data (4 min) - Download data to local computer - https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip - Create data storage place - Exercise Storage-1: https://scicomp.aalto.fi/triton/tut/storage/#exercises - Explanation: Storage places, why you have to care. Why its good to separate code, data and installations. - `mkdir $WRKDIR/gutenberg-fiction` - `ls $WRKDIR` - Copy our data over - Exercise RemoteData-1: https://scicomp.aalto.fi/triton/tut/remotedata/#exercises - Explanation: ondemand - Action: - Find the directory we made from the shell - Upload data via OOD interface - Alternative: - `wget https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip` - We need to remember this path. - Summary - We copied data - Note at least three copies exist now (original, own computer, cluster) - You have to know what is where. Demo - all together (4 min) - Initial test: run code on login node - Explanation: login node vs cluster - Explanation: --help - Action - `python3 ngrams/count.py --help` - Explanation: - arguments (how we tell our programs what to do) - CodeRefinery course on scripting - Action: - `python3 ngrams/count.py ../gutenberg-fiction/Gutenberg-Fiction-first100.zip` - Discuss how we use zipfile directly Exercise (20 min) ``` Try to repeat what we just did. If you can't, it's OK: work on it as homework. The parts: * Download the code * Exercise Shell-4 from https://scicomp.aalto.fi/triton/tut/cluster-shell/#triton-tut-example-repo * Create a directory to store the data * Exercise Storage-1: https://scicomp.aalto.fi/triton/tut/storage/#exercises * Copy over the data * Exercise RemoteData-1: https://scicomp.aalto.fi/triton/tut/remotedata/#exercises If you can't, it's OK - most people need to play with this from home. If your cluster is not Triton, you *will* need to do these steps some different way. ``` Talk - summary (3 min) - Reminder: you need to change to the same directory every time you log in - The documentation lists better options for regular transfers (mounting) and for transferring big data (rsync etc.) ## What is Slurm (ST) (10 min) Explain (10 min) - When we have 200 computers, how do people know what to use? - Slurm is the manager: it divides up all the tasks among the different computers? - What resources are managed? - How you know what to request? - After the break: we use this with our job ## Break ## Interactive jobs (RD) (20 min) Big example: run Gutenberg Talk: - Interactive is good for testing and looks familiar - It doesn't scale - So we usually start here but never end here. Demo: - Explanation: - srun - --pty - default resources (1 CPU) - --time=00:05:00 - --mem=200M - Explain output - Action: - srun --pty --time=00:05:00 --mem=200M python3 ngrams/count.py -n 2 -o ngrams-2.out ../gutenberg-fiction/Gutenberg-Fiction-first100.zip - Add -n 2 - Add -o ngrams-2.out - Explanation: - pager - Action: - less ngrams-2.out Exercises - (how do we have time for this?) ## First serial jobs (RD) (40 min) Big example: run Gutenberg Talk / Demo (10 min) - Explanation: - batch script - Action: - nano - Explanation - shebang - #SBATCH lines - Action: - sbatch - Explanation: - slurm history - Action: - slurm queue - slurm history 1hour - Explanation: looking at output - stdout output - Action: - less slurm-NNN.out - Check timings - slurm history - seff Exercises (20 min) Talk (10 min) ## End day 1 ## Conda (JR, YT) (40 min) Big example: Creating an environment for with pytorch - More details: https://hackmd.io/Hvv4vE5nSmiBX9Qcxftwcw - Mention that this is a Aalto specific example, look for other sites docs for their recommended solutions - General dependency manager - Language agnostic, install binaries - Demo: create environment with R ``` yml name: tidyverse channels: - conda-forge dependencies: - r-tidyverse ``` - Understanding the environment file - Why you should track environments - Demo: Python LLM environment ``` yml name: pytorch channels: - nvidia - conda-forge dependencies: - python==3.12 - pytorch-gpu>=2.6,<2.7 - torchvision - torchaudio - transformers ``` - Need to run `export CONDA_OVERRIDE_CUDA=12.6` - but show first without, to demonstrate the error ## Bringing everyone up to speed Following will set you up where we left off after day 1: ```sh cd $WRKDIR mkdir -p gutenberg-fiction cd gutenberg-fiction wget https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip cd .. git clone https://github.com/AaltoScicomp/hpc-examples.git cd hpc-examples ``` Previous day's submission script: ```slurm #!/bin/bash #SBATCH --mem=1G #SBATCH --time=00:10:00 python3 ngrams/count.py -n 2 --words --output ngrams-2.out ../gutenberg-fiction/Gutenberg-Fiction-first100.zip ``` ## Array jobs (ST) (50 min) Big example: gutenberg in parallel Talk (5 min) - You want the cluster to do many things at the same time - Array jobs are one of the easiest ways to do that - Give everyone the same instructions and one number. That number tells them what to do - Unfortunately you need to be good at shell for this. You are the manager, not the cook. Demo (10 min) - Action: - Create script `count-ngrams-2-array.sh`: ```slurm #!/bin/bash #SBATCH --mem=1G #SBATCH --array=0-9 #SBATCH --time=00:10:00 python3 ngrams/count.py \ -n 2 --words \ --start=$SLURM_ARRAY_TASK_ID --step=10 \ --output=ngrams-2-array-$SLURM_ARRAY_TASK_ID.out \ ../gutenberg-fiction/Gutenberg-Fiction-first100.zip ``` - `#SBATCH --array=0-9` - `python3 ngrams/count.py -n 2 --words --start=$SLURM_ARRAY_TASK_ID --step=10 -o ngrams-2-array-$SLURM_ARRAY_TASK_ID.out ../data/Gutenberg-Fiction-first100.zip` - `sbatch count-ngrams-2-array.sh` - Look at array job outputs - Array data processing: - Combine array job outputs to one - `python3 ngrams/combine-counts.py ngrams-2-array*.out -o ngrams-2-array.out` - `head ngrams-2.out` - `head ngrams-2-array.out` Complex example: ```sh= #!/bin/bash #SBATCH --time=00:15:00 #SBATCH --mem=500M #SBATCH --job-name=pi-array-hardcoded #SBATCH --output=pi-array-hardcoded_%a.out #SBATCH --array=0-4 case $SLURM_ARRAY_TASK_ID in 0) SEED=123 ;; 1) SEED=38 ;; 2) SEED=22 ;; 3) SEED=60 ;; 4) SEED=432 ;; esac python3 slurm/pi.py 2500000 --seed=$SEED > pi_$SEED.json ``` Exercise (30 min) - Repeat the demo. - Do one or more of the [hardcoding examples](https://scicomp.aalto.fi/triton/tut/array/#hardcoding-arguments-in-the-batch-script) - These examples use a code that calculates an estimate of pi with statistical trials Talk (5 min) ## Monitoring (ST) (20 min) Big example: Gutenberg monitoring for multiprocessing vs single-core. Instructors submit jobs during break, users examine that job's performance (only works on Aalto cluster). - `squeue -u $USER`/ `slurm queue` - `sacct -u $USER` / `slurm history` - `seff 7897826` (single CPU job) - `seff 7897849` (multi CPU job) - `seff 5246490` (for GPU job) - `ssh gpuXX` -> run `nvidia-smi` ## Applications (RD) (20 min) modules, containers, conda - Our cluster has more than 1000 users. You can't install stuff like normal - There are ways to install software so that it can be shared - Problems - Different people need different versions - Can't break the cluster OS - See our applications page, we don't go in depth. - System stuff - Demo - which python3 - Modules - Make software availabel relatively normally - Demo - `module spider matlab` - `module load matlab` - `which matlab` - `matlab -nojvm` - We have just changed how things work so docs may not be up - Containers - When something is really hard to install, you install a whole OS togethert - Demo - `module load apptainer-freesurfer` - `apptainer exec ${IMAGE_PATH} freesurfer --help` - Link this: - https://github.com/coderefinery/hpc-containers - Conda - Conda started for Python but it's good for almost anything - It can install amost anything into an environment - You had a demo earlier in the morning about this - `module load scicomp-python-env` - `which python3` - What you should do - When you need something, first read the instructions or ask - Don't try to figure out yourself ## Parallel (ST) (50 min) Big example: pi example Discuss why it's hard to know what is parallel submitting Could this include speed considerations? (copy from winter kickstart) Follow the tutorials for this session more closely - Multiprocessing: - Explanation: - Quick difference between array and multiprocessing - If your code uses multiprocessing, you need to tell it how many CPUs it should use - and if you're planning on using multiprocessing, you should tell the queue to give you the resources - Action: ::: warning Let's use pi instead - Run single core job: `srun --pty --time=00:05:00 --mem=2G python3 ngrams/count.py -n 2 --words -o ngrams-2.out ../data/Gutenberg-Fiction-first100.zip` - Modify to run 4 core version - `--cpus-per-task=4` - Increase memory to `--mem=2G` - `--threads=4` - `Change the script name to `count-multi.py` ::: - Show example: `python3 slurm/pi.py 1000` - Run single core job: `srun --pty --time=00:10:00 --mem=2G python3 slurm/pi.py 50000000` - Check seff - Modify to run with multiple cores: `srun --pty --cpus-per-task=4 --time=00:10:00 --mem=2G python3 slurm/pi.py --nprocs 4 50000000` - Check seff - Explanation: - Check run time. Debate whether it was worth it (saves time if it works vs have to queue longer to get a good slot) - Explain that not all programs that can utilize multiple CPUs are optimized. - Mention `OMP_NUM_THREADS`, `SLURM_CPUS_PER_TASK`, automatic parallelization and oversubscription Exercise: - Run the same commands we did ```sh srun --pty --time=00:10:00 --mem=2G python3 slurm/pi.py 50000000 srun --pty --cpus-per-task=4 --time=00:10:00 --mem=2G python3 slurm/pi.py --nprocs 4 50000000 ``` `run-pi-4core.sh`: ```sh #!/bin/bash #SBATCH --cpus-per-task=4 #SBATCH --time=00:10:00 #SBATCH --mem=2G ``` - MPI: - Explanation: - Action: - Compile MPI program: ```sh module load triton/2024.1-gcc openmpi/4.1.6 mpicc -o pi-mpi slurm/pi-mpi.c ``` - Run the program with a single processor: ```sh srun --time=00:10:00 --mem=500M ./pi-mpi 1000000 ``` - Run the program with a four MPI tasks: ```sh srun --nodes=1 --ntasks=4 --time=00:10:00 --mem=500M ./pi-mpi 1000000 ``` - Write a sbatch script `pi-mpi.sh` for the job and submit it ```sh #!/bin/bash #SBATCH --time=00:10:00 #SBATCH --mem=500M #SBATCH --output=pi-mpi.out #SBATCH --nodes=1 #SBATCH --ntasks=2 module load triton/2024.1-gcc openmpi/4.1.6 srun ./pi-mpi 1000000 ``` - Explain why srun in script Exercise: - Run the same commands we did ```sh module load triton/2024.1-gcc openmpi/4.1.6 mpicc -o pi-mpi slurm/pi-mpi.c srun --time=00:10:00 --mem=500M ./pi-mpi 1000000 srun --nodes=1 --ntasks=4 --time=00:10:00 --mem=500M ./pi-mpi 1000000 ``` `pi-mpi.sh`: ```sh #!/bin/bash #SBATCH --time=00:10:00 #SBATCH --mem=500M #SBATCH --output=pi-mpi.out #SBATCH --nodes=1 #SBATCH --ntasks=4 module load triton/2024.1-gcc openmpi/4.1.6 srun ./pi-mpi 1000000 ``` ## GPU (ST, HF) (30 min) - Teaching: - What are GPUs? (ST) - How to choose GPUs (HF) - VRAM and Compute Power - How to get the list `slurm p` - How programs use GPUs (image of CUDA stack) (ST) - Interactive usage: - `-p gpu-debug` - MIG GPUs on Jupyter - Why interactive usage is wasteful - If you run a training job, you can run single epoch and test if your code works - Action: - `module load triton/2024.1-gcc gcc/12.3.0 cuda/12.2.1` - `nvcc -arch=sm_70 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_90,code=sm_90 -o pi-gpu slurm/pi-gpu.cu` - `srun --time=00:10:00 --mem=500M --gpus=1 ./pi-gpu 1000000` - ```sh #!/bin/bash #SBATCH --time=00:10:00 #SBATCH --mem=500M #SBATCH --output=pi-gpu.out #SBATCH --gpus=1 #SBATCH --partition=gpu-v100-16g,gpu-v100-32g,gpu-a100-80g,gpu-h100-80g module load triton/2024.1-gcc cuda/12.2.1 ./pi-gpu 1000000 ```