Introduction to Scientific Computing and HPC / Summer 2025 (ARCHIVE)
Infos and important links
Day 2
Icebreakers
Programming langages you use? (poll, add o to vote):
- python: oooooooooooooooooooooooooo
- R: oooooooo
- matlab: oooooooo0ooo
- C: ooooo
- shell: oooooo
- C++: ooooooo+1
- java: oo
- javascript: o
- Scala: oooo
- Julia: o
- SML: o
- FORTRAN: ooo
- Ruby: o
- Basic: +1
How long have you used computers? What was your first?
- Commodor 64, around 1983-84 :)
- Some Windows 98 PC in the early 2000's. Intel C2D E8500 & Radeon HD 4870 was the dream setup when I was looking into building my first PC.
- Windows 98
- Intel 80386 with MS-DOS
- That windows with the green hill bg, early 2000s (my parents owned it).
- The first we had at home was the first MacBook in 2006. Before that I had used a PC at my kindergarten :D
- Around 2007, intel p4 with xubuntu and windows xp
- HP Probook, 2019 :)
- Lenovo Yoga, 2015-ish
- 2019 Lenovo
- Commodore 128
- Probably some old Dell around 2012
- 1989 - ATARI 800 xl - coding on Basics :-D
- VIC20 <3… loved basic
- Celeron, running windows 3.1
- 2012, a Mac
- Some Desktop in preliminary school that was brand new when it was bought 20 years ago
- Windows 98 (Early 2000s)
- Windows 98 - 2006
Favorite ice cream?:
- Italian gelato: pistacchio or coffee (or both at the same time?) +3
- Just simple vanilla the best! +1+1
- Vanilla with nougat pieces
- Raspberry-lemon or just lemon/raspberry sorbet
- Salmiakki / salty licorice
- Mint chocolate chip +1
- Stracciatella
- swiss chocolate
- Ben&jerry
- vanilla with a salted caramel swirl
- Stracciatella & Yogurt
- Anything Italian
- Malaga = rum, raising, vanilla icecream
Follow-up questions from day 1:
-
Are we going to continue working on gutenberg fiction?
-
I tried to do the exercises in Triton but I have this error:
Same problem with srun –> command not found
- That's weird. Are you sure you were connected to Triton when you tried it? If you were, I would recommend joining the zoom.
Yes it is weird; I do via kosh– ssh
- Something could have messed up your PATH
variable, but I couldn't say what would cause that.
-
I have a question not related to the course, i am wondering if it's on to ask or maybe send you a mail about it
- Go ahead if it's appropriate (code of conduct and all)
- This question looks a bit too complicated for us to answer here. If you are at Aalto, you could join our garage sessions at some point?
- I am in Oulu, would it be possible to send you a mail ab it ?
- I'm afraid our RSE services / garage are for people at Aalto, I don't know if Oulu has something similar?
- Have no idea
- yes, so i am using Fenics to do calculations and there's a predefined function assemble() to calculate the integrals, the thing is I’m using FEniCS (2019 version) to compute a Fourier-type integral of a function q whith two formulas that should give the same result but it doesn't
- Rough guess for the cause would be that some of the calculations are not entirely numerically stable (for example they give very small results that could be effected by floating point errors). Then these would be handled slightly differently based on the version. That is a complete shot in the dark though.
- I see, you probably right cause for a small xi value (a parameter) the error is smaller, but thank you.
import numpy as np
from dolfin import *
xi_val = [2, 1]
zeta_val = [2, 1]
x0_val = 0.0
y0_val = 0.0
d_val = 0.1
def V_0(x, xi, zeta):
return np.exp((-(zeta[0] + xi[0]*1j)*x[0] - (zeta[1] + xi[1]*1j)*x[1]) )
def V_1(x, xi, zeta):
return np.exp(((zeta[0] - xi[0]*1j)*x[0] + (zeta[1] - xi[1]*1j)*x[1])/2)
def exp_Fourier(x, xi):
return np.exp(-2*1j*(xi[0] * x[0] + xi[1] * x[1]))
def phi(x, d, x0, y0):
norm = np.sqrt((x[0]-x0)**2 + (x[1]-y0)**2)
if norm < d:
return np.exp(1 + 1 / ((norm/d)**2 - 1))
else:
return 0.0
def q_function_epsilon(x, d, x0, y0):
return phi(x, d, x0, y0)
mesh = RectangleMesh(Point(-1, -1), Point(1, 1), 32, 32)
V = FunctionSpace(mesh, "CG", 3)
class PythonQExpression(UserExpression):
def __init__(self, d, x0, y0, **kwargs):
self.d = d
self.x0 = x0
self.y0 = y0
super().__init__(**kwargs)
def eval(self, values, x):
values[0] = q_function_epsilon(x, self.d, self.x0, self.y0)
def value_shape(self):
return ()
q_expr = PythonQExpression(d=d_val, x0=x0_val, y0=y0_val, degree=2)
q_proj = interpolate(q_expr, V)
class PythonV0Expression(UserExpression):
def __init__(self, xi, zeta, **kwargs):
self.xi = xi
self.zeta = zeta
super().__init__(**kwargs)
def eval(self, values, x):
values[0] = V_0(x, self.xi, self.zeta)
def value_shape(self):
return ()
V0_expr = PythonV0Expression(xi=xi_val, zeta=zeta_val, degree=3)
V0_proj = interpolate(V0_expr, V)
class PythonVExpression(UserExpression):
def __init__(self, xi, zeta, **kwargs):
self.xi = xi
self.zeta = zeta
super().__init__(**kwargs)
def eval(self, values, x):
values[0] = V_1(x, self.xi, self.zeta)
def value_shape(self):
return ()
V_expr = PythonVExpression(xi=xi_val, zeta=zeta_val, degree=3)
V_proj = interpolate(V_expr, V)
class PythonQFExpression(UserExpression):
def __init__(self, xi, **kwargs):
self.xi = xi
super().__init__(**kwargs)
def eval(self, values, x):
values[0] = exp_Fourier(x, self.xi)
def value_shape(self):
return ()
F_q_expr = PythonQFExpression(xi=xi_val, degree=3)
F_q_proj = interpolate(F_q_expr, V)
q_fourier_1 = assemble(V0_proj * V_proj * V_proj * q_proj * dx)
q_fourier_2 = assemble(F_q_proj * q_proj * dx)
print("q_fourier_1:", q_fourier_1)
print("q_fourier_2:", q_fourier_2) "
Humans of Scientific Computing
Any questions to the instructors (about careers, etc)?:
-
How about the non-humans of scientific computing: do you use chatGPT or similar?
- It's a little helper, so yes when the output (= code) is easy to review.
- Mainly to avoid having to parse through documentation pages / getting examples in languages not that familiar with
- Yes, for parsing documentation and for mindless template filling
- Yes, sometimes when I need quick code skeletons, then I start developing details that (at least until now) LLMs don't reach until you go into details in the prompt
- Sometimes. Usually when I need to reformulate a text I'm writing and cannot find the correct words, but also when I cannot find the correct syntax for the language of a framework I'm using.
-
What’s the most exciting project or task you've worked on in your role?
- Hosting+running+teaching the CodeRefinery workshop. Here the last workshop and its topics for those who are interested: https://coderefinery.github.io/2025-03-25-workshop/
- I don't have a single thing, but several projects where I clearly see that I could help the researchers save lots of their time.
- Collecting data from personal devices people carry around everywhere
- Mining Refactoring data from OSS projects and training LLMs for refactoring recommendation.
- Helping a data center for a major astronomy satellite with their data center setup.
-
What's the most important skill you look for when you hire RSEs?
- Attitude towards good coding practices and towards willingness to help people.
- Ability to help people and some public code examples (willingness to publish software)
- Many people have good technical skills, it's harder to find someone with good desire to help others/social skills. (technically, the real thing is being able to learn new things)
- I actuall think it's not so much social skills and more about will to help, and being supportive. Sure completely hiding in your room not wanting to interact with others wouldn't help.
- At least to me it's all about willingness/social skills to work in a team, be able to explain your work, and listen to better approaches.
-
What are your recommendations for someone interested in neuroscience? What to study etc (in addition to the neuro courses university offers)?
- Reproducibility and how to enable it (version control, reproducible environment, avoiding questionable research practices like p-hacking, HARKing, etc.., registered reports or preregistrations)… well this is for any field of science in the end. :)
- Specifically for the neuroimaging field, get a good grasp of statistics!
- Terminology is subtly different between different imaging modalities/subfields, if you read books/papers be aware of what subfield the author is from.
-
What's the most common (and maybe easy to learn/fix) mistake that people working with HPCs make? And what's the most challenging part that many have a hard time to have a grasp of it?
- One hard thing to grasp is the workflow where you need to deal with waiting and be productive while your jobs are running. Usually with the queue system one needs to submit jobs and do something while waiting for the jobs to finish. This can feel strange as we're often used to doing something and seeing results of our work immediately. A common mistake and easy to fix thing might be to mix bash
and sbatch
. They are different things. bash
is the terminal program while sbatch
is the Slurm's queue submission command.
- Testing before starting a large run. If you run a small test case before submitting a multi-day run, you can avoid a lot of extra wait time.
Conda
https://scicomp.aalto.fi/triton/apps/python-conda/
-
For CSC, check their Tykky page for information how you can use conda with their containers.
- Yes! We teach generic conda here. Different clusters, especially the CSC-managed ones, have special instructions to make it scale better.
- Consider this an intro to understand those pages better.
-
Helsinki University Turso users: load mamba with command module load Mamba
(with a capital "M"). See module spider Mamba
for more info.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
-
to use mamba do i have to firt run triton?
- I don't understand the question. Do you mean if you need to be connected to Triton?
- Our demo is happening on Triton (where Mamba is pre-installed), but it can work on other computers if you install it first.
-
I get this error :
- What cluster are you working on?
- Lehmus, i don't think i have this module
- Some version of it probably exists, but it could be called something else. You would need to ask your cluster admins / look at your documentation for specifics.
- Check
anaconda
and conda
modules with module spider MODULENAME
. Those could be ready-made environments, or one of them could be meant for building your own environment.
-
Can you clean the terminal so we can see the commands on top?. +1
-
i mean how can we use mamba?
-
Is UV a good replacement for Conda? At least for local software management it's great. How about HPCs? (should be the same technically)
- We don't have much experience with uv yet, but as I understand it's essentially a better replacement for pip. Conda's biggest benefit is that you can get a lot of "system" libraries in your environment as well, which might not be as convenient to get on a cluster otherwise (behind modules etc.)
- So they both have their pros and cons essentially.
- Another benefit to conda is that most of the packages are precompiled binaries, which is nice on a heterogenous cluster (nodes with different architectures). Otherwise you might need to do a lot of work to get things working on all nodes at the same time.
-
I didn't know about the mamba run
thing - how new is it?
-
How do I exit the environment tab.
- You can deactivate the environment with
source deactivate
-
Is conda like python3 venv? Where you can install various modules/dependencies?
- conda & mamba are package managers that can install various dependencies into an environment. It is often used to install python environments.
- Broadly yes, they solve similar problems. conda was made because virtualenvs didn't handle complex compiled dependencies well (often needed for science)
-
I think that in CSC conda is now deprecated?
- See the answer at the top of this section. CSC uses Tykky.
-
why we create environement can't just work directly and install the packages needed?
- In that situation you can only have a single set of packages that are active all the time. So if you have multiple projects that need different packages you will quickly encounter problems setting up a single installation that solves needs of all projects. In addition, some of the dependencies can be really hard to install. So using tools that allow you to install complex dependencies helps a lot.
- you could re-install conda/mamba multiple times to get different base envs… but those are environments anyway. So may as well share the overhead.
- Thank you, it's clear now.
-
How does it work under the hood? What's the difference between Mamba/Conda on a local system versus on an HPC?
- No real difference. Besides that mamba is usually provided as a module, instead of being installed directly.
- Under the hood Conda and Mamba solve the environment based on the requirements you have specified in the
environment.yml
. Then they download pre-built packages that are extracted to a common package cache. Then an environment is created by creating a folder structure and linking relevant packages to the common package cache. Conda and Mamba use the package cache so that it can re-use already downloaded packages. But the tools work the same in local machines and in HPC machines.
-
While running 'mamba env create' I got an error 'libmamba YAML spec file environment.yml not found' and 'critical libmamba File not found. Aborting'
- Did you create an
environment.yml
file yet? mamba env create
always assumes you have a yml file, and with no arguments expects that it is in your current directory and is called environment.yml
.
- Ah, I created environment called 'tidyverse.yml', but followed the scicomp tutorial which had the name 'environment.yml'. Problem solved
- You can do
mamba env create -f filename
to create env from a file with different name.
-
If mamba doesn't exist in the cluster can we just replace it by conda right ?
- Yes, mamba is just a drop-in replacement for conda that does same things. (It's just faster since its written in C, whereas conda is in python iirc.)
- thanks
- Interestingly, you can use conda to install mamba
- How?
- You can create an environment like
-
I am curious how your file directory looks like. Where will you have the conda env files except the R and yml file that we are seeing on the screen?
- Conda and mamba install the environments to an environment directory. By default it goes to your home directory
~/.conda/envs
, but at the start of the session we specified that we wanted the conda environments to go to $WKRDIR/.conda_envs/
. This was done during the first time setup. Usually you do not need to know where the environment is. Once it is activated programs from the environment become available from there.
- Okay. So, let's say that each of my environment has same python3 version. So each environment will then install it separately and have equal number of copies of that?
- The environments share a common package cache where the packages will be downloaded and packages in the environment point to those packages. So if you have the same python version in all environments you won't be using that much more disk space. However, you will create more files (links from the environment to the package cache).
- Thank you!
-
What are the channels?
- Channels are essentially repositories for packages. Nvidia has packages managed by nvidia, and conda-forge is an open-source channel containing a large varity of packages.
-
Why can't I activate the environment? says 'critical libmamba Shell not initialized'
- Do
source activate ENVIRONMENT_NAME
. Using conda activate
requires running conda's initialization script, but this comes with bunch of other undersirable side effects on a cluster.
-
Should I always specify the exact package versions to guarantee that it is reproducible? Or is it generally safe to leave the version unspecified?
- Not specifying the version means downloading the latest available version. Usually this is safe, but if you need a specific one, it's better to mention the version.
- You can have a version of the environment that has the locked versions for e.g. showing exact versions that were used when doing a paper, but usually keeping the version numbers loose makes it easier to maintain the code.
- For a published version of my software or data, I would specify exact versions. When developing, I keep to the latest unless there is a reason to change (like the reason we set python to 3.12, it would have failed with 3.13.)
- Is there some mamba command I can use to export the exact versions of an env once I am ready to publish (i.e. update the yaml with the current versions)?
-
What was the command that we used to override for Cuda?
export CONDA_OVERRIDE_CUDA=cuda_version
. This sets an environment variable that tells conda that this version of cuda will be available.
-
I am getting the following error when trying to create the mamba env: critical libmamba filesystem error: cannot create directories: Read-only file system [/appl/scibuilder-mamba/aalto-rhel9/prod/software/mamba/2025.1/f67be15/envs/pytorch-env/conda-meta]
- It sounds like
mamba
is trying to install the environment in the installation directory of the mamba module, which can't be modified by users. Not sure why though, what commands did you run?
-
I get the error :
error libmamba Could not solve for environment specs The following package could not be installed └─ pytorch-gpu >=2.6,<2.7 * is not installable because it requires └─ pytorch [==2.6.0 cuda*_generic*200|==2.6.0 cuda*_generic*201|...|==2.6.0 cuda*_mkl*304], which requires └─ __cuda =* *, which is missing on the system.
- my yml file is this
- I used this
mamba env create --file pytorch.yml
I am not able to get out of this environment. i mean the page in the terminal. what should i type? ctrl 0?? ctrl X?? +1
- What does it look like? Is it asking if you want to install the environment?
- In general Ctrl + c will exit almost any command, if you are lost.
- yes this is what i wanted. thanks. but one more question, does it save though??
- You need to run
export CONDA_OVERRIDE_CUDA=cuda_version
before the mamba command.
- I get a new error, sorry having trouble formatting
- The reason for the error is that the login node does not have any GPUs, so mamba does not find any. We need to tell it we have some anyway.
- If i want to actually use some GPUs how would i activate them?
- You need to ask for GPUs in your batch script. Then you activate the environment with
source activate env-name
in the batch script.
- can you show an example please :D, not sure i get batch script fully
- A batch script is a series of commands you want the compute nodes to run. It also tells what resources you need for the command. The serial job tutorial has some examples. Later we will have a section about GPUs.
-
what how do i resolve this error above?
- Mamba does not find a GPU on the system. This is probably because the login node does not have a GPU. You can override it with
export CONDA_OVERRIDE_CUDA=12.6
(or replace 12.6 with other CUDA version)
- This environment will work on the compute nodes that have a GPU
- Thank you that worked
-
CSC does not seem to support use of conda environments on parallel filesystem anymore.
-
What is your policy on anaconda packages and Anaconda license?
- We strongly recommend not using defaults channel because of licensing issues. For now at least, conda-forge packages don't have licensing issues.
-
what's this error for $ source activate conda-example ? EnvironmentNameNotFound: Could not find conda environment: conda-example (helsinki cluster )
- Did you successfully run the command
mamba env create --file environment.yml
before you got the error?
- The name of the enviroment could be different in the file.
- actually it doesn't load Mamba.
module load Mamba
should work in Turso.
- Capitalization matters. So for Helsinki, is it
Mamba
and not mamba
?
- module is loaded. thanks. just about the environment.yml
- Do you have conda?
conda --version
- yes, 23.11.0
- You can try
conda env create --file environment.yml
, it should do the same thing. Then
- thank you. but error: EnvironmentFileNotFound: ' ~/environment.yml' file not found
- Did you create an environment file? The file name could be different from
~/environment.yml
. (The ~/
part refers to your home folder. If the file is in the current folder, remove that part.)
- I checked that the pkg and env are ok. but same error. did i miss something
- What is the name set inside the environment file? in the
name: something
field? That is the name you have to use to activate the environment.
- Which error was it? The
EnvironmentFileNotFound
means it does not find the .yml file and cannot create the environment. What files are in the current folder? (You can check with ls
)
- gutenberg-fiction hpc-examples ondemand poem.txt workdir. I cannot find the environmentfile
- Indeed, no environment file there. Please create one using a text editor such as
nano
, so nano environment.yml
and then copy-and paste the yml file content from the course notes and save the file (Ctrl-O) and then exit nano (Ctrl-X).
- this I solved with text editor. thank you
-
can you send the previous commands somewhere? I got lost somewhere
- thanks!
- Can the ones used in the very beginning be found somewhere?
-
what does -1 in the first line of the .sh?
-l
flag for bash tells to run it as a login shell. It doesn't make a huge difference but it causes bash to run some profile files again etc.
- In practice on most slurm clusters it should make no difference, since by default slurm will export all environment variables you had at submission time.
-
just an aside, how do you keep track of all the commands you type on terminal in another screen (as visible on the top right of the screen)?
- We use prompt-log, which we wrote specifically for these workshops.
- That's cool. And thanks!
- All shells have a built-in history which is useful for seeing your own history.
-
Does each job have 1GB of memory allocated or 100mb?
- Each has 1GB. Array jobs essentially launch copies of the script, however many times you specified. Only difference between jobs is the value of
SLURM_ARRAY_TASK_ID
environment variable. (And potentially node and cpus it is running on etc.)
- And yes, this is a good question to answer: how should resource requests be modified for arrays?
-
Can you break down the whole python 3 command to the end ?
- what does this mean?
- The python command in the batch script?
- They are doing it. Nevermind
-
–step=10 what is that?
- It was an argument specific to that python script, that tells it to process every 10th book. So it starts at the value given with
--start
, then next it does start + step
and so on. Since we had an array of 10 jobs, this means all of them will be processed and there is no overlap between processes.
- count.py had these option specifically added to make it suitable for array jobs.
-
I was wondering, one can also write a bash script to split these 100 subtasks to 10 batches. I have seen people using srun to split the resources. Where can array job be beneficial then?
- One difference is that if you have a single sbatch script with multiple srun calls, the whole thing needs to sit in the queue until it can get all the resources. Each step of the array job can queue independently and it will be easier to find 10 nodes with 10 free cpus compared to one with 100 for example.
- In addition I think the scheduler can handle array jobs a bit better, compared to same number of independent jobs. (If you were to do it the other way around)
- Got it! Thanks!
-
I followed along with what you did, however the output files are all named slurm-NUMBERS-0.out, and every single one shows 'No such file or directory when opened'
- Slurm creates these by default for each task. These are different from the files created by the python command.
- The error seems to be that the Gutenberg files are missing. Are they in the current folder?
- They should be, but I keep getting the same error –> Solved
-
Let’s say I have two hyperparameters, x1 and x2, both ranging from 0 to 9. I want to run all the different combinations of them, basically the Cartesian product x1 * x2. I can run my script like this:
python my_script.py x1 x2
What’s the best practice to manage and run all these different hyperparameter combinations? or maybe I can have more than only x1 and x2, what if I have more than 2 hyperparams to sweep?
- You could modify the Python program to take one argument and do the mapping internally (or maybe more like a separate python program to do that). But Bash can also do a 1D -> 2D mapping, I think we have an example in the docs.
- If you are asking this question I think you can figure out some ways to do it - you are on the right track!
- Relatively simple way would be to have array of 100 jobs, then use
floor(SLURM_ARRAY_TASK_ID/10)
as x1
and SLURM_ARRAY_TASK_ID mod 10
as x2
. But you could use whatever kind of logic you want.
-
if I want to delete the files now, how can I do it; since it looks so missed up?
- The
rm
command will delete files - check before pushing enter, since there is no undelete! If you are unsure, run ls
with the arguments first to verify.
- You can use the
*
glob like rm ngrams-array_*.out
to delete them in batch. It helps to name files consistently so it's easy to list and delete like this
- Thanks very much!
- It does not work to delete them all together, only indivually:
- rm ngrams-array_*.out
- rm ngrams-2-array-0.out this one works
-
After using the command "less file_name.out" to view an output file, is there a way to get out of it? For example when I try to view the non-array output I go into this huge list of results from that output from which I can't find a way to get out.
- You can exit less by pressing
q
.
- If it says "calculating line numbers" a "control-c" will stop that, and then
q
works.
-
what is the output I should get? How do I know if it succeeded?
- The job is writing a log file (usually in the location where you submitted it). You can investigate the log file and see if it worked.
Array exercises (until xx:55)
Try to do what we have done.
Here's the submission script count-ngrams-2-array.sh
:
Submit with sbatch count-ngrams-2-array.sh
.
Combine data with python3 ngrams/combine-counts.py ngrams-2-array*.out -o ngrams-2-array.out
Try more advanced array exoamples here: https://scicomp.aalto.fi/triton/tut/array/#hardcoding-arguments-in-the-batch-script
I'm
- Done: ooooooooooooooooooo
- Not trying: ooo
- Having problems: o
-
Note on best practices: Running a huge amount of short jobs is worse for the cluster than smaller amount of longer jobs. So if you have an array of 1000 jobs that each take 10 minutes, it's better to rewrite it to an array of 100 jobs each taking 100 minutes.
- But this doesn't matter for smaller arrays.
-
Which one consume more credits regarding to the priority of queue? 1000 jobs with each taking 10 mins or 100jobs each taking 100 mins.
- I am not sure, but 1000 jobs will have 1000 wait times in the queue (if I had to queue for a long time, then I'd rather keep my turn for longer times). But then you can run in theory 1000 jobs in parallel so it would be faster..
- You will usually get the maximum throughput (maximum amount of things done) when you have a medium sized job parallelized with array. It is much easier for the queue to fit a medium sized job than a very large one. Both should have a similar effect on the priority. You can think about the box of "CPU x RAM x time". If you do an array job you just divide a box of volume "CPU x RAM x long time" into smaller boxes "N array jobs x CPU x RAM x long time / N array jobs". Priority is calculated based on used resources.
- Thank you for answering! I had the experience before about 1000 jobs with each taking 3-4 mins. Then I divided it into 100 jobs with each taking around 25 mins. The results is 100 jobs run much faster due to the shorter queue time and also the starting time.
- If you have a lot of small jobs the startup can also take a meaningful amount of the overall job runtime. When programs start they need to fetch libraries, executables, data etc. that can take some time and if you do the same thing multiple times in the same job this data fetching is kept in the computer's memory and it does not need to fetch everything again. This is especially true if your program uses a common starting dataset and the only difference between runs is a parameter / seed change.
- Yes! That is true. I did the parameter sweeping. Thanks for your explanation.
-
i got this error on mahti,
sbatch: error: AssocMaxSubmitJobLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
- You might be requesting for resources that are not compatible OR you might be missing your CSC project ID. If you can paste the code, we can help more. :Here is the Project Number: 2014489)
- CSC always needs an –account parameter
-
I am trying to run this pi.py array task but I am having an issue with batch script somehow. It cannot find the slurm/pi.py file
- are you in the hpc-examples repository?
-
nvm I found the error
-
Is there a cheat sheet for all the slurm keywords somewhere?
-
What did I miss?
- [user@login4 hpc-examples]$ sbatch pi-array.sh
sbatch: error: slurm_job_submit: Automatically setting partition to: batch-hsw,batch-bdw,batch-csl,batch-skl,batch-milan
sbatch: error: Batch job submission failed: Invalid job array specification
- Can you paste the script that you are submitting? Maybe an error in the SBATCH directives?
-
^^^ There is a dot (.) instead of a dash (-) to define the range of IDs. It should be #SBATCH --array=0-4
https://slurm.schedmd.com/job_array.html
-
Is there a way of knowing how long a specific batch job will take, for example is it on the order of tens of minutes or tens of hours. Or would you just have to know how fast your own program is?
- It depends on many parameters (e.g. which node will process the job, is there lots of I/O), but as you wrote, a decent approximation is to run one case you have, and then scale "almost" linearly (e.g. 100 iterations of a pipeline, try running one iteration and then multiply times 120 for a worst case scenario)
-
Could there be something wrong with this? Because when I run it, and I open up the file I don't see a result.
I essentially see this:
- Which file is showing that? pi_$SEED.json? Could it be that it goes out of memory and gets killed?
- When I run this command after running the sbatch,
all of these files show the output above.
- Do you have any pi_$SEED.json files in the same folder? The script has a ">" which means "redirect the output to a file …json": those are the results of the computation. The .out is just a log (what you would see in the terminal if you were running it interactively).
- I do have it, I think at least
- Then look at the json files, those are the results.
- I only see this.
- You also have pi_123.json if the bit above is the content of your folder :)
- I do, thanks for holding my hand through this, very dumb of me.
Continue with some array examples:
Monitoring
https://scicomp.aalto.fi/triton/tut/monitoring/
Triton only we made these jobs to test:
seff 7897826 (single CPU job)
seff 7897849 (multi CPU job)
seff 5246490 (for GPU job)
-
How long does the HPC system keep job history? For example, when I run slurm history
, what is the earliest job date it will show?
- Triton: I don't think we have a delete policy, but we do prune the database for efficiency, so I would expect that beyond 1 year in the past we cannot guarantee that the history is there.
- By default
slurm history
shows last two weeks I think, but you can give it arguments to show older jobs. For how old, see the message above.
-
Considering that we can have profilers as part of our code, which can be very detailed, how can we benefit from this short summary such as seff
? I assume it's just there as a very easy way of knowing info.
- If you have an existing profiling solution those might indeed work better for your use case. The information stored in the slurm database is usually a good first step if you do not have such tools available to you.
- Biggest issue with fancier profilers is that they usually cause far greater overhead and are far more complicated to use. So it's just a question of how heavy-duty tool you need.
-
Is there a way to save all this cpu and mem usage info in the slurm-JOBID.out file produced at the end automatically? i.e. without having to manually check it for each job.
- what about inside sbatch_script.sh to add as last command, e.g.
seff $SLURM_JOB_ID"
-
Seff won't give you reliable information if the job is still running. So eventually at the end you run a script that appends to each log the output of seff.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
-
would this work, viz. seff is run last ?
-
I am not sure which tasks are going to be checked, but here you are still checking with seff the jobid that is currently running, so it won't give reliable info. One could just have a final bash script that looks at all slurm-JOBID.out and then run seff (very fast to run) and append to the slurm-JOBID.out.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
-
Good profilers:
- https://github.com/plasma-umass/scalene : Profiler for python code that does not require any changes to the program
Applications
https://scicomp.aalto.fi/triton/tut/applications/
-
Is python 2 supported in Triton? What to do if I want to run some old python 2 code?
- Conda is your best friend: create a conda environment with a specific version of python2 and you can run python2 software.
-
what is docker? I heard of this term throw around before…
-
What is the name of the cat??
- Codename is CATS
- Cute cat!
- Lovely cat
- Bonus if you know the origin of the name
- Computational Animal Troublemaker and Support?
- Cute Animal That's a Scientist?
-
Does CATS know how to use HPC?
- In the HPC-kithcen metaphor CATS is a data management specialist (eating all the misplaced food)
- So efficient waste management software?
- Garbage collector.
- yes, also a chaos monkey to improve our resiliance
- CATS appears in many of the HPC-kitchen videos if you need more motivation to watch them
Responsible Computational Research
https://scicomp.aalto.fi/scicomp/rcr-scicomp/
-
What I have noticed as this some cherry pick the "random" seed (not so much random) to show a stable reproducible training.
-
ALLEA - European Code of Conduct for Research Integrity : the link is working for me.
- I'll fix, it will probably update in a minute or so
-
The link towards the ALLEA book doesn't work. I clicked on it, but the paeg doesn't exist. I mean the first link in the section "Level 1: Foundational ethical principles in research"
-
Can you comment about different levels of "goodness" for different projects - does every project take the same practices?
- Could you specify a bit more what kind of comment you are looking for? What we think is needed for different kinds of projects? Or what are you looking for?
- I'm not quite sure… are the recommendations for a masters thesis the same as for a large study? etc?
- in general: yes. it's just that the scale is a lot smaller, so a lot of administrative stuff should be done not by the person doing the thesis but by those providing the thesis opportunity.
- In some cases (e.g. EU horizon big projects) there are many administrative tasks that a master thesis (which could be part of the same project) does not have to.
-
Hasn't there been conversations lately about younger programmers actually not knowing how to program, since many depend so much on for example copilot?
- Could be! And in general some studies have shown dependency, anxiety, burnout related to genAI use… (there is one review linked in the page)
-
what does it mean 'there is no cloud computing'? What is cloud computing? Is this what I do when interacting with chatGPT to code?
- It's anything that you run "on the internet", but what's meant with "there is no cloud computing" is that you should not assume that this cloud is something that isn't controlled by someone else.
- Yes ChatGPT runs "somewhere" (= cloud = most likely some machine room in USA) and your uni/organisation cannot guarantee that they will protect your work, even though OpenAI/ChatGPT promised you to not leak your data. Maybe there are no risks with your code/data, beyond your own professional risks of being scooped.
- I heard that if we are using AI resources then we should use Aalto AI instead of ChatGPT, because it's apparently safer. But isn't it still running ChatGPT so it still has vulnerabilities?
- Indeed at Aalto (and Helsinki Uni has something similar) we have an interface to the GPT models so that they are run in Europe in Sweden Microsoft Azure data center. A bit more data protection, but of course probability of something going wrong will never be exactly zero (so don't paste your dataset there if it's secret data :)). This is only the "C" of the CIA triad (Confidentiality), there are other issues and vulnerabilities related to how LLMs work.
- Still someone else's computer and US can request it with/without a warrant?
- Maybe? :) How much do you trust Microsoft? :)
Break until xx:12
- Then "parallel", the last session of the day. We cover two things,
- Shared-memory and multiprocessing parallelism (relatively commonly used for single-node stuff)
- MPI parellisism (for the biggest codes)
Parallel - shared memory
https://scicomp.aalto.fi/triton/tut/parallel-shared/
-
.
-
What was --pty
again? +2
- Launches the job in pseudo terminal mode. Usually it's used to get an interactive terminal on a compute node with the combination
--pty bash
.
- You can compare yourself with
srun bash
and srun --pty bash
.
-
How many CPUs does srun use by default? Is there a max or min?
- All of these can be set by admins, but generally the default is 1 cpu and there isn't an explicit max for CPUs (at least one you would realistically hit).
-
What is the difference between –cpus-per-task=4 and –nproc=4? Why are they both 4?
- nprocs is the argument specific for that code, that tells it how many cpus to use.
--cpus-per-task
is argument for slurm that tells how many cpus you should reserve per task. (By default you have one task.)
- So they are both 4 because we wanted to reserve 4 cpus and then tell the code to use all 4.
-
what is the limit of CPUs that we can ask the programm to use?
- This is going to depend on the program. Often there is no limit (assuming the program knows how to use multiple CPUs in the first place), but you might see diminishing returns with huge amount of cpus. In addition you need fancier methods to parallelize across nodes, so without those you are limited to the amount of cpus on a single node.
- There is an exercise "Parellelism-1: Test Scaling" that discusses this some: look at the solution, it shows the performance decreasing as you use more and more processors.
-
I am using Puhti, I set cpus-per-task=4 but when I checked seff after the job was finished 'Cores per node: 8' In the interactive sessions it says 2 threads per core will be allocated, does it have something to do with this and what does it mean?
- In some clusters the computers have hyperthreading enabled, which allows for multiple threads to run on a single CPU. So "Core" here refers to one of these threads, so "CPUS per task * threads per CPU = Cores". So if you request 4 CPUs, you get 2 threads and thus 8 "cores". It depends on a program whether they can use hyperthreading efficiently so the number you should tell the program can be either 4 or 8.
- This exact behaviour is also going to depend on the cluster. Some clusters will treat each hyperthreaded logical core as a "CPU", in which case you would get 2 physical cores with 4 threads and logical cores.
-
How we can set the number of CPUs in the bash script? I did not really get how it automatically worked
- Slurm automatically sets various environment variables, including $SLURM_CPUS_PER_TASK (which should be the same as –cpus-per-task). Which can be useful to keep track of info in only one place.
-
How to parallelize cpu across nodes?
- We'll talk about MPI parallization next, which does just that. There are some other libraries as well.
-
Why does the CPU efficiency decrease as you use more cpus?
- Not every operation can be parallelized. Basically every program has parts that need to be done sequentially, and in those cases rest of the CPUs are just waiting.
-
Why is this section called 'Parallel shared memory' if we just allocate more cpus to work in parallel? Where is the shared memory?
- The parallelization is done using shared memory paradigm. Each thread has access to shared memory pool and uses that to communicate. We will go over another version of parallelism that doesn't require shared memory later (MPI).
-
What is the best way for implementing this in our own code? Just using multithreading? What is the optimal number of threads in the cluster? Usually 8? What about the multithreading?
- Often programs will get diminished returns after 8 or 16 threads or so, but it will depend on your code. Biggest factor is how large amount of your code can actually be parallelized. If lot of it has to still run sequentially, you will notice diminishing returns because of the overhead from initializing the threads, among other things.
-
Is multithreading then the best way to go about it?
- It depends. Multithreading is often easier than MPI and has different pros and cons. Often the most efficient method is combination of both, e.g. 4 MPI processes each with 4 threads, using 16 cores in total. I think we quickly go over this later as well.
-
Why does slurm q only return this:
-
Why I got 0 CPU efficiency:
-
[user@login4 hpc-examples]$ slurm q
JOBID PARTITION NAME TIME START_TIME STATE NODELIST(REASON)
- Your job is no longer in the queue. It probably already ran (or failed.) You can check what happened to it with
slurm h
or slurm history
.
-
if I can't see anything when writing slurm q, how do I get the job id?
- Your job probably already ran and is no longer in the queue. Try
slurm h
or slurm history
instead. (Same command, first one is a shorthand)
Exercise (we return at xx:35)
Exercise:
- Run the same commands we did
run-pi-4core.sh
:
- There is more you can experiment with in the documentation pages.
I'm:
- Done: ooooooooooooooo
- Did not try:
- Having problems:
MPI parallelism
https://scicomp.aalto.fi/triton/tut/parallel-mpi/
-
is –ntasks=4 means that we run 4 independent threads?
- By itself they are 4 completely independent processes, but MPI provides means for those tasks to communicate with each other and work as a collective.
-
I only have openmpi as a module
- That should work
- ok, isn't it another different module or way of doing paralelisation
- No. OpenMPI is one implementation of the MPI standard. Another common implementation is MPICH.
-
Note about MPI: MPI is just a standard for implementing this communication between processes and there are multiple implementations. openmpi
is a common one, but your cluster might use another one such as mpich
.
-
What does –nodes mean? And –ntasks?
--nodes
tells slurm how many nodes (physical computers) you want to request. Relevant thing here is that all CPUS/cores on a node have access to same shared memory, allowing you to use multithreading. With multiple nodes they can't all access the same shared memory.
--ntasks
tells slurm how tasks you want. Then generally srun will launch that many copies of the process and initialize the MPI environment so they can communicate with each other.
--nodes
, --ntasks
, and --cpus-per-task
have a complex relationship and you can dig deep into this.
-
what is the diff between srun and sbatch?
- The most important difference is that srun starts the job interactively, whereas sbatch makes it run in the background.
-
Should I run pi-mpi.sh with srun then?
-
running the slurm script returns the following error:
- What cluster are you on? Some clusters do not have their MPI implementation built with Slurm support. In that case you need to launch the processes with
mpirun
inside your script to initialize the MPI environment.
- this was ran in turso
- I'd recommend checking the clusters docs, but
mpirun
might work as well
- mpirun worked, thank you!
- We used
srun --mpi=pmix
and that also worked. Source
-
What does the first 2 lines do in the exercise?
- They load the MPI installation (In Aalto we use OpenMPI) and compile the C code using the MPI compiler into executable.
-
I get this error: pi-mpi: error while loading shared libraries: libfabric.so.1: cannot open shared object file: No such file or directory while i have pi-mpi in the same folder
- Do you have an mpi module loaded?
- i loaded openmpi cause couldn't find mpi in the modules
- What cluster is this? It seems to be missing some libraries but I think those should be provided by the module
- Lehmus, but it's weird that the first command worked the compilation
- I would suggest asking your cluster admins. You could try the zoom room?
- yes what is the link? and are you connected after the course?
- You should have received the email. What university's cluster was Lehmus by the way? I think there were at least people from helsinki and oulu.
- And I think the room will stay open for a short moment after course at least, but might close soon if there are no questions.
- It's oulu's, ok i'll try to connect now, thank you.
Exercise until xx:55
- Run the same commands we did
pi-mpi.sh
:
I'm:
- Done: ooooooooo
- Did not try: o
- Having problems:
Feedback day 2
News for day 2 / preparation for day 3
- We covered what was in the schedule
- There are more exercises/practice at the bottom of every page.
- The main sessions are about GPUs and LLMs
- Try to check GPU instructions on your cluster, it will be useful tomorrow.
- A wrap-up session where you can ask us anything
Today was (vote for all that apply):
- too fast: ooo
- too slow: o
- right speed: oooooooooooooooo
- too slow sometimes, too fast other times: oo
- too advanced: o
- too basic: o
- right level: oooooooooooooo
- I will use what I learned today: oooooooooooooooo
- I would recommend today to others: ooooooooooooooo
- I would not recommend today to others:
One good thing about today:
- Learned how to create envirenements and the use of them
- Array jobs was explained well +4
- CATS <3
- Covering many topics, specially that parallelism is not "magic", it's an investment worth spending time learning.
One thing to improve for next time:
- More exercises? +3
- The section where we did Conda felt a bit rushed and working with an example that was not in the documentation was also a bit hard to follow.+1
- A little more involved exercises (not just copy-pasting commands) and more time to do those could enhance learning. On the other hand it is understandable that this occurs when one actually wants to use Triton etc. for their own purposes.
- I missed one thing during the Conda example and couldn't find in the documentation it was a bit confusing
- the conda part was kinda slow/lamelly presented
Any other feedback?
- CATS was lovely +11
- Nice examples and mix of topics. I also liked the philosophical break.+2
- Asking anonymous questions and getting immidiate answers shows the true power of online teaching! +8
- ..
General questions continued:
Day 3
Icebreakers, day 3
What is your favorite fruit, and why?:
- Orange; easily accessible, color, works for all season.
- Watermelon; Very fun experience eating them growing up, just holding a big slice and chomping on it :D
- watermelon
- Tomato; used everyday in cooking
- Nectarine
- Avocado; healthy and tasty
- apple. a pleasant eating experience
- "Grenade apple" (Pomegranate) +1
- Mandarins
- Mango
- all fruits are vegetables
What will you do about computing next week:
- I will probably create a cookie-cutter template repo for my HPC based repositories that contains majority of steps, such as data handling via DVC, standard repo structure, MLflow for experiment tracking and so on and the dot files and other configs for Triton. One repo to rule them all!
- I would like to see whether I can get started on using COMSOL on the cluster.
- I will try to get my supervisor's python library to the cluster from git, and then hopefully use that library to do cool examples!
- I would try setup some ML code, maybe just as a test run and see how it works
- run Hector
- database stuff
- finish my thesis
- IHC image analysis
Would you be willing to do a course on "What services are available at your university" ?
- yes: ooooooooo
- no:
- depends: ooooo
We do a lot of work making videos for the course. You think it is "worth it" if the videos take less than X hours per day. What is X? (https://www.youtube.com/playlist?list=PLZLVmS9rf3nNK5qWN6FdrQPHns4fNZyMX, https://github.com/coderefinery/video-processing)
- one: ooo
- two: oooooooo
- three: o
- four: o
- five:
- ten:
Any other questions/comments:
- Do you guys have a sort of "visit day" to see how a normal days goes on?
- Come to the garage (that you'll see soon)
How to ask for help with (super)computers (slides)
Questions:
GPU Computing (docs)
Questions:
-
I heard of these GPUs: V100, A100, H100. What is the difference? Is there a common site for comparing GPUs benchmarks?
- Biggest difference between these three is just that A100 is newer than V100, H100 is newer than A100. The computing power also increases significantly for the newer ones. Otherwise there are some differences in the architecture, but that won't be relevant for most users.
-
When we request a GPU node, is the GPU on that node (or the node itself) shared with other users?
- You need to specifically request however many GPUs you want, just having a job on the node isn't enough. If you request n GPUs, then none of those GPUs will be shared by anyone else. Slurm doesn't support splitting a GPU like it does with individual CPU cores.
- Helsinki cluster (Turso) users: add options
-M ukko
and -p gpu
to the srun
commands in the exercises for success.
- And the module you need in Turso is called
CUDA
with capital letters. So: module load CUDA
; e.g. (omit version for latest),
and for shared gpu, add to srun
these options,
-
What's the rule of thumb for requesting RAM and GPU? Sometimes, when I use a large dataset, I also need to request an appropriate amount of RAM
- How much RAM you need will depend on your program. RAM and GPU memory are separate, but you usually need some RAM so that the CPU can transfer the data to GPU efficiently.
- If you're doing deep learning quite often you'll need to transfer the model parameters and data to the GPU. So the RAM requirement will be at least the amount of memory needed by the parameters and the data. In most cases you'll need to do some data preprocessing on the CPUs in order to fully utilize the GPU, so you'll usually need more memory and multiple CPUs for that.
-
What is the difference between the GPUs that e.g. Facebook has and that Aalto/CSC has? Does Facebook just have more of the same GPUs that we use?
- Generally there wouldn't be a huge difference. Facebook would mostly just have a far greater number of them, and generally newer and shinier versions.
-
If I want to train a model on Triton and then run inference on a consumer GPU (RTX4090), will different architectures be a likely problem?
- You might need to recompile the code for the different architecture. But unless you have written some extremely low-level code that depends on specific arch, it shouldn't require more than that.
- If you're using common libraries such as PyTorch they have been compiled for many different GPU architectures, so you don't need to worry about that so much. But sometimes consumer GPUs do not have the same amount of calculation units for different data types (half-precision, single precision, double precision), so some calculations might not work or might be slow.
-
When we need more GPUs than a single node can provide, how do we request multiple GPUs across nodes? Are there any best practices or documentation you can point us to for doing this?
- As long as the version of Slurm is new enough, you can request
--gpus-per-node=X
. Your software will need to actually support multinode calculations however, and might need some additional setup.
- By supporting multinode calculations its not just multi GPU support right? any documentation I can check?
- Yes, the problem is similar as parallel CPU jobs, where you no longer have access to same shared memory (RAM). So you need to use something like cuda-aware MPI or similar. (Or whatever software you are using needs to be able to do that behind the scenes.)
- Also as a rule of thumb, if your job is so big that you need multiple nodes worth of GPUs, you might want to consider moving your calculations to LUMI.
- Yeah I've been using LUMI all these time and still need to scale more on my future project(s).
- Many training jobs utilize library called NCCL (pronounced nickel) or AMDs version of that (RCCL, rickel?). This makes it possible to transfer data from GPU to GPU directly through the fast interconnects. It can sometimes get a bit tricky in LUMI to get this working. The launching is often done via torchrun that creates a communication between different jobs, but you usually need to tell the launcher how it should connect to other nodes. See e.g. LUMI's example on distributed training
-
What is a GPU kernel?
- In the context of GPU computing, kernel is a small bit of code that is sent to the GPU to execute. Generally the paradigm is that the CPU offloads specific parts of the code as kernels to GPU and then those are executed asynchronyously.
-
Is there a n00bs training course for starting out with GPUs and GPU coding? I would love to learn more in my own time.
- You can probably find some from the CSC training portal, they often have stuff. I'm not sure what is at the right level: https://csc.fi/en/trainings/
- Aalto University has this academic course that goes into various parallel stuff. Lectures online. I'm not sure if it's at a good level for new people, though: https://ppc.cs.aalto.fi/
- In the spring we hosted CSC's Practical deep learning-course on Aalto premises. They might have a re-run in the autumn, but we'll probably host it again next spring. The materials are great for self learning as well.
-
is gpu-debug
exclusive for triton only? do we have the same stuff or equivalent on CSC's HPC?
-
have this error:
- Are you not on the Triton cluster? Different clusters will have different test partition names, see above
- You can see a list of partitions with
sinfo
. Hopefully the names are somewhat self-explanatory.
- I managed to get a good partition name but now i have this error:
- From the error it sounds like that partition doesn't have gpus available. Check your cluster documentation.
-
What is gcc? Is that a … compiler? What's a compiler?
- Yes, a compiler, the main open-source compiler (but there are newer)
- For complied languages (C, C++, Fortran, etc.) the source code first has to be trasformed to compiled machine code. Then it can be run. Newer languages like Python don't need this (but they do internally transform the code before running it)
- So does the compiler do this transformation from source to machine code?
- yes
- Compilers are important for HPC, unfortunately they aren't really taught anymore for a general audience. But, most people use code made by others and can use it from Python/R/Matlab so it's easier these days
-
what are the throws?
- Do you mean in error messages? "Throwing" an error is means that something has gone wrong, and usually the program exits and prints an error message. It is possible for another part of the program to "catch" and handle the error
- If you mean in the pi example, it is part of the simulation on how to calculate the pi.
- The simulation basically throws darts and determines pi from how many hit the dart board.
-
Hello, I cannot find ./pi-gpu in hpc-examples. Where is it?
- You need to compile it using the commands given in the exercise block
- And the output file where
pi-gpu
was created will depend on the options used in the compiler command, -o pi-gpu
.
-
This is maybe a bit off-topic, but if I create a code, say, with pytorch, Do i need to determine the batch size there first, and then again when running the code?
- Usually you'll want to pick a batch size that fits into the GPU memory. So if you have picked e.g. batch size that requires the GPU to have more than 32GB of memory, you can limit yourself to GPUs that support that batch size. Alternatively, you can reduce the batch size and run it with GPUs that have 32GB of memory as well. Often parameters such as batch size are written in some configuration file so you can adapt that based on the location where you're running without needing to modify the code.
-
Why is the GPU utilized 0%?
- Either it really did no work, or the program ran too fast. Slurm's CPU and GPU efficiency stats are not very reliable for extremely short jobs.
- The utilization statistics are sampled, so they will be only accurate for jobs that run for several minutes.
-
I did the exercise. But I don't understand what I actually did? I just typed the shown stuff.
- First line loads the necessary modules to compile the code. (gcc is a compiler and then cuda libraries are needed to use GPU)
- Second line uses
nvcc
to compile the code. nvcc
is essentially a compiler wrapper that takes care of setting up cuda related things. Most of the arguments for nvcc
tell it to compile for multiple GPU architectures, so the code works on all the GPUs on triton.
- Meanings of different
-arch
and -gencode
options are elaborated here.
- then in the end
-o pi-gpu
is the name you want for the compiled binary, and last argument is the name of the source code.
- What does the pi-gpu compiled binary mean?
- The compiler turns
pi-gpu.cu
source code into machine code file pi-gpu
. These compiled machine code files are generally called binaries since they are saved in binary format. (Since they don't need to be human-readable anymore.)
- All these terms might be unfamiliar if you are used to working with languages like python, which take care of all of this in the background during execution.
- Last line runs
pi-gpu
in the queue.
- Then the sbatch version does mostly same things but inside an
sbatch
script instead. We skip the compilation step since the code was already compiled in the previous exercise.
-
GPU arhc question (was deleted)
- This refers to the GPU arch that we talked about. This means in the in cluster you are using, that specific arch in not available.
-
If i want to run my own code, I am getting error 'permission denied': my call script is: - srun –pty –time=00:10:00 –mem=500M –gpus=1 -p gpu-debug ./Toss.py 1000000… Full Error code: srun: error: dgx16: task 0: Exited with exit code 13 srun: Terminating StepId=7926631.0
Exercises xx:55
For your site the partition names and module names might differ. Check your site's documentation first.
Commands to run the demo:
Running nvidia-smi:
Running pi-gpu
:
pi-gpu.sh
:
Submit using sbatch pi-gpu.sh
I'm:
- Done: oooooooooooo
- Not trying: o
- Having problems:
LLMs on Triton
- Docs:
- Example repo:
- List of Gen-AI tools at Aalto:
-
Questions:
-
What is hugging-face?
- It's platform which developers use to publish their open-source models.
- It also provides libraries for running these models easily.
-
I heard of RTX gpus for PCs. How do those GPUs differ from the ones on Triton?
- For the purposes of HPC use, they mostly differ in scale. Server GPUs are generally more powerful and can do more calculations.
- There are also some differences in architecture. Server GPUs are made for HPC use and don't really work as well for live rendering graphics iirc, which would be the usual primary use of a consumer GPU.
-
Regarding speech to text LLMs, does it (for ex. whisper) support finnish language interviews and translate them to english?
- Yes! Finnish and Swedish are both supported.
-
Whats the tradeoff between model size and precision?
- It's a tricky question to answer :D the reason is the precision depends on what task are you looking for. Usually, models do run some tests or benchmarks and report the accuracy. So the size itself cannot gaurantee the precision of the model; it depends on how it was trained.
-
Instructor says –mem=80GB is system memory? Is that RAM? What is system memory?
- Yes, RAM. "RAM", "system memory", "main memory", "memory" - all are usually terms for the same things. There are other types of "memory" though. Unfortunately the terms are a bit too flexible.
-
Is system memory equivalent to the data that will be used? For example the voice recording, or the text?
- So the CPU memory will be used to pre-load some of the data. And the GPU memory is where the model is loaded and does the calculation. Therefore, the GPU memory should be large enough to hold the model itself, and also the batch (small part of) data.
- In case of text the GPU will have to house the model parameters and the input and output layers that are basically arrays of numbers. So in the case of text, the input and output layer sizes depend on the context length of the LLM. Different models have been trained with different context lengths, so if the model can be read into the GPU memory, you can run arbitrary amount of queries through it as each query can have a maximum length of the context length.
- In case of audio the audio is often split into small snippets that sent through the model one by one. So the length of the audio does not matter.
-
I get this error: RuntimeError: Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, maia, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: auto
-
should I select cuda?
- Are you running the llm-example?
- yes
- Are you using Triton?
- yes, I can copy the whole output
- Can you paste the command you used?
-
and I run sbatch llm_example.sh
-
Try device_map="auto"
instead of device="auto"
-
I tried with device="cuda" and I got this. I copied the device="auto" from https://scicomp.aalto.fi/triton/apps/llms/
- We will need to update that example. But it might also depend on the versions you use, since this "alternating" pattern is not a requirement I was aware of, and haven't seen before.
-
now I tried with device_map="auto" and the same error of jinja2.excepetions happened
- Might be that at some point the templates used in transformers got updated to include this restrictions since (probably) some models have this restriction. In this instance, you would need to only enter one prompt in the messages array. i.e. remove the "What is the meaning of life?" question.
- Yes, it worked commenting out the second msg(thanks!):
-
Is this model being run locally?
- Yes, locally on triton; not Yu's own computer.
-
Where can I find the instructions for memory usage in real time?
- In the Monitoring section of the course, we talked about the
nvidia-smi
and other tools and "profilers" you can use.
-
What is this watch
command?
- The
watch
command simply re-runs the command comming after that: watch -n 2 nvidia-smi
means it will re-run nvidia-smi
every -n 2
2 seconds.
-
.
General Q&A
You can ask us anything, instructors will discuss as a group on stream.
-
I don't have a particular coding background, why I wanted to learn Triton was to be able to run my COMSOL simulations on here. The reason I want to do it on Triton because the simulations become computationally too heavy, not on my personal computer, but on one of the virtual computers offered by Aalto. There are documentation regarding using COMSOL for triton. Would you say following those documentations should be enough to get started running my simulations? Or should I run something "lighter" first on the cluster to see whether I actually understood how to do it?
- I would sugest trying, whether you can set up a comsol job (and the docs should have examples for this). If not, there are Comsol focus days, where we have additional experts that can help with COMSOL specific questions in garage (but you can also come at any other day and ask. it might just be, that we don't have the 'right' expert around).
-
I know I need to use a custom library for my summer project, and the README of it details installation guidelines, which are not written for Triton, and do not work in the terminal connected to the cluster… Is my best bet to come to the garage to get the custom library installed, or is there still some tab in the docs that explains this kind of a situation (I at least couldn't find one)?
- what kind of library (and what programming language) are we talking about here?
- I believe it is meant to be used with python (at least for my project), but most of it is written in C++. The git repository includes an install.py that is meant to be run, but I don't know how to adjust the installation process to the Triton case.
- Odds are that you are missing some libraries for example, which would be provided by modules. So that is something we can help with in garage. (It's a lot easier for us to figure out what provides what, since some of us have probably already done something similar.)
- OK, thanks!
-
Is it possible to download this document for future reference?
- It is archived (see links above), also we'll make sure it's added from the course page. It should stay up a while, but you your browser can probably download.
-
.I also want to run simaulations using platforms such aas COMSOL or OpenFOAM. Does Helsinki UNI also provide access t o these on their clusters??
- Helsinki Uni Turso cluster has an OpenFOAM module:
module spider OpenFOAM
. We can also guide COMSOL users, please come to our garage (see below).
- Helsinki Uni users, please get started here: https://version.helsinki.fi/it-for-science/hpc/-/wikis/home
- If you have a Helsinki University user account, please join the course Zoom breakout room 1 right now so that we can get you identified and connected to the Turso cluster.
- You can also get started by sending email to helpdesk[ät]helsinki.fi and request Turso cluster access. To speed things up, you can mention in the email that it's intended to the IT for Science solution team.
- And like Aalto, we have also adopted a daily HPC Garage practice to assist our users: https://version.helsinki.fi/it-for-science/hpc/-/wikis/Garage
-
.
-
.
-
.
Feedback, day 3
News for day 3 / what to do next
- We covered what was in the schedule.
- There is a lot more written material linked from the schedule (also Triton tutorials in general) that you should read for anything you need.
- Just try stuff out! And ask us for help when needed.
- Follow-up courses
- CodeRefinery: version control and software development stuff (mid-september).
- Python for Scientific Computing: late autumn
- Study a bit about shell scripting (as appropriate for you), we have a longer course
Join this Zoom now to meet the instructors: https://aalto.zoom.us/j/69608324491
The course was (vote for all that apply):
- too fast: oooo
- too slow: o
- right speed: oooooooooo
- too slow sometimes, too fast other times: o
- too advanced:ooo
- too basic:
- right level: ooooooo
- I will use what I learned today: ooo0ooooooo
- I would recommend today to others: ooooooooo
- I would not recommend today to others:
One good thing about today:
- to the point, and concrete.
- ..
- ..
One thing to improve for next time:
- I think the course should target an audience with a higher base level. For instance, if someone hasn't ever hear of a GPU, maybe they should be referred to more fundemntal courses rather than a overview of a HPC. Overall, the material can be focused more on real challenges. +2
- I personnaly get distructed with the non technical part,I think it would be better to leave them all till the last day or a specific time.
- llm example did not work, it should be corrected
Any other feedback?
- Too fast, advanced, tehcnical +1
- I really liked the execution of this course. Worked very nicely +2
- I wish there were more courses following this kind of execution
- Overall, i really liked your teaching way, it was encouraging to ask questions, which is really good, thank you so much for your hard work.
- Lots more today is Aalto/Triton-specific. It would be good to get input from other clusters to understand what's possible from built-in modules, and what we need to configure ourselves to keep up with the examples. +1
- Too advanced, too technical
- I really enjoyed when some of you had the terminal log showing, so I could easily follow along e ven when I'm not as fast with typing. I wish all of you would do it, since when it wasn't used I would easily fall behind.
- Good course! I think you covered everything important.
General questions continued:
- What now? Like I want to mess around with LLMs in my free time. How should I start? Can I somehow contribute to open source software?
- How long materials are available and where?
- This chat will be archived and link added to the course page. Youtube videos of the course will also be available on our youtube channel
- ..
- What is CodeRefinery course?
- Coderefinery workshop covers more practical scicomp skills, such as git usage and general best practices.
"Thank you to all the instructors. It helped a lot. :)" +3