Introduction to Scientific Computing and HPC / Summer 2025 (ARCHIVE)

Infos and important links

This is the archived notes document for Day 1
Here you find the archived notes for Day 2: https://hackmd.io/@AaltoSciComp/scicomphpc2025Archive2
Program and materials: https://scicomp.aalto.fi/training/scip/kickstart-2025/
page**

Please do not edit above this

This is a shared document where participants and instructors can write during the course.

Test this document!

Document will be open for edits later

This is a question
- Well this is an answer… although that was not a question
  - A comment to the answer
something completely different
…

Icebreakers

Where are you and what's the weather?

Helsinki and it is sunny +1 +1
Surpisingly sunny in Espoo :)
Espoo and very nice weather!
Tampere, the weather is so good today as well!
Oulu , It's raining :)
Oulu
Espoo, it's sunny, but a bit chilly.
Espoo, it's nice and sunny :)
Espoo, very good weather today :D
Espoo
Helsinki, sunny
Helsinki, nice weather today.
Helsinki, sunny 😎
Espoo has lovely weather today!
Pohjois haaga is sunny
Espoo, Aalto, Especially sunny here next to the window +1
Helsinki, it is sunny.
Oulu, it was sunny, but cloudy now.
Espoo, sunny and very nice
Sodankylä, sunny
Kumpula, nice weather
Aalto Campus watching the sunny day!
I am at Viikki campus and weather is too pleasant
Helsinki, sunny
I am in oulu its raining here
Helsinki Kumpula Campus
Echo problem at your side (no???, fine on my side :D)
oslo, sunny
Helsinki, Aalto
Helsinki, Aalto

How do you currently do computing?

Ubuntu linux on own computer, lots of SSH to the cluster
Ubuntu 24.04, remote connections, vscode
An HPC cluster :D
Jupyter Notebook mostly, I am quite new to thI is.
Ubuntu, linux
Depends on the task, Mac/Linux, locally, lab server, CSC too… Mostly I work with Python venv per project as suggested by Aalto SciCompt in another course I took from you guys hehe ;P
WSL
Spyder/Python(Anaconda), Linux/Powershell
WSL
Lumi CSC for bigger tasks, but locally on Linux.
Using a HPC
Uhh, Tbh I am very new to this. Have used Jupyter before, though
I use both server (like CSC or Oulu servers) and my laptop. However, depend to the type of problem, I may use parallel and threads.
Mac M1, on my own computer
Ubuntu, CSC
WSL/Ubuntu
Google Colab or pter
Jupyter and spyder
Conda and Jupyter
quantum computer
mainly Jupyter and google colab.
Azure and Jupyter
CSC using SD desktop
LUMI
pop-os, python
Debian

What's your favorite animal?

bears :) +1
Lobsters
Wombats!
Owl +1
Siili! Hedgehog
Cars +1
Pelican
Panda
Tiger +1
Otter
My parents' dog
Armadillo
Cats +1
Otters! +1
(brown) Bears!! +1
Gecko
CATS +1
CATS <3
Turtle
Hourse
Fox
Tiger sharks
Crow
rabbit
squirrel
orca

Introduction, about the course [materials]

Ask questions here, like this.
Is it possible to get credits from this course?
- No unfortunately it is not. But from next academic year we will have the opportunity to get 5 credits for multiple courses by Aalto Scientific Computing and CSC.
  - Does this apply to all universities?
    - We will be able to provide certificates, but it won't use the Finnish JOO system of automatic credit transfer (at least not next year).
In twitch, Is it possible to hide the chat in theatre mode?
- I have a button on the top left of the chat that hides it. It looks like "|->"
- you may need to exit theater mode, hide the chat, then go back to theater mode. At least that used to work.
- Thank you!
My registration didn't show that I am an employee at Aalto (edited this morning). Can I be granted access to the HPC cluster? (I emailed scip@aalto.fi about this)
- Yes, if you send an email someone will probably get to it soon
  - Thank you!
I am an Aalto employee but I still have not got an IT account, can I be given a temporary one to access the HPC cluster?
- Please apply for the account (you need to accept the policy) follow the link from: https://scicomp.aalto.fi/triton/accounts/
I forgot my ssh and password for triton. How do I get it ? I am already logged in ondemand
- The password for the command "ssh USERNAME@triton.aalto.fi" is the same password you use to log in to your Aalto emails, ondemand, etc… Or did you mean the ssh key passphrase?
- I tried to connect triton via vscode and somehow its not working
  - Try with a terminal. If that works, then the issue is vscode. Any terminal is fine (powershell, linux shell, macos terminal). Remember to be inside the Aalto VPN.

Is there recording available for all the sessions?

(moved question and answer down to the kitchen section)

The HPC Kitchen [slides]

Videos

.
A giant HPC cluster of Facebook would have just more counter space?
- In this metaphor big HPC clusters have a lot of kitchens (many computer), each whith multiple stoves (multiple CPUs) and huge counter space (lots of memory). This makes it possible for a huge number of cooks (lots of users / lots of processes) to run simultaneuosly to cook a lot of food.
When the number of cores increase, I feel managing the code is more challanging. Like a kitchen with a big stove, but working with it is not as simple as a small stove.
- This is a very good point. To make full use of multiple cores often requires special handling.
- Exactly!
Can you talk a little bit about CPU vs GPU?
- There are more cores for GPUs (but with lower capabilities or restricted capabilities)
- GPUs can do multiple specific calculations much faster than a CPU, while CPUs can do all sorts of calculations. They have lots of parallel computing cores that can do same calculation with multiple different data points at the same time (also called SIMD - single instruction, multiple data). So they can do vector calculations, matrix multiplications etc. much faster than CPUs.
- Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
- GPU is a parmiggiano cheese grater, CPU is a single dot for a minuscule grater :)
Could we say each GPU core is faster than a CPU core?
- GPU as a whole is faster than a CPU, but GPUs can have thousands of computing cores while CPUs usually have up to a hundred or so. So it is speed by having more calculations being done at a single time.
  - So we can not say "one GPU core" is faster than "one CPU core"!!
    - Indeed we cannot. The clock rates (how many instructions are done per a unit of time) on GPU cores are are slower than on a CPU. Modern GPU cores run around 1-2 GHz while CPU cores can boost to 3-4 GHz. In some special calculations GPUs have specialized hardware that do certain calculations faster than in CPUs (e.g. Tensor Cores in NVIDIA GPUs calculate certain matrix operations with fewer calculations than CPUs would use)
How to relate the metaphor with quantum computing?
- Oh that's a good question. I haven't thought about that at all
- Please help us what is cooking in quantum computing…
- Maybe something like: a magic pot with a lid where you put all of the ingredients and after a certain time when you open the lid you will either have the food you wanted to cook or something undetermined.
What GPU uses Triton: Nvidia or AMD?
- Mostly Nvidia, a few AMds for testing
- https://scicomp.aalto.fi/triton/overview/
How can I see the number of GPUs available?
- In general, sinfo is the command to get info about a slurm cluster. (slurm is the software that manages the allocation of resources to each cluster user)
- https://slurm.schedmd.com/sinfo.html

Connecting to the cluster

https://scicomp.aalto.fi/triton/tut/connecting/

Do we need an Aalto account to access the cluster?
- For the Aalto cluster (Triton), yes. And it needs to be activated. Other universities will have their own policies.
- A similar workflow works on other HPC clusters (CSC Mahti/Puhti, Helsinki University, Oulu University… please join the zoom if you are affiliated with HY or OY for more help)
Its asking for a password. How to set up a password. I work at Aalto
- Aalto? (yes Aalto)
- Same Aalto password
- If you have forgotten your Aalto password or haven't set it up you can use https://password.aalto.fi/ to reset your password. It will take a while though and you'll need to use your banking credentials to verify who you are.
Our university cluster portal is not active. How can we access another cluster?
- If you are in Finland, maybe you have access to CSC? Get in touch with your uni admins. Maybe it is just a temporary glitch.
- Yeah, unfortunately we can't provide a cluster to non-Aalto people.
I get this error when trying to connect: ssh: connect to host triton.aalto.fi port 22: Operation timed out. What could be the problem?
- Are you on the Aalto VPN, or Aalto device on campus? Access is only avaliable from internal Aalto networks?
- I'm on my own device on campus.
- Join the zoom (you should have the link if you registered with your Aalto account)
Do you have to VPN to Aalto to connect to the cluster?
- Yes for SSH (unless you set up ProxyJumps, which you can find in info)
- OnDemand should work from anywhere - so we recommend ondemand.
When connecting to Triton I get a "This is the Z Shell configuration function for new users,zsh-newuser-install" prompt. Is this something we should set up before starting to use Trition or can I skip it by pressing "q"?
- You can skip it - unless you want to see what it is. (Sorry, we often miss new things since our accounts are already made!)
I am trying to connect to triton with the VPN connection. I got the connection, but do not understand the instruction after that.
- Are you trying to go by SSH or OnDemand?
  - I am using SSH but I do not understand what commands I need to write. Do I need to make a SSH key?
  - Are you using Windows, Linux or Mac?
    - Windows but I have WSL working.
      - Next step after connecting the VPN is to open a terminal, this can be the Windows command prompt, windows powershell, or a WSL terminal (make sure VPN is started before WSL though (see other questions here)). Then enter 'ssh USERNAME@triton.aalto.fi' in the terminal (where USERNAME is your own triton username)
        
        This times out. But I tried command prompt and that worked!
- You can try joining the Zoom (if in Finland) for more help
  - I can't, sorry.
Is it possible to connect to the cluseter by CSC user, i am at Helsinki University.
- You can use CSC's Puhti or Mahti clusters if you have an account at CSC. https://www.mahti.csc.fi/public/
- is puhti down somehow? I can connect to mahto, but not to puhti. Thanks.
- Puhti is on scheduled service break today
- NOTE: You need to belong to a CSC computing project, to get access
  to resources like HPC clusters. You need to be university staff or equivalent to get a project, but then you can invite who you wish to the project, students, etc
- Helsinki University users are encouraged to use our Turso cluster accessible e.g. via SSH at turso.cs.helsinki.fi (if your HU user account has been added to cluster users).
I am trying to change the shell to bash. It says to wait 15 minutes but I tried to do it over 30 minuts ago and it still says /bin/bash. What should I do?
- You can mostly keep going with the whole course and it won't affect much.
- There can be an additional delay because the account synchronization needs to go from Aalto to Triton, which can take a bit more time.
On a daily basis, should I use VSCode or Linux Terminal when I work on Triton? And when should I use OnDemand?
- Really you can mix them as needed. VSCode may be nice for some editing, terminal for submitting jobs. Terminal could be via SSH or OnDemand (or even VSCode Remote-SSH). So mayn different ways to do things…
- We'll mention other options as well in the "Setting up a new project"-lesson.
Why shouldn't I run JupyterNobetooks via VSCode on Triton? The instructor said something about wasted time? Thanks
- The note about wasted time applies to any interactive use. While you are thinking the resources are idle :) But this is not a problem as long as you know.
- You can run notebooks from VSCode, both are interactive sessions.
  - You do need to make sure to setup VSCode properly however so that the notebook actually runs in an interactive session. If you just connect to Triton and launch a notebook, it will run on the login node and slow everything down for you.
About the JupyterNotebooks and VSCode. I usually work with .ipynb files on VSCode. Since, it's tricky to use this through remote ssh from VSCode, would JpyterLab thing I saw on OnDemand would work?
- Jupyterlab on OnDemand is generally the easiest way to use jupyter notebooks on Triton yes. You can also set up VSCode to start an interactive session on a computing node and run it there, but that requires more work. Here are our instructions for setting up VSCode.
I have access to Turso from UH. Should I follow the guidelines from the uni portal or the handson in this course are still applicable?
- The hands-on exercises of this course should be mostly compatible to be run on Turso. HU admins can assist if there are problems in running them.
- For connecting, you should follow the cluster's documentation. In the course we'll try to be as generic as possible, but each site has their own peculiarities so following your local instructions is usually better.
Im using my Aalto account and get this error after writing password. Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). How to overcome this?
- Try joining the Zoom and we can look together
- You requested your account to be activated, right?
I just got my work laptop (15 minutes prior to this session) and I'm having a bit trouble setting it up. Thus, I don't have any apps downloaded. Would Jupyter notebook be the easiest for the exercises? Not sure I can download any additional apps rn (and it'd take time)
- Any terminal is fine
- Alternatively use the ondemand web poartal
What were the specific things to avoid when using VSCode?
- Do not open folders like /home/user
- Do not open general storage folders/folder with lots of files (i.e. places where you have your data or conda environments)
- Try not to run jupyter notebooks
how to install vscode in linux ubuntu?
- This is the official guide: https://code.visualstudio.com/docs/setup/linux
- error: sudo: unknow user wa.xxx
- error initializing audit plugin sudoers_audit
Are there any notable differences in using Triton when having connected from a different way? I can successfully connect from powershell and log in to ondemand, but for some reason ssh from
does not work. Is there any reason why I should spend more time in trying to make WSL work, or should everything be doable from powershell?
- There shouldn't be any significant differences. The shell on Triton will work the same, and only real difference is access to different tools on your side. For example, you can use shell tools to transfer files if you connect via powershell / WSL shell, but you need to use ondemand's file transfer tools if you use ondemand.
- In some configurations, WSL does not "see" the internet
If you have already logged in to the cluster through onDemand, are you done for now?
- Yes, except if you want to try other ways to connect, which might come in handy e.g. for large file transfers (where a web interface might be incovenient)
I get the error of ssh: connect to host triton.aalto.fi port 22: Operation timed out. I am on campus, on aalto open on my own device. Is this sufficient or should I be connected to the Aalto network or VPN?
- You need to be connected to VPN if you are on your personal machine. Only aalto-managed devices can connect to internal Aalto network. ("Aalto" wifi is essentially two different networks depending on if you are on aalto-managed device or not)
This might be an out-of-scope question right now, but normally when I am working on any of my projects, if I want to see local results, I use NBs to quickly see what the shape of the result would be, and once I feel it's correct I start working on the final script, think about boosting the use of the available resources (either when I have used Puhti/Mahti). Is this a good practice?
- Do you mean, start locally, get it mostly working and debugging, and then scale up to the cluster? If so, yes, that's what many of us do!
- Yes! For some of us the limitation is that we use sensitive data that we cannot store locally. Luckily some similar open dataset can be used locally to develop and then move to the cluster for the "real" analysis
I am trying to connect from home and don't have access to Aalto VPN, how do I proceed?
- The only way possible would be to first connect to vdi (vdi.aalto.fi), and set up ssh keys. Then connect via kosh (https://scicomp.aalto.fi/triton/tut/connecting/), but generally I would highly recommend using the vpn (which you should be able to use if you have an Aalto account).
- I have an Aalto account, but it seems I can only use VPN with an Aalto device as far as I understand. I have a little problem with the Aalto laptop (I cant use it right now).
  - VPN can be used with any machine, it's not restricted to Aalto machines. go to vpn.aalto.fi
    - Oh great thanks!
how can i install vscode?
- Depends on what kind of machine you have. i it's your own: https://code.visualstudio.com/
I am in oulu what is my cluster and how can i connect
- You should have details for a zoom meeting, where admins from oulu can help you. But the ssh connection string for oulu should be : myaccount@lehmus-login1.oulu.fi
- I am in Oulu too and you can look here https://tki-kapasiteettipalvelut.version-pages.oulu.fi/lehmus-doc/docs/slurm/ you can look at login
I installed the Remote - SSH extension to VSC. How do I connect to the triton cluster (from where?!)?
- After installing the extension, you have a blue button like >< in the left bottom corner. Click on that button and follow the following guide: https://scicomp.aalto.fi/triton/tut/connecting/
- It should also install the "remote explorer" plugin automatically, which gives a symbol in the vscode extensions bar (A monitor with a small circle in the lower right). If you click on that, and then in "Remotes" hover over "SSH" and click the plus, you can then enter the connection string: username@triton.aalto.fi. Be aware, that you need to be on the Aalto network for this (e.g. via VPN).
my univ provided this. is it ok?
- Carpo2 will be shut down in couple of days. Please use our Lehmus HPC environment:
  https://tki-kapasiteettipalvelut.version-pages.oulu.fi/lehmus-doc/
Is Lehmus HPC ok?ith
- I cannot open the link. However I would say any HPC is good for this course. There are some small specifics could be different in every HPC, but the general concept (how to create a job, how to track the job, etc) are similar.
I'm trying to connect, but I'm recieving: ED25519 key fingerprint is SHA256:*** This key is not known by any other names. Are you sure you want to continue connecting (yes/no/[fingerprint])? n Please type 'yes', 'no' or the fingerprint:
- You would answer "yes" the first time (unless you have a reason to think there is something wrong with the network).
- If this is for Triton, you can check the fingerprints here.
  - I was able to connect once, than I entered "exit" in the terminal, tried to log in again with the same user name and password, but the server denies the access
Is there some resources on how to use the cluster via vscode and not in the login node?
- We have some documentation for this here. This is for Triton, but should be mostly applicable for other clusters. It's generally a good bit trickier than "normal" methods though.
First time connecting, got a message that The key fingerprint is: SHA256 …do I need yo save it somewhere?
- No, it's for checking you connected to the right machine. We have it listed somewhere… For Triton.
- Thank you! but I already checked, it says *This seems to be your first login SSH keys will be created for you so you can ssh in the cluster Generating public/private ed… key pair. is it like my code?
  - This one is about keyes used inside the cluster. You can connect to a compute node when you have a job running there. But this is a bit advanced, maybe more info later :)
  - got it, thank you (:
As a new user my home directory seems empty. Would it be safe to connect here via VSccode then, but I should avoid it later when I have added data for projects if I understood correctly and instead connect to the project code directories?
- There might be bunch of hidden files already, and several things can create hidden configuration files / cache files in your home directory. In general connecting to directories with small amount of files is fine, but your home is probably not going to stay very empty for long.
  - This includes e.g. conda environments (which have tens of thousands of files), and are, by default, located at $HOME/.conda. And in addition to that, it is anyways good practice to create one folder for a project, and not just put your code into your home folder, so that you can use different folders, with different vscode settings for different projects instead of one vscode which you modify (which will ultimately lead to a complete mess)
    - Exactly. Although for conda specifically we strongly recommend configuring it to install to your WKRDIR instead anyway.
- I would not do it, since vscode remembers locations you connected to, and it's too easy to just 2reconnect" to the same folder when, after some time it fills up.
  - Is it safe to SSH to Triton via VScode so that the terminal will awlays start in home (or is there a way to directly ssh to a specific directory?), as long as I do not open the home dir in the file explorer in the left sidebar?
    - the terminal starting in home is fine. It's just the "working folder" that vscode keeps scanning.
I managed to access with vpn on and from powershell (with windows). When I tried also from wsl, but that didn't work. Is it a dns-issue? I changed my resolve file, and internet works without vpn on. Any suggestions?
- WSL and VPN are a bit tricky. I think if WSL is started before the vpn is on, it will not "honor" the vpn connection, but bypass it, there are some odd things happening there, since WSL essentially starts it's own container on a very low level.
Just for the incoming lectures, are we gonna talk about how to build a bash file depending on the scenario to submit to SLURM? :)
- Yes! Exactly. First taste at end of day 1, much more day 2
- Awesome, thanks :)
Do I have to connect to Triton if I can access Turso via HY?
- No, we try to teach in a way that can work on any cluster
- Thanks!
Looks like my request for triton account hasn't been approved yet. Do you know how long this'll normally take?
- Will probably be done in the next couple of minutes.
  - Still says "Waiting for implementation". I sent the request at 10.29. Is there something I could do?
    - Could you come to zoom please, it's easier to handle it that way?
      - I didn't get the link since I was the one who added my employment status this morning (never got a reply from scip@aalto.fi)
      - (not sure if anyone noticed this)
        
        Email scip@aalto.fi (if you did not already) Thanks!
        
        Account should have been added, but it will probably take a bit for the change to be passed through the system (normally 15-60 minutes)
        
        It now changed to "resolved" so I was trying to set it up on PowerShell but it denies my account nevertheless.
I am using my aalto password to try to connect to triton but it keeps saying permission denied
- Did you request a triton account?
  - Yes
    - Could you come to zoom, to figure this out? link is in the invitation email.
please tell what do I do after opening the triton site, i dont have triton account
- If you are from aalto, request an account, since otherwise you wont get access. If you are from another university, please get in contact with your admins to get access to your cluster
Are we supposed to get to the USERNAME@login4 screen, anything after that?
- no, that's what you need!
  - Thank you!
Turso keeps terminating my connection to the remote server. Why is this and what can I do?
- Hm, I guess you need to ask the Helsinki people. I think there should be help in the Zoom for people from Finland (check email), or send a message?
- Hi, yes. Please appear into the Zoom breakout room 1 so that we can identify you and add you to Turso users list.
When working on a project, are we supposed to copy a specific directory from /scratch/… to our home directory and do the working there?
- We'll talk about this under "setting up a project" after lunch!
- But generally, you copy from your own computers to the cluster. On Triton we recommend doing most work directly in /scratch/…/ . But different clusters will have different practices - you'll need to check what is recommended.
- Your home directory on the cluster is generally small and may be OK for code, but not larger data.
Trying to connect to Triton on my home PC through Aalto VPN with WSL Ubuntu terminal gives the error "Could not resolve hostname triton.aalto.fi: Temporary failure in name resolution", but I'm able to connect with Windows PowerShell.
- In some configuration WSL does not "see" the internet
- Can you ping -c 5 www.google.com for example? If that doesn't work, you most likely have no internet connection from WSL. If it works, it's probably VPN issue.
  - Same error
- I think to remmeber, that WSL + VPN sometimes don't like each other, if you activated the VPN after WSL got started up, WSL tends to not process the VPN properly since it sets up some internal network routes. (https://superuser.com/questions/1630487/no-internet-connection-ubuntu-wsl-while-vpn)
  - Ah, I see. I did just install the Cisco client and activate VPN. Thank you.
Once you have the cluster webpage connected, is there a "timeout" feature? Where if I don't do anything in the cluster for some time it will log me out. Or maybe is it not advisable?
- Are you asking about ondemand? Or about an ssh connection?
  - ondemand yes, just using a terminal shell from ondemand.triton.aalto.fi
    - there are probably some timeouts and tokens passed along, so yes, there will be some timeout. In general, for long running commands those should best be submitted to the queue, at which point they are no longer part of the shell, but will be handled by slurm (and run independent of a current connection.) For interactive sessions, I assume (not entirely sure), that there will be no timeout since there is constant data transfer (visual)
      -Thank you for answering! I was just asking because I have the terminal open and haven't done anything so far because following the stream.
- I have the same problem in TURSO.
  - Please provide more info about the problem: Are you in a home network? Are you using VPN? Are you using WSL? We can help you also in Zoom breakout room 1 if you appear there.

PLEASE WRITE FURTHER QUESTIONS AT THE END OF THE DOCUMENT

Exercise: connecting

Try to get connected to the cluster. If you can't right now, we have quite a while still before we start using the cluster. You can take your time.
https://scicomp.aalto.fi/triton/tut/connecting/

i am connected with https://scicomp.aalto.fi/triton/tut/connecting/ now what do I do?

Try to:

Get connected to your cluster anyhow
You need a terminal to do the future exercises.
- At Aalto and CSC this can easily be through OnDemand
- SSH works everywhere

If you haven't requested an account yet, it may be too late now: in that case, sit back, relax, and enjoy watching without pressure.

I am (add an "o" symbol to vote for your status):
I am connected with https://scicomp.aalto.fi/triton/tut/connecting/ now what to do?

Done: o + 32
Tried and couldn't: oo
Not trying: o

CSC resources for Scientific Computing

CSC is the national supercomputing (+more) center of Finland. We have an introduction + demo of their services. This is also a demo of what we will see (on our own clusters) later on.

Is a HPC cluster considered as a supercomputer?
- It's essentially the same thing. Nowadays the word computing cluster is just used more often, since "supercomputers" now are just huge clusters or ordinary computers instead of single extremely powerful one.
  - "Supercomputer" also used to indicate a very powerful "single machine" and is very much misleading name wise, so high performance cluster (HPC) is probably the better name (definitely nowadays).
Pro-tip: ask all your professors/PIs to have at least one CSC project. It is always good to have one and test things out.
- Can you request a project for "general use" without a specific project in mind?
  - Depending on how you phrase it. For Lumi it might be difficult, but for other systems it should be fine
Is there a way to add more compute credits when they are depleted on the LUMI supercomputer? Additionally, are there any best practices for managing credits more efficiently?
- Yes, request additional units in my.csc.fi . Smallish requests (a lot, actually), is granted automatically
Is it okay to ask questions here? If so, could you please check whether Puhti is working today? I can't even open the website, and I need to download some files from it.
- Yes, you can ask! One instructor just said that Puhti has some planned mainenance today, so I think that's it.
What is the Parallel File Server? Whats it for?
- Oh good question… so, normal computers have one computer, one disk for storage, right?
- Cluster has a whole dedicated storage system that is mounted (available) on ALL computer nodes. It's convenience to have the same data everywhere, but it has to be fast (but the disks, and the network)
- Most clusters also have a filesystem where a large file can be split between multiple file servers. This allows you to parallelize the file loading.
- Data storage is a big deal - you can read more about it from the docs.
Are module systems like python environments?
- That's a good comparison. Modules include what would normally be system libraries, and Python environments usually cannot include those, so there's a difference.
- Another difference is that you can usually load many modules at the same time. Environments usually cannot be combined.
Why just not enforcing Docker containers? Or do you see that or something similar to be the future?
- Well, not docker, but we understand the idea: Apptainer/Singularity are made for clusters where you aren't admin
- CSC does in fact recommend and have tools for installing a lot of software in containers by default. In general, it's a good idea.
- We don't go in depth to them in this course, since it would be another huge complex thing to teach that would trip lots of people. We have had some other dedicated courses on them.
  - thanks! I understand. I guess you spend a lot of time and resources installing/managing different versions of libraries and probably solving problems for ppl. I guess something like Docker would move the ball to the user's court, so they/we have full power and control and responsibility on it
    - Containers are used somewhat frequently for installing software on the cluster as well. Sometimes it is easier than installing the software directly. Docker specifically isn't used because it requires root access though.
    - One problem with containers: They depend on the hardware (in case of GPU usage) and it's another set of tools people need to know/be trained in. And also, it sometimes becomes a lot more complex to debug issues.
  - Here is one page we have: https://scicomp.aalto.fi/triton/usage/singularity/
  - More general/broad: https://coderefinery.github.io/hpc-containers/
  - And yes, containers let users do whatever. If we make it recommended, the ball comes back to us to teach (and debug) all the problems with them. So it's a bit of a balance. (you are welcome to help us!)
    - Yeah, I see, e.g. you provide some linux courses for ppl to use the cluster, so you would need to provide also container training. I guess it's a hard problem and a trade-off. My guess is that it might be easier in the long term to only support container, but I see that it requires all users to be acquainted with it.
      - We have some videos of past workshops about containers in our YouTube channel. If you feel like joining again for a re-run, just ask us :) https://www.youtube.com/watch?v=9nHhB3Tn_BU&list=PLpLblYHCzJABy4epFn-rqsfDbUZ1ff5Pl&index=6
      - It's "easier" for the provider, but it also imposes more barriers for those who want to use it, and one big factor for a lot of these systems nowadays has become usability and accessability. More and more people do NEED these resources, but they might be ill equipped to get access, so any restriction on the use will just shift costs around (it costs the provider more to support/control multiple ways of cluster usage, but if you restrict, it costs each user more to potentially hire people that understand it/train people to understand it)
        
        good points :) thanks!
How can I evaluate what my tasks/jobs require in terms of computing? Is it trial and error ? What's the roadmap?
- Generally: Try to build a small example, or a few small examples that are getting larger. Run those with your code, see how it scales. This often gives a good impression on what's needed. And at the same time, you also get some testing in, and can debug with "known" inputs.
- I believe we also quickly go over things like this later in the course.
It was mentioned that large files (>5Gb) are difficult to transfer. Does an user need to use a particular CSC service to work with large files (e.g. large images)?
- It depends :) There are storage systems that are good for storing large files / lots of files, but they might be slow for read/write during computations. This is why HPC systems have a so called "scratch" disk that is not backed up, but that has fast I/O (read/write) access. + see the good point below on integrity
- The main issue with large file transfer is often integrity, i.e. making sure, that the data that you copied is not corrupted. Tools like rsync or specialised file services) can help there, since they do automatic file checks. They also allow resuming of data transfers, which can become important if the upload speed is limited and you don't want to risk it breaking after 80% is uploaded and have to redo it.
- It depends. These days 5GB isn't that much (if you have a good connection) but doing it thrgouh a web browser may not be fully relaiable. Command line tools like rsync can restart if the connection gets interrupted and handle almost anything.
  - Thanks to all!
Are the CSC MOOC courses free for finnish uni students?
- I am not from CSC, but I am pretty sure yes. We will ask Juho in the stream.
"billing" doesn't mean you pay, right?
- CSC resources are free for academic projects. Billing units are essentially monopoly money that your PI requests, for the purpose of making sure you use the resources sensibly.
Is there an advantage of using CSC clustre rather than TRITON?
- It is bigger, more resources (and from CSC you can scale even more to LUMI who is the 3rd biggest supercomputer in the world)
- CSC is ideal for cross-university collaborations (even cross-countries)
- CSC has very nice self-service interfaces (adding members, adding resources, services)
- CSC is not just HPC computing
- Triton is more integrated with Aalto ecosystem (laptops, workstations, usernames, projects, access control)
- Some research groups on Triton have access to dedicated resources (e.g. newest GPUs)
- Triton has a more flexible data policy (we basically don't delete, support helps for the whole project data lifecycle beyond triton)
- Triton has integrated support with other Aalto IT/Dept IT support/Research Software Engineers
If you group leader doesn't have time/knowledge to apply for a CSC project, is there anything to do?
- :sweatsmile: :)
- I guess we can talk with them and convice them? Get in touch scicomp@aalto.fi (if you are at Aalto)
Will the commands we learn for triton work the same on the CSC cluster? E.g. I understood slurm is an alias
- All the concepts apply, as an instructor said maybe 10% is different (enough you often need to check…)
- In the end they are all compatible with slurm documentation, + local minor changes https://slurm.schedmd.com/sbatch.html
What happens if you finish the billing money while running your scripts? interruption and losing everything?
- Good question! Usually you get enough email reminders before that happen. I doubt things will suddenly "stop working", all you need to do in the end is ask for more.
- I'm pretty sure I have gone over my billing units in a project before..
- If you are out of billing units, the batch system does not allow you to send any new batch jobs (apply for more billing units)
what does #SBATCH -c 128 mean in a slurm script? I use 128 cpus or cores? What is the difference between them again?
- For slurm: Core and CPU are the same. You always request compute units however your system has defined these (normally it's cores).
  - Generally it's specifically logical cores, not physical cores. (Relevant if the node has hyperthreading etc. enabled.) +1
- Regarding the syntax, that can also be written as #SBATCH --cpus-per-task=128. We will generally use the longer syntax in our tutorials for clarity.
How much can you buy for 1 billing coin? e.g. 1 CPU hour?
- The billig units are CPU/GPU/Storage GB hours. I.e. you will et 1 hour of compute time on one core for one unit. or 1 hour use of one gpu for a GPU billing unit, but addmittedly I'm not entirely sure how Storage units are calculated.
- Ok, do you get a message while starting the job? E.g. "estimated cost for this 10 coins, do you want to proceed?"
- In general, you do not need to worry too much about the billing units, only when you start doing large scale calculations, https://docs.csc.fi/accounts/billing/
- The initial amount of billing units is 10000, which is just for starting, and 100000 more is granted automatically (apply in my.csc.fi). For more, you need to explain what you plan to do, and the application will go through a review. And do not worry too much about the application, you basically need to convince the review board that you know what you are doing and there are some academic results coming.

Lunch until xx:00 (13 EEST, 12 CEST)

We will get hands-on after, so it's the last chance to get your cluster access working
We will start doing a lot through the command line, if you want to quickly read that material
You can keep asking questions below (and we will keep answering above).

CATS
Is it feasible to perform data preprocessing (chunking), embedding generation with intfloat/multilingual-e5-large-instruct, and FAISS indexing on CSC's Mahti or Puhti?
- I don't know details of what that is, but almost certainly yes.
Where can I find a list of useful commands?
- At least for triton we have carefully curated:
  - Quick ref (longer): https://scicomp.aalto.fi/triton/ref/
  - Printable quick ref: https://aaltoscicomp.github.io/cheatsheets/triton-cheatsheet.pdf
- These are good starting points

Setting up your project

We cover various sections now, not strictly following the Triton tutorials.

https://scicomp.aalto.fi/triton/tut/intro/

Copy code to the cluster: https://scicomp.aalto.fi/triton/tut/cluster-shell/#triton-tut-example-repo

Editing and running things through the cluster is costing computation time? Or does Triton not have any credit system?
- Triton uses a fairshare system, where the amount of resources you use will affect your priority in the queue. You're free to use (almost) as much resources you need, but you will sit in the queue longer. Changes to priority have a half-life of two weeks I think, so even using large amounts of resources will not kill your priority forever.
  - This is only for the jobs (computational tasks like running simulations etc) and doesn't affect login and accessing your data.
    - Yes, that is a good point. Login node just has a soft cap of 10GB ram usage and hard cap of 20, and you can only use two cores. Running any calculations on login node is ill-advised. (You can use it to compile your code / create conda environments etc.)
How do you get the list of commands that have been run?
- I made it: https://github.com/rkdarst/prompt-log
I recommend wget https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip which is the same data we will be using - 100 books packed together
Should I work in $WRKDIR or $HOME in Triton?
- So, every user by default has two directories. You can create projects (by emailing us) and create new project folders as well. But to answer your question, the $HOME directory has a higher SSD speed and it's regularly backed up, but has a small size (20 GB). The working directory has a larger size (200 GB, and it can be added if needed), but there is no backup.
  - So the default answer is $WRKDIR, unless you have a very valuable file you want to keep in $HOME.
  - Private stuff (e.g. your ssh keys): $HOME
  - Work experiments, self-learning, one-person-projects: $WRKDIR
  - Articles/projects as part of a team: ask for a shared project folder
- In Turso (Helsinki University): use /wrk-vakka/users/$USER as your work directory. Like in Triton, the /home directory is very small and the quota is easily filled up. Turso doesn't employ a $WRKDIR environment variable, since more advanced users must choose either /wrk-vakka or /wrk-kappa directories depending on the cluster federation (Ukko or Kale) they are using: /wrk-vakka for Ukko and /wrk-kappa for Kale. On this course, always use /wkr-vakka and Ukko federation: #SBATCH -m ukko
what if it says "fatal: repository 'https://github.com/AaltoSciComp/hpc-examples/-git/' not found"?
- (fixed the typo, still the same)
- does it end in .git ?
- ah, i left an extra slash
What is the link to get the data? The link they are using to download the gutenberg files. Thanks!
- Aha, this is the link to the project: https://www.gutenberg.org/
- https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip
- They are also pre-downloaded on Triton. You can copy them from /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first100.zip
What are those two dots before the end of a directory?
- .. means the above directory. So cd .. goes up one directory, and in the python command it tells the file is found in directory one step above.
Is the code ngrams/count.py supposed to be incredibly slow locally but fast in the cluster?
- It should be about the same speed locally as in cluster.
- It could depend on the speed of your hard disks (but even then, it should be similar)

I think I missed something… Does the following look about right? (changed the username and the login). What were the steps after?

[user@login hpc-examples]$ cd hpc-examples/
bash: cd: hpc-examples/: No such file or directory
[user@login hpc-examples]$ git clone https://github.com/AaltoSciComp/hpc-examples.git
Cloning into 'hpc-examples'...
remote: Enumerating objects: 428, done.
remote: Counting objects: 100% (179/179), done.
remote: Compressing objects: 100% (130/130), done.
remote: Total 428 (delta 90), reused 128 (delta 48), pack-reused 249 (from 1)
Receiving objects: 100% (428/428), 74.86 KiB | 2.27 MiB/s, done.
Resolving deltas: 100% (177/177), done.
[user@login hpc-examples]$ ls
R           gpu           io    mpi     openmp    python  slurm
README.rst  hpc-examples  misc  ngrams  postgres  scip
[user@login hpc-examples]$ pwd
/scratch/work/USERNAME/hpc-examples
[user@login hpc-examples]$ git status
On branch master
Your branch is up to date with 'origin/master'.

Untracked files:
  (use "git add ``<file>`..." to include in what will be     committed)
        hpc-examples/ 
    nothing added to commit but untracked files present (use "git add" to track)

Yes this looks good.
It looks like the repository was cloned inside of itself, but that won't affect things now.

What was the command to calculate the word frequency?
- python3 ngrams/count.py --words DATAFILE
- You can add -n 2 to count 2-grams
How do I unload the .zip file??
- You do not need to unpack it. The count.py-program can work with the zipfile.
what is the difference between $HOME and $WORKDR directories? Where should I copy the example zip?
- Home directory is meant for small configuration files and ssh keys etc. that are needed by the operating system. All data and code should be in work directory. You can copy the data to $WRKDIR/gutenberg-fiction. Go to $WRKDIR and create the directory either via OnDemand or use command mkdir gutenberg-fiction.
Am I missing something?
```
$ python3 ngrams/count.py --words gutenberg-fiction/
python3: can't open file '/scratch/work/USERNAME/ngrams/count.py': [Errno 2] No such file 
or directory
```
- It could be that the file is not stored under /scratch/work/USERNAME/ngrams. You might want to check what is the content of the folder with ls /scratch/work/USERNAME/.
- Yeah, you need to adjust the data file path to match where it is on your cluster. That is where we did it on Triton but you need to add in your account name for USERNAME: /scratch/work/lastf1/…
Should my usual workflow be that I code somewhere else, and then I copy my code to the cluser and run it there? I.e. avoid to code on the cluser?
- Usually you want to do it the other way around: you want to code on the cluster because that is where you will run the code. Meaning that you'll want to modify the code that is stored in the cluster work directory. But you can use either remote editors or you can mount the filesystem so you can use your editor of choice to modify the code. You'll usually want to avoid the hassle of copying stuff to and from the cluster because it is usually the most boring and complicated part.
im getting some error with reading the zip file UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 37: invalid start byte, it worked when I unzip the file
- Hm. You are trying to run count.py on the zipfile itself?
- yes
- Did you run it with python3?
- python3 ngrams/count.py -n 2 --words ../gutenberg-fiction/Gutenberg-Fiction-first100.zip this was my command
- Edit: worked when I deleted and downloaded the zip file again
Link to OnDemand?
- Aalto: ondemand.triton.aalto.fi
- Helsinki University: https://ood.cs.helsinki.fi (HU network connection needed, e.g. VPN)
I'm getting an error 'too many values to unpack (expected 2)' when calling the –words ngrams/count.py for gutenberg fiction.+1
- Weird. It's count.py you are running, not one of the other commands?
  - The full command is 'python3 ngrams/count.py –words gutenberg-fiction' and I'm running it in the hpc-examples directory
- Ah. Needs to have a file path including .zip at the end.
  - Full file path in front of 'gutenberg-fiction'?

I am getting an error Found 0 files,

[user]$ ls
R  README.rst  gpu  gutenberg-fiction  io  misc  mpi  ngrams  openmp  postgres  python  scip  slurm
[user@login4 hpc-examples]$ python3 ngrams/count.py --words gutenberg-fiction/
Found 0 files in gutenberg-fiction/
But there is a .zip file there. What did I wrong?

Need to give it path to the .zipfile (not directory, which seems to be empty)
- Thank you! Now works (:
  Found 0 files in /scratch/work

Where is the result data saved (after running the python program)?
- In the current case it is just printing it to the terminal, not saving it. There is an option to save it to a file.
What does it mean that in linux everything starts from root? What is root?
- In Windows there is "C drive". In Linux there is / which is called the "filesystem root" - everything is available from under there. Even if it's a different disk, it is mounted at some subpath. HPC-kitchen storage talks about this some.
I get this when i run the code :
- Needs the .zip on the end of the filename (full path)
- ok becouse i took the non zip version, but i'll try with the zip thanks
  bash-4.4$ python3 ngrams/count.py -n 2 –words ../Gutenberg-Fiction-first100
  Found 0 files in ../Gutenberg-Fiction-first100
  2-grams: 0

Walltime 0.01 s
User time: 0.03 s (0.03 + 0.00)
System time: 0.00 s (0.00 + 0.00)
MaxRSS: 0.013 GiB (0.013 + 0.000)
bash-4.4$

Still a bit confused why this doesn't work: python3 ngrams/count.py –words ../gutenberg-fiction/Gutenberg-Fiction-first100.zip
```
python3: can't open file '/scratch/work/USERNAME/ngrams/count.py': [Errno 2] No such file or directory
$ ls /scratch/work/USERNAME/
gutenberg-fiction  hpc-examples
```
- Under the /scratch/work/USERNAME/ directory there is no ngrams subdirectory. Most likely before running python3 ngrams/count.py, you need to cd hpc-examples
- Check what is your current directory. Type pwd and verify that you're in the hpc-examples-folder. If not, move to the folder with cd hpc-examples
  - Got it working! Thanks!

Why I am getting this error?:

Traceback (most recent call last):
  File "/scratch/work/USERNAME/Gutenberg-Fiction/hpc-examples/ngrams/count.py", line 154, in <module>
    main()
  File "/scratch/work/USERNAME/Gutenberg-Fiction/hpc-examples/ngrams/count.py", line 133, in main
  for filename, data in filelist:

Can you add the command you used to create that error?
- python3 ngrams/count.py --words gutenberg-fiction
  - You'll need to give the path to the filename as in gutenberg-fiction/Gutenberg-Fiction-first100.zip
still i am getting this:

python3 ngrams/count.py --words gutenberg-fiction/Gutenberg-Fiction-first100.zip ..
Traceback (most recent call last):
  File "/scratch/work/USERNAME/hpc-examples/ngrams/count.py", line 154, in <module>
    main()
  File "/scratch/work/USERNAME/hpc-examples/ngrams/count.py", line 128, in main
    filelist = sum((opendir(input) for input in args.input), [])
  File "/scratch/work/USERNAME/hpc-examples/ngrams/count.py", line 128, in <genexpr>
    filelist = sum((opendir(input) for input in args.input), [])
  File "/scratch/work/USERNAME/hpc-examples/ngrams/count.py", line 64, in opendir
    z = zipfile.ZipFile(dir_)
  File "/usr/lib64/python3.9/zipfile.py", line 1250, in __init__
    self.fp = io.open(file, filemode)
FileNotFoundError: [Errno 2] No such file or directory: 'gutenberg-fiction/Gutenberg-Fiction-first100.zip'

I get this error:

bash-4.4$ python3 ngrams/count.py -n 2 --words ../Gutenberg-Fiction-first100.zip
Found 200 files in ../Gutenberg-Fiction-first100.zip
Traceback (most recent call last):
  File "ngrams/count.py", line 154, in <module>
    main()
  File "ngrams/count.py", line 134, in main
    ngrams_total.update(process_file(filename, data, args))
  File "ngrams/count.py", line 97, in process_file
    data = data().read()
  File "/usr/lib64/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 37: invalid start byte

I wonder if the file is corrupted/changed somehow. Hm…
You might want to try downloading / copying the zip file again. ok i'll try thanks
Are you running this on Triton? no on Lehmus
Student who had the same problem: i had zipped the file locally and uploaded it the first time which might be the reason i got the issue, using the wget command worked
- Out of curiosity, what OS do you use? It does sound like the encoding broke during that. , Mac . yes mac M1
I did the same zipped the file locally then send it maybe the prob comes from that, but how to use wget?
- wget https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip, this in the gutenberg folder, it worked thank you. :D np

I did the pi exercises. I missed where are the gutenberg books? What exercise is with that?
- The files are here: https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip
- The idea of the excersice is to calculate the ngrams
- You can find them all collected here: https://scicomp.aalto.fi/triton/tut/exercises-ngrams/
- But we didn't go in that exact order.

Could you add the instructions row by row what should be done from the beginning?

Solution is below:

cd $WRKDIR
git clone https://github.com/AaltoScicomp/hpc-examples.git
# Copy data to triton or run the the following commands
mkdir gutenberg-fiction
cd gutenberg-fiction
wget https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip
cd ..
# Continue here if you have the file already
cd hpc-examples
python3 ngrams/count.py --words ../gutenberg-fiction/Gutenberg-Fiction-first100.zip

Now this should work
thank you!

-How to conduct parallel processing in case of computer vision data i.e. using point clouds or images ? I want to conduct training on huge amount of data (containing point clouds) where sometimes its very difficult to load even more than 1 batch size to a single GPU.

Exercise: setting up your project (we return at xx:45)

Try to repeat what we just did. If you can't, it's OK: work on it as homework. The parts:

Download the code
- Exercise Shell-4 from https://scicomp.aalto.fi/triton/tut/cluster-shell/#triton-tut-example-repo
Create a directory to store the data
- Exercise Storage-1: https://scicomp.aalto.fi/triton/tut/storage/#exercises
Copy over the data
- Exercise RemoteData-1: https://scicomp.aalto.fi/triton/tut/remotedata/#exercises
- You can probably directly download (if you don't have OnDemand / command line is too hard) with the shell command wget https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip

If you can't, it's OK - most people need to play with this from home. If your cluster is not Triton, you will need to do these steps some different way.

I am:

Done: ooooooooooooooooooooooooooooooo
Not trying:ooo
Had problems: ooo
Done… 8606226 empty spaces taking lots of space

Try to count 2-grams and 3-grams with the right command line option. Use --help to find it
You can also try it with the 1000-book file (see the links to find it)
how to kill run? ctrl + c not work
- Does ctrl+z work? Note that this doesn't kill the process, it just stops it and puts it in background. You can then fully kill it by running ps, finding the PID and running kill PID_OF_THE_PROCESS.
  - need to try next time. now just I refress the page and it ended (is this now killed)
  - PID TTY TIME CMD
    1382888 pts/8 00:00:00 bash
    1476091 pts/8 00:00:00 ps
    - It doesn't look like there is anything running anymore at least. The first process is your terminal and the second is the ps command you just ran.
- Perhaps something got messed up internally (for example some cluster connection or disk access is hanging that makes things hang.) You can always log out and in again.

Break until xx:08

You can keep asking questions
If things didn't work, it's OK. These are hard, and you will get it! You can always ask someone to help you. Some thing need more eyes to solve.

What is Slurm?

https://scicomp.aalto.fi/triton/tut/slurm/

Where is the Triton cluster? Can one visit it?
- Espoo, somewhat close to Aalto. Unfortunately it requires an NDA to visit and is kind of hard. But it looks cool, about 20 racks
Q: So if we are decising on how to allocate resources for our job, we do know the available resources such as Puhti, but how do we decide accurately what we need? (Approx of course). (e.g. some days ago I tried running a job in Puhti and took the template bach file from a colleague, the initial parameters made the job wait for the required rources to be available, then I realized that I didn't need that much, so I chanedge parameters, and the job started inmediately)
- Usually estiamate from your own computer
- Or over-request, see what was used, and then reduce (on increase)
- If it's possible for you to run a short test version of your code, that can be very helpful. Request some resources, see if it ran / used most of the request, then adjust. Afterwards scale the resources to match the full code.
  - We will go over methods of checking how much resources your job actually used at some point.
- "How big is my program?": https://scicomp.aalto.fi/triton/usage/program-size/
  - I'm the guy who asked: Thanks for the link :))

Interactive jobs

https://scicomp.aalto.fi/triton/tut/interactive/

Rule of thumb: On Triton most nodes have at least 4GB RAM per core, so if you can stay below that you're good. You can of course use more if your programs requires it, but that's a good starting point when you don't know how much you need.
The srun outputs Walltime, User time, System time. What are they?
- This was specific to the code, but these terms are fairly videly used. Walltime: the "clock" time spent on it. User time: cpu time spent executing the code, this could be greater than Walltime if the program ran on multiple cores. System time: time used by OS doing things.
Is Slurm running in Kubernetes? so are these clusters ran on kubernetes?
- No, they are separate things that do somewhat similar stuff.
Hey, did I miss something?
- [user@login4 hpc-examples]$ srun –mem=1G –time=00:10:00

python3 ngrams/count.py --words ../gutenberg-fiction/Gutenbe
rg-Fiction-first100.zip
srun: error: slurm_job_submit: Automatically setting partition to: batch-hsw,batch-bdw,batch-csl,batch-skl,batch-milan
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

Oh, an admin needs to fix this.
- To elaborate: this looks like a bug that occasionally happens with our script for creating new users.
- I see! Thank you
Are you in the zoom room?
- Yeah, I'll send a message!
- fixed
- Thank you!
How did you exit out of the nano modifier?
- Ctrl + x should exit the text editor
my count-2grams.sh has failed when I look at my slurm history. How do I find out what went wrong?
- There is usually an output file ending in ".err" (on stream, just saying somethin about output files)
- Or ending in .out since by default it combines error and output. You'd look in there, see what the problem is, and repeat. It's actually rare it works on the first time.
what did you write in the nano modifier?
- In the file we edited? Example is below.
How can we estimate how much time and memory do we need?
- See above, we discussed a bit in another answer, usually trial and error. Some hints here: https://scicomp.aalto.fi/triton/usage/program-size/
I'm stuck after the sbatch command.. How do I stop it from running?
- You can force exit by pressing Ctrl + c
I am trying to to slurm queue but got this error:
bash: slurm: command not found
- I guess it's not on your cluster. squeue --me probably works.
- Yes that works why ? why slurm is not in my cluster and how to know the command addapted to the cluster
  - slurm queue or slurm q is a custom command that is not available on all clusters. It does essentially the same thing as squeue but in a slightly more user-friendly way. squeue is the stock Slurm command.
- sacct or sacct --long are replacements for slurm history.
- I see thank you.

Serial (batch) jobs

https://scicomp.aalto.fi/triton/tut/serial/

Commands we run:
- sbatch
- slurm queue or squeue -u $USER
- slurm history or sacct or sacct --long

Exercises until xx:55

Try to repeat what we did. If it takes too long, it's OK - you can take more time after the course is done. What you do now is what everything else will build on, so try to make sure you can manage. It's also good practice for terminal work.

Try running the interactive job by adding this in front of your python3-command:

srun --mem=1G --time=00:10:00 python3 ngrams/count.py --words ../gutenberg-fiction/Gutenberg-Fiction-first100.zip --output gutenberg-words-2grams.out -n 2

Try creating a submission script count-2grams.sh with the following contents:

#!/bin/bash
 #SBATCH --mem=1G
 #SBATCH --time=00:10:00

 python3 ngrams/count.py --words ../gutenberg-fiction/Gutenberg-Fiction-first100.zip --output gutenberg-words-2grams.out -n 2

and submit it with

sbatch count-2grams.sh

what is this error? srun –mem=1G –time=00:10:00 srun: fatal: No command given to execute.
- You need to give srun something to actually execute, for example srun --mem=1G --time=00:10:00 python yourscript.py, it won't do anything on its own.
What was the command line used in the bash script? I missed it. It had something like -n 2 at the end. What does it do?
- For running the 2-grams add -n 2 to the arguments for count.py
- Otherwise exactly the same as we typed before.
Why I see this on CSC MAHTI
```
  srun: error: AssocMaxSubmitJobLimit
  srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
```
- In CSC machine you'll need to specify a partition and an account that you'll want to use. See this page for Puhti or this page for Mahti. Do note that you do not need to request the MPI tasks for this script so no --nodes or --ntasks-per-node.
What was the command to get nano? Was it just nano? And also how do you name the script?
- just nano, yes.
- The name doesn't matter other than how you keep it organized in your mind. Our convention is to end in .sh
- In this case you can call it count-2grams.sh, for example.
  - I get that, but I am not exactly sure where you name it, like I am not seeing any name this script section.
    - found it, very dumb of me.
what do I need to do after I have created the submission script?
- Try submitting it with sbatch script_name.sh
- I don't understand. Do I first need to exit or execute?
  - After you have written your script, the sbatch command will put it in the queue and everything will happen in the background.
- I get this error: sbatch: error: Unable to open file count-2grams.sh
anyone else having this error: slurmstepd: error: execve(): ex.sh: Permission denied
- Is that the right file name for what you made? Then, what command do you run? sbatch ex.sh should be it.
- The submission script could also have incorrect file permissions, but not sure how that would happen here.
- Can you write the command you used?

file contents:

#!/bin/bash -l
#SBATCH --time=01:00:00
#SBATCH --mem=4G
# Run your code here
python my_script.py

my_script.py simply has a hello world program in it
- This looks OK. Are you running sbatch ex.sh or ex.sh. The script ex.sh isn't marked as executable but it can be run by something else (in this case sbatch)
  - okay my bad, i run srun instead of sbatch, thanks for the clarification, but why didn't it work? is srun only for directly interacting with scripts?
    - Oh yes! That's it. Lots of stuff like that that can subtly go wrong…
    - srun tries to start an executable (instead of run a script) as an interactive session. Your script wasn't an executable, so you got that error.
- using chmod +x on the file solved the issue, thanks!
  - srun shouldn't be used to run batch script, since srun doesn't consider the #SBATCH commands inside of the file. You could do it for testing if you give it the right command line arguments, though.
What did this script exactly do? It finds the most common words used in the file, but what does adding -n 2 do?
- It finds the most common pairs of words, that is, "2-grams" (tuples) of words, with the -n 2 option. You can try increasing it to see what happens, too.
how to open the sh file? am I missed something? sbatch: error: Unable to open file run-pi.sh
- Did you name the file exactly run-pi.sh and you are in the directory you made it?
Why this error when I submit:
- sbatch: error: Unable to open file count-2grams.sh
- This is the command I used: sbatch count-2grams.sh
- Was your batch script named exactly count-2grams.sh ?
  - And also are you in the same directory as the script? You need to give the whole path to it, if not. (Relative or absolute path)

one question about quota:

$ quota
quota: Cannot resolve mountpoint path /root/admin: Permission denied
quota: error while getting quota from csfs-samba.cs.helsinki.fi:/wrk-kappa for daekim (id 1251306): Connection refused

HY admins can maybe confirm, but it sounds like you need to be an admin to run quota on HY's cluster
Yeah, seems like internal admin problems. To be reported to them.

Is it normal for the Turso system to crash often? should I reduce the jobs/tasks?
how to change or controle the number of proc used in parallel in the command?
- if you talk about the number of CPUs assigned to the program: –cpus=X either in the SBATCH script or as command line argument.
- I see, i meant processors does that means the same thing as CPUs?
  - yes. CPUS/processors/cores are viewed as mostly the same from a slurm perspective. As mentioned above somewhere, Slurm considers one logical core to be one cpu (i.e. if the machine has multithreading activated - multiple logical cores per phuysical core one cpu will refer to one of those logical units)
- Ok, so if we don't precise what the default number of cores is used ?
  - default is 1 (on most clusters), but depends on the cluster configuration.
    - I see, thank you so much.
Running the script made the output file that contains all the paired words correctly. But it also has created another output file called "slurm-7860384.out". It shows the runtime. Did it create it automatically?
- When you submit a slurm job using sbatch slurm will capture all output that the code prints. If you run the ngrams without slurm it will print that output. Slurm will save the output to a file named after the job id that the job gets when it is submitted. This allow you to monitor and check what your job has done later on.
- Also regarding the output file: By default it will save in the same directory with name slurm-JOBID.out. You can change the name and save it in different location with --output=filename argument. (#SBATCH --output=filename inside a batch script.)
after we add sbatch count-2grams.sh in nano window, how we start the process? what shall we press?
- you don't do this in nano. sbatch <script> submit the job defined in <script> to the queue. this needs to be done in the terminal, not in an editor.
- the problem is I see the window for the terminal but cannot edit there since the moment i call $ nano count-1grams.sh
Editing the socript with nano count-2grams.sh looks to give improper formatting on my machine. I get the following error
```
sbatch: error: Batch script contains DOS line breaks (\r\n)
sbatch: error: instead of expected UNIX line breaks (\n).
```
- Looks like you're using Windows that can add those line breaks.
- Write the file on the cluster directly so you don't get windows linebreaks. Alternatively we have a tool for converting them. You can run module load dos2unix and then dos2unix your_script.sh. (Assuming Triton)
What does -l on the /bin/bash line do?
- Runs bash as a login shell. This generally doesn't make a huge difference in practice.
  - But it does run a couple of scripts that would not be run on a non login shell, so it ensures that you get the same situation that you have when logging into the cluster

I tried to do the pi.py in slurm. I get the following error:

$ srun --mem=50M --time=01:15:00 pi.py
  srun: slurm_job_submit: Automatically setting partition to: batch-hsw,batch-bdw,batch-csl,batch-skl,batch-milan
  srun: job 7860487 queued and waiting for resources
  srun: job 7860487 has been allocated resources
  slurmstepd: error: execve(): pi.py: Permission denied
  srun: error: pe77: task 0: Exited with exit code 13
  srun: Terminating StepId=7860487.0

Try adding python3 before pi.py. srun cannot execute the pi.py directly.

Is it valid to think of sbatch script as a way to reserve a pool of HPC resources, which are then consumed by commands within it, including srun?
- In some way yes. Maybe it's somewhat similar as having an appointment, where you still draw a ticket. By submitting you the script you have made the appointment, but things might come up for the doctor, so you still have to wait till you are called.
It is not possible to write the commands directly on the command line instead of writing bash scripts?
- It is, but then you'll need to be there connected and write them manually. By writing them in a script you can let the queue system run them later. In addition you won't forget that you wrote because it is written in a script.
I get this error: sbatch: error: Unable to open file count-2grams.sh
- Are you in the same directory as the script / are you sure the name is correct?
- How can I check the directory in nano?
  - You don't need to do it in nano directly. You can check the current directory in the shell with pwd and list the files there with ls.
  - What should I do when I exit nano when there appears the file name to write?
    - If you want to save the file, choose yes and then enter to confirm the filename.
What was the way to fix the error srun: error: slurm_job_submit: Automatically setting partition to: batch-hsw,batch-bdw,batch-csl,batch-skl,batch-milan srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
- An admin needs to fix it for you. Are you in the zoom room?
  - I do not have Zoom installed right now. Is there another way?
    - You can mail us then, for example scip@aalto.fi. We need your username etc. and it's better to not write it down here.
    - Ok I'll do that!
does it affect the running time to use batch for multiple jobs?
- We'll talk about this more when we do array jobs tomorrow, but each job gets its own reservation so running time for each job is independent of earch other. So when running multiple jobs you won't need to add the runtimes together.
- I see so different processors take care of the different jobs?
  - Yes, and they can also be on completely separate nodes. Each job is entirely independent.
do we have homework for tomorrow?
- Keep trying examples and reading the material linked from the schedule until you feel a bit more comfortable (but days 2-3 also serve as a review of today, so you don't need to know it perfectly.)

Respond here where you are with the execises

I'm:

Done: ooooooooooooo
Not trying:
Having problems: ooo

Feedback

News for day 1 / preparation for day 2

Today we covered what was on the schedule
- More info on all topics in the schedule
You may need more time to work through the exercises in the second half of the day. Read through the tutorials linked, it presents the same things a different way.
- In the pages linked from the schedule, you can find exercises - and almost everything we did is an exercise with a solution there.
You need what we did today in order to continue tomorrow - don't be afraid to ask for help!

Today was (vote for all that apply):

too fast: oo
too slow: ooo
right speed: 0ooooooooo
too slow sometimes, too fast other times: ooooooo
too advanced: o
too basic: ooooooooo
right level: ooooooo
I will use what I learned today: oooooooooooooooooo
I would recommend today to others: oooooooo
I would not recommend today to others:

One good thing about today:

Was super interesting +4
Good explanations +6
It's nice to have so many people answering questions so quickly +11
Good amount of breaks +7
I loved your teaching way +1
Great job +1
Geat job . Thanks. +1

One thing to improve for next time:

I think it's a good idea to always mention the pain/problem first, then go for the solution. E.g, one should know why queuing is hard for clusters then we get Slurm. Nonetheless, topics covered were critical. Many thanks!
Clear instructions what to do before the lecture +4
..

Any other feedback?

would've been nice to see the solutions somewhere (i'm just a little lost)
- You can find solutions to most of it in the exercises in the linked pages (we should have highlighted this more).

General questions continued:

I am not pretty sure I got the idea behind bash scripts +2
..
..