Kickstart 2024 notes archive

# Introduction to Scientific Computing and HPC / Summer 2024 (archive) :::danger ## Infos and important links - To watch: https://www.twitch.tv/coderefinery - To ask questions and interact (this document): [link for those registered, check email] * *to write on this document, click on the :pencil: (Edit) icon on the top left corner and write at the bottom, above the ending line. If you experience lags, switch back to "view mode" ("eye" icon)* - Previous of this document are archived at: https://hackmd.io/@AaltoSciComp/scicomphpc2024_archive https://hackmd.io/@AaltoSciComp/scicomphpc2024_archive ::: # Icebreaker **Test how to use this hedgedoc live document, this is how we ask questions and get help.** - This is a question, or is it? - This is a reply, yes it is! - This is a nested comment to the reply - Actually... insert smart comment here - This is a different reply, but still valid - This is another question, or maybe not? - ... - ... - test it. - ... write something...wrgwe jhsdkjf - testing this again :) What is your current position? - Research assistant: ooooooooooooooooooooooooooo - PhD student:ooooooooooo - Other academic: - compute center staff: oooooooo - A student of life: Where are you now: - Helsinki: ooooooooooo - Kirkkonummi: o - Oulu: oooooo - Espoo: oooooooooooooooooooooooooooo - Trento (Italy): o - London: o - Bilbao (Spain): o - Lohja, FI: o What's the biggest computer you have used (free choice, can also "+1")? - LUMI's computer +3 - Aalto Triton +4 - TU Berlin Cluster - Puhti +6 - Mahti +3 - Just a normal PC with Good CPU+GPU :( +10 - Turso What do you work on?: - Neuroimaging, statistics, research ethics and data management ooo - LLMs +1 - IT support: o - sensor data +1 - NDE o - NLP stuff - reinforcement learning - Combinatorics - mechanical simulations - bioinformatics +2 - biophotonics and computational electrodynamics - vae and rl - Hyperspectral data - satellite remote sensing +1 - Hydrualic Simulations - Radiation physics simulation o - Speech Recognition - ... How is the audio? - all good - Simo is lowest - Simo is lowest, Enrico is loudest but difference is minor. # Day 1 - 4/June/2024 - BUG REPORT: The link to the YouTube playlist on the website (https://scicomp.aalto.fi/training/scip/kickstart-2024/) does not work: https://studio.youtube.com/playlist/PLZLVmS9rf3nOeuqXNa8tS-tDtdQrES2We/edit - Thanks! - Remember to set the TwitchTV player resolution to "source". ## Day 1: Practicalities / intro https://scicomp.aalto.fi/training/kickstart/intro/ Write any questions here: - I am using VDI Ubuntu 20.04 with Nvidia. It is now running CUDA version 11.2. I am a research assistant at ELEC. I need CUDA 12.5 for research purposes. As I don't have any installation permissions, how can I do that? - that depends on whether the drivers/hardware support CUDA 12.5 If they do, you can e.g. use conda to set up a virtual environment and install the appropriate cuda-toolkit in the environment. However, if your hardware/drivers don't support Cuda 12.5 you will need IT support to upgrade those. - I donot have sudo permissions - Is this an Aalto VDI computer? - yes it is an aalto vdi computer - OK, yeah. These aren't very flexible and designed for complex scyentific software installation (and don't have that much power). We'd recommend using our cluster for this, we can much more easily install software. You are in the right place, we cover this days 2-3! - What kind of tools do you need to run? some graphical tools or non interactive things? - https://github.com/google-research/timesfm , I want to run this on CUDA 12, it is an time series forecasting model - is this cluster related to triton? - if you are referring to "our cluster" from previous answer, then yes that meant using Triton. - We are using cPouta for our research. However, since we will collecting researrch data in the future, we are thinking of switching to ePouta. Do you know what is practical differences between cPouta and ePouta? So far, I don't see any reason to switch to ePouta, as I can see that cPouta can collect data just fine, with https (so there is encryption) - in general, ePouta and also SD services are for working with sensitive data, if your data is not sensitive, you might not need to worry about these services and can stick to cPouta. - If your data really is "sensitive", check this table to find out if ePouta or SD services would suite your needs better: https://research.csc.fi/comparison-sd-services-epouta - And "sensitive" is legally defined. Don't over-classify your data, ePouta is very hard and complex to set up (and will slow down your work). Normal data is "confidential" and other things are easier to use for that. +2 ! - Thanks a lot! It look like SD services are better option for us. (or even cPouta if our data is not that "sensitive"). ## Day 1: From data storage to your science https://hackmd.io/@AaltoSciComp/SciCompIntro Questions: - Hi, I've recently started using Puhti server for COSMO calculations. Often times I run out of resources. How do I guess the correct amount of resources needed for a particular task? - We have a lesson on this tomorrow! In short: if you run out, keep requesting more (a lot more as a test). Look at usage and then decrease to what you need. - And of course you would usually want to do shorter testing runs for this. - Essentially, try to split your problem into smaller fractions. Run a small part and estimate how much overall resources you will need based on that. Then (if you can't be sure, that it scales perfectly) add a bit of a buffer to your requirements. - ok great. thanks. - Should they take into account the number of CPUs, cores or threads when evaluating compute needs? Thanks. - Is this question related to the above question or does it refer to something else? In general, "running out of resources" probably refers to running out of memory or exceeding the time limit. In particular, when exceeding the time limit, the number of used CPUs is important if the code can utilize more CPUs as this will make the program run faster. This is something we talk about during the parallelization session. - I wait for parallelization session, to better understand if the constraint to take in account is the number of cpus or the number of cores when submit a job. - In the context of clusters, "cpu" generally refers to a single logical core. So those two are generally the same. ## Day 1: (Computational) reproducibility and open science CodeRefinery: https://coderefinery.org/ If you are interested in what we do there,want to get info when registration for the next workshop opens, sign up for our newsletter: https://coderefinery.org/about/newsletter/ or follow us on Twitter/X (coderefine), Mastodon or LinkedIn Happy to also answer any question you may have about the project here :) All CodeRefinery lesson materials: https://coderefinery.org/lessons/core/ :::info Poll: **Are you already using git** - Yes: ooooooooo - No: oo ::: Questions and comments: - Pondering about computational reproducibility subdued to data expiration... - Yes, this is linked and thus, if possible you also deposit your data in a reproducible fashion (e.g. [zenodo](https://zenodo.org/) or [CSC Fairdata](https://www.fairdata.fi/en/)), but, especially for sensitive data, there might be legal restrictions to that (like legal obligations to delete data after some time) - NOTE about docker: - Docker is NOT shipping the machine, but as much of the software as possible. Main restriction: hardware specific factors (like code optimised for a specific hardware, or requireing e.g. specific GPUs etc, or even networking issues - the latter happen predominantly on clusters, due to unusual high performance network connections, which might require specific support not in all containers) - A potentially useful rule of thumb: container replicates OS (like Ubuntu 22.04) but of course it cannot replicate the hardware. - . :::success ## Break until 13:00 EEST Please rest and stretch your legs, feel free to ask more questions here ::: ## Behind the scenes: the humans of scientific computing (there are no published materials, a discussion only. Ask questions) Questions and comments: - I am still at the beginning of my career (master student) and I am considering academia: what is the thing that one should know when starting? - Try to determine, what you like while you do your study, if it's the research -> go for it. If it's helping others in their research -> have a look at support positions, while the former might be more "prestigious", at least to me, the latter is more rewarding, because it's often very difficult to see where your research has any direct impact, but if you help someone you do see that you helped them. Also, don't put too much hope on the concept of academic freedom. While it's there (in theory), in practice, you have to write grant applications which are commonly closely connected to whatever the funder wants to fund, so you kind of have to do what someone else wants you to do. +2 - Great question! I could add "if you could travel back in time, what would you recommend to the younger you?" - (I will raise this question when we get to a later point when it fits) - As for software/computation, do learn good practices like version control (git) and good coding practices, documentation, etc. It will make your scientific work better and reproducible but also help you switch to industry later if you choose to switch. Check e.g. the coderefinery workshop for topics that are good to know: https://github.com/coderefinery/lessons - One thing I kind of learned: Never hold back a question, even if you think the answer is surely trivial. Best guess: >60% of the audience don't know the answer neither :smile: - Expanding on things to consider about academia; what type of positions is there to do research in private companies and how does it generally differ from academic research? I understand this can be fairly field or even company specific but if there is something to say about it - Without answering: I would distinguish industry vs public vs academia/private (there are semi-public research isntitutes which are pretty similar to academia, but which don't have the financial pressure industry faces). - In industry it may not be uncommon a R+D (Research and Development) unit with a focus on innovation, perhaps a crucial stage toward deliverying new forms of businesses and products. This may be in contrast with researching in public institutions such as Universities, where reseacrh is closer to the end goal (eg. publishing, academic career development, fund raising, consortia,...). In addition there are opportunities for acadamic research in collaboration with industry in which the industrial partners fund a research they can benefit from later on, and at the same time facilitates academic publishing and possibly the unfolding of further projects. - Do people in industry use these things you mentioned earlier like conda, containers and HPC? - And git! In my industry experience git was used A LOT - Also Kubernetes (an alternative to the slurm-based HPC we have at universities) is used in industry e.g. openAI GPT models are trained on Kubernetes cluster - Depends... I know some companies that use these actively. Some have own "HPC", but LUMI is nowadays also available for companies, on a lower price than many cloud operators, so there is now also companies in Finland using HPC: https://lumi-supercomputer.eu/news-and-articles/ - Before you go into academia, twink twice. Check out #ichbinhanna for problems in academia. #AcademicCrisisList - +1 A general rule is to try to know the people (supervisor or boss of a company) before committing to the job, because sometimes having a toxic supervisor/boss/manager will kill all your enthusiasm towards science/work - Also a not so correct pedigree can be a problem, social background, wrong group etc. +1 - Many commercial simulation softwares use HPC as their backend, not SLURM but commercial ones. - What is your experience with the PhD overall? Would you say that it was worth it? - Yes, for me (SW) it was/is worth it. I learned all the skills that brought me to where I am today during the time I did my PhD (some of which I could also have gotten as researcher). - Motivation to do courses and further develop: As PhD student you are expected to collect some credits still. - PhD title and exeprience needed for participating in funding applications, ie being even more involved in research, but from the computational support side. - But if we discuss industry and academia, I noticed industry is using AWS, cloud, Tableau, Databricks, SQL, PySpark. I don't see this in academia. What is your take on this? - Cost, it is easier to buy hardware (cloud services). - Yes national (CSC) or local (Aalto/HY/Oulu/...) HPCs are much cheaper than AWS/Azure/googleCloud + better protection of data - at least long term, but it's a bit of aa struggle for resources that have to be available "on demand" (i.e. services) - A lot of data in Academia is not stored in databases, but it's changing, in general I think the low adaption of cloud processing is partially due to data sensitivity, and researchers being reluctant to post their data to the large service providers. - I have a feeling that these make sense for big scale, but require specilists in running them. That doesn't exist in academia and most projects are so small/fast it's not worth it. - Sure I understand. But if industry demands such skills, maybe academia should teach them, because everybody knows there are not enough permanent positions in academia. - Samantha is really cool and a great role model! - :clap: +3 - It was very inspiring! ## Day-1: What can you do with a computational cluster? (Live demo: no written materials. Follow along at a high level and see the big picture only. You learn details on days 2-3) - A comment for Twitch: Is ist possible to remove the black part around the computer screenn? . - Can you adjust the window size so that it goes away? The raw screen has very little black (it's a portrait screenshare). You might need to hide the chat first, then arrange. - I removed the chat and the left side. Only the presentation is visible, but there is still black space around the white terminal, left and right. But no worries. - Try to enter Theatre Mode (a button at the bottom of the stream, or alt+t) - Yes! That removed the black space. - Could you pls also introduce a bit what is the purpose of running this script? some background info of the script and data? - There are many resources, but everyone has to share. Instead of running directly, we make a "recipe" and hand it off to the computer. It runs in the background (on one of the hundreds of computers), and gives us the results. - To make sure no one gets lost, we go over this again at the beginning of tomorrow. - How well does the Gnome desktop work on Open Ondemand (it's running on demo)? We are using xfce on Oulu - Oh yes. Hm, then I'm behind. I think these are generally OK. - Wait! Did he just say he is using an indenter and he can simulate the material underneath? Ui. - Answered on stream, it's simulating something pushing on the simulated atoms. - In LAMMPS you can have a [force in some shape (fix indent)](https://docs.lammps.org/fix_indent.html) that can create an indentation to a material. Material here is constructed of individual atoms that this indenter pushes. - I think I will contact you for more information! Is this our example for the workshop? How cool! - We'll learn about thinsg like this, yes. I don't think we have many demos of the interactive visualization but that is relatively straightforward, we talk more about running stuff in the background - I am actually interested in the science, because I do this practically :-) - What is the difference between login node and computer node? - The login node is the only one avalible from the internet. Everyone goes to there first (so if everyone runs there, it gets slow). Clusters have hundreds of compute nodes that share the work via a scheduler (Slurm) that we will discuss in detail tomorrow. - You can think of the login node as the "keyboard" to the compute nodes (if you think those are your computer). Essentially on the login node, you give instructions on what to run on the compute nodes - And do I really have to know the difference between nodes and cores? I am always confused. - it's somewhat important - if you do parallel work. We go into it more on days 2-3 (but most importanly: we hope to tell you what options you need.) - Node -> Full computer (multiple CPUS/cores, GPUs, Memory etc). Core -> 1 processing unit (in old times 1 CPU = 1 Core, in cluster speach: CPU = Core) - Node: your laptop (n=1). Cores: the cpus in your laptop (c=8) - . - . - . :::info ### Break until xx:02 Then "connecting to the cluster": where we go a more into "what is a login node!" (The past section was a big overview: details come tomorrow.) ::: ## Connecting to the cluster In this part, we go over how to connect. Hopefully you can, but *you don't need it until tomorrow*. If something goes wrong, you have plenty of time to ask for help https://scicomp.aalto.fi/triton/tut/connecting/ - Audio is a bit imbalanced: Thomas is low, Jarno is high - (hopefully it was fixed) - It's OK - I was able to connect to Triton with no issues (both VPN and aalto network). However, the tutorial articles talk about generating an ssh key and copying it to the server. Is this necessary for me? Seems not to be. Furthermore, after the first login there was already a .ssh folder in Triton with authorized_keys folder containing a key as well. - If you can connect, that is enough for this course. SSH keys are good and make things easier, but isn't necessary. (Connecting is something that happens very often, so it's worth trying to make it smooth. That's why there are so many recommendations). - So the way to do this would be to use my current way of connecting to Triton and then following the "Copy public key to server"- tutorial? - Yes - This is somewhat confusing. I am looking at the current authorized_keys file. It already contains an ssh key and that key is the same as the public key in the .ssh folder - If you can connect: don't do anything. On the cluster, ~/.ssh/id_ed25519.pub and ~/.ssh/authorized_keys will contain the same thing. - *If* you make a key on your own computer/laptop, *then* you can copy it to ~/.authorized_keys on the cluster. If you do, you don't need to enter a password when logging in. But you don't need to do this. - This is what I assumed, the tutorial is kind of misleading. I interpreted it as though I was supposed to copy the key on Triton instead of my own ssh key (on my laptop e.g.) but this seemed redundant. Also, if you connect to Triton, the .ssh is automatically created and the tutorial seems to want to create another .ssh folder. - I'll try to update it. Which page(s) were the confusing ones? the connecting tutorial seems OK. - Yes, the connecting tutorial is good. It's mostly the "copy public key to server" page that confused me.(scicomp.aalto.fi/scicomp/ssh). It seemed like it wants me to create a new .ssh folder even though one already exists on Triton for me. - Are you looking at the command line tab? I'm looking into it... - Thanks, I'll work on it when I have time. - The PowerShell/manual tabs - Ah true, now I see it, it's less clear than the command line (Linux/MacOS) instruction, we'll try to fix it. edit: The code works as intended but the 'mkdir -p' command can be a bit confusing - it creates the .ssh folder if it doesn't exist but does nothing if it does already. Maybe the explaining comment could be expanded a bit. - The keys on triton are used for "internal" communication (and this default key is auto-generated). If you have a job running on a node, you can connect to that node via ssh. - TRITON SSH Fingerprints - https://scicomp.aalto.fi/triton/usage/ssh-fingerprints/ - Is it possible that when in Aalto VPN, the Triton one tries to access is a faked one? - Within Aalto networks it's unlikely anything can happen. But it's always good to be safe and it might be a sign of something else going wrong. - And if it complains about changed SSH fingerprint and I don't know why (e.g. I know of no updates at the server), I contact the admins? - Yes! An email saying "I get a key error" is always good to send to us if you don't know why. - Hi. What about the password - For Triton at Aalto: same as Aalto password. Other clusters may have a dedicated password: you should check your own reference. - What if one forgot to close the desktop? Does leaving it open affect the functionality of the shared computing resource? Thanks! - It will keep your resources reserved for you. Which isn't the worst but if you don't need it, it's good to stop the desktop (from the OnDemand interface) - You can also "mount" the trituon filesystem on your laptop https://scicomp.aalto.fi/triton/tut/remotedata/ we cover this tomorrow (you need to be on VPN) - You can use the "Files" menu from ondemand, but I wouldn't recommend it for large files (and rather use rsync) - University of oulu members should use Lehmus cluster. ssh lehmus-login1.oulu.fi or https://lehmus.oulu.fi OnDemand portal # Temporary internet break from our broadcasting studio, stream will come back ASAP :::info ## Exercise time+break: connecting to the cluster (until xx:00) - Try to get connected to your cluster. Deadline: tomorrow's course. - Once you are connected, you can take a break. - You can ask questions here or in the Zoom sessions (or to anyone else who may be around). - Those who registered with @aalto.fi @helsinki.fi @oulu.fi email have received a reminder about the dedicated zoom room. Join the zoom if you have troubles connecting to your cluster. ::: - if the standard command of triton.aalto.fi does not work what should I do then to connect over the command line? It says that resource it temporarily unavailable. - Join the zoom for help - where do I get the link for the zoom? It was not in the email I got with the link to the resources. ## How to ask for help with (super)computers Slide deck: https://doi.org/10.5281/zenodo.8392762 Aalto garage: https://scicomp.aalto.fi/help/garage CSC: Weekly user support sessions, every Wed at 14 EEST: https://ssl.eventilla.com/usersupportcoffee/EN LUMI: Every last Wednesday of the month: https://www.lumi-supercomputer.eu/events/usercoffeebreaks/ - No offense to anyone, but sometimes it feels that the support people are bothered by simple questions. I assume they are just overloaded, but the human part is missing. - :( - This might be different from person to person, some are enthusiastic and happy to help, for others it is "just an (annoying) part of the job" - I'd suggest to "take it as it comes" and ask follow up questions or use the "virtual support sessions" instead - From a support persons point of view: I am commonly happy to help, I'm only getting annoyed, if people seem to actively not listen, or take ages to type a few letters, and you start wondering whether they actually listen to what you try to tell them. - Yes, unfortunately that happens. Some of these hints hopefully let you ask better (worst for me isn't simple questions, but complex questions without enough background to take any action.) - If you slowly escalate: ask someone in your own team, then ask the bigger team (and if you say you asked locally), you may be more likely to get a good reply. - Nice! Thank you. - - Comment from presenter: I forgot to say that it is OK and welcome to ask questions here also after the talk. We will be watching this. :::success ## Short break in the stream. We are switching back to the streaming server. Refresh your twitchTV window. ::: - I still cannot connect to the triton network using the ssh. - Join the dedicated zoom at 16:00 when the stream is over - ok I will, will the link be posted somewhere at 16:00? - If you registered before 12:00 with your aalto.fi email account, you already have it in your mailbox. Otherwise register again with your aalto account. - I did now, will I then get the link at 16:00 then? ## VSCode on HPC Questions and comments: - Note on Triton login node: Nowadays users have their own resource limit on login node. You can't hog that much of the shared resource, but you can eat all of your allocation and slow everything you do down to a crawl. - Yes! Running too much (taking too much memory) can make it so that *you* can't log in anymore. - Is this working also for Helsinki University? - In theory Remote-SSH works for anything you can SSH into. It may work better/worse for some computers. - Is the filesystem load related to the built-in git functionality, or does vscode also enumerate files that are ignored by git? (hopefully conda envs and other large folders would be listed in `.gitignore`) - It enumerates or indexes all file in each project you open. The big problem we have had is when someone opens `/scratch` or similar - it tries to index every file accessible on the whole filesystem. - Besides the default indexing, some extensions can also cause problems. I think that pylance can cause similar issues with constant indexing for example. - It is a really nice tool. In the past, my main issue was to access the Jupyterhub on the cluster. There is room of improvement. - vscode? - Indeed! - Is what you're currently doing happening completely on the login node? I read that just the remote SSH plugin will only be in the login node and you have to do additional configuration to run bigger calculations. - That's correct. - Maybe also good to know how to create environments and Jupyterhub kernels on the cluster. - We have a page about Jupyter on Triton and it tells a lot about making kernels from environments. Other pages describe making environments (Conda environments for Python.) - [Here](https://scicomp.aalto.fi/triton/apps/jupyter/) is the link to Jupyter tutorial and [here](https://scicomp.aalto.fi/triton/apps/python/#python) is the general Python tutorial. These are fairly specific to Triton though, you might need to take slightly different steps on other clusters. - What is the relationship of ssh and kernal? How to close the kernal? Thanks got it! - "kernel" is a term from Jupyter (but also other things now), which does calculations separate from the main program (Jupyter, VS Code, etc). - SSH connects to another computer. - Thes can be somewhat related but not directly. (for VS Code what we are doing, it's not running the kernel over SSH, another VS Code server is running over SSH). - What is the connection between a Jupyter kernel and other kernels, e.g. Linux kernel? AFAIK the Linux kernel is an OS kernel that handles hardware etc and other OS related tasks. - For all practical purposes they are unrelated, simply the same name (representing "backend that lets stuff run") - Sorry I missed how to switch from Login node to computer node on VS Code? Thanks - It is on this page: https://scicomp.aalto.fi/triton/apps/vscode/ - . ## Feedback, day 1 :::info * Tomorrow, we get to actually running things on the cluster. * **Make sure you can connect to the cluster by tomorrow** * Zoom for helping with connection problems stays is still going: you can join for more help. * Other follow-up reading: * "Cluster from the shell": there are several options: * Cluster from command line: https://scicomp.aalto.fi/triton/tut/cluster-shell/ * Shell crash course: https://scicomp.aalto.fi/scicomp/shell/ ::: Today was: - too fast: - too slow:o - right speed:ooooooooo - too basic: - too advanced:o - right level:oooooo - a tiny bit too fast: o Good things about today: - Great interaction, good responsiveness and informative answers to questions.+2 - Easy to get help, friendly answers +2 - I liked very much the session to get a feeling about things, not doing something. Watch and learn.+1 - 2 people asking each other questions, helpers in the back are excellent +2 - Helpers are patient. Environment of trust, I dare to ask questions. - The documentation is great! - .. Things to be improved for next time: - Explaining the key terminologies in dummy language or concrete examples would be very much appreciated.+1 - Maybe tell more about "challenges" in academia, state of RSE. - . - . Other comments: - Looking forward to the next sessions! Kiitos, Danke, grazie! - Thank you for the course! - What do I do when I don't have access to a HPC? Or do I have to prepare something on another HPC? - I can say about Finland: no, sorry. - Ok, please check with your university if they can provide you access to an HPC cluster, most european coutnries also have national clsuters. - Yes, I think I can access a HPC cluster. Do I need to prepare something special? Sorry if I don't know it now, just discovered the course in the afternoon by chance - Check if the cluster you will be using requires some special extra things for connecting (ssh keys, vpn, etc..). Once you are in, most likely the rest is applicable without any extra preparation. - Thank you! # Day 2 :::info * Tomorrow, we get to actually running things on the cluster. * **Make sure you can connect to the cluster by tomorrow** * Zoom for helping with connection problems stays is still going: you can join for more help. * Other follow-up reading: * "Cluster from the shell": there are several options: * Cluster from command line: https://scicomp.aalto.fi/triton/tut/cluster-shell/ * Shell crash course: https://scicomp.aalto.fi/scicomp/shell/ ::: ## Icebreakers, day 2 Audio ok? - seems OK! - Works - yes! - Yes Have you used an HPC system (computation cluster) before? - Never: oooooooo - Occationally: ooooooooo - For many years already: oooo How long have you used computers? What was your first? - Before teenagehood. Laptop. - Since childhood, using a basic desktop to play games :) +5 - Since the Dial-up era... +1 - It was 1983 and I got a commodore 64 :D - Early 2000s, mainly for games :D - 286 PC around ~1990 - Early 2000s, games, mini-tower-computer ("Shuttle", horrible thing) - Early 2000s, laptop. - 83 for me too, with Sinclair ZX Spectrum - 1991 or so, 8086 - since 1998, laptop - Since childhood, late 80s-early 90s my first computer was Commodore 64 - Since 2006 What's your favorite ice cream flavor? - Tiramisu icecream!! - Cherry - Pistachio +1 - Ginger - coffee. - Something sour, red. - Traditional Persian Ice-cream which includes Saffron, iced creams, etc. - Oh! Can one buy that somewhere in the Helsinki area? - Some persian restaurants have them. +2 - خیلی خوبه!، در واقع شگفت انگیزه - Hazelnut +1 - Lemon-Licorice - Pistacheu - Again: can one buy that somewhere in the Helsinki area? :) I was asking about some Libyan pastry, but then the thread changed to pistacchio :) - Kolme Kaveria brand has pretty good tubs you can get in grocery stores. - Jädelino sells it, grocery stores as well. - Cafetoria in Töölö has award-winning Pistaccio icecream (~5 EUR per scoop though...) - LMAO - Lemon sorbet - Vanila with sour cherry - Basil-lemon ## Introduction to clusters and your work https://scicomp.aalto.fi/triton/tut/intro/ :::success ### PSA: Set your TwitchTV player to "source" quality. ::: Questions and comments: - 12000 CPUs and 250 GPUs on Triton cluster - You mean 250 GPU nodes? Asking because: In my desktop the GPU has 22 cores, if I understand correctly how those things work. - GPUs, not nodes. But in case of GPUs we are talking about the whole machine instead of a core or such. The GPUs tend to be a decent bit beefier too than ordinary desktop ones. Also in case there is a confusion, "node" here means a machine with several gpus and cpus etc. - Thanks, I think I got it - Triton: 270 Nodes, 11652 Cores/CPUs, 70732 GB memory, 239 GPUs. - | | Pe | Skl | csl | milan | V100 | K80 | A100 | P100 | BigMem | AMD | Total | | -------- | ----- | ---- | ---- | ----- | ----- | --- | ---- | ---- | ------ | --- | ----- | | Nodes: | 91 | 48 | 48 | 32 | 27 | 3 | 14 | 5 | 1 | 1 | 270 | | Cores | 2192 | 1920 | 1920 | 4096 | 600 | 36 | 672 | 120 | 80 | 16 | 11652 | | Mem(GB) | 13696 | 9216 | 9216 | 16384 | 11264 | 384 | 7042 | 1280 | 2000 | 250 | 70732 | | GPUs | | | | | 136 | 24 | 56 | 20 | | 3 | 239 | - Note about outdated tutorials: Our pages have "last updated" time at top right of the page. If a page hasn't been updated in a while, it's possible it's outdated. ## Slurm queuing system https://scicomp.aalto.fi/triton/tut/slurm/ - Great analogue! - Somewhat naive question, but sometimes some tasks (such as a build job, for example) can take up the entire RAM/CPU cores that the laptop offers. Is there a nice strategy of how to estimate the resources in case the process simply crashes the laptop (building on the idea of `if it works on my laptop...`)? - In practice the easiest way would really be to just do quick testing runs with increasing resource requests until it stops crashing. - A practical trick for out-of-memory crashing: double the RAM request until the program stops crashing (4G, 8G, 16G, etc.). Same for exceeding time limit - double the time request (1h, 2h, 4h, etc.) - Like said in the stream: another option is to request a lot of resources to make sure the program doesn't crash, and then in subsequent runs scale down the request if possible. - Ooh thanks! I was wondering about process crashes and stuff because it typically requires a laptop reboot to fix the consequent garbage in the memory. - The new Mac processors (M1 etc) seem to have a lot of VRAM, how does that play into the rule-of-thumb for estimating the cluster resources "if it works on my laptop"? - M1 and M2 chips are a bit complicated to compare since it's a cpu and gpu in one. - thanks - In a case where you try to run a code on your laptop and it says insufficient memory probably because the laptop memory is not enough, how would you know then know the exact memory size you need to assign for it on the cluster? - Same answer as to an above question: Try small test runs with more and more resources until it stops crashing, then you can check with monitoring tools afterwards (that we go over later) to see how much you actually need and reduce the requests appropriately. - I think what is meant, is to (if possible), reduce the size of your run (e.g. if you feed in data, reduce the amount of data you feed in), test whats the minimal you get, then run one even smaller to get an idea how this scales by checking the consumption (does it scale linearly or quadratic with the input size). That can give you a good estimate. If you can't run it at all on your machine, I would run it with quite a lot of resources (e.g. if your machine has 16GB and that's nto enough, give it 64GB on a cluster for a test run - e.g. if you want to simulate an hour of something, only simulate a minute, or maybe just 10 seconds), and see how much it uses. If it finishes quickly, run again for a bi longer and check, whether the memory use is constant. If not, you know it's consuming over time, and can potentially estimate how much, if it's constant, go with whatever you found works. - Good point, if there is no way to do a "smaller" run with your program while keeping its resource consumption static, then that sort of method is necessary. - Another complication is sudden memory spikes. Lot of LLM's for example have a tendency to spike the memory usage for a short time while they load the model to GPU. ## Copy your code to the cluster https://scicomp.aalto.fi/triton/tut/cluster-shell/#triton-tut-example-repo Have you gone through the "cluster-shell" tutorial? - yes: oooooooo - no:ooooo * Cannot find link to github ... hpc-examples - https://scicomp.aalto.fi/triton/tut/cluster-shell/#triton-tut-example-repo ## Interactive jobs https://scicomp.aalto.fi/triton/tut/interactive/ - What is the difference between srun and SBATCH (shown yesterday)? - Sbatch is basically a convenient way to write more complicated scripts than what srun allows. In the most basic case it works just like srun would, but you have your requests written down for easier memorization. - Theoretically most things you do with sbatch could be done with combination of bash scripts and srun, but it makes it more convenient. - We will cover `sbatch` in the next section - Could be used sinteractive instead of srun for starting an interactive job? - sinteractive is an `srun --pty bash` with a little bit of extra work to enable a graphical tunnel. At least CSC and Triton have it installed. If you don't need the graphic tunnel, `srun --pty bash` lets you tune better what you need for that interactive session. - Can you srun with huge resource requests and show the queue command? Just for completeness? (And then cancel the job.) - Nice demo! - Add GPU requests to make it queue for a while - With srun hostname command, what does the job actually consist of? - It submits the job to the queue, and when it gets a node it just runs `hostname` on it (which prints the name of the node the job is running on). - If time and cpu option are not specified, are they set to some default value. - Yes there is a default. I think on Triton it is 15 minutes, 1 cpu, 500M. - And gpus are _not_ requested by default. - To know the defaults first check which partition you want to investigate with `sinfo -s`, and then for one partition you can use `scontrol show partition batch-csl` (now I asked for the partition called "batch-csl"), and then you can verify that ` DefaultTime=00:15:00` and `DefMemPerCPU=500`. - What does time=4-00:00:00 correspond to? 4 days? - Yes - D-hh:mm:ss - You can also use time=96:00:00, that is, 96 hours. - So when my task is completed, does it automatically logs me out of the node or are the resources still allocated to me? For example, I specified as in example time of 2 hr for session and run it using srun, but my task is done in 1 min, will that node still be assigned to me for the next 2 hrs? - The allocation ends when your job is finished. If you run a terminal with `--pty bash` for example, the job won't count as finished until you exit the terminal though. - If you need a single allocation for multiple jobs, you can use `salloc` for this. This shouldn't generally be necessary though and is only really useful for advanced users with fancy automated pipelines. (And our tutorials don't even bother talking about it) - You can also use serial jobs that we'll be talk in the next section. - Trying the pi.py example. Error: no such file or directory. Opened Triton desktop. Do I need to create the directory and put the file there? - You need to clone the repository following the instructions here: https://scicomp.aalto.fi/triton/tut/cluster-shell/#triton-tut-example-repo - If I want to run matlab code with parallelization toolbox, can it run in the console or do I need the graphic environment? - It depends on the tool you use. A simple matlab script with (or without) parallel toolbox can run without graphical interface (and non-interactively without the user monitoring it). - How does Jupyter terminal work with these demos? Not a fan of powershell :( - If it is on your machine, you will need to connect to the remote cluster first. If instead jupyter is already in the remote cluster, you can basically treat it like a login node (although there can be differences). - If you are on windows you can also install git bash for easy access to bash shell instead. - If you mean Jupyter on the cluster (through Open OnDemand for example), it works the same! Wait, you need to do `unset SLURM_JOB_ID` before srun works, I think. - How does sinteractive differ from srun? - See above line 162 (if the number stays :D ) - Do supercomputers (im looking at Mahti specifically for now) typically work in this "block" way with slurm when looking at GPUs? Because a GPU node has multiple GPUs, but i think it is usually required to allocate an entire node all at once, even if you use just one GPU. - You can generally request a single GPU. But splitting GPUs to separate cores like CPUs is not (generally?) possible. At least Triton allows single GPU requests, not sure if that applies to every cluster. - I think there might be some sort of GPU splitting on UoHelsinki's Turso - Is it generally not possible to "connect" to a job once it starts running, e.g. if you scheduled it using sbatch and so maybe you would like to get access to it and maybe even start up a vscode instance there. - You can schedule an interactive job. For example `srun <options> --pty bash` just gives you a shell on a node that you can then use to run what you want. - I think the problem is that quite a lot of supercomputers do not allow gpu interactives. - Yes interactive jobs with GPUs are a different matter. These are often not allowed since GPUs are a valuable resource and interactive jobs lead to a lot of idle time. - What is the suggestion to set the cpus-per-task for a GPU job? I remember it was twice the GPUs? - This will of course depend on your job, but usually you can pretty freely request 4+ CPUs. CPUs are a pretty abundant resource compared to GPUs - It's also generally recommended to request a couple so your job isn't bottlenecked by data transfer from CPU to GPU. - Is there a way to see recently run jobs (IDs)? - `slurm history` or `slurm h` as a shorthand. - Do we need to close jobs after they are completed? If yes, how? - Oh, good question! If it's the `--pty bash` (getting a shell) or `sinteractive` ones, then yes, yould close it. The `exit` command will do this - (basically, if you shell prompt doesn't say `login4`, then you are in a job and can exit). - Does slurm history allow us to see jobs submitted to all the nodes by each user (ourselves)? And if there is any job no longer necessary, we need to close them? - You can see your jobs with `slurm history` or `slurm h`. If your jobs are not exiting correctly you can kill it with `scancel JOBID`, and you can get the job id from `slurm queue` or `slurm q` - Everthing that is saved (e.g. files) is accessible after logout? Let - They are saved on the cluster so you need some sort of method to access data remotely. They on't get deleted after your session ends or anything though. - -bash: slurm: command not found? - slurm is like an alias that does not work on all clusters, the command for your cluster is most likely "sacct" - `slurm history` is equivalent to `sacct --long | less` - `slurm queue` is equivalent to `squeue -u $USER` . - . :::info ### Exercises until xx:55 https://scicomp.aalto.fi/triton/tut/interactive/#exercises Try to dooo the things we just demoed, and these exercises. Get familiar with the queuing system. If you need help and have an email address ending with aalto|helsinki|oulu.fi, join the zoom link that you received at 11:50 via email. I am: - done: ooooo - request more time: - not trying:o This: - worked: ooooo - didn't work: Would you like us to demo an exercise?: - 1:o - 2: - 3: - 4: - 5: ::: ## Serial (batch) jobs https://scicomp.aalto.fi/triton/tut/serial/ - "shbang" = "#" ((ha)sh) + "!" (bang) - How do you know whether a # line is a comment or read by the computer? - It depends on the lagnuage (Python, bash, R, etc), but it's usually always a comment. - In the batch script, `#` lines are comments to bash (the thing that actually runs it), but Slurm gets it first and detects `#SBATCH` and processes them. - A note with shell scripts: you'll probably have to do chmod +x script.sh. I didn't have permission at first and needed to give it permission - If you run with `sbatch` it probably isn't nedeed (unless it's missing `#!/bin/bash` at the top?) - If you do `./your_scirpt.sh` it needs to be executable but then Slurm won't read the resource requests. (common problem that leats to puzzling results...) - Ahh okay, then I messed it up :) - Didn't read far enough - Did you answer already how to decide how much memory to ask for? And how do you see if the program makes good use of the available ressources? - Monitoring is next on the agenda after serial jobs. - For choosing resource parameters, just first estimate how much resources your computer has and use it as a starting point. For more tips and tricks, you can read this page: https://scicomp.aalto.fi/triton/usage/program-size/#how-big-is-my-program - What if srun is used on an application that needs multiple threads or even gpu access, within a running batch job that does have that allocated? So, if it was called like `./program &` it would work as expected; would it work the same way with `srun program`? (EDIT: do not forget to `wait` at the end of the script) - In theory it would work, but this is sort of "working around the queue system". I've seen some people trying this and it minimally works, but also `srun` clames the resources and might not allow anything else to run with it (in the same job) - Normally you'd try to make programs so that the program itself can divide up the resources within itself. - I meant to be able to get the benefit of resource usage tracking that srun has (shows up as a job in history), supposedly there is no other reason to do srun within a sbatch script right? - For MPI jobs you'll want to use ``srun`` as well to make certain that MPI job is allocated correctly to the requested resources. - MPI jobs? - We'll talk about MPI (message passing interface) parallelism tomorrow. - Okay, so for normal single node jobs I'm guessing there isn't that much difference (?) - Usually, not that much of a difference - Using ampersand (`&`) after the command will also make the terminal interpreter continue processing lines and after it reaches the end, it might quit the job as it reaches the end of the script. So usually you'll try to avoid using `&` unless you want to run some background service. - Well, if there were say 8 parallel programs that should be run, then it would be done with some kind of for loop and `&` at the end of them, and then `wait` after the loop (from what I understand) - I would say: why not submit 8 separate jobs? (like an "array job" - that we will cover tomorrow). Can you raise this again when we get there? - Sure, but to quickly answer: it might be just one step of a bigger job, one that was allocated with sbatch. - Overall I would say: if you are using `&` in a job it might work but is likely a sign that the job should be adapted to fit Slurm better. `&` makes sense for running multiple things one your own computer but with Slurm it's usually not needed. - If I use scancel and by mistake I put another jobid, is there the risk to cancel someone else job? - No risk, it understands who you are and what you are allowed to do - I am not able to connect to triton. it is showing permission denied after I entered my aalto password on linux terminal. can i do shomething? - ssh: connect to host triton.aalto.fi port 22: Connection timed out, this error is coming while i am running command on windows powershell - Are you in Aalto network / VPN? - I am on wifi network named "aalto" I am not on any VPN useful script for scancel: ```bash= alias cancelall='/usr/bin/python3 ~/cancel_all.py' ``` ```python= import subprocess # Run the squeue command and capture the output output = subprocess.check_output(['squeue','-u','USERNAME_HERE']).decode('utf-8') print("squeue output:") print(output) print("") # Parse the output to find all JOBIDs job_ids = [] for line in output.split('\n')[1:-1]: job_id = line.split()[0] job_ids.append(job_id) #if no jobs, exit if len(job_ids)==0: print("no jobs found") exit(0) # Print the list of JOBIDs print(f"Found {len(job_ids)} jobs:") for job_id in job_ids: print(job_id) # Prompt the user to confirm whether to cancel all jobs answer = input("Do you want to cancel all jobs? (y/n) ") if answer.lower() == 'y': # Cancel all jobs using the scancel command subprocess.run(['scancel'] + job_ids) print("All jobs cancelled.") else: print("No jobs cancelled.") ``` - In the .sh file, how to force the output file to have a name as the .sh file? Is there any syntaxt for that? I mean how to fil the following command: --output=?????.out - yes there is a syntax https://slurm.schedmd.com/sbatch.html Search for "--output". It is good to use the %A and %a so that the job number (and array number) are stored in the filename, good for later tracking what failed, what was successfull. - How (what) do these command work (do)? ##SBATCH --array=1 ##SBATCH --cpus-per-task=4 - `--array` is used to submit array jobs which are one way of submitting multiple jobs in parallel. Array jobs are discussed tomorrow in the parallelization session. - Thanks, what about -cpus-per-task? - It allocates multiple CPU's for your job. We'll talk about these tomorrow. :::info ### Exercises until xx:00 and break until xx:10 https://scicomp.aalto.fi/triton/tut/serial/#exercises Take a good long time for these - it's the basics of all the rest. Also read the page and catch up with anything else that's old. Also read up on the cluster-shell lesson if you missed it: that's what these batch scripts are programmed in. For those at Aalto|Helsinki|Oulu: the zoom is open if you need help or have specific questions related to the differences with Helsinki/Oulu clusters. We'll do Serial-3 and Serial-5 after the break. ::: - Is it normal that the default shell is zsh? I don't mind it but I assumed it would be bash - Aalto computers by default make the default shells `zsh`. Triton uses whatever Aalto sets for you (under the "connecting" lesson it says how to change it) - do I do this while being in Triton? where can I find the key fingerprint to check if it matches to the printed one? - Fingerprints are available at: https://scicomp.aalto.fi/triton/usage/ssh-fingerprints/ - To change to a different shell: https://scicomp.aalto.fi/triton/tut/connecting/ (scroll to the drop down that talks about changing the shell to bash, you need to do this from a generic Aalto linux server like kosh.org.aalto.fi, it takes up to 1h for the change to propagate everywhere) - How do I leave the nano file creation, when I have created my file? - Control-X and then confirm with "y" (it gives hints at the bottom, `^` means "control") - thanks - Also, save without exiting: Control-O ("O" for overwrite) - How to realize which job id belong to which .sh file? If I clear the command window, those info. will be removed, and apparantly, there is not a way to check what was for what, right? - You'd need to check the queue/history to see what is running and hopefully figure out from tehre. - For jobs in the queue: `slurm queue` / `squeue -u $USER`. For jobs that finished: `slurm history` / `sacct -u $USER`. - For seeing what line was used to submit the script, you can do something like: `sacct -u $USER -X -o JobID,SubmitLine%40` - You can also add `#SBATCH --job-name=<descriptive name>` to help keep track what is what. - thanks - what is the minimum command to run a file with the name F.py? srun python F.py? - Yes. `srun python F.py` will submit the job requesting the default resources (1 cpu, 500M of RAM, 15 min) - I faced with the error "Unable to allocate resources: No partition specified or system default partition" when merely using that minimum command. - If you are on some other clusters, it might require more. You'll need to read the info on them. - thanks - To allocate multiple CPUs for a single code, should i do specific things on the code, or any runnable code can have more CPU to perform that. - the code you run is a command like "python mycode.py". That command is inside a shell script which contains some SBATCH directives for slurm. In the SBATCH directives you ask for the resources you need. Then your code will see those resources (e.g. 8 cpus and 16G of RAM). Your code of course should be written to use all those resources (i.e. code itself is not benefiting for the multiple CPU if it is not written for parallel computing). - Thanks. - Is the video playback already available?:) Thanks - If you go to the videos on coderefinery twitch channel you can look at the current broadcast under "Recent broadcasts" and rewind if you want an instant replay: https://www.twitch.tv/coderefinery/videos - nice, I didn't know this! - maybe not relevant to this course, but is there any course to learn how to write the codes using parallel approaches? - There are a lot around, for example from CSC and other organizations (this is in fact what most HPC courses focus on) - Does Coderefinery have such a course? - There are materials on modular coding, splitting the computations into parts. So while not directly about parallelisation per se, modular approaches can help in those cases that are "embarassingly parallel" (a jargon way of saying that the parallel processes do not depend on each other, e.g. running an analysis pipeline for each subject independently in parallel) - thanks - potentially useful script for sbatch history (not sure if clutters too much) (can we put these on git or somewhere and link to them?) (yes, can you put them in a pastbin or something and link? And give a bit more context about what they do): ```bash= sbatchadd(){ # Check if a file name is provided if [ "$#" -ne 1 ]; then echo "Usage: sbatchadd <sbatch_file>" exit 1 fi # Submit the job with sbatch and capture the output output=$(sbatch "$1") retval=$? echo "$output" if [ $retval -ne 0 ]; then echo "Error submitting job" exit 1 fi # Extract the job id from the output jobid=$(echo "$output" | awk '{print $4}') # Pass the job id and the sbatch file name to sbatchadd.py /usr/bin/python3 ~/sbatchadd.py "$1" "$jobid" } ``` ```python= import sys import re import datetime import os def extract_job_name(file_path): with open(file_path, 'r') as file: for line in file: match = re.search(r'#SBATCH --job-name=(\S+)', line) if match: return match.group(1) return None def append_to_file(job_name, job_id, file_path, sbatch_file): file_path = os.path.expanduser(file_path) with open(file_path, 'a') as file: file.write(f'[{datetime.datetime.now()}] {job_id} {job_name} {sbatch_file}\n') def main(): if len(sys.argv) != 3: print('Usage: python script.py <sbatch_file_path> <job_id>') return sbatch_file_path = sys.argv[1] job_id = sys.argv[2] job_name = extract_job_name(sbatch_file_path) if job_name is None: print(f'Could not find job name in {sbatch_file_path}') return append_to_file(job_name, job_id, '~/slurmjobs.txt', sbatch_file_path) print(f"Added job ({job_id},{job_name},{sbatch_file_path}) to slurmjobs.txt") if __name__ == '__main__': main() ``` - I am not able to connect to triton. it is showing permission denied after I entered my aalto password on linux terminal. can i do shomething? - Join the zoom and we can help - how can i join zoom? - if you registered it's in your email - I registered today with my aalto email, yesterday i registered with personal email mistakenly, so as of now i donot have any zoom link - How to cancel all of my jobs using a single command? - `scancel --me` should work. - tnx. - Caution, you set the `.sh` file as output - very good catch! - Good demo! - I remember there was a warning on turso to "never use tail -f" or somethinh along those lines. - It's always important to read and follow the cluster-specific documentation / guidelines. - `tail -f` updates quite often which is probably the problem because of the IO load. Using something like `tail -f -s 30` so it only checks every 30 seconds would probably be fine. ## Monitoring https://scicomp.aalto.fi/triton/tut/monitoring/ - I faced with this error: "bash: seff: command not found". - It's not a default Slurm command, so your cluster may not have installed it. I'd say request it, it is very useful. (`sacct` can give some of the same data though) - I am usiseffng Lumi workstation, maybe related to this? - Probably. I'm a bit surprised they don't have it. - The current view says ReqMem and MaxRSS and memory can be computed from that. CPUTime/(walltime×Ncpus) will tell CPU efficiency - How reliable are the seff and history utilized resource estimates? - They are pretty reliable, but they are averages from samples that are taken every 30s or so. So they can capture the maxima and the overall performance quite good if your program is consistent in its memory/CPU use. If the program does stuff in big spikes (e.g. loads a big dataset), the average and maximum start to differ and this might require additional profiling. :::info ### Exercises until xx:50 https://scicomp.aalto.fi/triton/tut/monitoring/#exercises I am: - done: - requesting more time: - not trying: +1 ::: - https://pastebin.com/cvi0vckZ this monitoring script is something I've made a while ago because I tend to run my jobs in a quick turnaround cycle, and I didn't want to spam my email so this uses windows "toasts" to notify when a job starts and ends (only made it work for a single job but that was enough for me). It ssh's into the supercomputer and runs a python script there that periodically does squeue, while local script interprets that. - thanks! This kind of automation is pretty cool and really shows how "you aren't doing work, you are directing others to do work." - glad to hear! - How to execute command on the run-pi.sh file? ^T does not show any result. - Can you elaborate a bit? - sorry i mean the run-pi.sh, the file we try submiting - You submit the file with `sbatch run-pi.sh` :::info ## Break until xx:12 Then "lightning talks" on some topics, then a q&a session. ::: - Should the time be xx:12? - thanks! fixed. - :cat: has been told to come visit if it wants food. - . - There are quite a lot of commands to get used to. ## Applications https://scicomp.aalto.fi/triton/tut/applications/ - Containers course: https://coderefinery.github.io/ttt4hpc_containers/ ## Modules https://scicomp.aalto.fi/triton/tut/modules/ - Is shell scripting case-sensitive? - in general yes - When I load a module, let's say anaconda, and then install some packages through a yml file. Can you explain how is this environment created and where / how do I access it or disable it - Usually the environment is created to your home directory by default, for Triton our tutorial page has first time setup to set the directory to your work directory instead. (This will also show you the actual folder path where the environments are written to.) - To activate the env you would generally need to have a module that provides conda/mamba loaded and then run `source activate environment_name`. - Note `source activate` instead of `conda activate`. Using `conda activate` requires you to run `conda init` which is not recommended on the cluster since it can cause issues that are hard to debug. - When running R, how can we visualize what variables are in the environment, and know the structures of the variables?Thanks! - ls() and str(variablename) is a good starting point if all you have is the terminal. I personally prefer Rstudio which shows the environment in its own subwindow. Rstudio can be used over HPC clusters, there are documentation both at Aalto and CSC. - How to check what toolboxes are installed in the matlab module? ## Data storage https://scicomp.aalto.fi/triton/tut/storage/ - Can you mention separate quotas for files (number of them) and the total size? - At least on Triton you can see them by running `quota` - How they do you connect vscode to triton - See lesson that was given yesterday. Videos are in youtube. Alternatively, see: https://scicomp.aalto.fi/triton/apps/vscode/ - ## Remote access to data https://scicomp.aalto.fi/triton/tut/remotedata/ - . ## Q&A (ask us anything) You can ask us anything about today or other questions (we might say we'll answer tomorrow) - Where is the cat? - sitting in hallway watcthing me. I'll try to attract it. - I think it knows it is wanted so is staying awy - sweet :) - I'm trying to setup the ProxyJump in an .ssh/config. I can connect to Kosh, I can also connect to Triton (with VPN) but when I try to use the Jump, it can't be complete the connection. Is the... - Does connecting to Kosh work without VPN? This could also be because you need both ssh keys and password for Kosh. - [Here](https://scicomp.aalto.fi/triton/tut/connecting/) is the tutorial page. - Yes, I have read that. This is the Jump config I tries. https://scicomp.aalto.fi/scicomp/ssh/#example-config-for-ssh - Can it be just rights/ownership thing? Or, back to the original question, are these instructions accurate after the recent upgrade of Triton? - They should work, but if you have several ssh keys you might also need to add `IdentityFile ~/.ssh/<privatekeyname>` - Nice, thank you, that worked! - Isn't the cluster the VM?Thanks for clarifying!Ok! - How do you see using VS connected via ssh to modify the code, the folders and modify the .sh files and then run from the OOD shell with sbatch? - I think there was written something about not runnning from the VS terminal(But maybe it is outdated now or just me remembering it wrongly?) - It's correct you shouldn't run big jobs (lots of CPU/mem) from the VS Code terminal directly. See the VS Code lesson linked from yesterday. - Is it possible to work on Triton directly with the files in the disks (Aalto hard drive), without moving files to /scratch that is not backed-up? - The I/O (speed of read and write) for those Aalto backed-up disks is very bad, which means that if you start running multiple jobs that read and write from those disks you will take down the whole Aalto teamwork storage network. HPC has a dedicated very fast I/O storage that is not going to fail with thousands of parallel read/write but you pay with the risk of "not being backed up". If you accidentally have an `rm -r *` in your script (i've seen it :)) then you lose everything and no way to go back. - "Not backed up" means there are no file versions, but the system is reliable against hw failure - For people interested about my setup: https://github.com/tmuxinator/tmuxinator [name=Simo] - What is the difference between clusters, workstations and virtual machines. Your short answer is appreciated. - See picture at https://scicomp.aalto.fi/triton/usage/workflows/ - workstation: a single physical computer - virtual machine: like a single physical computer but it is actually a virtual computer built on top of a real computer with hundreds of CPUs and RAM - cluster: a collection of physical (or virtual) computers - What are the benefits of Ondemand wrt terminals? - Mostly easy of use. The UI of ondemand is more familiar to people used to windows-like gui for example. - can you mention the "file searh problems" you mentioned for vscode again? - If you open a directory, it will try to read and index every single file. So if you open `/scratch` (the whole thing) or similar, it will use lots of CPU. Make sure you open just your project direcotry - Can we deploy a robotic simulator (NVIDIA Isaac Sim) in a container (requiring large GPU resources) to run in a cloud mode on Triton? - I'm not sure about cloud mode specifically, but Isaac Sim has been used in a container on Triton before. I don't think we have a module created for it on new Triton yet unfortunately. - Would you please share the recorded lectures of yesterday? - Here is the youtube link: https://youtu.be/XAn8daIGeiw - Is it precise to say that Lumi is a cluster, then? What make a cluster a thing having multiple workstations? Because we can connect two workstations to each other to make a single workstations. I mean that what is the difference between a cluster with 2 workstations and a single workstation with 2 workstations being connected to each other? ## Feedback, day 2 :::info News for day 2: - Tomorrow, we will go to more advanced uses: running many things at once. This is the real power of the cluster. - Try to make sure that you understand the batch jobs and monitoring parts of today, since it will be used tomorrow. ::: Today was: - too fast:ooo - too slow: - right speed:oooooooo - too basic: - too advanced: - right level:ooooooooo Good things about today: - There was a great bunch of information, but the discussive presentation style was very comfortable to follow - Many new knowledge that requires more time to consume but overall it was good that you touched on them all during the presentation. - . - Insights that go beyond pure textbook information, super interesting!! Things to be improved for next time: - The schedule for exercises was very strict, however, the demos and the explanations worked really well - . - . - . Any other comments? - . - It will be good to have a demo with jupyter notebook (.ipynb files). - I'd love to see a quick demo for a Python workflow from VSCode with an array of CLI parameters to the Python program as well as custom pip-packages (+ GPU) - pip packaging (Python for SciComp is another livestream course we do in the autumns): https://aaltoscicomp.github.io/python-for-scicomp/packaging/ - Array we get to tomorrow. - The workshop is amazing, thank you so much!! How should the format be adapted for the future? - I feel like the first day could be condensed into 2 hours and some of the topics of 2nd and 3rd days could be expanded with more practical examples. - +1 # Day 3 ## Icebreakers, day 3 What is your favorite weather? - 21 degrees and sunny +5 - Finland in winter +1 - 27+ C - Beginning of autumn - -30 degrees celsius - -275.152 degrees celsius - A stable one - +24 degree, slight rain and sunny at the same time, a rainbow in the sky, and in a mountainous region. - Happily parallel! - Do the parallel runs all have their own copies of variables? - In "embarrasingly parallel" yes (in shared memory: no) - Running the same simulation with different random sets, but inside each simulation I use GPU for large matrix multiplications. Each simulation in different CPU but how to handle the GPU part? - I'll raise this at the end - So, in essense it does not worth to parallelize the simulations in multiple CPUs when the same GPU will do the hard calculations. Or... ok, it depends how occupied the GPU becomes. - whats the full meaning of MPI? - "Message passing interface" - it's a common standard interface, with different libaries that implement it - Could you use the pasta metaphor for GPU?Thanks! - For more cooking metaphors see here: https://docs.google.com/presentation/d/16BTILZlUvEzCt6FfMsB9sSZm0PZHHXLBthE5QfoSrjo/edit - . ## Array jobs https://scicomp.aalto.fi/triton/tut/array/ - My simulations are relatively fast, like 1 minute. Doesn't it get more "slow" to have thousands of short simulations with this approach? Because for each simulation I will have to queue - Exactly. And the more you request/queue, the more you affect your priority for future jobs. So there is a perfect number between how many simulations to batch into a single array job, and how many array jobs to ask. A good rule of thumb is 1hour for one array job, so for example you would have 60 simulations batched together into each array job. - A kind of an overall for-loop? - Yes and the array jobs in the loop are submitted at the same time - is the number assigned to the array randomly done? or based on something? could you explain. - With examples: - `--array=1-5` yields 5 jobs with $SLURM_ARRAY_TASK_ID values 1, 2, 3, 4, and 5 - `--array=10,20` yields 2 jobs with $SLURM_ARRAY_TASK_ID values 10 and 20 - If i get it correct, using array jobs will result in multiple sessions (multiple nodes will be assigned to me, one for each job)? - Correct - What does the -l flag do in #!/bin/bash - local as if you would be ssh into the server and start an interactive session (so it loads your default settings, bash profile, etc) - Is it possible to have more than one array? (if I have, for example, 2 parameters)? - the linear indexes of the array job (from 1 to 100) can be remapped into 2 dimensions (e.g. a matrix of size 10,10). I personally prefer to do this type of mapping inside the script so that `python myfunction.py $SLURM_ARRAY_TASK_ID` will take the array ID as input paramater and then inside I code the logic of which row/column of the 2 parameters matrix I will use - I got confused, does this command --array=1-10 only runs a same program 10 times? - Indeed the same program runs 10 times, and then it is up to you if you want to use the integer SLURM_ARRAY_TASK_ID (from 1 to 10) to use different parameters for each of the 10 indepedent runs (e.g. set a different seed of the random generator for each of the 10 different runs). - Example: Within my Python script, I decide that $SLURM_ARRAY_TASK_ID=1 corresponds to distance metric "euclidean", $SLURM_ARRAY_TASK_ID=2 to distance metric "manhattan", and so forth. - exactly so that within your python code you can take the integer as input parameter and then you would have a "case" that sets which distance metric to use according to the SLURM_ARRAY_TASK_ID - Thanks, now I got it. - Great :) - Just a simple question, whats the difference between jobs and program (e.g python script)? - A job is a request you send to the slurm manager to run something that is basically a bash script. Inside the bash script you might call mulitiple programs e.g. various steps of a pipeline. Very often people just run "one program" inside the job, and then that program has the pipeline logic (e.g. multiple steps to do in your analysis pipeline). But yeah if one wants to be semantical, jobs are programs :) (but not all programs are jobs) - Is it possible to check from inside the array jobs whether the other ones are still running? The usecase - aggregate the individual array job results on a single node and do some post-processing. - In theory yes, it could run `squeue` to see what else is running. Probably better is job dependencies: you submit the jobs, get the job IDs, and then submit another job that says "run only after these other ones are done": https://scicomp.aalto.fi/triton/tut/dependency/ - Basically, if your array job id is ``N``, you can run the aggregation job with ``#SBATCH --dependency=afterok:N`` - Fantastic, thanks! - . - how did I get the individual outputs? I ran the first array job but then it gave me the "submitted batch job xxxxxx" after that it says that the output is saved somewhere how do I access those? - Usually, in the same folder where you are, you should have log files called slurm-JOBID_ARRAYID.out (unless you specified a different name/location with SBATCH --output=..... ) - how do I know the JOBID_ARRAYID? - slurm q to see the jobids that are running, slurm h to see those who are over (if you are not at Aalto, then instead of slurm q there are similar commands) `squeue -u $USER` and `sacct --long` - I found the JOBID but then I write "slurm-JOBID_ARRAYID.out" and it says command not found - That slurm-JOBID_ARRAYID.out file is just a text file so you cannot run it as a command. You can inspect it using some terminal commands such as `less slurm-JOBID_ARRAYID.out` or open it with a terminal editor `nano slurm-JOBID_ARRAYID.out` - I now tried the ls slurm-JOBID_ARRAYID.out and it says no file or directory with that name - You could run "ls" alone to see which files you have there in the folder where you are. I like "ls -alt" so that it shows me all details about the files - ahh thanks I found out, didnt know the coorect name thanks - NP! JOBID and ARRAYID will be different for each run. - A potentially confusing thing in the array job terminology: Submitting an array job using sbatch results in a single Slurm job with its own JOBID. Then, all the array jobs are also Slurm jobs with their own JOBIDs. Nevertheless, we call the integer number corresponding each array job $SLURM_ARRAY_TASK_ID. So "task" instead of "job". I might be wrong but it seems a bit confusing. - yes it is! And when you deeper tehre is "arrayJobID" (somefor all array tasks), ArrayTaskID, JobID (unique fer each array), JobArrayID in `squeue`'s output (=ArrayJobID if it's an array and JobID if it's not?). I usually try some sample jobs and see what comes out each time I need to figure out. - Ah I didn't know about the ArrayJobID, yet another thing :) - In the material, it is said that --array can also be given as a parameter to sbatch command. Does this mean that parameters given for ´sbatch´overwrite those given in the script? - Yes, you can `sbatch --array=N-M script.sh` and this will override anything in the file. - In general, I'd say it is recommended to include things like `--array` to the sbatch script. Or at least document the exact commands used if not (which is a good practise anyways). - I'd use the cmdline override to re-run certain ones that failed, for example - How my programm (let's say Pi.py) should get and pass the argument $SLURM_ARRAY_TASK_ID ? I mean what I should write/do in my programm to understand for different array ID, it should do different things. - It (`SLURM_ARRAY_TASK_ID`) is called "environment variables", and if you search different languages you can see how to access them. - We'll talk about this a bit in the next session on shared memory parallelism and those can be re-used for array jobs as well. - A quick answer for Python: `import os; slurm_array_task_id = os.getenv("SLURM_ARRAY_TASK_ID")` - Bash is good at handling them, and any other language can too. This is why it's good to know shell scripting well - Alternatively you can pass it in like any argument to the code (have a look at argparse or similar tools) - Hi, In the "Bash Case style" tutorial can you explain how "SEED" works? - In this example `SEED=x` is a Bash variable. So based on the value of the `SLURM_ARRAY_TASK_ID` we set the variable to be some value. Afterwards when we call the `pi.py`-program the value of the variable is recalled using `$SEED` and given to the program with `--seed=$SEED`. - So, is it like we have some values for seed in some file, and then using 'SLURM_ARRAY_TASK_ID' we're recalling that particular value i.e. if we we say SEED=123, it reads value corresponding to 123 entry? - Yes. We basically map the `SLURM_ARRAY_TASK_ID` to some other number that is hard-coded into the sbatch-script. :::info ## Exercises until xx:20 https://scicomp.aalto.fi/triton/tut/array/#exercises Try one of the seed exercises like Array-1. Aalto/Helsinki/Oulu: if you are stuck with the exercise or want to ask more questions, join us in the exercise zoom room. ::: ## Shared memory parellelism https://scicomp.aalto.fi/triton/tut/parallel-shared/ - I got confused, what is the difference and benefit of alocating multiple CPUs for a single program against allocating multiple arrays? - (discusesd by voice, both have their use cases and often are even combined). - Quick written answer: arrays launch separate jobs that can themselves use any number of cpus. It will depend on how your code can parallelize - You talked about the amount of processors, yet the output contains the term "processes". Can these terms be used interchangeably or do they refer to different things? + - "a process is a thing that can use a processor" - "process": computer term for a running program - "processor": hardware term for the CPU that can execute programs - It might not be entirely consistently used here... :::info ### Break until xx:01 ::: ## Laptops to Lumi: CSC resources https://github.com/AaltoSciComp/scicomp-docs/raw/master/training/scip/CSC-services_062024.pdf "CSC the IT Center for Science" is the Finnish national center for high-performance computing (and many other things). We'll hear about what resources it can provide. The computing resources are quite similar but much larger, so if you need more power go there. CSC weekly user support session: https://csc.fi/en/training-calendar/csc-research-support-coffee-every-wednesday-at-1400-finnish-time/ ("virtual office hours" in addition to support via servicedesk@csc.fi) - Where does the abbreviation CSC come from? - These days it doesn't officially stand for anything, we can ask. I think of it as "center for scientific computing" but I don't know if that was the original one. - Apparently that is correct: https://fi.wikipedia.org/wiki/CSC_%E2%80%93_Tieteen_tietotekniikan_keskus#Historiaa - My code is not using GPUs (or at least I think so), is LUMI only beneficial for GPU type of work? - LUMI has a huge CPU part. So not only for GPUs - Is there a way to restart a failed job on CSC without resubmitting the job file (some jobs on Puhti occasionally terminate for an unknown reason)? - Typically you'll not want to just resubmit the job if it fails. Figuring out the reason for the failure and resubmitting is usually the best approach. - If CSC is a non-profit organization, How it can employs its personal and fund its projects? - Most of CSC funding comes from the Finnish Ministry of Education. - Does that mean that the taxes are used for its budget. - Yes. - Thanks. I am happy with that the taxes are used for such positive means. - CSC is involved also in many projects that are funded by Academy of Finland, EU, etc. (basically also tax money)... Roughly half of the funding comes directly from ministry of education and culture. - . ## MPI parallelism https://scicomp.aalto.fi/triton/tut/parallel-mpi/ - . ## Break until xx:58 Then GPUs ## GPUs https://scicomp.aalto.fi/triton/tut/gpu/ - [GPU/Python question] I want to use TensorFlow and PyTorch and a niche Python package (on pip) in a Slurm job. In order to make both TensorFlow and PyTorch work, I used the `scicomp-python-env` module (if not, I got CUDA errors all the time). How can I load the niche pip-package on top of `scicomp-python-env` in my job? I tried to set up a conda environment that only installs `pip: my_niche_package` but then tensorflow and pytorch from the `scicomp-python-env` module weren't available anymore during the slurm job. - `scicomp-python-env` can't be used to create your own environments. You need to create a fresh environment with `mamba`. - That was my very first attempt, but I never got the CUDA versions right for Tensorflow and PyTorch at the same time. I have the additional restriction that I can't use TensorFlow >2.15.1. But then I'll look more into that. Just wanted to make sure that I'm on the right track before I commit more time into that. Thanks!! - Here is the part from `scicomp-python-env`'s env.yaml that has tensorflow and torch +cuda things for reference: (removed to fix autoscroll) - Thank you so much, I'll try that now! - So we can only connect to interactive nodes that we have an active job on them? - Yes, if you don't have a job on the node you can't ssh to it. - Where I can find the code of pi.py ? - All in here: https://github.com/AaltoSciComp/hpc-examples/ - In which folder of that repository, I can find the file being used for the shred-memory parallelization? Slurmm folder? - Yes, should be slurm/pi-gpu.cu and slurm/pi.py - thanks. - Do we need to allocate separately the memory the GPU will use or is this automatically? - You will always get all the memory the GPU has, there is not way to allocate it separately. (On Triton at least, but I'd expect that applies to basically all clusters.) - You should still request some RAM as well for CPUs to use for data transfer, as was just talked about. - Can I start an interactive job to a GPU node, like for CPU nodes in https://scicomp.aalto.fi/triton/tut/interactive/? I tried `srun -p interactive --time=2:00:00 --mem=600M --pty bash` but that didn't work. - Yes (but remember this is a very expensive and limited resources, make sure you release it when you are done). - Try these options `--gpus 1 --time=XXX --mem=XXX --pty bash` (you have to remove `-p interactive` and add the GPUs). - Oh nice, thanks! Yes, I just wanted to have an option to test CUDA environments quickly (and of course release the resources right after :) ) - In the case you could use `-p gpu-debug`. This partition is meant for jobs that are shorter than 30 min and will likely get you a quicker queue. - awesome, thanks! - So every time we want to use GPU we first need to `module load gcc/12.3.0 cuda/12.2.1` before anything else on our code. - With python, it's usually more convenient to have cuda in your mamba environment. If you have C code etc. that is compiled with cuda then yes. - What about Matlab? - Cuda yes. Someone else can correct if this is wrong, but Matlab should know the version of `gcc` that it was compiled with already, so that shouldn't be necessary to load (and can be detrimental if you load the wrong version). - Can a metaphor be made using the pasta examples, for conda, Tensorflow, PyToch, CUDA? Thanks! - They are all spaghetti. - CUDA is something like the Salt in the water maybe? Conda is something like different sets of pots. Tensorflow and Pytorch are more like spaghetti.... - - . ## Q&A - Are there some rules of thumb about whether to consider GPU, or is it normally sufficient to deal with CPUs? - Are the garage sessions similar to the Zoom sessions on this course? - Roughly yes: a lot of the same people there (just Aalto though), and you can ask about anything about SciComp. - If I have a problem and consider coming to the Garage for help: How much time should I spend trying to solve the problem myself? I think I'm learning towards trying to solve everything myself and spending hours and hours on problems because I don't want to bother you. Then again, the docs often say "ask us RSEs early" - Really depends on you an teh problem: it's something core to you work: spend more time. You know it's unreleted: come later. You hate messing with it: come sooner. etc. 10-15 min searching docs and the internet is reasonable as a minumum though. - haha ok, that helps a lot, thanks :) - Is an appropriate question something like: "I have task X, and it's a calculation of type Y. Should I use GPUs for this task" - yes! We always like people making sure they take the right path early, rather than do something off-topic and have to {tell them it's not going to work, try to make it work to not waste the time even if it's harder in the end} - If anything, we might sometimes be too eager to teach you the best practices. - What's the difference between mamba and conda? I have experience using conda for virtual environments but haven't heard of mamba. - Mamba is a plug-in replacement for conda. It's just written in C++ while conda is in python, so it can be order of magnitude faster at solving environments etc. - so basically any command where you would use conda, you can just replace it with mamba. - When using ondemand jupyter: does that process need to be killed manually? When I used it and exited (pressed a red button IIRC), it seemed that that process stayed running after exiting - If you just close the tab in your browser, the job will keep running. If you select "shutdown" from the File tab it will stop. If you just close the tab it will stay active (you can resume later) and terminate from the OnDemand interface if neededo - Delete also works and kills the job. - Yeah, I noticed the jupyter job with slurm q and cancelled it via scancel. It seemed to also kill the process. - Yes. it's just a normal job behind the scenes, so scancel also works. - How I can contact with you in future if I have a question in the relevant regards ? - Come to our daily garage, or send an email to scicomp@aalto.fi - Our [help](https://scicomp.aalto.fi/help/) page has a list of different channels. - . - . ## Feedback, day 3 and entire course :::info News for day 3: - We hope that you are inspired by this course. There is plenty more to learn, keep reading and asking - The website will be updated with some more recommended follow-up info (extra special topics we haven't covered this time). - Things we would recommend for follow-ups: - Python environments with Conda (one page scicomp.aalto.fi/triton/apps/python-conda/) - CodeRefinery workshop (coderefinery.org, one in September) - it's also in this format - Python for Scientific Computing (also livestream, later this year) ::: Today was: - too fast:o - too slow: - right speed:oooooooooooo - too advanced:ooooooo +3 (but really good to know) - too basic: - right level:oooooooo I would recommend tihs course to others: - yes:oooooooooooooooo - no: - maybe: Good things about today / this course: - Prompt interaction +2 - Great course, the explanations and especially metaphores were insightful +2 - The cat :) +3 - good information and tools to use and where to get more information +1 - Very good hands-on course, lots of new stuff to process and good examples to practice with. - Resources well structured, easy to find the links to documents and relevant materials to help the learning +1 - well structured course +1 - The Zoom session was a very nice experience, thank you for helping! - Very good information, good and fast and informative interaction. Even dumb (by me) questions aleardy answered in the docs got a good answer. +1 - The Markdown-based HedgeDoc document was fantastic! - The production quality is <chef's kiss> To be improved next time: - . - . - . - . Other comments: - I had read ~ most of the materials beforehand, but it really was worth it to attend this course, it offered a solid ground to build upon - I was quite affraid, if I'm able to join every day. And if it's woth if only participant only part of the course. But I'm going to recommend this to everyone! - We appreciate your presence (!), critical, and the recordings are important! Thank you very much! - I couldn't participate to all the sessions, but the ones I did, I could follow the clear instructions. I'm looking forward to revising the videos and reading the material. - It would be great to have an advanced course (a continuation of this one) on parallel and distributed computing at CSC, i.e. multi-node jobs. Preferably, including hands-on examples and covering best practices on writing the code for such jobs. +1 - We can recommend this: TTT4HPC aka workflows on HPC https://www.youtube.com/watch?v=n9SQthz_St4&list=PLpLblYHCzJABy4epFn-rqsfDbUZ1ff5Pl - We have run a course on MPI in the past, we can run it again if there is demand, let us know what you would like to see next, and take the survey -> https://link.webropol.com/s/scipod - Condensing the first day a little bit. In my honest opinion, the practical information as well as conceptual understanding are the most important, background information is less relevant (it's also possible to add more information to that day). Though it's possible to not attend those parts, I prefer to attend the whole thing in order to not miss anything +1 - THANK YOU!!! +1