# Introduction to Scientific Computing and HPC / Summer 2023 (ARCHIVE) :::danger ## Infos and important links - To watch: https://www.twitch.tv/coderefinery - To ask questions and interact (this document): https://notes.coderefinery.org/scicomphpc2023 * *to write on this document, click on the :pencil: (Edit) icon on the top left corner and write at the bottom, above the ending line. If you experience lags, switch back to "view mode" ("eye" icon)* - Previous of this document are archived at: right here - Matters that require one-to-one chat with our helpers (e.g. an installation issue or non-Aalto specific settings): - PLEASE USE THE ZOOM LINK SENT TO REGISTERED PARTICIPANTS IN THE INVITATION EMAIL (Only for participants in the Nordics) - Program and materials: https://scicomp.aalto.fi/training/scip/kickstart-2023/ - Prerequisites: - Make sure you have a working terminal on your machine and an account to an HPC cluster. - New to linux terminal? Familiarize yourself with linux shell commands by watching this video - Suggestion: if you have just one screen (e.g. a laptop), we recommend arranging your windows like this: ```bash ╔═══════════╗ ╔═════════════╗ ║ WEB ║ ║ PYTHON ║ ║ BROWSER ║ ║ WINDOW ║ ║ WINDOW ║ ╚═════════════╝ ║ WITH ║ ╔═════════════╗ ║ TWITCH ║ ║ BROWSER ║ ║ STREAM ║ ║ W/Q&A ║ ╚═══════════╝ ╚═════════════╝ ``` * **Do not put names or identifying information on this page** ::: *Please do not edit above this* # Day 1 - Intro to SciComp ## Icebreaker **Test how to use this hedgedoc live document, this is how we ask questions and get help.** - This is a question, or is it? - This is a reply, yes it is! - This is a nested comment to the reply - Actually... insert smart comment here - This is a different reply, but still valid - This is another question, or maybe not? - ... - ... - test it. - ... write something... - testing this again :) What university are you from?: (add `o` to vote) - Aalto: ooooooooooooooooooooooooooooo - Helsinki: oooooooooo - Tampere: oooooooo - Oulu: oooooooooo - What do you expect from this course? - Understanding how to use "basic" HPC +2 - Learning how to run my matlab codes in CSC machines - Run my Python scripts in more powerful environment. - First intro to using HPC. Hope to run R tasks faster. +1 - how to use HPC to do machine learning tasks +1 - I would like to start using the clusters for more computationally intensive tasks. +2 - I would love to try out HPCs and learn how I can use it to speed up my code. +3 - Understand HPC basics +2 - d What's your ideal way to spend a summer day?: - Walking and reading outside :) - Doing some high performance computing, naturally :) - Swimming, grilling and sitting around outside in the long summer night :) (ideally with no mosquitoes) - Chilling, swimming and eating good food outside - work hard and smile - Play Zelda TotK +1 - Hanging out with friends, having good food and spending time outside - Travelling, walking outside - Being near some water - Gaming (indoors) - Walking, eating, drinking, sleeping, movies - Swimming (unfortunately water is freezing in Helsinki) - Cycling, going on walks, just hanging out in nature - Wrestling, drinking, and hiking - bbq and beer +2 - Chilling at our summer cabin. Maybe SUP-boarding. - Definetely on the beach! *Here below there will be the questions for the material covered on day 1.* ## Intro to course https://scicomp.aalto.fi/training/kickstart/intro/ *Ask any questions regarding the intro here.* - After studying remotely for 1,5 year and having lots of online classes, I highly appreciate the amazing audio quality here. Many thanks for that! - I am trying to ssh into triton from my Ubuntu installation from Windows. I have connected to vpn2.aalto.fi, but ping triton.aalto.fi is failing. Any tips? - Quick tip: vdi.aalto.fi (a virtual desktop that you can connect from any place / OS and run Ubuntu with the net setup enough to connect to Triton) - I have done that, but I also want to learn how to connect from the vpn. - Later lesson will have more in-depth instructions for connecting, and also people helping on zoom in case you have issues connecting. ## A short intro to scientific computing https://hackmd.io/@AaltoSciComp/SciCompIntro - How much does a typical researcher need to know about computer hardware these days? - Depends a lot on what kind of research their doing. Many researchers use existing libraries and don't need to know that well. But many do weird new things that existing software cannot do, so they need to know a bit. - What happens when you are running something and internet connection is lost? - We will learn more about this later, but if you use a batch job your process will keep on running without issue. - In a nutshell, if you run in a batch mode on the cluster, lucking connection does not impact on the already queued jobs. If you run online (in the foreground), that process will die when your SSH session is closed. - Is there some sort of limit on how much data or CPU we can reserve ? - For triton specifically, there is a cap on cpu and memory utilization at once, but the real bottleneck will usually be the queue time: more resources you request the longer your job will sit in queue. (You sit in queue waiting for the resources to free up from other people running their jobs.) Other clusters should be similar in that regard. - The Triton reference also lists what's available and reservable in each partition: https://scicomp.aalto.fi/triton/ref/ . Other clusters should have similar information. (If you request more CPUs or memory than is available, you will get an error message - no harm done.) - With respect to the data, ie a disk space: each user or project has a quota, though expandable - For very heavy resource use (esp. CPU), CSC clusters might be better choice. - How are energy-saving plans affecting the availability of resources? - - ## What is parallel computing? An analogy with cooking https://docs.google.com/presentation/d/e/2PACX-1vQLTzWkRy7Du3jjPJ6Y9BqKczU_JcSTEL6XsndrNJ7ylzi4RWeEy8lhfWZQu_lpwbAKroh51qqLoPFG/pub - Amazing metaphors guys, thank you! - What does "wait a lot" when requesting resources mean? A couple of hours or days? - Unfortunately "it depends": depends on how much other people are running and what you are running (the more you run, the lower your priority). Run small enough and you get it almost immediately (like when getting started and testing). Run 10000 jobs and it could be days. - There are also short/debug queues for fast development. - What is the analogy to the cluster node? - Perhaps you could say "one apartment" or "one kithcen". It takes a lot longer to communicate across the kitchens, and you can't share supplies. - I have a feeling that 1 cook using 4 burners is more efficient than 4 cooks using 1 burner each (although a bit slower). Is this also the case in computing? - It can be! It really depends on the problem and how it's programmed. You are right that communication between people can be much slower than one person juggling the resources themselves. - We'll see tomorrow that "array jobs" + "one-cook" strategy solves most things well these days (without needing MPI=multiple cooks). - https://www.thekitchn.com/the-best-way-to-cook-dried-pasta-23086266 - This has some crazy pasta cooking strategies! - Any tips on how to use HPCs with Comsol simulations? - Our docs page has a tutorial for Comsol: https://scicomp.aalto.fi/triton/apps/comsol/ ## Break until xx:00 :::info Remember to walk around some. Then "how big is my calculation". You may keep asking questions above or below. **Note:** notes.coderefinery.org is what we are calling "hackmd" for unfortunate historical reasons. We'll call it "notes" or "collaborative notes" from now on! ::: - Did anyone adjust their cooking when energy was more scarse in the winter? - Yes of course! But some Italian friends rejected the method with the lid on the pot! ## How big is my program? We'll talk about estimating how big your program is... since we often get asked this. https://scicomp.aalto.fi/triton/usage/program-size/ - What if you underestimate the resources needed and you "run out"! (TUNI especially Narvi) - Process death! Well, maybe. If you run out of memory, job dies (Aalto has some grace, you can go a bit over, but will get killed if run out). If time runs out, then killed (again with a short grace period, ~15min maybe?). If CPUs run out, it runs slower but works. - Narvi also uses slurm and should behave similarily, only significant difference could be how big the grace period is etc. Check your documentation for specifics. - Recommendation: give it a try with something small, it's not bad if it runs out, just try again. - Relating queueing times and amount of resources requested: How long would I expect to wait to get one laptop worth of CPUs and memory (4 CPUs and 16G memory)? - For 4 cpus and 16GB of memory, usually not very long. Depends on the system, of course. Particularly on Triton: seconds to minutes so pretty much instantaneously. - Is there a way to save my results from death when I already can see that I reserved too little time for my calculations to finish properly? - Most well-developed big HPC programs can a) checkpoint (they write the state out periodically so it can resume from the checkpoint file), or b) it gets a signal before the kill, so it has a chance to make a save then. - In short, it's up to the program author if they support this. Definitely possible. - Your data produced by the code, if dumped regualrly, is in the cross-mounted folder that you can access during the job is running. You are free to analyze output and make a backup. - Can we mention fairshare and how it affects waiting time (and the fact we can't know the waiting time?) - I'll answer here: the more you run, the lower your priority gets, so it takes longer to run your next thing. In the long run, it balances out, but people can get more resources in a short time. - This fairshare concept makes it really hard to answer this question. In general, say what you need and trust it does the right thing. Ask for help if you notice something off. - There is a section with 'for non-admins' explanatios at: https://slurm.schedmd.com/fair_tree.html - In a iterative process that you don't know how many steps are needed, how can one estimate the needed time? A breadh first algorithm for example. - Now that's a good question! I guess level one approximation is to give far more than you think it needs - is there some upper limit (if it ends early, you won't get "billed" for that extra time at least). - If it might really take a long time, then implement checkpointing so you can restart (see above). - First, ask around if you group member have experience with the code running on the local cluster, if code is new, ask developers or on the user forum/maillist, and the last, try on your own: means reseve sufficient amount of resources, run the job of size that you plan to run in the production and then shrink the resource limits down to the really used. - Can you monitor/check your results, while computation is ongoing? - Yes! (yes, you can see outputs as it's running). Dedicated lesson on this tomorrow. - How can we record the throughput and execution time for deep learning trainings? - Youll see examples tomorrow. In short, Slurm (the thing that manages all the jobs) recourds executain time automatically. Your job should write out throughput as part of it's outputs (and maybe it's time too). - Does it matter where the raw data is located? - It can, depending on the system and the amount of data. For example in a deep learning training with a lot data, the speed of loading the data makes a big difference. Otherwise the main issue is disk space. - Is it okay to overestimate your job by some margin? lets say by the factor of 2. - it is ok if do for testing, otherwise it is waisting of others time - yes, especially when you are first starting out. You can lower once you get more familiar with it. - What if I always set the runtime to maximum possible runtime by the machine? - On triton in particular, you will wait significantly longer in the queue. Later answer elaborates on how it is a problem for others, but you are making things harder for yourself too. - Bad idea in general, since the time limit you set is read by the batch system (=SLURM) and resources you request are reseved for you for this period of time and thus out of queue and out of use for others. Thumb of the rule: test your application / code find out sufficient time, memory, gpus etc and then limit requested resources down to those numbers plus maybe 10% - As a reminder: it is not you against others :) the cluster like Triton is a in the common usage - ## Humans of Scientific Computing More information concerning RSEs: https://scicomp.aalto.fi/rse/ Good source for practical scientific computing knowledge: https://coderefinery.org/ No slides, you can ask questions here: - For Richard: What motivates you to make a transition to data science - Wanting a different set of skills to help career, by taking an obvious path. - But Richard also needs to apply for funding from now and then, isn't it? - Not individual funding, but as a team. And we work together on it, so feels much better. - But the School of Science does guarantee our funding, so our positions are secure. - What are the key skills and technologies required for research software engineer - The CodeRefinery workshops (link above) are a really good starting point. Just what a junior researcher needs to start with. - Learning how to use version control is a good idea. +1 - learning version control well enough is one of the best investments in your career. ## Break until xx:05 :::info then "connecting to the cluster" I feel about the course (vote for all which apply): is good: oooooooooooo is ok: needs improvement: too slow: too fast: good topics: oooo need better topics: ::: - . - . - . ## Connecting to the cluster :::info https://scicomp.aalto.fi/triton/tut/connecting/ * Our goal is to get connected. * We'll give a quick demo and give plenty of time to ask for help / work on it yourself - if it works, you have a break until the next hour. * If it doesn't work, then it's OK and it's homework. * If you need live help and registered, the Zoom link you got has helpers from Aalto+HY+TUNI+Oulu. * If you are from one of these 4 universities and are unsure about zoom help, Enrico is one email away: scip@aalto.fi ::: Triton ssh fingerprints here: https://scicomp.aalto.fi/triton/usage/ssh-fingerprints/ Questions here: - How to connect when you are using your personal laptop from home? - When you are, or are not, using personal laptop? - If you are not on aalto network, you need to either use a vpn or do a proxy jump (the `-J` options you see in the info), elaborated right now. - If vpn feels like cooperating, that is generally the easiest option. - vdi.aalto.fi can be of great help - What if the key fingerprint is wrong (after the first time connecting)? - It is very rare, but you could be on an insecure network (e.g. an open wifi in a caffee) that tries to interject your network traffic. In that case, don't type your password. (ideally you should connect without password using ssh keys, but that is a longer story -> https://scicomp.aalto.fi/triton/quickstart/connecting/#setting-up-ssh-for-passwordless-access) - This is the equivalent of a certificate warning when browsing the internet (e.g. you try to access https://gmail... but it complains that the certificate is not right... it might be that it is not the real gmail website) - you can send the output to your cluster admins and ask for advice. They need to know anyway. - So abort and contact admins - Yes, pretty much. - Is there a link for the Zoom group for Oulu? - Same one as all of us, in registraiton email - If you registered in the last few hours, you have not received it. Email scip@aalto.fi - Narvi connection details can be found from here: https://narvi-docs.readthedocs.io/connecting.html - There are 2 login nodes. narvi.tut.fi and narvi-shell.tut.fi . - narvi-shell is recommended if you wish to do light development in login node. - I did not understand the point of using ood. Is it another way to use triton instead of the "terminal way"? - Exactly, other interface that some people prefer Lets you get terminal through web browser. And has some built-in apps - Among with the normal "shell session" Open OnDemand (=ood) provides a native environment for the gui apps; in the nearby future, we will also enable file transfer option; it is somewhat like Aalto's vdi.aalto.fi service, but for Triton only and available within Aalto net only - Can we use Matlab and VS code at the same time through OOD? - Yes but not on the same node. Essentially, you will have two different windows for the two programs. - Technically yes. Put it this way, you can run Matlab or VS Code as a separate instance, in this case you get two different sessions, node allocation is done normally through SLURM. Alternatively, you can start a desktop, a normal Gnome, and run Matlab etc from within that session, this way you get both at once - /usr/bin/man: can't set the locale; make sure $LC_* and $LANG are correct (do I have to do something) - login works otherwise - Not a big issue (I guess you are on mac?). You can set this in your terminal preparations, and there are some few programs where this can cause issues. - Yes on mac! - https://apple.stackexchange.com/questions/21096/where-does-lang-variable-gets-set-in-mac-os-x <-- There is a tip int he accepted answer how you can set it. - Thanks I will try!! - https://scicomp.aalto.fi/triton/usage/faq/#command-line-interface Check the third question and answer. - For the university of Helsinki is the same, but to the Puhti service right? - Please join the breakout room for HY. You have received the link if you registered before 8am this morning. :cat2: +5 - Any ideas about Narvi's fingerprint location? - Please ask Narvi administrator in Break Out room 3 - There is a room for improvement in documentation. Fingerprints will be added shortly. - narvi-shell.cc.tut.fi ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMLf7VgwBl6e4PUNxOvPmlRSPzOCzP9O0k9o6YvRwPF0 - narvi-shell.cc.tut.fi ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBKG2IfmfjSn8qUHLmPEPhk/B1RY/VYz+xoilTyFrcXM83GDEyNkwtJsICHIpQIuGMcWIbpfqQX9OuRzCv40/bf8= - narvi.cc.tut.fi ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILIuNnPRWhJHi6vETtDi0xqN6NiCb3csRoLa/i056Gmm - narvi.cc.tut.fi ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBB1P2Au53FSSXLkd3szZfHkAG4PIknJVK39XxCoJiPsogacSSHvzhfFg5VvdYL70yQ629VVbWGbbpcmmH2pFis4= ## Connecting time / break until xx:00 :::info * Keep asking questions if you have them * Work on getting connected to the cluster: https://scicomp.aalto.fi/triton/tut/connecting/ * Once that's done, you get a nice long break * After the break, we will do the "cluster shell" lesson - you can preview it if you want * and then two live examples of what we do on the cluster. at xx:25. **I am (please vote):** connected: oooooooooooooooooooooooooooooo have problems: not trying: ::: **It is important to test the connection today because tomorrow all exercises will happen on the cluster and there won't be time to fix connection issues or account issues.** - I have problem - I removed your name :) please join the zoom if you are from Aalto or Helsinki uni or Tampere or oulu or just ask here :) ## Using the cluster from the shell :::info https://scicomp.aalto.fi/triton/tut/cluster-shell/ * This contains far more than we can go over now - we will summarize and you should explore some as homework. ::: Link to git intro: https://coderefinery.github.io/git-intro/ (not necessary for this course) Questions here: - Please, specify what is for Aalto and what is for HY also possible, I mixed up a bit... - Most things shown here should work on any HPC system. (anything that works like the Linux terminal) - Please check the documentation of HY cluster or join zoom breakout room 1 to ask for clarification directly from HY cluster admins. - HY cluster (Turso) documentation is here: https://wiki.helsinki.fi/display/it4sci/HPC+Environment+User+Guide - Is home directory accessible from any cluster? - I think this is referring to HY where there is more than one cluster, right? Usually the home is home everywhere, but at Aalto the "cluster home" is not the same home in the workstations... - Each cluster has a home directory. Some - I think I'm connected and at ~ but ls shows no folders (or files). Is this normal? - does pwd do anything? Your home should be empty since you haven't added any files to the cluster yet. - If you just got your account, there won't be files there yet. You can try making some when you are there - In the page, we'll leave that for self-study. - "The more you know" kind of information: `~` is your home folder. - How do we move program from local repository in my computer? - We'll talk about this a bit tomorrow. We aren't trying to teach it yet (for simplicity.) - For triton: https://scicomp.aalto.fi/triton/quickstart/data/#copy-data-to-and-from-triton - There are multiple ways to move data around, rsync is one of the ways to do it from a terminal - Is there a way for us to copy links from your command prompt? (or your notes) - The Notes "hackmd" document has links to copy. Is that what you mean? Also all links should be in the current lesson page. - I meant the command for git clone, but I just rewritten it myself- and it worked well - You can also find those commands from the relevant tutorial page. Git clone syntax in particular is repeated multiple times before exercises. - You can find the commands in the material. There is a copy button at the top right of every code block. You can paste them to the terminal using different key combinations, in Linux it is usually ctrl + shift + v. - How to show the command history real time like you did? Where is the log file in my system? - This is using prompt-log: https://github.com/rkdarst/prompt-log - When you exit the shell it saves to ~/.bash_history, and you can reuse it. - Push the up arrow key to see your own history live! - Text editors like nano, vim ... use keyboard key combinations to edit, save ..., which one do you recommend to use for a starter? - Nano is simple enough to learn and makes a good starting point. Only real pain-point with nano is that keyboard shortcuts don't match what newer software tends to have. Vim is significantly more advanced and you might want to avoid it as a beginner. :::info Exercises until xx:23 All exercises in here: https://scicomp.aalto.fi/triton/tut/cluster-shell/ * You *won't* have time to finish, but play around some and continue tonight. * The "shell crash course" also says other important things (link at the bottom of the page). ::: ## Two real-life examples and general Q&A. :::info - This is a demo only - Don't try to type, you don't know how to do this yet - You'll learn these things on day 3 ::: But you can ask questions: - First we are doing an "array job". It let's you run a similar thing many times over and over again really easily. - `--array=0-100` lets us run 100 things at the same time! - We can paste the submission script here (remember we only learn this on day 3): ``` #!/bin/bash #SBATCH --time=00:20:00 #SBATCH --mem=200M #SBATCH --output=array_example_%A_%a.out #SBATCH --array=0-1000 T="$(echo 0.01*$SLURM_ARRAY_TASK_ID | bc)" srun julia --project ising.jl --Temperature $T > array_run/out_$SLURM_ARRAY_TASK_ID ``` - Are you running Jupyter on Triton too? I mean the results are avialble for plotting without any delay? - Yes. Through jupyter.triton.aalto.fi - What is the difference in using some multiprocessing package in your script and submit a job and using the array job? - We will talk about different ways of parallelization on day 3. In general though array jobs are only practical in situations where you want to run same program multiple times. (Possibly with different parameters etc.) - Should we prepare our home dir (or whatever) for tomorrow practical parts (HY)? - For all: you might want to download the repository with the exercises `git clone https://github.com/AaltoSciComp/hpc-examples.git` - I would not store it in the home folder, but rather in your work directory which is often called `$WRKDIR` in many clusters, check the documentation of your cluster for details. - (Aalto specific) you can get to your work folder with `cd $WRKDIR` , work uses a faster file system and is recommended for general use. - (HY) The work directory in Turso cluster is /wrk-vakka/users/YOUR_USERNAME . The $WRKDIR variable is not set by default. - is it possible to move those hpc-examples there after I already did git clone elsewhere? - You can use `mv` command, but it also isn't critical for these exercises. Just something that is good to get in the habit of. - What was the trick to show possible commands in the command prompt? - there's a link above, "prompt-log" but there are others - When using Jupyter, if we have some heavy computation for the GPU, does it go to the queue and wait for resources to be freed? / So, in Jupyter we only use the CPU, do I get your explanation correctly? - What is the maximum number of processors we can specify in the submit script? - It depends on the type of job you run, assuming you run it on a single node, it then depends on the type of hardware you have. On triton https://scicomp.aalto.fi/triton/overview/#hardware 128 cores is the maximum on a single node on triton. - will we talk about the ssh config too? - Not any more in this course - you should read on your own - If you have the time, it is worthwile to setup. Ssh keys are not required to use the cluster, but it will make logging in faster. - or do you use ProxyJump to get into the aalto network or ProxyCommand? - ProxyCommand is more general ProxyJump. ProxyJump is simpler and thus recommended if it works. - ok thanks a lot :D - Do I need to somehow disconnect from Triton at the end of the day? I connected via ssh - You don't have to, but you can. If you aren't running something on the login node then there are no downsides - To disconnect, just type exit. - Where can I find the Ising model example just showed? - I can close the prompt on my workstation- and my calculations on triton will not die, right? - If submitted with sbatch, then they won't. Which is what you should be doing most of the time. - any reason to use tmux to let jobs running? - More fragile, there can be internal problems in the cluster that breaks the connection and your job is lost. - And definitely don't leave them running on the login node itself! - If I log out, will my job will still at queue? - Yes if submitted through sbatch (which is what you want to do most of the time). Interactive jobs will be killed. - I can connect to triton on terminal, but not on jupyter.triton.aalto.fi. What should I do? - Are you on an Aalto network or vpn? - It should work from anywhere. What kind of problem do you get? - I'm on my home network. No VPN to Aalto. This is the error Spawn failed: sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified - Have you been granted a Triton account as a part of a course? This might happen for those accounts. I'd recommend joining the Zoom room tomorrow to check this. - When using Jupyter, if we have some heavy computation for the GPU, does it go to the queue and wait for resources to be freed? / So, in Jupyter we only use the CPU, do I get your explanation correctly? - Correct. Ask us for help if you need more, we can go over options. - What is home folder is used for? - In Aalto: Currently home is mainly meant for storing your setting, ssh keys etc. In the unlikely event that the work directory is down or undergoing maintenance, having access to the home directory makes it easier debug problems. We might let it go in the future. The work directory (`/scratch`) is the main workhorse of the cluster. - Is the work directory in narvi /scratch? Should I make a directory there with my username? ## Feedback, day 1 :::info Thanks to everyone! Please keep adding feedback. "homework" and all will come to email + be on the workshop page. * Make sure you connect to the cluster * Cover over the cluster-shell ::: Today was: - too fast: - too slow: ooo - right speed: ooooooooooooooooo - too advanced:o - too basic: ooo - right level: ooooooooooo - I learned something today: oooooooooooooooooooo - I will recommend this course: ooooooooooooooo - I won't recommend this course: One good thing about today: - i know how to connect to triton - The programme list itself acts as a good guideline for future reference. The amount of material available is otherwise overwhelming. - Yes, we're sorry about that. At least we tried to warn you! (days 2/3 should be slower so you have a chance to follow along) - Great analogy between computer hardware and cooking pasta - Good overview what's HPC is all about given the time. - The format and the organization are great. The explanations are very accessible. - zoom room helpful! - great explanations of the basics One thing to be improved for next time: - more cats :) oooo - Find out participants' levels with a survey, and give suggestions on pre-reading for those who do not have experience with certain things. Skip the certain things if most participants are familiar with those. oooooo - Enrico here: I think it is better to be as inclusive as possible and assume learners with no familiarity to the topics covered. We make the materials available days before and people can choose what to skip. - no more pasta, i got hungry ^^ - How big is my program? - part was maybe a bit overly simplified. - First a bit slow and too fast towards the end; o Any other comments about the course and format: - If you use HackMD in your vocab, include it in the email - section: "Ask questions via collaborative document... o - Yeah... that went pretty badly. We'll definitely needs to fix this. - The (twitch) vertical screen thing is genius and should be used in way more (online) lectures o - Thanks! I really wish others could think outside the box enough to do this. - The framework is better than any other workshop I've ever attended - in terms of interaction and audio quality. HackMD is great. Other questions about materials of today: - . - . - . - Question from HY, tried to git clone hpc-examples, but the permission was denied. - Can you paste the exact command you ran and error? - `$ git clone https://github.com/AaltoSciComp/hpc-examples.git` fatal: could not create work tree dir 'hpc-examples': Permission denied - Update for HY, but still: `git clone https://version.helsinki.fi/it-for-science/hpc.git` fatal: could not create work tree dir 'hpc': Permission denied - The problem is it can't make a new directory on the cluster itself. Whevere you are (`pwd`), do you have permissions to make a new directory there? - For 'pwd' command I get only "/" - Yeah, `/` is the root of all data, where you don't have permission to write. Read about changing directories on the cluster-shell page and where you should store files at Oulu. - Does the problem persist still today (Day 2)? Please go to Zoom to Breakout room 1 to get help to this. --- # Day 2 - Intro to HPC pt.1 ## Icebreaker What stage of your academic career are you at? (add `o` to vote) - bachelor student:ooooooooo - masters student:ooooooooo - phd student: oooooooo - postdoc: oooooooo - professor o - a student of life: oo - working in aalto: ooooo I have already done these from yesterday: - connected to the cluster:ooooooooooooooooooooooooooooo - gone over "cluster and shell" lesson: oooooooo0o0oooooo - none of the above: - I don't know what those are: What's the worst bug you have ever solved: - a " (instead of '') in a json file that python could not read in. Took 3 hours and 2 professional engineers to finally find it - only 2 :D - Not a bug but tried to rewrite an R library to fit my needs (not a nice job to do alone maybe) - Segmentation fault. - I once worked in a team of about 10 people. Early on, I made a mistake in the calculation code, which I noticed about 10 months later. Initially it looked like "100 months of work down the drain", but we managed to use most of the work others had done in the meantime. Are you a: - cat person:oooooooooo - dog person:oooooooooooooooo - both:ooooo - Cat person - a cat: oo Do you have any "real examples" you would like us to discuss during the final Q&A? If at Aalto, add them to the #teaching stream on scicomp.zulip.cs.aalto.fi, or discuss below (let's just hope we have time...): - Maybe a naïve question-- but what I am hoping to do is transfer my R workflow to the cluster for speed/parallelization of tasks. I see that is it possible to run RStudio on cluster (for UoH), which would be nice in terms of familiarity. Would a demo of how to do this be possible? ## About clusters and your work https://scicomp.aalto.fi/triton/tut/intro/ Questions here: - So CPU and GPU are not parts of the same node? - GPU nodes also have CPUs, but the majority of nodes are CPUs only. - list of different nodes here if you are interested: https://scicomp.aalto.fi/triton/ref/#hardware - And normally you shouldn't be using a GPU node without using the GPUs. - How do we compare different clusters (for example Triton vs SCS)? Is it just memory and amount of CPUs? - CSC has larger amount of resources so most likely the waiting time can be smaller. A local cluster (Triton, Turso, Carpo2) has the advantage of having the data "local", i.e. integrated with your other storage systems. Also you can get better support from local HPC. But if all you care is amount of CPUs and RAM (e.g. large simulations so that you don't need to transfer lots of data back and forth), maybe you want to use CSC. ## Slurm: the queuing system https://scicomp.aalto.fi/triton/tut/slurm/ Questions about slurm: - Does Slurm is the same for Turso or there is another name of a manager? - Yes, slurm for most clusters these days. (the power of open source) - But configuration is a bit different, so you need to adjust the slurm scripts some. - (HY) Turso users can login at: https://version.helsinki.fi/it-for-science/hpc/-/issues/387 and post your questions/issues there, our team will help you there with Turso-specific stuff. - If i submit a job on slrum, and than after the job is submitted i modify the code to test something new. So when the job is run after the queue time, which version of code will be run? So you cant submit multiple jobs using the same code base? - Using config file to pass the parameter is a good way? - If you only need to change parameters, you can do that inside the sbatch file. That's a perfectly fine method. - Mostly slurm will run the code as it is when it starts. - We also have an exercise exactly about this later, but as mentioned slurm doesn't snapshot your code until it starts running. - To run multiple jobs with the same code, you can either - run in a different folder for each copy (the code does not have to be in the folder you run it from) - use different input and output files for the copies - copy the code to another folder if you need to make changes to the code itself - Can I submit multiple slurm at once? - You can have several jobs running at once, yes. This is the whole point behind array jobs. - What did happen when you run the job without srun? Did it skip the queue? - srun hostname # runs command "hostname" on one node of the clsuter - hostname # runs command hostname where you are (the login node) - What resources does srun reserve by default (if you don't specify --time etc.) - it depends on the cluster, on triton the default settings are 15 minutes, 500M of RAM, 1 CPU, so a pretty tiny job. ## Interactive jobs https://scicomp.aalto.fi/triton/tut/interactive/ Questions about interactive jobs: - Can we run interactive job on CSC cluster too? - At least on Puhti and Mahti, yes. Someone else can hopefully answer about Lumi. - Do we need to install programs to run things in different clusters? like different python libraries - Most clusters use modules for different software, if you need some specific python libraries you might want to make your own conda environment. Check your clusters documentation for this. - We'll also talk about this later in the software-lesson. - I got an error: `srun --mem=100M --time=00:10:00 python3 slurm/pi.py 1000` srun: error: Unable to allocate resources: Invalid account or account/partition combination specified - Your cluster probably needs other options for this. What site are you at? - Oulu University - Since no Oulu answers: check the docs and see what it takes to run. Maybe you need a partition. - We don't know if that's the problem here, but on some clusters you need to specify what "account"(= project) that the resources will be billed to. - srun: job 2398567 queued and waiting for resources, not able to get resourses, what could be the problem? - What was your submission script and your cluster? - carpo2 and i just typed srun hostname - Sometimes it just can be that it takes a while, if the cluster is busy. - so just to double check: we submit jobs to slum from login node (login3)? - Yes, login node is the interface for doing anything on the cluster. As long as you use either srun or sbatch to run your things, slurm assigns them to actual computation nodes. - But if I want to run jupyter notebook, I need to login to another node? - There are different ways to use Jupyter. You can submit it as a job and connect to it running on a node. Aalto's Triton has JupyterHub. - I'd recommend discussing with your university's staff to see what's the easiest way for be. - Triton info: https://scicomp.aalto.fi/triton/apps/jupyter/ - What is difference between running in login vs asking for queue? - Login node is a shared resource used by everyone to login to the cluster, if you run code on the login node it will slow down performance for everyone and possibly make people unable to login. - I ran the srun command and got following response, ``` srun: job 19208096 queued and waiting for resources srun: job 19208096 has been allocated resources usage: pi.py [-h] [--nprocs NPROCS] [--seed SEED] [--sleep SLEEP] [--optimized] [--serial SERIAL] iters pi.py: error: the following arguments are required: iters srun: error: csl48: task 0: Exited with exit code 2 srun: launch/slurm: _step_signal: Terminating StepId=19208096.0 ``` - pi.py needs some number of iterations as an arugment, it seems you didn't give it any. - Ok then the command provided in website needs adding the iterations - Good catch, this needs fixing. The script is indeed wrong. - Updated - Or can we pretend this was practice reading error messages? - When doing git clone does it matter where(in what folder) you are in? - Yes, the cloned folder will be a subfolder or where you are. Normally it is good to go to your $WRKDIR (many clusters use this shell variable which points to something like /scratch/work/USERNAME) with `cd $WRKDIR - Related to question above, cd $WRKDIR takes me to scrach folder, can I make my own folder there? - Yes, this is your area to organize how it makes sense to you. - I can also see and access everyone elses folders in scratch, is that not an issue? - You can usually see other people's directories, but you shouldn't be able to access them. - Yesterday we ran the script without slurm, what is the difference? what happens if we don't use the scheduler? - srun hostname # runs command "hostname" on one node of the clsuter - hostname # runs command hostname where you are (the login node) - If you run without slurm you are running code on login node. Problems of this are elaborated in above answer, but basically you are hogging a shared resource that everyone needs for basic cluster usage. - Does it happen often that people forget about using slurm and they just clog the login node? - Yes, this happens sometimes. Usually if your process hogs too much of the login node it will get killed fairly quickly, but it is still somewhat of an announce for other users. - When we ran git clone yesterday in the login node, does that mean the clongin runs in the login node? - Running the git clone itself is not an issue. Yesterday we also ran some code on the login node which is technically bad practice, but the code was so fast it didn't matter. You can run very short scripts on login node without impacting other people, it just shouldn't be used for anything requiring significant computing power. - Where did everyone get the test code? For ne it says slurm/pi.py does not exist. - You need to clone the git repository and the pi.py is avaialble inside that, which you can direct using cd command. - The command for cloning the repository is `git clone https://github.com/AaltoSciComp/hpc-examples.git` - Sorry, how to change from the login node to our workspace? - These exercises are so small, that you can run them in your home-directory as well. Just type `cd` to get to your home directory. If you're in Aalto, you can get to work directory with `cd $WRKDIR` - Does anyone know the operation that is similar to 'srun' at Oulu's cluster, carpo2? - If you can share some documentation, we can have a look (Aalto admin here) - srun --mem=100M --time=0:10:00 python3 slurm/pi.py srun: error: Unable to allocate resources: Invalid account or account/partition combination specified - You probably need to set a partition like `-p interactive` or `-p batch`. You can get partition names via `sinfo` or `slurm partitions`. - I checked the Oulu docs, and the linked page doesn't say much... doesn't say what's missing from the command. We might need to wait for them to answer. - I also get the same error and srun does not work for me---what should I type instead of srun when connecting via carpo2 (Univ Oulu)? - Can anyone from Oulu answer? - It seems instructors from Oulu couldn't make it today. If you can join the zoom there are people trying to figure stuff out through your documentation. - hi, I need to compile OpenFOAM-devopment (triton only has openFoam-9) version to triton. I git cloned and compiled it on login node. Is it a bad practice? Or should I run that also through srun? - Compiling on login node is pretty normal. If it was a really long compilation it could be submitted as a job (but sometimes the nodes don't have compilers or other things needed). - How to exit login? - `exit` closes the current shell. "Control-D" also exits. - Slurm is a command? Only in the Aalto node? - Yeah, we have an extra `slurm` command installed that makes some of the other commands a lot simpler. CSC also has it. Other ones may not have it (and that's why we also give the generic commands). - We try to give the generic alternative commands in the text. - it is often a wrapper of "sinfo" https://slurm.schedmd.com/sinfo.html or "sacct" https://slurm.schedmd.com/sacct.html which are more standard commands - It is suggested to do the jobs in gpu partition for carpo2 cluster, how to do it? ## Exercises until xx:10 :::info Exercises + break until xx:10 o Try to do what you can. When we come back we'll go over them + you'll have a bit more time. **Don't forget to take a 10 minute break!** I am: - done: oooooooooooooooo - had problems:o - need more time:oooo - not trying: oo ::: Questions about the exercises: - is it expected that i am not able to run the first exercise because "requested jobs are busy"? - Which cluster are you running on? - csl48 maybe? - Try running the commands from the login node. If you're running them from an interactive session (from csl48 etc.), you already have resources booked for you and srun will behave differently. - There's now too many past slurm jobs for me. How can I clear it? (and I assume it will only clear my info, and not everyone else's?) - `slurm history 1hour` will show just one hour (if slurm command). by default it's two days or something. - beautiful - If using `sacct` some option to limit what it shows - The Slurm database records everything for months. - What happens when you do not specify the time in the srun command? If I do not know how long my program will run, is it an option to not specify it? - Slurm has some default time set, what exactly it is might depend on your cluster. It's generally not very long though. If you don't specify the time it will just use the default setting. - Is there anyone from Oulu? - I have emailed Carpo2 admins, join the zoom with one Aalto RSE trying to help Oulu people there. - If I allocate time for 1 hour and the process is finished in 30 mins, does the resourced get freed for others? - It depends. Usually yes, but with programs like matlab, if you don't have an "exit" statement (or use the "batch" flag) matlab terminal will be there waiting for more commands... (see few example of this at https://scicomp.aalto.fi/triton/apps/matlab/#simple-serial-script) - Can anyone on the cloud kill my running program ? or admins only? - It depends. Usually only admins can cancel your jobs or kill your programs. There are some cases where some GPU nodes have a higher priority for certain research groups (they bought those nodes), but they can be used by others when they are idle. If you decide to use one of those nodes because they are idle, you can get kicked out by the resarchers who bought the node. It's a very specific corner case of course :) - Not completely related but: can we customize bash (colors etc.) on Triton? - Sure! You can add your custom colors on your ~/.bashrc - Keep in mind that you have a separate .bashrc on triton, so you need to copy your configs over if you want to use your old configs etc. - if i run a program using srun command, but accidently closed my shell, how can i get back the results when i relogin ? - Unfortunately your results are gone if you close your shell or your connection is interrupted. Next lesson we learn how to use batch jobs which run independently. - how to kill a command that i just ran? - with srun? or just like that? without srun ctrl + c, srun, if it doesn't react you can also open another connection and kill the job (`scancel` with the jobID, which you can get by `slurm q`) - with srun - Control-C kills many pograms, including srun-stuff ## Serial (batch) jobs https://scicomp.aalto.fi/triton/tut/serial/ Questions about batch jobs here: - #sbatch is a command or comment? - It's technically a comment, but slurm treats it as an argument. Bash ignores those parts and they are only processed by slurm. - should the file be made directly in the work directory? via the terminal? - You can make the file however you want: directly through terminal or copied over. Location of the script on the cluster doesn't matter besides needing to think about your relative path for the submission script. - Does the time need to be in HH:MM:SS format? - From slurm documentation: "Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds"." :::info ### Exercises until xx:15 Try what you can, at least be able to do the basic ones because we'll be using this a lot (Advanced are not too important). **Remember to take a 10-minute break**. https://scicomp.aalto.fi/triton/tut/serial/#exercises I am: - done: oooooooooo - not trying: - had problems:oo - need more time:ooo - want to go over in detail:oooo - let's move on quickly: What exercises do you want us to do as demos? - Ex1: - Ex2: - Ex3: - Ex4: o - Ex5: ooo ::: Questions about exercises (if you have a cluster specific question, please specify your cluster): - How did you leave after using 'less filename'? - press `q` - How to exit less pi.out? I have a highlighted (END) and I cant quit it. - press `q` - How I can leave the view in command window? - GNU nano 2.9.8 run-pi.sh Modified - If using nano, then "Control-x" then "y" and "enter". - If using less, then "q". - I don't see any jobs after running 'sbatch run-pi.sh' and slurm queue. - It probably ended so fast that it's already done. Is the output there? - On triton, you can also do `slurm h` to check your history and see if it ran. - Yes it was there. - Is it necessary to use srun in the slurm .sh file? The program seems to run just fine on the node without it, too. (`python pi.py` instead of `srun python pi.py`) - You can run it without ``srun`` as well, but using ``srun`` makes Slurm record extra information about that step. For certain parallel programs you will need to use that. - A rule of thumb: using `srun` doesn't hurt so you can always add it. - I'm in the csc cluster. The process ended in error, the error is: execve(): python: Permission denied. - Try `python3` instead of `python`. Or you might need to load a Python module to provide it. This error can also mean "can't find a program named that". - Thanks! It was that. - Is there any way to debug the code in CSC cluster and see where is the error? - what kind of debugging? of the python code itself? - yes - Is it possible to do anything - execute some command or delete anything, for example, that would break the server for other people? I.e. what do I need to be carefl not to do while using the cluster and the jobs? - Good question! You should not be able to remove other people's files, but you can delete your own files, so be careful when running commands that remove files (avoid wildcards etc.). When you're running programs in the queue each program will get their own share of the resources and should not affect each other. You'll want to avoid stuff like millions of web requests, sending spam emails or submitting thousands of jobs at the same time. You might also want to verify that your code does not create/write/read thousands of files as that can be bad for the shared file system. - There are (often) some shared reseources. The login node is a good example. If you run a big computation on the login node, you can prevent others from using the system. The file system is another. If you create or read millions of files quickly, you can slow the file system for others. - tl;dr: if you can break something for someone else, that's a cluster problem. You *can* make congestion on the login node, or use too much disk input/output bandwidth, but these are rare. - Can we include other things in the bash scripts, like virtual environment set up? - Yes. You can run other commands before the main computation. Each line in the script is executed in order. - How can i view the output from the slurm job? - The output file should be created in (I think) same directory as where you ran the script from. If you didn't specify a name for the output file with `--output=name` then it should be called slurm-JobID.out - Yes i have, is there a way to view the contents of it - Good options from terminal are nano or less. Alternatively if it's short you can print it to the terminal with `cat filename` - Can I clone a private repository from Github into the workdir? - Yes, but you need a personal access token or to set up ssh keys on the cluster (as you would on Linux instructions at https://docs.github.com/en/enterprise-server@3.4/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens for PAT and https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account for ssh keys) - slurm h command gives an output with uneven columns. Is there a way to fix it? ```19210786 the_script.sh 06-07 13:35:55 100M - 00:11.028 00:00:12 none 1 1 0:0 COMP csl48 └─ batch * 06-07 13:35:55 1M 00:00.056 00:00:12 1 1 1 0:0 COMP csl48 └─ extern * 06-07 13:35:55 1M 00:00.001 00:00:13 1 1 1 0:0 COMP csl48 └─ 0 python 06-07 13:35:56 1M 00:10.970 00:00:11 1 1 1 0:0 COMP csl48 19211507 the_script.sh 06-07 13:43:40 1000M - 00:55.280 00:00:56 none 1 1 0:0 COMP csl48 └─ batch * 06-07 13:43:40 1M 00:00.048 00:00:56 1 1 1 0:0 COMP csl48 └─ extern * 06-07 13:43:40 1M 00:00.001 00:00:56 1 1 1 0:0 COMP csl48 └─ 0 python 06-07 13:43:40 1M 00:55.229 00:00:56 1 1 1 0:0 COMP csl48 ``` - Does it get better if you make the terminal wider? - Nope - It should look fine when your terminal is wide enough to fit everything on one line. It does look quite messy before that. - I have two 4k screens. Still doesn't help. The other commands look just fine. I also get this error when I open zsh: /usr/bin/man: can't set the locale; make sure $LC_* and $LANG are correct - I suppose you are connecting from a Mac. Is it so? ok so it is another thing. Basically some terminals carry over some variables from where you connect. It can be debugged and fixed. If you are at Aalto, join our garage some other days. In general this is not blocking your work. - I use windows with WSL (Ubuntu). I have oh-my-zsh installed, but the behaviour is still the same without it. Yeah, it doesn't seem to affect anything else. - Also for what it's worth, we recommend BASH rather than ZSH on the cluster. - Tried bash. All the same, with the same error. - If you are at Aalto, the fix is here https://version.aalto.fi/gitlab/AaltoScienceIT/triton/-/issues/546#note_96854 (Available for Aalto users only) - How long the files are stored in the cloud? or should we delete it manually? - Depends on cluster - some automatically delete, some let you keep it there longer. Can you ask again when we are talking about storage later today?ok - How to leave watching the tail output? Doesn't seem like I can type cancel ot anything else - Control-c - I got a hold of person responsible from Oulu. I am now able to run srun. Has the issue been solved for the other participant(s) from Oulu? - Is the solution something you could write here for other people to use as reference? - not really, but the administrator told me he can be reached via Teams. Chech zoom if you don't have his contact. - How can I cancel a bunch of jobs at once in slurm? ## Monitoring https://scicomp.aalto.fi/triton/tut/monitoring/ - Can I explore the results directory, while my calculations are running? - yes! And it's good to do to make sure stuff is working correctly. - Can I also use these files (for postprocessing etc.), while the job is still running? - yes! And this is also a good idea. Just be aware they are changing, they might not be complete, when you read lines might be cut off halfway because of buffering (if you keep appending to a file), etc. - note: `seff` is an extra thing that needs to be installed, your cluster might not have it. - same with slurm h / q, they are aalto specific wrappers to sacct / squeue - seff unfortunately is not easily accessible on normal slurm commands, because it does compute efficiency from various elements of `sacct` :::info ## Exercises until xx:05 https://scicomp.aalto.fi/triton/tut/monitoring/#exercises Do what you can. The rest of the day/tomorrow doesn't depend on this *too* much. It's mostly practice with the stuff we know + a bit more. I am: - done: oooooo - not trying: ooooo - had problems: - need more time: oo - want to go over in detail: - let's move on quickly: ::: Questions about exercises here: - How to do `slurm history` in another cluster that is does not have the slurm command ? - `sacct -u <username> -S DATE --format=JobID,JobName`, with DATE e.g. being 01/01 for since january first (essentially -S indicates a start time/date). You can add additional fields to the output see : https://slurm.schedmd.com/sacct.html#OPT_helpformat - `slurm` was made by our previous admins in Aalto. You can also find the source code for the `slurm`-tool from here: https://github.com/jabl/slurm_tool - You can use it yourself or you can ask your site admins to install it for you. - Is there a lag with when the job appears in slurm history? Sometimes after running a script and checking that it's not in the queue anymore, it doesn't show in the history unless I run it again. - If the job runs very quickly, it might not show up properly in slurm history, try using `sacct` and check if you can see it there. - I think I *have* seen a lag sometimes, but usually not. Does it appear in the history eventually? - No it only appears when I run it again. So I don't see it twice in the history - Is seff cpu efficiency of a `srun` step a measure of how efficient that step was or a measure of how large a portion of the total job it was? Like an efficient but small step shows low efficiency if the total job was large. How to see the efficiency of a single step? - Yes seff doesn't work well with jobsteps so one should manually try to divide the total time into steps and get individal steps efficiency. Something we could add to seff one day. - `sacct -j JOBID --format Jobid,CPUTime,TotalCPU` # should give you the data to compute the efficiency % for each step - How to use the tail command, not able to see the steps in the loop - Do you see them at the end when the job is done? - yes - There is buffering while the job is running, so that not everything is logged in real time. Buffering can be disabled and give me few seconds to dig out what was the variable to set for that :) - `export PYTHONUNBUFFERED=true` (assuming you are using python) - Is it just me or is there no sound? ooo - yeah I was about to reboot my machine :D - Is there sound now? I can hear it. - Yes, the instructors just forgot to unmute - Is the walltime supposted to be shorter in excercise 4? It is not for me. - It should be, but maybe you got a node with very strong cpu first and then one with two significantly weaker ones? Or something was wrong with your script / pi arguments. - Another reason could be that the code ran so fast that the gain from using two cpus was offset by the overhead from paralellization. ## Applications https://scicomp.aalto.fi/triton/tut/applications/ - ## Modules https://scicomp.aalto.fi/triton/tut/modules/ - So if I'd like to load a module and some conda environment from within a .sh file, what line should I write? - Different clusters can have different recommendations, so make sure you check first. - Roughly from a shell, on aalto computers: module load miniconda source activate /path/to/conda/env - CSC recommends always makeing containers to store conda environments, which is a good idea but another thing to learn. - https://docs.csc.fi/computing/containers/tykky/ - But as a quick TL:DR, sbatch files are just bash scripts at their core, so the commands should be roughly same as what you'd do normally on the terminal. - . ## Data storage https://scicomp.aalto.fi/triton/tut/storage/ vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvzv <- CATS :cat: - So is it usual to keep all the important files on my own workstation and just use Triton for calculations - which means sending data back and forth? - At least the data you want to backup, yes. Normally this does not happen daily. However the code, should be backed up daily. "In case of fire, git commit git push and leave the building". In practice all I want to backup are the final figures for the paper and the code (the original data is backed up of course). - how do I ask for quota? - Contact your own cluster admins (however its docs say). Everyone has some default quota. - if you need a lot of quota most likely it is because it is a big project. In that case it is better to ask it as part of a team (project). So that if you win the lottery and move to a desert island, other people in the team can continue from where you left. - Now, TV stream is somehow working bad and dissappear or the quality become poor. - Try refreshing, it looks good here. You can set twitch to keep a constant bandwidth (on quality, do not use "auto") - As another note, having "low latency" on from advanced settings sometimes messes with audio quality. - Can I run a prgogram such that the output file is stored to a github repository?, also reading a big matlab file from repositoryas well? - github has some limitations with the size of the files you can add. There are data versioning solutions (e.g. git-annex, git-lfs) that can be used. Personally I find it more simple to make sure that the data folder is pointing to a local backed up solution, rather than having the data versioned with the code. Sometimes however the data is inherent to the code (e.g. some large AI models on huggingface) then the data could be stored on a different location / system outside git (and git-annex or git-lfs can be used for versioning) - If I want to run a program, which needs different python modules, should I install them in the work folder or modules are avialble by default? - Use virtual environment tools such as "conda". The conda environment should live with the project, so ideally you want to version control it and store it with the rest of the files. Here some help on how to get started with conda: https://scicomp.aalto.fi/triton/apps/python-conda/ - Never run "pip install xyz" unless you have activated a virtual environment. Which means: do not affect your global python settings, keep the settings specific to each project with virtual environments. ## Ask us anything, day 2 - If you have time, can you also show how to install something with pip in your environment, maybe for day3 - There was a quick demo - Python + Conda: https://www.youtube.com/watch?v=dmTlNh3MWx8&list=PLZLVmS9rf3nOmS1XIWTB0Iu7Amvf79r-f&index=5&pp=iAQB ## Feedback **Thanks all for coming! See you tomorrow with more about using many processors at the same time.** Today was: - too fast:oo - too slow: - right speed: oooooooooooo - too advanced: - too basic: - right level: ooooooooooo - I would recommend this day: oooooooo - I wouldn't recommend this day: One good thing about today: - Got more confortable using terminal, nano etc. +1 - the exercises were really useful - Exercises really made me memorize some specific commands - good documnetation that one can follow if he/she gets behind the teaching stuff - Same as yesterday, the analogies were quite helpful. And good to understand how the network is structured/communicate with each other. One thing to be improved next time: - Maybe an actual example of loading a module of some software - and making sure it works. -o - An example of loading data from our workstation and then saving results to our workstation. -oooooo - The last hour (i.e. the four small lectures) was quite rough in terms of amounts of information; also jumping a lot from one topic to the other, but I guess it helps to just read it again in a few days - Was the problem of Carpo solved for UO students‽ - It was for me, but it was too late for me to do the exercises. - Yeah. It was solved for me as well, and then I tried to do the basic exmples ;)) Good experience though. Any other comments? - A bit long day - I think the end of today is reaching the end of my comprehension! Tomorrow might need to slow down +1 - . - Theory a bit slow but the exercise too fast - o +1 - Yes, I would appreciate doing an example of loading the module. - No wait this is good agahahaha - :D --- # Day 3 - Intro to HPC pt.2 ## Icebreakers What is the best ice cream flavor? - Pistachio: ooo - Salted Caramel - Chocolate: oooo - Melon - white chocolate - TIRAMISU: ooooooo - Old fashioned vanilla oo - Banofi - Hazelnut - Cookie Dough - Butterscotch - Bananasplit: o Day 1/2 I was: - bored: - ok: - happy: ooooooo - overwhelmed:ooooooooo - eager for more: oooo - unsure of what next:o - . - The last couple of lectures were a bit overwhelming after the exercises - I didn't manage to join the second half yesterday, unsure if I can follow today. - If you didn't have time to check basics of batch jobs it might be tricky, try having the tutorial open for reference: https://scicomp.aalto.fi/triton/tut/serial/ In the worst case, these are all recorded so you can just return later if something goes over your head. - If you were in serial jobs, you'll probably be fine. What learning outcomes are you getting from this course? - how to work with slurm and how to - estimate resources needed for a program to run - How to estimate resources needed and how to use slurm and triton to run programs - . What learning outcomes do you want? (please sort if it's the question above) - Intro to where to start with GPU calculations o - Good news: this is part of today. - How to load my data onto cluster and get outputs saved o - I don't think we'll touch on this too much today, but you can check these tutorials afterwards for further reference: https://scicomp.aalto.fi/triton/tut/storage/ https://scicomp.aalto.fi/triton/tut/remotedata/ - The work flow for turning my Github Python script into a functioning job on Triton o extra: - For more info on git-annex: https://scicomp.aalto.fi/scicomp/git-annex/ - Videos are available here: https://www.youtube.com/@aaltoscientificcomputing3454 ## Intro Questions: - How should we determine we need to do the task in GPU instead of CPU? - . ## Basics of Parallel Computing :::info https://scicomp.aalto.fi/triton/tut/parallel/ ::: Questions here: - Where's the right places to save processed videos? - On a Triton cluster it is $WRKDIR, i.e. your work directory on Lustre filesystem (/scratch/work/...) - Why is it called embassingly parallel jobs? - This is just a solidified nickname for jobs that can be perfectly parallelized. - Compared to the peolpe doing "real" parallel, it's embarrasing it's so easy yet you call it parallesl! - IS MPI the way that LLM's are trained? - ChatGPT was trained with MPI (or so they say) - MPI: What are some examples of the messages being passed? A numpy array containg some parameters? - Usually the transferred information is numerical arrays, but on lower level than numpy arrays. - Anything that master process requests from the slave within the MPI communication, example: communication can be: give me a value of that variable when computed or alike. Can be an array etc. Types of communications differ, that is a topic for the whole lecture. - I have an M1 chip with 8-core GPU and 8-core CPU, when I check my Acitvity Monitor the GPU is never used. Is it only used for gaming? I thought 4k videos or 3D modelling would use it. - On the NVidia card if cuda enabled, it can be used for calculations as well; watching video and GPU usage, that depends on the videoplayer, codec you use / have; overall, depending on the application, some computational tasks can be offloaded to the GPU - it seems cuda enable is for NVIDIA, not for macs - MacBooks seems to have Intel HD ones, yes, but still some computing can be offloaded to Intel cards as well [google has bunch of links on 'intel hd computing'] - It is not Intel, it's mac's chip - Depends on the model, my Macbook from 2016 has Intel HD Graphics 530; anyway, gpu is a graphical processing unit, if your code can utilize it, then you can offload some tasks to the GPU - Nice, I'm gonna try that. - Another possible thing, does your laptop have another dedicated gpu? That could explain why it's not used. - These are my chip specs: Apple M1 chip 8-core CPU with 4 performance cores and 4 efficiency cores 8-core GPU 16-core Neural Engine yes it is ooo ## Array jobs :::info https://scicomp.aalto.fi/triton/tut/array/ ::: Questions here: - When I try to write the array_examples.sh, it won't save because it says it is read-only? Didn't have this problem with nano yesterday - check your current directory with `pwd`. make sure you are where you expect to be. --yes i am - what directory and what cluster? -- turso, hpc_examples - It might be a good idea for you to join Helsinki's zoom breakout room to get it fixed. - (HY) Turso is having the /home directories temporarily in read-only mode during maintenance. Use your work directory on Turso /wrk-vakka/users/$USER for writing output. - - Can you give an example of using array jobs (like embarassingly parallel method) in a use case(eg training a neural network)? - As an additional explanation: array job decription file in nothing else that a template for the # number of the exactly the same jobs that only differ by the array task id. Any embarassingly parallelizable algorithm can (should) use it. Example with Pi is a good one. - You can, for example, run different hyperparameters (like learning rate, model size, etc.) on independent array tasks if you map the ``$SLURM_ARRAY_TASK_ID`` to different parameter combinations. Training a network is rarely training just **one** network. :::info ## Exercises until xx:15 https://scicomp.aalto.fi/triton/tut/array/#exercises Try to do at least the first three... array jobs are one of the most likely things you'll be using! **Remember to have a 10 minute break** I am (multi-vote): - done:o - not trying: - need more time:ooo - successful:o - not successful: o ::: Questions regarding the exercises: - How can I set parameters (arguments) with different values in each run? (exercise 1) - The option that requires least shell scripting knowledge is probably using a case statement, you can adapt the script from one of the examples on the tutorial page (but in general `$SLURM_ARRAY_TASK_ID` is a variable you can use like any other): ``` case $SLURM_ARRAY_TASK_ID in 0) SEED=123 ;; 1) SEED=38 ;; 2) SEED=22 ;; 3) SEED=60 ;; 4) SEED=432 ;; esac ``` - It writes the output in the error file, even when the execution was successful and without errors? +1 - can you show the #SBATCH directives in your script? -`#SBATCH --output=pi-array-hardcoded_%a.out` -`#SBATCH --error=pi-array-error_%a.out` - That is because the example has been written to do some of the output there (we'll adapt it in the future), don't worry about it. - https://github.com/AaltoSciComp/hpc-examples/blob/master/slurm/pi.py#L30 - print("...", file=sys.stderr) - is there echo? - Not on this part of the planet, maybe you have 2 twitch windows open? - it was from zoom - Fixed now - Could you elaborate on ex.5? I didn't complete that one but would like to know about this memory case. - Is there something specific you would like to know that is either not in the solution or not clear enough? - At least I am struggling with the solutions that don't give the code you need to type, like #SBATCH info etc. - That makes sense, the solutions are far from perfect. For ex5 I think something like this should work: ``` #!/bin/bash #SBATCH --time=00:10:00 #SBATCH --mem=250M #SBATCH --job-name=memory-array-hardcoded #SBATCH --output=memory-array-hardcoded_%a.out #SBATCH --array=0-4 case $SLURM_ARRAY_TASK_ID in 0) MEM=50M ;; 1) MEM=100M ;; 2) MEM=500M ;; 3) MEM=1000M ;; 4) MEM=5000M ;; esac srun python slurm/memory-use.py $MEM ``` - please don't remove the three backquotes above, it ends the code block (sorry!) - thank you! still trying this one... it says 'error: the following arguments are required: memory' - What was your submission script and what cluster are you on? It seemed to work fine for me on triton. - just sbatch and that .sh name, on turso ## Shared Memory Parallelism :::info https://scicomp.aalto.fi/triton/tut/parallel-shared/ ::: Questions here: - You sure have realized that there is a "copy code button" on the top right corner of your memos - Yes, I type each time to slow down enough you all have a chance to keep up. - - If I have my code in Python and I want to use the cluster, do I need to re-write it as Shell language? - no, you'd modify it to be able to use multiple processors (if you want), and bash is just the controller of it that lets you script it more easily - shell/bash is just a way to script what you would manually type in the terminal interactively (e.g. python3 nameoffile.py) - Now I get, of course, just confused with so much info :D - I know! That's why we are here :::info ### Exercises: shared-memory parallelism (until xx:50) ### and break (until xx:00): https://scicomp.aalto.fi/triton/tut/parallel-shared/ Try to do what you can, nothing else will depend on this. **Remember to take a 10 minute break before xx:00** **Note: "multiprocessing"-type models are also run this way.** - done: ooo - not trying: oo - need more time: ooo - successful: - not successful: **After the break, we will have a guest speaker from CSC, and then go over the exercises + future stuff at around xx:30** ::: - In turso the seff IDJOB is only returning 'job not found'. I cannot find the job also with slurm q or slurm h - Please join breakout room 1 - ok - Turso: Use seff -M ukko JOBID - There is no numpy in my cluster :'v I'm in csc puhti. - Are you sure it's not available as a module somehow? - `module load python-data` should make it available if my quick google search is correct. - It is, thanks! - Any idea what the module name is for HY Turso? - After running 'srun --time=00:10:00 --mem=1G python pi.py 100000000', the job has been allocated resources, but then for a minute I could not write anything (even seff). The terminal would not allow me to do anything (writing starts from the beginning of the line). Is it working correctly? - It is probably running. You can cancel the job with Ctrl + C if it does not respond. - To add, if that was your exact submissions script then you are running it as interactive job and it will hog your terminal until it finishes. - You can write the requests and commands into slurm script if you want to run it non-interactively. - srun --cpus-per-task=4 --time=00:10:00 --mem=1G python pi.py --nprocs=4 100000000 has been hogging my terminal for over 30 min. Is this normal? - Something seems wrong, it shouldn't take that long. Try ctrl+c or close the terminal and start a new one. Only scenario I can think of where that would be normal is if you have been waiting in the queue the whole time. - I did get resources allocated since the beginning. I was able to abort and got a message about Out of Memory. - pi.py does eat quite a bit of memory on higher number of iterations. I tried running it with your settings but 4G of memory and it finished in 30 seconds, but after requesting only 1G it hangs and output tells you there were several out of memory errors. - Thanks, that fixed it, the 4G was enough. - This is not a question about these excercises specifically, but if I run "srun python pi.py 1000000" for example without specifying memory or time, how much memory is reserved for the job? - The defaults are defined by admins. On Triton you'll get 15 minutes and 500M of memory. - As a general rule of thumb, unless you are running something tiny the defaults are probably too low for your purposes. ## Laptops to Lumi: an overview of CSC resources You can ask questions from Juha here and we'll direct the questions to him in the stream: - What type of research projects benefit from LUMI supercomputer? - Are Bachelors/Masters theses allowed use cases for CSC/LUMI? - When would using LUMI make more sense than using Triton in terms of computation? - Using the CSC environment efficiently: https://csc-training.github.io/csc-env-eff/ - Can I create a personal project? How many resources can I ask for? - For a project you basically need to be a staff member. If a student, you would ask to join your supervisor's project. - Are the projects officially funded projects? Do they need to be approved? - For most projects the projects are just allocated billing units that are internal usage measuring "money". Actual money comes from the Ministry of Education etc. that funds CSC and users do not need to think about it. Projects are approved by the service desk. For getting large amounts of BUs there might be additional steps. - billing units: 100k automatic, 5M units is large, 20M is very large. - . - How to access Lumi? - See the following documentation pages: - https://www.lumi-supercomputer.eu/get-started-2021/users-in-finland/ - https://docs.lumi-supercomputer.eu/firststeps/accessLUMI/ - https://docs.csc.fi/accounts/how-to-create-new-project/#creating-a-lumi-project-and-applying-for-resources - Could you comment Lumi v.s. AWS or Azure? - Lumi is a HPC system while AWS and Azure are cloud services. You can build your own computing cluster on top of cloud services, but you'll have to pay for the resources and set it up yourself. - For information on LUMI: https://www.lumi-supercomputer.eu/ - More information on CSC machines: https://docs.csc.fi/ - Is support for using the CSC services available in the daily garage (or is it only for Triton)? - I'm in a project with LUMI, but how to start a session on it? - https://docs.csc.fi/computing/connecting/#setting-up-ssh-keys ## Checking exercises together https://scicomp.aalto.fi/triton/tut/parallel-shared/ Questions here: - Why didn't you specify memory nor time in this case (ex.1)? - There was still low enough amount of iterations that the default values were sufficient. - 333% cpu, what does this mean ? - On average 3.3 CPUs were used in the time it took. - How was this script parallelized onto multiple CPUs? This script wasn't an array job. How did shared memory work in this case? - So pi.py runs using shared memory parallelism, which means the cpus can communicate directly with each other without needing MPI workers since they are all on the same machine (slightly inaccurate, check the answer below). - Technically it's the `multiprocessing` Python module (which isn't classic shared memory). Other things are R futures, Python joblib, etc. ## Break until xx:05 ## MPI (very quickly) https://scicomp.aalto.fi/triton/tut/parallel-mpi/ - - ## GPUs https://scicomp.aalto.fi/triton/tut/gpu/ - Do you have an example of a code created for a GPU? - Deep learning frameworks like tensorflow, torch and so. - [mumax3](https://mumax.github.io/) - can you share a module name that require gpu? - In Triton: `cp2k/8.2-cuda-openmpi-scalapack-psmp-volta` - When I try to run the example, i get "Unable to allocate resources: Requested node configuration is not available" - what cluster? you might need to specify GPU partitions, e.g. `-p gpu`. - Sorry, how to check which GPU partitions are available? - You need to check the instructions for the cluster you are on. If you say maybe someone knows - Do we need to manually select on which GPU our job would work? Isn't the idea that the system will decide on which free GPU our job would run? - It will give you some GPU, if you do not specify a specific GPU architecture. - If my program does not use the full GPU memory, is it okay if I run several jobs on the same GPU? - If you have free GPU cores, yes. However, usually programs reserve the full GPU when they are running on them. - Does it mean that I cannot use a part of a GPU, but always need to use the whole GPU (8k cores)? - You can't reserve partial gpus, so it would be ideal if your code uses as much of it as possible so resources aren't going to waste. - Is there only 43 GPUs in Triton? (https://scicomp.aalto.fi/triton/overview/) - https://scicomp.aalto.fi/triton/tut/gpu/#available-gpus-and-architectures - That might be nodes, each node has 4-8 GPUs. It should be closer to 250. - Using `sacct -j JOBID -o comment -p` does not prints anything, it is empty. And using `seff` does not gives GPU job efficiency, just a line: --------. What could have happened ? - Which cluster? Triton? ## Q&A: Ask us anything - Would you recommend some further tutorials for GPU calculations? I've never ran parallel calculations but I'm thinking about trying to use GPU for my calculations. - CSC has a good course on CUDA programming: https://github.com/csc-training/CUDA - If i am using Numpy for a large program in my personal laptop that doesn't have proper GPU, would this make the processor run the program? - Yes if there are no GPUs, then the CPUs will run the program. Depending of the packages you use you can have code that works on both and you don't have to do anything, or you might have code that can only work on GPU. - What are TPUs and how do they differ from GPUs and CPUs? - Basically Google's own gpus :) Made specifically for TensorFlow - So the gaming GPU is no use for my own calculations? - They can still be quite efficient, but it's not as good as a GPU specifically designed for tasks you want to do. - What causes crashing of such big GPUs, out of curiosity? - What do you mean by crashing? The most common error yes (not programming wise we see is a lack of memory) ## Feedback, day 3 Today was: - too fast: oooo - too slow: - right speed: oooooooooooooo - too basic: - too advanced:oooooo - right level: oooo - I would recommend today: ooooooo - I wouldn't recommend today: o One good thing about today OR this course overall: - I really enjoyed doing the exercises right after the introduction of the topics.o - Really good and informative course. I know where to start in order to use Triton, so it was very useful. - Good introduction for someone who didn't really know anything about HPC before starting the course. о - More information about GPU based computation - it's great that the material is so easily accessible also after the course to go through things in my own pace again oo - Course was really helpful. It gave me a base to start with. o One thing to be improved about today OR this course overall: - The topics today were not deep and detailed enough. - I didn't manage to finish excercises + take a proper brake (a couple more minutes of break would fix it) - Some applications, maybe trying DL programs with the GPUs. - Not sure if it could be improved, but at some point I got too tired and could not follow everything. But it's important that lots of important topics were covered and the resources and recordings are available for future reference.ooo - Also, there wasn't support for CSC on Zoom. - I kind of lost the flow after array job examples, wanted a bit slower transition from topics +2 - Today was a bit advanced due to lack of advanced shell commands - The tasks were I bit too advanced Comments on the course format: - Super responsive teachers on here and zoom were really helpful o - Really good format with the streaming and the shared document for questions. ooooo - The cat kept me focused in the lecture - Really well organised and easy to follow. Nice amount of breaks as well. - Include more pictures for explanantion - Live interaction with the instructes were very helpful and excercises were nice - Would be nice to have twice the amount of time for exercises - There isn't enough time for the exercises Comments on the course topics and schedule: - Got too lost to do most of today's exercises, but the info presented was clear and useful oo - I feel that the GPU topic deserves at least one whole afternoon, but I also understand that there's time constraint. o - Any other comments? - I really appreaciate the instructors took the time to explain the jargons, instead of just letting them fly around. o - I liked the analogs because it helped to visualise/understand the topics better for someone not as familiar with computers and HPC. - Maybe make a deeper course for GPUs? https://enccs.se/events/2023-06-gpu-programming/ - For this one, the registration closed in May :( - :( - I'm sure we'll do it again - Thank you very much for all of your effort and time! о - The fact that the instructors were really nice contributed to the good course experience. Thanks for that!o - Thanks! - Thanks for this online course! Anyone watching with more than one person for one stream session? - .. - . --- :::info **This is the end of the document, please always write at the bottom, ABOVE THIS LINE ^^** HackMD/HedgeDoc can feel slow if more than 100 participants are editing at the same time: If you do not need to write, please switch to "view mode" by clicking the eye icon on top left.