--- tags: Kickstart --- # ARCHIVE Intro to Scientific Computing and High Performance Computing - Summer 2022 :::danger ## Infos and important links * To watch: https://www.twitch.tv/coderefinery * To ask questions and interact (this document): https://hackmd.io/@AaltoSciComp/IntroSummer2022 * *to write on this document, click on the :pencil: (Edit) icon on the top right corner and write at the bottom, above the ending line. If you experience lags, switch back to "view mode" ("eye" icon)* * Matters that require one-to-one chat with our helpers (e.g. an installation issue or non-Aalto specific settings): - PLEASE USE HE ZOOM LINK SENT TO REGISTERED PARTICIPANTS IN THE INVITATION EMAIL * Program: https://scicomp.aalto.fi/training/scip/kickstart-2022-summer/ * Prerequisites: a computer with an internet connection and with a shell terminal - Linux/Mac: you already have a bash terminal - Windows: we recommend https://gitforwindows.org/ - Alternatively, if you have access to a university linux server, you do not need to install anything, you can just open powershell and run something like `ssh username@machine.uni.fi` * Remote SSH to a native Linux - Aalto https://scicomp.aalto.fi/aalto/remoteaccess/ - HY https://wiki.helsinki.fi/display/it4sci/Remote+access+to+University+resources - TUNI https://www.tuni.fi/it-services/handbook/2724/3229 - OULU: https://www.oulu.fi/ict/remote * Suggestion: if you have just one screen (e.g. a laptop), we recommend arranging your windows like this: ```bash ╔═══════════╗ ╔═════════════╗ ║ WEB ║ ║ TERMINAL ║ ║ BROWSER ║ ║ WINDOW ║ ║ WINDOW ║ ╚═════════════╝ ║ WITH ║ ╔═════════════╗ ║ THE ║ ║ BROWSER ║ ║ STREAM ║ ║ W/HACKMD ║ ╚═══════════╝ ╚═════════════╝ ``` * **Do not put names or identifying information on this page** ::: # 7/June - Intro to Scientific Computing *Ask anything, write always at the bottom* (*please include your organization to the question as there can be differences between Aalto/Helsinki/Tampere/Oulu university clusters.*) ## Testing: asking questions via hackMD - This is a question, isn't it? - Yes, it is! And this is a reply - I have a different reply for you - Nice, I am a nested comment! - And here another question... or not? - check this out - .. - .. - Just write on here to ask questions, then we can easily answer them, or someone who knows can answer. Feel free to do so if you know the answer to a question. ## Icebreaker ### What university are you from? - Aalto: oooooooooooooooooooooooooooooooooooooooooo - Helsinki: oooooooooooo - Turku: o - Tampere: o - Jyväskylä: o - Politecnico di Torino: o ### What kind of research do you do? - wastewater analyzis/modelling - brain stuffs +4 - drug design - drug combination prediction - metabolic modelling. - materials research simulations - Convolutional neural networks - Flat band superconductivity - neuroscience +3 - Nanomagnetism and Spintronics +1 - Machine learning +1 - Space & plasma physics - Bayesian analysis and model selection - inorganic chemistry - materials research with machine learning - Nuclear Materials engineering - quantum spin liquids and neural networks - whole brain modelling - Linear algebra - Particle physics - Federated learning - Mental Rotation - Probablistic data inputation - Deep learning in combinatorial optimization - Computational Fluid Dynamics - Quantum Error Correction --- ### What do you expect from this course? What do you want to learn? - .best practices for high performance scientific computing - parallel calculation for very big CFD models - running large-scale R and Python experiments - best practices about security, i.e. similar to non root docker stuff - . - . - . ### Any other initial questions? - could you please test that everyone's mic is equally loud? Enrico's a bit quieter than others - how about numbering the questions to address them more easily? - Questions will mostly be answered here in the hackMD and only rarely directly on stream. But yes, we can number them ## Introduction https://scicomp.aalto.fi/training/kickstart/intro/ - in what ways the videos would be available later? - They will be made available and the link send out to registered participants, and they will be available on twitch for 14 days. ## From data storage to your science https://hackmd.io/@AaltoSciComp/SciCompIntro - What is more important, data or computing? - Both are equally important. if your processing doesn't work properly, you get junk, if your data is junk, the same.. - are they even comparable in this manner? - Depends on what you consider comparable. The problems you run into if one of them is wrong are different, but the invalidity of the end result hold true regardless on which was wrong. - Enrico, twitch request for increased microphone (but I know you tried...) - +1. I can hear, but it's less clear compared to others. Headphones / max twitch volume / 70 Windows volume. - can you turn up your volume? Because I can hear him alright, albeit he could be louder. - I [name=rkdarst] tried to increase the gain on the broadcasting side, too. - Are these lecture notes available somewhere? - See link at the top of this section and on the course webpage. - this lecture: https://hackmd.io/@AaltoSciComp/SciCompIntro - all lectures: https://scicomp.aalto.fi/training/scip/kickstart-2022-summer/ ## What is parallel computing? An analogy with cooking 🍝 Presentation: https://docs.google.com/presentation/d/e/2PACX-1vQLTzWkRy7Du3jjPJ6Y9BqKczU_JcSTEL6XsndrNJ7ylzi4RWeEy8lhfWZQu_lpwbAKroh51qqLoPFG/pub - the presentation link at https://scicomp.aalto.fi/training/scip/kickstart-2022-summer/#id1 does not work for clicking, but works if you copypaste it - Refresh the page, I just noticed and fixed. Thanks! - Yes, thanks, it works now. - how much CPU and GPU resources are available for regular Aalto users at Aalto HPC? - Triton has ~3000 CPUs, 100-200 GPUs (I think). - It's more like ~9000 CPUs (https://scicomp.aalto.fi/triton/overview/) - The limit is unlimited, but you are usually limited by the queue. As the system is a shared system other people will want to run their programs in the system as well. Typically, you can get maybe a few hundred CPUs or maybe 4-8 GPUs at a time, but it depends on your past usage. Maximum MPI job size is 16 nodes. [name=Simo] - Hardware overview: https://scicomp.aalto.fi/triton/overview/ - 👍 (btw a table there overflows the background. probably because of {max-width: 800px} in 'wy-nav-content' class) - yeah, it's too big. interesting, zoom in and mobile mode lets you scroll. webdev help welcome! - maybe a question for the later days: What tools can be used to check what resources were used during a job? - We will be talking about job monitoring tomorrow: https://scicomp.aalto.fi/triton/tut/monitoring/ [name=Simo] - On Triton: seff will give you an overview of the Memory and CPU efficiency of a finished job. - How do you figure out which form of paralleism a code uses? - Is the question what kind of parallelisation you should use or how to anlayse existing code? - Read the docs, there is no other way; or go through the source code to see the operands - For Thomas: please don't break pasta is very rude - But what if I only have a small pot - Think about the Italian pasta maker who spent a lot of time to craft that beatiful pasta .. - use dedicated pasta type such as farfalle, rotelle etc. - How does one figure out which form of parallelisation is the best for their task? - In general: Think about whether your code has parts that are completely independent, or whether the code can be split (e.g. in a for loop) into things that have no interaction with each other. That commonly points to an "embarassingly parallel" problem, which is simply running the same thing multiple times with different starting points. If different tasks need to talk with each other you will have to use things like MP or MPI - How to check if the task can be parallelized using GPU or only using multithread CPU? - In theory most code can be pushed to CPUs, but the question is, whether the effort of adapting it is too much. - Nowadays people rarely want to write their own low-level GPU code. There are many libraries that provide algorithms that run on the GPUs. Googling for the specific algorithm and "GPU implementation" can help find the library you need. Many programs can also run on GPUs (many physics codes, machine learning codes etc.). Of course you can learn how to code on the GPU itself, but that usually requires knowledge on C++ & CUDA etc. [name=Simo] - Using narvi hpc in tampere university, I have some problem in using gpu while tunneling to a jupyter notebook with a localhost. Will I learn in this course how to fix my problem?? - We'll go over some general things, but you'll probably need to ask Tampere local specilists during the exercise sessions on days 2-3. - Yep, it might be quite tricky to have gpu-resources for for jupyternotebook as we don't have any jupyterhub like e.g. Aalto. You'll need to take approach like [this](https://medium.com/@sankarshan7/how-to-run-jupyter-notebook-in-server-which-is-at-multi-hop-distance-a02bc8e78314) - What is the difference between multithreading and multiprocessing? - The difference is a bit technical: Processes are individual program executions while threads are basically smaller program executions within a process. Processes can launch both subprocesses and threads. Threading is usually used in web servers and in data loading where some parts of the program are waiting for some outside data to load in. Multiprocessing usually means that one process launches multiple copies of itself that then do some calculation. - Are there any specific software requirements for GPU coding? - Probably yes, one can find them out by checking the specific library such as CUDA. - That depends on how deep into GPU coding you want to go? Assuming you don't want to code hardware drivers, you will need some libraries that offer access to the GPUs and commonly those frameworks have their ownstyle that you are likely required to follow. - If you use Python, you can use libraries such as [CuPy](https://cupy.dev/) to do normal NumPy calculations on the GPU with minimal changes. There are plenty of R packages with GPU support and Matlab has its own [gpuArray](https://uk.mathworks.com/help/parallel-computing/gpuarray.html) for doing array calculations on the GPU. Usually one can use GPUs without learning how to "code on a GPU". If you code C or C++, you can also use OpenMP to offload calculations to [GPU](https://www.openmp.org/wp-content/uploads/2021-10-20-Webinar-OpenMP-Offload-Programming-Introduction.pdf) with small changes. - Do you know any good resource to learn CUDA programming? - Maybe [this article](https://developer.nvidia.com/blog/even-easier-introduction-cuda/) by NVIDIA. Even though that article is named "An Even Easier Introduction to CUDA" the complexity of the article really shows how complicated it is to do low level computing with GPUs. - Thanks! I wanted to learn that for very hacky stuff - Many frameworks such as Numba have good [APIs](https://numba.pydata.org/numba-doc/latest/cuda/index.html) for writing extensions that are compiled and run on GPUs. Typically one uses some higher level framework suitable to the task and then extends it via CUDA kernels that run on GPUs. ## Break until xx:06, then "Humans of scientific computing" ## Humans of Scientific Computing This is a freeform discussion. Please actively ask questions: - Do you think that what you learned in academia can be reused outside in the real world? What is that thing? - For me (enrico) version control (git) and continous integration + testing are the best skills that can immediately b reused outside academia. Of course also all the critical thinking + scientific-method approach etc. Oh also Linux + shell terminal, very useful outside academia!! - how do you usually do ci and testing with research code? Since it often feels like you're still in exploratory mode right until you're publishing. - I learned CI from coderefinery :) https://coderefinery.github.io/automation/ But it's true it is more useful if you are maintaining a toolbox/package/reusable code that you re-use in more studies. I never test one-off scripts... - How long did it take for you to get the hang of scientific computing? - 20+ years and I'm still learning :) Most of it was self learning. I can recommend this self-learning course: https://handsonscicomp.readthedocs.io/en/latest/ (and you can get credits in Finland via fitech https://fitech.io/en/studies/hands-on-scientific-computing/ ) - Great thank you :D - Links - Postdocs in Companies (finland) https://www.podoco.fi/ - Nordic-Reserch Software Engineers: https://nordic-rse.org/ - I [name=rkdarst] want to run a "academic skills in companies/the real world" kind of session at the autumn unconference. - Might be too general a question: I am fairly ok with math and making the formulas for my research project, but coding them is dreadful to say the least, I would procrastinate on it mostly because I dont know where to start etc, any tips for people in this kind of situation. - Write a function that takes some input and applies the formula? Many programming courses use this as a learning example. - Hang out with people that do things better. Join and watch various relevant communities. Most of this knowledge is passed person-to-person instead of by courses. If at Aalto, come to our daily garage often and see our tips. - Personally, coding is very connected with mathematical way of thinking. I usually use coding as a way to describe my thought process about the problem, from the most general things to lower level. As long as you express WHAT you need to do, you can worry less about HOW to do it because it is usually already done in various libraries or on StackOverflow - I usually try to find a framework or popular project that implements most of the ugly stuff (data loading, logging etc) and that can be easily extended to do the stuff that I actually needs to be done [name=Simo]. - Maybe it's not the programming that you struggle with, but the broader development process. If you don't use the right tools, it can be *very* frustrating. The CodeRefinery course (see below) is good there. - But also, people do different things, don't feel bad if it's not your thing! - How could we learn better coding practices even in academia, for example, apply some tools/practices from industry? Do you have some sources or workshops? Is there any place where we could ask questions or get feedback? - Coderefinery/Software Carpentries could be a starting point. - Next CodeRefinery in september (also live from Otaniemi) https://coderefinery.org/workshops/upcoming/ - You could also check [The Good Research Code Handbook](https://goodresearch.dev) which is a very easy first read. - where is the garage? - https://scicomp.aalto.fi/help/garage/ - online every day at 13:00 for Aalto researchers. - HY has its own one (link will come later) ## How do you actually install software on the cluster? https://scicomp.aalto.fi/triton/apps/python-conda/ - **no need to type along.** This is a demo, the page above essentially has all commands that you will need and a lot of additional information. - For Aalto, application specific information can be found here: - https://scicomp.aalto.fi/triton/#applications - does everyone logging into Aalto HPC get their home directory? - Any account on triton has a home and a work directory. The home directory has 20GB of sstorage, which is a pretty hard cap, the work directory has 200GB, and can easily be increased if you need more. - The problem Simo just had.. Was it because there was already an environment with the same name or 'cause the specific path already had a conda environment? - Because there was already an environment with the same name - Are there text editors other than nano available on Triton? - Yes, some common ones, and we can install more - we use nano for demos because it is simple - and available by default on many linux distributions - What are the possible/preferred visual remote desktop connection clients on Triton? X2Go, XRDP, other? - Currently there are a few different ways to connect to triton. For some common tools, there are open on demand installations for the particular tool. But in general the suggestion is to use shell/ssh. Triton itself does not have a remote desktop installation as such. - Short answer is "no", you can connect via https://vdi.aalto.fi and from there a terminal with "ssh -X triton.aalto.fi". We are testing remote desktop via "Open On Demand" but it's not ready. - https://jupyter.triton.aalto.fi is a somewhat graphical interface. - There was a suggestion to install the libraries to your work directory instead of home dir.. What is essentially my work directory? The one where the conda environment is setup/installed? - You have two folders on triton: /home/username (HOME) and /scratch/work/username (WORK). And if not changed, tools like conda R or other Programming languages tend to install their packages into home, which is limited in size. - We'll go into this more in day 2 - I have already installed Anaconda on Windows ``` On Anaconda prompt I installed environments with the following commands: (base) C:\Users\user>conda create --name env_name (base) C:\Windows\system32>conda activate env_name (env_name) C:\Windows\system32>conda install scipy Please tell me if I did this right ``` - It's not wrong, but it's not best practice. The problem with installing after creating an environment is that it gets difficult to keep track of what's in an environment. So keeping your requirements in an environment.yml file would be the better option. And yes, this might require redoing the environment, but it will avoid waiting for resolution of new packages every so often. - You can create environments from the command line, but as mentioned above and [in the documentation](https://scicomp.aalto.fi/triton/apps/python-conda/#installing-new-packages-into-an-environment) there is a risk that you cannot replicate the environment if you want to use the same environment in e.g. HPC cluster. Using environment.yml-files is a good idea so that you can always get a working environment with same packages. But at long as it works for you, everything is ok. [name=Simo] ## Break until xx:00 :::info * After the break, we have 1-hour demo of SSH advanced tips and tricks. This is optional but useful (good reference material for later) * Then one hour from then, we do the "how to connect to the cluster" part. This is needed for Day 2. * Choose which of these you come to. ::: ## SSH (Secure Shell) tips and tricks https://scicomp.aalto.fi/scicomp/ssh/ - https://scicomp.aalto.fi/scicomp/ssh/#first-time-login - compare at this **link** – there is no link there - Thanks, I am updating it. - Can I try this both on windows and then on Linux? (does it get somehow tricky if connect from my windows pc and then from vdi aalto Ubuntu right after) - Both work at the same time or separately, no worries on the Triton side. (make as many connections as you want) - what network we should use? aalto/aalto open/eduroam? - `aalto`, if you have an Aalto laptop, lets you directly connect to Triton without the proxy jump. Otherwise, it doesn't matter. (and you'll see how to make the proxyjump happen automatically, so that it matters even less) - can we use kitty to connect as well? - Yes. Any ssh client works. - Doesn't some of the convenience get lost after adding a passphrase? I.e. isn't it then the same as just using Aalto username + password? - The "agent" can remember the decrypted key locally, so you only have to enter the password once every time you reboot the computer. Many computers can handle this fully automatically - While this isn't about convenience, it is also safer since you aren't sending the password to other computers. - I accidentally had a typo with the name of SSH key when generating it (did not yet add it to the server) so how can I delete the key with the wrong name? (on Windows cmd) - The file is in the folder you indicated when generating it. you can simply delete the files - ssh-add gives "no such file or directory". File is created, it is visible with Explorer. Windows 10, both admin PowerShell and admin cmd. UPD: nevermind, did not run `ssh-agent` before - Did you create the key with the given name? - yes - Use cmd. Powershell might have different names for the environment variables. - same issue on cmd - Why is this more secure than just a password? Is this strictly necessary? - The security is in that you do not necessarily need to type your password (which is like a master key to everything). The real benefit of using ssh keys is that you can store the ssh key to the agent, which will make it so that you do not need to type any passwords across multiple connections. Additional security benefit is that you can revoke ssh key access if you lose access to it. With password you need to change the password. - I ran into a strange issue ``` [user@login3]~% ssh-copy-id -i ~/.ssh/id_rsa_triton.pub user@triton.aalto.fi /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/user/.ssh/id_rsa_triton.pub" The authenticity of host 'triton.aalto.fi (130.xxx.xxx.xxx)' can't be established. ECDSA key fingerprint is SHA256:0.......... ECDSA key fingerprint is MD5:a................. Are you sure you want to continue connecting (yes/no)? ``` - I think this is because you're doing the ssh-copy-id command from the login node. You're basically copying from triton to triton instead of from your machine to triton. - doesn't the information from the stream breach security? (visible keys) - They are showing public keys, not private keys. One cannot (at least not with current hardware) recreate the private key from the public key. If they showed private keys, I'm pretty certain they will delete the keys quite quickly. You should never show the private keys to anyone. [name=Simo] - is the `kosh` used instead of using Aalto VPN when connecting from off-campus? - Correct. Kosh is one of Aalto's shell servers, so it is available to the public internet. One needs to use it as a proxy host in order to get to the Aalto network. - Is it preferred to use Kosh proxy over using Aalto VPN? - Is the ssh key pair necessery if I am using triton only few times during the summer? Is it okay to then login with the password even if it is not as secure? - You can use passwords, but it gets tiresome quickly. However, it is a good idea to learn about ssh keys as they are used for version control systems such as version.aalto.fi and github for authentication. There passwords will not work. - I am from University of Helsinki. I connect the server using ssh turso.cs.helsinki.fi. Then I think it is successful. My command was changed to: myusername@turso02:~$ What should I do next? - This is you connected to the Helsinki. The command prompt change indicates that you are on turso02, which I assume is the helsinki university clusters login server. What to do next, you will probably best ask in the Zoom session afterwards. But essentially it means, you are set up. - Thanks. Do I need to connect to the Aalto network? I will join zoom session. - font size slightly larger? - An issue when connecting from off-campus Windows to `kosh`: ``` C:\WINDOWS\system32>type C:\Users\USER\.ssh\id_rsa_triton.pub | ssh MYLOGIN@kosh.aalto.fi "cat >> .ssh/authorized_keys" MYLOGIN@kosh.aalto.fi's password: zsh:1: no such file or directory: .ssh/authorized_keys ``` - This might be if you have never connected to kosh before and it does not have a .ssh folder. If you do " ssh MYLOGIN@kosh.aalto.fi "ls .ssh" " does it give a similar error? You might want to try connecting to kosh with "ssh MYLOGIN@kosh.aalto.fi" and then doing "ssh MYLOGIN@triton.aalto.fi" from kosh. By accepting the fingerprint you should get the folder. - worked, thanks! - I am able to log into Triton, but I am not actually able to create directories (error msg: mkdir: cannot create directory ‘testi’: Permission denied) or run python (even though I have the proper modules installed and I am in the work directory). If it helps, my account got triton access only a few hours ago. - we will discuss that later in the connection Zoom sessions and the next days. - What would be the best way to save the ssh passphrase? How strong should a "strong password" be here? - For me it is just another password for each key. So basic rules apply: > 10 characters, special characters, numbers, small and big letters and something that you can be easily remember.[name=Simo] - https://xkcd.com/936/ - Can I somehow remove ssh key from triton? I somehow managed to not save the ssh passphrase to my password manager. - You can just remove the line from ".ssh/authorized_keys" that describes the public key. This will revoke the ssh access. For example "nano .ssh/authorized_keys" and then remove that line. - config file done, public key copied to `kosh` and `triton_via_kosh`, `ssh kosh` works fine, `ProxyJump kosh` is present in the `config` file as suggested in the guide. ``` C:\WINDOWS\system32>ssh triton ssh: connect to host triton port 22: Connection timed out ``` - You should use "ssh triton_via_kosh" if that is the host name you configured. - heck, that's true, you're right. works fine with this. - Nice! - In Macos when trying to add the private key to the ssh-agent, I get this error: ssh-add --apple-use-keychain ~/.ssh/id_rsa_triton unknown option -- - - We'll have a look at this in the Connection session. - Maybe if you copied the command Mac might change dashes to those long dashes ("ajatusviiva") that nobody uses, but are grammatically correct. If you try typing the dashes yourself maybe that works. - Had the same problem, this (might) work with older than OS's than 12.0 Monterey: ssh-add -K ~/.ssh/[your-private-key] - Good to know. - I have the latest version, just updated this morning. - What does `sw_vers` return if you call? - ProductVersion: 12.4, BuildVersion: 21F79. ## Break until xx:05 :::info Then, connecting to Triton. This summarizes ssh and gives the bare minimum to get by tomorrow * https://scicomp.aalto.fi/triton/tut/connecting/ * Remember to check out "shell crash course" under "Preparation" on the course page, if that new to you: https://scicomp.aalto.fi/scicomp/shell/ ::: ## Connecting to the cluster * Aalto instructions: https://scicomp.aalto.fi/triton/tut/connecting/ ## Feedback, day 1 One good thing about today: - I love the portrait mode of the stream. Really helpful and accessible +1 - The hands-on approarch that was easy to follow - Very encouraging attitude of the teachers +1 - Overall you guys seem to know what you're doing when it comes to hosting this kind of thing. Also liked the way questions were presented and discussed. - good to know that I'm not supposed to know everything +1 - Was really comforting to hear that it takes time to learn all - the introduction was very well presented and clear, you guys also hosted this thing very well - I really liked the metaphors and analogies for explaining the basics of when/how to use a supercomputer. Made the ideas very stickable +1 - General management of time was good. Felt like there wasn't any time wasted +1 - Right amount of breaks +1 - I think it was interesting to hear about your backgrounds a bit and to discuss academia and scientific computing more generally. This also helped to stay focused and motivated to learn more somehow. +1 One thing to improve: - It would be cool to see the history of the commands like in the BASH introductory video (https://www.youtube.com/watch?v=56p6xX0aToI) +1 - +1. We'll be better future days! - Please make sure that everyone's microphones are loud enough, it was sometimes difficult to keep the voice up when someone else talked - thanks, we can try to do better but it is hard for us to know. Mention as soon as you hear the problem and whose volume is too high/low. - . Any other comments: - Keep on doing what you're doing! - Very nice work! Thank you very much - Thank you, this was very informative! - Thomas please get a bigger pot when you cook spaghetti 😂 - But... But... But... Well... I guess... Ok :smile: --- # 9/June - Intro to High Performance Computing *Ask anything, write always at the bottom* (*please include your organization to the question as there can be differences between Aalto/Helsinki/Tampere/Oulu university clusters.*) **The hackMD notes from the first day are stored here https://hackmd.io/@AaltoSciComp/ArchiveIntroSummer2022 ** ## Testing: asking questions via hackMD - This is a question, isn't it? - Yes, it is! And this is a reply - I have a different reply for you - Nice, I am a nested comment! - And here another question... or not? - check this out - .. - .. - Just write on here to ask questions, then we can easily answer them, or someone who knows can answer. Feel free to do so if you know the answer to a question. ## Icebreakers 1) What programs do you need to run for your work? - A mix of Python, R, and Matlab. - Neuroimaging tools like FSL and AFNI - LAMMPS, Ovito - Python, Julia - Python, R - Ovito, Python, bash - Python, Neural network framework, Pytorch - Matlab 2) What uni + department are you in? - Aalto, NBE +3 - Aalto, CS +2 - Aalto, PHYS +6 - Aalto, SCI +2 - Aalto, ELEC +2 - Aalto, CHEM - Helsinki, CS+2 - Helsinki, Phys +3 - Helsinki, Chem +1 3) Are you connected to a cluster? - yes: oooooooooooooooooooo00ooo - no: o - ... 4) 1st day was: (put an "o" to mark your answer) - useful: oooo0oooooöooooo0oo - stuff I already knew: ooo - too slow: oo - too fast: o0 5) Do you like this livestream course format? - good: ooooooooooooo0oooo - ok: ooo - not preferred: - bad: 6) Any more questions or comments on the material from the 1st day? - ... - .. - . There is a type-along session for Uni of Helsinki in the Zoom Break Out Room 1. There are many little differences in the examples between Helsinki and Aalto, so in the BoR-1 I'll try to emphasize on those. Please come! The daily garage for Uni of Helsinki https://wiki.helsinki.fi/display/it4sci/HPC+Garage The HPC User Guide of Helsinki https://wiki.helsinki.fi/display/it4sci/HPC+Environment+User+Guide ## About clusters and your work https://scicomp.aalto.fi/triton/tut/intro/ - TODO: add the chat link - also link directly to the other help page, /help/ ### Real example 1: array jobs - For those that are interested, the program is this: https://github.com/p-lambda/jukemir/ . Demo code is not yet shared [name=Simo] - Does running on CPU cluster also reduces one's priority on submitting jobs? I heard that CPU clusters are mostly free compared to GPU ones. - Yes it reduces priority, but since there are enough CPUs it's not an issue for CPU jobs. - for GPU jobs, it is a problem that we are working on (running CPU jobs reduces GPU priority) - unfortunately Slurm is not good at this. - If you notice this affecting you, contact us. - Testing/Debugging codes should be handled differently, or from the user's perspective it is the same as submitting jobs? For example, if we would like to test and run quickly some script. - Testing and developing on a cluster can be very challenging. Ideally you can work with "interactive sessions" to develop/debug and submit non-interactive jobs once you know your code is doing what is supposed to do. Interactive sessions are also jobs. - whenever the load on Triton from users is lowered, is it used for some distributed science projects e.g. Folding@home? - We used to be connected to the Finngrid (?) so that other universities in Finland can request Triton's CPUs if they are idle, but I am unsure if that project is still alive right now. I will check. - No, Triton at the moment is fully dedicatd to Aalto users only. But, we have no issues with the idling CPU cycles, there are always jobs in the queue. - Could you explain a bit which params for bash script must be included (and which are recommended)? For example, memory or time may be hard to guess in order to specify? But is it recommended to specify memory or by default necessary memory will be allocated dynamically? - Exactly, you will see in the examples that you can request amount of RAM, number of CPUS, or GPUs, specify which type of GPU... etc - Please write at the bottom as we cannot monitor the whole hackmd. If a parameter is not specified, default will apply. Defaults are small (time 15 minutes, 1 cpu, 500M of RAM) - As for Memory: At least triton is quite permissive with memory, so you can go over your memory requirement, as long as there is free memory on the node, but your job can be killed if there is none left. - I've got $ mpirun - bash: mpirun: command not found - is it how it should be? - Answered on stream: don't worry for now :) ## real example 2: MPI https://zqex.dk/index.php/method/lammps-demo - .was the "spider" command used for what? - We'll talk about modules later. They are a way to share centrally installed software. The command was `module spider`. [name=Simo] - .Why at HY clusters mpiruns are run with 'srun --mpi=pmix_v3' instead of 'mpirun'? OS being RHEL - We'll talk about srun later when we'll talk about serial and parallel jobs. The `--mpi=pmix_v3` tells how the MPI should communicate with SLURM. It depends on the cluster. [name=Simo] - I've run programs using sbatch, what is the difference between that and mpirun? - sbatch tells the queue to run a script in the queue. `mpirun` tells to run a program with MPI. We'll talk sbatch today and `mpirun` tomorrow. - sbatch + name of script with instructions for slurm - srun/mpirun + command that you want to run ## Interactive jobs https://scicomp.aalto.fi/triton/tut/interactive/ - do I need to be in what directory to run - your home folder is fine. The python one-liner is going to return the name of the remote computer where you are connected to (without srun). On Aalto's triton, the name is login3.triton.aalto.fi. When `srun` goes in front, the python one-liner command is sent to some node in the cluster. - What is the benefit of interactive session over just running 'run' command on logind node and requesting resources - in the login node everyone is there with you and if you don't request an interactive session, your commands are run in the login node. Sometimes people forget this and start using all RAM and all CPUs of login node, blocking everyone else. With "srun --pty bash" you get a similar terminal but on a dedicated node. And what you do there won't affect others. It is called interactive because you still are on a terminal typing commands interactively. If instead you have a script that does not require interaction, you can submit it like in the example `srun python3 -c ....` - could you show quick example with interactive shell with graphics? ### Exercises: interactive jobs :::success #### Break until xx:05, Exercises until xx:30 https://scicomp.aalto.fi/triton/tut/interactive/#exercises Try to run most of these yourself. They are simple tasks but let you explore how you set parameters and check jobs. We will connect this in serial jobs next. ::: - It is good to do the exercise in your WORK folder. `cd $WRKDIR` e.g. on triton it will be `/scratch/work/USERNAME/` - If you are stuck with the exercise, join zoom! - If you are from HY, it is worth joining the zoom since there might be differences. - I'm on Windows 10 `cmd`, connected to triton via kosh. Some keys like Delete, Home, End or key combinations like Shift+Insert don't work, even though they work in cmd itself. Same on PowerShell. - There are a few keys that don't work for me too on a ubuntu. - Could you try PowerShell? - Same. When i press Shift+Insert, ;2~ is printed in the terminal and the error sound is emitted - What should shift+insert do? - It's an alternative command for Paste, like Ctrl+V. - Ah ok, I use right click on top of terminal window. Not sure if it's standard or something I set up time ago. - Haven't tested this but check this out https://superuser.com/questions/16313/keyboard-shortcut-to-paste-in-windows-command-prompt - Does the right click on top of terminal window work for you? - Yes. Seems like the behaviour is now overridden by Bash. Is there any guide on Bash simple things like selecting part of the command, copy, paste. - They are OS specific. Aalto Windows 10, I know that when I select, it automatically copies. - If you run `echo $SHELL` do you get `zsh`? The issue is most likely due to zsh not working correctly. You can change your shell by following [these instructions](https://scicomp.aalto.fi/triton/usage/faq/#command-line-interface) - Yes, I do get `zsh` - `zsh` is (for some reason) the default shell in Aalto. We usually recommend people to change to `bash` if they do not need some `zsh` specific features. The change takes 15-30 minutes to propagate everywhere. - What was the command to view the queue status on a separate window? - `slurm watch queue`. CTRL+C to quit. - Should we do the exercises in the login node or an interactive shell? - You can run these from login node - What do the "extern" and "0" lines mean? - Where exactly are you encountering those lines? Could you post an example? - In the output of ''slurm history'' :::danger If you have issues with the exercise, join the zoom, raise your hand there or just start talking. ::: - Exercise 1.c How much memory can you use before the job fails? - 542M - write here your answer - 2054M +2 - 2000M +1 - 2G - 1750M - 2200M - In exercise 2, where should I type the seff JOBID? - on a terminal where you are logged in to the cluster, it doesn't have to be the login node - How is it going with exercises? - would like to have 5 extra mins - Helsink computer doesn't have the `slurm` command. - `sacct --long` or `sacct --long | less` options, maybe? `q` to quit from `less`. - What does the --time do? - for `srun` / `sbatch`, it says "Slurm will reserve that much time for the job". If it goes over time, it will get killed. - This is used so Slurm can efficiently schedule thousands of jobs, by knowing when each job should end. - ^^^When I do 'srun --mem=500M --time=60 time python hpc-examples/slurm/pi.py --threads=5 500000000' it doesnt kill the job after 60 seconds? Why is this? - Isn't 500000000 == 500M ? It might also be that the oom-killer gives you leeway as no-one is using the memory. :::info I've managed to do exercise 1: yes: ooo0ooooooo0ooooooooooo no: oo didn't try: o ::: :::info I've managed to do exercise 2: yes: oooooooooooooooooooo no: oo didn't try: o ::: :::info I've managed to do exercise 3: yes: oooooooooooooooooo no: o didn't try: oo ::: :::info I've managed to do exercise 4: yes: ooooooooooooo no: oo didn't try: ooooo ::: :::info I've managed to do exercise 5: yes: oooooo no: oo didn't try: ooooooooo ::: - Why HY clusters doesn't have 'slurm' commands? i.e. slurm history - "slurm" is a script/ utility tool that was developed at aalto, and I don't think is widely distributed. Not the slurm scheduler, but the `slurm` command for users of a slurm server. - It's on github, ask them to install for you. - Can you point to the repository? - https://github.com/jabl/slurm_tool, you can probably just download it to your personal bin folder (~/bin) and use it. - What exactly was I supposed to get as a result from excercise 2? What IS the relation between TotalCPUTime and WallTime?? ## Break until xx:00 ## Serial jobs https://scicomp.aalto.fi/triton/tut/serial/ - Is this tutorial type-along? - no, it will be fast and you can do it in 5 min - - Where is the output file saved? - same directory where you submit, or whatever you specify if you give a full path (or relative path) - - What does the "batch" "extern" "0" mean in jobID colum when runnung slurm q? - "batch" is the batch script itself - "extern" is "all other tasks that aren't included in some other line" - "0" is the first job step ### Exercises until xx:40 :::info https://scicomp.aalto.fi/triton/tut/serial/#exercises Try at least the three non-advanced ones ::: :::info I am: - done: oooooo - not done: oo - not trying: oo ::: :::info I've managed to do exercise 1: yes: oooooøoөoo no: ooo didn't try: o ::: :::info I've managed to do exercise 2: yes: ooooooꙩoo no: oo didn't try: o ::: :::info I've managed to do exercise 3: yes: ooõooѻoóo no: oo didn't try: o ::: Questions on exercises: - What if I write a batch job without specifying the values: ``` #SBATCH --time=... #SBATCH --mem-per-cpu=... #SBATCH --output= ... ``` - You would use the system defaults. Usually it is good idea to write what you need, though. - In the third exercise, do we need to make the loop within srun? like `srun for i in $(seq 30); do ...`? this gives a syntax error near unexpected token `do` - Without `srun`. The `date`-statement in the for-loop is the program we want to execute. The for-loop is just used to execute this program multiple times. You can either have `srun` for the `date`-command or not. You can use `slurm history` to see the difference. - Are there any advantages in using "sbatch" against "screen" + "srun"? - When you're using "screen" + "srun" your jobs can crash if the login node crashes or runs out of memory. - Array jobs, mpi jobs etc. do not work that well without sbatch as well - If you have mental capicity to manage, OK I guess. Still see above - When you do many things at once, you need sbatch - Writing a script (= that you pass to sbatch) makes your workflow reproducible. In X months when you need to run things again, everything is in the script that you submitted. If you work interactively you also need to keep track of all the interactive stuff you did which often leads to non-reproducible workflow. - What was meant when saying "since this is a trivial program, we are not adding srun in front"? - Unless you are doing MPI jobs (passing messages around the "cooks" in different kitchens), srun is not really needed, however the commands that have srun in front get their own entry in the "slurm history" so if they are heavy you can monitor the resources for that part of the script. - So basically, the only advantage in using srun in batch jobs is the easy to follow - history we obtain? - is needed for MPI jobs, but basically yes - Is it possible to kill a job with the job name? ("scancel JOB-NAME") - https://slurm.schedmd.com/scancel.html - `-n, --jobname=job_name, --name=job_name` - seems possible, but yes, job names aren't unique so it migth kill multiple things. Maybe useful? - How do I write that for-loop shell script and run it with srun? (I tried adding srun before the script and got a bunch of errors) - you need srun in front of individual commands (`date`), not the `for`. - `tail -f filename` is not adviced if you have Lustre file locking enforced if there is chance that you write, remove and read same files concurrently. This will cause issues and major locking contention. - When srun before date in the for loop, it works fine for five iterations, but after that it starts printing "srun: error: Unable to create step for job JOBID: Job/step already completing or completed". WHat could be causing this? - It might be that the slurm controller might be under heavy load if the jobs steps take too little time to finish. We'll investigate further. Should not happen with longer jobs, though. [name=Simo] ## Break until xx:05, then monitoring ## Monitoring https://scicomp.aalto.fi/triton/tut/monitoring/ - Any similar commands like 'seff' for GPUs? - unfortunately not that we know of. This aalto thing with the sacct comment field is the best we have. - Any other sites have something better? - We have the same comment-field thing as Aalto at Tampere, with modified seff to read it, if it contains something. - Apparently the sacct format flag is non-standard, or is that dependent on slurm version? ## Monitoring exercises :::info **Until xx:30** https://scicomp.aalto.fi/triton/tut/monitoring/ Try what you can in 15 minutes, mainly exercise #1 ::: Questions on exercises: - How do I use seff to check performance of individual job steps? (job steps are inside one script) - seff JOBID.JOBSTEP , jobstep is the `0`, `1`, `batch`, etc. - . - How do you print the output of the python file through a slurm script? - what do you mean? By default, any output goes to the Slurm output files. (ooooh ok I thought iy would print it out to terminal) - I get srun: error: Unable to create step for job 5295495: More processors requested than permitted when doing the exercise 2 - I guess there were too many `srun`s too fast. maybe the controller is overloaded or something, see above for the same problem. This is new to us. - I get the same error when trying to run Multiple Thread (Monitoring-2). :::info I've managed to do exercise 1: yes: oo0 no: oå didn't try: o ::: ## Software modules https://scicomp.aalto.fi/triton/tut/modules/ - maybe unrelated, but how can I get the command history that is shown here? (I mean to track my own commands) - that is github.com/rkdarst/prompt-log, but this is just for demos. your shell already tracks this in .bash_history in your home directory. - more comments later, you may need to set it up well if you want to see it right away. - or type `history` to see recent ones in this shell - . - . ## Data storage https://scicomp.aalto.fi/triton/tut/storage/ - In practice, when should you use the home directory, and when should you use the work directory? - Home directory is mainly useful for configuration files & ssh keys. ## Remote access to data https://scicomp.aalto.fi/triton/tut/remotedata/ - . - . - . ## Feedback, day 2 Today was (multi-choice): - too fast: ooooooo - too slow: o - just right: oooooooo - useful: ooooooooooooooo - not useful: o - livestream format was good: ooooooooooö - I would prefer in-person attendance: oo - I would prefer more hands-on for the hard stuff: oooooo - I would prefer more discussion part: oo One good thing: - Liked that you kept the metaphor game strong! - Returning to the cooking metaphor today made it make more sense than initially on Monday. +1 - thank you for the quick answers in zoom! - The ratio of demo/exercises/breaks was just right. +4 - I liked that you went through some of the exercises after we tried them ourselves +11 - your cat :)+5 One thing that could be improved upon: - screen sharing was a bit messy at times, would be easier to follow if the console was a little larger - indeed, sometimes it was a bit hard to know where to look at, probably because of all the things going on on the screen - . should this course be run (available online all the time): - june: ooooo - january and june: - only online is enough: # 10/June - Parallelization on High Performance Computing *Ask anything, write always at the bottom* (*please include your organization to the question as there can be differences between Aalto/Helsinki/Tampere/Oulu university clusters.*) **The hackMD notes from the previous days are stored here https://hackmd.io/@AaltoSciComp/ArchiveIntroSummer2022** --- ## Icebreaker: Do you think your work could be parallelized? Could it be run side by side or with multiple processors? - In my case I process individual subjects independently so parallelization can happen across subjects - I can run simulations of my structure with different parameters (e.g. width of the structure, frequency), and a single simulation is running on GPU by itself - ... - ... Yesterday was: - useful: ooooooooó - not useful: o - too fast: oooooo - too slow: o - just right: ooo - I would recommend this course to others: ooooooò what type of activity do you prefer that we do more of (multi-answer): - primary discussion/lecture: oooo - demo (without type-along expectation): oooooooooooo - type-along in main room: ooooo - exercises (independent work): ooooooooo - going over the exercises in main room: oooo0oooooooooo - Q&A via hackmd: oo - homework (do exercises yourself after class to save time): oooo How should this course be made better? if you don't like the format of this course, we really want to hear! Consider joining the zoom feedback session afterwards, or write them here in the icebreaker feedback. - The cat should make more appearances :D +5 - . - . Other comments: - How do I know which software I am able to run? If it is not available, how can I request installing it on HPC? - In case of UH, you can make feature request (there are ready templates available) on [GitLab](https://version.helsinki.fi/it-for-science/hpc/-/issues?sort=updated_desc&state=opened) - This principle applies to common softwares, but not very specific user applications. In these cases [users can create modules by themselves](https://wiki.helsinki.fi/display/it4sci/Module+System#ModuleSystem-1.2Creatingmodulesforyourownsoftwarehttps://). - You can also use containers in UH, singularity is readily available on compute nodes. - In Aalto, check: https://scicomp.aalto.fi/triton/help/ - And if you have heard of Docker, on HPC docker is not an option, but there is Singularity which is basically the same. You can build your container image on a machine where you have root access, and then port it to HPC and run it with any software you need inside. A short guide for Aalto Triton is available at https://scicomp.aalto.fi/triton/usage/singularity/ --- ## Simple parallelization with array jobs https://scicomp.aalto.fi/triton/tut/array/ - Do array jobs run parallel on different cores? I cannot understand what is embarrasingly. Can you show how we give different parameters (there was an example that different dataset can be selected for some of them) for the array jobs - Array jobs could be described as embarrassingly parallel. Eg, there is no communication between the tasks. Each task is independent process that has no communication with other processess of the array. - Consider array jobs as a bunch of the exactly the same binaries running with exactly the same requirements with respect to memory, CPU amount, GPUs etc that only differ with the binary's input parameters - Can I kill all of the array processes at once with the parent jobid? - Each of the jobs in the array are independent. - Yes, scancel allows you to kill both all jobs at once or a single one only - Can you somehow "synchronize"/manipulate the order of these parallel array jobs? - You can control the job execution order. Slurm supports functionalities such as job X must be completed with exit code 0 before starting next job in array. - See https://slurm.schedmd.com/job_array.html#dependencies - The array job tasks will be run in parallel and there is no way to say that SLURM must run task xxxx_2 only when xxx_1 is done or alike; the dependency=... is applied to the job id (i.e. master job in case of array job), example 'sbatch --dependency=afterok:JOBID --array [1-10] array-job.sh', all ten arrays tasks will be run only after JOBID has been finished succesfully, where JOBID is an id of any other job launched before this submission - How are the names of the output files consctructed i.e. what does #SBATCH --output=array_example_%A_%a.out mean? - %A will be replaced by the value of SLURM_ARRAY_JOB_ID, %a will be replaced by the value of SLURM_ARRAY_TASK_ID - Reference here: https://slurm.schedmd.com/job_array.html#file_names - How do I change the input data, if I have different datasets, with the array jobs? - Here you have options: paramaters can be in different files / directories, that is what most people do; or you can set a file with parameters one per line, or simply iterate the array job task id as an input; Arrays tasks differ by the tasks's number (an integer index) saved in the environment variable $SLURM_ARRAY_TASK_ID - You could also have an array with the settings and your array job indicates which parameter set from that array is used. - See the course page a bit after the part discussed now https://scicomp.aalto.fi/triton/tut/array/#reading-input-files - What's the difference between variables starting with '$' and '%'? - $... is a variable in BASH, while %... is a SLURM specific parameter. % used in the outputfile names etc, described in 'man sbatch' etc - Keep in mind, that SLURM batch script is nothing else than a BASH script with the SLURM operands of kind #SBATCH ... - Does the output data have to be in .out -format? - I think that is just the naming convention. The running software outputs whatever it wants in whatever format. - .out is not really a "format" its just a common file exension for output files, which can essentially contain anything. To some extend, it is kind of bad to use .out because you never know what's inside. - the ".out" is not output from the program, it is log (text format) from its running. But it's just a name. Your program will still output separately wherever you tell it to. - Possibly .log would acctually be a better extension, as it's clearer whats in the file. - Does the `#SBATCH --mem` parameter have to signify the total amount of memory needed for ALL tasks or a single job? - `--mem` is per array task (one number in the array). All tasks have all the same CPU and memory requirements (this is part of the way arrays work). - Option --mem designates the amount of memory you would like to have. Downside is that then the memory allocation can be from anywhere. To make sure every process has amount of memory locally guarateet, use --mem-per-cpu. - We'll talk about `--mem-per-cpu` in the parallel part. `--mem` basically means: set a request for total memory usage. Usually this is the easiest [name=Simo]. - In short, --mem works ok for jobs that run on single node. - For the memory hog, if I'd like to reserve a different amount of memory for each job, do I just set this parameter as an `srun` argument? - Ah, this is a good point. Array jobs have the *same* memory for all tasks. So you can't change per job. Use the higher amount for all in this case. - The idea behind the exercise is that each job DOES get the same memory limit. So some of them are supposed to fail based on the parameter that they ran on. [name=Simo] - What if I don't specify the `--mem` as `#SBATCH` parameter but write it as an `srun` paramener instead? - It won't work, as the overall limit is set by the sbatch-parameter. However, you can set a different value with `srun` in your sbatch-script, you can limit the individual job step's resources. This is rare, though. [name=Simo] - Do `SBATCH` parameters override `srun` parameters, or are the `srun` parameters ignored? - sbatch sets the overall requirements for the job. E.g. if you request `#SBATCH --mem=2G`, you will get 2GB for the overall job. If you then ask `srun --mem=4G` there's no memory to give, as the job has 2GB. - One can think SBATCH parameters as creating a playground of resources. You then use srun to claim part, or all of the resources you have defined for your playground. If your claim exceeds what you have asked your playground to include, then your job will be terminated due to claiming more than you have requested. - You can try the previous discussion with: ```sh #!/bin/bash #SBATCH --mem=2G srun --mem=4G hostname ``` You will get: `srun: error: Unable to create step for job JOBID: Memory required by task is not available` - Do you have answers from hackmd uploaded somewhere? - If you mean previous days: https://hackmd.io/@AaltoSciComp/ArchiveIntroSummer2022 ### Array exercises :::info **Until xx:40** https://scicomp.aalto.fi/triton/tut/array/#exercises If you can do #1 and think about #2, that is good. This is some basic practice. I am: - not done: ooooooooo - done: ooooooooooo - not trying: o ::: - `can't open file hpc-examples/slurm/memory-hog.py`, no such file or directory - nvm, used `$WRKDIR/hpc-examples/..` and it worked - how do I do slurm/memory-hog.py on HY cluster? - you can git clone the repository and run it with any version of Python. Note on cloning with git is right above. run with `python hpc-examples/slurm/memory-hog.py`. - See exercise 1 from yesterday https://scicomp.aalto.fi/triton/tut/interactive/#exercises N.B. in UH turso has python3, not python. So use 'python3 hpc-examples/slurm/memory-hog.py' - `seq $((n*CHUNKSIZE)) $(((n + 1)*CHUNKSIZE - 1))` is this creating a sequence like we would expect in python? - Not sure what you would expect in python but yes that is Bash for making a sequence of integers in a certain interval. I personally prefer to code these things inside the python script so that the only input parameter is the arrayID (n) and then the python script figures out which chunk to run. This is important so that all the logic of the code is in a single place, rather than having bits on the slurm script and bits on the python code. - could you modify the memory requirements like %a*100M for example if you have to find a way to have different memory requirements? +1 - Array jobs are not really thought for this, when requirements are very different, it is better to write separate jobs. - How do you do basic math operations (arithmetic operations) in BASH/scripts? - I will post a link, but I'd say don't do it and do the logic of mapping/math/etc inside your python (or whatever) script. - https://www.shell-tips.com/bash/math-arithmetic-calculation/#gsc.tab=0 - If it's simple, I would consider doing it in bash, and keep the code itself simple to do one task, not N tasks. Unless it's real complex. - I prefer keeping the logic of mapping from arrayID to input parameter inside the script that I run, makes it more reproducible since everything is in one place. - https://scicomp.aalto.fi/training/scip/shell-scripting/ - why was 1000M mem-hog not killed even though `--mem=500M` in my case? Aalto Triton - Was this on UH? If yes, this is because there is a bug in the system. - The kill starts kicking in when you go 800% over limit (or something like that, it can vary depending. on what's going on on the node where you are). Idea being that a short "going over memory" is allowed. - So does the case matching has to be exhaustive i.e. contain all the possible array job IDs? (Does something bad happen if there's no match found) - usually, but of course depends on your program and the script. - This is what we did: ``` MEM=$(( $SLURM_ARRAY_TASK_ID * 100 )) srun python hpc-examples/slurm/memory-hog.py ${MEM}M --sleep 60 ``` - Do you use $ to specify new variables in a bash script? - In Bash, $VARIABLE is one way on how to access the content of something defined with VARIABLE=something - Here a reference from our course https://aaltoscicomp.github.io/linux-shell/variables-functions-environments/ - It is a self learning course on Shell scripting we also run it as a zoom+inperson once a year. - `bash` documentation: [Bash Reference Manual](https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html) - How about the pending time?After sbatch, the jobs status is pending.How to define which job has the priority? - Slurm can give you an estimated start time (`slurm q`), but it's only an estimate, other jobs can come in with a higher priority - [Bash hotkeys](https://www.makeuseof.com/linux-bash-terminal-shortcuts/) - Could you please show the script for exercise 1 again? / or post it here ``` #!/bin/bash #SBATCH --time=00:15:00 #SBATCH --mem=5000M #SBATCH --output=array_exercise_%A_%a.out #SBATCH --array=0-15 case $SLURM_ARRAY_TASK_ID in 0) MEM=50M ;; 1) MEM=100M ;; 2) MEM=500M ;; 3) MEM=1000M ;; 4) MEM=5000M ;; esac srun python hpc-examples/slurm/memory-hog.py $MEM --sleep 60 ``` - I would recommend `$WRKDIR/hpc-examples/hpc-examples/slurm/memory-hog.py` - I can't get this array job to work, even though I copied the above - what is the error message in .out file? - Im not sure what the out file name is, using this script? - try running `ls` command to list - found the issue :D my path to the file was wrong. - if you did days 1-2, `$WRKDIR/hpc-examples/..` whoiuld work if you git copied there. - it works now. thanks for your help! - I get this exercise to work with hardcoding the arguments but not with a text file. What could be the reason? Each job seem to have the correct parameter from the text file but they all work and each .out file just has one line e.g. "Using 6037504 bytes so far (allocated: 2)" after the line "Trying to hog x bytes of memory". - I had the same issue. I forgot to include M after the amount in the argument to the python script (${memory}M), so the code just interpreted it as bytes instead of megabytes. - Thank you! This solved the problem. ## Break until xx:02 - How do you load a datafile in a bash script to input it to a program? Or, is it better to do some arithmetic mapping to tell the program what file to load? - You don't so much load it in bash (bash doesn't have data structures for real data), as give the other program the name of the file to load - So, you can point to a datafile to be input to the program? - as command line argument, yes. The "universal interface" - My personal preference is to have all the mapping from arrayID to whatever (e.g. a file name) inside the program. So I only need to debug one place if there's an error in 1 year when I go back to that code. ## Parallel computing https://scicomp.aalto.fi/triton/tut/parallel/ - - So OpenMP is parallel in one node and MPI is parallel that can be among multiple nodes - OpenMP is shared memory job, so yes, in practice you can do OpenMP parallelization as long as you stay in single node. MPI on the other hand includes node-to-node communications. - In practice, yes. - In OpenMP, MP="Multi Processing". In MPI, MP="Message Passing", that's one way to try to remember the difference :) (messages are needed between different nodes) - An importan comment: MPI job will run perfectly on the single node also, MPI implementation have built-in mechanism to run on the shared memory too, that is MPI approach is for both within the node and within the whole cluster, while shared memory approach (like OpenMP and other implementations) is for within the node only; OpenMP is easier to program, but MPI is more universal - Is it possible to run parallel processes (ex. simulation of different environments) and also use GPU in each process? If each process needs to run DL model inference. - when you use GPUs you can also request (multiple) CPUs, yes. Code needs to know how to use it. - We'll talk about this in the GPU lesson, but you usually want to use more than one processor when using GPUs. Data loading usually requires multiple CPUs. [name=Simo] - What is the '-l' for in the #!/bin/bash line - "login shell", make sure environment is a bit more clean. I think it works without, but we usually include. - '-l' is for the ZSH users to proceed with the initialization; BASH users may skip it - Just a curiosity question: there is a small text window that contains dublicactes for all entered commands. How does this magic work? - there are multiple solutions but this one is https://github.com/rkdarst/prompt-log/ . It is basically bash scripting and hooks, fairly standard stuff. - What's the difference between `mpirun` and `srun` and which one to use when doing MPI jobs? - srun is a somewhat elaborated wrapper to mpirun. In fact, one should use srun with slurm, and try to avoid use of mpirun (at least in UH). - With modern versions of srun and OpenMPI, there are a lot of benefits to use srun to launch the mpi tasks. - `srun` does extra things that mpirun does not do: it tells the MPI job which nodes it should use, it organizes the network layer and it handles communication needed when jobs are launched. With mpirun you usually need to do these steps yourself and there's a risk that the MPI process does not utilize the high-speed interconnects. Feedback on this session (what was good? what to improve?): - Unfortunaly good examples on parallelism are hard to come by, as the programs are so different with each other. [name=Simo] - The cat trying to play with the mic +2 - Did you understand the presentation? Did you get what you needed, even if it was "not needed now but what I do later"? - . - . - How can we set "--cpus-per-task", i.e. how to define how many is needed? ## Break until xx:00 Then a presentation by CSC, then GPUs, then Q&A. ## Presentation by CSC https://github.com/AaltoSciComp/scicomp-docs/raw/master/training/scip/CSC-services_062022.pdf - What is the main difference between Triton and CSC? Is CSC more or less occupied in general? - The best compariason is "Triton" and "[name of CSC computer]" (triton is computer and CSC is organization) - CSC larger, sometimes less occupied, CSC needs separate project application for each thing you do, CSC much larger (also leading to shorter waits for a given amount of work). - So what if you want to use software from a private repository that you have access to, to run e.g. a simulation on Puhti, is there a way to do that? - When you apply for CSC service (e.g. Puhti) access, you'll get a project number and a folder for storage. You can install programs that should not be visible to other users into the project folder. - Same applies to other clusters. You can install software with licensing restrictions to your work folder or into your project's folder. - Could you somehow achieve that with Docker containers? Or can you run Docker containers on clusters? - Why AMD and not NVIDIA? - There's probably many reasons, but cost-efficiency, performance etc. can be some of the reasons. Supercomputers cannot be bought from computer stores, so it really depends on the vendors and what they provide. [name=Simo] - I guess we could also say that, at the scale of Lumi, it is more efficient to use AMD and provide support in porting the code. (and then more spare GPUs available for others) - Will these slides be available somewhere? - They are now linked from the course page - Link is also above. Under the header. - Oh, Thanks! - Conda environments in containers: https://docs.csc.fi/computing/containers/tykky/ - Sensitive data services at CSC that were mentioned: https://research.csc.fi/sensitive-data-services-for-research - . ## Break until xx:52 Then GPUs, feedback, and Q&A. ## GPUs https://scicomp.aalto.fi/triton/tut/gpu/ - If you do have a code that should get GPU support, contact us and we have someone who can help with that. (Aalto at least) - Could we check which GPUs(type as Tesla or other) are available (not used) to then specify those? - I guess you would specify the generic requirements and it would choose whatever. `slurm features` and other monitoring commands can tell you more. - sinfo can also give you information on what is currently available. This command, will create a lot of information on current usage. Not specific CPU load, but general state (i.e. idle CPUs/GPUS etc(`sinfo --Node --Format="NodeHost:10,StateCompact:10,FreeMem:10,Memory:10,AllocMem:10,CPUs:10,CPUsState:15,CPUsLoad:10,Gres:40,GresUsed:40"` ) - Can you write down the Physics frameworks you just mentioned? - Some of them: https://www.cp2k.org/ , https://www.lammps.org/ , https://www.gromacs.org/ , https://charm.readthedocs.io/ - There are huge number of different physics simulation software. - . - 'sacct -j JOBID -o comment -p' doesn't work at UH? - the command works, but the GPU info will likely not be in the requested field, since that is something that is done by a script on triton. - In interactive session, can we use GPU for test/debug? And does it affect the priority then ? - yes, but remember to exit the job. GPU usage is comparatively expensive and will go against your Priotity. And it doesn't matter, whether the GPU is actually being used. The "price" is for blocking the resource, so it doesn't matter whether its interactive or non interactive. - Could you briefly comment on the different type of GPUs? - Newer are better, if you know why they are better. They are usually faster. - better = more features that might make some code work. So, some code doesn't work on older ones. - Can you briefly summarise what's the meaning of the CUDA score? - It's the amount of individual calculators in a GPU. - Any input on reproducibility and using Triton? E.g. should I use same type of GPU always? Sometimes the results are different between different GPUs. - That sounds like numeric instabilities, if your results are dependent on the architecture, that's something where I would suggest trying to find parts of the result that are not dependent on it, since the part with stability issues will not really be reproducible. But you can and should of course state all specs of hardware used for computations. - Is it a stochastic algorithm that is different each time you run it? - Thank you! This was more of a general question for the future since I've encountered this kind of instability in deep learning (the dataset was small though which is probably one of the main reasons) even when setting random seeds etc. Of course there's some differences between runs with deep learning always. - There are plenty of tools for reproducibility in deep learning. For example, see [this](https://towardsdatascience.com/achieve-reproducibility-in-machine-learning-with-these-two-tools-7bb20609cbb8) blog post. I suggest checking them out. [name=Simo] - If there is any possibility for it, you could try to set seeds to reproduce selection of training order. And/Or try to run it multiple times and present the average along with the variation of the results. - How powerful are the professional GPUs (ex. A100) compared to the consumer ones (ex 3000 series)? - I think the biggest difference is error correction and error corrective memory. If you care about bit errors you want the professional GPUs, but per cuda core they are more expensive. - I know about daily garage, but is there any way we could ask questions like this hackmd. But garage for Triton is once per week, so I mean about maybe asynchronous way if exist? - garage (syncronous) or our chat (asynchronous) are best: https://scicomp.aalto.fi/help/#chat . Or issue tracker maybe if you want to be sure you get a resolution but maybe slower (https://scicomp.aalto.fi/triton/help/#issuetracker). - Triton garage is every day at 13:00 (since summer 2020), not only weekly. - Isn't focus day is Thursday or we can ask even about Triton or else in any day and do not follow focus days? - Focus days essentially only increase your chance of getting an expert for a specific question. But in general most people are around all days, so most questions can come in any day. ## Feedback, day3 and the whole course Join the learner Zoom after the course for in-person discussion. Today was (multi-answer): - too fast: o - too slow: o - just right: oooöoooooooo - should have covered more: ooooo - should have covered less: oo Today, there should have been more: - primary discussion/lecture: ooo - demo (without type-along expectation): oooooo - type-along in main room: oooo - exercises (independent work): ooooo - going over the exercises in main room: oooo - Q&A via hackmd: o - homework (do exercises yourself after class to save time): o One thing you liked about today/the course: - the CSC intro +1 - Hackmd - Hands-on excercises (e.g. on array jobs) +2 - Overall organization of the course with Twitch and HackMD worked really well. +7 - The beginning of the course with exercises was really clear and easy to follow, really learned a lot +1 One thing to improve for next time (today or the course): - was a bit hard to focus esp. today in the latter parts since there were no exercises (or we didn't do them now) +4 - More on how to actually run your own programs; now it was just mentioned briefly how you can run a separate python program etc. but a lot of practical stuff was missing for me at least +1 General comments: - There should be a bash basics course with this similar kind of concept with twitch / hackmd ! - . ---- :::info **This is the end of the document, WRITE ABOVE THIS LINE ^^** HackMD can feel slow if more than 100 participants are editing at the same time: If you do not need to write, please switch to "view mode" by clicking the eye icon on top left :eye: