Try   HackMD

ARCHIVE Intro to Scientific Computing and High Performance Computing

╔═══════════╗ ╔═════════════╗
║   WEB     ║ ║  TERMINAL  ║
║  BROWSER ║ ║   WINDOW   ║
║  WINDOW  ║ ╚═════════════╝
║   WITH   ║ ╔═════════════╗
║   THE    ║ ║   BROWSER  ║
║  STREAM  ║ ║   W/HACKMD ║
╚═══════════╝ ╚═════════════╝
  • Do not put names or identifying information on this page

02/Feb - Intro to Scientific Computing

Ask anything, write always at the bottom (please include your organization to the question as there can be differences between Aalto/Helsinki/Tampere/Oulu university clusters.)

Icebreaker - How did you start your computing? Or how are you starting?

  • I'm part of staff at Aalto, I started during my uni studies with Matlab
  • I am a PhD student and need computing power for bioinformatics
  • I studied bioinformatics, and even before was doing some sys-admin stuff at school.
  • I am a postdoc at Aalto, have a degree in cs, reasonably fluent in python mostly in anaconda/jupyter. I run data analysis processes that can take a long time so I think HPC may be helpful
  • I am a physicist. I had never formal training. I started with it because I needed to record temperature curves around 0K. These days I am bombarded with 100 of GB data and I need to get fast through it because publication, publication, publication :(
  • I am a postdoc at Helsinki Univ, Computer Science departement, Data analysis
  • I'm a Ph. D. student in Physics. I have learned a lot during my studies on specialized courses. I'm here to learn to use Triton effectively in my research. I have coded mostly in C++ and Python.
  • I am a Doctoral Student in Physics, need more computing power for particle-in-cell simulations. Have no previous experience in HPC.
  • I am a postdoc at Oulu, I learned to code R, use github, learing jupyter, Water Resources, Environmental
  • I am a postdoctoral researcher in Ecology at Helsinki Univ. I started coding during my PhD. I code in R, use GitHub, and need to use server to run "big" statistical models
  • I am working at University of Oulu and started analyzing data during my postdoctoral research.
  • I am Real Estate Ph.D. candidate. I learned coding myself.
  • I'm a CS PhD Student at University of Helsinki working on ML. Mostly work with Python, but interested in speeding up workflows with HPC resources.
  • I am a PhD student at Aalto CS, but have a mathematical background.
  • I am a Postdoc working with computational simulations. I have a background in biology. I work at the University of Helsinki. I mostly do scripting with python, but I have no formal programming training. I use the CSC resources but I am afraid I am not taking full advantage of them.
  • .
  • .I am a PhD candidate in concrete techonology department.I started coding for the sake of my thesis that relies on machine learning methods.
  • I am a PhD student at Aalto in economics programme. I have learned coding mostly myself.
  • Started computing/programming during PhD project (theoretical chemistry)
  • I am a PhD student working on reinforcement learning and I have used Triton for a recent project, which was mostly a result of self-teaching and possible inefficient use :-)
  • I am a postdoc at Oulu Univ. Working in the field of space physics. I have to start using HPC soon for my work. So here I am
  • I am a postdoc at Uiverstiy of Helsinki. I started during my MSc for my thesis work. Now a postdoc and I still feel lost sometimes.
  • Postdoc at Aalto, doing remote sensing
  • Master student in Aalto, preparing for audio captioning challenge and working par-time. Needed to learn scientific computing for both.
  • Staff scientist at Aalto Studios, preparing for measuring collective behaviour and experiences
  • I am a PhD candidate of biomedicine in Oulu Univ. I have NGS data to analyze, and start to learn to use HPC, the Carpo2, to help me with that.
  • I'm second year undergraduate physics student who wants to combine interest of computing/programming with interest of physics.
  • I am a postdoctoral researcher at the University of Helsinki. I work with geospatial and mainly textual social media data
  • I am a PhD student with very little previous experience of programming, but would like to learn and explore the uses in my field

Intro

  • this is a question
  • and here another one
    • a reply
      • a comment to the reply
  • ..

Me and HPC or: How I learned to stop worrying and love the computing.

  • What is scientific computing vs computational science vs using computers for science?

    • Scientific computing vs using computers for science, you could distinguish in whether you create code, or pipelines, or whether you use predefined pipelines/tools.
    • .Scientific computing is about using methods and techniques refined for optimal use of resources to perform numerical and statistical tasks.
    • .
    • .
  • What kind of problems did you encounter during your studies?

    • My chief problems are two fold: Due to lack of formal training in the subject I am not aware of what tools are available. Secondly, due to lack of formal training in programming the scripts I write feel clunky and dirty, instead of feeling like a 'software'. :+2:
    • I was obsessed in learning the "best practices", maybe I am a perfectionist so I couldn't get anything done. They I realised that "good enough" is the new "best" :) I have time to learn the best practices +1
    • I am always overwhelmed. I have a problem to read properly Python libary manuals and learn them very fast. It is embarrassing. And yes, I want to do programming correctly and not anymore quick and dirty, but I don't have so much time. +1(me too )
    • making iterations and loops to build my own computational models, lately sometimes computation is too demanding for a laptop
    • Handling large amounts of data and programming physics calculations/simulations in as optimized fashion as possible using e.g. linear algebra libraries has been really time-consuming and bug-prone for me. My problem has been to find good strategies to handle these things efficiently.
    • Something runs into an error after 8 hours and I feel bad for wasting so many resources :-(
      • Yes! Plus the anxiety of having lost a day of work
      • It happens :) But is there a way to save the result every hour or so?
    • In many cases I encountered that my computer needs a long time to process the data, run simulation etc. So I want to be able to use more powerful computers. Also my data is almost always messy so I had to supervise the process for any types of error during the computation. So It would be beneficial to know how to prepare the code to avoid such issues.
    • I was lucky to have "nerdy friends" so I got introduced to Linux early, but this shouldn't be a reason, maybe we need more "friends" and communities to learn together
    • I frequently find bugs in my codes but it might take really long time to fix them, which usually is frustrating since it does not feel efficient working eventhough it is necessary usually. It's hard to know how to organize code better to reduce the amount of this.
    • I would like to get into the habit of using Git more fluently. At the moment my use of version control is pretty basic.
    • I find myself using solutions online often; for example, I am now trying to use a computational geometry solution to my problems, and I feel I need to somehow find a better way to approaching it than browsing SO alone.
    • I have never worked with Linux; thus, HPC was always too much for me to start with. All the training videos I googled required some background and skills that were not introduced during the HPC training. I hope that this training will start with the basics and explicitly explain the material. +1
  • What is your recommendation to spend time on a problem in computer science? 1 week and then give up, try a different route?

    • 1 day and then talk to people? :)
    • I work by searching for solutions to similar problems online if I can not solve it on week 1.
  • search "rubber duck debugging" :duck:

  • three fundamental skills

    • command line environment
    • version control
    • data management (micro and macro)
  • Do you have material for data management (folders, code, etc) for a Python research workflow?

    • let me find it
    • https://coderefinery.org/ workshops are a great starting point - there is one in March!
    • Thanks. This is what I always have issue with when starting research projects and they grow wild.
    • "data organization": https://scicomp.aalto.fi/data/organization/ (though you can find many other things). CodeRefinery for automation and the code management itself.

Scientific computing workflows

https://hackmd.io/@AaltoSciComp/SciCompIntro

  • Can i download the hackmd at the end of toady and how can I do it?

    • You can copy and paste it but we'll also archive it permanently on the webpage
      • I thought the hackMD will be reset everyday? At least this happen in last aalto HPC course?
      • we archive it in another spot to keep the main one small and responsive, but it's never gone - ask us if you can't find it. There is usually a link at the top.
      • Thanks :D
  • It used to be that CPU/RAM were very important to scientific computing. are they still so important? is the importance now GPU specifics?

    • Only if you have very specific code, where you have specific command-sets that are necessary and which are only available in specific CPUs/GPUs. Otherwise the only difference it makes is a bit of computing time. RAM can sometimes be a limit, especially if you want to pre-compute a lot of things to speed up your computation, but commonly it's mainly a tradeoff for time.
    • CPU/RAM is still an esential part of the common PC architecture; GPU is an accelerator, that comes as an extension of the common compute node facility; GPU became very popular in the last 10 years, though there are other accelarators available; your code can be CPU only or CPU/GPU hybrid, the last means you start your application normally on CPU but then the core of the code or some parts of the code will be run on the GPU; GPU cards have their own memory, it size also matters often
    • On modern systems, the most cost-efficient amount of RAM per CPU in HPC systems has stayed pretty much the same for many years (around 4GB/CPU). This means that your code will "fit" into an HPC cluster best if it has a similar RAM per CPU ratio. Simo
  • Nice memory vs storage analogy!

  • What does shared memormy mean, please? It came up recently and I did not understand this expression.

    • Usually this means a type of parallelization: you can have programs that run on multiple computers and have to communicate everything explicitely. Or, that runs on the same computer and communication is being hooked up to the same processor/memory.
      • Practical difference: shared memory can't be split across computers (though some very specialized hardware does share memory across machines).
    • Yes, in that sense of parallelization it came up. The software was supposed to reduce memory (?) while coping data between the workers, I think
      - What is the name of this very specialised hardware, please?
      - searching, so special I am not that familiar with it! anyone know? I think this is hardly ever used, not worth it. https://en.wikipedia.org/wiki/Distributed_shared_memory
      • I come often across the expression of FPGAs, but never used them. Probably, not worth it currently. Sorry.
    • Do not be confused: shared memory architecture and shared memory programming model. The shared memory is the local memory within one box, even if you have several CPUs memory is connected by the memory bus and thus communication is fast. As opposite: distributed memory system, it is when you have several nodes (=computers) interconnected by a network, and thus a CPU on the box one, has no direct access to the memory on box two, communication in this case happens by means of message passing interfaces, it is non-local, it is slower, but easily scalable. Programming models that come on top of them are shared memory programming model and MPI programming model. Shared memory programming model represented widely by OpenMP, OpenACC, etc GPU programming is also an example of the shared memory programming
      • Thank you!
  • I like the pasta analogy :-)

    • so, the trick is finding a recipe that can make use of more pots, rather than trying to find a hotter stove?
    • or prepare one sauce but cook many different pasta types in parallel and then sample which combination fits best :+1:
    • The point here is that the application's execution time, in the parallel manner, will be the time of the execution of the slowest chain. The spagetti example stands, that if boiling is the longest part in the chain, takes 10 minutes, the whole task will always take at least 10 minutes, independently how many pots/workers you have; the opposite example would be making 10 sandwiches, they can be made in parallel by ten people, and thus total execution time would be the time of making one sandwich
  • In Enrico's picture of various resources he has written "Big Data (slow)" within the HPC box.

    • HPC datastorage may be slow compared to RAM, but much faster than the data in cloud! Also faster than in the disk of your laptop.
  • What about hierachies (memory, computing):

    • memory: large and slow, small and fast
    • computing: small and easy to use, large and harder to use
    • always needing to work up?
  • "there is no cloud, there is only someone else's computer" ?

    • behind each virtual cloud, there is always some physical hardware; cloud is combination of the hardware and a software part that provides a way of accessing/allocating/dynamic distribution etc of that hardware
  • I wonder how should i estimate the resources I need for HPC?

    • it is always an estimation: even running the same software but simply increasing the size of the problem or changing an input parameter that may use different algorithm may have huge impact on the resources need: execution time, total memory usage, IOs, scalabiliy over a number of CPUs (especially MPI cases). The first way is to ask your group members that run the same code on the same cluster (you are most probably not alone), then consider looking for the public benchmarks, or some forum threads that discuss the optimal setup, and then try it out asking for more and then tune the requested resources according to the results. We are coming to this the next days.
    • make a small example, run it, see how much it uses and try to estimate. Depending on how long you expect your code will run, I would request either some more (long run times - days) or some fewer ressources (minutes or hours runtime).
      • the concept of small in HPC is so different to our home pc, so what would be small to HPC?
      • small would be "fewer data points", much fewer iterations etc. Essentially the same code, but instead of running it for 1000 samples, run it for 5 and maybe 10. This way you get an estimate of both increased memory usage and how much longer it takes.
        • so is the resources/computational power required for larger dataset in linear ratio?
          • not necessarily, but it's a first approximation and by running two small amples, you could potentially estimate how it scales. Also, the number of CPUs requested (if the code can actually use more than one, which you can also figure out with a small example), will influence computation time. but the "hardest" limit commonly is memory. i.e if you run out of memory, your job is dead, while you can't run out of CPU power, it only takes longer. Finally: For anything that takes more than a few hours, always try to put in some checkpointing mechanism (i.e. regularily saving the current state in order to avoid running into walltimes and loosing all the time of your job)

Break until xx:00

Next up: "When and how to ask for help": materials

you can continue asking questions:

  • .
  • .
  • .

When and how to ask for help

materials: https://cicero.xyz/v3/remark/0.14.0/github.com/bast/help-with-supercomputers/main/talk.md/

  • Comment: Sometimes the tone on stackexchange is harsh. And I don't like the downvoting. Aren't there any stupid questions? These days I hide behind a synomym and ask on github. I don't have dedicated RSEs or a garage around me.
    • This is an unfortunate side-effect of many social networks. I personally find it at the same time funny and sad that in many programming fields people turn to puritans who try to define what is right or wrong. GitHub issues are usually better, but it really depends on the developer. Simo
    • I also think empathy is often forgotten in stackexchange and sometimes even in our organisation's support tickets. Someone might have struggled for days with an issue so a more welcoming + empathic approach should be used.
    • If you don't have a community around you, you are always welcome to join open communities like CodeRefinery -> https://coderefinery.org/organization/get-involved/ You come as a learner, then maybe become a helper for some workshops and if you feel this is your thing you can become an instructor!
  • Comment: And there will be others who have the same question.
  • What about the role of chat/informal methods for simple questions that may or may not have a clear answer?
    • Chat isn't that common, but at Aalto we are trying to use it more and more for these small "is it a good idea" questions. For big things we will direct you to an issue tracker so it won't be forgotten.
  • Do you have any examples of a major XY problem case?
    • asking for more RAM/memory or access to some huge memory queue where the problem was a one line in a Python code and we could solve it by changing from reading all data in once to reading data in batches
  • I feel that in my organisation I cannot really talk with anyone to get help.
    • Yeah, that's a major problem. unfortunately we can't do much about that but we should work together to inspire better support. Join CodeRefinery and talk to us there both about questions and also we can work to motivate better support.
    • support is a big issue many places! It does really need more of a vision.
    • there are plentty of forums around, if in Finland, CSC is good place; then courses like this one; help does not need to be local

A tour of scientific computing skills and tools

Materials: https://hackmd.io/@AaltoSciComp/ToolsOfScientificComputing

  • Richard what do you think about "worse is better"?
    • Radovan sometimes "worse" and familiar can be easier/better for a collaboration than "better"/unfamiliar: the risk of the unfamiliar tool is that only 1-2 people then know it and all questions go to them and others feel disconnected from their own code/product/project
  • Should I use Jupyter Notebooks as my only programming environment if I am using python? Because that is what has happened and I am wondering if I should change that habit.
    • Jupyter notebooks are good for starting out, but when switching to HPC systems, it might be good to write your code in a way that it can be run from the command line. There's info on that in the "Running code from command line"-section, if you're interested. Simo
    • One big difference I would say that a notebook is designed to be inherently interactive. You have to be around to press a button to run a cell in the notebook. Simo
    • They are a fine place to start and good for a lot of work. As we'll see in days 2-3, they can be limiting when you need to scale up to larger resources. But it's not that big a step to using other things along with notebooks.
    • I find notebooks very useful for linear workflows: read data, do some stats, plot data. Then they are great. For some applications they are less good of a fit.
      • The only thing you have to be careful about is that your notebook is executed in order, at which point moving larger parts into functions can avoid unintended errors.
    • My usage is indeed linear. However, I do not have a good guide to orgnaizing non-linear frameworks.
    • I sketch+debug on notebook but then I export to python script so I can run it for longer times on a remote machine
    • This part about "writing things as a script" is a bit relevant here. notebooks go from start to end, but you need to add the "interface" part often by copying out into a script.
    • Will there be a guide towards how code should be organized in real programs?
  • Version control: CodeRefinery workshops! https://coderefinery.org, join our workshop in March
    • Nice!
  • Upcoming is a panel discussion and we can get some debates going on some of these practical questions - start thinking already!

Break until (xx:03)

Panel discussion questions

You can ask any questions about practical scientific computing topics: how you do X, what do you recommend for Y, is Z possible, etc?

  • (from above) Notebooks? vs scripts? role of each and how to move from one to the other
    • Notebooks are a bit of an playing ground where you can work with your data and code, scripts are more linear, "run and done"-things.
    • It might be a good idea to make some parts of the computations into functions that you call from the notebook.
  • How to version your data?
    • rkdarst I am interested in a tool called git-annex for data management unfortunately user interface is bad and it's too complex for mainsteram use. (definitely a case of "better is worse")
    • version the metadata
    • keep track of how the data was generated (version of code, environment, steps)
    • there can be versions of the datasets when the original data is updated (e.g. after quality control)
  • (from above) What to do when you can't get help in your organization
    • Let's build a cross-organization support network? We are trying this in different shapes: Nordic research software engineers, CodeRefinery network.
    • Get in contact with relevant groups (nordicRSE - weekly coffee break 9am - Helsinki Time , coderefinery e.g. via zulip)
  • Can you tell what you teach in a Code Refinery workshop? Is it worth taking it?
    • we teach version control, collaboration with colleagues and with your future self, how to document and test, how to keep track of dependencies and how to share code and data and workflows. I think it is very worth it but I am biased Radovan
  • Keeping track of requirements/dependencies: E.g. package updates in R. What is your experience here, and do you know of any solutions?
    • Each (modern) language has some way to manage dependencies: don't make your new things, learn it and try to use it! Also the other way around, make sure your programs can be installed using these standard methods. "package your software well"
    • I have used renv and hear good things about it to keep track of dependency versions in R
  • Can you recommend some data folder system handling library in C++? I'm currently just writing my own folder/csv file handling but perhaps there are some good existing solutions.
    • Maybe an HDF5 library could help since it can store data in hierarchical layout, compressed, and other codes and languages can also understand it.
    • also consider different data formats (is that the root question?)
    • Apache Arrow is a popular data format for big data (especially table data). Here's the C++ API. Data created by it is easy to read to other languages.
  • What is the role of containers (Docker etc.) in scientific computing? :arrow_up:
    • a container is basically an operating system in a file: a way to bring the whole installation to other systems.
    • by making containers, you have to somehow write down the steps to create them, which is good for reproduciblity
    • you can preserve an operating system and all environment for the future to rerun something 5-10 years later
    • rkdarst the danger to avoid is cases where software can not be used outside the container, because it is too difficult to install.
  • Should I write scripts (functions in a series executed) or should I go for a class when I do want to program a flow process?
    • more functional (prefering functions) or more object oriented (prefering classes and objects) is often a matter of preference (my preference would be to start with composable functions that have no side effects and introduce classes on "the outside of my code"). Design your code to be easily testable and change-able.
    • Perhaps the real question is about workflow managers: if you are trying to automate a process, there is a whole class of frameworks that handle that.
  • Related to code organization, how to handle boiler plate (e.g. in C++) effectively? Should you begin with classes at the beginning or use them as a way for refactoring the code later on (to avoid boilerplate at the beginning)?
    • explanation: "boilerplate": repetitive-feeling code/text, having to write quite a bit to achieve standard/common tasks
    • rkdarst In what cases do you all use classes is scientific computing?
    • RSE = research software engineer, people we have to help with your computational science so that you don't have to know everything. (several times we've been able to make things many times faster).
  • What is the difference in between HPC and scientific computing, or is it the same?
    • rkdarst I'd say that HPC is a subset of scientific computing.:+1: "HPC (high performance computing)" often referrs specifically to high-performance parallel programs, many people don't need to go to that extreme thed says.
    • These days "HPC" often means "resource intensive" (and there are many different types of resources and ways to use them)
    • "high thouroughput computing" means "you need a lot done but it can be distributed"
  • I'm physics student who likes to write C++ and to write parallel programs. I thoght that scientific computing would be way to combine my interests. Is there people doing scientific computing with C++ or am I falling to trap of screwing nails with hammer (wrong tool for the job)?
    • Radovan is the question whether C++ is the right tool? many many people and many groups use C++ for research software and/or HPC so I would not consider this to be a trap but being in very good company. But also check out the Rust programming language.
  • (from above, some answer there) How to estimate the resources needed for HPC?
    • Start with a small example and grow it (larger system size and more cores/nodes) and from that study you can extrapolate and also perhaps see bottlenecks. It is difficult to give a more general approach that would work for all domains/problem. I try to create an example that takes 1-2 hours on one core and then I study how it scales as I increase number of tasks/threads/cpus/nodes and extrapolate.
  • (from above, some answer there) Distributed shared memory architectures
    • These are quite rare and usually very technical, but for example MPI standard has ways of accessing other computer's memory (one-sided communication). Likewise NVIDIA's NCCL-library allows one to access memory of remote GPUs. Simo
  • How to balance between reinventing the wheel and having lot of dependencies
    • Thomas In general I'd say if the dependency gets added for a very minor function that is just a few lines, then rewrite it yourself (there is no point in adding a multi MB library for a 3-line function), otherwise rather use existing established libraries than creating (likely) comparatively inefficient versions.
    • One developer of Javascript famously removed a package that has one function from NPM repositories and broke half the internet. Do not import a dependency for a very simple function only :) Simo
      • Thomas This example shows that it depends on the dependency. If that dependency only has 17 lines, sure use it. So in the end, it's a bit: how much additional stuff does the dependency add in contrat to how much it saves you.
    • I sometimes rewrite because I would like to understand it
      • That's a nice sentiment. As an addition, I'd maybe even say that "if they have done better job than I could do, why not use their work?" Simo

Break until xx:03

Then "connecting to the cluster"

Connecting to a computer cluster

https://scicomp.aalto.fi/triton/tut/connecting/

Notes:

  • This is different for every cluster (but the general pattern the same)
  • Many ways to do it, but they all end up at a command line environment
  • We give a demo here and you can try to follow along, but you can take your time and get it set up before tomorrow.
  • Windows "powershell" also has "ssh" implemented
  • https://vdi.aalto.fi for a virtual desktop at Aalto
  • https://vdi.helsinki.fi
  • https://mobaxterm.mobatek.net/
  • Jumps are also available on Windows.

Comments & suggestions from this day

One good thing about today:

  • Comprehensiveness
  • Cat
  • .

One thing to improve for next time:

  • The main point of HPC usage for regular researcher is to connect to the cluster and run there something like Jupyter/R/Matlab (as long as only they are available with GUI, as far as I understand). Most of wide explanations can be skipped just by providing a clear flow of connecting on Windows (most of regular people use Windows still) and running usual software.
  • Data transfer lesson on day 1 [from zoom]
  • .
  • .
  • .

Now to learner Zoom for hands-on connecting support

Zoom link by email to registered participants

Day2 - 3/Feb/2022 Introduction to HPC (part 1)

Materials: https://scicomp.aalto.fi/training/scip/winter-kickstart/

Icebreaker

What programs do you need to run for your work?

  • Singularity containers, matlab and general development tools.
  • .R, conda, Snakemake
  • PyTorch
  • Matlab in combination with Simulink
  • Matlab, Simulink, python
  • QIIME2
  • Matlab (CasADi + IPOPT), Python (PyTorch), and Julia (DynamicalSystems)
  • Python, jupyter, python packages like tensorflow
  • Python, Jupyter notebooks, scikit-learn with conda environments :+
  • Jupyter Notebook, R Studio
  • Matlab (Computer Vision TB)
  • C++, Python 3 (NumPy,Matplotlib)
  • Macaulay2, julia
  • open source particle-in-cell code written on Fortran95 using MPI

I am connected to a cluster:

  • yes: oooooooooooooooooooooooo
  • no: o
  • I'm not trying to: oo
  • Yes, connected to Triton and at scratch/work/

Yesterday was:

  • useful: ooooooooooooo
  • stuff I already knew: ooooooo
  • too slow: oo
  • too fast: ooo
  • Useful: quite generic but the Q&A was very useful and I got a few nice hints, especially from the "tools" part

Any more questions or comments on the material from yesterday:

  • If I have some jobs running, is there a way to suspend them so that I can do the exercises?
    • If they are running in the batch queue, you don't necessarily need to suspend them
  • A question?
    • an answer
      • not really a question more like a comment
    • Well actually
  • How to open Jupyter via Linux shell? Is it possible? Or how to apply for Jupyter Hub? I just need some GUI, not Linux shell to run my code faster than on my Windows laptop, and have some data storage to work with
    • JupyterHub is an easy way. You can use ssh to forward a connect to your own Jupyter: https://scicomp.aalto.fi/triton/apps/sjupyter/ (but you'll need to use ssh+shell anyway). Open OnDemand will support this soon, similar to JupyterHub
    • Thank you for clarification. How to choose the volume of memory and GPU/CPU usage for JupyterHub? After logging in I cannot see any choice for these parameters.

About clusters

  • Can tou say something about Slurm as a language?
    • It's a set of programs and configuration options. You'll see a lot more now and later today
    • SLURM is not a language, but the queueing system (=batch system). The submit scripts is a shell script under the hood that have SLURM specific operators (#SBATCH ), slurm can recognize those operators and use them to find you a right free compute node to run your submitted job. The SLURM submit-script itself can be any long BASH (=default shell) script with all the BASH programming logic involved. Note: ASC has different course for BASH programming.

Real example: array jobs

this is a demo, follow along and watch

  • Are the array jobs always independent?
    • Yes. No built-in communication or even synchronization, separate process that may or may not run at the same time
    • It is called embarrassing parallelization in the literature, though should not be mixed with the shared memory or MPI parallelization
  • Can you get the Node ID from the submission queue?
    • Yes, slurm can tell you that. it's in last column in slurm history - more on this later.
    • 'squeue' command has plentty of options, and can show you everything SLURM knows about the job (see 'man squeue' for the details). On top of that we have a wrapper called 'slurm', it is a separate script developed in-house by us, thus 'slurm queue' or shortly 'slurm q' can show you pretty much eveyrting you need about currently running/pending jobs you have. Above mentioed 'slurm history' is useful for the completed jobs.
  • Is that we will be able to do something similar later today or tomorrow?
    • Yes! array jobs first thing tomorrow

Real life example: lammps and MPI

this is a demo, follow along and watch.

  • The demo: https://zqex.dk/index.php/method/lammps-demo
  • What is the difference between the time: use, real and sys?
    • Real is the time it actually took according to a wall clock
    • User is the amount of time the processors spent on the problem. If you use multiple processors, this can be larger than the real time
    • Sys is something I usually ignonre: the amount of time the system (Linux kernel or similar) spent on this.
    • 'man time' has a good explanations on the subject
  • Are the all the five processesors from the name node?
    • In this case yes, but mpirun can use processors from multiple nodes if it's set up to do so.
    • Yes, SLURM always tries to allocate CPUs on the same node, even if you do not ask so, but it is more efficient form the cluster running point of view. Though MPI jobs may get CPUs on the different nodes, depending on how cluster is busy. OpenMP (shared memory) will always be on the same node, ie shared memory is laways limited to a number of CPU cores available on one node.
  • what is the different between mpi and batch?
    • sbatch will add the job to the slurm queue. It will run on a compute node once it clears the queue
    • common suggestion: forget about mpirun only if you not testing on the allocated node and thus you are advanced enough to do the thing. Use always 'srun your_binary', it takes care of all MPI settings you need, including number of processes etc, and it communicat to SLURM providing all the tech info you may need later, like memory usage etc.
    • mpirun does not know about slurm, so it will just run on the login node.
      • with mpi run, it works simlarily with sbatch, but jsut different resources be used?
        • Sbatch does more, it queues and runs on the compute nodes. Mpirun just run "here" (it essentially won't use any nodes from the cluster except for the login node which normally is not what you want).
  • How does one install MPI?
    • normally you can load most tools by loading the respective module in case of mpi it should be loaded automatically on all nodes.
    • MPI by itself is a "Standard" (www.mpi-forum.org) and it has several implementations, the most popular is OpenMPI, but then Intel has it is own implementation, and there are others. Once you need MPI for a code compilation/linking, you can load an MPI package of your choice as a module from the list of the preinstalled MPI packages (like 'module load openmpi'). In case of you run an MPI ready software, it always loads needed MPI libs on the fly once you load the application through module (try 'module load lammps' and see 'module list', you will see all packages that have been loaded along with lammps itself).
  • I guess -n flag for MPIRUN is capped with the maximum number of processors for the total nodes selected, right?
    • Depends on the setup. If it's not capped, it can run more processes than there are processors, which is rarely useful.
      • Oh! So how does one cap it manually?
        • You do that when you install / set up MPI. On the cluster you don't do it, the admins should have.
  • Also, was lammps already mpirun 'aware'?
    • Lammps is MPI-aware.
  • slurm-command is something triton specific?
    • Right, a wrapper to other commands: squeue, sacct, sinfo, see /usr/bin/slurm, it is plain text
    • correct, written by Tapio Leipälä and developed further by other ASC team members. You can use the following commands to get the same info:
      • sinfo
      • squeue
      • sprio
      • sstat
      • sshare
      • scontrol
      • sacct
    • here a handy cheatsheet for all those commands https://slurm.schedmd.com/pdfs/summary.pdf (straight from the creators of slurm)
  • is it possible that i put a chain of batch job in one .sh file? I have noticed that if a shell script is being executed at ./, then it doesnt matter where the working directory, the output of the shell script will be written at ./. What should I do to specify the output file elsewhere (overriding the global preset)?
    • It is also possible to use more advanced features of slurm such as --dependency=afterok:jobid to set a batch job with dependency to other job. See sbatch-documentation for more information on this feature. Simo
    • Files are written to your working directory by default. You can change it in a script using the cd command.
    • You can specify the full path of a file to (eg /scracth/school/user/my_file)
      • I tried to use cd or write the full path in the shell script, but what i wanted to do is: run [first programme] then output at /A, then run [second programme] and output at /B, which each programme's output will be placed accordingly. However, I have seen that even i use cd or even specify EXPORT, the output will be placed at where my shell script is exectued. (Is this depends on programme?)
        • If you did not specify the directory to the programme, it should only know the working directory. Maybe try printing the working directory (pwd) after cd /B.
        • I have to change directory to let [programme 2] to run, else it will end up with error, so I am pretty sure it changed its directory, as I usually do not type the full path for it to run a programme.
        • How do you write from the program? Do you use a pipe (> or >>)? Do you specify a filename as an argument? Or does the program just create a file on its own?
          • I was not even using pipe, they are in different lines.
        • Is the output file called slurm-??????.out? This is created by slurm, so it is always in the folder where you run srun. It contains the standard output of all your commands. To avoid this, use run command > filename.
          • will this not cause all my output, for exampl, one of my job in the same shell is to cp *.txt ../, then with run command > filename would it not write the thing into filename?
          • You can put this in the script and use a different file name for each command. (It should not affect the cp-line anyway, since it should not output anything.)
          • So for example:
          ​​​​​​​​​​​​​​​​#!/bin/bash -l
          ​​​​​​​​​​​​​​​​#SBATCH ....
          ​​​​​​​​​​​​​​​​
          ​​​​​​​​​​​​​​​​cd /A
          ​​​​​​​​​​​​​​​​command1 > outputfile
          ​​​​​​​​​​​​​​​​cd /B
          ​​​​​​​​​​​​​​​​command2 > outputfile
          
          writes the files /A/outputfile and /B/outputfile
          • right, ok, I will try and see how it works :D Thanks!

Interactive jobs

https://scicomp.aalto.fi/triton/tut/interactive/

  • what is the different between srun and sinteractive? and if I ran overtime, then I guess the job will be terminated automatically, am I right??
    • sinteractive is again in-house developed script (/usr/local/bin/sinteractive), that does node allocation and ssh to the node on your behalf, you get 'an interactive shell', while srun is for the interactive job execution on the remote node, ie you run one program, and one execution is doen, your session is done, while with sinteractive, you run a shell session, and it remains open even if nothing is running there. Regarding termination, you can set time limit with 'sinteractive time=', default is 2 hours.
    • Yes, it will be terminated (after a grace period).
    • With srun, slurm will just run your script in the background.
      • not correct, try 'srun hostname', it goes to the queue
    • With sinteractive, it opens a terminal you can interact with.
  • How do i escape from sinteractive job and use my login node to work during those time? And I guess i can do it in srun?
    • You can close the interactive job by typing exit.
      • so if u exited, the interaction session will be vanished forever that you will not be able to resume your previous work?
        • yes. But you still have the files and can start another interactive job.
        • Sinteractive is an ssh to an allocated node. It takes over your terminal and goes to another node.
    • Otherwise just open another terminal and ssh into the cluster.

Have you managed to run srun hostname in the queue:

  • yes: oooooooooooo0ooooo
  • no: oo
  • I'll run turso examples in the Zoom Break Out Room 1 now. Please go there if you are turso (UH) user.
  • Oulu University users in Zoom Room 2
  • TUNI users in Zoom Room 3
​​​​    --pty

Execute task zero in pseudo terminal mode. Implicitly sets unbuffered. Implicitly sets error and output to /dev/null for all tasks except task zero, which may cause those tasks to exit immediately (e.g. shells will typically exit immediately in that situation). This option applies to step allocations.

How do you terminate if it does not go through? what is the command to terminate?

Break until xx:11

  • then we go to the exercises of interactive jobs

Interactive jobs was:

  • useful: ooooo
  • not useful: o
  • too chaotic: ooo
  • too fast: ooooooooo
  • too slow: o

Exercises until xx:38

https://scicomp.aalto.fi/triton/tut/interactive/#exercises
Try to get done with 1-2, don't worry if you can't get farther than that. If you can't do any, that is OK too.

git command to get the examples: git clone https://github.com/AaltoSciComp/hpc-examples.git

I am:
done: oooooooooo
need more time: oo
not trying: o
I want you to demo the exercises: o

  • It is good to be in your Work directory before running git clone...
    • cd $WRKDIR
    • At Aalto cd /scratch/work/MYUSERNAME/
  • I got this error when i run srun --mem=500M python hpc-examples/slurm/memo-hog.py 50M
    • me too; the cluster is Carpo2/Oulu (Room 2 on Zoom)
    ​​​​ srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
    
    
    • What cluster are you on?
    • csc puhti
      • In puhti, you need to specify a project that you want to use for the billing with --account=<account name>. See CSC interactive usage docs for more information. Simo
      • thanks
  • How can we clean the CLI and remove previous text?
    • clear will clear the terminal window of previous commands.
    • Thank you!
  • If you are blocked, and you know it, come to zoom!

  • for me exercise 1.c -> it starts failling with 2.5G

  • Why I have an error in Puhti?

    • Can you paste the command and the error here?
      slurmstepd: error: execve(): memory- hog.py: Permission denied srun: error: r18c48: task 0: Exited with exit code 13
    • Is the file executable? Try running chmod +x memory-hog.py
      • This is not needed if the command run is "python memory-hog.py" # assuming you are in the folder where the file.py is stored.
      • Thanks, it works with the chmod +x
    • python ../../memory-hog.py maybe?
      • written it like that it means you are 2 subfolders inside where the "memory-hog.py" file is stored (parent/parent/memory-hog.py)
    • Can you paste the full command you are running?
      • e.g.: srun mem=500M python hpc-examples/slurm/memory-hog.py 3G
  • how did you zoom out from the screen in teh terminal?

    • control + or - possibly. depends on your terminal
  • On turso is there a reason to use other partitions than test sometimes?

  • Do I need to get a cat to be better at coding? ;)

    • it reminds you to take breaks!
    • PERFECT ANSWER!
  • . What is the UH analogue for slurm history?

    • sacct is the raw command but it can be hard to use to make it look as nice.
    • feel free to copy: https://github.com/jabl/slurm_tool, or here is what it does: 'sacct -s ca,cd,f,to,nf,s -Pno jobid,jobname,start,reqmem,maxrss,totalcpu,elapsed,ntasks,alloccpus,nnodes,exitcode,state,user,nodelist -S 2022-02-02T14:16 -E Now' then on top of that formatting. And one suppose to set the -S, the starting date:time
    • Just sacct on its own gives a similar output (If I recall correctly it shows you the jobs from the last 7 days)
  • I contacted Tapio Leipälä if the slurm command is available on-line and could be distributed. It seems very portable.

    • How to make the slurm command available on Puhti?
      • Let's wait what the author says
      • Tapio is now working at CSC, but that script has been developed while he has still been working at Aalto, and we developed it later further (Janne, Ivan, Mikko), here you go: https://github.com/jabl/slurm_tool

Break until xx:05

Interactive jobs was:

  • useful: ooooooooo
  • not useful: o
  • too chaotic: o
  • too fast: o
  • too slow: o

More questions:

  • How do you terminate the job?
    • If you're running it with srun, Ctrl+C will stop it (Apple + C on Mac)
    • A submitted job, the one pending in the queue, can be killed with the 'scancel JOBID', where JOBID is the number you got after submission, or just check it out with 'slurm queue'.
  • How could I terminate interactive shell when the time is over?
    • exit will log you out (be careful not to log out of login node by running it another time though)
  • If I need to run something on a huge dataset, where do I actually store it in the cluster ? (not necessarily interactive)
    • We'll talk about that later today.

Serial jobs

https://scicomp.aalto.fi/triton/tut/serial/

  • Does the SBATCH time in a sbatch script reserve time for each srun separately (if many are in sequence) or for them all?

    • What do you mean by a sequence?
    • I mean that if I have multiple srun commands in the script
      • OK. The time is the full run time for all the steps.
      • Each step gets this time separately
    • For the advanced users: you may have time= with every srun within the submit-script, thus you make sure that one particular step does not take more time than expected. There are also several techniques to run even srun steps within the script in parallel, like sending them into background, but that is outside of the scope of this kickstart
  • Why -l option is needed for bash in #!

    • It behaves like a "login shell", a shell you get when you log in to the system. This is probably what you are used to.
    • It is also a common trick for the ZSH users (since on Triton we have some). If you are a native BASH, you are safe to skip '-l'.
      • If zsh is your default shell (you can type echo $SHELL to check) and would rather use bash, you can change it with chsh on kosh.aalto.fi (at least in Aalto).
  • Why do we need srun in the script, why not just echo whatever?

    • srun lets you record resource usage per step. it's not critical but might be nice (see the extra row in slurm history with srun)
    • srun is a big thing itself, srun implements 'mpirun' with all the option that you need (MPI jobs ie). srun can take pretty much all the command line options that you put into sbatch script: you can define amount of memory, time, cpu cores etc that can be used by the step. srun is a step within your job, there can be many of sruns within one submit-script. See 'man srun' for the details. We do recommend use srun for every single big step in your submitted job, that helps to distinguish stages in your run, helps to debug and diagnoze if issues occur. You can get report of resources usage for every single step, instead of total numners only.
  • Could I load all the modules myself, and then deleted the line for loading modules?

    • You could load the modules in the ".bashrc" (= the file that sets the default settings for your bash), but for reproducibility I would explicitly have "module load" in the .sh sbatch file. Even better, specify which version of the module, so module load matlab/r2017b is better than just module load matlab
  • On turso, when I run sbatch hello.sh and then sacct, I the partition column is empty. Is this because there is a separate line for the actual job?

Exercises until xx:55, then break until xx:05

https://scicomp.aalto.fi/triton/tut/serial/#exercises
This is the most important lesson of the course, so try to get 1-3 done and ask for help if you need it.

I am:

  • done: ooooooooo
  • need more time: ooo
  • not trying: o
  • they were too simple: o
  • they were too hard: o

Self learning materials for Shell / Command Line: https://aaltoscicomp.github.io/linux-shell/

  • (from the Twich chat:) Will we cover array jobs and dependencies?

  • Do we always need to specify the time and memory for batch scripts?

    • Short answer is yes. If you don't specify them, then default time (15 minutes at Aalto) and default memory (500M at Aalto) are used, and these numbers are usually too small
  • What if the specific time for batch scripts is not enough for the total work, would the job get terminated when the time is up and no final result left?

    • yes. a very fine HPC code would save intermediate results, and could continue from near where it was cancelled.
  • exercise 3: do we write the program in another file ? Is it a python program ? could we write it directly in the bash file ?

    • The program is a bash script. You can paste it directly into the slurm script, or into a separate file.
      • how would we use srun for such a program ?
        • The program you're actually running is the sleep-program inside the for-loop. You can put srun in front of it and check how different the slurm history is after the job has finished.
    • Bash has features such as for-loops and if-else-structures that allow for all kinds of nice tricks. Using it as a glue around your actual code is often helpful.
  • With the scancel command, can we accidently cancel other user’s jobs due to typo or does it only apply to our own jobs?

    • You cannot cancel other users' jobs by accident. Only admins can cancel other users' jobs.
    • But feel free to try it out and see the result :)
  • My code in Ex3 fails with error "syntax error near unexpected token do":

    ​​#!/bin/bash -l
    ​​#SBATCH --time=01:15:00
    ​​#SBATCH --mem-per-cpu=500M
    ​​#SBATCH --output=hello.out
    ​​srun for i in $(seq 30); do
    ​​date
    ​​sleep 10
    ​​done
    
    • you don't need "srun", or at least written it like that only the first line goes to "srun" while the rest is out of it
    • In a real world case, the srun would probably go before sleep 10. It is typically the parallel part that is run with srun. Most of the lines in the script are run only by a single program in a single node of the cluster, while the srun will launch with the full multi-tasking power. Multiple copies of the command possibly on multiple computation nodes.
    • Could be like below, where 'sleep' and 'date' suppose to be the main steps in your script body (in this dummy case both sruns are pointless though):
    ​​​​...
    ​​​​for i in {0..30}; do
    ​​​​  srun date
    ​​​​  srun sleep 10
    ​​​​done
    
  • I noticed that if there is an error in the code itself it does not really say that something was wrong, i.e., the jobs gets "submitted" by sbatch but never starts. So when checking what's in the queue using slurmq it just shows that no jobs were started. Is there a better way to check that the code contained some error? Or having it not queued is the way to check?

    • The error would be visibile in the log. Ideally one runs the commands interactively first, and once you know that they work you run them not interactively with sbatch. This is also why it is good to have the output log named with the jobID so that you can always debug what went wrong in a specific job. See slurm reference https://slurm.schedmd.com/sbatch.html search for "output"
  • sometimes you "accidentally" queue hundreds of array jobs and scancel saves the day (happend to a "friend" of mine)

  • Would you mind explaining later how to set up such a shorthand way for connecting to Triton? Seems quite useful!

  • tail -f: opens file and prints new lines as they appear. Control-c to exit.

Monitoring

https://scicomp.aalto.fi/triton/tut/monitoring/

Modules

https://scicomp.aalto.fi/triton/tut/modules/

Data storage and remote data access

https://scicomp.aalto.fi/triton/tut/storage/
https://scicomp.aalto.fi/triton/tut/remotedata/

  • .
  • .

Feedback day 2

Today was:

  • too slow: o
  • just right: ooo
  • too fast: oo
  • useful: oooooooooo
  • not useful: o
  • boring: o

One good thing:

  • .The exercises were doable and not overwhelming
  • .Exercises were really helpful!
  • .

One thing to be improved:

  • . Maybe a lunch break (some 20 mins) would have been nicer
    • good point! In our timezone it's right after a typical lunchtime, but maybe we could split in two?
  • Use less time for things that are specific to the environment at Aalto
  • .

Question

  • I have data on a teamwork drive at Aalto. What is the best way to access it on Triton?

    • Short answer: copy (with rsync) the data to /scratch/, work with /scratch/, copy what you need to preserve back to teamwork. Reasons: scratch is a huge filesystem attached to all computing nodes and very efficient for computations. If an array job with hundreds of parallel jobs is going to write to teamwork, we might take teamwork down :) However, scratch is not backedup because of its large size. It might be a negative thing, but in the long term it makes you more aware that you need to have a reproducible workflow so that you can reproduce your outputs by rerunning one (or few) scripts. Having a reproducible workflow is very important especially if you publish and during peer-review you are asked to rerun everything with some parameter changed.(enrico).
  • I have been recently introduced to the 'local disk' in csc, can you explain more about this compare the normal bash script that we submit to the slurm?

    • If it is like we have it at Aalto, then it is a temporary disk physically in the node where you are, so it is even faster to read/write. Somtimes with GPUs you need to read lots of data fast, so in these cases you might first copy to local disk and then start computing. When the job is over, what is in the local disk is deleted.
      • then how different is it from /scratch ?
        • /scratch in CSC is a massive file system shared among all the nodes. It is the main work directory. Similarly in Aalto. Local disks are only on the node itself and they are cleaned after job on the node finishes.

for tomorrow:

  • on turso, I noticed that sacct does not display all my jobs. I am thinking that this is because they are moved to a differnent cluster. I see them with sacct -L. Is this correct?

    • sacct also accepts -M (cluster name) argument, or if every cluster is desired, then -M all
  • Thank you for clarification. How to choose the volume of memory and GPU/CPU usage for JupyterHub? After logging in I cannot see any choice for these parameters.

  • .

  • .

  • .


This is the end of the document, WRITE ABOVE THIS LINE ^^

HackMD can feel slow if more than 100 participants are editing at the same time: If you do not need to write, please switch to "view mode" by clicking the eye icon on top left :eye: