or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
TTT4HPC (Tuesdays Tools & Techniques for HPC)
tags:
Training
TTT4HPC
Timeline for development (in weeks)
9/April meeting
Agenda:
25 I have access to Aalto University Triton cluster
11 I have access to CSC (Puhti and Mahti)
5 I have access to LUMI
4 I have access to NRIS/Sigma2 clusters
3 I have access to Uppsala University's UPPMAX
0 I have access to another cluster (please write which one in the comment box)
1 I do not have access to any HPC cluster, but I still want to watch and learn
7 I have access to Tetralith at NSC, Linköping
5 I have access to Dardel at PDC, Stockholm
3 I have access to Leonardo Booster (Italy)
9 I have access to another cluster (please write which one in the comment box)
2 I do not have access to any HPC cluster, but I still want to watch and learn
assigning roles for day 1
zoom studio room
List of supported clusters and persons who will test the exercises
List of people who will test the exercises
Days/content
Here below, mostly a copy paste from the past + a whole new day for singularity. Please comment in a way that it is clear that is new content, for example: EG comment: I think this is great!
1. Tue 16/04/2024 :: Computational resources (memory/cpus/gpus, monitoring computations, monitoring I/O, local disks/ramdisks) -> this will become day 1
Suggested coordinator: Jarno
Helpers/Instructors/LessonsDevelopers: Richard Darst, Simo Tuomisto, Diana Iusan, Dhanya Pushpadas, ??, ??
1.1 Benchmarking & choosing job parameters (50 min) (DI, RB: we have material in Norway for memory and num of cores calibration, DI: same in SE, I can contribute with something)
1.2 Monitoring I/O (~50 min) (JR, ST)
Three points
Motivating example/exercise:
strace -c -e trace=%file,read,write ./command
Container/archive formats for data
story 11
Three points
Example case: conda+container? better data formats?
mldb? Loads the data into memory, so only for small datasets.
webdatasets
Using local disks and ramdisk
Three points
https://github.com/hoytech/vmtouch (tool to mention perhaps)
Motivating example/exercise:
UPPMAX material:
2. Tue 23/04/2024 :: Working on clusters (interactive sessions, data (and code) access/moving/versioning, graphical tools)
-> this will become day 2Suggested coordinator: Samantha
Helpers/Instructors/LessonsDevelopers: Enrico Glerean, Jarno Rantharju, Hossein Firooz, ??, ??
2.1 From laptop to cluster: Syncing code (and data) (45 min) (EG, RD, SW)
2.2 Side episode: sshfs, short demo (10 min) (??, ??)
2.3 Interactively working on a cluster (45 min) (EG, RD, SW)
2.4 Remote interactive example: vscode (25 min) (HF, RD)
3. Tue 07/05/2024 :: Containers on clusters (everything from zero to infinity)
Suggested coordinator: Simo Tuomisto (+ Enrico Glerean)
Helpers/Instructors/LessonsDevelopers: MP, DP, ??, ??
Note, text below has part 3 and 4 from a brainstorm with ChatGPT, + past discussions with Simo, + Enrico's ideas
3.1 Introduction to Singularity for HPC (30 minutes)
3.2 Basic Singularity Commands (30 minutes)
3.3 Advanced Features and Parameters (30 minutes)
3.4 Hands-On Exercise Session (30 minutes)
4. Tue 14/05/2024 :: Parallelization
Suggested coordinator: Thomas Pfau
Helpers/Instructors/LessonsDevelopers: Radovan Bast, Pavlin Mitev, Diana Iusan, Simo Tuomisto, Teemu Ruokolainen, ??
4.1 Parallelizing code without parallelizing (TP, RB, PM(15) DI interested in Slurm solutions, SW) (90 min)
4.2 Workflow automation tools (TR, ??)
4.3 Hyperscaling pitfalls (??, ST)
Day 4 practicalities
Main: TR and RD
Host: EG
Screenshare: always Richard
Lead: usually done by Richard and explicitly pass the mic to Teemu
done: EG move pitfalls under concepts and add bullet points in the pitfalls pages to avoid the wall of text on the streamdone: make sure they are somewhere (e.g. conclusion) TPZOOM EXERCISES:
done: EG can add the coderefinery snakemake exercise as an optional exercise if somebody wants to try thatOther issues:
Meeting 27.3.
Agenda:
Quickly go over existing docs
Conceptual idea
Comments
Next steps:
Organizational
Existing "starting" repos:
14/Feb Meeting agenda
OLD CONTENT HERE BELOW
December 2023 meeting summary and voting
Poll 1: Vote for a name
Note: "Workflows" might be confusing for those using snakemake or nextflow. For some options I tried to keep the name of the week + word starting with same letter.
Poll 2: Vote for the Days
Notes:
December 2023
December 2023 meeting agenda
Actions and comments
Old overall plan
Schedule
Day 1: doing actual work with a cluster
From laptop to cluster: Syncing code (and data) (45 min) (EG, RD)
Side episode: sshfs, short demo (10 min) (??, ??)
Interactively working on a cluster (45 min) (EG, RD)
Remote interactive example: vscode (25 min) (RD, ??)
Day 2: managing resources
Benchmarking & choosing job parameters (50 min) (ST, RD, RB: we have material in Norway for memory and num of cores calibration, DI: same in SE, I can contribute with something)
Monitoring I/O (~25 min) (??, ??)
Container/archive formats for data (25 min) (DP, ??)
Using local disks and ramdisk (~25min) (??, ??)
The filesystem-related
Day 3: making the most ouf of a cluster
Parallelizing code without parallelizing (TP, RB, PM(15) DI interested in Slurm solutions) (90 min)
Workflow automation tools (??, ??)
Hyperscaling pitfalls (??, ST)
February 2023 Meeting
Let's check where we are
When: 13:00 CET
Where: https://aalto.zoom.us/j/69608324491
Agenda
Old stuff for reference
Task 1 - Let’s describe real problems that the course + teaching materials can solve (a.k.a. user stories)
Write a list of potential questions a user might have that could be answered in this workshop (these are like “user stories”). The first examples below are based on the topics above. Feel free to be redundant and write similar questions, we can then merge things together (e.g. see the sweeping parameters question below).
I want to parallelize my code, how do I do it in practice? (TP: could do this)
I develop on my laptop/workstation and also on the HPC cluster, how can I keep things in sync with minimum effort?
I want to automate my workflow, where do I start?
I need to sweep through multiple parameters, how do I code that? TO MERGE WITH 1
How can I work interactively (non-gui or gui) on a cluster?
I am getting different results on my laptop and on the HPC cluster: how can I move my environment around? (this could be not just conda, but also containers)
Do I need to parallelize within my code to compute things at the same time? (When it is unfeasible or skills are lacking to do parallelization on code level)?
Why would I want to put effort in learning workflow tools when I can do a lot with bash scripts?
What are good tools and best practices for developing and small scale testing code on cluster?
I would like to run a graphical IDE on HPC, how can I do that? And should I do that? (this could be matlab, rstudio, comsol, spyder, etc) (remote ssh in vscode?)
I am out of file-number quota, I have 1M+ small files as results of my analysis, is there a way to optimise disk space, e.g. by tarring them all together? Remapping them to a database? Using other file formats? Will I be able to read them again if I merged them or do I have to “unzip” them each time?
again containers might save the day
is it good to fix this issue before writing 1M output tiny files? or post-processing to gather all results?
Easily becomes a info dump as there are lots of different file formats. See for example file formats in Python for SciComp. Good examples could avoid this.
File cleaning is important (this is a good habit to learn, beyond how to practically do that, especially if everything can be reproduced, no need to hoard files) "Live as if scratch were to die tomorrow"
Think of "data appraisal": what is truly important
Is this a good story?: ooo
The I/O of my code is slow and I have heard that I could take advantage of the local disk of the computational node: how can I do that in practice? Manually move files around? What if the job fails, will I lose the intermediate results? (DI interested to contribute, but it would be good to have someone else from a diff center)
I have x TB of data to analyze, how do I get it on the cluster, where and how do I store it? How and where do I share the results?
I want to collaborate with others on a task using HPC, how do we share and organize code and data?
Pavlin Mitev Here is the situation - you have written a serial Python code that runs and everything is fine. You have done some reasonable optimizations and cleaning of the code, but there is this problem… You have to run the code on (let's say 500 000) inputs from a single file (Molecular Dynamics trajectory is the real data behind the inspiration of this tutorial). Just to make things worse - you CANNOT load the entire set of inputs in to the computer memory to use traditional methods of applying a function on to each element of a list… https://github.com/pmitev/almost-embarrassingly-parallel-python - the content online is working but it is far from compete.. and perhaps too long to be a part/section…
I need to run a set of similar jobs and after they are done and only after they are done, I can start a follow-up step with the outputs from step 1 (DI interested)
I am unsure where to parallelize: Inside the code? inside the Slurm script? Outside the slurm script? What are typical approaches and pros and cons?
What are the options to connect to a cluster and do some work? (SSH, OOD, jupyter,…)
How should I arrange my project efficiently?
How should I arrange my project?
Data harvesting from an API
COMSOL
Effective use of conda -> see also story 6
Data collection
Implementing "workers" for doing very large parallel jobs rather than thousands of array jobs (Pavlin, in that case it was done in bash because user did not want to switch to snakemake or similar workflow tools)
Is it better to write a script and never check your jobs? or check your jobs and "waste time"? when is the good balance between?
-. Enrico: I just had this conversation with a user
what tools do I have available? What can I do with it?
-. Counterargument: first I need to have a problem and then I should learn about the tool
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Presenting good general tools useful for many (some IDEs, some parallelization tools)
Scaling calculations in cluster: e.g. how to estimate job runtime from smaller job runtime, how to estimate memory consumption based on data size scaling, how to estimate how long an analysis takes based on how long single analysis takes, how to decide whether optimization is needed?
Benchmarking/profiling (big-picture, overall efficiency)
Other questions to answer
Notes
Zoom chat copy of what is relevant
From Simo Tuomisto to Everyone 11:01 AM
https://github.com/bast/singularity-conda
From Radovan Bast to Everyone 11:01 AM
i use it personally mostly to get python codes to run on my NixOS which is very strict about library dependencies
From Simo Tuomisto to Everyone 11:02 AM
https://github.com/CSCfi/hpc-container-wrapper.git
From Radovan Bast to Everyone 11:02 AM
i like that it forces people to document their dependencies in a file
From Pavlin Mitev to Everyone 11:35 AM
https://pmitev.github.io/UPPMAX-Singularity-workshop/
From Richard Darst to Everyone 11:52 AM
https://coderefinery.zulipchat.com/#narrow/stream/141114-help/topic/inspection.2Fperformance.20monitoring.20tools/near/308556198
From Radovan Bast to Everyone 12:05 PM
suggestion: it would be good to incorporate/syntesize the user stories and add instructions in writing on what we expect from all as next step. this will also allow those who were not here (Sabry, Samantha. Matias) to join
From Pavlin Mitev to Everyone 12:27 PM
https://github.com/hoytech/vmtouch
What is it good for?
Discovering which files your OS is caching
Telling the OS to cache or evict certain files or regions of files
Locking files into memory so the OS won't evict them
Preserving virtual memory profile when failing over servers
Keeping a "hot-standby" file-server
Plotting filesystem cache usage over time
Maintaining "soft quotas" of cache usage
Speeding up batch/cron jobs
And much more…
Extra
* Containers (+ conda ?) (50 min) (ST)* PM: Singularity, perhaps
conda
aand/orpython-venv
* Three points:
* What is a container?
* Basic usage of Singularity
* The benefits of packaging code into a container.
* Is this going to day 2?