# TTT4HPC 14/05/2024 ARCHIVE ## Tuesday Tools & Techniques for High Performance Computing - Episode 4 :: Parallelization and workflows :::danger ## Infos and important links - Program for the day: https://coderefinery.github.io/TTT4HPC_parallel_workflows/ - Materials: https://coderefinery.github.io/TTT4HPC_parallel_workflows/ ::: *Please do not edit above this* # Tuesday Tools & Techniques for High Performance Computing - Episode 4 :: Parallelization and workflows ## Icebreakers Test this collaborative notes document, click the pencil icon on top and start writing! ### 1. This is the last episode of the TTT4HPC series, will you go back to the recordings of the other sessions? CodeRefinery YouTube channel: https://www.youtube.com/channel/UC47aupE7HKGduAjXKt1Gwrg - yes, I often use YouTube to learn new things - maybe, more likely to use the written material - Lecture material first, recording maybe later. - ... ### 2. What is your research domain and can you parallelize your analysis? How many parallel instances can you run? - Neuroimaging. We usually preprocess each subject's brain images in parallel. I think my maximum was 300 subjects in parallel using slurm array jobs. Lots of I/O stress for the disks since neuroimaging pipelines are well known for generating hundred of thousands of files... - quantum chemistry. Using 24 cpus at the moment - bioinformatics/computational genetics - Bioinformatics, omics data - ... ### 3. Which (researcher) skill do you want to learn or wish to improve next? - installing OSS using 'standard' libraries - What to think about when automating a workflow. And anything I have not known yet about parallel computing and HPC would be great to learn. - databases ## Questions and comments - Did anyone else just get a network error on Twitch? (reloading the page fixed it) - yep, same here - ... - ... ## 1. Motivation https://coderefinery.github.io/TTT4HPC_parallel_workflows/motivation/ - Sometimes it feels that my laptop multi processors are faster than the same amount of processors on HPC, why is my laptop faster? - Very good question that most likely others have also experienced when working with HPC. I think the reason is mostly related to the hardware architecture, the cpus of your laptop are more optimised and faster in number of operations per seconds. The good thing is that on a cluster you can request more cpus than the ones you have on your laptop AND you don't need to keep the laptop open, with full spinning fans, waiting for something to finish. - Laptop CPUs usually focus on being fast to reduce latency that the user experiences (they have a fast boost clock speed for short bursts of time). This is very helpful when user does not want to experience latency. This works fine when you have thermal headroom (laptop is cool) and you can temporarily basically overclock the CPU. In HPC clusters the CPUs are typically in a constant use. This means that they are often optimized for situations where they can run at the 100% load for long periods of time. Thus they often saturate to some performance ceiling that might be lower than in your laptop. - However, HPC clusters are much more **wide** than your laptop. They have lots of CPUs while your laptop has only few. Plus, your laptop won't burn your lap while you're doing work in HPC cluster. - ... ## 2. Concepts - "embarassingly/trivially/beautifully" parallel code https://coderefinery.github.io/TTT4HPC_parallel_workflows/parallelization/concepts/ - Is it better to have more parallelization and shorter jobs? Or longer jobs and less parallelization? Where is the optimal point? Any rule of thumbs? - Usually there is overhead in setting up the job. The Slurm queue manager needs to manage the jobs and large amounts of jobs can cause problems to the queue manager. In addition, each time you launch a new job you need to set up the calculation environment (e.g. get application files, get data files), which can create big IO bottlenecks in your code's execution. - At Aalto we recommend at least 15-30 minutes per job as a minimum. So if your single parallel unit takes few seconds, it is better to batch them together, otherwise you will end up spending more time waiting to access resources and for the computations to start (and affect your "fairshare" lowering your priority for getting access to resources) - One good rule of thumb is: ask yourself when do you expect the results to appear? If you want to immediately view them, it might be good idea to run a short job with **part** of your whole job and view the results. If you expect the results to be done by tomorrow, you can easily bunch multiple small jobs to bigger jobs that takes longer, because you won't be checking the results until tomorrow anyways. - How does the mapping of the container CPUs to cluster CPUs work? I.e. if the container has two CPUs does the node running a job in such container "know" it? - If you're using apptainer to launch the jobs (so you're using light-weight containers) the apptainer will use any CPUs provided to it by the queue system. Slurm uses a linux feature called cgroups (control groups) to limit the available CPU resources that a job can use to the amount of resources the job has requested. In most cases Slurm automatically binds processes to certain actual hardware CPUs so that the processes do not need to jump between CPUs. Apptainer will just launch a process and these cgroups then send the processes to the CPUs they're supposed to run on. ## 3. Pitfalls with parallelization https://coderefinery.github.io/TTT4HPC_parallel_workflows/pitfalls/concurrency_issues/ https://coderefinery.github.io/TTT4HPC_parallel_workflows/pitfalls/os_side/ - What is a good way to do concurrent writing? Or is it better that each job write on its own file? - This is a very good, but complex question. If you have individual parallel processes, you can try out databases, but they can sometimes cause huge numbers of small IO writes. Usually it is easier to collect all writes from a single job to a single file by using a good data format like parquet or HDF5. So if your program creates N files, you can instead try to write all N files to a single file that can hold multiple datasets. - One extra problem with concurrent writing is that it can cause problems where one job needs to wait for another job to finish writing before it can continue. So it can cause IO bottlenecks. - Is it safe to append to the same file from multiple jobs (assuming the writes are small enough to be atomic?) - Usually would not recommend it. There is a risk that the writes come in wrong orders which might cause corruptions in the data. - Is it that the database would have a queue for the concurrent writing operations? - Usually yes. You'll want to verify from the database's documentation itself that is is "thread-safe" and it supports some locking mechanism that makes certain that it knows that multiple workers are writing to the same database. - Should more people use databases for HPC concurrency? - The problem with databases in HPC context is that the underlying filesystem (for example, Lustre) is not designed for the type of IO that databases generate (small writes all the time). In addition, databases are usually not designed for these kinds of systems either. If the database in an HPC system is slow, it can create bottleneck for multiple jobs at the same time. - The previous being said databases are really good for collecting small information and for managing big workflows. So for example, to keep track of parameters that you have already gone through. So you can use databases to have e.g. a table that contains hyperparameters and some result numbers (basically small amounts of numbers) and a path to where the actual big data is stored. Storing the big data to database is usually problematic. - Is there a linux command to test the I/O speed of the disk I am using? How do I know which ones are the slow disks or fast disks? - It depends on the cluster. We showed the "time" command on the first episode so you could maybe simulate some large read, and large write operations? - dd man-page: https://man7.org/linux/man-pages/man1/dd.1.html how-to: https://www.cyberciti.biz/faq/howto-linux-unix-test-disk-performance-with-dd-command/ - Thank you, I didn't know this! - You can also try [ior](https://github.com/hpc/ior/). This is what admins often use to measure speeds. - . ## 4. Parallelization examples - from Jupyter to HPC https://coderefinery.github.io/TTT4HPC_parallel_workflows/parallelization/jupyter_to_script/ - I know of snakemake and fireworks workflow tools, are they similar? It seems at least that fireworks needs a database which maybe is tricky to setup? - Yes you are correct, they are both common workflow managers tools. Snakemake does not require a database, it uses the disk as a databse to track which files have been processed and which ones still need to be process. There will be more about snakemake after the break. - Fireworks is similar, but as mentioned before, it uses a database as a backend. Fireworks is great when used with tools that are designed around it (material simulations etc.). - I think the best way to decide which one to learn is to check what is used by other researchers in your field. There are other options too (e.g. Nextflow) - Do I always need to make a container to convert my notebook to a slurm script? - Not necessarily. In the example here the apptainer image just contains the Python installation you want to use. If you get your Python installation via module system or by some other means you can use that. The apptainer is used here to make the examples reproducible across different clusters. Each cluster has their own recommended Python installations and it is easier to use apptainer in the examples. - The main focus should be in how you can convert the notebook `.ipynb` to Python script `.py` that you can use it the Slurm script. - What is the container used in the examples? - The docs point to this https://github.com/coderefinery/TTT4HPC_parallel_workflows/blob/main/content/code/container/singularity.def ```bash Bootstrap: docker From: python:3.10.14-slim %post # Install any needed dependencies specified in requirements.txt pip install --no-cache-dir scikit-learn numpy pandas matplotlib ``` - ... ## 5. Parallelizing with scripts https://coderefinery.github.io/TTT4HPC_parallel_workflows/parallelization/parallelize_using_script - If I understood correctly then the python block will call `sbatch submission.sh {i}`, can you pass input parameters to slurm sh files just like that? Are they treated as bash inputs? $1 $2 etc... - Yes. Any arguments given after the submission script are considered as arguments to the submission script. - Isn't it going to become difficult to manage if we submit many independent jobs like in that for loop? Wouldn't array job be more efficient to rerun something if it failed? - Usually, yes. With array scripts you need to write the `$SLURM_ARRAY_TASK_ID` -> my parameters-mapping in the Slurm script or in the code that Slurm script calls when the job is running. The example here onlycan probably configure bash yourself for what you need) - Feedback: It might be good to add "sleep X" in the python lunch script because the slurm might block you if you do this kind of massive submisison. - We'll make a note of that. This is an added benefit of array jobs. For array jobs slurm does not need to make individual job entries for each job (slurm uses a database as its backend to keep track of the jobs). Instead it can just take the whole array job into the queue and write the jobs' data whenever they start. - Are there situations where I should prefer this approach over array jobs? I've always used array jobs and I'm wondering if this has some pros compared to that? - I'd probably always use array jobs. For anything big this isn't good for scheduling efficiency :::info We can try the code in the hands-on session after lunch and you can run it in your cluster. Register to get the zoom link. ::: ## 6. Parallelisation using Slurm Array jobs https://coderefinery.github.io/TTT4HPC_parallel_workflows/parallelization/array_jobs/ - Is the difference that now the input parameter is the variable $SLURM_ARRAY_TASK_ID? Which one is better to use the custom python or the array job? - Array jobs are typically better. However you need to map the `$SLURM_ARRAY_TASK_ID` to some input parameters that you want to use during your computations. - So basically previously there was: python -> slurm script with extra args -> run python script with these extra args - Now: slurm script with array structure -> run python job with `$SLURM_ARRAY_TASK_ID` -> determine args from this number in the Python script - I can't hear Richard - thanks! - How many variables can we use in "Slurm array jobs"? - When launching an array job the differerent array jobs only differ by `$SLURM_ARRAY_TASK_ID`, which is a number. However, you can map this parameter to any number of arguments (e.g. get line from a file that has the parameters or ask a table what parameters are in row `$SLURM_ARRAY_TASK_ID`) - I like to make a table for the inputs so that every row is accessed with the SLURM_ARRAY_TASK_ID variable and then the columns are all the parameters I need for that specific run. The table is useful because later I can check which array job ID failed and see what were the input parameters used. ## 7. Parallelisation with workflow managers (SnakeMake) https://coderefinery.github.io/TTT4HPC_parallel_workflows/parallelization/parallelize_using_workflow_manager/ - Can the container image be a docker image or should it be apptainer (i.e *.sif)? - Apptainer/Singularity will be the option for you on a shared system (like HPC) where you do not have administrator rights. On your machines you can also use docker. - snackmake will generate the slurm script in a correct format automatically? What if my system is not slurm or I need some additional custom slurm parameters? - Depending on how you submit jobs in your cluster (e.g. if you have a kubernetes cluster) you can check existing options for snakemake https://snakemake.readthedocs.io/en/v5.8.0/executing/cluster-cloud.html (note that only specific versions of snakemake support this, I think the most recent version does not have this yet) > We'll pilot a **Snakemake on HPC hackathon** next week (22.5). It is currently full, but if this is something you are interested in, please register to the waitlist: https://ssl.eventilla.com/snakemake_hack (we might still accept people, or if not, you'll still get the link to the materials and we know that there is interest and may offer it again) - Do not let problems with the snakemake installation dissuade you from using snakemake. We'll try to fix the issue ASAP. ## Feedback, day 4 :::info News for day 4: - Exercise session in one hour, we will go over these things - Today is the last of the days - Material stays online indefinitely - Credit info is on the website (join zoom for exercises) (the snakemake problem was solved: I had a ~/.irods directory, so snakemake tried to automatically use it! You will see the run in the Zoom session) ::: Today was: - too fast: - too slow: oo - right speed:ooo - too basic: o - too advanced: - right level: ooo One good thing about today: - . Enjoyed the commentary on Snakemake specific syntax. - . Slurm arrays look interesting - . Workflow managers -- this was completely new to me - Motivation for using (or not using) workflow managers One thing to be improved for next time: - . Working snakemake so we know what we're aiming for (even if its difficult) - Perhaps a comment the preferred workflow managers for specific clusters? And/or a comment on the Hyperqueue executor fir snakemake (which is a section on the CSC puhti page) - . Other comments: - It is great if examples work but on the other hand it is so comforting to see that errors happen also to experts (or that the things do not work as expected) - . - . --- :::info **This is the end of the document; always write at the bottom, ABOVE THIS LINE ^^** HackMD/HedgeDoc can feel slow if more than 100 participants are editing at the same time: If you do not need to write, please switch to "view mode" by clicking the eye icon on top left.