# TTT4HPC 16/04/2024 ARCHIVE DOCUMENT ## Tuesday Tools & Techniques for High Performance Computing - Episode 1 :: HPC resources :::danger ## Infos and important links - This is the archive for the questions from episode 1 of TTT4HPC - Program for the day: https://scicomp.aalto.fi/training/scip/ttt4hpc-2024/#episode-1-16-04-2024-hpc-resources-ram-cpus-gpus-i-o - Materials: https://coderefinery.github.io/TTT4HPC_resource_management/ ::: *Please do not edit above this* # Tuesday Tools & Techniques for High Performance Computing - Episode 1 :: HPC resources ## Icebreakers Test this shared notes document! Click on the :pencil: in the top bar and start writing. ### 1. Where are you connecting from? - Otaniemi, Espoo, Finland! ooooo - Helsinki :)o - Bergen, Norway :) - Oslo, Norway - Tromso, Norway - Delft, The Netherlands - Espoo, Otaniemi - Modena, Italy - Gothenburg, Sweden o - Dresden. Germany - Kirkkonummi, Finland - Warsaw, Poland - Stockholm, Sweden - Bergen, Norway - Linköpimg, Sweden - Stockholm Sweden - Gothenburg, Sweden ### 2. Which cluster do you have access to? Add an "o" to the cluster(s) you have access to - Aalto University Triton cluster: ooooooooooooo - CSC (Puhti and Mahti): oooooooo - LUMI: oo - NRIS/Sigma2 clusters: ooo - Rackham at UPPMAX, Uppsala: o - Tetralith at NSC, Linköping: oooo - Dardel at PDC, Stockholm: ooo - Leonardo Booster (Italy): oo - DelftBlue: o - ex3 Simula Cluster: o - Write your own cluster name: o - No access, just watching and learning: o - LUNARC - Puhti - ICM (Warsaw) and PSNC (Poznan): o - Tetralith, NSC, Linköping o - NSC tetralith and berzelius, LUNARC Cosmos, and MAX IV cluster - Vera ### 3. How would you explain HPC to someone from the 19th century? - Instead of 10 Procesors, put 10 Professors together who work at the same time on a project - Start with a mechanical calculator: we made this much, much faster (using electricity!) - then, that still wasn't fast enough. So we connected multiple of them together really, really fast (using electricity - like telegraphs!) - It's like having the worlds largest factory filled with the all the most experienced computers (profession) all together calculating a very big calculation. - a ballet troupe - yes! nice analogy ### 4. have you ever taken part to a livestream course? - yes: ooooooooooo - no: oooooooooooo - what?: - no... --- ## Questions - Here a question - And here an answer - another one - I need to ask questions about the registration and ECTS credits. I have emailed to "scip@aalto.fi" but no response. Where can I seek for help? - My bad, I wanted to answer via webpage so I updated the details on the website. https://scicomp.aalto.fi/training/scip/ttt4hpc-2024/#credits - And gathered other questions at: https://scicomp.aalto.fi/training/scip/ttt4hpc-2024/#questions - What is the course structure? - Morning: livestream demos, general Q&A - Afternoon: Zoom (or local rooms?): bring your own code and ready-prepared problems and we look at together - Thanks, the registration question was answered. Could you explain how to register the one ECTS credit in Sweden? - Please later check https://scicomp.aalto.fi/training/scip/ttt4hpc-2024/#credits if something is missing there I can make it clearer. Short answer is: yes you can get 1 ECTS that your study coordinator can register in your study system. - I missed the beginning of the morning session and now had some problems with Aalto network and could not connect for a few minutes - will the session be recorded and, if yes, where will the recording be made available? Thanks in advance! ## Intro section Test that you can edit this! - How many years have you used HPC? - 0-1: oooooooo - 1-2: ooo - 2-4: ooooooo - 5-8: o - +8: ooooooo - ... ## Resources https://coderefinery.github.io/TTT4HPC_resource_management/ ### Job scheduling and Slurm basics https://coderefinery.github.io/TTT4HPC_resource_management/scheduling/ - question - answer - do you distiguish between HPC and HTC (High throughput computing) in the course? - I think it's not something we comment on much, but many lessons are specific to the performance communication part of HPC. But many lessons will apply to both. (I don't remember any HTC-specific parts) - (for anyone watching: HTC is mainly about lots of computing but without the fast communication between nodes, so it can be more distributed) - Beautiful analogy to Tetris!! - This is the most beautiful slurm explanation I have ever heard! (10+ years with HPC experience) +1 - It depends on the policy. If it is strick first-in-first-out, then the slurm behavior might be different. - Do many clusters of the attendees use FIFO without backfilling? - I think UPPMAX have 'priorities'. So, 'backfilling' or 'cut-in'. - What is the difference between `ntasks` and `ntasks-per-node`? - ntasks: you want this many tasks. they can be distributed however among nodes - ntasks-per-node: require this many per node (probaly used with --nodes). - Depending on your problem, you may want things more evenly distributed, so this gives you more control - Someone: what are cases when one is better than another?Ta - In a case where you have e.g. physics simulation that does simulations in 3D-grid you might want to split the 3D-grid to blocks of equal size. So for example 8x8x8 grid might be split to 4x4x4 blocks of 64 CPUs. This can reduce the communication overheads between processes. - How do you know the slurm job number so that the second job can wait for the first job? - I think you just have to submit first, get number, then use that for second. You can find some examples of automating this. - Is there a way to transfer data using compute nodes as efficient as using the login node? - Transfer to/from a location outside the cluster? There might be bandwidth bottlenecks when leaving the cluster network. - Yeah, I think usually you'd want data on the cluster before starting. Next week we talk about data sync some. - (once I heard of "burst buffers" in Slurm, the name is confusing but I understood that it's some way to do work before all main resources allocated. so could be used for transferring data. Does anyone know about this?) - What does 's' mean in 'sbatch', 'scancel', 'squeue'? - Just short for Slurm, IMO +2 - Is the allocation working like Tetris or more like a puzzle? That is, if there is a free spot of the size of my job somewhere in between other jobs, can it be allocated there? - Yes! Called backfilling (as pointed out above someone it could be disabled, or configured some other way). - In the tetris analogy, if my slurm job has several subjobs, are they all still treated as one tetris block or each independently? - If it's within one `sbatch` submission it's one block (what is being discussed now is relevant to this question though) - If you use job arrays, which we'll learn in episode 4, they may start at different times. - Oh yes, job arrays are one submission but count as different jobs for scheduling. - So Do I have to ask exactly the same amount of time and number of cores for job to start (to make a square block ) ? - If I want 4Gb memeory should I ask 4 hours of runtime. To make a square - No, it doesnt' have to be square. But the way most schedules work is it has to be *rectangle*: the number of CPUs/mem don't change throughout the time of the job. - (this gets close to a good point: you don't want to ask for all memory and only one CPU, since the other CPUs couldn't be used. So it's good to try to use CPUs and memory in about the same proportion, if possible. Not possible everywhere of course.) - Are we sure that our job will be tied to 4 (or n) CPUs that we requested in the Slrum script ? Generally, the scheduler will do context switches and it can happen that a new CPU is assigned which is quite far away from the other CPUs, RAM etc.. that are alloted to my current job. Right ? Does each job be "pinned" to particular CPUs? - The exact CPUs allocated could change (or it could eb pinned) - probably depends on the cluster. But overall it will stick with the same number of CPUs through the whole job. - Keywords to search "cpu affinity" or "slurm cpu affinity" for related discussions. - Quite often the affinity is enabled on a system level, but you can ask slurm to distribute the tasks in different fashion with multiple different flags for `sbatch`. [See Slurm's multicore support docs for more information](https://slurm.schedmd.com/mc_support.html) - When using software that uses GPU very efficient for short times but then a few CPU cores only in between. Is the best way then to split the code into pieces and alternate slurm commanding...? - It depends if you need to access the results/memory of the CPU part. If it is too short in time and you need it for the GPU-heavy part of the simulation, I would keept it in one. It really depends in you're using one code or it's already broken into several codes/scripts. - Solving these problems is definitely part of the "science of HPC" - they really are good, interesting problems to solve to get the most resources you can. - Can you exaplin re-queue with Tetris ? - I'd say, remove a block from the graph/Tetris and re-insert it. - Something higher priority comes and kicks you out. Can your block automatically be added back to the top for re-scheduling, or does your job change some state/files and if you re-run, it'll break/do something bad? this would be why re-queue isn't the default. - I am confused when to use "srun" in the job script? What is the difference to just type in the program command? - `srun` or `mpirun` should be used if the code is MPI-parallized. This is typicaly specified in the documentation. You may also try to search/`grep` the source code for `MPI`. - `srun` will tell slurm to record information on this specific step, so you can later check how long the job step took and what resources it used. It will also start MPI-enabled codes using all of the resources that have been reserved to the job. In addition, you can optionally use it to launch a program with some resources allocated for the job. For example, you could launch e.g. Spark in-memory database with some resources on the background and then run separate process that uses that database with some resources. - We'll bring this up again in the exercise session so that we can explain more in further details MPI vs OpenMP jobs. - If I reuest 1 hour in my sbatch file, but I only use 5 mins, I will be charged for 1 hour or 5 mins? - You will be account only for the time the job ran, so 5 min in this case. - However, the queue system will need to fill your job to the queue according to the resources that are requested. So if you have a big mismatch between requested resources and used resources you will be queing longer than you should. - What would happen if I do not specify the requested time in my batch file? - There is a default time (configured by the system administrator), which depends on the cluster. On Rackham is 1h. ### How to choose number of cores by timing a series of runs https://coderefinery.github.io/TTT4HPC_resource_management/num-cores/ - how would you evaluate the time to get the job started? i.e. maybe for 10 steps, all time goes to whatever needs to be done at the start, while that insignificant for 10000 step - Do you mean job steps within one submitted (non-array) job? - rather when there's something that needs to installed to the node at the start and it takes time - You can run your program for `t` steps and then run it `t+1` steps. Then you can substract them from each other to get rid of the initial startup time and say that each timestep takes `T(t+1) - T(t) = \delta t` . Of course in practice you might run it maybe 10% longer so that you have more timesteps to extrapolate over. If you want to calculate the startup time you can then say that `T_startup = T(t) - t * \delta t` - Ah. Then yeah, that's a good point. Especially since it is node-dependent. Luckily, since most nodes are the same often you don't need to re-install. For things like transferring data search "burst buffer" above - I think the concept would apply to other things too - Can't we do 'htop' command and check our resource consumption instead of polling everythimg using $squeue --me ? - If you can SSH to nodes, then yes (we'll often do that as a first interactive check) - For automatically recording, these other techniques can be useful. - Is this session about the "strong scaling / weak scaling"? - I don't think we explicitely discussed but it comes quite close. Let's raise this during Q&A time, remind us if we forget. - Yes, the planets exercises is about scaling. It would be even easier to see if the run time was plotted agains the number of core, but the table gives a good idea as well. - I have done this type of testing for low resolution simulations but I struggle to estimate how many cores to choose when running the same case at a higher resolution (that requires enough resources that I can only reasonably run the case once), do you have advice on this? - In this case I'd use a long time (hours or days, it depends). Once I do a couple of similar simulation, I will have a better estimate for the time needed. - If you know how your program scales, you can estimate how the time / resource increases when you increase the program size. So assume that `T(medium sized job) / T(small sized job) = (scaling function)`. So for example, if you first run a small problem and scale it up by factor of 2, but runtime increases by a factor of 4, you can assume that the scaling function is quadratic. Then you can calculate how much bigger the large job is compared to medium sized job. - Continue with this question, when I know the scale factor of that larger job, should I increase the computation time or the number of cores? Because both of these factors will influence the 'area' of the job. - Does anyone know about burst buffers in Slurm? what all are they used for? does anyone use them? Once I was in some Slurm presentation and heard that (despite the name) they were a general way to do work before the main job starts, but don't know more - here is the [slurm burst buffer guide](https://slurm.schedmd.com/burst_buffer.html). I have never used it myself. - Burst buffers are used in some sites to move data needed by the program to a faster filesystem before jobs start. This is very site specific and you should check whether your site supports these. ### Measuring memory https://coderefinery.github.io/TTT4HPC_resource_management/memory/ - In the sacct tool, can't we change the sampling time if we know that our program takes smaller amount of time (say less than 30s) ? - Usually you should not run very short jobs as they can be problematic for the queue system. - Good point. Most jobs are anyway longer than 30 secs and also should be. Jobs that only take very few seconds are so short that start-up time might be longer than the computation. Main point is that it might miss peaks between sampling points. - How do I add the /usr/bin/time -v when the job is submitted from a GUI? - Do you specify the command name somewhere in this GUI? - I think I only press a "Run" button but I do not edit the command in the GUI? - Perhaps one can add the time command once seeing it running in "top"? - some GUIs may not allow you to modify the command. What I would do in this case if I wanted to know memory consumption is to look at "top" and see how much memory the process consumes. But then I only see the now. And if it varies a lot, I might miss it. - From my experience, I used 'seff jobnumber' to check memory use. But I am not sure whether it samples or not... - to my knowledge it samples. (+1 same database as sacct) - It presents the sacct output in a different format. - `seff` is good. I did not show it because it is not installed everywhere but for most calculations it gives a good and short overview of resource use. :::info ## Break until xx:15 then I/O related considerations ::: *Feel free to add more questions or suggest some improvements.* - I missed the beginning of the morning session and now had some problems with Aalto network and could not connect for a few minutes - will the session be recorded and, if yes, where will the recording be made available? Thanks in advance! - At least the recording will be available on the same TwitchTV link for the next 7 days. We might archive it to YouTube, but because this is a pilot run we are still considering this. - I am using UPPMAX. What is the recommended module I should load in my bash for this course? - is it to compile the planets.c code? or to run the example.py? - For the lecture material and lab session. - https://notes.coderefinery.org/ttt4hpc-20240416-rackham - May I have the link to the material of the lab session? - At the bottom of the same materials for this day: https://coderefinery.github.io/TTT4HPC_resource_management/exercises/ - Since there are three factors influence the job 'area': time, memory, cores. Should we test these three factors one by one? And when I want to increase the 'area' of the job, which factor should I increase first? - yes often they can be studied separately. but they are dependent (time and cores) and also there can be dependency between memory and cores (some jobs need to ask for more cores not to have more compute resources but to get access to their memory) - I would first start: how can I reduce the system but to still make the answer meaningful and get a feeling for timing. at the same time I measure memory (measuring memory does not cost anything). Then once I have a feeling for timing for a reasonable start guess, I start varying number of cores/threads to get a feeling for how it scales with cores/threads (in the talk I did not distinguish cores and threads and tasks to keep it simple but the strategy on how to study scaling is the same) ### I/O Best Practices https://coderefinery.github.io/TTT4HPC_resource_management/io-best-practices/ - How do we know or identify what is the bottleneck of our computation: is it the CPU cores or Memory provided to the program or the I/O interactions ? Are there ways to know these other than profiling the program ? - as you write: profiling will often reveal this in more detail - most often computations are not CPU-bound but memory or disk-bound - trying a scaling study by increasing the number of cores/threads/tasks might give you indication about bottlenecks - another way to get indication about where a bottleneck might be can be to try different job groupings (e.g. all on one node or distributed across nodes) - tool like `strace` (to be demonstrated) - if this is your own code, then inserting timing information into the code can also help revealing this: then you know which section of the code uses most time and then you can look at what that code portion is doing. I find this often easier than running profiling tools and it won't slow down my code (which some profiling tools might). - You can also think that if the code is not using CPUs, it is waiting for IO. So if your CPU utilization is much less than 100%, the program could be waiting for IO to finish. - How come the open calls are more than the files? - Importing python libraries contains file reads - Can the "tarfile-trick" be used for file transfer? usually I copy many small files from somewhere to the cluster using rsync or lftp. - Yes it is much faster, up to a point when the file is too big and might have issues (e.g. transfer crashes for network reasons) - How did you create the tar file? - `create_archive.py` in the example uses Python to create a tarfile - A follow up question: does creating the tar file just move the slowness of opening files to another scritp? - Can you do the tar file trick with pickle objects? - There are other formats to store everything as one file, and pickle is one of them - Tarfile.extractfile() (what's used in the example) gets you a file-object that can be used in anything that supports it. You could read pickle from it. f = xxx.extractfile('name'); pickle.load(f) - How do I know what type of filesystem my cluster has? or which folder is using? - You should check the docs of it - that's easiest way. They may also give other local advice about what to do or not do. (`mount` shows what is mounted but it's quite technical and doesn't show how it's actually configured) - If the system uses a shared network filesystem the actual technical details do not really matter. There will always be delays as the data is not stored on the compute node. - What are recommendations when, for example, using R scripts? Is .rds fine for loading input + writing results or should one rather use .csv or .json or other formats? - rds is better than csv or json because it is a binary format. At least for intermediate data. - parquet and feather are also great formats for storing `data.frame`. The [arrow package](https://arrow.apache.org/docs/r/articles/arrow.html) provides access to these formats. - Are we using the same notes during the zoom session? - Yes I think so, we can continue using those to have everything in one place ## Feedback :::info Daily notes: - **Lunch break now, after lunch is Zoom sessions for exercises and/or bring your own code and we'll look together.** - If you are registered, you come by the "exercise/bring your own code (BYOC)" session about an hour after we finish - Many of the partners listed on the website have support also: they would be *very happy* to receive your questions and help you with anything discussed today. ::: Today was: - too fast: oo - too slow: - right speed: oooooooooo - too advanced: - too basic: - right level: ooooooo One good thing about today: - time command, but not available at my favorite cluster... - Can you tell which cluster? We can influence admins maybe :) - ask your admins to install! - tar trick :) +1 - getting more infromation about a job - this questions and answer document was amazing +1 - really clear explanation of why "calibration study" for time and memory usage is useful and how to approach it - I will check the time command on the tetralith cluster as well. I also like the time -v trick..did not know it so far. (now we already have a ticket for that feature) - ... One thing to improve for next time: - don't be nervous :) You're doing great :heart_eyes_cat: - Really dig the course and the instructurs. Good job! - Great job, nothing to say as of now - .. Any other feedback? - In the second part, I was a bit confused at times since I found it a bit harder to remind myself why we are doing the things that were presented. Might be due to missing prior knowledge on my side. +2 - For me, the twitch broadcast got blurry quite often. wonder if this is on my site or somewhere else ?! +2 - We should have mentioned to force Twitch to use max quality (otherwise it tries to optimise your bandwidth and blur things). You can set this on the twitch window on the gear bottom right corner. - ..ah ok...I do not use twitch so often. Yes, it was often switching between blur and fine. - you mentioned that it will also be recorded. Where do you find the recording later ?! I think I cannot join next time, so it would be nice to have the recording available somewhere. Ah ok. good to know. Thanks - https://twitch.tv/coderefinery/videos (for 7 days only), and maybe youtube later. - For me personally, it is easier for me to concentrate if only one presenter is presenting at a time. But this is just a personal preference I guess. Otherwise, very nice demos in the morning! ## Archive The morning questions and answers have been archived to https://hackmd.io/@coderefinery/ttt4hpc_day1_archive # Welcome to the hands-on zoom session! - We start at 12:00 Oslo time / 13:00 Helsinki time. - This session will **not** be recorded so that everyone can interact freely. - Usual zoom etiquette. Please mute yourself if you don't want to talk. - Ask questions on this document or raise your zoom hand to ask live (we will write down live questions into this document anyway for documentation purposes) - We follow the CodeRefinery Code of Conduct: https://coderefinery.org/about/code-of-conduct/ - If you feel that the Code of Conduct was violated, please report it to any of the instructors (Host/Co-host in this zoom) or via scip@aalto.fi (reaches also people who are not here) - If you would like to receive the credit, please send a direct message to the host (Enrico Glerean) to mark your presence. Enrico will confirm that your presence is marked. ## Exercises Materials: https://coderefinery.github.io/TTT4HPC_resource_management/exercises/ Note: The link in exercise 2.2 should be https://github.com/coderefinery/CIFAR100_example - For Rackham (@UPPMAX) instructions, please check https://notes.coderefinery.org/ttt4hpc-20240416-rackham?view - For Tetralith, use `mpprun` instead of `srun` and `Anaconda` instead of `miniconda`. - For Dardel, `srun` is OK, and use `Anaconda3` instead of `miniconda`. - For Lunarc, you may use `Anaconda3` instead of `miniconda`. Also, remove the `M` when you specify the memory as the default is in MB. - For NRIS clusters, load the specific `miniconda` version that is available on specific cluster. ## Questions on exercises - I have resources on LUNARC, can I still do the exercises? - Yes, but if you run into a problem let us know. - For the scaling study, to how many cores do I need to go? - You can try this on 1, 2, 4, 8, 16 ... and go only as far as you have available. The more we will ask for, the longer we might queue also. It's mainly about trying the technique. - You may also verify the scaling for different arguments (size of the input). If you reduce `--network-penalty` to 100 or even 1, the cost of communication will go down and one can scale it to more cores. - What is LUNARC compared to COSMOS? The host of COSMOS? - yes :-) COSMOS is the cluster which is hosted by LUNARC which is the group/center running it. - Thanks, so cosmos is the computers, and Lunarc is more like the administration? - yes - Can you write down the cluster where you are running the exercises: - Aalto Triton: ooo - CSC:o - DelftBlue (TU Delft): - NRIS/Sigma2 (Saga/Fram/Betzy): - Rackham (UPPMAX, Uppsala, Sweden): o - Dardel (PDC, Stockholm, Sweden): o - Tetralith (NSC, Linköping, Sweden) - Cosmos (Lunarc, Lund, Sweden): o - ex3 (Simula Research Center) - C3SE Vera (Chalmers university, Gothenburg, Sweden) - - What should I put for the requirements in the conda environment? - This is not relevant for exercises 1.1 - 1.4 - For exercise 2.2, it's in https://github.com/coderefinery/CIFAR100_example/blob/main/requirements.txt - For exercise 2.4, https://github.com/coderefinery//meteorological-data-processing-exercise/blob/main/requirements.txt - On Triton, which module to load for mpicc? - `module load openmpi/4.1.5` seems to work - What would happen if I set more than one node for an OpenMP software? - I don't think it will fail, but it will only use the number of cores available on 1 node during run time (by default). And it will waste the resources allocated on the other nodes. :( - On Triton, I get "More processors requested than permitted" error when running sbatch with anything more than ntasks=8. Do we need to split it up somehow or did I make a mistake somewhere else? - Do you mean that by replacing 8 with a larger number the script fails? - It was probably an issue with me trying to submit multiple slurm jobs with various ntasks parameters through using ENV_VARS. Still testing. - +1! - This kind of worked, but I still got "Unspecified error" messages in some jobs: `for i in 1 2 4 8 16 32 64 128 256; do sbatch --job-name "${i}_cores" --ntasks=$i --output="tasks_${i}.out" job.slrm; done` - From the lectures I understand, that if someone have access to the same nodes as you they also have access to your data? In this case if I have sensitive data that I do not want other people to have access to I can use a specific cluster, say COSMOS-Sens? - You should use a cluster for sensitive data in particular (Bianca @UPPMAX, Cosmos-Sens, ...). Or you could use the `--exclusive` SBATCH option that will make sure no other user gets allocated resources on the same node(s) as you, irrespective if you're effectively using only part of the node(s). - But arent the files protected by file permissions at least? - Norway/NRIS perspective: files created by users are by default only readable to the file creators. Also only the person who has actively jobs running on a node can log into it. - Yes, but of course for sensitive data you need to be even more careful, and that is the reason one either asks for --exclusive or run on clusters that are specially for sensitive data. - In the first exercise, you didn't set number of node explicitly. Do we need to set this value? - good observation. I chose not to and let the cluster place it. This gives the scheduler more flexibility. But downside can be that the same number of cores can yield different timings if there is lots of communication. communication within a node or across nodes can have different latency. for some codes it can make a difference but for many codes it does not matter and I typically start by not specifying where exactly the tasks should be placed. - You do not need to set the number of nodes (`--nodes`) as well, although I consider it good practice. As there is different number of cores/node at different clusters, not specifying the number of nodes ensures that our example will work for everyone. - Does that mean if I am not sure about the number of nodes setting, I can ignore it? Or for example, I don't know whether ths software is MPI or OpenMP, I don't need to set this value. - I think a nice way to check how many nodes are reserved to a job is `scontrol show job id 124567`, where 1234567 is a sample job ID. - Great point and reminder! If the software is OpenMP parallelized, you have to worry about on which node they are placed. They all need to be on the same node. But it can be sometimes difficult to tell whether something is OpenMP or MPI: https://coderefinery.github.io/TTT4HPC_resource_management/num-cores/#what-is-mpi-and-openmp-and-how-can-i-tell - sometimes you might need to ask for help or write to the code authors to figure it out - Indeed. Also, for an OpenMP job I would also specify `--nodes=1` and a suitable number of tasks, for example: ```bash ... #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=20 module load uppasd export OMP_NUM_THREADS=20 sd > out.log ``` 20 is the number of cores per node on that particular cluster. If the application does not use a lot of memory, I recommend you set `--cpus-per-task` to the total number of cores per node. But if the application requires more memory, you may want to tweak the number of CPUs per task. - Thank you so much! However, to be honest, when I typed 'scontrol show job id 1234567', it said 'too many arguments for keyword:show'; if I type 'scontrol show 1234567', it says 'invalid entity'. '1234567' is my job id. Could you please tell me what I did wrong or before I typing this command, I should do something else? - May I ask what the differences is between "devel" and "devcore" partition in Rackham? Can I do multi-core, multi-node test in "devel" or "devcore" partition? - The `devcore` partition should be used if you ask for max 1 node. If you use the `devel` one you can ask for a maximum of 2 nodes. A nice command to use is `sinfo -p devel` (or any partition name) and it will list if there are available (=idle) nodes in that partition. - Question: ```bash #!/usr/bin/env bash #SBATCH --account=XXX #SBATCH --job-name='TTT4HPC' #SBATCH --time=0-00:05:00 #SBATCH --mem-per-cpu=1500M #SBATCH --partition=devel #SBATCH --nodes=2 #SBATCH --ntasks-per-core=2 cat /etc/hostname sleep 1 ``` The output is: ```bash r483.uppmax.uu.se ``` Why I only have one ouput, not 4 output? -Only the main node, got it. Here a script that would be suitable for testing (it prints the hostname from each of the mpi tasks) https://github.com/AaltoSciComp/hpc-examples/blob/master/slurm/pi-mpi.c - For exercise 1-1 I am getting wildly inconsistent results from different number of ntasks (on Triton). I suspect this might be due to running on various different partitions. Is this likely to be the case? ```bash grep completed *.out | sort --version-sort` tasks_1.out:Simulation completed on 1 cores after 435.68 sec: 30000 planets and 100 steps. tasks_2.out:Simulation completed on 2 cores after 216.59 sec: 30000 planets and 100 steps. tasks_4.out:Simulation completed on 4 cores after 243.04 sec: 30000 planets and 100 steps. tasks_8.out:Simulation completed on 8 cores after 137.84 sec: 30000 planets and 100 steps. tasks_16.out:Simulation completed on 16 cores after 63.42 sec: 30000 planets and 100 steps. tasks_32.out:Simulation completed on 32 cores after 24.92 sec: 30000 planets and 100 steps. tasks_64.out:Simulation completed on 64 cores after 190.43 sec: 30000 planets and 100 steps. ``` - reason might be communication across nodes if the nodes are "far apart". since in this particular example, there is almost no disk access so disk should not be an issue here (in reality it often is) - in this example, I would test reducing communicaton with the `--network-penalty` option. reduce it to 1. - for a real code that you cannot easily change, I would run this test again and check on which nodes it was running. sometimes my calculation landed on an "ill" node where it took a lot longer. - I met some issues when I do my own calculation. This is script setting: #!/bin/bash #SBATCH --time=00:15:00 #SBATCH --partition=batch #SBATCH --nodes=1 #SBATCH --ntasks=4 #SBATCH --mem=2G #SBATCH --job-name=validate #SBATCH --output=job.out #SBATCH --error=job.err for l in `seq 3.6 0.1 4.5` ; do something done I should have results for all these values: 3.6 to 4.5. But I lost some results between these values, like 3.7. Do you know the reason? How can I solve this issue? - I think we will need to see more details about "something" since from the script itself it looks OK to me. - Feedback: For Rackham: ```bash 1) gcc/12.3.0 2) openmpi/4.1.5 Default of this works. 1) intel/20.4 2) openmpi/3.1.6 Default of this will not. ``` `mpicc -O3 planets.c -o planets -lm` will output error for syntax. `mpicc -std=c11 -O3 planets.c -o planets -lm` will work. - thanks! we will test and adjust the material for this. - I usually use the intel module together with the intelmpi one. This works: ```bash module load intel/20.4 intelmpi/20.4 mpiicc -std=c11 -O3 planets.c -o planets -lm ``` - What are the mandatory lab exercises for students to get the credit? Some exercises are "bring your own code", which I guess is not mandatory? - correct. the "bring your code" exercises are optional ### Limit to certain nodes on Aalto Triton with ``` #SBATCH --nodelist=pe[1-48,65-81] ``` You limit your job to a specific architecture. List of architectures available at: https://scicomp.aalto.fi/triton/overview/ - `--nodelist=<node_name_list>` Request a specific list of hosts. The job will contain *all of these hosts* and possibly additional hosts as needed to satisfy resource requirements. ## Were you able to do part 1 of the exercises? - yes: oo - still working on it: ooooo - no/not-trying: ## Were you able to do part 2 of the exercises? - yes: - still working on it: ooo - no/not-trying: o