Julia for High-Performance Scientific Computing - Day 3

![](https://media.enccs.se/2024/09/julia-for-hpc-autumn2-2024.webp) Julia for High-Performance Scientific Computing - Day 3 :::success **Julia for High-Performance Scientific Computing — Schedule**: https://hackmd.io/@yonglei/julia-hpc-2024-schedule ::: ## Schedule for Day 3 -- [Julia for HPC](https://enccs.github.io/julia-for-hpc/) | Time (CET) | Time (EET) | Instructors | Contents | | :---------: | :---------: | :---------: | :------: | | 09:30-10:30 | 10:30-11:30 | Jaan | Dagger | | 10:30-11:30 | 11:30-12:30 | Jaan | Running on HPC | | 11:30-12:30 | 12:30-13:30 | | ==Lunch Break== | | 12:30-13:30 | 13:30-14:30 | Francesco | MPI | | 13:30-14:30 | 14:30-15:30 | | Buffer time, Q&A | --- ## Exercises and Links **Lesson material**: - [Introduction to programming in Julia](https://enccs.github.io/julia-intro/) - [Julia for high-performance scientific computing](https://enccs.github.io/julia-for-hpc/) :::warning - Exercises for [XXX]() ::: --- :::danger You can ask questions about the workshop content at the bottom of this page. We use the Zoom chat only for reporting Zoom problems and such. ::: ## Questions, answers and information - Is this how to ask a question? - Yes, and an answer will appear like so! ### Parallel execution with Dagger - If I try out the given solution to dagger exercise, I get an error: "LoadError: ArgumentError: Invalid Dagger task expression: begin... ". Is this only me or has somebody the same problem? - I solved it like this: ``` @everywhere function h(x::Integer, y::Integer) map(one(x):x) do i Dagger.@spawn sleep(1) Dagger.@spawn y+i end end ``` - I see the error as well. Something has changed between Dagger.jl versions. - OK, thanks, I additionally removed `@everywhere` from `task_graph()` and it works now - What is this exactly? If the course has a resource reservation, we can use the ``--reservation=<name>` option to use it. - Sometimes when a course is held on an HPC resource some nodes of the cluster are reserved for the course. That way any job submitted under that reservation usually doesn't need to be in queue since there are nodes already booked for it. This is not the case for this week's workshop though so you can ignore it. ### Julia on HPC cluster ### Message passing - in julia the data sent/received are objects like in python? - Yes, they can be structs or primitive datatypes. Julia does not have objects (as in instantiation of classes), but the concept of passed/received data is the same. - when should we use the mpiexcjl command? - mpiexecjl() works well with the `MPIPreferences` package. For example, I may want to have my code use different MPI binaries depending on the project (project inthe Julia sense); in that case, `mpiexecjl()` will read which `mpiexec` or `mpirun` to use depending on the project and/or preload certain libraries, etc. In this case, us loading the `julia-mpi` module made it so that `mpiexecjl()` is simply aliased to `srun` (and also includes a library to have GPU-aware MPI). Generally `mpiexecjl` is useful if you want to run the literal same script with the same `sbatch.sh`, but otherwise you can achieve the same effect with just `mpiexec`/`mpirun`/`srun`. - Can you provide an example for using GPU-aware MPI on LUMI? - here is one example from the LUMI webiste: https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/lumig-job/ - [this](https://gist.github.com/luraess/c228ec08629737888a18c6a1e397643c) is an example of ROCm-aware MPI that LUMI should support. Basically, you send GPU arrays in buffers directly. - What exactly is overloaded over srun command? - `srun` is one of the command from the `Slurm Workload Manager` - it asks to run a parallel job on cluster managed by Slurm - If necessary, `srun` will first create a resource allocation (number of cores, number of tasks, and memory) in which to run the parallel job - Why is neccesary to load julia and julia-mpi at the same time? Are not both Julia installations? ``` module use/appl/local/csc/modulefiles module load julia module load julia-mpi ``` - No. The `julia` module contains the actual Julia binary. The `julia-mpi` module just creates a `Project.toml` file which contains the name of the MPI launcher (`srun`) and a couple of env variables to make sure it launches MPI correctly; it doesn't contain itself Julia binaries. You can check it out by looking at the Lua files with `cat /appl/local/csc/modulefiles/julia-mpi/0.20.0.lua` - It is also the same with `julia-amdgpu` that we will use tomorrow, it sets a couple of env variables and checks that the system libraries needed for GPUs are loaded - Can be combined MPI together with Distributed? I mean: can you have a functions that run the code in parallel by using MPI, and then call this function with distributed using pmap? If there is an example I would appreciate it. - In principle yes but I would caution against it. If you start with MPI, go MPI all the way. One common approach, though, is to have hybrid shared+distributed memory parallelism (in other languages it's called OpenMP+MPI). In Julia it'd be calling either Distributed or MPI and have Threads inside each MPI rank/Distributed worker. But I have to say I have never done this myself so I can't comment much more than this. Also, this makes sense only if each MPI rank/Distributed worker sit on different nodes, otherwise you risk oversubscribing the cores of the CPU. - Im getting the following error any ideas? ``` srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x0000000000000000000000000000028000000000000000000000000000000280. ``` - Can you share your sbatch script and julia script? ``` #!/bin/bash #SBATCH --account=project_465001310 #SBATCH --partition=small #SBATCH --nodes=2 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=1 #SBATCH --mem-per-cpu=1000 #SBATCH --time="00:15:00" module use /appl/local/csc/modulefiles module load julia module load julia-mpi julia --project=. -e 'using Pkg; Pkg.instantiate()' srun julia -p 2 --project=. ring.jl ``` ``` using MPI MPI.Init() comm = MPI.COMM_WORLD rank = MPI.Comm_rank(comm) size = MPI.Comm_size(comm) dst = mod(rank+1, size) src = mod(rank-1, size) send_mesg = Array{Float64}(undef, 5) recv_mesg = Array{Float64}(undef, 5) fill!(send_mesg, Float64(rank)) print("$rank: Sending $rank -> $dst = $send_mesg\n") sreq = MPI.Isend(send_mesg, comm, dest=dst, tag=rank+32) rreq = MPI.Irecv!(recv_mesg, comm, source=src, tag=src+32) stats = MPI.Waitall!([rreq, sreq]) print("$rank: Received $src -> $rank = $recv_mesg\n") ``` - The issue is that you're launching julia with `-p` I think. `srun` will spawn 4 ranks (2 tasks per node x 2 nodes), but then Julia is also trying to create 2 of its own workers ``per task``, but you don't have that many CPUs available. Try just going `srun julia ring.jl` - Changed the requirements on the batch script to `#SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 ` altered to `srun julia ring.jl` and still get the same error. -Strange, it seems to work for me. Have you run interactive jobs before this one? It could be that you're still in the shell of one of them. Can you `exit` LUMI and then go back again? If you have to `exit` more than once to go back to your local computer that might be the culprit - That did the job ;-D - :D - What is the difference between task and cpu? ``` #SBATCH --ntasks-per-node=4 #SBATCH --cpus-per-task=1 ``` - you mean `between`, so where is the other one for the comparison? - A task is a MPI rank, which itself can occupy multiple cpus. So for example if a node has 128 cores, you can have 4 tasks of 32 cores each, 1 task of 128, 128 tasks of 1 core each, etc. Each task in that case can use multithreading to use the CPUs assigned to it. - Ragarding the Julia integration within other software based on C++ (e.g. OpenFOAM=OF). Assume that Julia functions are called from within OF. If I call MPI.init() in OF first, what should I call on the Julia side to work with this MPI instance? - This is a tricky one. How is the integration between OF and Julia made? Do you even need Julia to be MPI-aware in this case? - Yes, Julia is responcible for the GPU code, OF is CPU only. Julia has its own data structures that should be directly transferred from the CPU buffers to GPU ones via MPI. The integration is achieved by wrapping Julia functions to C interface and calling the latter from within OF. - I don't think you need MPI to transfer from CPU to GPU if for any CPU rank you have its own GPU. I'd say that if you pass the OF volFields (or probably the matrices they contain) to Julia and then let Julia handle the CPU-to-GPU communication, you can just treat each OF rank as a separate process from the Julia side. If you only have one GPU or just a random set of GPUs (not one per rank) I don't have a solution off the top of my head right now. - What argument do I pass to Julia so that it can use mpi, I totally missed that. - None, as far as you load both julia and julia-mpi modules and launch julia with srun, either in a batch script or passing it all the -p, -t, -A flags, etc. - You can look at the MPI job exercise here https://enccs.github.io/julia-for-hpc/hpc-cluster/#exercises to get some inspiration - What about running it on the command line? Do i just run "julia -p 10 ring.jl" - No, that would give you Distributed workers. MPI needs srun --- :::info *Always ask questions at the very bottom of this document, right **above** this.* ::: ---