![](https://media.enccs.se/2022/09/AMD-workshop.jpg) # Developing Applications with the AMD ROCm Ecosystem ### November 29 - December 2, 2022 ## General information - ENCCS: https://enccs.se/ - This hackMD: https://hackmd.io/@enccs/amd-nov2022 - Archive hacMD: https://hackmd.io/@enccs/rySrI2Nws - Lesson material: https://enccs.github.io/AMD-ROCm-development/ - Workshop feedback form: https://events.prace-ri.eu/event/1440/surveys/1044 - Workshop will be recorded. edited recordings will be linked from enccs.se/lessons - Upcoming AMD GPU hackathon: https://enccs.se/events/2023-03-amd-gpu-hackathon/ ### Follow ENCCS - Sign up for the ENCCS newsletter: https://enccs.se/newsletter - Follow us on [LinkedIn](https://www.linkedin.com/company/enccs), or [Twitter](https://twitter.com/EuroCC_Sweden) ### Schedule **Nov 29, 2022, 13:00-17:00 — Programming Environments** - HIP (George Markomanolis, AMD) - HIP Exercises on ACP cloud - Break - Hipify — porting applications to HIP - Hipify exercises - Getting Started with OpenMP® Offload Applications on AMD Accelerators (Jose Noudohouenou, AMD) - Break - OpenMP exercises - Developing Fortran Applications, HIPFort & Flang (Bob Robey and Brian Cornille, AMD) - Fortran exercises -- HIPFort **Nov 30, 2022, 13:00-17:00 — Understanding the Hardware** - The AMD MI250X GPUs (George Markomanolis, AMD) - AMD Communication Fabrics (Mahdieh Ghazimirsaeed,AMD) - Memory Systems (Bob Robey, AMD) - Break - Exercises – MPI Benchmark and Memory Systems - Roofline Model (Noah Wolfe, AMD) - Affinity — Placement, Ordering and Binding (Gina Sitaraman and Bob Robey, AMD) - Exercises -- Affinity **Dec 1, 2022, 13:00-17:00 — Tools** - Profiler - rocprof - Exercises - rocprof - Profiler - Omnitrace - Break - Profiler - Omniperf - Break - Debuggers — rocgdb - Debugging - exercises **Dec 2, 2022, 13:00-17:00 — Special Topics** - Using OpenMP® (Michael Klemm, AMD) - Break - Introduction to ML frameworks (Alessandro Fanfarillo, AMD) - Break - Discussion and feedback --- ### AMD Compute Node Reservations - Nov 29th – 14:00 to 17:00 - Nov 30th - 14:30 to 17:30 - Dec 1st – 13:30 to 17:00 - Dec 2nd – 16:00 to 17:00 ### Training material Detailed instructions: https://hackmd.io/@enccs/HJHVWuQvi Second day: https://hackmd.io/VpT0JnqGT82Z77kxj95Odg?both#November-30 Third day: https://hackmd.io/gYQqaedKTh-gptKOF3u07g?view --- ### Feedback At the end of each day we will ask you for short feedback: - one thing you liked or found useful - one thing that can be improved --- ### Instructors and helpers - George Markomanolis, AMD - Jose Noudohouenou, AMD - Bob Robey, AMD - Brian Cornille, AMD - Noah Wolfe, AMD - Gina Sitaraman, AMD - Mahdieh Ghazimirsaeed, AMD - Alessandro Fanfarillo, AMD - Michael Klemm, AMD - Suyash Tandon, AMD - Thor Wikfeldt, ENCCS - Qiang Li, ENCCS --- ### Code of conduct We strive to follow the [Contributor Covenant Code of Conduct](https://www.contributor-covenant.org/version/2/1/code_of_conduct/) to foster an inclusive and welcoming environment for everyone. [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/ENCCS/event-organisation/blob/main/CODE_OF_CONDUCT.md) In short: - Use welcoming and inclusive language - Be respectful of different viewpoints and experiences - Gracefully accept constructive criticism - Focus on what is best for the community - Show courtesy and respect towards other community members Contact details to report CoC violations can be [found here](https://enccs.se/kjartan-thor-wikfeldt). --- :::danger You can ask questions about the workshop content at the bottom of this page. We use the Zoom chat only for reporting Zoom problems and such. ::: --- ## Connecting to AMD Cloud Detailed instructions: https://hackmd.io/@enccs/HJHVWuQvi --- ## Questions, answers and information :::success - [Day 1 Porting to HIP, George](https://drive.google.com/file/d/1hyKGZMMv7JajIcXPwx9UurFESbozYAz1/view?usp=share_link) - [Day 1 Intro to HIP, George](https://drive.google.com/file/d/1NvZwpKfgxWw7sMwOUoyaIV1fmO_ZE986/view?usp=sharing) - [Day 1 Developing in Fortran, Bob](https://drive.google.com/file/d/1TyAP-kB8V1CvqxXgYxkzizc0TWVjIrHl/view?usp=sharing) - [Day 1 OpenMP, Jose](https://drive.google.com/file/d/1xUCZoDai-leOY2-1DLrj1PsvfEZ0Kxkb/view?usp=sharing) - [Day 2 architecture intro, George](https://drive.google.com/file/d/1CWSBMHas6aPddeHC1uJB9I4VId5dWTRF/view?usp=sharing) - [Day 2 AMD communication, Mahdieh](https://drive.google.com/file/d/1SKzjjR8TCqbGviJm6R-jbK67ArN8i64D/view?usp=sharing) - [Day 2 memory systems, Bob](https://drive.google.com/file/d/1RYa9R58RlUHjDyylHtHb4W5bE1Bj4-Ql/view?usp=sharing) - [Day 2 Roofline model, Noah](https://drive.google.com/file/d/1t8pl2RyQyG2Ue9X4qacU0xzBGnmeIqpt/view?usp=sharing) - [Day 2 Affinity, Gina and Bob](https://drive.google.com/file/d/1e0Ap8wzu4UyAKtM3k2Dhz2wHnPZBprmF/view?usp=sharing) - [Day 3 rocprof intro](https://github.com/ENCCS/AMD-ROCm-development/blob/main/content/slides/intro_rocprof.pdf) - [Day 3 Omniperf intro](https://github.com/ENCCS/AMD-ROCm-development/blob/main/content/slides/intro_omnitools.pdf) - [Day 3 rocgdb intro](https://github.com/ENCCS/AMD-ROCm-development/blob/main/content/slides/ENCCS-AMD-rocgdb-Dec2022.pdf) - [Day 4 OpenMP programming](https://github.com/ENCCS/AMD-ROCm-development/blob/main/content/slides/OpenMP_Offload_Programming.pdf) - [Day 4 Intro to ML frameworks](https://github.com/ENCCS/AMD-ROCm-development/blob/main/content/slides/ML_Training_ENCCS_Dec2_22.pdf) - [Day 4 Summary, discussion, feedback](https://enccs.github.io/AMD-ROCm-development/summary/) ::: ### Icebreaker question Are you working on a project that you want/need to port to AMD GPUs? If so, what is it about? - Thor: Not working on a specific project but want to start exploring AMDGPU.jl in Julia on AMD GPUs - Hugo: Want to implement the hybridization expansion continous time quantum Monte Carlo method. The core problem is doing fast Markov chain updates through fast determinant updates adding one (or a few) row(s)+column(s) to an already known determinant matrix. This boils down to a few matrix-matrix operations with sizes $\sim 100$. Since the algorithm is trivially parallellisable the goal is to run as many Markov chains in parallell as possible on each GPU using as many GPUs as possible. Here is a minimal (cpu) implementation, https://github.com/HugoStrand/cthyb.jl - Simppa: Wish to support others porting to Lumi, but also try to achieve performance portability on a specific OpenMP offloaded code. - Peter: Not working on a particular project but I would like to explore AMD as a possibility for our (current or future) HPC infrastructure. Additionally interrested in AMDGPU.jl and for AMD support in AMBER, ... - Balthasar: I'm working at ECMWF on porting the IFS to GPU architectures, including LUMI for Destination Earth's Digital Twin. - Simon: As part of the BioDiversity Digital Twin project I am working on running joint species distribution models on the LUMI super computer. I am working with a group that are porting their R code to tensorflow to run on GPUs on LUMI. - Michael: Also working at ECMWF on porting the IFS (numerical weather prediction) to GPUs including AMD GPUs - Yorgos: working on many body physics simulation, also want to learn it for personal projects ML tasks - Igor: Not working on any AMD-related projects at the moment but I am interested to know more about AMD GPU and how it could be used for some of our projects in the future - Waclaw: Currently not porting, but hope to improve GPU port of MD code. - Kevin: Our gyrokinetic code (plasma physics) was recently partially ported on a new machine similar to Lumi through an external collaboration and I want to be brought up to speed before starting to work on it. - Andreas: working at ECMWF on porting weather prediction codes (mostly Fortran+OpenACC calling some C++ with CUDA) to AMD GPUs. Especially interested in hipBLAS and hipFFT. - Kin: Interested in porting the CUDA lattice boltzmann code in Fortran for multiphase flow into AMD platform. - Giannis: would have to port CUDA code to AMD GPUs on DFT code - Gaurav - I work for HPE/Cray and worked a bit on Lumi procurement behind the scenes, mostly on hardware. Interested to know more about RoCm stack from AMD colleagues and understanding how researchers plan to use the machine and RoCm so that we can support them better :) --- ## Questions, answers and information ### Day 1 - How do I run jobs on AMD cloud outside of the course reservation? - Use the standard Slurm commands to allocate resources. You do not need the reservation argument when outside of the course reservation. - My jobs are not running in AMD cloud (user `enccs_tr_36`) ```$ squeue -u enccs_tr_36 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 16576 MI250 jobscrip enccs_tr PD 0:00 1 (ReqNodeNotAvail, May be reserved for other job) 16577 MI250-x4- jobscrip enccs_tr PD 0:00 1 (Priority) ``` even though there are nodes available ```$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST MI210* up 2-00:00:00 1 mix mun-node-1 MI210* up 2-00:00:00 1 alloc mun-node-2 MI250-x4-IB up 1-00:00:00 2 plnd mun-node-[3-4] MI250-x4-IB up 1-00:00:00 1 mix mun-node-7 MI250-x4-IB up 1-00:00:00 1 alloc mun-node-8 MI250 up 1-00:00:00 2 mix mun-node-[6,9] MI250 up 1-00:00:00 3 idle mun-node-[5,13-14] ``` what am I doing wrong? - Did you use the reservation argument? - No I commented it out, here is the head of my job script ```#!/bin/bash ##SBATCH -p MI250 #SBATCH -p MI250-x4-IB #SBATCH -N 1 ##SBATCH --reservation=enccs #SBATCH --gpus=1 ``` - the batch script looks good. There may be another reservation. - Humm.. there is the `ML_Denoiser` reservation but it only uses the MI210 nodes. Check the output of `scontrol show reservation` - I don't see it either. And our reservation does not start until 14:00. There are idle nodes, so it should start. - I think so too, but that is not the case. - This just worked for me ```bash #!/bin/bash #SBATCH -p MI250 ##SBATCH -p MI250-x4-IB #SBATCH -N 1 #SBATCH --reservation=enccs #SBATCH --gpus=1 ``` - I tried running a job in the MI250 partition and see the same problem. - I just tried getting an interactive allocation in MI250 and that works with the reservation argument. salloc -N 1 -p MI250 --gpus=1 --reservation=enccs -t 10. The batch job with the reservation is just in a PD state. I just checked it again and it ran and there is output. - I expect to see the HIP api which is pretty identical to CUDA. In my mind CUDA API/programming structure reflects the NVIDIA hardware. Now, should I make the same assumption when looking at HIP structures, that there is a direct mapping to equivalent hardware? What is the "AMD internal" language that includes the internal knowledge? What do you use in-house? - The hardware for GPUs has to be similar across vendors to support the graphics function. There are some special functions that are different, especially for compute functions such as matrix ops or tensor ops. The HIP language is translated into "HSA" which is the lower level language. But programming is done at the HIP language level. CUDA translates into PTX as its lower level. - Are HIP errors propagated to host code during kernel execution or only at the end of kernel or some barriers? (Assuming asynchronous runs) - Note that the HIP error check only checks that the call is correct and not that it executed properly. Kernel failues are returned at the end of the operation. Kernel operations are asynchronous and the operations are batched. They do not get executed until information is requested from the GPU. So the errors cannot be sent back until the execution. However, it is possible to execute the kernel in the rocgdb debugger where you can step through the execution and query the status. The rocgdb debugger will be presented at the start of Thursday's session. - Is it basically checking only on the API level, not within kernels? - Exactly. Kernel failures usually fail silently and the data is not returned. - There doesn't seem to be stream defined in the error checking. How does one know which s - The HIP error check is on the API call. The stream defined in the call would be the one for which the error occurred. If the failure is in a series of kernels, it may not be apparent which kernel failed. Inserting calls that request information back after a call will force the calls to execute one-by-one and help diagnose the problem. - Does the device know of the streams, or do they live only in the API? - Interesting question. It depends on what you consider the "device". There is a scheduler on the GPU that manages which work gets done. It knows the dependencies and controls the execution of the work. The scheduler also knows when a workgroup is stalled for a memory access (or other requirement) and will swap out the workgroup for another workgroup that can make progress. - What limits the number of streams? - There is a limit. I haven't seen anyone hit the limit. I don't remember what it is. 32668 is the max according to one of - Assume my threads seem to start separating into two categories as the computation progresses. Is there an established/efficient way to regroup threads? To have warps doing same branching. - Thread divergence is an important consideration with SIMD or SIMT types of processor (vector or threads). Short divergence is fine, but long divergence can cause poor performance. - What does "GPU%" in `rocm-smi` tell? - GPU utilization - Where should we put questions for the exercises? Also in this hackmd document? - Yes. Or raise hand and ask verbally. - I setup amdcloud connecting yesterday, but it doesn't work today. Is mine "" particularly down? - I just connected with that - port 8560? - ssh -i id_ed25519 -p 8560 enccs_tr_XX@ - The third client machine I tried - Status OK now, two first client machines had issues, IDK the issue with first, the 2nd had too old ssh client. - The HIP example sbatch hip_batch.sh gives me "sbatch: error: Invalid directive found in batch script: –t" - try removing `-t`! - `--time` for script seems to work - /global/modulefiles/StdEnv.lua writes: `See this ticket for details`: [45](https://github.com/ROCmSoftwarePlatform/ACP/issues/45) but link is broken. What's the issue there ? - that repository has perhaps been made private - In terms of using AMD GPUs without HIP and OpenMP: what do you think about using OpenACC with the Cray compiler? - Cray has a mature OpenACC and OpenMP support in their Fortran compiler. C/C++ has switched to LLVM basis and does not have OpenACC support any more. They have not implemented all of the latest OpenACC standard. AMD is supporting the development of the GCC compiler by Siemens with OpenACC support in Fortran/C/C++. The exercises at the end of the day include an example with Siemens. Note that there is a Siemens module on the ACP cloud system. - OpenCL? - OpenCL is supported by AMD though is not promoted as much today. It is a lower level module - gcc performance for NVIDIA was bad in my experience? What about AMD? - Performance is still being worked on in GCC. Use the latest release from Siemens as they are working on it and upstreaming it. You should see performance improvement by the summer release. The version of Siemens on ACP is the June release. A more recent release has been made in the last couple of weeks. - How are the wavefronts actually executed? Is wavefront executedon on one SIMD unit ? Leavefronts? - Is there a link to the slides or are we getting the slides later? - slides will be sent out to everyone asap! **Exercise: https://hackmd.io/VpT0JnqGT82Z77kxj95Odg?both#Hipify-example** - What does it take to gain "permanent" access to - You can submit a request at AMD Cloud Platform (ACP) - https://acp.amd.com in the upper right corner. - I am having trouble compiling the OpenMP examples ```$ make amdclang -o saxpy codelet.o main.o -fopenmp ld.lld: error: undefined symbol: __kmpc_push_target_tripcount_mapper >>> referenced by codelet.c:5 (/mnt/beegfs/home/enccs_tr_36/openmp_samples/saxpy_gcc/codelet.c:5) >>> codelet.o:(saxpy) ld.lld: error: undefined symbol: __tgt_target_teams_mapper >>> referenced by codelet.c:5 (/mnt/beegfs/home/enccs_tr_36/openmp_samples/saxpy_gcc/codelet.c:5) >>> codelet.o:(saxpy) ld.lld: error: undefined symbol: __tgt_register_requires >>> referenced by codelet.c:0 (/mnt/beegfs/home/enccs_tr_36/openmp_samples/saxpy_gcc/codelet.c:0) >>> codelet.o:(.omp_offloading.requires_reg) clang-15: error: linker command failed with exit code 1 (use -v to see invocation) make: *** [Makefile:16: saxpy] Error 1 ``` - You are compiling OpenMP for the CPU. Do you have the ROCm module loaded? It looks like it is not finding the OpenMP libraries. - I was adding the ` --offload-arch=gfx90a` to the CFLAGS, moving it to the LDFLAGS works much better. - Makes sense. The -fopenmp and the --offload-arch are needed for the link. - Building the OpenMP saxpy_gcc with the GPU OpenMP offloading gives warnings ```$ make amdclang -O3 -g -std=c99 -fopenmp --offload-arch=gfx90a -c codelet.c -o codelet.o warning: codelet.c:4:1: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning] 1 warning generated. amdclang -O3 -g -std=c99 -c main.c -o main.o amdclang -o saxpy codelet.o main.o -fopenmp --offload-arch=gfx90a warning: Linking two modules of different data layouts: '/tmp/main-f28731.o' is '' whereas 'llvm-link' is 'e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7' enccs_tr_36@mun-node-5:~/openmp_samples/saxpy_gcc$ cat codelet.c void saxpy(int n, float a, float *restrict x, float *restrict y, float *restrict z) { #pragma omp target teams distribute parallel for simd map(to: x[0:n],y[0:n]) map(from: z[0:n]) for (int i = 0; i < n; i++) z[i] = a*x[i] + y[i]; } ``` what am I doing wrong? - There is nothing wrong with your compile. The compiler is just telling you that it cannot vectorize the loop. - Fair enough, but why is it warning in the linking stage? - Good question. I don't think I have seen that before. Will look into this ... . - Is for MI250 same as MI200 for compiler flags? (What about MI250x?) - Yes, same, the difference is in MI250x the conherency with the system memory. Also differences in the hardware. - To what degree can I mix HIP and OpenMP offload? (e.g. OpenMP allocated memory with HIP kernels) - +1 - This should be possible. You must be careful to give the device pointer and not the host pointer to the HIP kernel. There are tools within OpenMP to get the device pointer, but it can be confusing. - If you have HIP and OpenMP in separate files, you can combine them. - Related to HIP with OpenMP: why doesn't the compiler give me any warnings and errors when I try to use OpenMP offloading inside a HIP file? I only found out from Lumi support that OpenMP offloading and HIP cannot be used in the same file. The code compiled fine but failed at runtime with GPU memory faults. - I need to check, last year, I could not compile such case, I had errors. However, I think it is improving, and OpenMP 5.1 brings close this concept. - Another question about using HIP with OpenMP offloading: hipfft requires data to be provided as `float2*` or `hipfftComplex*`. These data types are defined in HIP (`hip/hip_vector_types.h`). I can use float arrays created on the GPU with OpenMP in a different file by using something like `float2 *dx = reinterpret_cast<float2*>(x)` in my HIP file. Is this the recommended way to do this or is there a better way? - The pace we going through the material, it is incredible! Is this going to continue the rest of the week? - too fast? - Even if I have experience with OpenMP and just went to a HIP course, this is a lot of information flowing. At best I can grasp specific new information pieces. There is zero chance that the lecture prepare for excercises - +1 - +1 - +1 - I am finding the material too fast, the presentations are information overload, very technical. However, I am not very experienced in HPC/GPU development. - thanks for this feedback - what language are you interested in? Do you know what GPU language/approach you are most interested in? The presentations are giving a broad coverage of different approaches. Once you decide which approach, it is possible to get a more specific, step-by-step guide. - My background is higher level eg. R /python so all the low level details are a bit lost on me. This is the sort of direction I'm coming from: https://www.amd.com/en/graphics/servers-solutions-rocm-ml - You may be particulary interested in the presentation on machine learning on Friday. While machine learning still has all of the concerns discussed in these presentations, the main compute routines have already been written for the most part and are accessed through PyTorch, TensorFlow and other libraries. - It would be great if we could have the slides already during the lecture because that would allow us to check if we missed something on the previous slide when we lose track. - good point, we'll try to make all slides available - OpenMP is quite high level compared to HIP. How do I e.g. know what is in shared memory and what is in global memory? - This is generally up to the compiler to determine, e.g. private arrays can be stored in shared memory if the compiler deems this optimal. Some compilers will emit diagnostics indicating wha - Additionally in OpenMP 5.0 the `allocate` directive and clause were introduced. This lets you specify an allocator (e.g. `omp_low_lat_mem_space`) to use for variables within kernels. However, it is still up to the implementation to map that to hardware. - Do function calls **within kernels** in devices have heap like on host? (Perhaps everything is inlined?) - The data will be moved over to the device in the memory copies. Then the pointer to this data is passed to the kernel. - **within kernels** is a good question. The pragma based languages did not handle this well at first and everything needed to be in-lined. Now there are ways to mark the subroutines as a "device" subroutine so that it can be called from within another kernel. The data in the subroutine must also be on the device and have special declarations. Since these are newer additions to the standard, they may not be implemented in every compiler yet and the compiler may not give good warnings to the programmer. - Not sure if I missed it in the lecture: can I create pinned memory with OpenMP? - With OpenMP 5.0 this should be possible using a custom allocator definition (`pinned` is one of the allocator traits that can be set when defining an allocator). However, I do not know what the implementation status of this in any compiler. - What are the best points of contact to report compiler issues for each of the compilers listed in the previous presentation? - For the Siemen's compilers, we are reporting to them issues with each of their releases. Right now, they are focusing on functionality. They are very interested in issues that are found with more complicated applications. Right now you can pass them to a member of this team and they will report them. - I'm certainly intrigued and will give it a try. Thanks! - Issues with the Cray compilers are reported to HPE - AOMP and HIP are supported by the ROCm team. - Thanks! - If you are not getting a response on a particular compiler issue or can't find a place to report it, you can check with one of our team of presenters and they will follow up. - Should I expect to see some timing output with `LIBOMPTARGET_KERNEL_TRACE=1` for the saxpy_gcc example? Setting this variable doesn't change anything for me - Try `GCN_DEBUG=true`. The debug and trace information has a compiler-dependent environment variable, so it depends what compiler you are using. - For the fib_gfortran exercise I am getting the following warning (just fyi): `F90-W-1206-OpenMP GPU - Array sections in "map" clause is not implemented yet, the entire array will be mapped (freduce.f90: 18` - That is expected and is just information from the compiler that it hasn't implemented the slice feature. - For the OpenMP saxpy example I get: ``` gcc -o saxpy codelet.o main.o -fopenmp lto-wrapper: fatal error: could not find accel/nvptx-none/mkoffload in /usr/lib/gcc/x86_64-linux-gnu/9/:/usr/lib/gcc/x86_64-linux-gnu/9/:/usr/lib/gcc/x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/9/:/usr/lib/gcc/x86_64-linux-gnu/ (consider using ‘-B’) ``` - Have you changed the compiler? Now I get ``` amdclang -o saxpy codelet.o main.o -fopenmp ld.lld: error: undefined symbol: GOMP_target_ext >>> referenced by codelet.c:5 >>> /tmp/codelet-792bd9.o:(saxpy) ``` - had the same do a `make clean` to have no old artifacts left Perfect, thanks! - Could you tell us the mapping between various OpenMP parallelisms and equivalent HIP parallelisms for the supporting compilers, please? (This information is quite critical.) E.g. SIMD <--> warp, Thread group <--> Block or however it goes. - This actually depends on the compiler. For all of them `teams` => blocks, AMD and GNU compilers `parallel` => threads, Cray `simd` => threads. The mostly portable way to use OpenMP is `teams distribute` for the outer level of parallelism and `parallel for simd` for the innter level of parallelsim. - OpenMP is a prescriptive language and the directives are supposed to tell the compiler exactly how to do the parallelism. But as you see from above, the compilers do it differently. OpenACC is a descriptive language and leaves the details to the compiler. OpenMP has added the loop directive as a descriptive option that replaces the complex syntax of earlier OpenMP directives and gives the compiler some freedom in how it implements the parallelism. The OpenMP exercises are in `/global/training/enccs/exercise/openmp_samples` - What are the modules needed for the [HIPFort-Example](https://hackmd.io/@enccs/HJHVWuQvi#HIPFort-Example)? ```bash hipfc -v --offload-arch=${ROCM_GPU} hip_implementation.cpp main.f03 /global/software/rocm/rocm-5.3.0/bin/hipcc -fno-gpu-rdc -fPIC --offload-arch=gfx90a -c hip_implementation.cpp -o /tmp/hipcc.o /usr/bin/gfortran -cpp -I/mnt/beegfs/home/enccs_tr_15/hipfort/include/hipfort/amdgcn main.f03 /tmp/hipcc.o -L/mnt/beegfs/home/enccs_tr_15/hipfort/lib -lhipfort-amdgcn -L/global/software/rocm/rocm-5.3.0/lib -lamdhip64 -Wl,-rpath=/global/software/rocm/rocm-5.3.0/lib -lstdc++ -o /mnt/beegfs/home/enccs_tr_15/hipfort-source/test/f2003/vecadd/a.out lto1: internal compiler error: compressed stream: data error 0x7fce85771082 __libc_start_main ../csu/libc-start.c:308 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <file:///usr/share/doc/gcc-9/README.Bugs> for instructions. lto-wrapper: fatal error: /usr/bin/gfortran returned 1 exit status compilation terminated. /usr/bin/ld: error: lto-wrapper failed collect2: error: ld returned 1 exit status ERROR: The following command failed with return code 1. /usr/bin/gfortran -cpp -I/mnt/beegfs/home/enccs_tr_15/hipfort/include/hipfort/amdgcn main.f03 /tmp/hipcc.o -L/mnt/beegfs/home/enccs_tr_15/hipfort/lib -lhipfort-amdgcn -L/global/software/rocm/rocm-5.3.0/lib -lamdhip64 -Wl,-rpath=/global/software/rocm/rocm-5.3.0/lib -lstdc++ -o /mnt/beegfs/home/enccs_tr_15/hipfort-source/test/f2003/vecadd/a.out #Info: Temp files kept in /tmp ``` - I am running with the default ROCm 5.3.0 module loaded. Your output also shows rocm-5.3.0. I think you are doing everything right. My compile line looks exactly the same, but it works. Perhaps it is having some trouble with the /tmp files? ```bash= hipfc -v --offload-arch=${ROCM_GPU} hip_implementation.cpp main.f03 /global/software/rocm/rocm-5.3.0/bin/hipcc -fno-gpu-rdc -fPIC --offload-arch=gfx90a -c hip_implementation.cpp -o /tmp/hipcc.o /usr/bin/gfortran -cpp -I/mnt/beegfs/home/brobey/hipfort/include/hipfort/amdgcn main.f03 /tmp/hipcc.o -L/mnt/beegfs/home/brobey/hipfort/lib -lhipfort-amdgcn -L/global/software/rocm/rocm-5.3.0/lib -lamdhip64 -Wl,-rpath=/global/software/rocm/rocm-5.3.0/lib -lstdc++ -o /mnt/beegfs/home/brobey/hipfort-source/test/f2003/vecadd/a.out #Info: Temp files kept in /tmp ./a.out -- Running test 'vecadd' (Fortran 2003 interfaces)- device: - PASSED! ``` ### Day 2 - Where did the day 1 exersises go? The link https://hackmd.io/@enccs/HJHVWuQvi only shows day 2 exersises now - everything collected now at: https://enccs.github.io/AMD-ROCm-development/ - very nice, thanks!! #### Lesson D2.L1 architecture info - Do we need to add those "pak" calls ourself? Doesn't the compiler know to use them? Are they assembly or API/kernel functions? - If you use ROCBLAS you need just to call routines, it is hidden from you, of course and other libraries that support packed FP32 - 26.5 TFLOPS peak performance per GCD is FP32 -Correct? or FP64? - double, FP64 in special cases, usually it is less. Like HPL, it does not achieve so high. - Does SLURM have a plugin that knows the what is connected to what? Does it know to attach the job to the correct CPU-GPU combination? - The Affinity talk will discuss all of these - SLURM does not. - Feel free to ask AMD colleagues to write one! :-) - How is Lumi/Frontier slingshot connected to the GPUs? Is that the "NIC"? - Yes the NIC, they have 4 NICs per node. #### Lesson D2.L2 GPU aware MPI - Is there GPU-direct-storage for AMD? Saving data directly to Lustre from the GPU? - I will let a colleague to add a comment but on AMD, the GPU is connected on the interconnection, there is no CPU involvement and during MPI IO, as long as the buffer is on the GPU, it should not include CPU work. - The GPU is connected to another GPU. There is no connection directly to storage. It still has to go through the CPU to access storage. - Storage is connected via Infiniband (or slingshot), typically. - Check the slide with the CPU and the GPUs, there is one NIC per GPU to connect on the network. there is no NIC connected on the CPU. - Correct. Is there support writing to Lustre w/o going through host memory, is my question. - https://jira.whamcloud.com/browse/LU-14798 - I would expect if the data are on the GPUs, no CPU is used as basically IO is MPI to the storage, I have in my mind MPI IO with GPU-aware always - So it is most likely some variation of RDMA transfer? - Sorry, are you doing IO and you see a variation? It is easy to have some variation - This is still hypothetical questions. - I am not sure if we have a lot of data as vendor usually does not have a big system with storage, but it is something we want to check. - Is there software support? Drivers? - The software support is through the MPI implementation. In the presentation is shown how to tell if the MPI supports GPU direct communication. - I see HIP code, what about OpenMP together with GPU-aware MPI? - If a GPU buffer is used in the MPI call, it will communicate from the GPU. There are ways to get the device buffer pointer in OpenMP offloading functions. - I see! :-) - What happens when a GPU-aware MPI is called with pointers allocated with malloc_shared? Does it detect that the data is on the cpu or gpu? Or doest it assume that the data is a gpu pointer or cpu pointer by default? - I guess if the coherency is activated,, then it will be used but you will pay the cost of transfer from the DRAM whish is slower than HBM. #### Lesson D2.L3 memory systems - On page 9, where is `array` defined? Didn't get that. - It could be a typo and be `a` - is the page migration available only on MI200 series or also MI100? - on MI250x with Trento processor, not on MI210, MI250. - My experience with managed memory on NVIDIA GPUs was that one should try it but if the code for some reason fails there was not much chance to debug this. Is there any help to debug managed memory if there are issues? - That's a good question for our debugging session tomorrow but I expect our ROCgdb that handles CPU and GPU, could help. - I believe the example with the debugger session tomorrow uses the test case of commenting out the malloc and then detecting the problem with rocgdb debugger. - Is there anything like the coarse/fine grain for NVIDIA? Or is this something to only look out for in ROCm? "Performance portability..." - I do not know of an equivalent for NVIDIA yet. This can be done in a portable manner with some care across vendors. There will be some differences in performance with specific blocks of code between vendors or even between GPU models. #### Exercise session > Please remember the --reservation=enccs_2 flag today. Instructions: https://hackmd.io/@enccs/HJHVWuQvi Source files: /global/training/enccs/AMDTrainingExamples_ver0.2.tgz #### Lesson D2.L4 roofline - is there a way to export the data from omniperf from a cluster and analyize locally. (If you have no way to run the Graphana) - Yes, there is also standalone GUI, except the Grafana one - Is there need for specific compiler options or so for the binary? - Not really - Will this work for OpenMP offload? - Yes. - Last time I looked, the whole thing was a single huge kernel. :-( But that was not for AMD. - is MD N1 or NlogN? - depends on the treatment of long-range electrostatic interactions. Clever schemes bring it down to O(N) but with a prefactor, FFT for long-range interactions give O(NlogN). - in MD interaction have finite suport - ### Break until 16:13 CET #### Lesson D2.L5 Affinity - I thought that LUMI-G has 64 cores per node, but in the slides there are 128. How does this change the afinity issues. WhIch cores are close to which GDC? - The hardware cores can have hyperthreads enabled -- two per core. So the operating system sees 128 virtual cores, also called hardware threads. The hyperthreads are usually the set of higher numbered virtual cores. Placing the processes on the virtual cores/hardware threads in the domain connected to the GCD will give good performance. - Got it. Thanks - A comment for the slides. Using the long version of the slurm options, might make the srun command a little more intuitive to read as a student. - Thanks for the suggestion. The options can be confusing especially with different ones for every tool. - Is there a utility that allows to play around with the binding outcome without submitting a job? Juelich had something like this. - Interesting. We'll have to look into that. ### Day 3 #### rocprof intro - as for this example everything is in `hipXXX` commands will the profiling work on NVIDA cards? - No. However, the application will run with CUDA and nvprof, nsys, etc would be the appropriate tools for profiling then. - The exercise instructions for rocprof end with looking at the file rocprof_counters.csv, correct? Maybe I missed this: is there a way to visualize these counters? (I'm used to looking at counter measurements with HPCToolkit.) - There are instructions in the exercises to copy the results.json over to your laptop and visualize with perfetto. If the question is can you visualize the data in rocprof_counters.csv directly, it can be opened in an application such as Excel and the data plotted. Currently there is no integrated tool at the rocprof level that will visualize the csv files. rocprof is a lower level tool that collects the data and the results will be used in omnitrace and omniperf for more powerful analysis. #### omnitrace intro - skip calls: ncu allows to trace not all consecutive calls of a single kernel by skipping some of the calls. Is that possible with omnitrace ? ``` /opt/nvidia/hpc_sdk/Linux_x86_64/22.3/profilers/Nsight_Compute/ncu --help |grep -A6 kernel-id --kernel-id arg Set the identifier to use for matching the kernel to profile. The identifier is of the format "context-id:stream-id:[name-operator:]kernel-name:invocation-nr". Skip entries that shouldn't be matched, e.g. use "::foobar:2" to match the second invocation of "foobar" in any context or stream. ``` - Not sure what I'm doing wrong. I get: ``` hipcc -o MatrixTranspose MatrixTranspose.cpp Traceback (most recent call last): File "/global/software/rocm/rocm-5.2.3/bin/rocm_agent_enumerator", line 257, in <module> main() File "/global/software/rocm/rocm-5.2.3/bin/rocm_agent_enumerator", line 241, in main target_list = readFromKFD() File "/global/software/rocm/rocm-5.2.3/bin/rocm_agent_enumerator", line 200, in readFromKFD line = f.readline() PermissionError: [Errno 1] Operation not permitted ``` - Please compile with declaring architecture `hipcc --offload-arch -o MatrixTranspose MatrixTranspose.cpp` - - If you have only one GPU through Slurm, it blocks access to the other GPUs. This causes rocm_agent_enumerator to fail. The fix is to supply the GPU type through --offload-arch=gfx90a on the compile line so the autodetection of the GPU type is skipped. - Which module do I need to load to use Omnitrace on Lumi? - not yet a module for it on LUMI, but soon #### rocgdb - I probably missed this: is use of rocgdb limited to binaries compiled by hipcc? Or would this also work, e.g., with applications built with Cray (C, Fortran)? - Also works for them as long as you have compiled with rocm or hip. Since rocgdb is just an enhanced gdb, it will run on an executable without hip, but the enhanced functionality won't do anything. - is the saxpy source available? (Don't want to copy from the slides, as it is a picture...) - Working on it. - Copying worked ;) ```cpp #include <hip/hip_runtime.h> __constant__ float a = 1.0f; __global__ void saxpy(int n, float const* x, int incx, float* y, int incy) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < n) y[i] += a * x[i]; } int main() { int n = 256; std::size_t size = sizeof(float) * n; float* d_x; float* d_y; hipMalloc(&d_x, size); hipMalloc(&d_y, size); int num_groups = 2; int group_size = 128; saxpy<<<num_groups, group_size>>>(n, d_x, 1, d_y, 1); hipDeviceSynchronize(); } ``` - :thumbsup: - Now at /global/training/enccs/saxpy.cpp #### omniperf - questions here ### Day 4 :::danger Slides and exercises for today: https://enccs.github.io/AMD-ROCm-development/ ::: ### :::info - Add a few empty lines for a new question to avoid hackMD issues! - Please go to view mode (:eye: symbol) when you're not editing ::: #### OpenMP presentation - Any general advice what to do with present statements in OpenACC when porting Fortran+OpenACC to OpenMP? - In the Fortran+OpenACC code that I'm porting to OpenMP there are a lot of statements like ``` !$ACC PARALLEL LOOP DEFAULT(NONE) PRIVATE(...) PRESENT(...) ``` Naively I would think this should translate in OpenMP to ``` !$OMP TARGET PARALLEL DO DEFAULT(NONE) PRIVATE(...) SHARED(...) ``` but shared feels wrong when just using OpenMP for GPU offloading. Should I better remove the DEFAULT(NONE)? - I only use private and shared with CPU OpenMP directives. I don't think that DEFAULT(NONE), PRIVATE, or SHARED are needed for the GPU with OpenACC or OpenMP. OpenMP will expect the data to be on the device unless you add the MAP clause. The TEAMS DISTRIBUTE is necessary to invoke the parallel hardware. SIMD is good if you are also running on Cray to get portable behavior and distribute the work across threads. But let's get Michael's thoughts on this. ``` !$OMP TARGET TEAMS DISTRIBUTE PARALLEL DO SIMD ``` - Any idea about the level of support for the use_device_ptr in fortran on LUMI? - with AOMP? CRAY? which compiler? - Both. There are also gfortran and flang. - For LLVM clang https://clang.llvm.org/docs/OpenMPSupport.html - For all OpenMP https://www.openmp.org/resources/openmp-compilers-tools/ - Let's ask Michael about flang-- yes, for flang. Not clear for Cray Fortran, but probably because they support OpenMP 5.0. Have to check for gfortran -- GCC and/or Siemens. - What should we do with data types defined in HIP (like float2) when we are not allowed to use them in our OpenMP file? (Seems to be necessary when using hipFFT with data created with OpenMP) - Question 1: Any general advice what to do with present statements in OpenACC when porting Fortran+OpenACC to OpenMP? - Question 2: What should we do with data types defined in HIP (like float2) when we are not allowed to use them in our OpenMP file? (Seems to be necessary when using hipFFT with data created with OpenMP) - Not a teacher here, I susccefully converted a fortran array of N flaots to a float2 of N/2 elements - This is something I am not sure… but can we create streams for gpu with openmp offloading without using Hip? - I think the compiler decides automatically if create or not, but Michael can answer. - Yes and no. The underlying implementation of OpenMP is somewhat different. It doesn't have the underlying concept of stream. But can use some underlying concepts; unfortunately not a portable method. Maybe in future - OpenMP 6. - I realized something: of course OpenMP cannot expose shared memory: OpenMP would need to consolidate the mapping between parallelisation levels in GPUs. Otherwise, there is e.g. no well-defined mapping with whom to share. (Various implementations map warps/wavefronts to SIMD or threads in implementation specific way.) - How does the GPU maximum number of concurrent (different) kernels map to OpenMP limits? Like number different work-sharing tasks or something? - resources: - https://www.openmp.org/resources/openmp-compilers-tools/ - https://clang.llvm.org/docs/OpenMPSupport.html ### ML frameworks - In ML: does "model" mean the neurons or neurons with the "correct" weights? - If one's goal is to do/learn reinforcement learning, should one start by TF/pytorch tutorials, or is there a direct approach to jump directly there? - Thank you everyone! I have to leave a bit early! Thank you for the great workshop! :-) - Simppa - :::info *Always ask questions at the very bottom of this document, right above this.* :::