![](https://media.enccs.se/2023/04/GPU-Course-why-when-how.jpg) # GPU programming - Why, When and How? ### June 9 & 12, 2023 ## General information - ENCCS: https://enccs.se/ - Upcoming events: https://enccs.se/events/ - Event page: https://enccs.se/events/2023-06-gpu-programming/ - LUMI: https://www.lumi-supercomputer.eu/ - This hackMD: https://hackmd.io/@enccs/general-gpu-june2023 - Lesson material: https://enccs.github.io/gpu-programming/ - Workshop feedback form: https://events.prace-ri.eu/event/1501/surveys/1082 :::warning Please take 5 minutes to participate in the 2023 ENCCS services survey! It helps us improve existing services and develop new services, which you can benefit from. https://enccs.se/services-survey-2023 ::: ### Follow ENCCS - Sign up for the ENCCS newsletter: https://enccs.se/newsletter - Follow us on [LinkedIn](https://www.linkedin.com/company/enccs), or [Twitter](https://twitter.com/EuroCC_Sweden) - YouTube: [ENCCS](https://www.youtube.com/@enccs) ### Schedule **Friday June 9, 09:00-16:00 CEST** |Time | Topic | | --- | ----- | |9:00-9:15 | Welcome| |9:15-9:30 | Why GPUs?| |9:30-9:50 | The GPU hardware and software ecosystem| |9:50-10:10 | What problems fit to GPU?| |10:10-10:25 | Break| |10:25-10:50 | GPU programming concepts| |10:50-11:10 | Introduction to GPU programming models| |11:10-11:50 | High-level language support| |11:50-12:50 | Lunch break| |12:50-13:40 | Directive-based models| |13:40-14:30 | Multi-GPU programming with MPI| |14:30-14:45 | Break| |14:45-16:00 | Non-portable kernel-based models| **Monday June 12, 09:00-16:00 CEST** |Time | Topic | | --- | ----- | |9:00-10:00 | Non-portable kernel-based models| |10:00-10:15 | Break| |10:15-11:30 | Portable kernel-based models| |11:30-12:00 | Preparing code for GPU porting I| |12:00-13:00 | Lunch break| |13:00-13:20 | Preparing code for GPU porting II| |13:20-14:00 | Recommendations and discussions| |14:00-14:15 | Break| |14:15-15:45 | Problem example| |15:45-16:00 | Wrap-up| **Friday June 16, 09:00-16:00 CEST** |Time | Topic | | --- | ----- | | 9:00 - 15:00 | Mini-hackathon / bring your code | ### Joining the mini-hackathon :::info Do you have a code that you want to port to GPUs and you need help? Join us for a mini-hackathon on Friday June 16! Register at https://events.prace-ri.eu/event/1502/registrations/1102/. ::: Please also visit this hackMD and describe your code, what you want to do, and how we can access your code: https://hackmd.io/@enccs/mini-hackathon-june2023 It's most convenient if your code is accessible publicly on GitHub, GitLab or similar and you write the URL in the hackMD, but you can also send an email to Thor Wikfeldt (thor.wikfeldt@enccs.se) with a file archive or instructions on how to access it. --- ### Training https://enccs.github.io/gpu-programming/ --- ### Instructors, helpers and lesson developers - Andrey Alekseenko, KTH - Cristian-Vasile Achim, CSC - Erik Heggeli, University of Tromsø - Hicham Agueny, University of Bergen - Jaro Hokkanen, CSC - Jussi Heikonen, CSC - Magnar Bjorgve, University of Tromsø - Pedro Ojeda May, HPC2N Umeå - Qiang Li, ENCCS - Richard Darst, Aalto University - Stephan Smuts, University of Aarhus - Stepas Toliautas, Vilnius University - Thor Wikfeldt, ENCCS - Tuomas Lunttila, CSC - Wei Li, ENCCS - Yonglei Wang, ENCCS --- ### Code of conduct We strive to follow the [Contributor Covenant Code of Conduct](https://www.contributor-covenant.org/version/2/1/code_of_conduct/) to foster an inclusive and welcoming environment for everyone. [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/ENCCS/event-organisation/blob/main/CODE_OF_CONDUCT.md) In short: - Use welcoming and inclusive language - Be respectful of different viewpoints and experiences - Gracefully accept constructive criticism - Focus on what is best for the community - Show courtesy and respect towards other community members Contact details to report CoC violations can be [found here](https://enccs.se/kjartan-thor-wikfeldt). --- ### LUMI First, if you have not accepted the Puhuri invitation yet, do so ASAP if you want access to LUMI! Next step is to upload an SSH key. Instructions for uploading an SSH key: https://docs.lumi-supercomputer.eu/firststeps/SSH-keys/ In short: 1. Generate a local SSH key pair (you can also use an existing key) 2. Upload your key to https://mms.myaccessid.org/profile/ 3. Wait 1-2 hours for the key to propagate to the cluster, and then try logging in: `ssh -i <path-to-private-key> <username>@lumi.csc.fi` Existing CSC users need to link their existing accounts: https://docs.csc.fi/accounts/how-to-manage-user-information/#how-to-link-your-csc-user-account-to-external-authentication-sources Further information about MyAccessID authentication: https://puhuri.neic.no/user_guides/myaccessid_registration/ #### Running on LUMI Interactive job, 1 node, 1 GPU, 1 hour: `$ salloc -A project_465000485 -N 1 -t 1:00:0 -p standard-g --gpus-per-node=1` Exit interactive allocation with `exit`. Interacive terminal session on compute node: `$ srun --account=project_465000485 --partition=standard-g --nodes=1 --cpus-per-task=1 --ntasks-per-node=1 --gpus-per-node=1 --time=1:00:00 --pty bash` Corresponding batch script: ```console= #!/bin/bash -l #SBATCH --job-name=gpulesson #SBATCH --output=job.o%j #SBATCH --error=job.e%j #SBATCH --partition=standard-g #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gpus-per-node=1 #SBATCH --time=0:20:00 #SBATCH --account=project_465000485 srun <some_command> ``` Submit your job: `sbatch batch-script.sh` Monitor your job: `squeue --me` Kill job: `scancel <JOB_ID>` --- :::danger You can ask questions about the workshop content at the bottom of this page. We use the Zoom chat only for reporting Zoom problems and such. ::: --- ## Questions, answers and information :::warning HackMD questions from Day 1: https://hackmd.io/@enccs/H14EFs7vh ::: ### Day 2 - I lost the e-mail with the zoom link. Can you send it again? Very stupid of me. To many e-mails - I found it. Constructive feedback: it would be nice to have the zoom-link resent before each day. People (like me) are not always so structured :sweat_smile: - very good point, it should have been in the wrap-up email after the session on Friday - Does working in CUDA/HIP mean writing C++? Or are there more languages? - Yes, CUDA/HIP are C++ -based. There are wrappers for Rust, [Python](https://documen.tician.de/pycuda/tutorial.html#executing-a-kernel), etc., but you still have to write the kernel in C++ (or C). ## Non-portable kernel-based models (continued) https://enccs.github.io/gpu-programming/9-non-portable-kernel-models/ - Why is it recommended to multiply with 32? What is this magic number? - On NVIDIA devices, the threads are executed in groups of 32 (so-called ["warps"](https://enccs.github.io/gpu-programming/4-gpu-concepts/#cuda-threads-warps-blocks)). So, the number of threads will be anyway rounded up to the nearest multiple of 32, but the additional threads will do nothing. - The AMD MI250 devices we have in LUMI have warp ([AMD calls it "wavefront"](https://enccs.github.io/gpu-programming/4-gpu-concepts/#terminology)) of 64. - Is it recommended to have the memory allocation/freeing in the main-function of a program and not in separate subfunctions? Would it not be "cleaner" to separate out? - For real programs -- yes, it will be cleaner to separate it. For code examples, it's easier to have a "linear" flow of logic, so we don't have to jump back and forth over the source code. - We also skip the error checking in our examples; you should definitely check returned status codes / catch exceptions in real programs. - We probably should add a disclaimer to the course that the code is there to illustrate programming concepts, not best practices. - How can we caluclate the band width? Or know what it is? - We calculate the size of the matrix in bytes, multiply by two (because we first read then write the whole matrix), and divide by the average kernel time. - the only difference between the naive and the CPU implementation of the matrix transpose is to invoke the GPU memory with share and sync? - This seems to be doable. - Well, yes, it's about knowing how and when to use it :) Just using shared memory won't make your code faster, you need to see inefficient access patterns and use shared memory to "transform" them into efficient ones. And not forget synchronizing (CUDA has nice `compute-sanitizer` tool set to check for memory errors in runtime). - What happens in the example if offset = i * (N/stream_size) fraction gives an odd number? How does c++ deal with this and ints? I'm not so steady in c++.Would not the offset be converted to the nearest int either up or down? Is it not potential for memory errors there? - In C/C++, it will be rounded down if N is not divisible by stream_size. - But will the offset then always give the right value? - As Cristian mentioned, there's a typo, should be N/n_streams to get the right offset. And, for simplicity, we assume that they are nicely divisible here. - But, in the real world one needs to pay attention to this right? It is potential for an error there is it not? - What is the difference between ROCm and HIP? Which one is comparable to CUDA? - ROCm is the whole software toolset, including HIP; HIP is specifically the CUDA-like thing. - What would ROCm be comparable to in Nvidia terms? - Don't think they have a similar term. CUDA is often used to mean broadly the whole software stack, so sometimes they can be equated like that. - I am porting some CPU code to GPU with the purpose of running it on LUMI. I have a RTX 3090 in my computer. Can I develop the software while testing it on the 3090 and then run it on LUMI without doing a lot of more work? Aka. can I easily do HIP programming and run it on a 3090? - One of the selling points of HIP is that it can target NVIDIA devices. So, theoretically, you can. HIP is not working *great* on NVIDIA, though; you might waste some time getting it to run (and, hopfully, you're not using Windows). There are also a lot of hardware differences between 3090 and MI250; *a lot*. So, most of non-trivial optimizations and tuning will have to be re-done. (Using shared memory for matrix transpose is a trivial optimization in this context). - What would be the best way to develop the software: Write the code and run it on LUMI while I develop or is it better to have a workstation with a suitable GPU to do the development and then later tune it for LUMI? - Nasty thing about AMD hardware is that they have quite different architectures for their desktop and server GPUs; so, in terms of tuning, RX6600 would not be much better than RTX3090 (although, at least AMD card will nicely run HIP, despite not being officially supported). - The point of this course is that there is rarely a one-size-fits-all approach. Depends heavily on your code, the available time, your priorities (e.g., do you want to get the best possible performance vs. get the code running with minimal effort). It might be viable to first port to CUDA, and then convert automatically (or manually) to HIP. Or to develop on LUMI (they are not overloaded now, so getting a node to test your code should noy be a huge problem). Or, indeed, to get HIP on NVIDIA card. Or to use a portable kernel-based model. - One argument for first porting to CUDA: NVIDIA has *very* good profiling/debugging/sanitizing tools. And a huge market share. So, you would not have to fight the tooling during your initial porting, and you will get a code suitable for a wide range of existing machines. Then, [converting it to HIP is not hard](https://enccs.github.io/gpu-programming/11-gpu-porting/#porting-between-different-gpu-frameworks) (optimizations might be, but that's the part you'll have to do anyway if you switch between different devices) - My experience with porting is that one could use hip to write a `functional` code on desktop computers and do some simple optimizations which are clear enough and works for all (coalseced accesses, overlappin, ...), but when going to more optimization there is a lot that will be different. - A couple of years back I tried to get HIP running on my 6900XT with no luck. I might have gotten somewhere with spending more time trying to make it work, but I gave up. How will it be getting HIP to work on a NVIDIA card, say 3090? On Linux of course (maybe theres even a recommended distro?) : ) - I think it should work now. I recomend looking into using [`spack`](https://spack.io/) - In the Hello World exercises, I get the printout twice. Is the program run two times? This happened on `02_hello.cpp` and `03_hello.cpp` as well. ``` klassonm@uan02:~/gpu-programming/content/examples/cuda-hip/hip/01_hello_world> hipcc -O2 --offload-arch=gfx90a 01_hello.cpp perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LANGUAGE = (unset), LC_ALL = (unset), LC_CTYPE = "UTF-8", LANG = (unset) are supported and installed on your system. perl: warning: Falling back to the standard locale ("C"). perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LANGUAGE = (unset), LC_ALL = (unset), LC_CTYPE = "UTF-8", LANG = (unset) are supported and installed on your system. perl: warning: Falling back to the standard locale ("C"). klassonm@uan02:~/gpu-programming/content/examples/cuda-hip/hip/01_hello_world> srun -p standard-g --gpus 2 -N 1 -n 2 -c 4 --time=00:10 --account=project_465000485 ./a.out srun: job 3675004 queued and waiting for resources srun: job 3675004 has been allocated resources ---------------------- Hello World from CPU! ---------------------- ---------------------- Hello World from CPU! ---------------------- ``` - - The `02_hello.cu` should print only once. The `srun ...` contains `-n 2` which means 2mpi processes per node. So there are 2 instances of the program running - what is the differenc ewith the `.c` and `.cu` extensions? - `.cu` extensions are for CUDA code (e.g., the code containing `__global__` kernels and other non-standard-C++ language constructs). It's not strictly required (you can configure your build system to compiler `.c` or `.cpp` files with CUDA compiler; same for HIP), just a very common convention. - There seems to be a error in `05_hello.cu`. I can't compile it? - What error do you get? - ```console ~/gpu-programming/content/examples/cuda-hip/cuda/01_hello_world> hipcc -O2 --offload-arch=gfx90a 05_hello.cu 05_hello.cu:8:21: error: use of undeclared identifier 'blockIdx' const int bid = blockIdx.x; ^ 05_hello.cu:9:22: error: use of undeclared identifier 'threadIdx' const int tidx = threadIdx.x; ^ 05_hello.cu:10:22: error: use of undeclared identifier 'threadIdx' const int tidy = threadIdx.y; ^ 05_hello.cu:11:5: error: no matching function for call to 'printf' printf("Hello World from GPU (block-%d and thread-(%d, %d))!\n", bid, tidx, tidy); ^~~~~~ /usr/include/stdio.h:332:12: note: candidate function not viable: call to __host__ function from __global__ function extern int printf (const char *__restrict __format, ...); ^ 05_hello.cu:20:11: error: unknown type name 'dim3' const dim3 block_size(4, 8); ^ 05_hello.cu:21:19: error: use of undeclared identifier __hipPushCallConfiguration hello_from_gpu<<<1, block_size>>>(); ^ 05_hello.cu:26:5: error: use of undeclared identifier 'cudaDeviceSynchronize' cudaDeviceSynchronize(); // cudaDeviceReset(); ^ 7 errors generated when compiling for gfx90a. :~/gpu-programming/content/examples/cuda-hip/cuda/01_hello_world> cat 05_hello.cu // // nvcc 05_hello.cu // #include <stdio.h> __global__ void hello_from_gpu() { const int bid = blockIdx.x; const int tidx = threadIdx.x; const int tidy = threadIdx.y; printf("Hello World from GPU (block-%d and thread-(%d, %d))!\n", bid, tidx, tidy); } int main(int argc, const char * argv[]) { printf("\n----------------------\n"); printf("Hello World from CPU! Before calling 'hello_from_gpu' kernel function.\n"); const dim3 block_size(4, 8); hello_from_gpu<<<1, block_size>>>(); printf("Hello World from CPU! After calling 'hello_from_gpu' kernel function.\n"); printf("\n----------------------\n"); cudaDeviceSynchronize(); // cudaDeviceReset(); return 0; } ``` - - Only the hip examples will work on LUMI. Check the 05_hello.cu in the `hip` folder. - Okay, my bad. Did you guys already say this? - No. Sorry about the confusion. - Can you build one "hip-binary" which works on mi100,mi210,mi250,mi300? Similar to Cuda supporting multiple gpu models? - `$ hipcc 03_hello.cpp -o 03_hello --amdgpu-target=gfx906,gfx908,gfx90a` - hipEventElapsedTime returns time in ms, no? (In the examples it says seconds) - True. the time is given in s, but the bandwidth is in `GB/s`. It is a more natural measure. -But hipEventElapsedTime returns for example "elapsed 3.860036 sec", while the entire code takes less than 1 second (including inititalizations). For 03_matrix_summation_GPU_2D2D_2D1D_1D1D.cpp - Are you reffering to the code in teh lecture page? There is of course big overheads asscoaited with the CPU initialization, GPU allocations, **slurm** - hipEventElapsedTime returns **larger times** than total wall time! (if we assume the units in the example correct) - When I run example 1 I only get the printed messages twice. Even though I ask for 4 GPUs. Why? - What is the launching command? - of course. My bad. The command is for GPUs so of course I have to ask for more of those. Thanks - For local devleopment: is it possible to use a container (docker or singularity) that has the GPU functionalities inside and then run this container on a cluster later? Maybe this is not possible since the architecture is not in the contianer, but it would be REALLY nice if it was possible to try things locally during development before wasting supercomputer CPU/GPU-power on solving bugs - It is possible to use a Singularity containers on LUMI, but you still need to have some hardware locally if you want to test/debug things. - My experience is that getting Nvidia cards working with Tensorflow / Pytorch (on Linux) can be a real hassle because I have to be very careful with choosing correct driver and software versions to get everything running. Is it an equal hassle to get CUDA running? Do you have a recommended guide to set up a Nvidia card on a Linux computer for running CUDA software? - Tensorflow uses CUDA internally, so it's a bit easier (you only have CUDA and driver that should match, no need to match Tensorflow version). Usually for me, downloading CUDA from https://developer.nvidia.com/cuda-downloads?target_os=Linux and installing the bundled driver works pretty well. - Will I be able to switch between CUDA and HIP without installing / changing drivers / software? - If you want to use HIP for AMD card, you can use them independently, no problems. If you want to use HIP for NVIDIA, you might have to play around to find a matching version (no experience how hard it is); but you can definitely use them together, HIP uses CUDA internally for NVIDIA GPUs. - Do you guys have recommendations for how to go from a docker to a singularity container? - `singularity build ...`:) - yes, but it is not always enough. For instance, how docker and singularity writes to disk is different. I'm currently trying to figure out how to go from docker to singularity. And write permissions inside the container is NOT trivial to figure out. ## Portable kernel-based models https://enccs.github.io/gpu-programming/10-portable-kernel-models/ - are all of these wrappers around CUDA/HIP etc? - OpenCL is a separate thing. Kokkos and SYCL (the two major implementations, DPC++ and hipSYCL) use CUDA/HIP internally for NVIDIA/AMD support. - Is Kokkos actually a shell? Ref the makefile in the example? - No, it's just a Makefile in which the Kokkos Makefile is included, KOKKOS_ARCH, KOKKOS_DEVICES, and KOKKOS_CUDA_OPTIONS are "inputs" to the included Kokkos Makefile, and KOKKOS_* variables in the build case are "outputs" - Can I test kokkos locally? - Yes. It can be compiled as well on only CPU architecture. Jaro can give more details. - I would really like more details for how to do this :) :+1: - Sure, just use "OpenMP" for KOKKOS_DEVICES and "SNB" for KOKKOS_ARCH in the Makefile, and for example g++ for the CXX (if it does not compile, copy Makefile contents again from the course page, I updated the order in the compile command for OpenMP case, the original did not work) - To play with Kokkos on LUMI: - 1. mkdir kokkos_test - 2. cd kokkos_test - 3. git clone https://github.com/kokkos/kokkos.git - 4. *Copy Makefile into the current folder* - 5. *Copy hello.cpp into the current folder* - 6. make - 7. srun ./hello - He showed us a makefile, how does kokkos work with cmake? - Separate compilation, ie, the normal way to include a library:`find_package(Kokkos REQUIRED); [...] target_link_libraries(myTarget Kokkos::kokkos)` - In-tree build (ie, compile together, similar to the Makefile approach), use `add_subdirectory(<path to Kokkos dir relative to your CMakeList.txt>) include_directories(${Kokkos_INCLUDE_DIRS_RET}) target_link_libraries(myTarget kokkos)` https://kokkos.github.io/kokkos-core-wiki/ProgrammingGuide/Compiling.html - can you provide the example GitHub-repo link? - The course repo https://github.com/ENCCS/gpu-programming.git - To get (grouped) examples on LUMI: git clone https://github.com/ENCCS/gpu-programming.git - However, portable-kernel model examples are web-only (they do not have associated `examples/[some-model]` directory) - General feedback: looks like we skipped the scheduled break. Not so good for my focus! This is evry hard to follow when all the material is new. The one at 10.00? Short one - Which one? There was a break after non-portable models. 10:00-10:20, actually :) But sure, it was an extension of exercise session for some. - Ah. It was in the exercise mode? I lost that and just did the exercises the whole time. Maybe you could give the info more explicit again :) - Will try! - :hearts: ## Preparing code for GPU porting https://enccs.github.io/gpu-programming/11-gpu-porting/ - Do you have an actualy example where the fortran code is ported to GPU? Can you share it? - Not for this one, unfortunately. - Could you make a psedu code? I think this example is really useful for understanting how to port from CPU to GPU. It is still a bit difficult for me to understand how to go from hello-world examples to my actual research legacy code. - I will try to put the code here, but without context it is difficult to understand it: <details><summary>🖱️ Click to show code </summary> ```cpp __global__ void cuda_get_soap_der_one(double *soap_rad_der_d, double *soap_azi_der_d, double *soap_pol_der_d, double *multiplicity_array_d, double *trans_soap_rad_der_d, double *trans_soap_azi_der_d,double *trans_soap_pol_der_d, cuDoubleComplex *cnk_d, cuDoubleComplex *cnk_rad_der_d, cuDoubleComplex *cnk_azi_der_d, cuDoubleComplex *cnk_pol_der_d, int *k2_i_site_d, bool *skip_soap_component_d, int n_sites, int n_atom_pairs, int n_soap, int k_max, int n_max, int l_max) { int k2 = threadIdx.x+blockIdx.x*blockDim.x; if (k2<n_atom_pairs){ int i_site=k2_i_site_d[k2]-1; int counter=0; int counter2=0; for(int n=1;n<=n_max;n++){ for(int np=n;np<=n_max;np++){ for(int l=0;l<=l_max;l++){ if(!skip_soap_component_d[l+(l_max+1)*(np-1+(n-1)*n_max)]){ //if( skip_soap_component(l, np, n) )cycle // if it happens lots of time, do it in reverse counter++; double my_soap_rad_der=0; //trans_soap_rad_der_d[k2+(counter-1)*n_atom_pairs]; //soap_rad_der_d[counter-1+k2*n_soap]; double my_soap_azi_der=0; //trans_soap_azi_der_d[k2+(counter-1)*n_atom_pairs]; //soap_azi_der_d[counter-1+k2*n_soap]; double my_soap_pol_der=0; //trans_soap_pol_der_d[k2+(counter-1)*n_atom_pairs]; //soap_pol_der_d[counter-1+k2*n_soap]; for(int m=0;m<=l; m++){ int k=1+l*(l+1)/2+m; counter2++; /* if(threadIdx.x==121 && blockIdx.x==154){ printf("\n Pair %d \n" , k2, i_site); } */ cuDoubleComplex tmp_1_cnk_d=cnk_d[i_site+n_sites*(k-1+(n-1)*k_max)]; //trans_cnk_d[i_site+n_sites*(k-1+(n-1)*k_max)]; //cnk_d[k-1+ k_max*(n-1 +i_site*n_max)]; cuDoubleComplex tmp_2_cnk_d=cnk_d[i_site+n_sites*(k-1+(np-1)*k_max)]; //trans_cnk_d[i_site+n_sites*(k-1+(np-1)*k_max)]; //cnk_d[k-1+k_max*(np-1+i_site*n_max)]; cuDoubleComplex tmp_1_cnk_rad_d=cnk_rad_der_d[k2+n_atom_pairs*(k-1+(n-1)*k_max) ]; //trans_cnk_rad_der_d[k2+n_atom_pairs*(k-1+(n-1)*k_max) ]; // cnk_rad_der_d[k-1+k_max*(n-1 +k2*n_max)]; cuDoubleComplex tmp_2_cnk_rad_d=cnk_rad_der_d[k2+n_atom_pairs*(k-1+(np-1)*k_max)]; //trans_cnk_rad_der_d[k2+n_atom_pairs*(k-1+(np-1)*k_max)]; // cnk_rad_der_d[k-1+k_max*(np-1+k2*n_max)]; cuDoubleComplex tmp_1_cnk_azi_d=cnk_azi_der_d[k2+n_atom_pairs*(k-1+(n-1)*k_max) ]; //trans_cnk_azi_der_d[k2+n_atom_pairs*(k-1+(n-1)*k_max) ]; //cnk_azi_der_d[k-1+k_max*(n-1 +k2*n_max)]; cuDoubleComplex tmp_2_cnk_azi_d=cnk_azi_der_d[k2+n_atom_pairs*(k-1+(np-1)*k_max)]; //trans_cnk_azi_der_d[k2+n_atom_pairs*(k-1+(np-1)*k_max)]; //cnk_azi_der_d[k-1+k_max*(np-1+k2*n_max)]; cuDoubleComplex tmp_1_cnk_pol_d=cnk_pol_der_d[k2+n_atom_pairs*(k-1+(n-1)*k_max) ]; //trans_cnk_pol_der_d[k2+n_atom_pairs*(k-1+(n-1)*k_max) ]; //cnk_pol_der_d[k-1+k_max*(n-1 +k2*n_max)]; cuDoubleComplex tmp_2_cnk_pol_d=cnk_pol_der_d[k2+n_atom_pairs*(k-1+(np-1)*k_max)]; //trans_cnk_pol_der_d[k2+n_atom_pairs*(k-1+(np-1)*k_max)]; //cnk_pol_der_d[k-1+k_max*(np-1+k2*n_max)]; my_soap_rad_der+=multiplicity_array_d[counter2-1]*(tmp_1_cnk_rad_d.x*tmp_2_cnk_d.x+tmp_1_cnk_rad_d.y*tmp_2_cnk_d.y+ tmp_1_cnk_d.x*tmp_2_cnk_rad_d.x+tmp_1_cnk_d.y*tmp_2_cnk_rad_d.y); my_soap_azi_der+=multiplicity_array_d[counter2-1]*(tmp_1_cnk_azi_d.x*tmp_2_cnk_d.x+tmp_1_cnk_azi_d.y*tmp_2_cnk_d.y+ tmp_1_cnk_d.x*tmp_2_cnk_azi_d.x+tmp_1_cnk_d.y*tmp_2_cnk_azi_d.y); my_soap_pol_der+=multiplicity_array_d[counter2-1]*(tmp_1_cnk_pol_d.x*tmp_2_cnk_d.x+tmp_1_cnk_pol_d.y*tmp_2_cnk_d.y+ tmp_1_cnk_d.x*tmp_2_cnk_pol_d.x+tmp_1_cnk_d.y*tmp_2_cnk_pol_d.y); } trans_soap_rad_der_d[k2+(counter-1)*n_atom_pairs]=my_soap_rad_der; //soap_rad_der_d[counter-1+k2*n_soap]=my_soap_rad_der; trans_soap_azi_der_d[k2+(counter-1)*n_atom_pairs]=my_soap_azi_der; //soap_azi_der_d[counter-1+k2*n_soap]=my_soap_azi_der; trans_soap_pol_der_d[k2+(counter-1)*n_atom_pairs]=my_soap_pol_der; //soap_pol_der_d[counter-1+k2*n_soap]=my_soap_pol_der; } } } } } } __global__ void cuda_get_soap_der_two_one(double *soap_d, double *sqrt_dot_p_d, double *soap_rad_der_d, double *soap_azi_der_d, double *soap_pol_der_d, double *trans_soap_rad_der_d, double *trans_soap_azi_der_d, double *trans_soap_pol_der_d, double *tdotoprod_der_rad, double *tdotoprod_der_azi, double *tdotoprod_der_pol, int *k2_i_site_d, int n_sites, int n_atom_pairs, int n_soap, int k_max, int n_max, int l_max) { int k2=blockIdx.x; int tid=threadIdx.x; int i_site=k2_i_site_d[k2]-1; __shared__ double sh_soap_rad_der_dot[tpb]; __shared__ double sh_soap_azi_der_dot[tpb]; __shared__ double sh_soap_pol_der_dot[tpb]; double this_dotprod_rad=0.0;double this_dotprod_azi=0.0;double this_dotprod_pol=0.0; for(int s=tid;s<n_soap;s=s+tpb){ this_dotprod_rad+=soap_d[s+i_site*n_soap]*soap_rad_der_d[s+k2*n_soap]; this_dotprod_azi+=soap_d[s+i_site*n_soap]*soap_azi_der_d[s+k2*n_soap]; this_dotprod_pol+=soap_d[s+i_site*n_soap]*soap_pol_der_d[s+k2*n_soap]; } sh_soap_rad_der_dot[tid]=this_dotprod_rad; sh_soap_azi_der_dot[tid]=this_dotprod_azi; sh_soap_pol_der_dot[tid]=this_dotprod_pol; __syncthreads(); //reduction for (int s=tpb/2; s>0; s>>=1) // s=s/2 { if (tid < s) { sh_soap_rad_der_dot[tid] +=sh_soap_rad_der_dot[tid + s]; sh_soap_azi_der_dot[tid] +=sh_soap_azi_der_dot[tid + s]; sh_soap_pol_der_dot[tid] +=sh_soap_pol_der_dot[tid + s]; } __syncthreads(); } for(int s=tid;s<n_soap;s=s+tpb){ tdotoprod_der_rad[s+k2*n_soap]=sh_soap_rad_der_dot[0]; tdotoprod_der_azi[s+k2*n_soap]=sh_soap_azi_der_dot[0]; tdotoprod_der_pol[s+k2*n_soap]=sh_soap_pol_der_dot[0]; } } __global__ void cuda_get_soap_der_two_two(double *soap_d, double *sqrt_dot_p_d, double *soap_rad_der_d, double *soap_azi_der_d, double *soap_pol_der_d, double *tdotoprod_der_rad, double *tdotoprod_der_azi, double *tdotoprod_der_pol, int *k2_i_site_d, int n_sites, int n_atom_pairs, int n_soap, int k_max, int n_max, int l_max) { int k2=blockIdx.x; int tid=threadIdx.x; int i_site=k2_i_site_d[k2]-1; double loc_sqrt_dot_p=sqrt_dot_p_d[i_site]; for(int s=tid;s<n_soap;s=s+tpb){ double my_soap=soap_d[s+i_site*n_soap]; double my_soap_rad_der=soap_rad_der_d[s+k2*n_soap]; double my_soap_azi_der=soap_azi_der_d[s+k2*n_soap]; double my_soap_pol_der=soap_pol_der_d[s+k2*n_soap]; double myprod_der_rad=tdotoprod_der_rad[s+k2*n_soap]; double myprod_der_azi=tdotoprod_der_azi[s+k2*n_soap]; double myprod_der_pol=tdotoprod_der_pol[s+k2*n_soap]; soap_rad_der_d[s+k2*n_soap]=my_soap_rad_der/loc_sqrt_dot_p -my_soap/(loc_sqrt_dot_p*loc_sqrt_dot_p*loc_sqrt_dot_p)*myprod_der_rad; soap_azi_der_d[s+k2*n_soap]=my_soap_azi_der/loc_sqrt_dot_p -my_soap/(loc_sqrt_dot_p*loc_sqrt_dot_p*loc_sqrt_dot_p)*myprod_der_azi; soap_pol_der_d[s+k2*n_soap]=my_soap_pol_der/loc_sqrt_dot_p -my_soap/(loc_sqrt_dot_p*loc_sqrt_dot_p*loc_sqrt_dot_p)*myprod_der_pol; } } __global__ void cuda_get_soap_der_thr_one(double3 *soap_cart_der_d, double *soap_rad_der_d, double *soap_azi_der_d, double *soap_pol_der_d, double *thetas, double *phis, double *rjs, int *k3_index, int n_sites, int n_atom_pairs, int n_soap, int k_max, int n_max, int l_max) { int k2=blockIdx.x; int tid=threadIdx.x; int k3=k3_index[k2]-1; double my_theta=thetas[k2]; double my_phi=phis[k2]; double my_rj=rjs[k2]; for(int s=tid;s<n_soap;s=s+tpb){ if(k3!=k2){ double my_soap_rad_der=soap_rad_der_d[s+k2*n_soap]; double my_soap_azi_der=soap_azi_der_d[s+k2*n_soap]; double my_soap_pol_der=soap_pol_der_d[s+k2*n_soap]; double3 my_soap_cart_der; my_soap_cart_der.x=sin(my_theta)*cos(my_phi)*my_soap_rad_der -cos(my_theta)*cos(my_phi)/my_rj*my_soap_pol_der -sin(my_phi)/my_rj*my_soap_azi_der; my_soap_cart_der.y=sin(my_theta)*sin(my_phi)*my_soap_rad_der -cos(my_theta)*sin(my_phi)/my_rj*my_soap_pol_der +cos(my_phi)/my_rj*my_soap_azi_der; my_soap_cart_der.z=cos(my_theta)*my_soap_rad_der +sin(my_theta)/my_rj*my_soap_pol_der; soap_cart_der_d[s+k2*n_soap]=my_soap_cart_der; } } } __global__ void cuda_get_soap_der_thr_two(double3 *soap_cart_der_d, double *soap_rad_der_d, double *soap_azi_der_d, double *soap_pol_der_d, double *thetas, double *phis, double *rjs, int *n_neigh_d, int *i_k2_start_d, int *k2_i_site_d, int *k3_index_d, int n_sites, int n_atom_pairs, int n_soap, int k_max, int n_max, int l_max, int maxneigh) { int i_site=blockIdx.x; int tid=threadIdx.x; int my_start=i_k2_start_d[i_site]-1; int k3=my_start; int my_n_neigh=n_neigh_d[i_site]; for(int s=tid;s<n_soap;s=s+tpb){ double3 loc_sum; loc_sum.x=0,loc_sum.y=0,loc_sum.z=0; int k2=my_start+1; for(int j=1;j<my_n_neigh; j++){ double3 my_soap_cart_der=soap_cart_der_d[s+k2*n_soap]; loc_sum.x-=my_soap_cart_der.x; loc_sum.y-=my_soap_cart_der.y; loc_sum.z-=my_soap_cart_der.z; k2++; } soap_cart_der_d[s+k3*n_soap]=loc_sum; } } ``` </details> - - in order to improve the access the problem was transposed. The orginal loop had 3 parts and the kernels were ported for each part. - To the above, it would be extremely helpful to assemble sort of a library or gallery of various ported codes, illustrating all sorts of frameworks. A lot to ask, but maybe this material is already out there somewhere? - You will have a sneak peek of this in the late afternoon, worked example episode! (Still not *everything*, but at least a variety to choose from.) - And also, GPU programming is still very case-by-case research. - that's why i still do not feel comfortable with being able to this myself after the workshop :nerd_face: - Did you apply for the Friday hackathon? - You are not alone (and you hear this from the *presenter*). A lot to learn! - In the legacy code I'm working with somebody has written "nested" calls to the MPI-functions. In the main-function there is an initialization part ```cpp // MPI initialization int numprocessors, rank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocessors); MPI_Comm_rank(MPI_COMM_WORLD, &rank); char name[50]; int count; MPI_Get_processor_name(name, &count); if (rank == 0) { cout << "Running MainIso" << endl; cout << "using " << numprocessors << " nodes with " << omp_get_max_threads() << " threads/node for a total of " << numprocessors*omp_get_max_threads() << " processes." << endl; } ``` and then there is done a call to a class method that also does a MPI AND a openMP initialization and call again. Is this type of code organization recommended? (I understand that it is hard to answer generally, but for the naive me this looks a bit like a "nested" MPI-initialization, and it is quite difficult to read the code.) What do you experts think about this way to organize/invoke parallelism? - let me know if it is not understnadable what I'm trying to ask about :P - Do you mean calling `MPI_Init` etc twice? Or just combining MPI with OpenMP? - It looks like they are doing both. Calling `MPI_Init` twice and after they also combine it with openMP. Is this recommended/smart? I thought it was enough to do initialize once? - MPI must be initialized once, so calling `MPI_Init` twice is a bug. Combining MPI with OpenMP is quite common; that's an advanced technique, but a lot of codes use it. - I am not able to see the "nested" initialization. Combining MPI and OpenMP is valid. - What about calling `MPI_Comm_size(MPI_COMM_WORLD, &numprocessors)`ND `MPI_Comm_rank(MPI_COMM_WORLD, &rank); ` several times? Why would you do that? - No issues, but might slow down code a little depending on how many times is done. We would have to see a little more of the code to be able to comment. Also one could define custom communicators and in this case the rank changes. - Unless it's done in the hot loop, probably nothing to worry about. ## Porting between different GPU frameworks: automatic translation tools https://enccs.github.io/gpu-programming/11-gpu-porting/#porting-between-different-gpu-frameworks - Can hipify-clang be used with any rocm version? ROCm above 5.2.3 isn't supported on Ubuntu 18.04LTS - Below 5.x.x -- not sure. But there's little reason to use an older version. - Does hipify-clang offer any diagnostics when the code relies on architecture assumptions (e.g., ballots over warp of 32 vs 64 threads?) - It should print a warning message about it, but we did not try it - It is always recommended to avoid hardcoding the warp size into the code. - A lot of legacy codes do that :( - The translation tools can do a lots of things and help with the porting, but in the end the programmer has to do the final "touches". - That's why the question was about warnings. It might be hard to fix automatically, but a guidance ("check here! something might not work right") would be helpful when dealing with a huge legace codebase. - Also I expect that the programmer has knowledge of the code. - CLACC github: ? - `git clone -b clacc/main https://github.com/llvm-doe-org/llvm-project.git` - So a branch within LLVM/Clang - I tried to use Remote-SHH to loging on LUMI through VSC, but I can't make it to work. Looks like Erik figured it out? IS there a trick? - no special trick. - how do you set up your keys? A local config file? I'm just being asked to type my password again and agian. - where you able to log in by other programs? - yes, no problem with the regular ssh through the terminal. Just not with VSC. - set the plug in remote ssh and add the same command that you used to log in with program. - exactly what i diid, but maybe something is fishy with my computer. I have had problems on Saga (supercomputer in Norway) before. - What platform? I ahve MAC , and I know linux works - WSL, ubuntu 22.04 - one has to give the path to the key manually I guess. (`-i <path-to-key>`). On linux and mac VSC uses `~/.ssh` , on WSL can not say. - I'm using Remote Explorer(extension to VSC), can't remember any special tricks - Erik - I would like to ask you experts about editors on the clusters. Vi or VSC? Big debate. Lots of efficiency to gain. Which editor do you guys use :)? - Personal opinion, VSC - I use VSC with VIM extention - VSC and Remote Explorer - My impression is that the "youngsters" uses VSC and the "oldies" (like me) uses vim :P - One can even edit locally with whatever's preferable, then transfer through SSH/SFTP and do last-second fixes with e.g. vim. - I try to do most editing locally (with CLion), and use Vim if some fixes are needed remotely - Overlapping :) - i use Vim for small fixes, otherwise xemacs for a large amout of text ## Recommendations https://enccs.github.io/gpu-programming/12-recommendations/ - Can you show an example with implementation where you are oging from CPU to GPU code? I would still like to see more examples on how to do this :) - After the break it's nothing but this :) - :heart_decoration: ### Exercise 1. Has your mental model of how GPUs work and how they are programmed changed? - I think the course was quite low level (in programming POV), I now know more of what happens under the hood. I still struggle to think how I could go from a high level programming language (Python) to this low level stuff. - I think so, very nervous to try this myself in my real academic legacy code world - Mental model is the same, but now I feel more informed about different means to achieve things (frameworks, high-low level abstractions etc) - A small change. I'll probably use more high level language. 2. Do you have a better idea about what framework is right for your code? - Yes, if I need to dig this deep in optimization, I know what to use.a - Personally I feel a bit overwhelmed by the choice of frameworks. I'm sure I will have a better idea for my use case once I digest everything - For me portability is generally the most important aspect. Typically you have a lot of differet machines to run on. It is difficult to write the code general enough that it can work in the furture AND be portable to new architectures. Do you have general advice on how to be portable AND fsat :P - you should use a portal framwork. E.g, between OpenACC vs OpenMP, you should choose the latter - Portability is important; however, for fast prototying I would use some high level language/library/framework (e.g. Python). I wish I had more time to learn and go to a lower level approach. 3. What questions do you have? Ask us anything! - Assuming I have a problem that lends itself to running on a GPU, is it "easy" in your experience (whatever easy means) to get a performance boost even from some rather quick'n'dirty GPU programming, or do you often end up having to tweak around a lot to get back to performance of a well-tuned CPU code? - It depends on an initial code an a little on the language/ framework too, but yes, it is possible to have modest gains at first and maximum gains after tweaking. - Relating to the above, did we ever find out why the Colab GPU code example from Friday did run 100 times *slower* than the CPU one? - It did not. It ran faster than pure/ common Python code, but slower than optimized/ JIT-compiled CPU version, which is essentialy C and not Python anymore. Depending on how you write the Python code, the same optimizer for GPU can perform comparatively better or worse. - Well yes, almost anything runs faster than pure Python. But are you implying that Friday's GPU example could have been sped up to give a performance improvement over the CPU JIT version? - That's up to research :) If we are talking same algorithm, but not necessarily the same code lines, then there will be Python CPU vs GPU example after the break. - If you have two codes, one for CPU and one for GPU usage. How can I test that the results are the same? Easy to do mistakes when you are maintaing two code bases! - Well, the same way you test that your CPU code works correctly :) (you have CPU tests, right?). - I always write that, but my senior professor-guys. They do not. :scream: - Well, still, the approach is the same (as in, just like CPU, it can be done in tons of different ways). During the initial porting, just tons of `printf` here and there comparing intermediate results on CPU and GPU are common :) Long-term, some testing is needed; how exactly -- depends. - Nvidia has a tool for comparing CPU/GPU results: https://developer.nvidia.com/blog/detecting-divergence-using-pcast-to-compare-gpu-to-cpu-results/ - This is maybe not easy to answer: for very heavy codes that are supposed to run on LUMI types machines, what sort of performance can one expect, in units of performance/theroretical peak performance of the hardware. 0.1%, 1% , 10%, 70%? I am mostly intersted in scientific software (that basically simulate processes from a mathematical model) - Most of the scientific codes are in Fortran and do we have any choice except OpenMP and OpenACC? Also, Our code is mixed of Fortran and C++ Would it be possible to use CUDA for C++ and OpenACC for Fortran? - It is possible to combine languages. Let the fortran code as it is, expcept for the parts that are intensive. Using ` iso_c_binding` one can call from fortran the C/C++ functions. In fact at the moment that is the only way to do `fortran + hip ` on LUMI. - CUDA has pretty good Fortran support (not covered in this course): https://docs.nvidia.com/hpc-sdk/compilers/cuda-fortran-prog-guide/index.html - CUDA fortran is pretty cool thing, but only works with NVIDIA. It would be useless on LUMI :(. - Well, the question was asking about CUDA and OpenACC, so... But yes, CUDA things are limited to NVIDIA, and there does not seem much love for Fortran from AMD. - AMD has [hipfort](https://github.com/ROCmSoftwarePlatform/hipfort), but it still requires kernels to be written in C++ - There various solutions to this problem and the answer depends on the case. One could put everything intensive in cuda/hip kernels and call them from C or fortran. Or could use a combination. There is cost/benefit evaluation. - **Feedback**: on the course description, the course requirements were mentioning different programming languages (C++, Fortran, Python, Julia). Yet, most of the course had C++ code and went quite low level. It could be nice to include a checklist to revise/study requirements beforehand. Also, it would be nice to have a section on how to use GPUs with high level languages (such as Python). - Like https://enccs.github.io/gpu-programming/6-language-support/ ? - Yes, except the python example is minimal and does not run in LUMI (because of the hardware choice) - Yes, please spam Numba developers for ROCm support :) - Will do :) - Still, we are listening to what is important/ popular among the participants and will definitely take this into account. - Comment: Converted libQuEST (3.5.0) using hipify-perl to HIP and cretaed dso using hipcc. Works. ## GPU programming example: stencil computation https://enccs.github.io/gpu-programming/13-examples/#gpu-programming-example-stencil-computation - Finite element methods (engineering!!) - Ising model - Cell to first neighbours signalling (Notch for example) in a fixed grid. - Computational Fluid Dynamics (1D,2D,3D) - Game of life - Lattice Boltzmann method - weather forcasting? - Finite Volume Method, Finite Difference Method, and Finite Element Method - https://en.wikipedia.org/wiki/Lattice_model_(physics)#Examples_2 ? - Some game other game theory models Poll: which programming model/ framework are you most interested in today? - OpenMP offloading (C++): oooo - SYCL: - Python (numba/CUDA): oo - Julia: oo - Other (specify below) - Porting to AMD GPUs in general - Using Kokkos with C++ - . - . - . - Why is 1000 grid side lower than 2000? - Is it because of the small grid effect? How is heat flow computation expected to scale with respect to the number of time steps? - Linearly: oooo - Quadratically: - Exponentially: How is stencil application (grid update) expected to scale with respect to the size of the grid side? - Linearly: - Quadratically: oooo - Exponentially: (Optional) Do you expect GPU-accelerated computations to suffer from the memory effects observed above? Why/ why not? - so can we use shared memory between the threads? Or was it blocks? - Shared memory can be used to share data between *threads within a block* --- - Does this mean that all GPU threads can read from memory simultaneously, without being in the way for eachother? - Can you comment on speed differences between SYCL implementations? (OpenSYCL, OneAPI, hipSYCL, computecpp, ao) - Preface: - OpenSYCL and hipSYCL are the same project - ComputeCPP nowadays switched to embedded devices, their high-end GPU work are part on oneAPI - Performance is typically quite close. They are all based on Clang, so the differences in performance are usually due to some differences in implementations that are arbitrary. - why do we need to allocated resources for running? - On LUMI (and most other supercomputers) by default you SSH to "login node", which is not supposed to be used for computing and does not have a GPU. You then request a certain number of nodes (one for our case), and run your code there. `srun` is a wrapper that launches the program on the allocated node. There are other ways to do it (interactive session, where you login to the node; or batch script, when you submit a "batch script" -- that's the most common way to run large tasks). - when would I introduce MPI and not only use openMP? - so, an alternative to MPI does not really exists in the case of several nodes you want to use at maximum capacity at the same time? - In data-science world, hadoop/spark is used, but it's targeting somewhat different use-cases than a typical HPC workload - not really an alternative, but performing somehow similar job: coarray fortran and UPC - do you guys have experience with spack as a dependency tool (alternative to module load)? https://github.com/spack/spack - do you mean alternative to module? - yes, for creating reproducable workflows across machines - I think it is a great tool. There is a module on LUMI supporting `spack` for user installation. - You should try to avoid installing application from installing scripts. - and do it how instead?? Ask the sys ad? - local installation ,where the user has access. This is a lot specific to each supercomputer, but it is pretty common. Check for example on [LUMI](https://docs.lumi-supercomputer.eu/software/installing/spack/) ## Final Q&A - . - . ## Feedback (only 3 survived? `:(` ) Today was (multi-vote): - too fast: - right speed: oo - too slow: - too advanced:o - right level: o - too slow: - I would recommend this course: ooo One good thing about today/this course: - Really good with the last example! - The excercise from the last part of the day was a great summary and discussion. I also loved that it had several different versions so that each one could dig deeper into what is most interesting to him. - Lots of different frameworks/examples One thing to be improved for next time: - difficult for you guys to do something about on short term, but where are all the females? - Swapping between presenters in the same time slot is a bit complicated and often wastes time. I believe it's better to use breaks to separate this. - Consider separating into basic and advanced stuff, or high/low level frameworks, or similar - two full days is very intense, could spread material out a bit. - :+1 for half days! Any other comments: - Good course! I would have liked the pratical info-sections to be repetead a bit more often. Like, now we are in session X, now ou take break for 10 min or continue to work in the breaks. Repeat the zoom link to people. Easy to get lost. - Great job! Worked really nicely considering it was the first time :) - . - . ---- :::info *Always ask questions at the very bottom of this document, right **above** this.* ::: ----