GPU Programming: When, Why, and How - Day 1

![](https://media.enccs.se/2025/10/Frame-7-1536x768.jpg) GPU Programming: When, Why, and How - Day 1 :::success **Nov. 25 - 27 , 09:00 - 12:30 (CET), 2025** ::: :::success **GPU Programming: When, Why, and How - Schedule**: https://hackmd.io/@ENCCS-Training/gpu-programming-2025-schedule ::: ## Schedule | Time | Contents | | :---------: | :----------: | | 09:00-09:10 | Welcome | | 09:10-10:30 | Directive-based models (OpenACC, OpenMP) | | 10:30-10:40 | Break | | 10:40-12:00 | Non-portable kernel-based models (CUDA, HIP) | | 12:00-12:30 | Q/A session | --- ## Exercises and Links :::warning - Exercises for [Directive-based models](https://enccs.github.io/gpu-programming/6-directive-based-models/#) ::: examples: git clone https://github.com/ENCCS/gpu-programming.git in: gpu-programming/content/examples/ this section: gpu-programming/content/examples/acc gpu-programming/content/examples/omp workspace (lumi-workspaces): ```bash Project: project_465002387 Project is hosted on lustrep4 /projappl/project_465002387 46G/54G 25K/100K /scratch/project_465002387 408G/55T 178K/2.0M /flash/project_465002387 4.1K/2.2T 1/1.0M ---------------------------------------------------------------------- ``` modules needed: ```bash ml purge ml LUMI/24.03 ml partition/G ml PrgEnv-cray/8.5.0 ml rocm/6.0.3 ``` ### openacc: OpenACC is supported only by the Cray Fortran compiler. The C and C++ compilers have **NO** support for OpenACC. To enable OpenACC, use the `-hacc` flag. ``` ftn -hacc vec_add_kernels.f90 -o vec_add_kernels_f_gpu ``` ### openmp: OpenMP is turned off by default which is the opposite of how earlier versions the CCE compilers worked. It is turned on using the `-homp` or `-fopenmp` flag ``` cc -fopenmp vec_add_target.c -o vec_add_target_c_gpu ``` ``` ftn -homp vec_add_kernels.f90 -o vec_add_kernels_f_gpu ``` export CRAY_ACC_DEBUG=1 (1/2/3) ### slurm queue: project number: project_465002387 allocate 1 gpu with 7 cpu cores: ``` salloc -p dev-g -t 30:00 -n 1 -c 7 --gpus-per-task=1 -A project_465002387 ``` check the allocated resource: ``` squeue --me # will be looks like: (USER and NODELIST will be differ) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 14923734 dev-g interact xxxxx R 17:01 1 nid005001 ``` submit the excut to the node: ``` srun ./vec_add_kernels_f_gpu ``` --- :::danger You can ask questions about the workshop content at the bottom of this page. We use the Zoom chat only for reporting Zoom problems and such. ::: ## Questions, answers and information ### 1. [Directive-based models](https://enccs.github.io/gpu-programming/6-directive-based-models/) - Is this how to ask a question? - Yes, and an answer will appear like so! - Can you repeat how to get debug output when running the executable - `export CRAY_ACC_DEBUG=2` (or any level 1, 2 or 3, higher means more information) - How do we get output from srun on the screen - When the program is executed `srun ./vec_add_kernels_f_gpu` the ouput is on the terminal where it is executed - Follow-up on previous question: The instructor showed the Debug information from the compiler. How do we see that? - I don't see the above project as part of my workspaces - Please join the breakout room and ask to be addedd to the project - I tried but they don't appear in my zoom (it says I need to be added by host) - I can try to add you if you tell me you zoom name - thank you :) - I loaded all the modules, which are given on hackmd. But when compiling, I get: cc vec_add_kernels.c -hacc clang: warning: argument unused during compilation: '-h acc' [-Wunused-command-line-argument] - It is OpenAcc only works with fortran. The c code is most probable a openmp code which require `-fopenmp` or `-homp` - Cray system supports OpenACC fortran but partially OpenACC C/C++. - I got confirmation from my colleagues at CSC. OpenACC in C is not suported by cray compiler on LUMI. - I'm getting the error with the C exercise: `cc -O2 -fopenmp -o vec_add_diff vec_add_diff.c` warning: vec_add_diff.c:14:5: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning] - Is the code running faster than before? - There does appear to be offloading? - I (Yann) also see this – I think thereis just not enough work to do that the compiler adds a vector "layer" of parallelisation in this case. You do see there is 400 blocks x 256 threads nevertheless. ```bash ACC: Version 6.0 of HIP already initialized, runtime version 60032831 ACC: Get Device 0 ACC: Set Thread Context ACC: Start transfer 3 items from vec_add_diff.c:14 ACC: allocate, copy to acc 'vecC' (819200 bytes) ACC: allocate, copy to acc 'vecA' (819200 bytes) ACC: allocate, copy to acc 'vecB' (819200 bytes) ACC: End transfer (to acc 2457600 bytes, to host 0 bytes) ACC: Execute kernel __omp_offloading_73ac72ce_d5001a7c_main_l14_cce$noloop$form blocks:400 threads:256 from vec_add_diff.c:14 ACC: Start transfer 3 items from vec_add_diff.c:14 ACC: copy to host, free 'vecB' (819200 bytes) ACC: copy to host, free 'vecA' (819200 bytes) ACC: copy to host, free 'vecC' (819200 bytes) ACC: End transfer (to acc 0 bytes, to host 2457600 bytes) ``` - Is the node allocation from `salloc -p dev-g -t 30:00 -n 1 -c 7 --gpus-per-task=1 -A project_465002387` not an interactive node? Because when I query `hostname` it prints - 'uan01' but where as the `squeue --me` is it `nid005276` - What about `srun hostname` ? - yes `srun hostname` -> `nid005276` gives this - In order to use the allocation one can log to the node or just run applications with `srun ...`. - As far as I experienced on various clusters when I used `salloc`, it is always an interactive session. Is this not the case and depends on admin settings from slurm config? - Yes. In LUMI slurm is configured a little different from what we are used to. - The code from https://github.com/ENCCS/gpu-programming/blob/main/content/examples/omp/vec_add_target.c does not have any print statements or the time monitoring. Is this wantedly left out and just monitored using env variables of cray at the moment? - Indeed we did not add that to keep the focus of the exercises on the directives. To be honest in these very straightforward examples it will be tricky to really see big perfromance differences unless using very inefficient parameters on purpose. - What happens if `map(tofrom:...)` is used but the memory between CPU and GPU differs? - The question is a little ambigous, but I hope this answers it ---> It ok to differ. In practice the data should stay on GPU as long as possible and only update on CPU when needed for saving. - with `to` it will copy into the GPU entering the region, and with `from` copy back from GPU to CPU when exiting. Hopefully indeed your code on the GPU will have changed values! :) - Thanks, that helps. So at the start `{...` its CPU->GPU and then at the end `...}` it's GPU->CPU transfer. So `map(tofrom:...)` nees curly braces, right? - I think technically it doesn't "need" curly braces, but in C/C++ these delineate a block of code to which that pragma applies. I think if you put such a pragma before a function call it'll apply for that function? Without curly braces the pragma applies only to the next line of code. - Also, with unstructured regions they are not between curly braces. - what does the good portability, included in the pros for OpenMP and OpenACC, refers to exactly? Can you expand on this? - Portability usually means that you can take your code and run it (efficiently) on a different machine/GPU/architecture without "too much" effort. With these approaches today you should be able to run easily on both Nvidia and AMD GPUs for example, it doesn't require more from you as a user (assuming the system has a compiler supporting OpenMP/OpenACC) whereas you will see shortly that if you write a CUDA code, it won't be quite straightforward to try and run it on AMD. :::danger **Break until XX:55** ::: ### 2. [Non-portable kernel-based models](https://enccs.github.io/gpu-programming/7-non-portable-kernel-models/) - Exercise material: https://github.com/ENCCS/gpu-programming/tree/main/content/examples/cuda-hip. - Will the slides be made available? (Or are they already?) - Self-study material is available at https://enccs.github.io/gpu-programming/ - (Johan) The slides are not available yet, I will ask the instructor about them when we go into the exercise session. - Descriptio of PBS file - `-N x` = `--nodes=x` --> number of nodes - `-n y` = `--ntasks=y` --> total number of tasks - `-c z` = `--cpus-per-task=z` --> number of cores available to a tasks - `-t hh:mm:ss` = `--time=hh:mm:ss` --> time requested for the job - `-p part` = `--partition=part` --> partition where the job should run - With the `makefile` in the example directories you can build all the programs in the directory with `make all-examples` - Try out to set the number of tasks and the number of GPUs to higher values with e.g. ` srun -p dev-g --gpus 8 -N 1 -n 8 -c 7 --time=00:10:00 --account=project_465002387 ./02_matrix_thread_index_info` - `nvcc` seems to be not available on LUMI? - LUMI only has AMD GPUs. `nvcc` is the CUDA compiler. Use this examples on LUMI `https://github.com/ENCCS/gpu-programming/tree/main/content/examples/cuda-hip/hip`. If you have access to a manchine with CUDA you can try as well the other example in the `../cuda` folder. - Does any partitions or any other sevrers exists with NVIDIA GPU's? - In Finland Puhti and Mahti at CSC, Triton at Aalto Uni. In Sweden I think Dardel has GH200 - In Sweden, ALVIS has NVIDIA GPUs - Dardel at PDC has 8 Grace Hopper nodes, each node with 4 GH200 - A stupid question. Why do I get "slurmstepd: error: execve(): 01_hello.cpp: Permission denied" running the 'hello' examples? Tried with less cores and nodes, and then it worked. - Could you paste the exact command which did not work? w:::info *Always ask questions at the very bottom of this document, right **above** this.* ::: ---