# Advanced OpenMP presentation ## Memory Pragmas Setup your environment ``` export LIBOMPTARGET_INFO=-1 export OMP_TARGET_OFFLOAD=MANDATORY ``` You can also be more selective in the output generated by using the individual bit masks ``` export LIBOMPTARGET_INFO=$((0x01 | 0x02 | 0x04 | 0x08 | 0x10 | 0x20)) ``` The first example code uses just a single pragma with a map clause at the computational loop. Examine this code and then compile and run. ``` mkdir build && cd build cmake .. make ./mem1 ``` You should get some output like the following ``` Libomptarget device 0 info: Entering OpenMP kernel at mem1.cc:89:1 with 5 arguments: Libomptarget device 0 info: firstprivate(n)[4] (implicit) Libomptarget device 0 info: from(z[0:n])[80000] Libomptarget device 0 info: firstprivate(a)[8] (implicit) Libomptarget device 0 info: to(x[0:n])[80000] Libomptarget device 0 info: to(y[0:n])[80000] Libomptarget device 0 info: Creating new map entry with HstPtrBase=0x0000000001772200, ... Libomptarget device 0 info: Creating new map entry with HstPtrBase=0x000000000174b0e0, ... Libomptarget device 0 info: Copying data from host to device, HstPtr=0x000000000174b0e0, ... Libomptarget device 0 info: Creating new map entry with HstPtrBase=0x000000000175e970, ... Libomptarget device 0 info: Copying data from host to device, HstPtr=0x000000000175e970, ... Libomptarget device 0 info: Mapping exists with HstPtrBegin=0x0000000001772200, ... Libomptarget device 0 info: Mapping exists with HstPtrBegin=0x000000000174b0e0, ... Libomptarget device 0 info: Mapping exists with HstPtrBegin=0x000000000175e970, ... Libomptarget device 0 info: Mapping exists with HstPtrBegin=0x000000000175e970, ... Libomptarget device 0 info: Mapping exists with HstPtrBegin=0x000000000174b0e0, ... Libomptarget device 0 info: Mapping exists with HstPtrBegin=0x0000000001772200, ... Libomptarget device 0 info: Copying data from device to host, TgtPtr=0x00007f617c420000, ... Libomptarget device 0 info: Removing map entry with HstPtrBegin=0x000000000175e970, ... Libomptarget device 0 info: Removing map entry with HstPtrBegin=0x000000000174b0e0, ... Libomptarget device 0 info: Removing map entry with HstPtrBegin=0x0000000001772200, ... -Timing in Seconds: min=0.010115, max=0.010115, avg=0.010115 -Overall time is 0.010505 Last Value: z[9999]=7.000000 ``` Explore examples 2 through 5 and observe the output produced when the `LIBOMPTARGET_INFO` environment variable is set. Mem2 pattern : Add enter/exit data alloc/delete when memory is created/freed After new `mem2.cc:#pragma omp target enter data map(alloc: x[0:n], y[0:n], z[0:n])` Loop around computational loop and keep map on computational loop. The map to/from should check if the data exists. If not, it will allocate/delete it. Then it will do the copies to and from. This will increment the Reference Counter and decrement it at end of loop. `mem2.cc:#pragma omp target teams distribute parallel for simd map(to: x[0:n], y[0:n]) map(from: z[0:n])` Before delete `mem2.cc:#pragma omp target exit data map(delete: x[0:n], y[0:n], z[0:n])` Mem3 pattern: Replacing map to/from with updates to bypass unneeded device memory check After new `mem3.cc:#pragma omp target enter data map(alloc: x[0:n], y[0:n], z[0:n])` Before computational loop. Data should be copied. Reference counter should not change. ``` mem3.cc:#pragma omp target update to (x[0:n], y[0:n]) mem3.cc:#pragma omp target teams distribute parallel for simd ``` After computational loop `mem3.cc:#pragma omp target update from (z[0:n])` Before delete `mem3.cc:#pragma omp target exit data map(delete: x[0:n], y[0:n], z[0:n])` Mem4 pattern: Replacing delete with release to use Reference Counting ``` mem4.cc:#pragma omp target enter data map(alloc: x[0:n], y[0:n], z[0:n]) mem4.cc:#pragma omp target exit data map(release: x[0:n], y[0:n], z[0:n]) mem4.cc:#pragma omp target teams distribute parallel for simd map(to: x[0:n], y[0:n]) map(from: z[0:n]) ``` Mem5 pattern: Using enter data map to/from alloc/delete to reduce memory copies ``` mem5.cc:#pragma omp target enter data map(to: x[0:n], y[0:n]) map(alloc: z[0:n]) mem5.cc:#pragma omp target exit data map(from: z[0:n]) map(delete: x[0:n], y[0:n]) mem5.cc:#pragma omp target teams distribute parallel for simd map(to:x[0:n], y[0:n]) map(from: z[0:n]) ``` ### One solution that miminizes data transfer Mem6 pattern: Using enter data alloc/delete with update clause at end ``` mem6.cc:#pragma omp target enter data map(alloc: x[0:n], y[0:n], z[0:n]) mem6.cc:#pragma omp target teams distribute parallel for simd mem6.cc:#pragma omp target update from(z[0]) mem6.cc:#pragma omp target exit data map(delete: x[0:n], y[0:n], z[0:n]) mem6.cc:#pragma omp target teams distribute parallel for simd ``` ### Unified Shared Memory Mem7 pattern: Using Unified Shared Memory to automatically move data ``` mem7.cc:#pragma omp requires unified_shared_memory mem7.cc:#pragma omp target teams distribute parallel for simd mem7.cc:#pragma omp target teams distribute parallel for simd ``` For this example, HSA_XNACK=1 needs to be set ``` export HSA_XNACK=1 make mem7 ./mem7 ``` ### Unified Shared Memory with backwards compatibility Mem8 pattern: Demonstrating Unified Shared Memory with maps for backward compatibility ``` set HSA_XNACK=1 at runtime mem8.cc:#pragma omp requires unified_shared_memory mem8.cc:#pragma omp target enter data map(alloc: x[0:n], y[0:n], z[0:n]) mem8.cc:#pragma omp target teams distribute parallel for simd mem8.cc:#pragma omp target update from(z[0]) mem8.cc:#pragma omp target exit data map(delete: x[0:n], y[0:n], z[0:n]) mem8.cc:#pragma omp target teams distribute parallel for simd ``` ### APU Code -- Unified Address in OpenMP We now switch to how unified address programming would look in other languages. The language we will work with the most will be OpenMP. We'll start by looking at the unified address code shown in slide 31 and 32. It is also in the mem12.cc file in the directory given below. You should also compare it to the original GPU code using explicit memory management in mem1.cc through mem6.cc. We'll now run the unified address example if we have access to an MI300A GPU. If you don't have access to an MI300A, we'll also run nearly the same code in mem7.cc with managed memory on the MI200 series GPUs. We'll be looking at all of the versions of this code in the Advanced OpenMP presentation. ``` cd ~/HPCTrainingExamples/Pragma_Examples/OpenMP/CXX/memory_pragmas module load amdclang make mem12 ./mem12 ``` ## Kernel Pragmas Download the exercises and go to the directory with the kernel pragma examples ``` git clone https://github.com/amd/HPCTrainingExamples.git cd HPCTrainingExamples/Pragma_Examples/OpenMP/CXX/kernel_pragmas ``` Setup your environment. You should unset the `LIBOMPTARGET_INFO` environment from previous exercise. ``` unset LIBOMPTARGET_INFO ``` ``` export CXX=amdclang++ export LIBOMPTARGET_KERNEL_TRACE=1 export OMP_TARGET_OFFLOAD=MANDATORY export HSA_XNACK=1 ``` The base version 1 code is the Unified Shared memory example from the previous exercises ``` mkdir build && cd build cmake .. make kernel1 ./kernel1 ``` `Kernel2 : add num_threads(64)` `Kernel3 : add num_threads(64) thread_limit(64)` On your own: Uncomment line in CMakeLists.txt with -faligned-allocation -fnew-alignment=256 Another option is to add the attribute (std::align_val_t(128) ) to each new line. For example: ``` double *x = new (std::align_val_t(128) ) double[n]; ```