:::info
This code is taken from the repository of David Henty and implements a regular 2D CFD simulation of an incompressible fluid flowing in a cavity using the Navier-Stokes equation.
:::
The code is composed by 4 files
boundary.f90 : buondary initialization
cfdio.f90 : defines routines for IO operations at the end of the program
jacobi.f90 : contains routines to implement the jacobi step and the error calculation
cfd.f90 : main of the program, implements the main loop for cfd code
Welcome to the first edition of the <font color="#F7A004">Epicure</font> hackathons!
The event will be hosted at
Edificio INFN-CINECA , Tecnopolo
Via Stalingrado 84/3
40128, Bologna
:runner: How to get there? Link here
simo___one changed 8 months agoView mode Like Bookmark
This hackMD document collects information and guided unwinding of the hands-on proposed here: https://gitlab.hpc.cineca.it/training/epicure-gpu-hackathon
CUDA exercises
The exercises are collected at this link
Vector Addition
Electrostatic Particle-In-Cell code
Martix Multiplication
Matrix Transpose
:::info
The non-gpu version of this code is taken from the repository of David Henty.
:::
This is the MPI-distributed version of the CFD previously offloded to a single GPU. In this case, each MPI rank will be binded to a GPU for the offload. The aim is to implement efficient GPU to GPU communications.
The code is composed by four files
cfd.f90 contains the main loop of the program
jacobi.f90 contains the jacobi step the reduction for error calculation
The command line interface
To perform the measurements when the program starts, you need to run it with
nsys profile [optional command_switch_options] [application_executable] [optional application_options]
This command will generate a report in the format .nsys_rep, to open in the GUI.
:::info
The GUI can be install on your laptop from here <https://developer.nvidia.com/nsight-systems>_
:::
Steps
Add data and parallel loop/seq directives, without optimization clauses.
Open the Makefile and add instructions to target Leonardo's accelerators; use -acc=noautopar to inhibit automatic loop optimizations done by the compiler and -Minfo=accel to get information on how the code is compiled for GPUs.
Modify the jobscript is order to compile and run the code on compute nodes.
Modify the parallel directives by adding clauses for loop optimizations; rerun the code.
Try also with kernels directive and -acc=autopar.
Questions
How does the compiler offloads the loop in the different cases?
Compare the time to solution GPU code in the three cases. Do you observe performance improvement?
:::info
This code is taken from the repository of OpenACC best practice guide from NVIDIA.
:::
In this code you will start from the block version of the mandelbrot exercise, and use openmp threads to send each block to a different GPU in the node. To this, you need to bind to each thread one of the available GPUs in a round-robin fashion.
Thread-GPU binding
As a first step, we need to use the OpenMP and OpenACC/CUDA APIs to query the number of openmp threads available and bind threads to gpus.
Consider that the number of threads is equivalent to the number of gpus on the node, unkown a priori. Use the following APIs:
:::info
This code is taken from the repository of OpenACC best practice guide from NVIDIA and reproduces a common operation in image processing. The image is loaded on an array, each element of this array corresponds to a pixel. Each pixel is processed from within a loop by a mandelbrot function.
:::
The code is composed by two files:
main.* : containing the main loop of the program;
mandelbrot.* : containing the definition of the mandelbrot function.
Offload
:::info
This code is taken from the repository of David Henty and implements a regular 2D CFD simulation of an incompressible fluid flowing in a cavity using the Navier-Stokes equation.
:::
In this exercise you will offload the serial version of the code using OpenACC programming model.
The code is composed by 4 files:
boundary.* : containes routines to define boundary conditions and initialization;
cfdio.* : contains routine for final IO operations;
This toy code is composed by 3 files
mod_hostdata.f90 contains the definition of the global variables
mod_functions.f90 contains the initialisation routines
gemm.f90 is the main of the program, contains the gemm operation on the CPU and on the GPU
Steps
Manage the data movements with enter data and exit data directives in initialisation/finalisation routines.
After computing ZGEMM on the CPU, add a call to cublasZGEMM on the GPU. Be careful to provide the device buffer (with OpenACC) as an input of the cuBLAS API.
Add nvtx ranges to wrap the ZGEMM operation on the CPU and on GPU