Single-node multi-GPU mandelbrot
================================
:::info
This code is taken from the repository of [OpenACC best practice guide](https://github.com/OpenACC/openacc-best-practices-guide/tree/main/examples/mandelbrot) from NVIDIA.
:::
In this code you will start from the block version of the mandelbrot exercise, and use openmp threads to send each block to a different GPU in the node. To this, you need to bind to each thread one of the available GPUs in a round-robin fashion.
Thread-GPU binding
------------------
As a first step, we need to use the OpenMP and OpenACC/CUDA APIs to query the number of openmp threads available and bind threads to gpus.
Consider that the number of threads is equivalent to the number of gpus on the node, unkown a priori. Use the following APIs:
- `acc_set_device`
- `acc_get_num_devices`
- `omp_get_thread_num`
from the OpenACC and OpenMP libraries. Do not forget to include these libraries in the header of the program.
To check if everything works, print
`print*, "Thread:",my_gpu,"is using GPU",acc_get_device_num(acc_device_nvidia)`
Multi-gpu offload
-----------------
Distribute the first do loop among openmp threads and parallelize the inner loops. Use also asynchronous and wait clause to send each block processing and value update to queue 1 of its GPU; check the behaviour on the timeline view. Is the `async` directive needed?
Solution
--------
:::spoiler
On Leonardo there are 4 GPUs per node; this leads to the following message:
```
Thread: 0 is using GPU 0
Thread: 1 is using GPU 1
Thread: 3 is using GPU 3
Thread: 2 is using GPU 2
```
The block is parallelized among openmp threads, while the inner do loop is offload to GPUs via OpenACC parallel loop. The timeline shows that each openmp threads offload one block to queue 1 of the binded GPU.
