Optimise matmul ============= Steps ---- 1. Add data and `parallel loop/seq` directives, without optimization clauses. 2. Open the Makefile and add instructions to target Leonardo's accelerators; use `-acc=noautopar` to inhibit automatic loop optimizations done by the compiler and `-Minfo=accel` to get information on how the code is compiled for GPUs. 3. Modify the jobscript is order to compile and run the code on compute nodes. 4. Modify the parallel directives by adding clauses for loop optimizations; rerun the code. 5. Try also with `kernels` directive and `-acc=autopar`. Questions -------- - How does the compiler offloads the loop in the different cases? - Compare the time to solution GPU code in the three cases. Do you observe performance improvement? - Compare how the loop is mapped on the GPU hardware from the profiler :::info :bulb: Collect gpu counters with `--gpu-metrics=true` ::: Solution ------- :::spoiler | Version | Fortran | gridsizeXYZ | -------- | ------- | ----| | Single loop | 0.7526160 s | (63,1,1) | | Optimized | 0.3015850 s | (500000,1,1)| With optimization clauses, the gridsize along X (and equivalently the number of gangs) has increased significantly, while the blocksize remains unchanged. The `collapse` clause increses the amount of parallelism exposed for `gang` parallelization, but the inner loop contains in both cases a reduction that has to be executed sequentially within a gang. With nsys, we can also check the number of active SM and the warp occupancy ![image](https://hackmd.io/_uploads/Sya0mu9ekg.png) :::