COSMICA optimization

# COSMICA optimization ## Setup - Allocate the resources on Leonardo knot - Run the regular code - Set multiple GPUs ## Profiling - Nsight System profile: - The code is strongly dominated by the execution of Heliospheric Propagation kernel (main computation function) - Memory load is neglegible - Other kernels and initialization are neglegible - HelProp kernel is called multiple times (one for each initial particles energy) ![Epicure_Nsys](https://hackmd.io/_uploads/SyxippJ-1e.png) - Nsight Compute: - Register pressure (COSMICA uses 109 of 32 maximum per thread) - Number of blocks need to be > than SM number - 5k quasi-particle parallelization is to small (need expanded parallelization or multiple kernel launch in same GPU) - Big L2-L1 data transfer (maybe not local variable read and write) - Warp scheduling overheads or workload imbalances ## Improvements & optimizations - Relieve register pressure with architecture compiler flag - Optimization of warp per block number based on A100 SM number () - Discoverment of an unexpected large usage of the L2 cache memory (probably synthom of incorrect non local allocation of variables) ![ncu_report_mem](https://hackmd.io/_uploads/HJdgDJ-Wkx.png) - We comment the atomic histogram section (possible cause the L2 pressure problem), and exclude it as major contributor to this issue ![ncu_report_mem2](https://hackmd.io/_uploads/SJoAYkbZ1e.png) - Nevertheless we achieved a speedup (not directed related to the execution time of histogram kernels, which is negligible with respect to propagation kernel) ![ncu_report_compare](https://hackmd.io/_uploads/H1nfRJW-1l.png) - Test and profile the inline compilation of linked model computational functions (small worsening of performances) ![ncu_report3](https://hackmd.io/_uploads/r1uxwyZWke.png) - Threads execution time divergence (with augmented time step), which need a deeper systematic testing ![ncu_report_thread_div](https://hackmd.io/_uploads/SyNBOyWZ1e.png) - cuRAND save RNG in global memory ad it is executed each propagation step iteration (could be a cause of delay). Try to save RNG output in registers or generate with Xorshift algorithm to test the memory access