# COSMICA optimization
## Setup
- Allocate the resources on Leonardo knot
- Run the regular code
- Set multiple GPUs
## Profiling
- Nsight System profile:
- The code is strongly dominated by the execution of Heliospheric Propagation kernel (main computation function)
- Memory load is neglegible
- Other kernels and initialization are neglegible
- HelProp kernel is called multiple times (one for each initial particles energy)

- Nsight Compute:
- Register pressure (COSMICA uses 109 of 32 maximum per thread)
- Number of blocks need to be > than SM number
- 5k quasi-particle parallelization is to small (need expanded parallelization or multiple kernel launch in same GPU)
- Big L2-L1 data transfer (maybe not local variable read and write)
- Warp scheduling overheads or workload imbalances
## Improvements & optimizations
- Relieve register pressure with architecture compiler flag
- Optimization of warp per block number based on A100 SM number ()
- Discoverment of an unexpected large usage of the L2 cache memory (probably synthom of incorrect non local allocation of variables)

- We comment the atomic histogram section (possible cause the L2 pressure problem), and exclude it as major contributor to this issue

- Nevertheless we achieved a speedup (not directed related to the execution time of histogram kernels, which is negligible with respect to propagation kernel)

- Test and profile the inline compilation of linked model computational functions (small worsening of performances)

- Threads execution time divergence (with augmented time step), which need a deeper systematic testing

- cuRAND save RNG in global memory ad it is executed each propagation step iteration (could be a cause of delay). Try to save RNG output in registers or generate with Xorshift algorithm to test the memory access