# NEMO/GPU OpenACC training/workshop
## Goals:
- Perform hands-on practice for the methodology, tools and techniques that are used for NEMO/GPU porting
- Alexey, Rommel: review and improvement of the existing hands-on materials and engineering tools
- Carlos, Aitor: acquire practical skills of using OpenACC in NEMO
- Laleh: check the correctness and ability to execute the Tracers and Ocean Dynamics regions on MN5-ACC
## Results
### Hands-on materials (Alexey & Rommel)
Updated the introductory slides and added couple of new points:
- the scalars in GPU code: workaround for NVIDIA
- handling of OpenMP decomposition code in NEMO
- general way of porting subroutines in NEMO
URL: https://docs.google.com/presentation/d/1DNjQx3oWBne6s7fqcQXI2VwrOb3xQxOI
### Methodology and tools (Rommel & Alexey)
- updated the Loki based tool for reviewing the Fortran modules in NEMO to get the basic structure: helps to make decisions before porting new modules to GPU
- updated the tool that connects compiler diagnostics to source code
- how to automate creting the data copy clauses for separate memory code? Brainstormed with Nikolaos
### Code: general and utility code (Alexey)
- made an update to LBC code to use uniform lbc_lnk calls both in GPU and CPU code
Now, istead of this duplicated code for `lbc_lnk` and `lbc_lnk_multi`:
```
oncpu CALL lbc_lnk( 'icedyn_adv_pra', zei_max, 'T', 1.0_wp )
oncpu CALL lbc_lnk( 'icedyn_adv_pra', zes_max, 'T', 1.0_wp )
ongpu CALL lbc_lnk_gpu( 'icedyn_adv_pra', zei_max, 'T', 1.0_wp )
ongpu CALL lbc_lnk_gpu( 'icedyn_adv_pra', zes_max, 'T', 1.0_wp )
```
We can leave the `lbc_lnk` and `lbc_lnk_multi` code original, without any changes:
```
CALL lbc_lnk( 'icedyn_adv_pra', zei_max, 'T', 1.0_wp )
CALL lbc_lnk( 'icedyn_adv_pra', zes_max, 'T', 1.0_wp )
```
- made an update to GPU macroses to simplify handling the OpenMP decomposition code
Now instead of this code for OpenMP decomposition:
```
oncpu CALL nompinfo( itid, ithreads )
oncpu ji1 = nompstas(itid,npti)
oncpu ji2 = nompends(itid,npti)
ongpu itid = 0; ithreads = 1
ongpu ji1 = 1
ongpu ji2 = npti
```
we may use single GPU-aware macros that hides all details:
```
gpu_aware_omp_decomposition(npti, ji1, ji2)
```
### Code: Tracers (Laleh)
We checked the ability of the existing code to build & run, and the reproducibility
of the existing ported code on MN5-ACC.
#### Tracers: good modules
The modules that are reproducible with at least 64 timesteps checked:
- File: `OCE/TRA/tradmp.F90`
- File: `OCE/TRA/trasbc.F90`
- File: `OCE/TRA/tranxt.F90`
- File: `OCE/TRA/traldf_iso.F90`
- File: `OCE/TRA/trazdf.F90`
#### Tracers: the modules with issues
The modules with reproducibility issues:
- File: `OCE/TRA/eosbn2.F90`
> Issue: Results are not identical when using 64 time steps, though they are consistent at 16 time steps.
- File: `OCE/TRA/traadv_fct.F90`
> Issue: Reproducibility fails with 32 time steps or more.
> Observation: Modifying the lateral boundary condition link from lbc_lnk_gpu to lbc_lnk did not resolve the issue.
- File: `OCE/TRA/trabbl.F90`
> Issue: Results are not reproducible even with 16 time steps.
### Code: Ocean dynamics modules (Laleh)
We checked the ability of the existing code to build & run, and the reproducibility
of the existing ported code on MN5-ACC. Also, ported some new code in `dynzdf.F90`.
#### Ocean dynamics: good modules
Reproducible and working OK:
- File: `OCE/DYN/dynzad.F90`
#### Ocean dynamics: the modules with isuues
The modules with detected issues:
- File: `OCE/DYN/dynzdf.F90`
> Issue: A three-dimensional loop has a backward dependency in the third dimension, preventing reproducibility.
> Observation: Even when using !$acc parallel loop seq on the outermost loop, reproducibility is not achieved. As a workaround, this loop is currently executed on the CPU.
- File: `OCE/DYN/dynkeg.F90`
> Issue: Long execution time leading to failure.
> Error Message: Execution stops with error Failing at address: 0xc4f1002c245c.
### Tracers and Ocean dynamics modules: summary and recommendations:
- Further investigation is required to isolate the root causes of non-reproducibility, especially in files like `tra_adv_fct`, `eosbn2.F90`, and `dynzdf.F90`.
- Consider examining data dependencies under different execution models (CPU vs GPU).
- Profiling and debugging tools should be employed to trace the execution failure in dynkeg.F90 especially deactivating async.
- Code restructuring or loop reordering may be necessary to resolve backward dependencies that hinder reproducibility.
- The recommendation is to prioritize Tracer modules:
- First, we should understand why eos part is failing bitwise comparisons on late steps
- Second, we have to detect and eliminate the identity issues on trabbl, tra_adv_fct
- Then, it is good to try to make the whole step_tracers work in separate memory mode:
- with explicit data transfer directives in each Fortran module on GH (build with `MANAGED_MEMORY=0`)
- test then, if the same code works on MN5-ACC
- try to gather all explicit data transfers into a big cumulative directive somewhere at the beginning of `step_tracers`
### Code: Sea Ice rheology module (Aitor)
The module `src/ICE/icedyn_rhg_evp.F90` along with the function `ice_var_sshdyn` (`src/ICE/icevar.F90`) and subroutines `ice_strength` (`src/ICE/icedyn_rdgrft.F90`) and `bdy_ice_dyn` (`src/OCE/BDY/bdyice.F90`) were ported to work on GPUs in managed and separated memory modes. Here there are some observations made during the work:
#### icedyn_rhg_evp
- `zfmask`: Termporary arrays starting by z in NEMO code usually need to be acc created on the GPU and they do not persist after the execution of the subroutine. That is not the case of zfmask. The subroutine `icedyn_rhg_evp` receives the number of iteration `kt` as an argument and `zfmask` is only allocated in the original code if `kt==nit000` that is, the first iteration, and left in memory during all the runtime, never deallocated. When porting the code to GPU, the variable is initialised at `kt==nit000` and transfered back to the CPU. As it is a read-only parameter in next iterations, its valued is `!$acc copyin` when needed.
```
IF ( kt == nit000 ) THEN
oncpu !$omp barrier
oncpu !$omp master
ALLOCATE( zfmask(jpi,jpj) )
ongpu !$acc data create(zfmask)
ongpu !$acc parallel loop gang vector collapse(2) default(present)
DO jj = 1, jpjm1
DO ji = 1, jpim1 ! NO vector opt.
zfmask(ji,jj) = tmask(ji,jj,1) * tmask(ji+1,jj,1) * tmask(ji,jj+1,1) * tmask(ji+1,jj+1,1)
END DO
END DO
...
ongpu !$acc update host(zfmask)
ongpu !$acc end data
ENDIF
...
ongpu !$acc data copyin(zfmask)
ongpu !$acc parallel loop gang vector collapse(2) default(present)
DO jj = MAX(1,jj1), MIN(jj2,jpjm1)
DO ji = 1, jpim1
! shear at F points
zds(ji,jj) = ( ( u_ice(ji,jj+1) * r1_e1u(ji,jj+1) - u_ice(ji,jj) * r1_e1u(ji,jj) ) * e1f(ji,jj) * e1f(ji,jj) &
& + ( v_ice(ji+1,jj) * r1_e2v(ji+1,jj) - v_ice(ji,jj) * r1_e2v(ji,jj) ) * e2f(ji,jj) * e2f(ji,jj) &
& ) * r1_e1e2f(ji,jj) * zfmask(ji,jj)
END DO
END DO
ongpu !$acc end parallel loop
ongpu !$acc end data
...
```
- The main calculation consists in a big outer loop over `jter` index and double nested loops inside it. As the range of jter was not known, this outer loop was left as it is in the original code, but internal double-nested loopes were parallelised and collapsed by default with `!$acc parallel loop gang vector collapse(2)`
```
DO jter = 1 , nn_nevp
...
ongpu !$acc parallel loop gang vector collapse(2) default(present)
DO jj = MAX(1,jj1), MIN(jj2,jpj)
DO ji = 1, jpi
zu_ice(ji,jj) = u_ice(ji,jj) * umask(ji,jj,1) ! velocity at previous time step
zv_ice(ji,jj) = v_ice(ji,jj) * vmask(ji,jj,1)
END DO
END DO
ongpu !$acc end parallel loop
...
```
Function `ice_var_sshdyn` and subroutines `ice_strength` and `bdy_ice_dyn` are only called inside this subroutine, so they can be ported directly to GPU without creating an alternate GPU version.
#### rhg_evp
Minor subroutine. Most of the arrays are already present on the GPU except `zres` which is `!$acc create`'d.
#### ice_strength
`SUM(..., dim=3)` had to be restructured to work on GPU, consider to change also the CPU version for the sake of homogeneity
```
oncpu strength(:,:) = rn_pstar * SUM( v_i(:,:,:), dim=3 ) * EXP( -rn_crhg * ( 1._wp - SUM( a_i(:,:,:), dim=3 ) ) )
ongpu !$acc parallel loop gang vector collapse(2) present(strength,a_i,v_i)
ongpu DO jj = 1, jpj
ongpu DO ji = 1, jpi
ongpu strength(ji,jj) = rn_pstar * SUM( v_i(ji,jj,:) ) * EXP( -rn_crhg * ( 1._wp - SUM( a_i(ji,jj,:) ) ) )
ongpu END DO
ongpu END DO
ongpu !$acc end parallel loop
```
Variable `strength` is the one that is modified and returned
#### ice_dyn_sshdyn
Small subroutine with a couple of array assignations, no loops.
#### bdy_ice_dyn
Double nested loops inside of `SELECT-CASE` control structure. The approach was conservative only including `!$acc` directives inside each `CASE` statement, although some loops are treated sequentially. A more aggressive optimization entails placing the loops at the outside of `SELECT-CASE` control structure or directly removing it.
### Code: Ocean Dynamics dyn_vor module (Carlos)
I ported the dyn_vor.F90 module and achieved reproducible results using the reproducible build type.
I focused on the subroutine used by the vor_een option in the eORCA1 namelist, but the loop pattern is similar and can likely be easily adapted for the other options.
I didn’t have time to port the trends call to trd_dyn, which is also not used in the eORCA1 configuration we tested.
#### Code Transformations
- `MAX()` and `MIN()` clauses in loop bounds caused incorrect results under OpenACC. Precomputing them outside the loop (e.g., `kj1_aux`, `kj2_aux`) fixed it.
```
! Before (incorrect results)
!$acc parallel loop gang vector present(...)
DO jk = 1, jpkm1
DO jj = MAX(2, kj1), MIN(kj2, jpjm1)
DO ji = fs_2, fs_jpim1
...
! After (correct results)
kj1_aux = MAX(2, kj1)
kj2_aux = MIN(kj2, jpjm1)
!$acc parallel loop gang vector present(...)
DO jk = 1, jpkm1
DO jj = kj1_aux, kj2_aux
DO ji = fs_2, fs_jpim1
```
- Moving `SELECT-CASE` outside the main loop over jk to improve GPU parallelitzation and avoid issues with control flow inside parallel regions. Interestingly in this case MAX and MIN did not interfere with good code generation ...
```
! Before (cannot collapse loops, switch inside)
DO jk = 1, jpkm1
SELECT CASE(kvor)
CASE(np_RVO)
DO jj = MAX(2,kj1), MIN(kj2,jpj)
DO ji = 2, jpi
zwt(ji,jj,jk) = ...
END DO
END DO
...
END SELECT
END DO
SELECT CASE(kvor)
CASE(np_RVO)
!$acc parallel loop collapse(3) ...
DO jk = 1, jpkm1
DO jj = kj1_aux, kj2_aux
DO ji = 2, jpi
zwt(ji,jj,jk) = ...
END DO
END DO
END DO
...
END SELECT
```
- Original single jk loop with flux + vorticity calc caused wrong results under OpenACC kernels so we split it into three.
```
! Loop 1: fluxes
FOR jk
zwx(...) = ...
zwy(...) = ...
ztne(...) = 0 ...
END FOR
! Loop 2: jj = 2 case
IF kj1 ≤ 2 AND kj2 ≥ 2 THEN
FOR jk, ji
ztne(ji,2,jk) = ...
END FOR
END IF
! Loop 3: general vorticity
FOR jk, jj ≥ 3, ji
ztne(ji,jj,jk) = ...
END FOR
```
#### NEMO_build Suggestions/Requests
- Add an option to the `dnb.sh` command to repeat from the makenemo step. This is useful when only a single file has changed, which is usually the case when continuously checking the reproducibility of new changes.
- For Risc-V I will need a way of using the nemo-build with a test configuration because I need to work really small grids like 20,20,10 to be able to run with the emulator.