# IFS-NEMO MN5-ACC (GPU)
## Build
### Forks and branches to use:
Use the forks and branches as listed below:
1) Exact pull-request source code state:
- ifs-bundle:
- `ssh://git@git.ecmwf.int/~ecme6549/ifs-bundle-accel.git`
- `feature/mn5-accel-partition`
- RAPS:
- export IFS_BUNDLE_RAPS_GIT:
`ssh://git@git.ecmwf.int/~ecme6549/raps-accel.git`
- export IFS_BUNDLE_RAPS_VERSION:
`feature/mn5-accel-partition`
- ifs-source:
- export IFS_BUNDLE_IFS_SOURCE_GIT:
`ssh://git@git.ecmwf.int/~ecme6549/ifs-source-accel.git`
- export IFS_BUNDLE_IFS_SOURCE_VERSION:
`feature/mn5-accel-partition`
2) Or ifs-bundle branch tuned for local tests:
- ifs-bundle:
- `ssh://git@git.ecmwf.int/~ecme6549/ifs-bundle-accel.git`
- `feature/mn5-accel-partition-localtest`
### Bundle build args:
The `ifs-bundle build` flags that I use:
```
./ifs-bundle build --arch arch/eurohpc/mn5-acc --with-single-precision \
--with-gpu --with-gpu-aware-mpi \
--with-double-precision-nemo --nemo-version=V40 \
--nemo-grid-config=eORCA1_GO8_Z75 --nemo-ice-config=SI3 \
--with-multio-for-nemo-sglexe --dry-run --verbose \
--nemovar-grid-config=ORCA1_Z42 --nemovar-ver=DEV --build-dir=build
```
### VERY important notes:
**NOTE:** mind the `--with-gpu` and `--with-gpu-aware-mpi` flags!
**NOTE:** build can be made **only** on MN5-ACC compute nodes. On GPP nodes you'll have module load errors. On MN5-ACC *login* node you'll have out-of-memory in the middle of compilation process.
**NOTE:** `make -j16` seems to be a good enough parallel compilation parameter. But even with 16 parallel processes, compilation stops sometimes on ~25% progress mark with sympthoms of race condition in a Makefile. This happens rarely, but still be aware of possible spontanious compilation failure. Compilation restart helps.
---
## Runtime
### DVC repositories for tco399 and tco1279:
```
rm -rf dvc-cache-de340
export DNB_IFSNEMO_DVC_INIT="module load singularity"
export DNB_IFSNEMO_DVC_CMD="singularity exec -e --bind /gpfs/projects/bsc32/bsc032120/dvc-cache-de340_container /gpfs/projects/bsc32/bsc032120/dvc.sif dvc"
export DNB_IFSNEMO_DVC_CACHE_DIR="/gpfs/scratch/ehpc01/data/.dvc.ehpc01/cache"
branch=ClimateDT-phase2
eval $DNB_IFSNEMO_DVC_INIT
git clone -b $branch https://earth.bsc.es/gitlab/kkeller/dvc-cache-de340.git
cd dvc-cache-de340
$DNB_IFSNEMO_DVC_CMD config --project cache.dir $DNB_IFSNEMO_DVC_CACHE_DIR
$DNB_IFSNEMO_DVC_CMD checkout
# tco1279:
# nemogrid=eORCA12_Z75
# icmcl=ICMCL_1279_1990_extra
# resol=1279
# yyyymmddzz=1990010100
# expver=hz9o
# tco399:
# !!NOTE!! ntsres=1200 -> ntsres=900 in hres!!
# nemogrid=eORCA025_Z75
# icmcl=ICMCL_399_1990
# resol=399
# yyyymmddzz=1990010100
# expver=hz9n
```
**NOTE:** double-check the values for the dvc repository branch and cache dir
**NOTE:** for `tco399` we have to manually decrease the NEMO timestep to `900` in the hres file, otherwise it hits instability checks in NEMO at the beginning of the simulation (the same for CPU version).
### Model params:
General idea is given in the file `bin/SLURM/mn5/hres_tco79.eORCA1.mn5-acc.slurm` inside RAPS.
Also duplicating the idea below (but here can be a different, most up to date flags' set):
```
export DR_HOOK_TRAPFPE=0
wrapper=affinity_setup_mn5-acc.sh
hres \
-n $nodes -p $mpi -t $omp -h $ht \
-j $jobid -J $jobname \
-H $host -C $compiler \
-d $yyyymmddzz -e $expver -L $label \
--nemo-grid=$nemogrid --icmcl $icmcl \
-T $gtype -r $resol -l $levels -f $fclen \
--inproot-namelists \
-N $nproma \
-x $ifsMASTER \
-w $wrapper \
$MODEL_ARGS \
--run-directory="$run_dir" --ifs-bundle-build-dir="$build_dir"
```
where `MODEL_ARGS` for now looks like:
```
MODEL_ARGS+="--experiment=hist "
MODEL_ARGS+="--nemo --nemo-ver=V40 --nonemopart -R --deep "
MODEL_ARGS+="--keepnetcdf "
MODEL_ARGS+="--teos10 "
MODEL_ARGS+="--nemo-xproc=-1 --nemo-yproc=-1 "
MODEL_ARGS+="--no-ozone --sicoupl=1 "
MODEL_ARGS+="--noreferencecheck "
MODEL_ARGS+="--realization=1 "
```
This set of model args is for IFS-NEMO simulation without I/O. Additionally we may take the options for I/O, that may look like this in case of tco399 on 16 nodes:
```
ioflags+="--multio-write-fdb "
ioflags+="--keep-fdb "
ioflags+="--ifs-and-wam-multio "
ioflags+="--nextgemsout=6 "
ioflags+="--iotasks=16 "
ioflags+="--nemo-multio-server-num=2 "
```
**NOTE:** parallel configuration: 16 ranks per node, 5 OpenMP threads
**NOTE:** NPROMA=32
**NOTE:** mind the `export DR_HOOK_TRAPFPE=0` export, otherwise it tends to fail with SIGFPEs due to nvfortran (NVHPC) known issues
**NOTE:** mind the `--inproot-namelists` flag, which is a required one for the latests RAPS versions
### Runtime configuration, queues and accounts:
The modules which are expected to be loaded at runtime:
```
module load nvidia-hpc-sdk/25.1 mkl/2024.0 hdf5/1.14.1-2-nvidia-nvhpcx fftw/3.3.10-gcc-nvhpc pnetcdf/1.12.3-nvidia-nvhpcx aec/1.1.2-gcc eccodes/2.34.1-gcc python/3.11 gdb/14.2-gcc
```
(must be the same as listed in: `ifs-bundle/arch/eurohpc/mn5-acc/default/env.sh` but that `env.sh` list also contains some modules that are required on a build stage only).
Possible account/qos combinations for SLURM:
- bsc32 / acc_debug (or acc_bsces)
- ehpc01 / acc_debug (or acc_bsces)
- ehpc01 / acc_ehpc
**NOTE:** always compile and run from the *accelerated parttion* login or compute nodes! Running/compiling from general purpose partition nodes will fail with some weird dignostics.