How to DaCe - HackMD

# How to DaCe ## Daint running verification ``` DACE_execution_general_check_args=0 DACE_compiler_cuda_max_concurrent_streams=-1 PYTHONUNBUFFERED=1 FV3_DACEMODE=True FV3_STENCIL_REBUILD_FLAG=False srun --time=180 -N 6 -C gpu -A s1053 --partition=normal python -m pytest --data_path=/scratch/snx3000/tobwi/sbox/c12_6ranks_standard -v -s -rsx --disable-warnings --backend='gtc:dace:gpu' -m parallel ../fv3core/tests/ --which_modules=DynCore --print_failures --threshold_overrides_file=../fv3core/tests/translate/overrides/standard.yaml ``` running nsys profiling on daint ``` DACE_execution_general_check_args=0 DACE_compiler_cuda_max_concurrent_streams=-1 PYTHONUNBUFFERED=1 FV3_DACEMODE=True FV3_STENCIL_REBUILD_FLAG=False srun --time=180 -N 6 -C gpu -A s1053 --partition=normal nsys profile --force-overwrite=true -o ./profiling_results/%h.%q{SLURM_NODEID}.%q{SLURM_PROCID}.qdstrm --trace=cuda,mpi,nvtx --mpi-impl=mpich --stats=true python ../fv3core/examples/standalone/runfile/simple_acoustics.py /scratch/snx3000/tobwi/sbox/128_6ranks_baroclinic_acoustics 2 gtc:dace:gpu ``` running performance ``` DACE_execution_general_check_args=0 DACE_compiler_cuda_max_concurrent_streams=-1 PYTHONUNBUFFERED=1 FV3_DACEMODE=True FV3_STENCIL_REBUILD_FLAG=False srun --time=30 -N 6 -C gpu -A s1053 --partition=debug python ../fv3core/examples/standalone/runfile/simple_acoustics.py /scratch/snx3000/tobwi/sbox/128_6ranks_baroclinic_acoustics 2 gtc:dace:gpu ``` ## Local runs performance run ``` PYTHONUNBUFFERED=1 FV3_DACEMODE=True FV3_STENCIL_REBUILD_FLAG=False mpirun -np 6 --oversubscribe python ../fv3core/examples/standalone/runfile/simple_acoustics.py ../../c12_6ranks_standard/ 3 gtc:dace ``` verification ``` PYTHONUNBUFFERED=1 FV3_DACEMODE=True FV3_STENCIL_REBUILD_FLAG=False mpirun -np 6 --oversubscribe python -m pytest --data_path=/home/tobiasw/work/c12_6ranks_standard -v -s -rsx --disable-warnings --backend='gtc:dace' -m parallel ../fv3core/tests/ --which_modules=DynCore --print_failures --threshold_overrides_file=../fv3core/tests/translate/overrides/standard.yaml ``` ## Special dace flags - `DACE_compiler_cuda_syncdebug=1` for cuda memchecks - `DACE_profiling=1` for profiling just the .so file for 100 runs - `DACE_compiler_cuda_max_concurrent_streams=-1` do everything in the default stream (0 is unlimited, X is spawning X extra streams) - `DACE_compiler_cpu_openmp_sections=0` turns off the (currently faulty) omp sectioning - `DACE_execution_general_check_args=0` turns off validation ## Exploration ### Goals & status of DaCe collaboration Project goals that DaCe can help with: - On GPU, aiming at 5x speedup vs Fortran for the dycore - On CPU, aiming at +/- ~10% of Fortran for the dycore - Develop a debuggable and "friendly" modelling framework - Target Piz Daint for a hero run of the model at scale (target: ~1,6km resolution) Extended goals (very active development toward those): - Have a full model by porting & integrating FMS physics (first ports are all there, microphysics is optimized and integration is ongoing) - physics have some slightly other patterns than the dynamics so implementing those is something on the horizon Where we are: - Work on acoustics timestep as shown validation & speed up on GPU without halo exchange (on `daint` (P100) and `nslb` (V100)) - Halo exchange work has been conducted on DaCe side - Regions work for the most simple cases (pointwise) but need some extra work to function throughout the code - right now most of the code is changed and the regions are represented in numpy. This will most likely need to change if we move to fully do dace, otherwise all the other backends will become super slow ### Questions to answer Towards full dycore: - Halo exchange integration - Full acoustics numbers - no halo is there and can extrapolate - halo numbers are wip - Testing features we need - Indirect accesses - While loop - Frozen stencil > Future stencil (ongoing) - Update to FV3Core master state - regions - while loop / indirect access (or do the one stencil in python) - minor speedbumps along the way - the nord stencils: 4 versions with one diffrent external. Right now manually unrolled - Document open errors and validation issues Towards scale: - Future stencil generation - Measuring cache genaration time (without the per-stencil .so compile) - Test on a large node configuration (>100-200) Towards Physics: - Testing features we need: - For loop - Extra data dimensions Towards a better understanding of the framework: - Run GTBench with DACe (CPU and GPU) - Assess extendability of the full program mode (for new feature or even exploratory work) Workflow with the DaCe team: - Discuss contribution form from AI2 - Look at DaCe debugging tools - Look at DaCe performance analysis tools Extended goals: - Full dycore performance number - Full dycore validation - Microphysics validation