Try   HackMD

Brainstorming Cycle 30 07/25

(checked boxes means project is shaped)

Brainstorming Blueline production

GT4Py

  • concat_where (Hannes)
    • merge passes (Hannes could take this)
    • embedded (TODO)
    • integration into passmanager (not reviewed)
    • dace backend (under testing)
  • Different output domains (Hannes) -> let's shape this
    • when do we need it?
    • Advantages:
      • Nicer user-code
        • Enables solving nlevp1 problem properly
      • More optimization potential exposed (by composing bigger operators)
    • discuss some examples
  • Make implicit domain (domain deduced from size of output field) available to static args (PMAP could profit from these compile-time domain sizes): blocked by compile() for field_operators, TODO: check if it works on all backends (Enrique)
  • (Caching)
    • switching gpu-cpu breaks translation cache (in a compiled program?)
  • (Investigate locking problems)
  • Unstructured extent analysis

DaCe (optimizations)

  • Continue optimizations (Philip)
    • Which programs did DaCe not see? Christoph's (30to38) and the one that Hannes touches (15to28)
    • Performance improvements of the transformations
  • (Codegenerator is indeterministic)
  • A number of issues in the lowering to SDFG and dace backend in general (Edoardo):
    • Lowering of symbolic expressions (sympy symbols instead of tasklets)
    • Scalar input to concat_where with empty domain: requires special handling of dynamic memlet on scalars (see dace PR #2064)
    • CUDA codegen issue: wrong cuda code generated for a very simple stencil with one input scalar and one output field, that writes the scalar value on a subset of the field (init_constant_cell_kdim_field).
    • Segmentation fault in dycore stencil (to be investigated, could it be related to ScalarToSymbolPromotion pass?)
  • Investigate what's missing to test DaCe performance directly in ICON4Py instead of benchmarking repo (Edoardo, Enrique, Magdalena)
    • datatests/stenciltests with static params etc.

ICON4Py

  • Finish the last program (Christoph)
  • Implement CFL stencil in velocity advection with compatible interface to ICON (see below, Christoph, Chia Rui)
    • combine 8_to_18
  • Do a round of refactoring of all combined programs (Hannes)
    • concat_where only where needed, etc
    • Document the strategy that we currently apply
      • Why we do it
      • How should it look in the future
      • etc.
  • Connectivity sizes from Fortran are of nproma size (Hannes)
    • temporaries will use the connectivities
    • fix: shrink connectivties to their respective sizes
    • Hannes: Think about mch_ch2 and GlobalIndices don't have nproma sizes
  • CI/Testing/benchmarking (Enrique, Magdalena):
    • Benchmark infrastructure and experiment with benchmark relevant size
    • switch dsl experiment to the smaller mch_ch2_small
    • Reduce what we run (with the GT4Py strategy)
    • Caching of programs (like in PMAP)
      • Till summarizes the PMAP strategy
      • FileCache or more?
      • Can we just leave this on scratch?
  • (Delete liskov (and programs that are only there for liskov))

Integration

  • Memory consumption (Christoph)
    • Investigate memory consumption
    • Push OpenACC -> CUDA mempool from Dmitry
    • Fortran: check that we don't have allocations that are not used
    • Backend keep temporary too long?
      • How important is this?
      • DaCe+gtfn: re-use/release temporaries within a program
    • DaCe/gtfn: don't use the CUDA memory pool probably
      • check that DaCe does cudaMallocAsync if switched to per run allocation
    • ICON4Py manual temporaries (decrease scope to where it's needed)
      • make cupy use the cuda memory pool
  • Crash in mch_ch2 (Hannes)
    • not resolved in gtfn
    • check with Dace
  • Make dynamic substepping available from Fortran (pass relevant variables to Python), combine the programs more and testing of CFL exceed cases (extra diffusion and extra substep) (Chia Rui)
  • Continue deployment: uenv + venv including ICON Fortran (Christoph)
  • Performance benchmarks (Christoph)
    • Do we have a production relevant experiment running in CI?
      • soon
    • Is total currently a good comparison of DSL vs OpenACC?
      • Use total plus model_init or time with external timer
    • median would be ideal, but does not exist right now. Let's use min for now.
      • switch back to average as soon as we pre-compile
    • Get the mch_ch2 and mch_ch1 running (not in CI).
    • icon4py: mch-ch2 and mch-ch1-medium on 1 GPU
    • icon-exlclaim: mch-ch1-medium on 1 GPU (+ mch-ch2 on 4 GPUs, as soon as possible)
    • see also the task in ICON4Py
  • (Integrate DaCe runs into CI)
  • (Return scalar to Fortran: do we need it for MCH production (Hannes))

Greenline

  • Halos for distributed (Magdalena)
  • Consistent torus implementation (Magdalena)
  • (Structured grid?)

Other

  • (Continuous benchmarking in GT4Py)
  • Investigate: Fix "During handling of the above exception, another exception occurred:" and show the actual exception (in CustomMapping and workflows). (Hannes)
  • (Better asserts for out-of-bounds: forward shape inference (can compute) is subset of backward inference (need to read))

Came up later

(put stuff here that we didn't discuss together)

  • (The dace orchestration feature currently does not support stencil precompilation, because this was added later. The icon4py granules make use of precompiled programs, and therefore are not compatible with dace orchestration. The diffusion granule was the only component that worked with dace orchestration (only on cpu, because of compilation errors on gpu). With the the latest gt4py version we have to disable the dace orchestration tests on the diffusion granule. Do we want to keep maintaining this feature?) -> disable test for now