Try   HackMD

[DaCe] Optimization IX

  • Shaped by: Philip
  • Appetite (FTEs, weeks):
  • Developers:

Problem

We essentially continue our work on the DaCe optimizer. The current state is as follows:

  • 15_to_28: Currently cleaned up by Hannes, however, according to Christoph the corrector is kind of done, while the predictor not. Giacomo has started with it, but unclear if he continues.
  • 29_to_37: Is currently handled by Christoph no EAT known.
  • 39_to_60: Very near its completion, currently handled by Ioannis.

The work would be pretty straight forward from what we have done before. However, it was suggested that the stencils are first cleaned up, before new stencils are analyzed. We consider this as a dependency or even a task in the scope of this project, however, we do not handle it.

Left Overs

A list of standing issues in DaCe and/or GT4Py can be found here. In previous cycle this was part of the respective shaping document but it was now decided that it should become its own document.

Something that has become urgent is the n_lev issue, which is caused is related to the promotion of a Scalar to a symbol on an InterstateEdge, however, we are working on.

Appetite

As long as it takes.

Solution

No-Go

Rabbit holes

Uncountable.

Important TODOs

  • Failing verification

    • dy_15_to_28_predictor
      • Multiple writes (race conditions) in multiple access nodes
    • calculate_nabla2_and_smag_coefficients_for_vn
      • Started failing after introducing only using the default stream. Not sure why. SDFG looks simple
      • Problem with https://github.com/GridTools/gt4py/pull/2178
      • It seems that the same bug that affects apply_diffusion_to_theta_and_exner is the same.
    • apply_diffusion_to_theta_and_exner
  • DaCe transformations

    • Enable VerticalMapFusion only if all the input edges of an AccessNode are generated by maps
      • See dy_41_to_60_corrector SDFG where the same map is split between 1:14 and 14:80 (issue originates from the pattern shown below around next_w)
        • Image Not Showing Possible Reasons
          • The image was uploaded to a note which you don't have access to
          • The note which the image was originally uploaded to has been deleted
          Learn More →
    • Fix next_w copies in 41_to_60_corrector/39_to_60_predictor
      • Image Not Showing Possible Reasons
        • The image was uploaded to a note which you don't have access to
        • The note which the image was originally uploaded to has been deleted
        Learn More →
    • Check 30_to_38 to see if there's anything to improve
      • Corrector/predictor in SL2 look good
      • See SL1 performance overhead for predictor below
  • Other performance aspects

    • Compare SDFGs of stencils that have differences between SL1 and SL2 performance

Progress

  • 15_to_28
    • https://github.com/C2SM/icon4py/pull/772
    • Check if the corrector is really okay.
    • Check predictor when ready > failing
    • Check the state of the current integration into the benchmark repo (fixed values and flavors).
    • Update the OpenACC reference time.
    • Actual work:
      • Fix ConstantSubstitution for AccessNodes as well
      • Check why with the current version is failing
  • 30_to_38
    • Ask Christoph when it is done and how its state is.
    • Find out which flavor there are and which are the correct values.
    • Integrate into the optimizer.
    • Determine the OpenACC reference time.
    • Actual work:
      • Check the SDFGs to see if there's anything to improve
  • 39_to_60
    • Write 41_to_60_corrector in a single fieldop
      • Not really possible because next_w and vertical_mass_flux need to be written for one extra level
        • Check if they really have to be written for this extra level by running the timeloop tests in icon4py
    • Write 39_to_60_predictor in a single fieldop
      • Same as above
      • Symbol substitution fails for n_lev if it's set from the gtx.program
        • Works if it's passed as a parameter to the gtx.program and then to the gtx.fieldop
        • Need Edoaro's fix for the symbolic expressions
    • https://github.com/C2SM/icon4py/pull/784
      • Check why 41_to_60_corrector verification fails on GPU only with benchmark_2
      • Check why CI needs specific parallelism otherwise CUDA_ILLEGAL_ADDRESS_ERROR
  • Check SL1 vs SL2 performance
  • In VerticalMapFusion check if there are cases where we can fuse maps that are before/after an AccessNode that has overlapping edges
    • To fuse successfully the maps in this case we would have to duplicate the computations of the overlapping range which might not be beneficial