# Strided vs indirect addressing - https://hackmd.io/1fkccfepQge5AiVOxmzV1g ## First steps ### Stencil benchmarks Familiarize with stencil benchmarks and compare P100 - A100 - H100. - TODO(havogt): find the NVIDIA H100 document with performance data ### gtbench Run on santis and evaluate performance. Compare to performance extrapolation. ### Repeat Cartesian experiments see [previous work](#Previous-work) ## Main project: optimize strided access for ICON like grids ### Steps - get a regular ICON-like mesh - implement strided using the "coloring" strategy: TODO(havogt) add link to hackmd document describing how to go from neighbor tables to strided - find a meaningful example (at least 2 neighbor reductions) - optimize with blocking - try to think of possible other optimizations ### Previous work #### Presentation Carlos TODO https://github.com/cosunae/cuda_stencils #### Bachelor thesis Andre Rösti (at SPCL) https://andreroesti.com/attachments/bachelor-thesis.pdf https://docs.google.com/presentation/d/1aIKuymloD4Iz_JCkLzxkT4QBWbvdCQtaE-ntbgPWkO0/edit#slide=id.g6e8f0c7100_0_188 https://github.com/andrej/stencil-performance ##### Work plan Efficient GPU Implementation of Stencils on Unstructured Grids Weather and climate models solve governing equations of the atmosphere by discretizing them on a structured or unstructured grid. The finite difference and finite volume methods used to discretize the equations have a low arithmetic intensity. As a result, weather and climate codes are very memory bandwidth-hungry, and applying data-locality optimizations is performance-critical. Loop fusion and loop tiling are known to be effective for structured grid codes. They can be implemented easily since the neighbor relations are known at compile-time, which allows us to apply overlapped tiling. Unstructured grid codes are typically implemented using indirect addressing, which complicates the implementation of loop fusion and loop tiling. Although indirect addressing enables arbitrary data-layouts, the grid elements of the unstructured grid should be stored in a way that ensures the memory accesses of neighboring threads coalesce. Data-layout optimizations are thus another important concern when tuning the performance of unstructured grid codes. In this thesis, we plan to develop possible implementation strategies for unstructured grid codes. In particular, we plan to evaluate the performance penalty of indirect addressing and strive to develop efficient implementation strategies for unstructured grids. We may thereby use the structure inherent to the unstructured grids used in weather and climate [1]. We plan to work on the following tasks: Task 1 - Compare Direct and Indirect Addressing Indirect addressing is supposedly more expensive than direct addressing. In this task, we compare direct and indirect addressing for existing COSMO benchmark kernels (rectangular grid). We will re-implement them using indirect addressing and compare the performance to the original variants that implement direct addressing. Task 2 - Optimized Unstructured Grid Codes We then switch from rectangular grids to icosahedral grids and plan to tune the memory performance of unstructured grid codes. Possible optimizations are data-layout transformations (space-filling curves, stripes of triangles) or addressing optimizations. Despite their name, the unstructured grids used in weather and climate are mostly structured and dense (fully structured in the k-dimension and mostly structured in the ij-dimension). The irregularities are limited to corner cases which potentially enables the implementation of efficient compression schemes for the index arrays describing the neighbourhood relations. Task 3 - Loop Fusion and Loop Tiling Overlapped tiling has been successfully applied to structured grid codes. In this task, we evaluate possible data-locality transformations for unstructured codes. Our goal is to formalize the cost of different implementation variants by considering different performance metrics such as execution time, achieved memory bandwidth, and register pressure. References [1] http://cgi.cs.arizona.edu/~mstrout/Papers/Papers05-09/smashing-LCPC08.pdf