Lorenzo Internship

# Lorenzo Internship ## First tasks - Overview Laplacian GT4Py together [code](https://github.com/GridTools/gt4py/blob/main/tests/next_tests/integration_tests/multi_feature_tests/ffront_tests/test_laplacian.py) - Setup VSCode environment with Julia Debugger - Julia basic Press `;` to get a shell repl inside a running julia terminal Press `]` to open the package manager repl - Laplacian in Julia https://docs.julialang.org/en/v1/manual/arrays/ Profile https://docs.julialang.org/en/v1/manual/profile/ Benchmark https://github.com/JuliaCI/BenchmarkTools.jl - Get familiar with SLURM https://user.cscs.ch/access/running/ - Execute a job on the multicore partition and gpu partition of piz daint - Execute an interactive job (e.g. a shell) and "look around" on the node, e.g. how many cores do you have, run `nvidia-smi` - Execute a job and connect via ssh to the compute node in a different terminal - Install GT4Py on your local machine and Piz Daint (use spack) and run all tests using pytest. Make sure to use a support python version. CSCS Knowledge base for Multifactor authentication: https://user.cscs.ch/access/auth/mfa/ Daint SSH: ```daint.cscs.ch``` via VPN or see here: https://confluence.cscs.ch/display/KB/Direct+access+to+the+computing+systems+from+local+clients Spack links: https://spack.readthedocs.io/en/latest/getting_started.html https://spack.readthedocs.io/en/latest/environments.html Spack environment file ``` spack: specs: - gcc@12 - python@3.10 - py-pip - boost - cmake - py-cupy cuda_arch=60 ^cuda+allow-unsupported-compilers ^nccl+cuda cuda_arch=60 view: true concretizer: unify: true ``` - Generated functions Write a Julia function that has the same behaviour as the following function, but dispatches at compile time. Generated function docs [here](https://docs.julialang.org/en/v1/manual/metaprogramming/#Generated-functions). ``` foo(x::Int) = 1 foo(x::Float) = 2 ``` Ensure the dispatching is happening at compile time using `@code_llvm` or `@code_native` macro. - Macros Write a derivative macro that given an expr returns the derivative with respect to x. ``` expr_d_dx = @derivative x^2 @assert expr_d_dx == :(2*x) expr_d_dx = @derivative x^2+x^3 @assert expr_d_dx == :(2*x+3*x^2) ``` Handy tools and methods: ``` dump(:(x^2)) # print expr tree @macroexpand @my_macro(my_args) # show expr after macro expansion ``` https://docs.julialang.org/en/v1/manual/metaprogramming/ Take a look at https://github.com/FluxML/MacroTools.jl. This is often handy when working with expressions. Use pattern matching from MacroTools to have a cleaner version: http://fluxml.ai/MacroTools.jl/dev/pattern-matching/ - @inline, @inbounds, Optimization flags - Import a Python package in Julia using PyCall, execute something, set a breakpoint in python and inspect the surrounding - Run a Cuda.jl example on Piz Daint (or Tödi if available again) - Type stability https://m3g.github.io/JuliaNotes.jl/stable/instability/ - Extending Broadcasting: - https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting - https://docs.julialang.org/en/v1/manual/interfaces/#extending-in-place-broadcast - GT4Py quickstart guide: https://github.com/GridTools/gt4py/blob/main/docs/user/next/QuickstartGuide.md - First tasks in GridTools.jl - Generate documentation on your local machine - [Optional] Fix CI - Refactor tests in `gt4py_fo_exec.jl` to assert result values of all field operators are correct Example structure of a test after refactoring: ```python a = Field(Cell, collect(1.:15.)) b = Field(Cell, collect(-1.:-1:-15.)) out = Field(Cell, zeros(Float64, 15)) @field_operator function fo_addition(a::Field{Tuple{Cell_}, Float64}, b::Field{Tuple{Cell_}, Float64})::Field{Tuple{Cell_}, Float64} return a .+ b end fo_addition(a, b, backend = "py", out = out) @test all(out.data .== 0) ``` In order to avoid duplicating all the field operator as regular julia one approach would be to compare the python with embedded backend. Ideally results are simple enough for this to not be needed. E.g. instead of computing the entire field for checking one could also just compare the sum of all values in a field. - Write laplacian and laplacian of laplacian test using GridTools.jl - Small cleanup in src/atlas folder. Extract all files except for `atlas_mesh.jl` and place them in an example folder. Adopt README for instructions on how to run the code. Note: [atlas4py](https://github.com/GridTools/atlas4py) package needed in Python virtual environment. - Setup benchmarking infrastructure. - Look at [AirspeedVelocity.jl](https://github.com/MilesCranmer/AirspeedVelocity.jl) - Setup CSCS CI for benchmarking (TODO @tehrengruber: setup CSCS CI token etc.) - Make the tests work on GPU - Write benchmarks - Start with all elementary operations - Measure the memory bandwidth on your machine - Use STREAM installed via spack https://www.amd.com/de/developer/zen-software-studio/applications/spack/stream-benchmark.html - Derive what performance you expect on your machine given the measured - Compare with the performance you measure in your benchmarksNote: it is expected to have real performance close to the theoretical with the case of addition but not for the neighbour sum (embedded). (No assumption for python backend, slow performance for large fields) # Tentative goals / schedule until the end of the internship Updated: 22.08 - Fix laplacian tests with embedded backend - Add additional embedded backend running on the GPU Potential packages to use: https://github.com/JuliaGPU/CUDA.jl Many operations (e.g. arithmetic between fields) might directly work as soon as the broadcast is GPU aware, others e.g. [`remap`](https://github.com/GridTools/GridTools.jl/blob/main/src/GridTools.jl#L444) need to custom handling. For the broadcast this is the central loop: https://github.com/GridTools/GridTools.jl/blob/main/src/embedded/cust_broadcast.jl#L260 - Benchmarks that should work: - basic benchmarks (arithmetic etc.) - laplacian & double laplacian benchmark - atlas based benchmarks - nabla operator (optional: compare with C++ results) - The new backend should be tested in the CI (CSCS CI required in order to use GPUs) - Setup Airspeed velocity in the CSCS CI with the goal - to see performance degradation between main and a PR to main - execute the benchmarks that print the memory bandwidth and allow easy comparision with the bandwidth as measure using stream # Final Week Updated: 10.09 1. Run benchmarks & stream in the CSCS CI directly without airspeed velocity 2. Check & understand the bandwidth numbers 3. Run CSCS CI for PRs with airspeed velocity comparing to main. Add "speedups" / table to PR automatically. # Topics for GridTools.jl Features expansion: - Implement a Julia based GPU backend for Nvidia and/or AMD GPUs (using Cuda.jl, AMD.jl). - Enhance the Julia CPU backend to support variable memory layouts. Performance & Optimization: - Develop a benchmark suite representative of common computational patterns in weather and climate codes. - Execute the benchmark suite on CSCS supercomputers and identify potential performance bottlenecks. Python to Julia transpiler: - Implement a transpiler from FOAST or the internal representation in GT4Py (ITIR) to the Julia DSL to enable easy transition from the Python DSL to the Julia DSL. - Apply the transpiler to an existing code, for example the test suite of GT4Py or an actual weather model, and validate the results. Ideas: - Physics in Julia. e.g. CloudSC - Benchmarking & Optimize GridTools.jl. Advection sphere. ICON Nabla. - Embedded backend with GPU - Compiler passes in Julia. E.g. Function inliner. - CSCS CI for GridTools.jl? # Visit 02.07.24 - Start with an overview of the GT4Py toolchain - Briefly explain what models we are working with - ICON - MCH - Exclaim - PMAP - FV3 - Install GridTools.jl and get debugger running - Overview of GridTools.jl - Show Node visitor in GT4Py - NodeTranslator - Showed `foast_to_itir.py:visit_Constant` - Usual procedure for extending GridTools.jl - Implement feature in embedded - Easy to debug & implement - Extend the jast to foast lowering - Allows executing with GT4Py