Support GPU backends from Python

# Support GPU backends from Python ###### tags: `cycle 15` - Shaped by: Enrique, Rico - Appetite: 1/2 cycle, 2 developers - Developers: Enrique for the storages part (WIP) ## Problem `gt4py.next` is able to generate valid CUDA GPU code, as shown in ICON4Py where the generated code is called from Fortran. However, it is still not possible to call the generated GPU code directly from Python. The two main blockers for having GPUs supported in Python are the incomplete state of memory allocation utilities (only working for CPU) and the lack of a proper compilation step for CUDA code in Python. ## Appetite This is an important step towards the goal of getting GT4Py in a usable state for general users, so at least half a cycle. ## Solution As mentioned above, there are two missing points to support the execution from Python of the `gt4py.next` GPU backends: GPU memory allocation using the `gt4py.storage` module and extending the code-generation and compilation utilities to the `gt4py.next` toolchain to support CUDA. Both tasks are linked and they should be tested together, but most of the development work can be done independently. ### Storages During last cycle, the [open PR](https://github.com/GridTools/gt4py/pull/1211) by Linus extending the CPU/GPU memory allocation utilities to `gt4py.next` has been refactored to follow a different interface, but this task is still WIP (by @egparedes). This refactoring mainly deals with: + Decoupling the memory layout and alignment scheme from the backends, in such a way that the memory scheme can be shared by different backends. + Decoupling the memory allocation from the dimension tags by providing two interfaces: a low-level interface only dealing with the allocation of buffer objects, and a high-level interface (meant to be used by users) to tag the allocated buffers with the dimension tags. The goal for this cycle will be to finalize this refactoring and test that it works properly with the compiled GPU backends. ### CUDA backend support - C++ code-generation: the current `gtfn` backend needs to be forked to generate the same GridTools-C++ templated code, but instantiated with a C++ GPU backend (if needed, check with @havogt @petiaccja or @ricoh for details). - Bindings code-generation: `otf.bindings` package need to be extended to support passing GPU buffers from Python at execution-time. GPU buffers cannot be extracted from Python objects using the _Buffer Protocol_ because it only supports CPU. The GridTools C++ already supports creating GPU SIDs from Python using the [CUDA Array Interface](https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html) (see code snippets here [[1](https://github.com/GridTools/gridtools/blob/eef2309238716b5a7098b5764fd96c2793277a89/tests/regression/py_bindings/driver.py#L39)], [[2](https://github.com/GridTools/gridtools/blob/eef2309238716b5a7098b5764fd96c2793277a89/include/gridtools/storage/adapter/python_sid_adapter.hpp#L271)]). `gt4py.cartesian` uses this mechanism and it works fine, so we will follow the same approach here using the cartesian implementation as reference when needed. - Compilation: the `otf.compilation` subpackage will be updated to discover, setup and call the CUDA compiler with appropriate flags and options. Again, `gt4py.cartesian` is already able to compile CUDA code, so the cartesian implementation can be used as reference implementation if needed. ## Rabbit holes In the current `gt4py.next` compilation pipeline CMake is used to handle the hard tasks of figuring out the include paths and the compilation flags, and therefore we don't expect important source changes to deal with CUDA compilation, but we acknowledge that complications _might_ happen and the whole compilation might turn out to be trickier than expected. To mitigate that, a vertical slice should be prepared first to test at least a simple case. ## Optional extra tasks - If time allows it, examples comparing `gt4py.cartesian` and `gt4py.next` could be written using storages: https://github.com/GridTools/gt4py/pull/1202 - [DLPack](https://github.com/dmlc/dlpack) is a standard interface to share memory buffers which supports multiple devices and frameworks and it has been chosen as _data interchange layer_ for the next revision of the [Python Array Standard](https://data-apis.org/array-api/latest/design_topics/data_interchange.html). Since both NumPy and Cupy already support DLPack, it might be worth to use it in the Python bindings for the CPU and GPU backends to unify both code paths. In case this project is done earlier than anticipated and if there are not other urgent tasks, developers could try a time-boxed exploration of this idea for the future during a couple of days.