[Greenline] Cleanup / closing loose ends

# [Greenline] Cleanup / closing loose ends  - Shaped by: Magdalena, - Appetite (FTEs, weeks): - Developers:  ## Problem  ### backend settings - Nikki - done Currently, icon4py uses a global setting for the selected backend, which makes it hard to change and does not allow for simple backend switching for CI runs which is important for performance measurements. ### initialization of python granules - Nikki - done The `SolveNonHydro` and `Diffusion` class separate the construction from the initialization. This was originally done because the prototype of CFFI Fortran wrapper needed to have a granule instance around before calling init function, but is not needed any longer and it has the drawback that the user might create an instance of a granule that is not ready to run because it is not initialized. ### fix various issues and quick fixes from GPU runs GPU runs of the JW test case and the granule integration lead to various type casts (int to int32) or device / host copies cluttered through out the code. This should be consolidated. ### ~~KHalfDim in diffusion~~ postponed: (observation) the group of people who want to push this increases This cleanup is a step towards a better user interface in icon4py. Currently, due to gt4py restriction icon4py uses the same dimension `K` for fields on full and half levels. There exist a half level dimention `KHalfDim` which is in someplaces used to allocate fields with the correct vertical size but after allocation the dimension is dropped. Some exploratory work was done to fully use the half level dimension in `Diffusion`. The work could not be finished due to restrictions in gt4py. The idea is to allow a simple implementation from the gt4py side that does not have performance penalties but also does not need a full "staggered" dimension implementation and still allowing consistent use of vertical dimension in icon4py. TODO (magdalena): check with Hannes/Enrique, to decide whether we should work on this. ## Appetite  half a cycle, dependening on what we do. ## Solution  ### backend settings These steps should be done for all model components: (SolveNonHydro, Diffusion, Saturation Adjustment) - Pass the backend as a parameter to the constructor of the model components (diffusion, dycore). - cache the stencils that are run in the component with the `.with_backend(backend)` - use the backend for allocation of local fields. - use the "cached" `.with_backend` stencils everywhere at `run` time - tests: use the `backend` fixture in the datatests tests Additional steps: `xp` import from `settings.py`: replace by a simple conditional import for cupy if it is installed. If cuda is available cuda will be used and when transforming to gt4py fields the `gtx.as_field` will take care of possible device/host copying. IconGrid: with the conditional import the `on_gpu` flag for the grid can be removed and the conditional`xp` import used. As an exception the `start_index` and `end_index` arrays need to be on CPU. - move the `settings.py` to `tools/py2fgen` and remove the `backend` argument from the stencils (needs to be discussed with Sam!) ### fix various issues and quick fixes from GPU runs see PRs [555](https://github.com/C2SM/icon4py/pull/555), [553](https://github.com/C2SM/icon4py/pull/553) #### strategy - Agree on a common patterns: - device/host copies for Fields: through gt4py allocation or conversion. - for arrays: xp.asarray only where we really need arrays and they should on device if running on GPU. There are some special cases - `start_indices`, `end_indices` *must* be on host, explicitly use numpy. Keep in mind the 0d array/scalar issue in numpy vs cupy: numpy treats 0d fields as scalars, whereas copy treats them as 0d array as it does not know scalars. Cupy ``` x = cp.asarray(2) # 0d - cp.ndarray val = x.item() # scalar but this creates a standard python int = int64! ``` Numpy ``` ar = xp.arange(10, dtype=xp.int32) v = ar[2].item() # this has type int v = ar[2] # is a 0d array (which in numpy behaves as scalar) v = ar[2][()] # np.int32 in numpy #cupy has not scalars only arrays v32 = xp.int32(ar[2].item()) -> np.int32 # copies the value to python.int and casts to int32 v = ar[2].asnumpy() -> np.ndarray # device to host copy v2 = ar.asnumpy()[2][()] -> np.int32 # copies to numpy array ``` ### initialization of python granules Initialize the component/granule in the constructor and remove the `init` function. ## Rabbit holes  ## No-gos  ## Progress  - [x] Task 1 ([PR#xxxx](https://github.com/GridTools/gt4py/pulls)) - [x] Subtask A - [x] Subtask X - [ ] Task 2 - [x] Subtask H - [ ] Subtask J - [ ] Discovered Task 3 - [ ] Subtask L - [ ] Subtask S - [ ] Task 4