# Profiling Results: ICON4Py Diffusion Optimization
This document outlines the performance testing and evaluation of various optimizations applied to the ICON4Py Diffusion module. These optimizations aim to improve the general runtime of any icon4py granule called from Fortran integrated through Py2F.
## Overview
The optimizations discussed here are part of the `optimise-overhead` branch: [GitHub Pull Request #616](https://github.com/C2SM/icon4py/pull/616) and [#624](https://github.com/C2SM/icon4py/pull/624). Key features of this branch include:
- Usage of `FrozenProgram` in diffusion.
- Caching of pointers passed from Fortran to minimize redundant unpacking into NumPy views.
Further optimisations to reduce overhead in calling gt4py programs can be found in [Github Pull Request #1536](https://github.com/GridTools/gt4py/pull/1536). Key features of this branch inclue:
- Elimination of `isinstance` checks in gtfn.py
---
## Profiling Results
### ICON-DSL OpenACC (GPU)
| Configuration | Avg. Runtime / Timestep (s) | Notes |
|--------------------------------|-------------|--------------------------------|
| **Reference Run** | 0.00077 | Faster than prior reference (0.00178s/timestep). |
---
### Icon4Py Called from Fortran on GPU
| Optimization Stage | Avg. Runtime / Timestep (s) | Speedup vs Baseline |
|-------------------------------------------------------|-------------|---------------------|
| **Baseline** | 0.31645 | - |
| **FrozenProgram** | 0.01342 | ~23.6x |
| **FrozenProgram + Pointer Caching** | 0.01259 | ~25.1x |
| **FrozenProgram + Pointer Caching + No `isinstance` checks in `extract_connectivity_args`** | 0.00349 | ~90.7x |
| **FrozenProgram + Pointer Caching + No `isinstance` Checks in `convert_args` and `extract_connectivity_args`** | 0.00231 | ~137x |
- **Note:** Removing all `isinstance` checks in `gtfn.py` was implemented following [GridTools PR #1536](https://github.com/GridTools/gt4py/pull/1536).
---
### ICON-DSL CPU
| Configuration | Avg. Runtime / Timestep (s) | Notes |
|--------------------------------|-------------|--------------------------------|
| **Reference Run** | 0.12834 | Baseline CPU runtime. |
---
### Icon4Py Called from Fortran on CPU
Python-level optimizations beyond `FrozenProgram` have minimal impact on the CPU due to the dominance of stencil runtime.
| Optimization Stage | Avg. Runtime / Timestep (s) | Notes |
|-------------------------------------------------------|-------------|--------------------------------|
| **Baseline (No FrozenProgram)** | 0.61873 | - |
| **FrozenProgram + Pointer Caching** | 0.31154 | Significant improvement. |
| **FrozenProgram + Pointer Caching + No `isinstance` Checks** | 0.30565 | Minimal additional gain. |
---
## Summary of GPU Performance Improvements
These results highlight the significant impact of optimizations on GPU performance, with diminishing returns on CPU where stencil computation dominates runtime.