[BlueLine] Performance debugging

# [BlueLine] Performance debugging - Shaped by: Christoph - Appetite (FTEs, weeks): 1 cycle - Developers:  ## Problem The combined stencil speedups do not show up in the blueline as faster runtime of the dycore. Therefore we would like to gather more data, such as the runtime of each of the combined stencils, and display that information in bencher. To further understand the performance of the DaCe backend, we can look at timelines, check the start-up overhad and other overheads. All of these tasks can be done for an at scale experiment without halo exchanges (`icon-ch1-medium`) and with halo exchanges (`icon-ch2`). ## Appetite  ## Solution * Add timers to Fortran dycore for each combined stencil * Make sure this information is displayed in bencher at least for `icon-ch1-medium` and `icon-ch2`. * Produce timelines for DaCe runs, first no halo exchanges, then with. Check for start-up overheads, kernel launch overheads, weird gaps, etc. * Compare runtimes between icon-nwp spack compiled for double and icon-exclaim openacc compiled manually, see if the performance discrepancy comes from there. * In general move from no halo exchanges to halo exchanges experiments * Speed up slow CFL reduction in velocity advection by providing cupy a consecutive array ## Rabbit holes  ## No-gos  ## Progress  - [ ] Add combined stencil timers to Fortran dycore - [x] Add timers in PR (separate timers for predictor and corrector programs) - [x] In PR jenkins on balfrin runs `exp.mch_icon-ch1_medium` twice, with different timer levels, so once time stencils, once time whole dycore - [x] PR open https://github.com/C2SM/icon-exclaim/pull/375 and https://github.com/C2SM/icon-exclaim/pull/376 - [ ] Merge PRs - [ ] Refactor experiment (created 2nd experiment in PR, which is copy of the first with different timer level, can be refactored if timer level can be parametrized) - [ ] Check in bencher if all data arrives - [ ] Time combiend stencils in Python dycore - [ ] Dace has nvtx timers? - [ ] Create timelines, check for gaps - [ ] Compare icon-nwp built with MCH vs EXCLAIM software stack on balfrin - [ ] Debug CFL condition max-reduction being very slow (reason is non-consecutive arrays cannot be accelerated by cupy) - [x] Have single GPU dycore performance releavant (grid big enough) experiment for balfrin/santis ready: Can run JW on r2b7 with this `python model/driver/src/icon4py/model/driver/icon4py_driver.py /scratch/mch/cong/data/nh35_tri_jws_sb_exclaim_r2b7/ser_data/ --experiment_type=jabw --grid_root=2 --grid_level=7 --enable_output --enable_profiling --icon4py_driver_backend=gtfn_gpu` on a compute node (probably launch with `srun`). Need serialized data to run this, can be found on balfrin here `/scratch/mch/cmueller/remove_liskov/icon-exclaim/build_serialize/experiments/exclaim_nh35_tri_jws_sb/ser_data` (Chia Rui recommended this) - [ ] Measure current situation (slow CFL reduction) - [ ] Comment out CFL reduction completely, measure upper bound - [ ] Implement option 1 (what is used in FVM, might cause a copy), measure, if fast -> done (given by Till) ``` def generate_wrapper(np_func_name: str) -> Callable[[Field, ...], ...]: np_func = getattr(np, np_func_name) def _wrapper(field: Field, axis: Optional[tuple[int, ...]] = None): assert isinstance(field, Field) data = field["physical"] axis = axis or tuple(range(0, data.ndim)) if device == "gpu" and len(axis) == data.ndim: # Slicing produces a non-contiguous array, making full reductions perform poorly on GPU # As a workaround, we create a 1-d copy of the data which is contiguous in memory # so that CuPy will launch CUB fast reduction kernels instead data = data.ravel(order="K") axis = 0 result = np_func(data, axis=axis) if result.ndim == 0: # CuPy will return a 0-dim array, use item() to get a cpu scalar return result.item() else: return result return _wrapper for func in ["min", "max", "sum", "average"]: setattr(sys.modules[__name__], func, generate_wrapper(func)) ``` - [ ] If too slow get raw pointer of array at initialization set all bytes to zero including padding. Get length in bytes. RELY ON doubles being 8 byte aligned. For the max operation take raw pointer and lengths in bytes (including padding). Interpret the bytes as doubles, take the max, rely on the padding to be zero as we initialized. Here is some untested code on how that could look like: Initialize array (including padding) with zero !!!CODE incorrect, nbytes function in numpy does not include padding bytes, needs to be fixed!!! ``` import numpy as np import ctypes def secure_zeros(shape, dtype=np.float32, xp=np): """ Create an array with all underlying bytes (including padding) set to zero. Parameters: shape (tuple): Shape of the array. dtype (np.dtype or cp.dtype): Data type of the array. xp (module): Either numpy or cupy module. Returns: numpy.ndarray or cupy.ndarray: Zero-initialized array with secure memory clearing. """ # Allocate uninitialized memory arr = xp.empty(shape, dtype=dtype) if xp.__name__ == 'numpy': # Get raw pointer and byte size ptr = arr.ctypes.data n_bytes = arr.nbytes # Zero the memory with ctypes ctypes.memset(ptr, 0, n_bytes) elif xp.__name__ == 'cupy': # CuPy uses its own memory manager and GPU pointers ptr = arr.data.ptr n_bytes = arr.nbytes # Use CUDA's memset to zero the GPU memory xp.cuda.runtime.memset(ptr, 0, n_bytes) else: raise ValueError(f"Unsupported array module: {xp.__name__}") return arr ``` Take max over an array including its padding, rely on padding being zero ``` def max_from_raw(ptr, nbytes, xp): """ Compute the maximum float64 value from a raw pointer and byte length, using either NumPy or CuPy as the backend. Parameters: ptr (int): Raw pointer to the memory buffer. nbytes (int): Number of bytes in the buffer. xp (module): Either `numpy` or `cupy`. Returns: float: Maximum value in the buffer interpreted as float64s. Raises: ValueError: If the backend module is not supported or alignment is invalid. """ if nbytes % 8 != 0: raise ValueError("Byte size must be a multiple of 8 for float64 data") n_elems = nbytes // 8 if xp.__name__ == 'numpy': import ctypes double_array_type = ctypes.c_double * n_elems c_array = double_array_type.from_address(ptr) arr = xp.ctypeslib.as_array(c_array) return arr.max() elif xp.__name__ == 'cupy': memptr = xp.cuda.MemoryPointer(xp.cuda.UnownedMemory(ptr, nbytes, None), 0) arr = xp.ndarray((n_elems,), dtype=xp.float64, memptr=memptr) return arr.max() else: raise ValueError(f"Unsupported array module: {xp.__name__}") ``` ## Progress (Edoardo, Hannes, ...) - [x] Performance analysis of `ch1_medium` - [x] with DaCe backend: gpu2py_verify the experiment (Edoardo) - [ ] run OpenACC with https://github.com/C2SM/icon-exclaim/pull/375 and verify that the numbers are consistent with the numbers used in the dace benchmarking repository (Hannes, ask Ioannis to verify) - [x] apply the reduction fix (Hannes) - [x] double-checking how much faster the ravel reduction is: ravel reduction has same performance as no reduction within error bars - [x] compare to no reduction: no reduction is about 4% faster on the `integrate_nh` `t_min` timer - [x] merge the fix https://github.com/C2SM/icon4py/pull/872 - [ ] run granule with gtfn and dace and compare vs OpenACC (Edoardo, Hannes) - [ ] dace - [ ] total time - [x] per stencil time - [x] gtfn - [x] total time - [x] per stencil time - [x] run on 4 gpus - MPI performance - [x] run mch-ch1_medium on 4 GPUs: gtfn blueline 2.8x slower than openacc - [x] remove halo-exchanges in multi-node setup to investigate if performance is recovered: performance is mostly recovered (1.3x slower than openacc) - [x] replace halo-exchanges by just syncs to check if it's just Python overhead (that is otherwise fully compensated): overhead is relatively small (~4% on ch1_medium) - [ ] Investigate Python overhead in GT4Py (higher than in the past, but not critical due to async execution) - [ ] Double-check bindings overhead - [ ] Introduce/check ICON timer around the same block of code that is in the granule - [ ] Investigate GHex `MPICH_GPU_IPC_ENABLED=1` problem, see https://hackmd.io/jB38vtGdSJWOheYpbhq7vA - [ ] Individual program timers with DaCe backend - [ ] vertically implicit solvers perform poorly ### Status summary (last updated 2025-09-12 14:45): - reduction performance is irrelevant after https://github.com/C2SM/icon4py/pull/872 - gtfn_gpu 20% slower than openacc on `integrate_nh` `t_min` - gtfn_gpu performance overhead quite large (at least 20%, but should be mostly compensated by async execution) ### Hannes' log All experiments are run with the following options: ``` uenv start icon/25.2:v3 --view default source externals/icon4py/.venv/bin/activate export GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE=1 export GT4PY_BUILD_CACHE_LIFETIME=PERSISTENT export PYTHONOPTIMIZE=2 # except when running py2fgen! export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/user-environment/linux-sles15-zen3/gcc-12.3.0/gcc-13.2.0-4or6n7qyzqwxr3lcsis4e6sqgkc4obtv/lib64" # for serialbox if needed ``` - `/scratch/mch/vogtha/icon_granule_benchmark_openacc/icon-exclaim` contains ICON build with https://github.com/C2SM/icon-exclaim/pull/375 timers #### OpenACC performance for ch1-medium ``` -------------------------------------------------------------------------------------------- ------- ------------ ------------ ------------ ------------- ------------- ------------- name # calls t_min t_avg t_max total min (s) total max (s) total avg (s) -------------------------------------------------------------------------------------------- ------- ------------ ------------ ------------ ------------- ------------- ------------- total 1 33.9355s 33.9355s 33.9355s 33.935 33.935 33.935 L integrate_nh 360 0.08038s 0.08609s 0.29092s 30.994 30.994 30.994 L nh_solve 1800 0.00634s 0.00662s 0.00861s 11.925 11.925 11.925 L nh_solve.veltend 2160 0.00093s 0.00097s 0.00167s 2.089 2.089 2.089 L compute_derived_horizontal_winds_and_ke_and_contravariant_correction 2 0.00010s 0.00030s 0.00051s 0.001 0.001 0.001 L compute_derived_horizontal_winds_and_ke_and_contravariant_correction_skip 718 0.00000s 0.00015s 0.00030s 0.106 0.106 0.106 L compute_contravariant_correction_and_advection_in_vertical_momentum_equation 1 0.00063s 0.00063s 0.00063s 0.001 0.001 0.001 L compute_contravariant_correction_and_advection_in_vertical_momentum_equation_ski 359 0.00034s 0.00035s 0.00036s 0.126 0.126 0.126 L compute_advection_in_vertical_momentum_equation 3600 0.00007s 0.00030s 0.00054s 1.066 1.066 1.066 L compute_advection_in_horizontal_momentum_equation 4320 0.00006s 0.00014s 0.00024s 0.624 0.624 0.624 L nh_solve.cellcomp 3600 0.00023s 0.00040s 0.00059s 1.440 1.440 1.440 L compute_perturbed_quantities_and_interpolation 1800 0.00056s 0.00057s 0.00058s 1.021 1.021 1.021 L interpolate_rho_theta_v_to_half_levels_and_compute_pressure_buoyancy_acceleratio 1800 0.00022s 0.00023s 0.00025s 0.409 0.409 0.409 L nh_solve.edgecomp 3600 0.00034s 0.00054s 0.00072s 1.948 1.948 1.948 L compute_horizontal_velocity_quantities_and_fluxes 1800 0.00059s 0.00060s 0.00061s 1.074 1.074 1.074 L compute_averaged_vn_and_fluxes_and_prepare_tracer_advection 1440 0.00041s 0.00041s 0.00067s 0.592 0.592 0.592 L compute_averaged_vn_and_fluxes_and_prepare_tracer_advection_first 360 0.00033s 0.00033s 0.00034s 0.120 0.120 0.120 L nh_solve.vnupd 3600 0.00038s 0.00063s 0.00090s 2.263 2.263 2.263 L compute_theta_rho_face_values_and_pressure_gradient_and_update_vn 1800 0.00086s 0.00087s 0.00089s 1.572 1.572 1.572 L apply_divergence_damping_and_update_vn 1800 0.00037s 0.00038s 0.00039s 0.682 0.682 0.682 L nh_solve.vimpl 3600 0.00102s 0.00106s 0.00115s 3.830 3.830 3.830 L compute_dwdz_and_boundary_update_rho_theta_w 1800 0.00004s 0.00004s 0.00006s 0.072 0.072 0.072 L update_mass_flux_weighted 1440 0.00002s 0.00002s 0.00003s 0.030 0.030 0.030 L update_mass_flux_weighted_first 360 0.00002s 0.00003s 0.00003s 0.009 0.009 0.009 L nh_solve.exch 7200 0.00000s 0.00000s 0.00001s 0.003 0.003 0.003 L boundary_halo_cleanup 1800 0.00002s 0.00002s 0.00004s 0.041 0.041 0.041 L nh_hdiff_initial_run 1 0.00189s 0.00189s 0.00189s 0.002 0.002 0.002 L nh_hdiff 360 0.00171s 0.00173s 0.00187s 0.624 0.624 0.624 L transport 360 0.01163s 0.01196s 0.01905s 4.307 4.307 4.307 L adv_horiz 360 0.00751s 0.00767s 0.01478s 2.760 2.760 2.760 L adv_hflx 360 0.00686s 0.00702s 0.01412s 2.527 2.527 2.527 L back_traj 1080 0.00001s 0.00025s 0.00067s 0.267 0.267 0.267 L adv_vert 360 0.00394s 0.00412s 0.00476s 1.484 1.484 1.484 L adv_vflx 360 0.00335s 0.00353s 0.00418s 1.270 1.270 1.270 L action 360 0.00001s 0.00001s 0.00014s 0.004 0.004 0.004 global_sum 369 0.00000s 0.00000s 0.00000s 0.000 0.000 0.000 wrt_output 11 0.28796s 0.28886s 0.29114s 3.177 3.177 3.177 L wait_for_async_io 11 0.00001s 0.00002s 0.00003s 0.000 0.000 0.000 vertically_implicit_solver_at_predictor_step 2880 0.00008s 0.00054s 0.00101s 1.547 1.547 1.547 vertically_implicit_solver_at_predictor_step_first 720 0.00008s 0.00055s 0.00104s 0.398 0.398 0.398 vertically_implicit_solver_at_corrector_step 2160 0.00000s 0.00052s 0.00105s 1.119 1.119 1.119 vertically_implicit_solver_at_corrector_step_first 720 0.00000s 0.00053s 0.00108s 0.384 0.384 0.384 vertically_implicit_solver_at_corrector_step_last 720 0.00000s 0.00055s 0.00111s 0.397 0.397 0.397 rbf_vector_interpolation_of_u_v_vert_before_nabla2 361 0.00009s 0.00009s 0.00012s 0.033 0.033 0.033 calculate_nabla2_and_smag_coefficients_for_vn 361 0.00028s 0.00029s 0.00032s 0.104 0.104 0.104 calculate_diagnostic_quantities_for_turbulence 361 0.00016s 0.00017s 0.00017s 0.060 0.060 0.060 rbf_vector_interpolation_of_u_v_vert_before_nabla4 361 0.00009s 0.00009s 0.00010s 0.032 0.032 0.032 apply_diffusion_to_vn 361 0.00025s 0.00025s 0.00027s 0.092 0.092 0.092 apply_diffusion_to_w_and_compute_horizontal_gradients_for_turbulence 361 0.00025s 0.00026s 0.00026s 0.092 0.092 0.092 calculate_enhanced_diffusion_coefficients_for_grid_point_cold_pools 361 0.00002s 0.00002s 0.00002s 0.006 0.006 0.006 apply_diffusion_to_theta_and_exner 361 0.00036s 0.00036s 0.00037s 0.132 0.132 0.132 physics 361 0.03169s 0.03732s 0.23668s 13.471 13.471 13.471 L nwp_radiation 7 0.18138s 0.19155s 0.19879s 1.341 1.341 1.341 L preradiaton 7 0.00572s 0.00614s 0.00690s 0.043 0.043 0.043 L phys_acc_sync 361 0.00082s 0.00087s 0.00110s 0.314 0.314 0.314 L ordglb_sum 361 0.00000s 0.00000s 0.00002s 0.002 0.002 0.002 L satad 361 0.00058s 0.00202s 0.00209s 0.731 0.731 0.731 L phys_u_v 361 0.00020s 0.00021s 0.00024s 0.075 0.075 0.075 L nwp_turbulence 721 0.00800s 0.00843s 0.01506s 6.077 6.077 6.077 L nwp_turbtrans 361 0.00800s 0.00864s 0.01505s 3.118 3.118 3.118 L nwp_turbdiff 360 0.00507s 0.00511s 0.00573s 1.841 1.841 1.841 L nwp_surface 360 0.00787s 0.00846s 0.02033s 3.047 3.047 3.047 L nwp_microphysics 360 0.00326s 0.00346s 0.00361s 1.246 1.246 1.246 L rediag_prog_vars 361 0.00012s 0.00035s 0.00036s 0.126 0.126 0.126 L sso 31 0.00200s 0.00202s 0.00212s 0.063 0.063 0.063 L cloud_cover 61 0.00229s 0.00233s 0.00388s 0.142 0.142 0.142 L radheat 361 0.00062s 0.00063s 0.00069s 0.228 0.228 0.228 nh_diagnostics 1126 0.00000s 0.00023s 0.18387s 0.256 0.256 0.256 diagnose_pres_temp 72 0.00000s 0.00107s 0.07256s 0.077 0.077 0.077 model_init 3 4.4389s 7.4197s 13.1485s 22.259 22.259 22.259 L compute_domain_decomp 1 0.76680s 0.76680s 0.76680s 0.767 0.767 0.767 L compute_intp_coeffs 1 3.5856s 3.5856s 3.5856s 3.586 3.586 3.586 L init_ext_data 1 0.18742s 0.18742s 0.18742s 0.187 0.187 0.187 L init_icon 1 3.8979s 3.8979s 3.8979s 3.898 3.898 3.898 L init_latbc 1 5.6957s 5.6957s 5.6957s 5.696 5.696 5.696 L init_nwp_phy 1 0.75640s 0.75640s 0.75640s 0.756 0.756 0.756 upper_atmosphere 4 0.00001s 0.00013s 0.00048s 0.001 0.001 0.001 L upatmo_construction 2 0.00002s 0.00025s 0.00048s 0.000 0.000 0.000 L upatmo_destruction 2 0.00000s 0.00001s 0.00001s 0.000 0.000 0.000 write_restart 2 0.00000s 0.00003s 0.00005s 0.000 0.000 0.000 L write_restart_io 1 0.00000s 0.00000s 0.00000s 0.000 0.000 0.000 L write_restart_communication 1 0.00000s 0.00000s 0.00000s 0.000 0.000 0.000 optional_diagnostics_atmosphere 11 0.00005s 0.00005s 0.00006s 0.001 0.001 0.001 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` without detailed timers ``` total 1 33.4043s 33.4043s 33.4043s 33.404 33.404 33.404 L integrate_nh 360 0.07988s 0.08576s 0.28688s 30.873 30.873 30.873 L nh_solve 1800 0.00621s 0.00648s 0.00838s 11.664 11.664 11.664 L nh_hdiff_initial_run 1 0.00172s 0.00172s 0.00172s 0.002 0.002 0.002 L nh_hdiff 360 0.00164s 0.00166s 0.00188s 0.598 0.598 0.598 L transport 360 0.01208s 0.01251s 0.01778s 4.503 4.503 4.503 L adv_horiz 360 0.00800s 0.00828s 0.01358s 2.981 2.981 2.981 L adv_hflx 360 0.00736s 0.00764s 0.01292s 2.750 2.750 2.750 L adv_vert 360 0.00388s 0.00405s 0.00467s 1.459 1.459 1.459 L adv_vflx 360 0.00329s 0.00346s 0.00409s 1.247 1.247 1.247 ``` #### Blueline performance - icon4py main `b30000f109f1e245895932b8c11bdfddb9577368` ``` ------------------------------- ------- ------------ ------------ ------------ ------------- ------------- ------------- name # calls t_min t_avg t_max total min (s) total max (s) total avg (s) ------------------------------- ------- ------------ ------------ ------------ ------------- ------------- ------------- total 1 10m50s 10m50s 10m50s 650.156 650.156 650.156 L integrate_nh 360 0.10765s 1.7978s 10m06s 647.205 647.205 647.205 L nh_hdiff_run 360 0.00147s 0.00154s 0.00308s 0.556 0.556 0.556 L nh_hdiff_run_initial 1 0.10614s 0.10614s 0.10614s 0.106 0.106 0.106 L transport 360 0.01168s 0.01200s 0.01983s 4.320 4.320 4.320 L adv_horiz 360 0.00753s 0.00769s 0.01536s 2.769 2.769 2.769 L adv_hflx 360 0.00689s 0.00705s 0.01463s 2.536 2.536 2.536 L back_traj 1080 0.00001s 0.00025s 0.00075s 0.268 0.268 0.268 L adv_vert 360 0.00394s 0.00413s 0.00494s 1.487 1.487 1.487 L adv_vflx 360 0.00334s 0.00354s 0.00436s 1.273 1.273 1.273 -------------------------------------------------------------------------------------------------------------------------------------- ``` #### Blueline performance - icon4py `b30000f109f1e245895932b8c11bdfddb9577368` + no cfl reduction ``` ------------------------------- ------- ------------ ------------ ------------ ------------- ------------- ------------- name # calls t_min t_avg t_max total min (s) total max (s) total avg (s) ------------------------------- ------- ------------ ------------ ------------ ------------- ------------- ------------- total 1 43.4010s 43.4010s 43.4010s 43.401 43.401 43.401 L integrate_nh 360 0.10332s 0.11231s 1.1390s 40.431 40.431 40.431 L nh_hdiff_run 360 0.00148s 0.00155s 0.00205s 0.559 0.559 0.559 L nh_hdiff_run_initial 1 0.01977s 0.01977s 0.01977s 0.020 0.020 0.020 L transport 360 0.01168s 0.01209s 0.02026s 4.353 4.353 4.353 L adv_horiz 360 0.00753s 0.00779s 0.01601s 2.805 2.805 2.805 L adv_hflx 360 0.00690s 0.00715s 0.01534s 2.573 2.573 2.573 L back_traj 1080 0.00001s 0.00025s 0.00068s 0.267 0.267 0.267 L adv_vert 360 0.00394s 0.00412s 0.00513s 1.483 1.483 1.483 L adv_vflx 360 0.00334s 0.00353s 0.00455s 1.269 1.269 1.269 ``` #### Blueline performance - icon4py `b30000f109f1e245895932b8c11bdfddb9577368` + ravel reduction ``` ------------------------------- ------- ------------ ------------ ------------ ------------- ------------- ------------- name # calls t_min t_avg t_max total min (s) total max (s) total avg (s) ------------------------------- ------- ------------ ------------ ------------ ------------- ------------- ------------- total 1 43.8645s 43.8645s 43.8645s 43.864 43.864 43.864 L integrate_nh 360 0.10342s 0.11357s 1.9240s 40.884 40.884 40.884 L nh_hdiff_run 360 0.00146s 0.00152s 0.00185s 0.547 0.547 0.547 L nh_hdiff_run_initial 1 0.16758s 0.16758s 0.16758s 0.168 0.168 0.168 L transport 360 0.01166s 0.01198s 0.02007s 4.314 4.314 4.314 L adv_horiz 360 0.00753s 0.00770s 0.01554s 2.770 2.770 2.770 L adv_hflx 360 0.00689s 0.00705s 0.01481s 2.538 2.538 2.538 L back_traj 1080 0.00001s 0.00025s 0.00075s 0.267 0.267 0.267 L adv_vert 360 0.00394s 0.00411s 0.00441s 1.479 1.479 1.479 L adv_vflx 360 0.00334s 0.00351s 0.00382s 1.265 1.265 1.265 ``` #### GT4Py timers `b7d1d874d00320c7dae2cae8a79d6d1831d7689a` ``` program compute +/- total +/- init_diffusion_local_fields_for_regular_timestep[run_gtfn_gpu_cached] 1.25766e-03 nan 2.70191e-01 nan init_nabla2_factor_in_upper_damping_zone[run_gtfn_gpu_cached] 3.99828e-04 nan 1.38418e-01 nan en_smag_fac_for_zero_nshift[run_gtfn_gpu_cached] 1.40820e-02 nan 1.46089e-02 nan setup_fields_for_initial_step[run_gtfn_gpu_cached] 2.19345e-05 nan 2.39611e-04 nan scale_k[run_gtfn_gpu_cached] 1.75854e-05 1.52221e-06 1.19927e-04 6.80252e-05 mo_intp_rbf_rbf_vec_interpol_vertex[run_gtfn_gpu_cached] 7.27094e-05 3.25194e-06 1.72939e-04 1.31312e-05 calculate_nabla2_and_smag_coefficients_for_vn[run_gtfn_gpu_cached] 3.29836e-04 1.78760e-05 4.81302e-04 2.39179e-05 calculate_diagnostic_quantities_for_turbulence[run_gtfn_gpu_cached] 1.68754e-04 3.34668e-05 2.82492e-04 3.80208e-05 apply_diffusion_to_vn[run_gtfn_gpu_cached] 1.76844e-04 6.92977e-06 3.12787e-04 1.33358e-05 copy_field[run_gtfn_gpu_cached] 3.84589e-05 1.94620e-06 8.76529e-05 6.72821e-06 apply_diffusion_to_w_and_compute_horizontal_gradients_for_turbulence[run_gtfn_gpu_cached] 1.79593e-04 1.91568e-05 3.17537e-04 2.92677e-04 calculate_enhanced_diffusion_coefficients_for_grid_point_cold_pools[run_gtfn_gpu_cached] 2.77870e-05 2.48274e-06 1.18942e-04 1.07733e-05 apply_diffusion_to_theta_and_exner[run_gtfn_gpu_cached] 2.64476e-04 2.59229e-05 4.02147e-04 3.33173e-05 init_test_fields[run_gtfn_gpu_cached] 8.61490e-05 2.86221e-06 1.63204e-04 1.04721e-05 compute_derived_horizontal_winds_and_ke_and_contravariant_correction[run_gtfn_gpu_cached] 3.36442e-04 4.26624e-05 5.07392e-04 5.19911e-05 compute_contravariant_correction_and_advection_in_vertical_momentum_equation[run_gtfn_gpu_cached] 4.97097e-04 2.09760e-04 1.22271e-03 1.09144e-02 compute_advection_in_horizontal_momentum_equation[run_gtfn_gpu_cached] 2.95554e-04 1.25958e-05 5.60976e-04 4.83986e-03 compute_rayleigh_damping_factor[run_gtfn_gpu_cached] 1.42641e-05 1.38238e-06 9.93543e-05 8.37004e-06 compute_perturbed_quantities_and_interpolation[run_gtfn_gpu_cached] 5.82366e-04 3.11630e-05 8.03504e-04 3.79817e-05 compute_hydrostatic_correction_term[run_gtfn_gpu_cached] 2.32368e-05 1.66370e-06 1.30556e-04 9.54038e-06 compute_theta_rho_face_values_and_pressure_gradient_and_update_vn[run_gtfn_gpu_cached] 1.02551e-03 3.25966e-05 1.28198e-03 4.03655e-05 compute_horizontal_velocity_quantities_and_fluxes[run_gtfn_gpu_cached] 5.17958e-04 9.10940e-06 6.82094e-04 1.62208e-05 vertically_implicit_solver_at_predictor_step[run_gtfn_gpu_cached] 1.20273e-03 5.03044e-05 1.46847e-03 5.84206e-05 stencils_61_62[run_gtfn_gpu_cached] 3.26300e-05 1.59429e-06 1.16745e-04 7.85839e-06 compute_dwdz_for_divergence_damping[run_gtfn_gpu_cached] 1.99375e-05 1.23579e-06 1.05248e-04 7.16532e-06 compute_advection_in_vertical_momentum_equation[run_gtfn_gpu_cached] 4.78566e-04 5.07931e-05 7.90797e-04 5.99868e-03 interpolate_rho_theta_v_to_half_levels_and_compute_pressure_buoyancy_acceleration[run_gtfn_gpu_cached] 2.26513e-04 1.87310e-06 3.70967e-04 1.08420e-05 apply_divergence_damping_and_update_vn[run_gtfn_gpu_cached] 1.98171e-04 3.10816e-06 3.57234e-04 1.32128e-05 compute_averaged_vn_and_fluxes_and_prepare_tracer_advection[run_gtfn_gpu_cached] 2.76253e-04 2.30162e-05 3.92734e-04 2.43990e-05 vertically_implicit_solver_at_corrector_step[run_gtfn_gpu_cached] 1.19461e-03 1.96620e-05 1.44936e-03 3.29604e-05 init_cell_kdim_field_with_zero_wp[run_gtfn_gpu_cached] 1.49455e-05 1.26407e-06 6.64724e-05 6.86017e-06 update_mass_flux_weighted[run_gtfn_gpu_cached] 2.40930e-05 1.72399e-06 9.66651e-05 7.94954e-06 compute_theta_and_exner[run_gtfn_gpu_cached] 6.65557e-05 1.70982e-06 1.25171e-04 6.57184e-06 compute_exner_from_rhotheta[run_gtfn_gpu_cached] 1.98780e-05 1.37872e-06 7.16677e-05 6.03315e-06 update_theta_v[run_gtfn_gpu_cached] 3.97960e-06 4.68814e-07 6.79938e-05 7.29643e-06 ``` #### MCH-CH2 1h simulation ##### OpenACC ``` -------------------------------------------------------------------------------------------- ------- ------------ -------- ------------ ------------ -------- ------------- -------------- ------------- -------------- ------------- ----- name # calls t_min min rank t_avg t_max max rank total min (s) total min rank total max (s) total max rank total avg (s) # PEs -------------------------------------------------------------------------------------------- ------- ------------ -------- ------------ ------------ -------- ------------- -------------- ------------- -------------- ------------- ----- total 4 25.6346s [1] 25.6347s 25.6347s [3] 25.635 [1] 25.635 [3] 25.635 4 L integrate_nh 720 0.12242s [3] 0.14209s 0.53990s [2] 25.575 [0] 25.577 [2] 25.576 4 L nh_solve 3600 0.01082s [0] 0.01127s 0.01414s [1] 10.123 [0] 10.151 [1] 10.142 4 L nh_solve.veltend 4320 0.00148s [1] 0.00152s 0.00266s [2] 1.642 [1] 1.651 [2] 1.646 4 L compute_derived_horizontal_winds_and_ke_and_contravariant_correction 8 0.00011s [2] 0.00048s 0.00085s [0] 0.001 [3] 0.001 [0] 0.001 4 L compute_derived_horizontal_winds_and_ke_and_contravariant_correction_skip 1432 0.00000s [0] 0.00024s 0.00050s [0] 0.085 [2] 0.086 [0] 0.086 4 L compute_contravariant_correction_and_advection_in_vertical_momentum_equation 4 0.00090s [0] 0.00091s 0.00093s [3] 0.001 [0] 0.001 [3] 0.001 4 L compute_contravariant_correction_and_advection_in_vertical_momentum_equation_ski 716 0.00055s [1] 0.00056s 0.00057s [0] 0.100 [3] 0.100 [2] 0.100 4 L compute_advection_in_vertical_momentum_equation 7200 0.00010s [2] 0.00047s 0.00085s [0] 0.845 [1] 0.846 [2] 0.846 4 L compute_advection_in_horizontal_momentum_equation 8640 0.00009s [0] 0.00025s 0.00041s [2] 0.527 [1] 0.534 [2] 0.530 4 L nh_solve.cellcomp 7200 0.00037s [2] 0.00065s 0.00097s [2] 1.175 [1] 1.179 [3] 1.178 4 L compute_perturbed_quantities_and_interpolation 3600 0.00092s [1] 0.00093s 0.00096s [2] 0.839 [1] 0.842 [0] 0.841 4 L interpolate_rho_theta_v_to_half_levels_and_compute_pressure_buoyancy_acceleratio 3600 0.00036s [2] 0.00037s 0.00038s [1] 0.331 [2] 0.334 [3] 0.332 4 L nh_solve.edgecomp 7200 0.00057s [1] 0.00090s 0.00114s [0] 1.616 [2] 1.632 [0] 1.622 4 L compute_horizontal_velocity_quantities_and_fluxes 3600 0.00098s [1] 0.00100s 0.00102s [0] 0.894 [1] 0.900 [0] 0.896 4 L compute_averaged_vn_and_fluxes_and_prepare_tracer_advection 2880 0.00069s [2] 0.00070s 0.00072s [0] 0.500 [2] 0.509 [0] 0.504 4 L compute_averaged_vn_and_fluxes_and_prepare_tracer_advection_first 720 0.00056s [1] 0.00057s 0.00058s [0] 0.102 [2] 0.103 [0] 0.102 4 L nh_solve.vnupd 7200 0.00064s [1] 0.00104s 0.00146s [3] 1.854 [1] 1.872 [2] 1.866 4 L compute_theta_rho_face_values_and_pressure_gradient_and_update_vn 3600 0.00140s [1] 0.00142s 0.00146s [3] 1.267 [1] 1.282 [3] 1.275 4 L apply_divergence_damping_and_update_vn 3600 0.00064s [1] 0.00065s 0.00067s [0] 0.582 [1] 0.589 [2] 0.586 4 L nh_solve.vimpl 7200 0.00167s [1] 0.00174s 0.00188s [3] 3.126 [1] 3.149 [3] 3.133 4 L compute_dwdz_and_boundary_update_rho_theta_w 3600 0.00003s [0] 0.00003s 0.00004s [0] 0.028 [1] 0.029 [2] 0.029 4 L update_mass_flux_weighted 2880 0.00002s [0] 0.00002s 0.00006s [2] 0.012 [1] 0.012 [0] 0.012 4 L update_mass_flux_weighted_first 720 0.00002s [1] 0.00002s 0.00003s [0] 0.004 [1] 0.004 [0] 0.004 4 L nh_solve.exch 14400 0.00010s [0] 0.00014s 0.00067s [1] 0.486 [0] 0.555 [1] 0.515 4 L boundary_halo_cleanup 3600 0.00003s [0] 0.00003s 0.00006s [2] 0.025 [1] 0.025 [2] 0.025 4 L nh_hdiff_initial_run 4 0.00538s [0] 0.00542s 0.00546s [2] 0.005 [0] 0.005 [2] 0.005 4 L nh_hdiff 720 0.00283s [1] 0.00294s 0.00332s [0] 0.521 [1] 0.535 [3] 0.530 4 L transport 720 0.02018s [3] 0.02083s 0.04604s [2] 3.741 [3] 3.753 [0] 3.750 4 L adv_horiz 720 0.01355s [1] 0.01412s 0.03921s [2] 2.533 [3] 2.549 [0] 2.542 4 L adv_hflx 720 0.01250s [3] 0.01307s 0.03813s [2] 2.344 [3] 2.360 [0] 2.353 4 L back_traj 2160 0.00001s [0] 0.00041s 0.00141s [3] 0.218 [1] 0.220 [3] 0.219 4 L adv_vert 720 0.00613s [1] 0.00638s 0.00728s [2] 1.143 [1] 1.154 [3] 1.149 4 L adv_vflx 720 0.00529s [1] 0.00557s 0.00648s [3] 0.996 [1] 1.006 [3] 1.002 4 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ##### blueline ``` ------------------------------- ------- ------------ -------- ------------ ------------ -------- ------------- -------------- ------------- -------------- ------------- ----- name # calls t_min min rank t_avg t_max max rank total min (s) total min rank total max (s) total max rank total avg (s) # PEs ------------------------------- ------- ------------ -------- ------------ ------------ -------- ------------- -------------- ------------- -------------- ------------- ----- total 4 11m36s [1] 11m36s 11m36s [2] 696.579 [1] 696.579 [2] 696.579 4 L integrate_nh 720 0.25442s [0] 3.8695s 10m45s [2] 696.516 [0] 696.518 [2] 696.517 4 L nh_hdiff_run 720 0.00491s [1] 0.00701s 0.01436s [0] 1.166 [3] 1.377 [0] 1.262 4 L nh_hdiff_run_initial 4 3.1259s [2] 3.1262s 3.1265s [3] 3.126 [2] 3.127 [3] 3.126 4 L transport 720 0.02388s [1] 0.02508s 0.04912s [0] 4.481 [2] 4.537 [3] 4.514 4 L adv_horiz 720 0.01342s [2] 0.01394s 0.03981s [0] 2.492 [3] 2.540 [0] 2.508 4 L adv_hflx 720 0.01237s [2] 0.01289s 0.03861s [0] 2.302 [3] 2.351 [0] 2.319 4 L back_traj 2160 0.00001s [0] 0.00041s 0.00145s [3] 0.218 [1] 0.221 [3] 0.220 4 L adv_vert 720 0.00610s [2] 0.00628s 0.00708s [1] 1.123 [2] 1.139 [3] 1.131 4 L adv_vflx 720 0.00528s [2] 0.00547s 0.00627s [1] 0.977 [2] 0.991 [3] 0.984 4 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ``` ### Edoardo's log ##### Blueline (DaCe) icon4py:main (`d795ebe` with ravel reduction) + icon-exclaim:add_timers #### ch1-medium (1 GPU) Note: during the lowering of stencil programs with the SDFG timers enabled, a corner case was hitted, that resulted in a cpp compile error. The stencil `update_theta_v` was lowered to an empty SDFG. It is not verified yet, but I suspect that this was the result of an empty compute domain. We could catch "empty" compute domains in an early stage of the gt4py workflow and return an empty program-call decorator. ``` ------------------------------- ------- ------------ ------------ ------------ ------------- ------------- ------------- name # calls t_min t_avg t_max total min (s) total max (s) total avg (s) ------------------------------- ------- ------------ ------------ ------------ ------------- ------------- ------------- total 1 01m03s 01m03s 01m03s 63.278 63.278 63.278 L integrate_nh 360 0.10878s 0.16755s 19.0027s 60.318 60.318 60.318 L nh_solve 1800 0.01200s 0.02292s 18.5158s 41.253 41.253 41.253 L nh_hdiff_initial_run 1 0.16169s 0.16169s 0.16169s 0.162 0.162 0.162 L nh_hdiff 360 0.00217s 0.00227s 0.00273s 0.816 0.816 0.816 L transport 360 0.01166s 0.01214s 0.02031s 4.371 4.371 4.371 L adv_horiz 360 0.00751s 0.00784s 0.01580s 2.824 2.824 2.824 L adv_hflx 360 0.00687s 0.00720s 0.01506s 2.592 2.592 2.592 L back_traj 1080 0.00001s 0.00025s 0.00074s 0.267 0.267 0.267 L adv_vert 360 0.00393s 0.00412s 0.00521s 1.483 1.483 1.483 L adv_vflx 360 0.00334s 0.00353s 0.00463s 1.269 1.269 1.269 ``` Note that with dace backed we cannot collect compute timers and total timers at the same time: the SDFG instrumentation to collect the compute time (using C++ chrono calls) uses file-based interface to collect the timers, which introduces a big overhead in the total time measured on the python call. Stencil timers (python): ``` program total +/- init_diffusion_local_fields_for_regular_timestep[run_dace_gpu_cached_opt] 3.18851e+00 nan init_nabla2_factor_in_upper_damping_zone[run_dace_gpu_cached_opt] 1.19200e-01 nan en_smag_fac_for_zero_nshift[run_dace_gpu_cached_opt] 8.84807e-01 nan setup_fields_for_initial_step[run_dace_gpu_cached_opt] 1.03114e-02 nan scale_k[run_dace_gpu_cached_opt] 1.01499e-04 6.26159e-04 mo_intp_rbf_rbf_vec_interpol_vertex[run_dace_gpu_cached_opt] 7.62519e-05 2.58749e-04 calculate_nabla2_and_smag_coefficients_for_vn[run_dace_gpu_cached_opt] 1.78379e-04 9.10504e-04 calculate_diagnostic_quantities_for_turbulence[run_dace_gpu_cached_opt] 9.49575e-05 4.73035e-05 apply_diffusion_to_vn[run_dace_gpu_cached_opt] 1.02475e-04 1.02635e-04 copy_field[run_dace_gpu_cached_opt] 6.14805e-05 4.60277e-05 apply_diffusion_to_w_and_compute_horizontal_gradients_for_turbulence[run_dace_gpu_cached_opt] 1.26753e-04 5.13211e-05 calculate_enhanced_diffusion_coefficients_for_grid_point_cold_pools[run_dace_gpu_cached_opt] 6.90685e-05 2.89623e-05 apply_diffusion_to_theta_and_exner[run_dace_gpu_cached_opt] 1.01652e-04 4.52337e-05 init_test_fields[run_dace_gpu_cached_opt] 8.27297e-05 2.15429e-05 compute_derived_horizontal_winds_and_ke_and_contravariant_correction[run_dace_gpu_cached_opt] 4.99264e-03 9.20888e-02 compute_contravariant_correction_and_advection_in_vertical_momentum_equation[run_dace_gpu_cached_opt] 2.87408e-02 5.42472e-01 compute_advection_in_horizontal_momentum_equation[run_dace_gpu_cached_opt] 1.25758e-04 4.09338e-05 compute_rayleigh_damping_factor[run_dace_gpu_cached_opt] 5.23726e-05 1.34450e-05 compute_perturbed_quantities_and_interpolation[run_dace_gpu_cached_opt] 2.12533e-04 5.63240e-05 compute_hydrostatic_correction_term[run_dace_gpu_cached_opt] 7.64754e-05 1.73255e-05 compute_theta_rho_face_values_and_pressure_gradient_and_update_vn[run_dace_gpu_cached_opt] 2.24263e-04 3.83123e-04 compute_horizontal_velocity_quantities_and_fluxes[run_dace_gpu_cached_opt] 1.35636e-04 3.97159e-05 vertically_implicit_solver_at_predictor_step[run_dace_gpu_cached_opt] 1.38926e-03 4.73321e-02 stencils_61_62[run_dace_gpu_cached_opt] 7.65293e-05 4.39056e-05 compute_dwdz_for_divergence_damping[run_dace_gpu_cached_opt] 1.97924e-04 6.09962e-03 compute_advection_in_vertical_momentum_equation[run_dace_gpu_cached_opt] 3.25645e-03 1.31405e-01 interpolate_rho_theta_v_to_half_levels_and_compute_pressure_buoyancy_acceleration[run_dace_gpu_cached_opt] 9.66491e-05 2.15038e-05 apply_divergence_damping_and_update_vn[run_dace_gpu_cached_opt] 9.77245e-05 1.98966e-05 compute_averaged_vn_and_fluxes_and_prepare_tracer_advection[run_dace_gpu_cached_opt] 7.92636e-05 2.33745e-05 vertically_implicit_solver_at_corrector_step[run_dace_gpu_cached_opt] 2.57367e-04 1.09847e-04 init_cell_kdim_field_with_zero_wp[run_dace_gpu_cached_opt] 5.15143e-05 5.60557e-05 update_mass_flux_weighted[run_dace_gpu_cached_opt] 6.29583e-05 1.26561e-05 compute_theta_and_exner[run_dace_gpu_cached_opt] 5.48734e-05 1.25564e-05 compute_exner_from_rhotheta[run_dace_gpu_cached_opt] 6.99233e-05 1.15953e-05 update_theta_v[run_dace_gpu_cached_opt] 4.77750e-05 8.61083e-06 ``` Stencil timers (SDFG compute): ``` program compute +/- init_diffusion_local_fields_for_regular_timestep[run_dace_gpu_cached_opt] 2.90000e-05 nan init_nabla2_factor_in_upper_damping_zone[run_dace_gpu_cached_opt] 2.00000e-05 nan en_smag_fac_for_zero_nshift[run_dace_gpu_cached_opt] 2.70000e-05 nan setup_fields_for_initial_step[run_dace_gpu_cached_opt] 1.80000e-05 nan scale_k[run_dace_gpu_cached_opt] 1.67861e-05 1.54643e-06 mo_intp_rbf_rbf_vec_interpol_vertex[run_dace_gpu_cached_opt] 7.50903e-05 5.36889e-06 calculate_nabla2_and_smag_coefficients_for_vn[run_dace_gpu_cached_opt] 5.64700e-04 4.41293e-04 calculate_diagnostic_quantities_for_turbulence[run_dace_gpu_cached_opt] 1.72353e-04 6.44717e-06 apply_diffusion_to_vn[run_dace_gpu_cached_opt] 2.05139e-04 4.90212e-06 copy_field[run_dace_gpu_cached_opt] 2.43083e-05 2.31207e-06 apply_diffusion_to_w_and_compute_horizontal_gradients_for_turbulence[run_dace_gpu_cached_opt] 2.10756e-04 3.57984e-05 calculate_enhanced_diffusion_coefficients_for_grid_point_cold_pools[run_dace_gpu_cached_opt] 3.46611e-05 4.23694e-06 apply_diffusion_to_theta_and_exner[run_dace_gpu_cached_opt] 3.24900e-04 6.42211e-06 init_test_fields[run_dace_gpu_cached_opt] 8.79178e-05 3.20192e-06 compute_derived_horizontal_winds_and_ke_and_contravariant_correction[run_dace_gpu_cached_opt] 2.66075e-04 1.00612e-05 compute_contravariant_correction_and_advection_in_vertical_momentum_equation[run_dace_gpu_cached_opt] 2.28336e-04 6.92948e-06 compute_advection_in_horizontal_momentum_equation[run_dace_gpu_cached_opt] 3.27090e-04 6.65463e-06 compute_rayleigh_damping_factor[run_dace_gpu_cached_opt] 1.86556e-05 1.83093e-06 compute_perturbed_quantities_and_interpolation[run_dace_gpu_cached_opt] 6.07774e-04 7.57908e-06 compute_hydrostatic_correction_term[run_dace_gpu_cached_opt] 2.88344e-05 4.01309e-06 compute_theta_rho_face_values_and_pressure_gradient_and_update_vn[run_dace_gpu_cached_opt] 9.03842e-04 5.15589e-06 compute_horizontal_velocity_quantities_and_fluxes[run_dace_gpu_cached_opt] 4.80483e-04 1.00173e-05 vertically_implicit_solver_at_predictor_step[run_dace_gpu_cached_opt] 1.30816e-03 2.69263e-05 stencils_61_62[run_dace_gpu_cached_opt] 3.79444e-05 1.21347e-05 compute_dwdz_for_divergence_damping[run_dace_gpu_cached_opt] 2.34539e-05 6.49916e-06 compute_advection_in_vertical_momentum_equation[run_dace_gpu_cached_opt] 3.81733e-04 9.28016e-06 interpolate_rho_theta_v_to_half_levels_and_compute_pressure_buoyancy_acceleration[run_dace_gpu_cached_opt] 2.15563e-04 6.41855e-06 apply_divergence_damping_and_update_vn[run_dace_gpu_cached_opt] 1.93829e-04 1.09547e-05 compute_averaged_vn_and_fluxes_and_prepare_tracer_advection[run_dace_gpu_cached_opt] 2.86422e-04 2.81291e-05 vertically_implicit_solver_at_corrector_step[run_dace_gpu_cached_opt] 1.23604e-03 2.55282e-05 init_cell_kdim_field_with_zero_wp[run_dace_gpu_cached_opt] 2.14333e-05 2.07814e-06 update_mass_flux_weighted[run_dace_gpu_cached_opt] 3.00033e-05 8.26728e-06 compute_theta_and_exner[run_dace_gpu_cached_opt] 6.89922e-05 8.45818e-06 compute_exner_from_rhotheta[run_dace_gpu_cached_opt] 1.30598e-04 7.91002e-06 update_theta_v[run_dace_gpu_cached_opt] 3.05556e-08 1.72158e-07 ``` #### MCH-CH2 1h simulation ``` total 4 01m32s [1] 01m32s 01m32s [0] 92.413 [1] 92.413 [0] 92.413 4 L integrate_nh 720 0.26703s [0] 0.51307s 40.4648s [2] 92.351 [0] 92.354 [2] 92.353 4 L nh_solve 3600 0.02920s [0] 0.08244s 34.1437s [3] 73.067 [0] 75.154 [1] 74.196 4 L nh_hdiff_initial_run 4 0.68132s [1] 0.68300s 0.68724s [3] 0.681 [1] 0.687 [3] 0.683 4 L nh_hdiff 720 0.00496s [2] 0.00657s 0.01094s [0] 1.109 [1] 1.230 [0] 1.182 4 L transport 720 0.02235s [2] 0.02372s 0.04754s [0] 4.202 [2] 4.303 [0] 4.270 4 L adv_horiz 720 0.01356s [1] 0.01405s 0.03766s [0] 2.519 [1] 2.541 [0] 2.529 4 L adv_hflx 720 0.01251s [1] 0.01300s 0.03650s [0] 2.330 [1] 2.353 [0] 2.340 4 L back_traj 2160 0.00001s [0] 0.00040s 0.00133s [3] 0.217 [1] 0.220 [2] 0.219 4 L adv_vert 720 0.00611s [2] 0.00629s 0.00713s [1] 1.132 [2] 1.133 [0] 1.132 4 L adv_vflx 720 0.00529s [2] 0.00547s 0.00632s [3] 0.985 [3] 0.988 [0] 0.985 4 ```