WP2: FALL3D & CINECA

# WP2: FALL3D & CINECA In this document, I'll provide you information about the FALL3D model, putting a special focus on the parallel I/O system for accessing NetCDF files. Additionally, you'll find instructions to download and run a test case for profiling and assesment of parallel I/O operations. Regarding parallel I/O operations, we mean whole routines from nc_IO module (see section [Module mod_nc_IO.F90](#module-mod_nc_iof90)). Looking at the profiling report you sent us, it seems that *scalasca* results are not about total times as you sure do know, so we assume measuring elapsed times would be suitable for you in order to check whether IO is meaningful in terms of performance and then we have to consider further optimizations. ## Current status * Currently, parallel I/O to netCDF-4 is enabled in FALL3D * Initially, the performance was very poor (worse than in the serial case). * In this case, parallel file access was independent (any processor may access the data without waiting for others) * We improved a lot the performance forcing collective access (all processors must participate) * Now, for large problems the parallel and serial IO have similar performance. For small problems, parallel IO is slightly better. * Now, all the computing processors participate in the IO * More tests and a profiling is required! ## Next deliverable The next deliverable associated with WP2 is due on **April the 30th 2020** |WP no.|Del Rel. No.|Title|Description|Lead Beneficiary|Est. Del. Date|Owner|Reviewer 1|Reviewer 2|Status| |--|--|--|--|--|--|--|--|--|--| |WP2|D2.2|First report of code optimization and tuning |First report on optimization and tuning for the selected applications on the available hardware| CINECA| 30 Apr 2020 | Piero Lanucara|Mauricio Hanzich| Josep de La Puente | Pending | ## Proposed goals * T2.2.4: Vectorization analysis and improving - **Done!** * T2.4: Assesment of parallel I/O performance and profiling - Execute and evaluate the I/O performance for small and large cases - We would like to know if optimizations are worth doing - Comparison between serial and parallel I/O - Measure and report relative times for subroutines: - nc_IO_read_dbs/nc_IO_out_dbs - nc_IO_out_grid/nc_IO_out_res - nc_IO_out_rst In the future (depending on the profiling results): * T2.4: Optimizations - Collective access for parallel I/O operations (**Done!**) - Other optimizations? * T2.4: Asynchronous I/O - Currently, all the computing processors participate in the I/O operations. It would important decide if dedicated processor for IO control should be defined. ## General information We created a branch **cineca_io** in gitlab for cineca use. You can get the source code by: ```bash git clone -b cineca_io https://gitlab.com/fall3d/development.git ``` Feel free to modify the code as you want in the branch **cineca_io**. Please, be sure to work with the last version in the **master** branch. You can update you branch with: ```bash git merge master ``` You can compile and install FALL3D, in the usual way: ```bash ./configure make make install ``` You will need the fortran library for netCDF with parallel support. Further information about the FALL3D model is available in our [wiki site](https://gitlab.com/fall3d-distribution/v8.0/-/wikis/home). ## Test case for CINECA We prepared a test case for CINECA with high I/O requirements. The user can define the grid size of the problem. Required files: * Input parameter file: io_cineca.inp - This is the configuration file * Input met data: io_cineca.gfs.nc - Meteorological input data You can download these files from: [CINECA files](https://drive.google.com/open?id=1qqK64yq4a79CcbnNZzIq5E9pPrEHgWRI) ### Executing FALL3D: Run an MPI job using: ```bash mpirun -n $NT ${FALL3D_EXEC} ALL io-cineca.inp $NX $NY $NZ ``` where NT = NX x NY x NZ is the total number of processors. You can change the size of your problem by editing the following block in the **io_cineca.inp** file: ``` ---- GRID ---- (...) NX = 20 NY = 20 NZ = 12 ``` Choosing different values for the grid size of your domain, you can test different problem sizes. For a realistic problem choose: NX=NY=1000, NZ=100. In order to select between serial and parallel IO, look at the `PARALLEL_IO` option in the block: ``` ------------- MODEL_OUTPUT ------------- ! PARALLEL_IO = NO ``` ## I/O diagram This is the diagram of file access we are interested in testing. It involves access to the following netcdf files: ![IO diagram](https://i.imgur.com/shQ0N2q.png) Notice that reading of input file `io_cineca.gfs.nc` is serial in all cases. Consequently, we should focus exclusively on I/O operations related to files `io_cineca.res.nc`, `io_cineca.*.rst.nc`, and `io_cineca.dbs.nc`. ## Module mod_nc_IO.F90 The I/O related routines are located in the fortran module called **mod_nc_IO.f90**. The important routines are: * nc_IO_read_dbs/nc_IO_out_dbs: read and write the database file (io_cineca.dbs.nc) * nc_IO_out_grid/nc_IO_out_res: write the results file (io_cineca.res.nc) * nc_IO_out_rst: write the restart file list (io_cineca.*.rst.nc) The operations related to parallel I/O are controlled by a logical variable called `PARALLEL_IO`. For instance, for creating a netcdf file you will find the following code structure: ```fortran if(PARALLEL_IO) then mode_flag = IOR(NF90_NETCDF4, NF90_MPIIO) mode_flag = IOR(mode_flag, NF90_CLOBBER) istat = nf90_create_par(nc_file, & cmode = mode_flag, & comm = COMM_MODEL, & info = MPI_INFO_NULL, & ncid = ncID) else if(master_model) then mode_flag = IOR(NF90_CLOBBER,NF90_NETCDF4) istat = nf90_create(nc_file,cmode=mode_flag,ncid=ncID) end if ``` If the parallel I/O option is activated (in the input file **io_cineca.inp**), the logical variable would be `PARALLEL_IO = .true.` and the `nf90_create_par` subroutine would be called. In other case, only the master PE will create the netcdf file calling `nf90_create`. We found a significant improvement in the performance setting collective acces when `PARALLEL_IO==.true.`. For this purpose, for each input/output field we call: ```fortran istat = nf90_var_par_access(ncID,varID,access = NF90_COLLECTIVE) ``` Probably, this result should be mentioned in the upcoming deliverable document along with the new results.