# Post-SCF and real-space data with HDF5
### Outline:
1. Current DIRAC schema and what do we want to achieve.
2. Extending DIRAC schema - few scenarios.
3. Real-space data in HDF5 format + FDE example.
4. Response data in HDF5 format.
5. Labeled storage in DIRAC: future and general concerns.
</br>
> notes for DIRAC'c hackaton day, 17/01/2022
> [link to these notes](https://hackmd.io/@gosia/rk9iB1DnK/edit), [link to miro boards](https://miro.com/welcome/MjJFUkFzUG83eFhScEppUGZHVjJqN3lEcEdEUDgycHV0eE1FYkwyaHRGSk1oQTMxZUNiTEdVSEpmT2M2RVdFWnwzNDU4NzY0NTE1ODY0NTk1Njcy?invite_link_id=859971972633)
> [DIRAC branch: `gosia/fde-with-hdf5`
> [tests (on that branch): `fde_import_export_density_hdf5`]
---
### Current DIRACschema:
* single `CHECKPOINT.h5` file for a single run
* storage: wavefunctions (HF+DFT) + description
### What do we want to achieve:
* possibility to use many h5 checkpoint files in a single run:
* FDE: import h5 files for an active subsystem and for frozen subsystem(s)
* FDE data (freeze-and-thaw iteration): import data, export updated data
* possibility to work with real-space data:
* FDE: import/export of embedding potential or frozen densities
* VISUAL: export data on custom grids
* possibility to work with post-SCF data:
* FDE: import/export of perturbed densities (require response calculations)
* response calculations: checkpoint response parameters
---
### Scenario 1 - extend current DIRAC schema
![](https://i.imgur.com/gSy75jl.jpg)
[miro board](https://miro.com/welcomeonboard/V0NQZjFmQTM4c09EVmN3VGV2NXIxcnE1RWNRWk1odlk4cTAyZDByNzVsSGRGOVYzYkJrQzh2U1pRNm02UXhIVHwzNDU4NzY0NTE1ODY0NTk1Njcy?invite_link_id=636686901137)
---
### Scenario 2 - write separate schema for modules
![](https://i.imgur.com/ObjBcbE.jpg =x450)
[miro board](https://miro.com/welcomeonboard/ZnpDdngzNk1GVGEyb2F3Qkk4bG9UUEtJUEM0U2g0dmZMemtidExwWmVFS3FJcEtIMnNNajVlbExpRzNveENSQXwzNDU4NzY0NTE1ODY0NTk1Njcy?invite_link_id=289216627870)
---
### FDE in scenario 2: `FDE_schema.txt` - key elements:
* one `FDE_checkpoint.h5` for one FDE workflow (can mean few runs if FDE freeze-thaw)
* basic groups:
* subsystems: `/{input,result}/subsystem`
* their 'labels' are updated according to the order of files (`--mol="f1 f2 f3 ..."`):
* we assume `f1` is 'active'; `f2`, `f3`,... are 'frozen'
* grids: `/{input,result}/grid`:
* grids have distinct IDs (`/{input,result}/grid/grid_id`)
* basic subgroup: `.../subsystem/grid_function`:
* `grid_function` = any property on a grid (densities, potentials, ...)
* each `grid_function` can be defined on its own grid:
* `.../subsystem/grid_function/grid_id` links to `/{input,result}/grid/grid_id`
---
### FDE in scenario 2: `FDE_schema.txt`
* now: some redundancy wrt `DIRACschema.txt`:
```
*schema
input composite single required # definition of the calculation
result composite single required # results of the calculation
*end
*input
subsystem composite array required # information about each subsystem (on input)
grid composite single required # information about imported grids
*end
*result
execution composite single required # information about the run
subsystem composite array required # information about each subsystem (on output)
grid composite single required # information about exported grids
*end
*subsystem
molecule composite single optional # topology of the molecular system (optional=may not be known for all subsystems)
aobasis composite single optional # atomic orbital basis set descriptions (optional=may not be known for all subsystems)
grid_function composite array required # grid data and quantities imported/exported on that grid
label string single required # is this subsystem `active` or `frozen`?
name string single required # unique name or label for a subsystem
operators composite single optional # matrix representations of operators (useful if ever doing freeze-thaw inside DIRAC and wanting to update frozen subsystem data)
fde_method composite single optional # FDE setup details
wavefunctions composite single optional # results for each wave functions that was optimized (useful if ever doing freeze-thaw inside DIRAC and wanting to update frozen subsystem data)
*end
*grid_function
grid_id integer single required # grid_id of a grid on which this grid function is defined/is to be defined
property_label string single required # property label
property_name real single generic # real-space representation of a property (on a specified grid; property_name is a string of length 16)
status_io string single optional # 'import' or 'export'
action string single optional # update/do not update
*end
*grid
grid_p_num integer single required # number of grid points
grid_p_xyz real single generic # xyz-coordinates of grid points
grid_p_weights real array required # weights of grid points
grid_id integer single required # unique grid id or hash
status_io string single required # 'import' or 'export'
action string single optional # prune/combine/compress/... (TODO)
*end
```
---
### Working with real-space data in DIRAC: FDE example (scenario 2)
![](https://i.imgur.com/36xGsMQ.jpg)
[miro board](https://miro.com/welcomeonboard/MGtwMEpvaTUzcUN1ZEF0SXZOMzVYbGZBZEhvem9MQjBXYTRWNEhYbUltd25XNWV5YmlHamc2b2dLOEdacDl5bXwzNDU4NzY0NTE1ODY0NTk1Njcy?invite_link_id=206001469506)
---
### This example from the inside: modifications in DIRAC
* in `src`: `fde/fde_checkpoint.F90` is modified `gp/checkpoint.F90`:
* we needed more flexibility (many variables are hardcoded in `gp/checkpoint.F90`)
* future: transform `gp/checkpoint.F90` to a generic module?
* in `utils`:
* `FDE_schema.txt`
* `pam.in`:
* `--mol = "file1 file2 ...."`
* `--fdeh5`
---
### Real-space data in FDE and VISUAL - TODO list:
* if scenario 2 for FDE:
* read subsystem data from their `CHECKPOINT.h5` files
* rework `FDEschema.txt` (avoid repetition wrt `DIRACschema.txt`)
* add support for grid manipulation:
* bind to external tools and/or write a separate 'grid' module (FDE and VISUAL)
* examples: compare (avoid double storage)/combine/prune grids, enable adaptive grids
* better handling of I/O of grid data (think of easy post-processing)
* overlap of FDE and VISUAL modules:
* come up with consistent schema for FDE and VISUAL grid data
* VISUAL as a library of real-space properties
* FDE:
* keep import/export of data important to FDE (densities+ emb.potential)
* but call VISUAL if export of another real-space property is requested for a subsystem
* parallelization:
* keep track of distribution of grids and grid functions to MPI processes
---
### Working with real-space data: inspirations
* logical grids (l-grids) and data grids (d-grids), [source: https://doi.org/10.1002/cpe.4165](https://doi.org/10.1002/cpe.4165):
* l-grid informs about the structure and updates on d-grids (parallelization, scalability):
![](https://i.imgur.com/vdloCCy.png =x400)
* data structure in vtk-m: [source](https://www.cs.uoregon.edu/Reports/AREA-201703-Kress.pdf):
![](https://i.imgur.com/G5FKhLL.png =x400)
---
### Storage of post-SCF data: response parameters
* step 1: bring linear response data stored on PAMXVC file to HDF5
* decision: extend DIRACschema (scenario 1) or write a separate one (scenario 2)?
* decision: labelling ideas for response:
* property labels used now in `src/prp`:
* variables: `PRPLBL` (internal) or `PRPNAM` (user defined)
* `PRPNAM` is what is stored on `PAMXVC`
* property labels used in `src/openrsp`:
```
type(prop_field_info) :: field_list(14) = & !nc an ba ln qu
(/prop_field_info('EXCI', 'Generalized "excitation" field' , 1, F, F, T, T), &
prop_field_info('FREQ', 'Generalized "freqency" field' , 1, F, F, T, T), &
prop_field_info('AUX*', 'Auxiliary integrals on file' , 1, F, F, T, F), &
prop_field_info('PNC' , 'PNC' , 1, F, F, T, F), &
prop_field_info('EL' , 'Electric field' , 3, F, F, T, F), &
prop_field_info('VEL' , 'Velocity' , 3, T, F, T, F), &
prop_field_info('MAGO', 'Magnetic field w/o. London orbitals' , 3, T, F, F, T), &
prop_field_info('MAG' , 'Magnetic field with London orbitals' , 3, T, T, F, F), &
prop_field_info('ELGR', 'Electric field gradient' , 6, F, F, T, F), &
prop_field_info('VIBM', 'Displacement along vibrational modes',-1, F, T, F, F), &
prop_field_info('GEO' , 'Nuclear coordinates' ,-1, F, T, F, F), & !-1=mol-dep
prop_field_info('NUCM', 'Nuclear magnetic moment' ,-1, F, T, F, T), & !-1=mol-dep
prop_field_info('AOCC', 'AO contraction coefficients' ,-1, F, T, F, F), & !-1=mol-dep
prop_field_info('AOEX', 'AO exponents' ,-1, F, T, F, F)/) !-1=mol-dep
```
---
### Extending labeled storage in DIRAC - general concerns:
1. code flexibility:
* easy-to-extend schema(s) (scenario 1? sneraio 2? sth in between?)
* support different strategies to working with real-space data:
* import real-space data to DIRAC:
* grid + data and grid; no need to store anything else
* data can come from other software (e.g. ADF)
* (re)generate real-space data from checkpoint files:
* data can come from DIRAC only
* e.g.: unperturbed density can be generated from `CHECKPOINT.h5` on any grid
* perturbed densities can be generated if response data is also checkpointed
* easy mechanisms for updating the data (overwriting) and for tracking how it changes
* ensure easy integration with external codes
2. code maintainability:
* if scenario 2: generalize `utils/process_schema.py` and `gp/checkpoint.F90`?
* better documentation (what to update to extend the schema)
* consistent labelling, using variables for groups and labels, etc.
3. scalability:
* strategies for large grids
4. performance:
* favor storing or (re)generating the data?
* store data in one large `CHECKPOINT.h5` file or create many checkpoint files (`FDE_CHECKPOINT.h5`, `PRP_CHECKPOINT.h5`, etc.)?
* automate decisions about when to store/overwrite the data
* separate data computation from data I/O
* enable restarts of computations on grids
---