Post-SCF and real-space data with HDF5

Outline:

Current DIRAC schema and what do we want to achieve.
Extending DIRAC schema - few scenarios.
Real-space data in HDF5 format + FDE example.
Response data in HDF5 format.
Labeled storage in DIRAC: future and general concerns.

notes for DIRAC'c hackaton day, 17/01/2022
link to these notes, link to miro boards
[DIRAC branch: gosia/fde-with-hdf5
[tests (on that branch): fde_import_export_density_hdf5]

Current DIRACschema:

single CHECKPOINT.h5 file for a single run
storage: wavefunctions (HF+DFT) + description

What do we want to achieve:

possibility to use many h5 checkpoint files in a single run:
- FDE: import h5 files for an active subsystem and for frozen subsystem(s)
- FDE data (freeze-and-thaw iteration): import data, export updated data
possibility to work with real-space data:
- FDE: import/export of embedding potential or frozen densities
- VISUAL: export data on custom grids
possibility to work with post-SCF data:
- FDE: import/export of perturbed densities (require response calculations)
- response calculations: checkpoint response parameters

Scenario 1 - extend current DIRAC schema

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

miro board

Scenario 2 - write separate schema for modules

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

miro board

FDE in scenario 2: `FDE_schema.txt` - key elements:

one FDE_checkpoint.h5 for one FDE workflow (can mean few runs if FDE freeze-thaw)
basic groups:
- subsystems: /{input,result}/subsystem
  - their 'labels' are updated according to the order of files (--mol="f1 f2 f3 ..."):
    - we assume f1 is 'active'; f2, f3,… are 'frozen'
- grids: /{input,result}/grid:
  - grids have distinct IDs (/{input,result}/grid/grid_id)
basic subgroup: .../subsystem/grid_function:
- grid_function = any property on a grid (densities, potentials, …)
- each grid_function can be defined on its own grid:
  - .../subsystem/grid_function/grid_id links to /{input,result}/grid/grid_id

FDE in scenario 2: `FDE_schema.txt`

now: some redundancy wrt DIRACschema.txt:

*schema
input            composite single    required    # definition of the calculation
result           composite single    required    # results of the calculation
*end

*input
subsystem        composite array     required    # information about each subsystem (on input)
grid             composite single    required    # information about imported grids
*end

*result
execution        composite single    required    # information about the run
subsystem        composite array     required    # information about each subsystem (on output)
grid             composite single    required    # information about exported grids
*end

*subsystem
molecule         composite single    optional    # topology of the molecular system (optional=may not be known for all subsystems)
aobasis          composite single    optional    # atomic orbital basis set descriptions (optional=may not be known for all subsystems)
grid_function    composite array     required    # grid data and quantities imported/exported on that grid
label            string    single    required    # is this subsystem `active` or `frozen`?
name             string    single    required    # unique name or label for a subsystem
operators        composite single    optional    # matrix representations of operators (useful if ever doing freeze-thaw inside DIRAC and wanting to update frozen subsystem data)
fde_method       composite single    optional    # FDE setup details
wavefunctions    composite single    optional    # results for each wave functions that was optimized (useful if ever doing freeze-thaw inside DIRAC and wanting to update frozen subsystem data)
*end

*grid_function
grid_id          integer   single    required    # grid_id of a grid on which this grid function is defined/is to be defined
property_label   string    single    required    # property label
property_name    real      single    generic     # real-space representation of a property (on a specified grid; property_name is a string of length 16)
status_io        string    single    optional    # 'import' or 'export'
action           string    single    optional    # update/do not update
*end

*grid
grid_p_num       integer   single    required    # number of grid points
grid_p_xyz       real      single    generic     # xyz-coordinates of grid points
grid_p_weights   real      array     required    # weights of grid points
grid_id          integer   single    required    # unique grid id or hash
status_io        string    single    required    # 'import' or 'export'
action           string    single    optional    # prune/combine/compress/... (TODO)
*end

Working with real-space data in DIRAC: FDE example (scenario 2)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

miro board

This example from the inside: modifications in DIRAC

in src: fde/fde_checkpoint.F90 is modified gp/checkpoint.F90:
- we needed more flexibility (many variables are hardcoded in gp/checkpoint.F90)
- future: transform gp/checkpoint.F90 to a generic module?
in utils:
- FDE_schema.txt
pam.in:
- --mol = "file1 file2 ...."
- --fdeh5

Real-space data in FDE and VISUAL - TODO list:

if scenario 2 for FDE:
- read subsystem data from their CHECKPOINT.h5 files
- rework FDEschema.txt (avoid repetition wrt DIRACschema.txt)
add support for grid manipulation:
- bind to external tools and/or write a separate 'grid' module (FDE and VISUAL)
- examples: compare (avoid double storage)/combine/prune grids, enable adaptive grids
better handling of I/O of grid data (think of easy post-processing)
overlap of FDE and VISUAL modules:
- come up with consistent schema for FDE and VISUAL grid data
- VISUAL as a library of real-space properties
- FDE:
  - keep import/export of data important to FDE (densities+ emb.potential)
  - but call VISUAL if export of another real-space property is requested for a subsystem
parallelization:
- keep track of distribution of grids and grid functions to MPI processes

Working with real-space data: inspirations

logical grids (l-grids) and data grids (d-grids), source: https://doi.org/10.1002/cpe.4165:
- l-grid informs about the structure and updates on d-grids (parallelization, scalability):
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
data structure in vtk-m: source:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Storage of post-SCF data: response parameters

step 1: bring linear response data stored on PAMXVC file to HDF5
decision: extend DIRACschema (scenario 1) or write a separate one (scenario 2)?

decision: labelling ideas for response:

property labels used now in src/prp:
- variables: PRPLBL (internal) or PRPNAM (user defined)
- PRPNAM is what is stored on PAMXVC
property labels used in src/openrsp:

type(prop_field_info) :: field_list(14) = &                         !nc an ba ln qu
  (/prop_field_info('EXCI', 'Generalized "excitation" field'      , 1, F, F, T, T), &
    prop_field_info('FREQ', 'Generalized "freqency" field'        , 1, F, F, T, T), &
    prop_field_info('AUX*', 'Auxiliary integrals on file'         , 1, F, F, T, F), &
    prop_field_info('PNC' , 'PNC'                                 , 1, F, F, T, F), &
    prop_field_info('EL'  , 'Electric field'                      , 3, F, F, T, F), &
    prop_field_info('VEL' , 'Velocity'                            , 3, T, F, T, F), &
    prop_field_info('MAGO', 'Magnetic field w/o. London orbitals' , 3, T, F, F, T), &
    prop_field_info('MAG' , 'Magnetic field with London orbitals' , 3, T, T, F, F), &
    prop_field_info('ELGR', 'Electric field gradient'             , 6, F, F, T, F), &
    prop_field_info('VIBM', 'Displacement along vibrational modes',-1, F, T, F, F), &
    prop_field_info('GEO' , 'Nuclear coordinates'                 ,-1, F, T, F, F), & !-1=mol-dep
    prop_field_info('NUCM', 'Nuclear magnetic moment'             ,-1, F, T, F, T), & !-1=mol-dep
    prop_field_info('AOCC', 'AO contraction coefficients'         ,-1, F, T, F, F), & !-1=mol-dep
    prop_field_info('AOEX', 'AO exponents'                        ,-1, F, T, F, F)/)  !-1=mol-dep
    ```

Extending labeled storage in DIRAC - general concerns:

code flexibility:
- easy-to-extend schema(s) (scenario 1? sneraio 2? sth in between?)
- support different strategies to working with real-space data:
  - import real-space data to DIRAC:
    - grid + data and grid; no need to store anything else
    - data can come from other software (e.g. ADF)
  - (re)generate real-space data from checkpoint files:
    - data can come from DIRAC only
    - e.g.: unperturbed density can be generated from CHECKPOINT.h5 on any grid
    - perturbed densities can be generated if response data is also checkpointed
- easy mechanisms for updating the data (overwriting) and for tracking how it changes
- ensure easy integration with external codes
code maintainability:
- if scenario 2: generalize utils/process_schema.py and gp/checkpoint.F90?
- better documentation (what to update to extend the schema)
- consistent labelling, using variables for groups and labels, etc.
scalability:
- strategies for large grids
performance:
- favor storing or (re)generating the data?
- store data in one large CHECKPOINT.h5 file or create many checkpoint files (FDE_CHECKPOINT.h5, PRP_CHECKPOINT.h5, etc.)?
- automate decisions about when to store/overwrite the data
- separate data computation from data I/O
- enable restarts of computations on grids

Post-SCF and real-space data with HDF5

Outline:

Current DIRACschema:

What do we want to achieve:

Scenario 1 - extend current DIRAC schema

Scenario 2 - write separate schema for modules

FDE in scenario 2: FDE_schema.txt - key elements:

FDE in scenario 2: FDE_schema.txt

Working with real-space data in DIRAC: FDE example (scenario 2)

This example from the inside: modifications in DIRAC

Real-space data in FDE and VISUAL - TODO list:

Working with real-space data: inspirations

Storage of post-SCF data: response parameters

Extending labeled storage in DIRAC - general concerns:

Read more

DM 06/2023. Design patterns for real-space data handling

Design patterns for real-space data handling

FDE in scenario 2: `FDE_schema.txt` - key elements:

FDE in scenario 2: `FDE_schema.txt`