Try   HackMD

Post-SCF and real-space data with HDF5

Outline:

  1. Current DIRAC schema and what do we want to achieve.
  2. Extending DIRAC schema - few scenarios.
  3. Real-space data in HDF5 format + FDE example.
  4. Response data in HDF5 format.
  5. Labeled storage in DIRAC: future and general concerns.

notes for DIRAC'c hackaton day, 17/01/2022
link to these notes, link to miro boards
[DIRAC branch: gosia/fde-with-hdf5
[tests (on that branch): fde_import_export_density_hdf5]


Current DIRACschema:

  • single CHECKPOINT.h5 file for a single run
  • storage: wavefunctions (HF+DFT) + description

What do we want to achieve:

  • possibility to use many h5 checkpoint files in a single run:

    • FDE: import h5 files for an active subsystem and for frozen subsystem(s)
    • FDE data (freeze-and-thaw iteration): import data, export updated data
  • possibility to work with real-space data:

    • FDE: import/export of embedding potential or frozen densities
    • VISUAL: export data on custom grids
  • possibility to work with post-SCF data:

    • FDE: import/export of perturbed densities (require response calculations)
    • response calculations: checkpoint response parameters

Scenario 1 - extend current DIRAC schema

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

miro board


Scenario 2 - write separate schema for modules

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

miro board


FDE in scenario 2: FDE_schema.txt - key elements:

  • one FDE_checkpoint.h5 for one FDE workflow (can mean few runs if FDE freeze-thaw)
  • basic groups:
    • subsystems: /{input,result}/subsystem
      • their 'labels' are updated according to the order of files (--mol="f1 f2 f3 ..."):
        • we assume f1 is 'active'; f2, f3, are 'frozen'
    • grids: /{input,result}/grid:
      • grids have distinct IDs (/{input,result}/grid/grid_id)
  • basic subgroup: .../subsystem/grid_function:
    • grid_function = any property on a grid (densities, potentials, )
    • each grid_function can be defined on its own grid:
      • .../subsystem/grid_function/grid_id links to /{input,result}/grid/grid_id

FDE in scenario 2: FDE_schema.txt

  • now: some redundancy wrt DIRACschema.txt:
*schema
input            composite single    required    # definition of the calculation
result           composite single    required    # results of the calculation
*end

*input
subsystem        composite array     required    # information about each subsystem (on input)
grid             composite single    required    # information about imported grids
*end

*result
execution        composite single    required    # information about the run
subsystem        composite array     required    # information about each subsystem (on output)
grid             composite single    required    # information about exported grids
*end

*subsystem
molecule         composite single    optional    # topology of the molecular system (optional=may not be known for all subsystems)
aobasis          composite single    optional    # atomic orbital basis set descriptions (optional=may not be known for all subsystems)
grid_function    composite array     required    # grid data and quantities imported/exported on that grid
label            string    single    required    # is this subsystem `active` or `frozen`?
name             string    single    required    # unique name or label for a subsystem
operators        composite single    optional    # matrix representations of operators (useful if ever doing freeze-thaw inside DIRAC and wanting to update frozen subsystem data)
fde_method       composite single    optional    # FDE setup details
wavefunctions    composite single    optional    # results for each wave functions that was optimized (useful if ever doing freeze-thaw inside DIRAC and wanting to update frozen subsystem data)
*end

*grid_function
grid_id          integer   single    required    # grid_id of a grid on which this grid function is defined/is to be defined
property_label   string    single    required    # property label
property_name    real      single    generic     # real-space representation of a property (on a specified grid; property_name is a string of length 16)
status_io        string    single    optional    # 'import' or 'export'
action           string    single    optional    # update/do not update
*end

*grid
grid_p_num       integer   single    required    # number of grid points
grid_p_xyz       real      single    generic     # xyz-coordinates of grid points
grid_p_weights   real      array     required    # weights of grid points
grid_id          integer   single    required    # unique grid id or hash
status_io        string    single    required    # 'import' or 'export'
action           string    single    optional    # prune/combine/compress/... (TODO)
*end

Working with real-space data in DIRAC: FDE example (scenario 2)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

miro board


This example from the inside: modifications in DIRAC

  • in src: fde/fde_checkpoint.F90 is modified gp/checkpoint.F90:
    • we needed more flexibility (many variables are hardcoded in gp/checkpoint.F90)
    • future: transform gp/checkpoint.F90 to a generic module?
  • in utils:
    • FDE_schema.txt
  • pam.in:
    • --mol = "file1 file2 ...."
    • --fdeh5

Real-space data in FDE and VISUAL - TODO list:

  • if scenario 2 for FDE:

    • read subsystem data from their CHECKPOINT.h5 files
    • rework FDEschema.txt (avoid repetition wrt DIRACschema.txt)
  • add support for grid manipulation:

    • bind to external tools and/or write a separate 'grid' module (FDE and VISUAL)
    • examples: compare (avoid double storage)/combine/prune grids, enable adaptive grids
  • better handling of I/O of grid data (think of easy post-processing)

  • overlap of FDE and VISUAL modules:

    • come up with consistent schema for FDE and VISUAL grid data
    • VISUAL as a library of real-space properties
    • FDE:
      • keep import/export of data important to FDE (densities+ emb.potential)
      • but call VISUAL if export of another real-space property is requested for a subsystem
  • parallelization:

    • keep track of distribution of grids and grid functions to MPI processes

Working with real-space data: inspirations

  • logical grids (l-grids) and data grids (d-grids), source: https://doi.org/10.1002/cpe.4165:

    • l-grid informs about the structure and updates on d-grids (parallelization, scalability):

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • data structure in vtk-m: source:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Storage of post-SCF data: response parameters

  • step 1: bring linear response data stored on PAMXVC file to HDF5
  • decision: extend DIRACschema (scenario 1) or write a separate one (scenario 2)?
  • decision: labelling ideas for response:
    • property labels used now in src/prp:
      • variables: PRPLBL (internal) or PRPNAM (user defined)
      • PRPNAM is what is stored on PAMXVC
    • property labels used in src/openrsp:
    ​​​​type(prop_field_info) :: field_list(14) = &                         !nc an ba ln qu
    ​​​​  (/prop_field_info('EXCI', 'Generalized "excitation" field'      , 1, F, F, T, T), &
    ​​​​    prop_field_info('FREQ', 'Generalized "freqency" field'        , 1, F, F, T, T), &
    ​​​​    prop_field_info('AUX*', 'Auxiliary integrals on file'         , 1, F, F, T, F), &
    ​​​​    prop_field_info('PNC' , 'PNC'                                 , 1, F, F, T, F), &
    ​​​​    prop_field_info('EL'  , 'Electric field'                      , 3, F, F, T, F), &
    ​​​​    prop_field_info('VEL' , 'Velocity'                            , 3, T, F, T, F), &
    ​​​​    prop_field_info('MAGO', 'Magnetic field w/o. London orbitals' , 3, T, F, F, T), &
    ​​​​    prop_field_info('MAG' , 'Magnetic field with London orbitals' , 3, T, T, F, F), &
    ​​​​    prop_field_info('ELGR', 'Electric field gradient'             , 6, F, F, T, F), &
    ​​​​    prop_field_info('VIBM', 'Displacement along vibrational modes',-1, F, T, F, F), &
    ​​​​    prop_field_info('GEO' , 'Nuclear coordinates'                 ,-1, F, T, F, F), & !-1=mol-dep
    ​​​​    prop_field_info('NUCM', 'Nuclear magnetic moment'             ,-1, F, T, F, T), & !-1=mol-dep
    ​​​​    prop_field_info('AOCC', 'AO contraction coefficients'         ,-1, F, T, F, F), & !-1=mol-dep
    ​​​​    prop_field_info('AOEX', 'AO exponents'                        ,-1, F, T, F, F)/)  !-1=mol-dep
    ​​​​    ```
    
    
    

Extending labeled storage in DIRAC - general concerns:

  1. code flexibility:

    • easy-to-extend schema(s) (scenario 1? sneraio 2? sth in between?)
    • support different strategies to working with real-space data:
      • import real-space data to DIRAC:
        • grid + data and grid; no need to store anything else
        • data can come from other software (e.g. ADF)
      • (re)generate real-space data from checkpoint files:
        • data can come from DIRAC only
        • e.g.: unperturbed density can be generated from CHECKPOINT.h5 on any grid
        • perturbed densities can be generated if response data is also checkpointed
    • easy mechanisms for updating the data (overwriting) and for tracking how it changes
    • ensure easy integration with external codes
  2. code maintainability:

    • if scenario 2: generalize utils/process_schema.py and gp/checkpoint.F90?
    • better documentation (what to update to extend the schema)
    • consistent labelling, using variables for groups and labels, etc.
  3. scalability:

    • strategies for large grids
  4. performance:

    • favor storing or (re)generating the data?
    • store data in one large CHECKPOINT.h5 file or create many checkpoint files (FDE_CHECKPOINT.h5, PRP_CHECKPOINT.h5, etc.)?
    • automate decisions about when to store/overwrite the data
    • separate data computation from data I/O
    • enable restarts of computations on grids