Post-SCF and real-space data with HDF5

# Post-SCF and real-space data with HDF5 ### Outline: 1. Current DIRAC schema and what do we want to achieve. 2. Extending DIRAC schema - few scenarios. 3. Real-space data in HDF5 format + FDE example. 4. Response data in HDF5 format. 5. Labeled storage in DIRAC: future and general concerns. </br> > notes for DIRAC'c hackaton day, 17/01/2022 > [link to these notes](https://hackmd.io/@gosia/rk9iB1DnK/edit), [link to miro boards](https://miro.com/welcome/MjJFUkFzUG83eFhScEppUGZHVjJqN3lEcEdEUDgycHV0eE1FYkwyaHRGSk1oQTMxZUNiTEdVSEpmT2M2RVdFWnwzNDU4NzY0NTE1ODY0NTk1Njcy?invite_link_id=859971972633) > [DIRAC branch: `gosia/fde-with-hdf5` > [tests (on that branch): `fde_import_export_density_hdf5`] --- ### Current DIRACschema: * single `CHECKPOINT.h5` file for a single run * storage: wavefunctions (HF+DFT) + description ### What do we want to achieve: * possibility to use many h5 checkpoint files in a single run: * FDE: import h5 files for an active subsystem and for frozen subsystem(s) * FDE data (freeze-and-thaw iteration): import data, export updated data * possibility to work with real-space data: * FDE: import/export of embedding potential or frozen densities * VISUAL: export data on custom grids * possibility to work with post-SCF data: * FDE: import/export of perturbed densities (require response calculations) * response calculations: checkpoint response parameters --- ### Scenario 1 - extend current DIRAC schema ![](https://i.imgur.com/gSy75jl.jpg) [miro board](https://miro.com/welcomeonboard/V0NQZjFmQTM4c09EVmN3VGV2NXIxcnE1RWNRWk1odlk4cTAyZDByNzVsSGRGOVYzYkJrQzh2U1pRNm02UXhIVHwzNDU4NzY0NTE1ODY0NTk1Njcy?invite_link_id=636686901137) --- ### Scenario 2 - write separate schema for modules ![](https://i.imgur.com/ObjBcbE.jpg =x450) [miro board](https://miro.com/welcomeonboard/ZnpDdngzNk1GVGEyb2F3Qkk4bG9UUEtJUEM0U2g0dmZMemtidExwWmVFS3FJcEtIMnNNajVlbExpRzNveENSQXwzNDU4NzY0NTE1ODY0NTk1Njcy?invite_link_id=289216627870) --- ### FDE in scenario 2: `FDE_schema.txt` - key elements: * one `FDE_checkpoint.h5` for one FDE workflow (can mean few runs if FDE freeze-thaw) * basic groups: * subsystems: `/{input,result}/subsystem` * their 'labels' are updated according to the order of files (`--mol="f1 f2 f3 ..."`): * we assume `f1` is 'active'; `f2`, `f3`,... are 'frozen' * grids: `/{input,result}/grid`: * grids have distinct IDs (`/{input,result}/grid/grid_id`) * basic subgroup: `.../subsystem/grid_function`: * `grid_function` = any property on a grid (densities, potentials, ...) * each `grid_function` can be defined on its own grid: * `.../subsystem/grid_function/grid_id` links to `/{input,result}/grid/grid_id` --- ### FDE in scenario 2: `FDE_schema.txt` * now: some redundancy wrt `DIRACschema.txt`: ``` *schema input composite single required # definition of the calculation result composite single required # results of the calculation *end *input subsystem composite array required # information about each subsystem (on input) grid composite single required # information about imported grids *end *result execution composite single required # information about the run subsystem composite array required # information about each subsystem (on output) grid composite single required # information about exported grids *end *subsystem molecule composite single optional # topology of the molecular system (optional=may not be known for all subsystems) aobasis composite single optional # atomic orbital basis set descriptions (optional=may not be known for all subsystems) grid_function composite array required # grid data and quantities imported/exported on that grid label string single required # is this subsystem `active` or `frozen`? name string single required # unique name or label for a subsystem operators composite single optional # matrix representations of operators (useful if ever doing freeze-thaw inside DIRAC and wanting to update frozen subsystem data) fde_method composite single optional # FDE setup details wavefunctions composite single optional # results for each wave functions that was optimized (useful if ever doing freeze-thaw inside DIRAC and wanting to update frozen subsystem data) *end *grid_function grid_id integer single required # grid_id of a grid on which this grid function is defined/is to be defined property_label string single required # property label property_name real single generic # real-space representation of a property (on a specified grid; property_name is a string of length 16) status_io string single optional # 'import' or 'export' action string single optional # update/do not update *end *grid grid_p_num integer single required # number of grid points grid_p_xyz real single generic # xyz-coordinates of grid points grid_p_weights real array required # weights of grid points grid_id integer single required # unique grid id or hash status_io string single required # 'import' or 'export' action string single optional # prune/combine/compress/... (TODO) *end ``` --- ### Working with real-space data in DIRAC: FDE example (scenario 2) ![](https://i.imgur.com/36xGsMQ.jpg) [miro board](https://miro.com/welcomeonboard/MGtwMEpvaTUzcUN1ZEF0SXZOMzVYbGZBZEhvem9MQjBXYTRWNEhYbUltd25XNWV5YmlHamc2b2dLOEdacDl5bXwzNDU4NzY0NTE1ODY0NTk1Njcy?invite_link_id=206001469506) --- ### This example from the inside: modifications in DIRAC * in `src`: `fde/fde_checkpoint.F90` is modified `gp/checkpoint.F90`: * we needed more flexibility (many variables are hardcoded in `gp/checkpoint.F90`) * future: transform `gp/checkpoint.F90` to a generic module? * in `utils`: * `FDE_schema.txt` * `pam.in`: * `--mol = "file1 file2 ...."` * `--fdeh5` --- ### Real-space data in FDE and VISUAL - TODO list: * if scenario 2 for FDE: * read subsystem data from their `CHECKPOINT.h5` files * rework `FDEschema.txt` (avoid repetition wrt `DIRACschema.txt`) * add support for grid manipulation: * bind to external tools and/or write a separate 'grid' module (FDE and VISUAL) * examples: compare (avoid double storage)/combine/prune grids, enable adaptive grids * better handling of I/O of grid data (think of easy post-processing) * overlap of FDE and VISUAL modules: * come up with consistent schema for FDE and VISUAL grid data * VISUAL as a library of real-space properties * FDE: * keep import/export of data important to FDE (densities+ emb.potential) * but call VISUAL if export of another real-space property is requested for a subsystem * parallelization: * keep track of distribution of grids and grid functions to MPI processes --- ### Working with real-space data: inspirations * logical grids (l-grids) and data grids (d-grids), [source: https://doi.org/10.1002/cpe.4165](https://doi.org/10.1002/cpe.4165): * l-grid informs about the structure and updates on d-grids (parallelization, scalability): ![](https://i.imgur.com/vdloCCy.png =x400) * data structure in vtk-m: [source](https://www.cs.uoregon.edu/Reports/AREA-201703-Kress.pdf): ![](https://i.imgur.com/G5FKhLL.png =x400) --- ### Storage of post-SCF data: response parameters * step 1: bring linear response data stored on PAMXVC file to HDF5 * decision: extend DIRACschema (scenario 1) or write a separate one (scenario 2)? * decision: labelling ideas for response: * property labels used now in `src/prp`: * variables: `PRPLBL` (internal) or `PRPNAM` (user defined) * `PRPNAM` is what is stored on `PAMXVC` * property labels used in `src/openrsp`: ``` type(prop_field_info) :: field_list(14) = & !nc an ba ln qu (/prop_field_info('EXCI', 'Generalized "excitation" field' , 1, F, F, T, T), & prop_field_info('FREQ', 'Generalized "freqency" field' , 1, F, F, T, T), & prop_field_info('AUX*', 'Auxiliary integrals on file' , 1, F, F, T, F), & prop_field_info('PNC' , 'PNC' , 1, F, F, T, F), & prop_field_info('EL' , 'Electric field' , 3, F, F, T, F), & prop_field_info('VEL' , 'Velocity' , 3, T, F, T, F), & prop_field_info('MAGO', 'Magnetic field w/o. London orbitals' , 3, T, F, F, T), & prop_field_info('MAG' , 'Magnetic field with London orbitals' , 3, T, T, F, F), & prop_field_info('ELGR', 'Electric field gradient' , 6, F, F, T, F), & prop_field_info('VIBM', 'Displacement along vibrational modes',-1, F, T, F, F), & prop_field_info('GEO' , 'Nuclear coordinates' ,-1, F, T, F, F), & !-1=mol-dep prop_field_info('NUCM', 'Nuclear magnetic moment' ,-1, F, T, F, T), & !-1=mol-dep prop_field_info('AOCC', 'AO contraction coefficients' ,-1, F, T, F, F), & !-1=mol-dep prop_field_info('AOEX', 'AO exponents' ,-1, F, T, F, F)/) !-1=mol-dep ``` --- ### Extending labeled storage in DIRAC - general concerns: 1. code flexibility: * easy-to-extend schema(s) (scenario 1? sneraio 2? sth in between?) * support different strategies to working with real-space data: * import real-space data to DIRAC: * grid + data and grid; no need to store anything else * data can come from other software (e.g. ADF) * (re)generate real-space data from checkpoint files: * data can come from DIRAC only * e.g.: unperturbed density can be generated from `CHECKPOINT.h5` on any grid * perturbed densities can be generated if response data is also checkpointed * easy mechanisms for updating the data (overwriting) and for tracking how it changes * ensure easy integration with external codes 2. code maintainability: * if scenario 2: generalize `utils/process_schema.py` and `gp/checkpoint.F90`? * better documentation (what to update to extend the schema) * consistent labelling, using variables for groups and labels, etc. 3. scalability: * strategies for large grids 4. performance: * favor storing or (re)generating the data? * store data in one large `CHECKPOINT.h5` file or create many checkpoint files (`FDE_CHECKPOINT.h5`, `PRP_CHECKPOINT.h5`, etc.)? * automate decisions about when to store/overwrite the data * separate data computation from data I/O * enable restarts of computations on grids ---

Read more

DM 06/2023. Design patterns for real-space data handling

Design patterns for real-space data handling