Metadata and ontologies for potential-energy sampling

# Metadata and ontologies for potential-energy sampling **Authors:** James Kermode, Tristan Bereau, Christian Carbogno, Omar Valsson, Chuanxun Su, Berk Onat **Acknowlegements:** Markus Kühbach, Ron Miller **[JRK: this split matches the information I have received, please let me know if you have any concerns]** ## Introduction The existing NOMAD repository infrastructure provides an example of how to systematically store inputs and outputs of electronic-structure calculations in accordance with FAIR principles [GhiringhelliNPGCompMat2017 doi:10.1038/s41524-017-0048-5.]. In general, the existing workflows for parsing input and output files, normalising the data, and generating metadata can be adapted to molecular dynamics (MD), building on existing conversion layers for MD codes already supported by NOMAD (comprising `LAMMPS`, `Gromacs`, `Amber`, `NAMD`, `GROMOS`, `CHARMM`, `Tinker` and `DL_POLY`). However, storing and sharing inputs and outputs for dynamical simulations, ranging from *ab initio* MD through empirical potentials to coarse grained MD, raises a number of specific challenges in comparison to ~~single-point~~ **[CC: NOMAD has support for more than single-point, but not much more, yet.][BO: Agreed. NOMAD has support for more than this but AFAIK dft code parsers only use the standard single point calculation results for scf cycles and MD codes use frames (with one calc. step) for simulations. See the flexible support in metadata in Figure 1. Single-point calculation is shown as one frame and scf steps as calculation steps. Note that the additional ab-initio metadata (mostly in method box) is not illustrated in Fig1.]** electronic-structure calculations. These challenges arise principally because empirical potential-energy sampling techniques are used to investigate a wider variety of scientific questions than electronic structure techniques. Larger length scales, longer time scales, as well as more complex phase-space exploration algorithms and workflows are thus addressed: - (i) In many cases, the investigated systems feauture thousands of atoms with complex short- and long-range order and disorder, e.g., describing microstructural evolutions such as crack propagation. This requires large, complex simulation cells with a range of chemical species to be correctly described and categorized. - (ii) In close analogy to electronic-structure approximations, force-fields exist in a wide variety of flavors that require proper classification. On top of that, they allow for granular fine-tuning of the interactions, even for individual atoms. Faithfully representing complex force fields thus requires to also capture the topology that is often needed to define the actual interactions. - (iii) Compared to first-principles calculations, the larger length and longer time scale allows for a larger variety of simulation protocols, using specific boundary conditions, thermostats, constraints, integrators, etc. This allows additional observables to be computed, both instantaneously and as statistical averages or correlations. Furthermore, various user-defined observables, that might be non-standard, are often computed and stored. Representing these properties requires to extend the existing NOMAD meta-data infrastructure. Practically, this also implies the need to efficiently store and access large volumes of data, e.g., large trajectories, including positions, and possibly also velocities and forces **[OV: thinking about it, I don't think it so common to store also v and f.]**, for each atom at each time step. **[OV: I found this paragraph a bit unclear so I changed it slightly, feel free to change back if you don't agree. Perhaps it can be further improved]** - (vii) Eventually, the infrastructure described for workflows (cf. discussion of workflows, section **X.Y**) can be extended to workflows that feature a sequence of force-field based simulations, e.g., for observables that can only be derived from replicated and/or collective simulations and ensembles such as parallel tempering. ## Current capabilites For the purposes of illustration, we start by identifying some typical use cases, then describe what is currently implemented within the NOMAD infrastructure and meta data (see **Figure 1**) and what is missing in each section. The examples we adopt fall into two classes: (i) high throughput systems which are individually *simple* (~1,000-10,000 particles) but where the value of sharing comes from the resulting ability to run analysis across many variants of e.g. chemical composition or force field; (ii) *heroic* simulations of very large systems or very long time scales which cannot readily be repeated by other researchers and which are thus individually valuable to share. <p style="text-align: center;"><img src="https://i.imgur.com/oB4FV7r.jpg" alt="drawing" width="500"/></p> - **Figure 1:** A schematic metadata structure of calculation context for workflows in NOMAD Meta Info. The different stages of the simulation is shown in two contexts: by entring new **sampling** and/or **contraints** information in each **simulation run** or by adding multiple **sampling** and **constraints** sections that are either created or referenced in a single **simulation run** through different frames of MD simulation. Here, a simplified version of the metadata hierarchy is illustrated. For a more complete and detailed definition, please see NOMAD Meta Info browser through https://metainfo.nomad-coe.eu/nomadmetainfo_public/archive.html. Examples of the first *simple* class could be MD simulations in the $NVT$ ensemble for systems such as liquid butane or bulk silicon, performed at different temperatures $T$ or volumes $V$, using different but standard well-defined force fields (e.g., CHARMM, Stillinger-Weber, respectitvely). Quantities of interest are typically computed during MD simulation (e.g. liquid densities). For flexibility, full trajectory files should also be stored (typically for these systems these would not exceed ~1GB in size), but some important observables might be worth precomputing (e.g., radial distribution functions). The second class *heroic* simulations of very large systems could include multi-billion atom MD siumulations of dislocation formation of [Bulatov:doi: 10.1038/nature23472] or solidification [Shibuta:doi:10.1088/1361-651X/ab1e8b,doi:10.1088/1361-651X/ab1d28,[doi:10.1038/s41467-017-00017-5] or very long time-scale simulations of protein folding [doi:org/10.1016/j.sbi.2013.12.006]. With reference to these examples, we consider how each of the *FAIR* principles can be acheived: - **Findable**: Here, the existing infrastructure developed for electronic-structure calculations is an excellent starting point, although there is some scope for augmentation. All data associated with a given simulation is already assigned a unique and persistent identifier. Augmenting this with automatically detected meta-information about the composition, in terms of chemical (e.g. Si, already implemented) or molecular (butane; not yet implemented but relatively straightforward) species present in the simulation cell makes a data record searchable. **[CC Suggestion:** Similarly, the existing infrastructure can be used to search and group calculations according to the employed force-field, ensemble, and thermodynamic condidtions~(see also Interoperable)**]** - **Accessible**: Again, here we can reuse the electronic-structure machinery, linking all input and output files to the unique identifier of a simulation, as well as hashing the original input and output files. - **Interoperable**: Supporting this principle requires conversion to normalized formats for all inputs and outputs to enable comparisons; this is already in place for standard force fields (e.g. CHARMM, AMBER) and MD ensembles (e.g. NVE, NVT, NPT, etc.). - **Reusable**: Ensuring data that is uploaded to a repository are reusable requires appropriate licence agreements to be in place, along with provenance information. Licensing information for proprietary codes and access permissions for stored information are already implemented in existing NOMAD Meta Info. In some cases more detailed information about specific versions of codes may be required, since there are likely more local modifications to MD codes than DFT codes. This requirement could be addressed by storing the `git` SHA hash for the code version used, information which is output by many codes. ## Future Perspectives We now move to a series of more complex use cases, encompassing a broader set of simulations but where we expect the current infrastructure to *not* yet abide to the FAIR principles. A number of challenges are outlined, namely: (i) complex systems; (ii) custom force fields; (iii) advanced sampling; (iv) other classes of dynamical systems; (v) structure prediction; (vi) long time- and large length-scale simulations, and finally (vii) reproducibility. The above-mentioned systems had a straightforward architecture: either bulk or a (macro)molecule solvated in bulk water. Detecting the molecules present in the simulation box is straightforward and enables the generation of the relevant metadata to make the data Findable. Adding support for **more complex systems**, while of great interest, raises both scientific and technical challenges. First, geometries beyond the bulk are an essential extension toward interfaces and surfaces (e.g., water/oil or water/air interface). On the materials side, individual crystal point (e.g. vacancies, solutes), line (e.g., dislocations), and planar defects (e.g. grain boundaries) help reach systems of greater scientific and technological relevance. Characterising these systems will require an _ontology_ to describe these setups in human- and machine-readable forms. Beyond the description of chemical composition or space group, these complex systems will require representation of more subtle properties, including microstructure, phase, and poly-crystallinity. To ensure an automated upload of simulation data, attention will need to be brought on algorithms that automatically annotate structures, e.g., to identify surfaces, but also moving to more ambitious approaches such as labelling of classes of atomic enviroments within a simulation cell. On the biomolecular side, we envision the systematic identification of structural units to be of great value, especially in the context of proteins, DNA, RNA, small molecules, etc. Finally, we stress the challenges of incorporating information for coarse-grained models, where multiple atoms are lumped into each particle. In coarse-graining, the link to the chemistry is sometimes unclear and possibly counterproductive (e.g., generic models are not chemically specific). The simplest inclusion of metadata for **force fields** amounts to storing its name, akin to what is currently done for exchange-correlation functionals for electronic-structure calculations. This strategy is both pragmatic and sufficient when relying on a common transferable force field (e.g., AMBER, CHARMM or OPLS). Several initiatives aim at representing force fields (e.g., OpenKIM in materials science [cite https://openkim.org/ + doi:10.1007/s11837-011-0102-6] or OpenForceField for biomolecules). While a reasonably small number of transferable force fields account for a wide amount of research studies, several communities heavily rely on tailor-made force fields (e.g., coarse-graining). Encoding *variations* of a force field could be achieved by the metadata description of both functional forms (e.g., harmonic bond) and parameters (e.g., spring constant). We note, for instance, that OpenMM stores force-field parameters in XML files. These sets of force-field parameters are often overwhelmingly large, due to the need to define many atom types, effectively encoding atoms in specific chemical environments. While FF models and their driver codes as a package are also accesible through some archives such as OpenKIM, OpenForceField and WebFF, there is no standard FF metadata for both topology and non-topology based FFs. Each archive has their own metadata or model representation that leads to redefining FF parameters, topologies or writing specific codes to meet with their hierarchical structure. With the large number of parameters at hand these make the manual uploading of interactions (eg. in OpenKIM for topology based FFs and in OpenForceField non-topology based many-body potentials) a significant effort. An exhaustive description of functional forms and parameters---or alternatively tabulated potentials---would provide the basis to compare whether two force fields are equivalent. Beyond these standard force fields, the incorporation of nonparametric models (e.g., machine learning) would require the storage of the entire model. However, the support of machine learning models would pose new challenges, for instance in ways to compare whether two force fields are equivalent. Finally, we point to potential licensing issues with certain force fields that may not be publically available, leading to reusability issues (e.g., POTCAR files from VASP). Beyond simulations in a standard thermodynamic ensemble (e.g., $NVT$ and $NPT$), we would like to describe **advanced-sampling methods**. In a first stage, simulations could describe evolving thermodynamic ensembles, for instance during temperature annealing. On the technical level, this would require a top-level grouping, a calculation context of the workflow in the metadata to represent the different stages of an MD simulation (see **Figure 1**). The thermodynamic parameters could then be defined for part of a simulation, or even per frame. Next, we cover enhanced-sampling schemes aiming at biasing the conformational sampling in specific regions. Such schemes might for example be: (i) methods based on collective variables, e.g., metadynamics [doi:10.1073/pnas.202427399],[doi:10.1146/annurev-physchem-040215-112229] or umbrella sampling [doi: 10.1016/0021-9991(77)90121-8], (ii) methods based on trajectories, e.g., transition path sampling ([doi:10.1016/j.actamat.2017.02.027], [doi:10.1038/nature15364]) or forward flux sampling [doi:10.1088/0953-8984/21/46/463102], and (iii) free-energy methods, such as alchemical transformations. This would require adding support for popular MD plugins such as PLUMED [doi:10.1016/j.cpc.2013.09.018], SSAGES [doi:10.1063/1.5008853], COLVARS [doi:10.1080/00268976.2013.813594], OpenPathSampling [doi:10.1021/acs.jctc.8b00626], and others that are used to drive such enhanced sampling simulations. In this regard it would also be interesting to connect to the PLUMED-NEST [https://www.plumed-nest.org/], the public repository of the PLUMED consortium [ref add later], for example by allowing for automatic uploading of PLUMED input files to the PLUMED-NEST when uploading to NOMAD. **[OV: Note, I am personally involved in the PLUMED consortium behind the PLUMED-NEST.]** **[OV: I found the following sentences about workflow not connect well with the above. Also the idea of workflows in MD is very general. So I moved it to a seperate paragraph and changed it slighlty.]** The complexity associated with many MD simulations would benefit from, and ideally demand, a connection to _workflows_: a complex MD script may contain several steps (e.g., energy minimization, annealing, equilibration, or production). Incorporating these in a workflow could offer several conceptual and technical advantages. For example, workflows can be compared with similar implementations to identify possible error-prone workflows or to improve complex state of the art simulation techniques. The support of **other classes of dynamical simulations** would entail several aspects. For instance, the incorporation of Monte Carlo simulations would require the types of moves attempted to be stored. Linking back to enhanced-sampling, the simulation of multiple replicas that are dynamically connected (e.g., parallel tempering) would require extra metadata to incorporate the swaps that occurred. Further, more complex thermodynamics ensembles like grand-canonical MD would require some adaptation, since the number of atoms in a simulation cell is no longer constant. Finally, we note that the incorporation of multiscale methods such as QM/MM would largely rely on a combination of metadata from the two types of calculations involved (i.e., electronic-structure and classical), along with a description of the coupling method. **Structure prediction** methodology such as CALYPSO [doi: 10.1016/j.cpc.2012.05.008], AIRSS [doi: 10.1088/0953-8984/23/5/053201], USPEX [doi: 10.1016/j.cpc.2006.07.020], etc. are becoming more and more popular and useful. These methods typically generate thousands of distinct atomic structures for each simulation. We optimize the geometries by DFT code. The global minimum and structures located in low energy range are supposed to be stable and metastable. We will select out these structures for further analysis. Structure prediction is able to determine atomic structures and many predicted atomic structures have been testified by experiments [doi: 10.1103/PhysRevLett.106.015503, doi: 10.1103/PhysRevLett.106.145501, doi: 10.1073/pnas.1119375109, doi: 10.1103/PhysRevLett.110.136403, doi: 10.2138/am.2012.3973]. However, large amounts of calculations are not well archived and are barely reused. Every structure prediction calculation starts from scratch. The valuable data from previous calculations can’t be learned at all. The numerous structures collected from structure prediction could be used to train machine learning model so that next calculation could learn these experiences and explore the potential energy surface more efficiently. One previous work mentioned to collect the theoretical structures generated by structure prediction and reuse them[doi: 10.1088/1361-648X/aa63cd]. However, not too much progress have been seen. **[BO Suggestion: The rest of this paragraph can be moved to "Current Capabilities" since it explains how to use/store metadata for the aformentioned methodologies based on DFT codes such as VASP. The metadata storage as it is explained in this paragraph is already done in NOMAD Meta Info. Please check https://encyclopedia.nomad-coe.eu/gui/#/search and https://metainfo.nomad-coe.eu/nomadmetainfo_public/archive.html . There are also tools for structure analyses in NOMAD Analytics-Toolkit https://analytics-toolkit.nomad-coe.eu/home/ . Ex. Several millions of OUTCAR files are stored in NOMAD Repository and all the information they possess are already stored in well-formatted and well-documented NOMAD Archive.]** Archiving these calculations in database and creating relevant metadata will be an excellent solution. Metadata could record the key parameters of structure prediction such as chemical composition, formula unit, volume, minimum inter-atomic distance constrains, pressure, etc. **The numerous structures collected from structure prediction could be used to train machine learning model so that next calculation could learn these experiences and explore the potential energy surface more efficiently.[BO: This is the sentence that I moved up. ]** Optimizing these structures will generate large amounts of output files (e.g. OUTCAR from VASP). Storing all these files could be a big challenge. To solve this problem, we could extract all the possible useful information from OUTCAR so that we don’t need to store OUTCAR anymore. For **long time-** and **large length-scale simulations**, several questions may rise: How should we deal with these simulations, where the extensive amount of data produced by MD simulations becomes overwhelmingly large to systematically store and share? Can we afford to store and share all of it? If the storage is limited, how can we identify the significant and crucial part of the simulation to store it in a reduced form? Keeping the whole data locally and sharing the metadata with only the important parts of the simulations would be a viable alternative, assuming the different servers have enough redundancy. Standard analyses techniques such as similarity analyses and monitoring dynamics can also be used to identify the changes in structure and dynamics to store only the significant frames or specific regions in MD simulations (eg. QM/MM models uses large MM buffer-atom regions that may not be stored entirely). Further the cost/benefit of storing versus running a new simulation must be weighed. On the other hand, researchers may soon face increased constraints from funding agencies to store their data for a number of years, in which case the present endeavour offers a convenient implementation. We note ongoing algorithmic developments on compression algorithms for trajectories, e.g., https://pubs.acs.org/doi/10.1021/acs.jcim.8b00501. Finally, we share some thoughts on **reproducibility** of MD data. Unlike many examples from electronic-structure calculations that precisely compare quantities deterministically, reproducibility in MD simulations is often of interest in a statistical sense only (e.g., conformational average of an observable or free energy). Ongoing efforts to systematically test interatomic potentials and software, such as OpenKIM, can be complementary to help ensure reproducibility of various aspects of an MD simulation. While discussions about containerisation of codes (e.g., Docker or Singularity) could help provide the exact software standpoint used for an MD simulation, describing all the software and library versions of the running environment in metadata and storing the clones of available source codes in local archives should be enough to achieve reproducibility---statistically, at least---within a certain level of tolerance. This could be formalised using metrics such as the Kullbeck-Leibler divergence to compare the probability distributions associated with a particular quantity of interest. To summarize, we associate the main challenges we discussed to the following letters of *FAIR*: - **Findable**: Custom force fields, Complex systems, Advanced sampling - **Accessible**: Longer and larger simulations, Complex state-of-the-art simulations, Custom workflows - **Interoperable**: Custom force fields, Advanced sampling, Reproduciblity - **Reusable**: Custom force fields, Custom workflows and sampling