Draft specifications for trajectories

# Draft specifications for trajectories Live document used during the [2021 OPTIMADE workshop](https://www.cecam.org/workshop-details/52). ## Things to add - allow to "link" a structure to the trajectory it belongs to (if any), and trajectory frames with a structure explicitly defined (if any) ## Open questions - Do we need to differentiate between not sampled positions, and non existing atoms? - a null at the highest level means the property is unknown. nulls at lower level can have meaning. - if `cartesian_site_positions=null` or is not defined: it means not sampled - if `cartesian_site_positions` is a nframes*nsites*3 array, some positions can be null and this means that atom is not in the simulation for that frame - REQUIREMENT: either all three xyz are null or they are all floats. - if a coordinate vector == [null,null,null] the particle is not in the box - note: to return missing frames, the WHOLE frame should be null even if it's an array. E.g. for `lattice_vectors`: ``` [ [[1,0,0],[0,1,0],[0,0,1]], null, [[1,0,0],[0,1,0],[0,0,1]], null, ... ] ``` and *NOT*: ``` [ [[1,0,0],[0,1,0],[0,0,1]], [[null,null,null],[null,null,null],[null,null,null]], [[1,0,0],[0,1,0],[0,0,1]], [[null,null,null],[null,null,null],[null,null,null]], ... ] ``` (this syntax is reserved for site_positions, to say an atom is not in the box). - We should make the data structure flexible enough to accomodate grand canonical data but do not yet define it in the standard yet. Mention how (with null positions, fixed nelements/nsites, etc.) -> discuss the condition on the i-th atom being always the same (e.g. it holds unless the atom "disappears from the simulation") [what happens if one extracts a subset of a trajectory? ~~Or *mark* atoms that don't have this property, i.e. are part of a reservoir~~] - allow only some of the particle positions to be returned. For example not returning positions solvent molecules. -> let the client decide which ones from the reference frame, then allow to provide a list of atom numbers (0-based). - allow data in json format to be transferred in blocks where client and server can apply maximum chunk size. Yes - we should make tools that can can convert our json format to common trajectory formats (reusing existing (python (?)) libraries where possible (e.g. exporting to a very well defined and general format, and then letting the library convert to the others, or even integrating/adding the OPTIMADE format to those libraries. - Should users be able to ask for a range of frames (using the frame number, i.e. a counter)? YES ideally required (SHOULD) but it may need more discussion. - if SHOULD: /info should tell if the feature exists - if SHOULD: discuss also how bulk works - Is it allowed for the server to return only a subset of what was asked (e.g. max size)? - Yes, with a `next` metadata link (using the same querystring syntax as for frames) - This will happen only for DBs that can return subsets - Is there a way for the client to request a maximum size of the request? YES - e.g. the client could get the response size from the reference frame, estimate the maximum number of frame it wants from a request, and ask that as a limit - but actually this does not need additional features, is just asking for a range of frames (note that the client might still need to go the `next` link if the server paginates the frames) - Also: what if I ask max_frame_number > nframes? - ERROR, similarly for negative numbers, max < min, ... - Can users ask for a single frame, different from the server-decided reference frame? E.g. to take only frame 1: `/optimade/v1/trajectories/ID/data?frame_range=10:20:1&response_fields=cartesian_site_positions,lattice_vectors,...` - use syntax used for (optional) feature above to get frame ranges (only if server has the feature of course) - To be rediscussed? Shyam's suggestion of a special endpoint that acts exactly like a structure endpoint. Reason: if the feature is optional, the endpoint will not be used. - Can we assume that other properties are only sampled at frames for which we also have site_positions, and not e.g. with a denser sampling, different sampling, ..." ? NO - Note that by using an "explicit" type for the cartesian_site_positions, it's actually possible to skip some frames if the positions were not sampled, by setting the values to `null` (or using one of the extesions of option 2.2); the assumption remains that there is a "common" frame index (that is, you don't have times sampled on a grid, positions in a completely different incommensurate grid, ...) - time: Can we filter on it? especially if there is not a regular timestep - if the query is "give me the first ps": no need to "query", better to ask JUST for the time array (should be small) and then reuse it to query for the right frames - We should think if there is some general query to do (considering that time might be optional) - lattice_vectors and cartesian_site_positions for the full trajectory? - possibility to define lattice_vectors only once (constant) for the whole trajectory - return also the times (and the frame ids) - some simple syntax to return subsets of the trajectories (see above and below) - file types: Do we want to support multiple file types. - reuse the format spefication of OPTIMADE in the `/info` endpoint? see e.g. https://optimade.materialsproject.org/v1/info/structures that returns `{"data":{"formats":["json"],"description": ... - there are two formats: one is the response format (JSON, but could be XML) -> format on the main endpoint; there could be a /files or /download endpoint, and this could have specific (even binary) formats. - One specific format (JSON or similar) will be mandatory (so all implementations have it) -> will be the one discussed below in the specification. Other are optional (and can be inspected in /info, and can also contain also subsets of information, e.g. only the coordinates xyz, or binary). Reason: efficiency. - If we allow multiple file types to be returned. How do we inform user which filetypes are available. in `/optimade/v1/trajectories/ID/data/info` ?` - See also issue : 360 on github - contraints (constant stress, ...) or properties that change during the simulation (temperature) -> discussion on which metadata we want to have -> all of these are "calculation" metadata; we need a specific discussion on these. - For temperature, we might need to define two at least (instantaneous and target from thermostat) - When there is a thermostat, how to distinguish if the target value is not sampled and/or stored, from just not kept constant - We removed this sentence from `nframes`, to think if there will be an equivalent array for which we can query the length: - Note: queries on this property can equivalently be formulated using `elements LENGTH`{.filter-fragment}. -> REMOVE - Do we need assemblies? Maybe remove and add only if needed. Or for now keep among the constant things like elements, ... - Do we want a format, where all properties (nelements, elements, nspecies, ...) except the reference structure, can be specified at each frame ? - In this case, the reference frame should also contain all general properties. - Example: ```json { "reference_frame": { # The content is IDENTICAL to a structure "nelements": 2, "elements": ['H', 'O'], ... "cartesian_site_positions": ..., "lattice_vectors": ..., } "reference_frame_id": 10, "nframes": 100, ... } ``` - Queries will run at this global "trajectory level": all queries on structures apply *only* to the reference_frame. This reference frame has to be one of the structures in the trajectory (add note: if no specific choice, use frame number 0). - We will not allow (for now) to query specific frames, only the reference frame and the global trajectory metadata. - do we want trajectories to be linked such that if we have a long trajectory split into pieces, you know the previous_trajectory and next_trajectory chunk? MAYBE NOT, only if required by a specifically defined usecase (can be extended) (but fix names to avoid to confuse with next (for frames/pagination)) - if instead the goal is to "group" trajectories: move this to a different issue (there is the same issue for structures) - revisit the syntax of the sparse_explicit examples (zero-based, improve names, decide if things should instead be ordered by frame) ## Ideas for extension - TODO: discuss here outcome of discussions on how to extend but that we don't implement now - Do we want to make particle/site id's ids? - Now particles are identified by the order of the sites. In case the number of sites/particles changes it is not possible to follow sites in this way. - It should be optional - describe a usecase (describe in the context of grand-canonical) ### Ideas that go beyond the scope of trajectories - We need to add the concept of residues (not changing over a simulation) -> This is not specific for trajectories but also applies to structures so it would perhaps be better to discuss this elsewhere. ### List definitions List of terms we want to define that are specific to trajectories: - trajectory: a trajectory is a list of frames - frame: each frame is essentially a structure; but, all frames in the trajectory share a certain number of common properties (see below) - frame number: a sequential integer from 0 (or 1?) to (N-1) (or N) enumerating the frames, where N is the total number of frames - total number of frames: see above - frame time: the time (in ps?) of that frame, from some "arbitrary" origin, but consistent across all frames of a given trajectory - frame id **REMOVE THIS CONCEPT**, just clarify when speaking about the frame number that this is just the array position, and will in most cases be different from the frame ID returned by the underlying simulation code: ~~an integer that can be used by the server to distinguish frames (e.g. it could be an index of the moves in a MC simulation, it can be the internal number used in an MD integrator where ids can be non-sequential, if e.g. only 1 frame every 10 is stored, ...): e.g. 1000, 1010, 1020, ...~~ Trajectory Entries ------------------ A Trajectory is an ordered list of structures that are related, e.g. created by the same procedure. Some examples include molecular dynamic trajectories, relaxations of a molecule or crystal structure, Monte carlo simulations, etc. It is assumed that the number of atoms and and the type of atoms does not change during the trajectory (type of species, their order and composition, ...) **MAYBE DISCUSS ONLY IF (OR NOT) IT'S POSSIBLE TO DESCRIBE VARIABLE NUMBER OF ATOMS**. In addition, it is assumed that the order of sites does not change during the trajectory (so, e.g. one can compute the MSD of a site (an atom) by looking always at the i-th site, across all frames). `trajectories`{.entry} entries (or objects) have the properties described above in section [Properties Used by Multiple Entry Types](#properties-used-by-multiple-entry-types), as well as the following properties: The following properties have the same definitions as described in the structures entry point. ~~They are assumed to be the same for all frames of the trajectory. This means that they are to be considered as a "global" property of the trajectory entry, and not specific to a single frame~~ EXPLAIN WHAT IS CONSTANT: 'reference_frame': { ALL THE SAME AS A STRUCTURE cartesian_site_positions lattice_vectors elements nelements ... } - elements - nelements - elements_ratios - chemical_formula_descriptive - chemical_formula_reduced - chemical_formula_hill - chemical_formula_anonymous - dimension_types - nperiodic_dimensions - nsites - species_at_sites - species - assemblies - structure_features - ~~cartesian_site_positions_reference_frame~~ [NO: just `cartesian_site_positions`, and put all bullet points inside a dictionary `reference_frame`] - **Description**: Cartesian positions of each site in one reference frame from the trajectory. This reference frame MAY be one of the frames (see reference_frame_number for more information). The goal is to provide an example frame to give a quick impression and visualize it. - For the rest of the specification, see `cartesian_site_positions` in the `structures` endpoint. - ~~lattice_vectors_reference_frame~~ [NO: just `cartesian_site_positions`, and put all bullet points inside a dictionary `reference_frame`] - **Description**: The three lattice vectors in Cartesian coordinates, in ångström (Å), for the reference frame from the trajectory. This reference frame MAY be one of the frames (see reference_frame_number for more information).~~ - reference_frame_id [YES, and keep it outside the `reference_frame` dictionary]: an integer indicating the ID of the frame used to return the coordinates in `cartesian_site_positions_reference_frame`. If set to `null`, it indicates that the reference frame is not part of the trajectory frames (for instance, if the reference frame comes from an experiment). - For the rest of the specification, see `lattice_vector` in the `structures` endpoint. ### nframes - **Description**: Number of frames in the trajectory as an integer. - **Type**: integer - **Requirements/Conventions**: - **Support**: SHOULD be supported by all implementations, i.e., SHOULD NOT be `null`{.val}. - **Query**: MUST be a queryable property with support for all mandatory filter features. - The integer value MUST be equal to the length of the trajectory, that is, the number of frames. - The integer MUST be a positive non-zero value. - **Examples**: - `3`{.val} - **Querying**: - A filter that matches trajectories that have exactly 100 frames: `nframes=100`{.filter}. - A filter that matches trajectories that have between 100 and 1000 frames: `nframes>=100 AND nframes<=1000`{.filter}. ## Getting the actual data of a trajectory - Option to get data on the trajectory, allowing to get multiple properties in one shot (but also to select which ones) `/optimade/v1/trajectories/ID/data?response_format=...&?response_fields=..,..,..` - Includes the following properties (see specifications and full list below): - cartesian_site_positions - lattice_vectors - time - frame_number - frame_id - ... - Default format is JSON but more are possible - This endpoint should be always return data at least for the JSON reponse format (but which fields are returned is still to be discussed) - A filter could be frame_range=20:200:5 to get frames from 20 to 200 every 5 (syntax to be discussed) - Only one range can be defined, to get more, run more REST API requests - `/optimade/v1/trajectories/ID/file?response_format=AMBER` - This would already return a file (possibly binary) in the AMBER format (this is just an example), containing (only) the fields required by that format (e.g. site positions but not time) - Name: file vs. download vs. ...; to be extended also to structures - Optional; use /info to declare with formats exist ### Specification of the properties of the trajectory data #### cartesian_site_positions - n_actual_frames*nsites*3 array, same format as for the structure endpoint (angstrom, ...) - Order: xyz index loops fastest - n_actual_frames is either n_frames or the number of requested frames with the filter frame_range #### lattice_vectors - 3x3 array, same => we might need a way to express the fact that lattice_vectors are constant throughout the trajectory (probably: go for option 1 - TASK: document the rationale of the decision) (to discuss: properties dumped by the code only every X steps -> maybe just use `null`) We first discuss in option 1.x whether to group data by frame (and, for each frame, provide all properties, like 1.2, 1.3), or to group by property (and for each property, provide an array with that property at all frames, like 1.0, 1.1) ## Option 1.0 (example with 2 sites) ``` { 'cartesian_site_positions': [ # Frame 1 [[0, 0, 0], [1, 1, 1]], # Frame 2 [[0, 0, 0], [1, 1, 1]], ] ], 'lattice_vectors': [ # Frame 1 [[1, 0, 0], [0, 1, 0], [0, 0, 1], ], # Frame 2 [[1, 0, 0], [0, 1, 0], [0, 0, 1], ], ] ] } ``` Advantages: - Simple - data is stored per type making it easier to process for the client. Disadvantages: - If a property is not present at each frame a lot of null values would have to be stored. - All the data has to be read to obtain all information belonging to 1 frame ## Option 1.1 (example with 2 sites) ``` { 'cartesian_site_positions': { 'type': 'explicit', 'values': [ # Frame 1 [[0, 0, 0], [1, 1, 1]], # Frame 2 [[0, 0, 0], [1, 1, 1]], ] }, 'lattice_vectors': { 'type': 'explicit', 'values': [ # Frame 1 [[1, 0, 0], [0, 1, 0], [0, 0, 1], ], # Frame 2 [[1, 0, 0], [0, 1, 0], [0, 0, 1], ], ], }, 'temperature': { 'type': 'constant', 'values': # The same for all frames 100, } } ``` Advantages: - Having a type allows different ways of organizing the values. there no longer has to be a value for each frame Disadvantages: - If a property is not present at each frame a lot of null values would have to be stored. - All the data has to be read to obtain all information belonging to 1 frame ## Option 1.2 (example with 2 sites) ```json { 'response_fields': ['cartesian_site_positions', 'lattice_vectors'], 'data': [ # Frame 1 [ # positions of frame 1 [[0, 0, 0], [1, 1, 1]], # lattice vectors of frame 1 [[1, 0, 0], [0, 1, 0], [0, 0, 1], ], ], # Frame 2 [ # positions of frame 2 [[0, 0, 0], [1, 1, 1]], # lattice vectors of frame 2 [[1, 0, 0], [0, 1, 0], [0, 0, 1], ], ], } ``` ## Option 1.3 (example with 2 sites) -> cost of additional data transfer is acceptable? ## ```json { 'data': [ # Frame 1 { # positions of frame 1 'cartesian_site_positions': [ [0, 0, 0], [1, 1, 1]], # lattice vectors of frame 1 'lattice_vectors': [ [1, 0, 0], [0, 1, 0], [0, 0, 1], ], }, # Frame 2 { # positions of frame 2 'cartesian_site_positions': [ [0, 0, 0], [1, 1, 1]], # lattice vectors of frame 2 'lattice_vectors': [ [1, 0, 0], [0, 1, 0], [0, 0, 1], ], }, } ``` #### Discussion on chosen format (to be moved to section about future extensions?) We also assume that we chose above option 1.1. Note that this has the disadvantage that the whole data must be in memory even just to dump the JSON, but this allows to have a standard format and one can ask for a subset of the trajectory (similar to pagination). We might want to define a different format, ideally not JSON but something easier to stream (e.g. one JSON per line, each JSON is a frame), that does not require loading everything in memory. Example (first line MUST be the header, the rest MUST be the frames in order; this could be called `streamable-json`): ```json {'type': 'header', 'lattice_vectors': {'type': 'constant', 'value': [[1,0,0],[0,1,0], [0,0,1]]}} {'frame_id': 1, 'cartesian_site_positions': [[0,0,0], [1,1,1]], 'temperature': 100} {'frame_id': 2, 'cartesian_site_positions': [[0,0,0], [1,1,1]], 'temperature': null} # Might also skip 'temperature' if not defined as frames {'frame_id': 3, 'cartesian_site_positions': [[0,0,0], [1,1,1]], 'temperature': 100} {'frame_id': 4, 'cartesian_site_positions': [[0,0,0], [1,1,1]], 'temperature': null} ``` This can be read in a stream, you parse one frame as soon as you find a newline. ### Ways to pass properties with "holes" (i.e. not sampled at every frame) The assumption here is that other properties are only subsampled at frames for which we have site_positions, and not e.g. with a denser sampling, different sampling, ... **We will discuss extensions to case 1.1 above.** **NOTE ABOUT BULK** when pagination not available on the server: - the server can still print the JSON as sequence of JSONS, one per "group" of frames, up to manageable size, each group has syntax 1.1 or 2.2 - the client should be able to communicate a maximum "size" of the group (e.g. max number of frames?) This is ok because the server can do less if not able to. It's a similar concept to the max page size (it's not the max_frame_number!) - Question: should we enforce that every "group" has the *same* number of frames (except the last) or it's OK for the server to dynamically change it? [maybe no need to add this requirement] Suggestion: only support Option 2.1 as it's simpler, and we can extend ~later~ with 2.2 if it's really important (actually: implement some of the 2.2, to be decide which (e.g. explicit_general_sparse, and the regular one?)). Note: also other properties that now are constant (e.g. species at site) might in future versions have the same syntax (constant, explicit, ...). In this case we can enforce that all constant properties (species, ...) must be returned by this endpoint as well. #### Option 2.0 ```json {'energies': [null, 0.12223, null, -.32343, null, 0.42342423], } ``` #### Option 2.1 ```json {'energies': { 'type': 'explicit', 'values': [null, 0.12223, null, -.32343, null, 0.42342423], }, 'temperature': { 'type': 'constant', 'value': 100.0, }, 'time': { 'type': 'linear', 'offset': 1.2, 'step_size': 0.01 }, } } ``` Note that this is just an example, but any property can have any time, e.g. ```json 'time': { 'type': 'explicit', 'values': [0.01, 0.03, 0.08, ...] }, ``` #### Option 2.2 (example with 2 sites) Extension to more general and compact syntaxes (data can still be expressed by using `null` with 2.1, but this makes things more compact) for types `explicit_regular_sparse`, `explicit_general_sparse`, `explicit_general_sparse_segments`. ```json { 'cartesian_site_positions': { 'type': 'explicit', 'values': [ # Frame 1 [[0, 0, 0], [1, 1, 1]], # Frame 2 [[0, 0, 0], [1, 1, 1]], ] }, 'lattice_vectors': { 'type': 'explicit', 'values': [ # Frame 1 [[1, 0, 0], [0, 1, 0], [0, 0, 1], ], # Frame 2 [[1, 0, 0], [0, 1, 0], [0, 0, 1], ], ] }, 'energies': { 'type': 'explicit', 'values': [ 0.1234, -3.443, ] }, 'kinetic_energies': { 'type': 'explicit_regular_sparse', 'offset': 2, # this means starting at frame ID 2 'step': 2, # and sampled at every 2 steps: 2, 4, 6, .. 'values': [ 0.1234, -3.443, 2.123, ] }, 'potential_energies': { 'type': 'explicit_general_sparse', 'frame_ids': [2, 4, 10], # explicitly specify the frame IDs of the values below 'values': [ 0.1234, -3.443, 1.2 ] }, 'temperatures': { 'type': 'explicit_general_sparse_segments', 'frame_ids': [[1], [2], [3], [4], [5,10], [11,20], [31,40]], # explicit value for frames 1 (T=100), 2 (T=200), 3 (T=300), 4 (T=400), then constant with value 500 value for frames 5 to 10, then unknwon/set to `null` from 11 to 20, then unknown from 21 to 30, then T=900 from 31 to 40 - note - there should be validation that all frame IDs are valid (>= 0 or 1, <= nframes-1 or nframes, and non-overlapping) 'values': [ 100, 200, 300, 400, 500, null, 900 # null means that for that range, we don't know the value (e.g. it's not sampled or not set) ] }, } ``` #### Option 2.3 ```json {'energies': [0.12223,-.32343, 0.42342423], 'energy_step_ids': [2, 4, 6] } ``` #### Option 2.4 ```json {'data': [ { 'cartesian_site_positions': [[0,0,0], [1,1,1]], }, { 'cartesian_site_positions': [[0,0,0], [1,1,1]], 'energy': 0.12223, }, { 'cartesian_site_positions': [[0,0,0], [1,1,1]], }, { 'cartesian_site_positions': [[0,0,0], [1,1,1]], 'energy': -.32343, }, { 'cartesian_site_positions': [[0,0,0], [1,1,1]], }, { 'cartesian_site_positions': [[0,0,0], [1,1,1]], 'energy': 0.42342423, } ]} ``` #### Option 2.5 ```json {'energies': [ { 'frame_id': 2, 'value': 0.12223, }, { 'frame_id': 4, 'value': -.32343, }, { 'frame_id': 6, 'value': 0.42342423, } ]} ``` ### Closed Questions - Do we want to return trajectories as frame major or data major. - e.g. ## Frame Major ```json { 'response_fields': ['cartesian_site_positions', 'lattice_vectors'], 'data': [ # Frame 1 [ # positions of frame 1 [[0, 0, 0], [1, 1, 1]], # lattice vectors of frame 1 [[1, 0, 0], [0, 1, 0], [0, 0, 1], ], ], # Frame 2 [ # positions of frame 2 [[0, 0, 0], [1, 1, 1]], # lattice vectors of frame 2 [[1, 0, 0], [0, 1, 0], [0, 0, 1], ], ], } ``` - Data Major ```json { 'cartesian_site_positions': [ # Frame 1 [[0, 0, 0], [1, 1, 1]], # Frame 2 [[0, 0, 0], [1, 1, 1]], ] ], 'lattice_vectors': [ # Frame 1 [[1, 0, 0], [0, 1, 0], [0, 0, 1], ], # Frame 2 [[1, 0, 0], [0, 1, 0], [0, 0, 1], ], ] ] } ``` We are going to serve the data data major. - Should the reference structure be optional? NO - Should the frame number be 0-based or 1-based? 0 BASED - notes: 1. say in the specs that everything but pages is zero-based in OPTIMADE? 2. check below that we used the wording "frame id" and "frame number" correctly - Should we indicate at the info endpoint that a server has trajectory data? YES - Should we include which properties exist and their `type`, so you know already which properties are in the frames constant? YES - Should this go in: `/optimade/v1/trajectories/ID/data/info` ?(this e.g. can be used to know which are the valid `response_fields` to ask back) YES - Do we want to support the concept of a frame id? NO, but databases can provide their own database specific keyword. - - How do we query for frames between 10 and 20 inclusive? - first_frame=10 - last_frame=20 - frame_step=1 - How do we query for even frames between 10 and 20 inclusive? - first_frame=10 - last_frame=20 - frame_step=2 - Do we allow to get one structure only in the *same* way as a structure? (I have put it here after the general discussion) - e.g. `/optimade/v1/trajectories/ID/structure?frame=123` NO It would not offer functionality that is not already present by requesting a singleframe.