# Dataset API
## Design goals
1. End result of the dataset should be an XArray (but in principle it can )
2. "Streamable": we may not know in advance how many data we will have, so we might want to continuously append data to the dataset.
- This could happen for the case of any feedback in measurement control, when number of repetitions depends on measured data
- This could happen for a "monitoring" loop: say, we continuously measure fridge temperature in an event loop.
3. Structure of gettables and settables must define structure of the dataset in some form.
- But we want to still be compatible with QCoDeS parameters being settable and gettable.
## Notes about existing dataset solutions
### QCoDeS Dataset
Writing the data to SQLite database.
To figure out data types and sizes, QCoDeS requires "registering" parameter with (their) measurement control.
## Proposed API
### Disregarded option: binary streaming
Natural way to process data of pretty much every readout module is to dump the data in some (various) form into an array and then return this array. On Quantify side, we would expect the data to have a very specific gridded structure suitable to be converted to Numpy array:
```python3
data = b"" # Data is just some bytes array
n_repetitions = len(data) / repetition_size_in_bytes
data_array = numpy.ndarray((n_repetitions,), dtype=some_complex_dtype, buffer=data)
data_xarray = xarray.DataArray(data_array, coords=..., dims=...)
```
Typically going from first to second is quite a trivial reshape/transpose/concatenate/etc., which can be done effectively on a backend level.
The problem with this approach is `some_complex_dtype`. It would work if settables and gettables provide their dtype, say, I always set `np.float64` and always get `np.float64` and `np.int32`, I can always rely that batch of data has size of 8 + 8 + 4 = 20 bytes and `some_complex_dtype` can be `numpy.dtype([("s1", np.float64), ("g1", np.float64), ("g2", np.int32)])` and then use this to reconstruct the target raw data XArray on the processing stage.
The reason that won't work is that we can't reconstruct data type of QCoDeS parameters.
The only feasible way to deal with it is "let XArray deside" behaviour, that means that we have to "stream" not raw byte data, but xarrays themselves.
### Probably feasible option: gettable knows better.
- `Gettable.get()` should return an XArray DataArray or DataSet.
- DataArray is simpler to handle on Quantify-Core side, concatenation of several gettables into a Dataset is a no-brainer: `Dataset` is a set of `DataArray`s.
- Probably `Dataset` would be more convenient for complex circuits, where we may need to do some complex structure to avoid `NaN`-padding (if we want to).
- Merging `Dataset`s is fine in most sane cases, but we need to watch carefully if there are no feasible cases to get naming collisions, especially silently failing cases that lead to loss of data.
- If it is not an XArray (I am watching at you, QCoDeS parameters), it should be casted to XArray by some well-defined rules and well-defined naming.
- Settables follow the same casting to XArray rules (and probably also accepts XArrays)
- `MeasurementControl` stacks together all settables and gettables into a dataset, both settables and gettables are considered to be data variables. This dataset in the end must have `repetition` index.
- This datasets are streamed to the data backend using interface with style:
```python3
with container.lock() as f:
# Container is locked for using it somewhere else for writing
# Reading is fine though, we need that for monitoring loops.
f.write(dataset)
```
* `write` always concatenates to dataset along `repetition` index, disregarding the index itself. `MeasurementControl` does not have to track repetitions. If user needs to track something, there should be a separate settable or gettable in the experiment for that.
- Containers themselves are created on a backend level, and probably acquired by TUID and `ExperimentID`, and I don't know what `ExperimentID` . So far the closest thing we have to it is the name we append to TUID in the name of experiment folder:
```python3
# Create a container
a_container = backend.container(tuid, experiment_id)
# Acquire a reference to the same container
the_same_container = backend.container(tuid, experiment_id)
with pytest.raises(ExperimentIDMismatch):
# Nope, not possible
another_conainer = backend.container(tuid, another_experiment_id)
```
- Retrieving the data from the dataset should be something like:
```python3
dataset = a_container.dataset(filter=...)
```
- Filter should be `None` or some XArray-compatible filter. Purpose of it is to, say, grab the data for the specific time slice or for a specific subsystem.
- This should work in non-blocking manner, preferably it should be a read-only view to a dataset.
- It should be possible to extract this from a non-finished experiment. Say, we are monitoring a fridge temperature in the background and want to use this data in some analysis without terminating the monitoring loop itself.
- Backend should be allowed to cancel the request with some `RetrieveError`, if, say, experiment has terabytes of data (a really long loop) and user forgot to specify a filter.
- ~~`plotmon` should subscribe to the updates of the dataset. Callback function receives a `dataset` as an argument and does something to update `plotmon`. API looks like:
```python3
a_container.subscribe(lambda dataset: pass)
```
This function gets exactly the same `dataset` ~
**NOOOOO:** Both `plotmon` and `container` should subscribe to `MeasurementControl` updates, in the spirit of https://gitlab.com/quantify-os/quantify-core/-/merge_requests/265. Need to clarify further. Probably it should be a configurable option whether MC creates a default plotmon, default container, etc. Need to think about it. Then subscriber API should likely be more compicated and have "experiment started", "data acquired", "experiment stopped" events.