Dataset modelling changes

Dataset modelling changes === ###### tag: `renku`, `KG` :::info **CURRENT STATUS** - **Stage:** shaping - **Pitch due date**: (?) - **People**: Kuba, Ralf, Sekhar (or someone from the UI) ::: :::warning **NEXT STEPS** - Add more detailed info about changes to the CLI -- use the same ID for all Datasets imported from the same source (having the same `sameAs`); changes to the Dataset object types in the CLI. - **People**: Ralf - Add info about changes to the UI -- UI to prevent from importing an imported Dataset (Renku or external) - **People**: Sekhar ::: :calendar: Timelines --- ```mermaid gantt %%{init: { 'theme': 'forest' } }%% title Dataset modelling changes dateFormat YYYY-MM-DD axisFormat %d-%m section Cycle 1 Shaping :active, shape1, 2022-08-31, 2022-09-14 ``` 🤔 Problem and Context --- The move to schema 9 was undoubtably a very big success. It took modelling of the Renku metadata much closer to the way we think about the data and talk about it. However, over time some areas got identified where things do not work as we'd like. Specifically, the modelling of the concept of a Dataset revelead some obvious flaws. A gap between the language that got settled in Renku and the data constructs, efficiency and complexity needed to work with the data, as well as poor UX due to confusion in understanding the data are all the manifestation of the too simplistic design choices. For the users, the limitations of the metadata are visible in search, dataset detail views, and in dataset provenance. A specific case of this simplistic model is the notion of an imported dataset. This is currently expressed through relations in the knowledge graph (e.g. `derivedFrom` and `sameAs`) but the way that we apply those relations in practice actually _modifies_ the way the entities themselves are treated (e.g. imported vs. created Dataset). This semantic difference is not well captured in the current metadata. The problem is exacerbated when projects are forked because it becomes impossible to reliably identify the true source of the datasets, leading to confusing search results. In this context, we might think of "users" as ones that use the frontend (UI), but also more advanced ones that may want to consume the metadata directly. For them especially, the notion of an imported dataset is difficult to identify when inspecting the metadata generated by Renku, as it is only implied through the relations between datasets. The desired solution addresses the following issues: - imported datasets are _easy_ to identify in metadata; this improves understanding of the KG and improves search performance - imported datasets are identified as such in the UI and CLI - datasets are imported only from true sources to maintain accurate provenance information - datasets in forked projects do not appear as if they are owned by the fork 🍴 Appetite --- As we don't like to live in the world where things are awkward, and we're always hungry for tasty, well done solutions, we think it's high time to make Datasets be tasty! The correctness of dataset metadata is critical if we want to further promote the use of dataset features. From the larger Renku perspective, dataset metadata and especially provenance are the "easy" kinds of provenance to record in Renku and can be used to effectively demonstrate the usefulness of the Renku knowledge graph. The issues described above make this use-case difficult to promote and encourage. We should dedicate a full 6-week cycle to implementing a solution. 🎯 Solution --- A richer Dataset type structure can mitigate the mentioned problems. A proof of the idea might be the Knowledge Graph services (KG) which identified the necessity of more precise Dataset modelling. Such an approach brought invaluable benefits like more precise, robust and closer to real life data constructs (e.g. the concept of an imported Dataset) as well as gains in compile-time validation. The need for cleaner metadata as expressed above may be satisfied with increased precision in the Dataset modelling. What that means is introduction of more fine grained typing in the Dataset space. Based on that, the querying, validation, and UX might be improved. ### 1. TS and KG changes Currently, KG specifies five Dataset flavours, however, with the proposed change, four plus a marker type seem to be sufficient to cover all the scenarios: * `InternalDataset` - only for Datasets that are created in Renku and neither modified nor imported (with no `sameAs` property); * `ImportedExternalDataset` - for all Datasets that are imported from an external source like Zenodo or Dataverse (with `sameAs` pointing to the Dataset origins in the external source); * `ImportedInternalDataset` - for all Datasets imported from a Renku-created Dataset (`InternalDataset`) or imported from a modification to any Dataset (`ModifiedDataset`) (with `sameAs` pointing to the id of the source Dataset); * `ModifiedDataset` - for Datasets that were created as a result of a modification of a Dataset; they never have the `sameAs` property but instead their `wasDerivedFrom` points to the id of the Dataset that was modified; parents of a `ModifiedDataset` might be `InternalDataset`, `ImportedExternalDataset`, `ImportedInternalDataset` or `ModifiedDataset`; * `DiscoverableDataset` - a special marker type to indicate that the Dataset can be presented to the user; unlike the other types, this type might be a temporary one and a Dataset will be decorated with it only when it can be imported to another Project (e.g. it's the latest or the only, version of the Dataset) and will lose the type when it gets modified; as already mentioned, the `DiscoverableDataset` type can eventually stay on a Dataset even if it gets modified if it was imported to another Project, that imported Dataset exists, does not have modifications or links to other imported Datasets and it has not been invalidated. The scenarios that the concept was validated on are: ![](https://i.imgur.com/oZ04oc2.png) There's an issue for KG already created: https://github.com/SwissDataScienceCenter/renku-graph/issues/1077 ### 2. CLI changes Improvements to the CLI include: 1. A richer Dataset type structure to keep the model in sync with the domain language. Specifically, add the concept of `InternalDataset`, `ImportedExternalDataset`, `ImportedInternalDataset` and possibly `ModifiedDataset` (though ideally this would stay `InternalDataset`) mentioned above in CLI code and JSON-LD schemas as well. Could be subclasses of the existing `Dataset` class. 2. Simplified Dataset modelling in cases of Datasets that has the same origin. Generation of ids should be consistent, so e.g. the same dataset imported in two project should result in matching `ImportedInternalDataset` with identical IDs. 3. Disallow importing `ImportedExternalDataset` and `ImportedInternalDataset` into project (optionally, prompt user to import the source dataset of those instead). 4. Disallow exporting `ImportedExternalDataset` and `ImportedInternalDataset` into project 5. Disallow editing of `ImportedExternalDataset` and `ImportedInternalDataset`, except in the case where a user unlocks them for editing with a specific command, creating a new `InternalDataset` in the process (and being prompted to enter a new name/title/description etc). Still needs a pointer to what it was based on, i.e. the original `InternalDataset` or the url of the external dataset (e.g. zenodo), instead of the `ImportedExternalDataset`/`ImportedInternalDataset` node it was created from. 6. Disallow tag creation on `ImportedExternalDataset` and `ImportedInternalDataset`. 7. (Maybe) add new edges from Project node to the different types of datasets from 1. so they can be easily separated. 8. Change CLI commands (list,show) to clearly display the different types of datasets. 9. Enhance service so datasets.list can return results filtered by type. 10. Add migration (doctor check? though it is a big change) to migrate the old structure. 11. (Maybe) add support in graph export for flags like `is_fork` and `forked_at_commit` and on-the-fly transform `InternalDataset` to `ImportedInternalDataset` when exporting to deal with forks (could also be done by the KG instead) ### 3. UI changes As of now, importing a Dataset that was not an original Dataset but just an import can lead to confusion, UI would need to stop giving such an option. Instead, a suggestion to import the original Dataset might be given to the user. ![](https://i.imgur.com/kpcR7hW.png) 🐰 Rabbit holes --- 🙅 No-Gos --- ## Notes