DATALAD DATA PORTAL

# DATALAD DATA PORTAL # Collaborative notes, January 30th, 10:30AM **Who's here?** Stephan, Ben, Adina, Laura, Alex, Michael, Christian, Michał **Agenda** 1. **Stephan**'s initial draft: * The Data portal kit should be a lean way for people to build a decentralized portal for data * Data portal managers have full control of the portal, e.g., the required metadata, their form, the way it is submitted * The portal should integrate any extension that might be useful (metalad, catalog) * Components: Data portal (Catalog-based, metadata about all data in the portal), Storage * Each components needs a technical realization 2. Planning the hackathon (tomorrow 9AM-4PM) 3. Group discussions about Permissions **ACTION ITEMS for Tuesday, January 31st:** - **Generate an iconic set of userstories to support (20 minutes today, each person contributes their 3 main user stories in bullet points)** ## Minutes __Discussion about Stephan's initial draft__ - The DataLad Portal *KIT* comes only when we have figured out all of its components and tools - mih defines the data portal concept: "The portal is an address one can navigate to with their browsers, and it contains pointers to the data (where that data lives). Archiving in the scope of the institute means 'knowing where we put it'." - **aqw**: Is catalog generation fully decentral or fully central? Fully central would create issues we can't handle from the perspective of data access. - **Answer to question**: We will be building something decentral. This decentral solution can be used centralized. - **jsheunis**: A centralized place defines a specification of requirements for data and metadata. Anyone submitting to the portal can do the generation of metadata locally. - **bpoldrack**: Cautions that the name "portal" might be misleading. The portal could be a central instance but doesn't need to be. - **aqw** clarifies after discussion: The portal needs to have the ability to know about permissions for access - **mslw**: MSLW thinks about the portal as potentially entirely centrally managed as well and uses OpenNeuroPET as an example. But it could also be run (partially) decentrally, e.g., in the SFB1441, and envisions a GitHub-like possibility where external contributors can submit metadata to be added into the portal - (**cmoench**: Catalog runtime/environment comment) - final clarification for INM-7: There is no additional dataset. Our portal will be based on the dataset superdataset, and all relevant metadata from the superdataset will be used in the portal. We need more metadata, however, for access management. The metadata any random person can see in our portal will be public/not-restricted. The data portal will provide users with information on how to access any specific data holding - **adswa**: let's define term catalog and portal - Portal: HTML/web-based representation of yet unknown definition - Catalog: DataLad catalog __Planning the hackathon tomorrow__ - Scope of this meeting: - Targets for tomorrow: Find out - what are the permission levels (e.g., Dataset, filenames, content)? - what are the permissions scopes (e.g., read/write/...?)? - which are the techncial pieces that exist, and which will need to be developed? - Create an illustration of the concepts - Brainstorm the technical solutions necessary to support our proposed user stories __Discussion about Permissions__ - Components that need dedicated permissions: - Dataset title and ID - ... - ... - Mslw brings out a permission example: Consumer starts with public facing rendering and clicks "get access"; a web service checks for access and if not, displays what needs to be done to get access; access is granted for a specific user, time-limited, for a specific azure storage blob - mih adds that one aim of us is to work out such a permission management. He thinks any data request should go through a http request, that allows us to determine access rights ---------------------------------------------------- # Collaborative notes, January 31st **Who's here?** Stephan, Ben, Adina, Laura, Alex, Michael, Christian, Michał **Agenda** - Discuss everyone's generated user story - Create a figure illustrating workflow elements and processes **ACTION ITEMS for Thursday, Feb. 2nd**: There are four main usecases. Everyone assigns themselves to one use case, and tries to map the elements in the figure to the usecase: 1. OpenNeuroPET - @jsheunis 2. ICF store - @mih / @cmoench 3. INM-7 archive - @loj / @aqw / @bpoldrack 4. SFB-1451 - @mslw @adswa Things to work with during the mapping process: - terminology - what would an operator need to do? - workflow specifications ## Space for user stories ### Adina 1) Alice has joined the institute. She will work on sex-differences in resting state fMRI. Susanne and Kaustubh task her to find at least 10000 subjects's rs-fMRI data in the data holdings of the INM-7. 2) Bob has finished his post-doctoral fellowship in the institute. During his stay, he created two Nature publications. An attentive member of his former group wants to place the associated projects into the data portal 3) Jamie is an outside collaborator from Aachen, and does not have a juseless account, but needs to collaborate with an INM-7 group on one of their past projects ### Stephan 1. Jay the PhD student or post-doc in a lab is tasked with creating a way for possible external collaborators to browse high-level descriptions of their group's datasets and find interesting data. The "way" should allow people to request access to individual datasets and allow people to get the data as seamlessly as possible. 2. I am Groot, an fMRI researcher looking for (semi-) open data of movement disorders. I know about OpenNeuro but I want to be able to browse through such data on a more granular level. E.g. search for datasets containing subjects in a specific age range that underwent a specific task, etc. And then I want to group these subjects' data together and access them together somehow as a single unit. 3. Alexis the sys-admin or so-called "technical data person/groupleader" of an institute is tasked with making their institute's data F.A.I.R. Their datasets include sensitive human data as well as data that can be shared publicly without any hassle. If they will be using the DataLad Portal Kit for this, their questions could/would be something like: - what are the architectural and software requirements? - are there existing containerized workflows that I can run to set up "the portal"? - which steps or workflows are needed in order to maintain the portal (i.e. what needs to be done per new dataset, and how do I do that)? ### Michał 1. A group of consortium members generated many DataLad datasets, stored in various places (some data are open, some not). They want to let the world, and each other, know what resources they have. They all agree to publish the git part of their datasets (including at least some kind of CITATION or minimeta file) to some git-capable hosting and share it with Darcy, the curator. Darcy periodically fetches updates, runs basic metadata extraction, and maintains a catalog which serves as the consortium's data portal. 2. Same as above, but this time consortium members don't need to show the datasets, and instead run metadata extraction on their own, providing Darcy with catalog-ready metadata. Darcy wants the submission process to require minimal effort, but also leave some room for moderation. ### Laura 1. Kitty ran CAT 12.8 across open and restricted datasets and wants to make the results available to the institute. A student uses this data for their project and publishes a paper. A year later Kitty runs the pipelines again with CAT 12.9. She wants to update the data available to the institute, but the old data needs to be kept because it was used for a publication. 2. Ori applied for and got access to a DUA restricted dataset for his project. He put in the work to convert it into BIDS. He thinks other in the institute would benefit from the data. 3. Jessica collected her own data for her PhD. She’s moving on to a postdoc elsewhere. The data needs to be archived, but will be used by others in her group after she leaves. ### Alex 1. Hannibal is a new PhD student and has been told that they will work on the Roman dataset. How do they discover whether the dataset requires a DUA, how to apply to upstream for permission, and where to submit the proof-of-upstream-approval to gain access to the restricted portions of the catalog/datastore/Data Portal? 2. Willis works at an institute with a wide collection of open and restricted data. They are tasked with tracking DUA compliance and access to the datasets. A data breach occurs. They need to disclose who is approved for which datasets over what timespans --- with proof. 3. Vesuvius has an analysis that will produce 500 TiB of interim data that is no value in re-use, but is deemed essential until the publication is accepted after 3 rejections and 8 years have passed. Once published, the required data is 1.44 MiB. ### Christian 1. Otto creates a dataset of traffic patterns that are recorded by automated sensors, the dataset is updated once a day. He would like the dataset to be accessible through the institute portal in its most up-to-date state, including summary metadata that is calculated from the respective state. 2. Otto would also like to make the latest version of traffice pattern dataset automatically available to Google's dataset search. 3. Ignatz creates metadata from restricted personalized data and would like to publish it for all individuums that have access to the restricted data. He has of course agreed with the dataset owners about this plan and received permission to make his metadata available to all clients with a certain authorization. ### Benjamin ### Michael - An overview of user-specific "load" put on the data portal (e.g., a leader-board of data submitters, or a transparent overview of costs) ## Summary of use cases: - Find datasets by property, by publication - Access (internal) data from the "outside" - Discover datasets by browsing/accident - Request access to a dataset/see terms - Requirements for adoption - Offer a showcase/catalog for bragging - Workflow for maintaining a catalog with the least bottleneck - Deal with continuously changing resources - Auditing, protocol of who had access to what for how long - Onboarding info/training for portal users. How do they know what to do? - How do derivative datasets need to be annotated to make the association to the original clear - How does the data portal encourage deletion, while reducing anxiety about deletion and avoiding pollution (can the data portal solve general failure of leadership?) - How to inherit terms from a source dataset? - Discoverability via google dataset search - Accounting and auditability (every X years, by e.g. data protection officer) ## Minutes - First discussed and summarised user stories - alex: how do we move large unused parts of data on storage to different storage backend without exposing this process to the user, related to concept of temperature of data - mih: the issue/challenge underneath relates to granularity of data/sets, mentions of annex-wanted expression - adina: how to conceptualise dfferent elements of the portal/dataset, and what permission-levels do we assign to those? and we provide a general concept of how "things" are distributed - aqw: An HTML rendering of a catalog is a client. What's in there is the maximum the people viewing it are allowed to see. Datasets contain their own catalog HTML file, and access to datasets unlocks access to those catalogs - mih: valid challenge yet to be solved is at which level of granularity to we apply deviating access granting processes - christian asks whether the catalog has any role in authentication/access; would it contain searchable data to which access has to be authorized? - mih: possibility for circular discussion. if we have the bigger picture with components represented visually, we might quickly get to a common understanding. e.g. if restricted access is needed on file-level, same access process could be used for restricted data inside "catalog" - Start discussing Adina's diagram: - Webfrontend for catalog is html going to client from html server - JS in client will make requests for specific data files, some might be restricted and some might not - Access granter has a table that lives somewhere, containiung user identifiers and linking them to resources - adswa: what about granularity of access on dataset/file level, and how this relates to what specific users are able to see in the catalog? - mih: should this separation be done on dataset-level, or should the whole portal be able to deal with this? - aqw: catalog client shouldn't care where data comes from - mslw: think about it as views - mih: we know how to do data transport, how far can we get by focusing on requesting data from file systems (and not focusing on trying to support searches across combinatorial space, i.e. across different access levels) - aqw: agree with mih - should not have any form of cascading combinatorial insanity - cmoench: metadata generated from restricted data would also be restricted / access-controled? - ... - mih: people come in with their identity and dataset/file-identifier, site-specific access granter will authenticate, and give resource-specific credentials - Discussing Michael's drawing: - mih: is there any way we can avoid having to create the running service "access granter", think not. One version of an access granter will come from OpenNeuroPET collab (they're outsourcing development) - mslw: DUA signing process might be because of a legal requirement (e.g. GDPR in the case of openneuroPET), or just filling a few fields like email/reason for request. - adswa: can there be a simultaneous operator and user? - mih: want to stay away from user permission-level granting service Figuring out which pieces need to be build based on Michaels drawing: - The datalad-based retrieval is implemented - We need an access granter. How does its concepts map to the git-annex/datalad world? - when hosting of a file changes into the data management system (INM-7 storage/s3/...) the data portal knows that this file is available. The access granting method needs to happen at this stage - aqw: possible conclusion = we won't be creating a token-based access granting service ![](https://i.imgur.com/YJamfJS.png) Above figure should be altered to distinguish between "services" and "processes". Often a service is one way to implement a process, but not the only way. For example, "signing a DUA" (the process) could be implemented as a web-service, but also as the physical office. Q: operator vs maintainer? Q: Have a figure that just has operations in the flowchart, no implementations? Q: acess granter (the process) is essentially just an authentication workflow (implementation wise) Q: which roles do we want to distinguish? Q: which "operations" do we want to support? Q: can we come up with a "matrix view" that compares the different use cases on a number of common dimensions/aspects? Q: three levels of description: abstract, implementation by service, implementation by datalad building block Q: what is the additional cost of a datalad-based system vs a system that solves the same problem in another way ----------------------------------------------------------------- # Collaborative notes, Feburary 2nd **Who's here?** Stephan, Ben, Adina, Laura, Alex, Michael, Christian, Michał **Agenda** - Discuss everyone's mapping of usecases to figure components - Create a matrix with different dimensions of "core concepts" distilled from the usecases **ACTION ITEMS for Thursday, Feb. 2nd**: Collaboratively map usecases into a table and differentiate them on various dimensions (Adina starts). Christian tries to outline a systematic approach in writing. ## Use cases ### OpenNeuroPET (The destilled output of our discussion (Christian and Stephan) is currently represented in [this drawing](https://github.com/psychoinformatics-de/data-portal-kit/blob/main/whiteboard-openneuropet-use-case-20230201-1421.svg),i.e. lost in authentication details, but here is a quick overview): - Data is stored in access controlled data stores S1 ... Sn - The browsing-machinery (HTML, JS, CSS, ...) including an initial configuration is accessible on a public HTTP-server - The top-level, i.e. public, dataset information is also accessible on public HTTP-servers and described in the initial configuration - There is a portal-instance specific access granter (AGRANT) that can take a user id (UID) and a storage system description for Sx and return a credential that is valid to access Sx, if the UID has access to the data stored on it. ### INM-ICF ICF users/data consumers are FZJ employees and external collaborators/customers. ICF operators are FZJ employees. Data read access is implemented via authenticated HTTPS to physical servers located at the JSC. The is **no unauthenticated access**. Operator access is via direct server console login via SSH. **Data organization**: The data store is organized by "studies" that contain "visits". Each study corresponds to a directory on a storage server's file system. Payload data are contained in TAR archives. **DataLad assimilation workflow**: The native data origanization does not involve DataLad data types. Study-visit data are indexed as DataLad dataset *after* original data deposition on the server has already been completed. Each TAR archive is accessible via HTTPS URL. Archives are registered via an *uncurl* special remote setup, and archive content is indexed via the *archivist* special remote. The indexing procedure is fully automated via a script that takes a study and a visit identifier as input. Generated DataLad dataset contain only metadata and are deposited alongside the data payload in a study directory on the storage server. Deposition is performed by ICF operators. Catalog record generation is done once a DataLad dataset was generated. The DataLad dataset of a study visit is *not* meant to be updated (ever), hence will not contain the catalog metadata. Instead, it may be added to a study DataLad dataset, or be directly deposited at the catalog site. **Data catalog**: A catalog site lists basic information on studies, but only aggregate statistics on visits. The catalog site is accessible on the intranet. The ICF catalog is maintained by ICF operators. **Permission handling**: Access permissions are implemented using location-based permissions. Users get read-access to a directory that contains all information/data of a study (incl. all visits). The is no access granting service. ICF operators hand out credentials to individual ICF users, and assign them manually to any relevant study. The is manageable because the number of studies is low and the number of users is low too. Catalog access may be limited to the union of all users. **DUA handling**: There is no (explicit) DUA handling. ICF users can be considered *data owners* as the recrute the ICF as a servicer provider to perform data acquisition on their behalf and based on their authroization to conduct specific research projects. ### INM-7 archive **Storage Services** - Access: - Read (HTTPS) access for users - Write (SSH) access for operators/publishers - Permissions: - DUA scope cannot be more granular than one datastore - HTTP Basic auth with per-user passwords **New Project Requests** - Create a fresh DataLad dataset - Register the dataset uuid as a resource id in the resource/DUA/user mapping tables. - When the user wishes to move/share/archive content, they follow the Submission Request (handcuffs emoji) steps. **Submission/Modification Request** User/submitter: - Open a PR - Validation bot(s) to check/guide: - required metadata - structure (e.g. BIDS, etc) - provenance records - data retention information - DUA compliance/restrictions - suggested reviewers - UNKNOWN: what should the PR contain, and how should the bots access the data (since this should be automated as much as possible), since it's not included yet on any storage backend Operator/Curator: - Upload to appropriate storage backend - Add to the INM-7 super-catalog (as appropriate) - Register appropriate permissions in DUA->User<->dataset UUID tables **New(d) Bits** DUA process - Where does a user submit their DUA docs - DataLad dataset to track resource<->user relationship - Modifications are tracked by git history - supporting documents (e.g. signed upstream DUA doc) are tracked - export resource<-user-> tables (if needed) - GitLab workflows available for PRs, etc Submission/validation bots - use Dartmouth pieces? - ... ### SFB-1451 **Roles**: Within the SFB 1451, an _operator_ is a dedicated SFB Z- or INF- project associate such as Michał, and _users_ are scientific personell of different career levels, from student to PI, who create data. PIs are users with elevated privileges, and can control resource permissions. External collaborators and re-users of the data are considered _users_ as well (crucially: users may want to submit or retrieve content). **Data Portal front-end**: A Catalog linking all available datasets. An examplary data portal _front-end_ is [the existing catalog (1)](https://jsheunis.github.io/sfb1451-data-catalog-website/#/dataset/1b26b5d1-5729-4dd5-b990-fff74960c949/89d15906f570e2dbb27fa1f6cf2e114ecd33f156) and [(2)](https://psychoinformatics-de.github.io/sfb1451-projects-catalog). **Storage and access**: _Users_ create an account at the _storage service_ (Sciebo) via their institution. PIs create and/or give users access to dedicated project folders in the SFB's sciebo project box. Users can obtain a Sciebo guest account via the INF project if their institution doesn't subscribe to Sciebo. Alternative storage services such as Gin are possible. If some data needs to be intramural, a derived metadata-only dataset gets deposited instead? (Note: how much of the data can we realistically expect to be in externally accessible places?) **DUA handling**: ? (adina isn't sure - no explicit DUA handling, all done via sciebo/Gin/... access? I.e., Retrieval of the data hosted on Sciebo is contingent on a Sciebo account with whom a given project box was shared) (michał - probably by direct contact, central "matchmaking" service where requests are made and approved could be implemented in theory but probably not practical) **Assimilation workflow**: A submission of resources to the data portal requires the following steps: * a _user_ creates a DataLad dataset and saves all payload to it * the _user_ adds specified metadata (citation information, and associated publications) in specified file format (CITATION.cff, .ris/.nbib) to the dataset * the _user_ uploads the dataset to the _storage service_ * The operator generates a catalog from the dataset and adds it to the SFB's general catalog (is this a process that is simply done every X time intervalls, or is there a way to ping the operator about an upload?) **Updates**: * ? (adina isn't sure - _users_ push updates to the existing project folders, catalog re-generation incorporates the update?) (michał also ) ## Minutes - Run through everyone's thoughts on use cases mapping onto concepts - mih: thinking about paper, the idea is to give a high level descrption of the whole kit given a few core concepts and then move on to demonstrating how specific use cases map onto the concepts, and highlight some possible implementation details as they relate to core concepts - adswa: can we imporve our understanding of the concepts and the big picture by talking about a specific use case (as an outcome for today's discussion) - cmoench: understanding is that a core principle data with the same access permissions are put in the same place. Should we identify core principles? - mih: what are the similarities of different use cases, can we draw a matrix view of that? - data portal user ID (auth): - required for openneuropet, used for everything (including e.g. access granter) - ICF: unclear, possibly FZJ system-based, dealing with external people too, probably used in htaccess file - INM7: unclear - SFB1451: public catalog, with link to specific datasets that might have own requirements for auth/access - bpoldrack: is portal id just an authorization against access granting process - bpoldrack: reminder about statistics being available with authenticated access, and not without - mih: decision of site whether they want to implement a workflow that precludes statistics or not - jsheunis: should superdatasets (and relatedly) be seen as a core concept with different implementation details per use case? - sfb1451: useful for representing organization/governance - registry: definitely does not want a superdataset ### Core questions from @mih based on Tuesday's discussion Q: operator vs maintainer? Q: Have a figure that just has operations in the flowchart, no implementations? Q: acess granter (the process) is essentially just an authentication workflow (implementation wise) Q: which roles do we want to distinguish? Q: which "operations" do we want to support? Q: can we come up with a "matrix view" that compares the different use cases on a number of common dimensions/aspects? - catalog maintainers are data controllers? - authorization process - authentication process - need for/purpose of authentication (need for a "portal user ID") - choice of metadata standards - metadata homogenization by whom/when/how - metadata deposition (alongside data?) - requirements for data deposition - process of data deposition - frequency of dataset updates - frequency of metadata updates Q: three levels of description: abstract, implementation by service, implementation by datalad building block Q: what is the additional cost of a datalad-based system vs a system that solves the same problem in another way ### Core Concepts Matrix | Concept | OpenNeuroPET | ICF | INM-7 | SFB-1451 | | -------- | -------- | -------- | -------- | -------- | | **catalog maintainers are data controllers/processors?**| yes | yes | no | no | | **authorization** | central service based on user accounts | central service based on academic affiliation | Permissions at storage backend | Permissions at storage backend | | **authentication process** | time-limited token | HTTPS | HTTPS | HTTPS | | **need for/purpose of authentication (need for a "portal user ID")** | yes, to map resource permissions | no (VPN-based general portal access) | yes, to map to resource metadata permissions | no, public portal | | **choice of metadata standards** | BIDS | DICOM | Study-mini-meta, BIDS | CITATION.CFF, .ris/.nbib, BIDS | | **metadata homogenization by whom/when/how** | submitter | automated pipeline? | submitter aided by validation bot | submitter (aided?) | | **metadata deposition (alongside data?)** | part of the data | part of the data | alongside | alongside, often metadata only | | **requirements for data deposition** | BIDS compliance, legal stuff? | ICF personell | INM-7 membership | Consortia membership, Sciebo account | | **process of data deposition** | Some to be developed+described process probably involving signing controller/processor agreements followed by data upload via sftp or other. | automated indexing of structured TAR files with _uncurl_ special remote | PR with metadata against JuGit repo by submitter | upload to Sciebo [^1] | | **frequency of dataset updates** | uncommon | never | dependent on dataset | dependent on dataset | | **frequency of metadata updates** | every few years | never | common | common | Study-visit data are indexed as DataLad dataset after original data deposition on the server has already been completed. Each TAR archive is accessible via HTTPS URL. Archives are registered via an uncurl special remote setup, and archive content is indexed via the archivist special remote. The indexing procedure is fully automated via a script that takes a study and a visit identifier as input. [^1]: Although we can use Sciebo for git-only uploads, Michał wonders whether it would be worth it using an existing git hosting (GIN?) or even deploying own thing like Forgejo (gitea fork) just to allow reviewed PR so that maintainer can ensure that metadata is solid -- but this is a separate discussion. --------------------------------------- # Collaborative Notes, Tuesday Feb 7th **Whos here?** Stephan, Ben, Adina, Laura, Alex, Michael, Christian, Michał **Agenda** + Discussion of Christians attempt at a systematic approach + Discussion of a number of late-night thoughts @mih had the day before **ACTION ITEMS for Thursday, Feb 9th**: Adina cleans up/structures this hackpad. ### An attempt for a systematic approach [Here](https://docs.google.com/document/d/1jpUjBV9BeS2n0GBzP5k66aWbg2gu0_7z-aX0Qoso22A/edit?usp=sharing) (I left it on google docs for now, because I was editing the images a lot) is a document that attempts to clearly separate the new ideas, identify the necessary mapping onto system components, and map out the possible implementation space. It might also serve as a blueprint for a future paper, but that is a matter of taste. It is based on my understanding of our discussions. The document is not yet finished, but I would appreciate any feedback on its utility, correctness, etc. ### Michael's chat comments in response to the draft > OK, went through it -- my brain wants me to comment, my gut tells me to do it tomorrow ;-) I tried reading it from different angles and a conclude two things > - we will have a hard time describing our development goal to someone who has worked with something like django, and have that person NOT conclude: "this is just like our stuff" we have to create a glossar and enforce it with corporal punishment > - there are only so many meanings for "dataset" or "content", and we have exhausted all of them > As pointed out in the google doc, it already starts with "metadata". I don't think we will be allowed to use that term at all, or dataset, or data, or content, or any such short terms. that all sounds more negative than intended. I think the approach is the correct one (conceptual vs implementation vs user vs operator). So thanks for the kick-off! > This is all very frustrating. I start to think the framing of all this data portal kit stuff is not right. I will attach my inner monolog to this chat now. please disconnect, if you do not want to suffer. I think a key question underlying all this is "why?". why would anyone want to read about this, or even use our tooling. i believe it boils down to this: they have something related to data management going, but it isnt fancy or featureful enough for some particular use case > - openneuropet: they have the process, the money, the storage, the legal clearance, but they lack something that makes it cheap and workable longterm > - sfb1451: they have a mess and need to make it look coherent > - inm-icf: they are a service infrastructure that is disconnected from its users > - inm7: we have invested in an rdm tool that is not a sufficient solution for our needs > what we can offer to any such party is a wide variety of features that are built around few principles, with the fundamental aim to connect existing thirdparty services that are also ideally paid for by third parties. so everything evolves around datalad datasets. if the stuff is not a datalad dataset, it is out of scope. sometimes we have datalad datasets already. and they need to go places, sometimes they need to be created first and this is where things start: a datalad dataset exists anything prior to that is out of scope. if a datalad dataset is at the core, the next Q is: what is a datalad dataset and what it is (ignoring special cases), is a metadata structure in a particular deserialized format. that metadata structure describes a dataset (not a datalad dataset, but a dataset). a dataset is a collection of files datasets and files are the only datatypes that datalad can perform operations on. the metadata may (and does) contain other entities, but they are no more than stuffing wrt to the data structure. so we have files and collections of files described in metadata and we have a system that can export such metadata in serialized form from a datalad dataset. ok, so with datalad we have a system for managing collections of files that need not be located in the same place, and need not be described in a homogenous fashion datalad offer the means to describe essential properties of files and the collections in a homogenous (and thereby actionable) fashion. the metalad->catalog system does the same thing for more of this metadata contained in datalad datasets. homogenizing more, in order to be able to perform more operations on it. they key aim here is "discovery": find stuff we dont know where it is, or if it exists. so in some sense, we build an integrated system to aid discoverability, accessibility, interoperability and all that through homogenization. how is that different from what everybody else is doing? we do not make copies of data, we operate on pointers. they can point to any number of places and system, can be heterogenous while operations still go though a homogeneous api. we do not require curation of metadata to a fixed standard as an admission criterium the system. instead, we require that metadata can be mapped onto a fixed standard, programmatically and repeatedly. and because this sounds all complicated and expensive, we need to come up with nice rewards fir people that bite the bullet and invest into this system. in the end it is this: if you pay the entrance fee (commit to maintaining datalad datasets), here is what you get. so it is not really about data portals or kits. it is about "roll your own" on top of standard infra/services accessible to someone. in the fairly big paper this was about "roll your own reproducible computing" when neither the hardware nor the software want to be cooperative now it is about "roll your own data showcase". giving particular individuals access (to the showcased data) is a sidenote, and may not even be possible (see sfb1451 use case). it really is about empowering anyone who cares to curate a collections of datasets to be able to present them, in a structured, human and machine accessible fashion to the largest possible audience. cheers, and good night ## Minutes - Discussion about @mih's comments in the chat about @christian's draft of a systematic approach - "data portal" is a misleading term, shouldn't be used - its rather about a "mix and match" approach of tools to assemble for a specific usecase - "serialized versus deserialized metadata formats" discussion - mih proposes: A DataLad dataset is only metadata - a catalog is metadata in a serialized form - exploring 3 dimensions for classification: heterogeneous/homogenous **access**, and heterogeneous/homogenous **metadata**, heterogeneous/homogenous **governance** - INM-7 usecase: "data discovery tool" - ICF usecases: "data discovery tool"; homogenous metadata, homogenous access, homogeneous governance - OpenNeuroPET usecase: remote cloud storage with central/homogenous authentication process, but homogenous format (BIDS) - SFB usecase: heterogeneous everything (how/where to get the data, metadata description, governance by separate entities) - Collaborative curation relates directly to governance - "Roll your own showcase" - @adswa: if we classify use cases along these main dimensions, we do not yet have the granularity to discern them on a technical level - Reminder about 3 levels of description: - fully abstract: (no mention of DataLad) file, collection of files, metadata about these - conceptual implementation: mention datalad and other concepts - concrete implementation: use cases - Important way to think about the process: who is the agent performing metadata homogenization and or standardization - @mih: We have too many terms to define, and their defintions are too complex. This needs to be shorter and simpler. - @mih: We need a "first author" to make decisions and guide processes. Christan volunteers. @stephan proposes that the "dumbest person" should do it (as in naive, unencumbert with previous experiences that could confine one's perspective), everyone agrees. Stephan volunteers as well. Christian and Stephan will co-work on this. ### A new set of terms arising in the current meetings minutes :eyes: - File - Collection of files - Metadata?? - Homogenization - Curation - serialized versus deserialized - governance - showcase - agent # Space for anything else ## Relevant technical features: The catalog must learn how to: - annotate permission levels - provide information on access DataLad-MetaLad - fine-grained authorization ## Terms A currently non-exhaustive and possibly overlapping set of relevant terms. The goal would be to disambiguate their meanings, refine the terms, and ultimately define a set of core terms that serve to explain the functionality and components of a DataLad Data Portal Kit. - Access - Access granularity - Archiving - Assimilation - Authentication - Authorization - Catalog - Catalog entry - Catalog generation - Catalog metadata - Catalog update - Centralized - Confidentiality - Data - DataLad dataset - DataLad sub-dataset - DataLad super-dataset - Dataset - Dataset-level - Decentralized - Deposit - File-content - File-level - Metadata - Metadata extraction - Metadata store - Metadata translation - Operator - Payload - Payload-pointer - Payload-metadata - Portal - Schema - Submission - Submitter - Storage - Storage URL - Temperature - Validation - Workflow - Workflow execution environment ## Example diagram ```plantuml start if (condition A) then (yes) :text1; elseif (condition B) then (yes) :text2; stop ```