Introducing OpSci Commons

# Introducing OpSci Commons OpSci Commons is built to serve web-native scientists that perform the majority of their research activities on the internet. The avalanche of massive scientific data produced by ambitious “super projects” such as the Obama BRAIN initiative, the Human Connectome Project (HCP), and the Adolescent Brain and Cognitive Development (ABCD) Study, have opened the door to a combinatorial explosion of scientific hypotheses that can be tested purely in digital code. For the very first time, thousands of petabytes of data will become available for researchers to generate insights about the human brain in unprecedented detail. Unfortunately, there does not yet exist sufficient digital infrastructure to support open access to this data. High-resolution images, exceeding an exabyte in size, are impossible to render on most consumer hardware, making the benefits of this new knowledge accessible only to those institutions with the required resources and expertise. OpSci Commons solves this data sharing and reuse problem by breaking datasets like these into efficient chunks and storing them on distributed storage networks using peer-to-peer data routing to optimize the bandwidth of requested images. Massive datasets can be sustainably archived for free through OpSci Commons by leveraging file storage networks engineered to drive costs down through incentives for public participation. Filecoin Plus rewards storage providers to store data that serves the public good by making ordinary commercial storage providers pay for that overhead. Data archived on Filecoin Plus is cryptographically guaranteed to be available for a predetermined time. Through this, Commons provides a persistent data layer that exists via public participation in a distributed storage network rather than dependence on a third-party service provider, like AWS. This is a significant upgrade for decentralized web-native science communities that are dependent on public funding of infrastructure providers for archiving and ensuring availability of open-access datasets. ![](https://i.imgur.com/p7f0vh9.png) **An Open Source Infrastructure.** OpSci Commons is an open source decentralized application (dApp) that anyone can fork to host their scientific data commons and archival workflows. Commons was built with the expectation that others would fork the code and join the distributed scientific data-sharing network, increasing resiliency, bandwidth, and potential for collaboration. Several open-source tools are integrated into Commons to solve data-sharing challenges. The Brain Information Dataset Structure (BIDS), datalad, and git annex are at the heart of automated pipelines for dataset provenance and archival. Interplanetary Filesystem (IPFS), libp2p, Filecoin, and Ethers.js are integrated with scientific tooling to enable peer-to-peer discovery of shared datasets. Although Commons was built with neuroimaging data in mind, the architecture is general enough to support public health, bioinformatics, civics, climate, astronomy, and other valuable open science datasets. The OpSci Commons architecture was designed to scale to meet the general challenges of sharing massive datasets by utilizing distributed storage networks and offering a seamless UX for discovery, archival, and publication. To get started with integrating OpSci Commons, users can follow the documentation for the API back-end on the Github Repository. The RESTful API provides several methods for publishing, downloading, and querying metadata for knowledge artifacts published on Commons (diagram for upload below). ![](https://i.imgur.com/wSPuMKp.png) Incentive Mechanism Design for FAIR practice. The majority of researchers believe that the products of scientific research should be Findable, Accessible, Interoperable, and Repeatable (FAIR). [Yet, only a small minority report achieving FAIR research practices, reporting high cost with little incentive to justify the investment of money, time, and personnel.](https://doi.org/10.1038/s41597-022-01325-2) Research funding is typically allocated to budget line-items that generate scientific results and less so for maintaining or communicating those results. It should be no surprise that there are very few low- cost solutions that allow researchers to publish large scientific datasets. Even so, [researchers cite that open science practice is a privilege](http://doi.org/10.1089/bio.2020.0037) for a minority with abundant funding that can afford to hire support staff, pay for expensive software, and publish data without fear of being scooped. Commons is an experiment for incentive mechanism engineering that utilizes free archival, enhanced discoverability, and verifiable open science impact certificates as motivating factors for data sharing. Users can archive and publish any sized dataset for free on Commons with the promise the data will be persistently available as long as there are participants in [the distributed storage network, which follows its own incentive mechanisms for data storage as a commodity.](https://filecoin.io/2020-engineering-filecoins-economy-en.pdf) In order to unlock this benefit, users must 1) sign in with a verified ORCID 2) uploaded data must pass an automated validation check for machine readability. Let’s unpack this. ORCID is a persistent identifier for researchers, allowing their research activities to be tracked and become more discoverable by altmetric data aggregators and scientific web services. By using ORCID, researchers can add published datasets to their records of scientific works, providing them with recognition for scientific productivity beyond appearing as an author in a journal article. Verifying dataset authorship is just the first step for linking open science impact to demonstrable FAIR science practice. The I(interoperable) and R(reproducible) in FAIR are tightly interlinked and can often be addressed through the use of machine-readable standards for dataset structure. In the Commons workflow, datasets submitted for publication are permitted only if conform to a data standard. For neuroimaging datasets, we implement the BIDS standard, which is a dataset structure that requires rich metadata. This provides transparency into the acquisition parameters, protocol details, and provenance of the dataset. Computational analytic workflows can be expected to run in a fairly autonomous manner if the source dataset conforms to a consistent structure. This means that a developer can write an algorithm that can be reasonably expected to execute without error on a BIDS dataset. This enhances both interoperability and reproducibility by making it easier to run the same code on any given dataset. Lastly, all data stored on Commons is addressed by a content identifier. This means that the content of the data uniquely identifies the data on a distributed network. Content addressability surpasses the DOI system by providing a direct link between the bits and the “name” of the dataset. This makes the dataset more discoverable and less likely to be confused with other data, enhancing the odds that researchers can reproduce the findings from the dataset. Free storage and enhanced discoverability are just the tip of the incentive iceberg that can be explored with a distributed data commons model. ### Roadmap We have the following deliverables planned for the future - **OS-C-1 (User Research):** On-going user research, scientist interviews, and user requirement sourcing to generate feedback, frequently asked questions, pilot new features, and refine UI/X design for various application flows. - **OS-C-1.1:** Interview 20 researchers for feedback on OpSci Commons flow deployed as of Sept 2022. Source feedback on ORCID user sessions, identify bugs, collect data on expected dataset sizes to be uploaded. - **OS-C-1.2:** Interview 20+ researchers for feedback on OpSci Commons flow deployed as of Dec 2022. Source feedback on front-end enhancements, upgraded search features, and identify requirements for queries and complex filters. - **OS-C-1.2:** Interview 20+ researchers for feedback on OpSci Commons flow deployed as of May 2023. Source feedback on back-end enhancements, impact certificate minting, preferences for funding published projects, identify attitudes towards metrics for impact - **OS-C-2 (Back-End Enhancements):** A searchable database of archived datasets with rich metadata, such as authorship, experimental protocols, key terms, and associated scientific artifacts. Tight integration with existing academic web services, OpSci Verse, and IC-NFT flow. - **OS-C-2.1:** Scope OAuth linking with user sessions for DOI, ORCID, Figshare, Dryad, and Open Science Framework - **OS-C-2.2:** Bidirectional synced events with on-chain syncing unto ORCID records and/or OSF projects - **OS-C-2.3:** Deployment of metadata schema with on-chain subgraph (The Graph or Tableland TBD). - **OS-C-2.4:** Upgrades to metadata schema following sourcing requirements from open science contributors and ecosystem participants (i.e., DANDI-schema) - **OS-C-2.5:** Integration of IPNS for human-readable persistent content-based identifiers linked with existing standards such as DOI. - **OS-C-2.6:** Token-based gating based on OpSci Society Membership + Account Linking - **OS-C-3 (Front-End Enhancements):** - **OS-C-2.1:** User research sourced UX improvements for seamless publishing and sharing flow - **OS-C-2.2:** Embedded In-Browser Application (Neuroglancer) for visualizing multi-dimensional data published in Commons - **OS-C-4 (Production Performance):** - **OS-C-4.1:** Collaboration with PiKNiK Filecoin archival service provider to identify server architecture for performance optimization - **OS-C-4.2:** Demonstrations of storage architecture performance in comparison with S3-only solutions - **OS-C-5 (Impact Certificate Development):** Deployment of the Hypercert/Impact Certificate certification utilizing an off-chain database for the prototype, followed by a web3-native integration with either The Graph or Tableland for complex SQL relational queries and dynamic NFT metadata. - **OS-F-3.1:** Smart contracts that link proposals, associated metadata, computed impact metrics with minted IC-NFT - **OS-F-3.2:** Smart contract expansion to include minting IC-NFT contribution record for funders linked to the parent IC-NFT. - **OS-F-3.3**: Wireframe design of Impact Certificate with responsive elements and embedded within Commons front-end - **OS-C-6 (Open Source Dataset Archival Pipelines):** Automated Filecoin archival of public open source datasets, targeting 10PBs of data stored on the Distributed Archive of Neuroimaging Dataset Index (DANDI) in collaboration with MIT Senseable Intelligence Group. ### Our development updates from the past month #### Typescript Migration TypeScript is a superset of typed JavaScript (optional) that can help build and manage large-scale JavaScript projects. It can be considered JavaScript with additional features like strong static typing, compilation, and object-oriented programming. TypeScript always points out the compilation errors at the time of development (pre-compilation). Because of this getting runtime errors is less likely, whereas JavaScript is an interpreted language. TypeScript supports static/strong typing. This means that type correctness can be checked at compile time Considering the benefits of it, We decided to migrate to typescript to add type safety to our backend. This task included adding typescript dependencies, changing file extensions, configuring Typescript, rewriting the code in EcmaScript syntax, adding in type annotations for variables and functions While migrating to TypeScript we made sure to keep the logic of the code the same. You can find our progress here at the `adopt-typescript` branch on Commons-Backend Repo Link: https://github.com/opscientia/commons-backend/tree/adopt-typescript #### ORCID Integration We are working on integrating ORCID as one of the authentication methods for our users. It is one of the major updates for OpSci Commons. ORCID is a persistent identifier for researchers, allowing their research activities to be tracked and become more discoverable by altimetric data aggregators and scientific web services. So far we have - Added ORCID authentication using JWT for session management in /login page, - Display ORCID iD of users once they are logged on a protected page /dashboard - Display the user's ORCID iD on the Accounts page. ### How to contribute? Connect with us on [Discord](https://discord.com/invite/n7UBwrGywZ) and on [Twitter](https://www.twitter.com/opscientia) Stay updated with our [newsletter](https://pulse.opsci.io/)