DCPPC Key Capability Working Groups

# DCPPC Key Capability Working Groups ## KC1. Development and Implementation Plan for Community Supported FAIR Guidelines and Metrics Coordinator: Avi Ma'ayan The research community has begun to embrace FAIR principles for data, standards, and tooling; however, there are no clear guidelines for what it means to be FAIR or how to measure FAIR-ness. For the NIH Data Commons to be FAIR compliant, there need to be community endorsed guidelines and metrics on applying FAIR principles to digital research assets, roles, and relationships. The development of FAIR guidelines and metrics will require structured reporting methods, a quantification of FAIR-ness, FAIR use cases, and interfaces to capture and report FAIR-ness statistics. Proposed guidelines and metrics will need to be assessed for their usability and utility. The guidelines and metrics must be developed through engagement with the research community and have the community’s demonstrated endorsement. The applicant should propose approaches to developing community endorsed FAIR guidelines and metrics that enable biomedical scientists to annotate and release the products of science digitally, and in a FAIR manner, so the products can be part of a FAIR compliant Data Commons. ## KC2. Global Unique Identifiers (GUID) for FAIR Biomedical Digital Objects Coordinator: Mercè Crosas The Data Commons will need to uniquely identify any and all FAIR digital objects community engagement with and endorsement of the proposed methods. ## KC3. Open Standard APIs Coordinators: Zak Kohane and Paul Avillach The DCPPC should develop a strategy for maximizing interoperability and reuse of web-based biomedical APIs, through the development of standards for API metadata, registries and workflows. Working with existing communities that are defining API standards, such as the Global Alliance for Genomics and Health (https://genomicsandhealth.org) and adopting and extending community-defined API standards will be critical to the success of the Data Commons. ## KC4. Cloud Agnostic Architecture and Frameworks (Part of the Full Stacks; Stan Ahalt, Coordinator) The ability to exchange data, semantics, universal identifier conventions, and tooling between cloud infrastructures is crucial to FAIR-compliance and to providing a long-term, adaptable, andfuture-proof Data Commons platform. Commercial cloud storage and cloud computational infrastructure is likely to be the long-term underlying infrastructure for the Data Commons. The Data Commons Framework will need to provide a cloud agnostic abstraction layer that enables researchers to access, contribute to, and learn from data, tooling, and semantics available in the Data Commons. This must be possible without requiring direct knowledge of the underlying infrastructure of the Commons, to the extent possible, based on the technical and implementation differences between the cloud providers. A successful Data Commons Framework will: a) enable multiple, highly interoperable, and cross-discoverable data commons platforms/efforts. b) implement use of multiple clouds including on-premise and hybrid clouds. c) provide services for authentication, authorization, digital IDs, metadata and data access that span multiple commons platforms/clouds so that data can be accessed transparently across commons platforms/clouds (supporting FAIR principles and compliance). d) provide services for executing reproducible workflows across clouds so analysis tools can be ported easily across commons platforms/clouds, and so queries and analysis pipelines can be distributed across clouds and the results gathered. e) minimize ingress and egress charges between clouds and, if applicable, make those charges predictable and well understood. f) support resiliency and high availability of services available in the cloud g) create a community-driven process for implementing and promulgating existing standards for data, tools, and semantics. ## KC5. Workspaces for Computation (Part of the Full Stacks; Stan Ahalt, Coordinator) Workspaces for computation should provide users with the ability to store, create, and publish digital objects and analytical pipelines such that they can access and analyze diverse datasets, and visualize results. Workspaces should also provide users with comprehensive and user- friendly interfaces to build and run existing analysis pipelines, to visualize the results, and to allow them to bring their own datasets into the environment. ## KC6. Research Ethics, Privacy, and Security Coordinator: Karen Davis Approaches to address research ethics, privacy, and security in the context of the Data Commons should incorporate or be consistent with applicable laws, regulations, rules, and policy on human subjects protections, privacy, and information security. Approaches would allow for cloud-based storage of and access to information that alone or in combination with other information may be considered identifiable or otherwise protected information. At a minimum, approaches should address policy considerations associated with the amalgamation of large datasets, controlled-access oversight management, aggregation of informed consent metadata and tracking. To achieve the required level of data transparency and data use, it is likely that authentication-controlled access solutions will need to be integrated deeply into cloud service providers’ Identity and Access Management (IAM) infrastructure and tools. This component is critical to meeting the FAIR compliant goals of the Data Commons. It will also allow NIH supported programs (MODs, TOPMed, and GTEx) to share data with the wider research community, creating research synergies while complying with applicable federal protections for research participants, research participant consents, privacy and security requirements, and NIH security policies. ## KC7. Indexing and Search Coordinator: Ian Foster The Data Commons will index metadata among different projects through FAIR compliant APIs. This will allow users to search for and identify data of interest in order to create ‘synthetic cohorts’ and conduct meta-analyses of pooled data. This OT ROA is not intended for the research and development of new methods, rather the modification and deployment of systems that are already available, and have some community support. ## KC8. Scientific Use cases Coordinator: Anthony Philippakis The Data Commons must enable users to address scientific questions of interest from datasets stored on the cloud platform or be able to bring their data to the cloud platform for analysis against datasets stored there. The DCPPC expects to develop scientific use cases with TOPMed, GTEx, and MODs datasets and tools. NIH expects successful applicants to this OT ROA to work collaboratively with the NIH funded investigators and their groups who manage these datasets as part of the Stage 1 consortium (see details below). Scientific use cases should enable analysis of individual TOPMed, GTEx, or MODs datasets or enable analysis across these datasets to better support genotype – phenotype associations. The Data Commons should ultimately provide novel ways to interconnect phenotypic, clinical, imaging, biospecimen, and model organism data. ## KC9: Coordination and Training Coordinators: Owen White and Titus Brown The Data Commons Consortium will require coordination and will need a governance model. The Data Commons will also require significant training and outreach activities. ## Full Stack Working Group Coordinator: Stan Ahalt The "full stack" of Data Commons technology (KCs 2-7) will need to interoperate within and between KCs and teams.