# NOAA Omics Data Management Guide
# Home
Welcome to the NOAA Omics Data Management Guide!
## Overview
Management of data generated in research is critical to ensure fidelity of the scientific enterprise and maximum utility of research funding. Management of omics data presents additional challenges because of its large datasets, new and diverse technologies generating those datasets, and the many types of data and environmental data involved.
The motivation of this guide is to facilitate finding the data produced by every 'omics project and ensuring that it is usable by future researchers.
The focus of this guide is to help omics users decide which data goes to which repositories (NCEI, NCBI, etc.), and how to deposit their data so that it will be FAIR (findable, accessible, interoperable, and reusable), including following community standards for raw data, processed data, and contextual metadata and environmental data.
## Contents
The wiki is organized into the following pages:
1. **[Omics Data](https://github.com/aomlomics/omics-data-management/wiki/1-Omics-Data)** – How to format, QA/QC, and share omics data (raw, processed, or tabular)
2. **[Metadata & Environmental Data](https://github.com/aomlomics/omics-data-management/wiki/2-Metadata-&-Environmental-Data)** – How to format, QA/QC, and share metadata (about environments, samples, or methods) and other non-omics data
3. **[FAIR Data Principles](https://github.com/aomlomics/omics-data-management/wiki/3-FAIR-Data-Principles)** – How to standardize and share your your data to be findable, accessible, interoperable, and reusable (FAIR)
4. **[Checklists](https://github.com/aomlomics/omics-data-management/wiki/4-Checklists)** – How to decide which standards and repositories you should use for your data
## Motivation
The advantages of good data management for NOAA's mission, including machine readability and compliance with US Government mandates.
### Why manage data
#### FAIR Data Principles
The FAIR data principles were defined in the 2016 *Scientific Data* paper [The FAIR Guiding Principles for scientific data management and stewardship](https://www.nature.com/articles/sdata201618) and are listed [here](https://www.go-fair.org/fair-principles/). The criteria of the FAIR principles are as follows:
* Findable – Data/metadata should be richly described, include the identifier of the data they describe, be assigned a globally unique and persistent identifier, and be indexed in a searchable resource.
* Accessible – Data/metadata should be retrievable by an identifier using a standardized communications protocol, which is open, free, and universally implementable, and the metadata should be accessible even when the data are no longer available.
* Interoperable – Data/metadata should use a formal, accessible, shared, and broadly applicable language for knowledge representation, using vocabularies that follow FAIR principles, and include qualified references to other metadata and/or data.
* Reusable – Data/metadata are richly described with a plurality of accurate and relevant attributes, are released with a clear and accessible data usage license, are associated with detailed provenance, and meet domain-relevant community standards.
### Machine readability
### Government mandates
* [White House Executive Order - Making Open and Machine Readable the New Default for Government Information (2013)](https://obamawhitehouse.archives.gov/the-press-office/2013/05/09/executive-order-making-open-and-machine-readable-new-default-government-)
* [NOAA Plan for Increasing Public Access to Research Results (PARR) (2015)](https://www.glerl.noaa.gov/review2016/reviewer_docs/NOAA_PARR_Plan_v5.04.pdf) (also see [PARR website](https://www.ngdc.noaa.gov/parr.html))
# 1 Omics Data
Omics datasets should be sent to the relevant long-term data repository in accordance with the publication requirements of the field of study. While NCEI can be used to curate some small omics datasets (< ~20 GB), it is not easily searchable nor does it allow for critically important interactive querying (e.g., BLAST), and so should not be the lone repository for omics data if other options apply. Raw data (e.g., FASTQ files from the sequencing center) should be submitted to a repository for proper archiving. Data analysis products (e.g., MAG/genome assemblies) are useful to the scientific community, and should be submitted to relevant repositories. We offer guidelines for archival locations for omics datasets below and in Table 1.
## Data Types vs. Methods
Omics research encompasses diverse methods that produces many types of data, with many methods generating similar kinds of data. To simplify the decision process for how to manage your data, here we provide guidance that is both data type-specific (*e.g., Where do I submit DNA sequence data?*) and method-specific (*e.g., Where do I submit metabarcoding sequences, taxonomy tables, and feature tables?*).
## Repositories
Recommended destinations for different data types
### Table 1 - Data repositories
Suggested formats and destinations repositories for common environmental omics datasets. Please note that, although NOAA's Coral Reef Information System ([CoRIS](https://www.coris.noaa.gov/CoRIS)) is the preferred venue for archiving NOAA-funded coral reef data, all CoRIS submissions are handled by NCEI.
!! Note that some of these are metadata and can be moved to section 2
Data type | Data formats (non-exhaustive) | Repository
-- | -- | --
Station location and sampling metadata (CTD or underway sensor data, physicochemical or nutrient data, trawl data, etc.) | Tab-delimited text following MIxS, OADS, ESIP, or ISO standards. MIxS is preferred unless another metadata format is mandated (e.g., OADS for OAP-funded work). | [NCEI](https://www.ncei.noaa.gov/archive)
Locations of omics data ('big data'), feature observation tables and metadata, and analysis code (see below) | Tab-delimited text containing persistent identifiers for each dataset, such as Digital Object Identifiers (DOIs) or Persistent Uniform Resource Locators (PURLs) | [NCEI](https://www.ncei.noaa.gov/archive)
DNA reference sequences | GenBank format | [NCBI GenBank](https://www.ncbi.nlm.nih.gov/genbank/submit/)
DNA sequence data (amplicon, metagenomic, RAD-Seq) | Raw FASTQ | [NCBI SRA](https://www.ncbi.nlm.nih.gov/sra/docs/submit/)
Amplicon Sequence Variants | Reference FASTA | [NCEI](https://www.ncei.noaa.gov/archive)
RNA sequence data (RNA-Seq) | Raw FASTQ | [NCBI SRA](https://www.ncbi.nlm.nih.gov/sra/docs/submit/)
Functional genomics data (quantitative gene expression, ChIP-Seq, HiC-seq, methylation seq) | Metadata, processed data (e.g., raw read counts) raw FASTQ | [NCBI GEO](https://www.ncbi.nlm.nih.gov/geo/info/submission.html) (raw data submitted to NCBI SRA for you)
RNA transcript assemblies | FASTA or SQN file | [NCBI TSA](https://www.ncbi.nlm.nih.gov/genbank/tsa/)
Genome assemblies | FASTA or SQN file, optional AGP file to orient scaffolds | [NCBI WGS](https://www.ncbi.nlm.nih.gov/genbank/genomesubmit/)
Quantitative PCR data | Tab-delimited text | [NCEI](https://www.ncei.noaa.gov/archive)
Mass spectrometry data (metabolomics, proteomics) | Raw mass spectra, MZML, MZID | [ProteomeXChange](https://www.proteomexchange.org/), [Metabolomics Workbench](https://www.metabolomicsworkbench.org/)
Coral reef data | Tab-delimited text, HDF, or netCDF (less preferable) | [CoRIS](https://www.coris.noaa.gov/CoRIS) (via [NCEI](https://www.ncei.noaa.gov/archive))
Feature observation tables and feature metadata | BIOM (HDF5) format (feature observation tables), tab-delimited text (feature metadata) | [NCEI](https://www.ncei.noaa.gov/archive) (size permitting), [Zenodo](https://zenodo.org/), or [Figshare](https://figshare.com/)
Reference database | FASTA (sequences) and TSV (taxonomy) | [Zenodo](https://zenodo.org/) or [FigShare](https://figshare.com/) or [Dryad](https://datadryad.org/stash)
Analysis code | Commented code and Jupyter notebooks | GitHub (optionally archived on [Zenodo](https://zenodo.org/) or [FigShare](https://figshare.com/) or [Dryad](https://datadryad.org/stash))
Figure code | Commented code for recreating figures (R, etc) | GitHub (optionally archived on [Zenodo](https://zenodo.org/) or [FigShare](https://figshare.com/) or [Dryad](https://datadryad.org/stash))
Notes on data formats:
* Tab-delimited text files should contain a single row at the top containing column headers. Column headers should be written in camelCase or with_underscores (no spaces) with units included in the column headers.
* Tab-delimited text files can be created as Microsoft Excel or Google Sheet files, edited, and then saved as tab-delimited text (.txt, Excel) or tab-separated values (.tsv, Sheets), which are the same except for the extension.
## Storage & Backup
Best practices for storage and backup of 'omics data should ensure that the data files are associated with their metadata and backed up securely.
### Raw data
Upon receipt of raw sequencing data, FASTQ files should be immediately stored in two locations (e.g., a local drive for analyses and on the NOAA Google Drive). The location of these files and their [metadata](https://github.com/aomlomics/omics-data-management/wiki/2-Metadata-&-Environmental-Data) (e.g., file name, associated project, sequence submission date) should be recorded on a Google Spreadsheet. This spreadsheet should be backed up on a regular basis by downloading it as a tab-delimited file and saving to an external drive.
Storage locations of raw sequence data should have a regularly scheduled backup plan (e.g., RAID on server drives, external backup of Google Drive).
### Intermediate and processed data files
Depending on your 'omics method, you will generate various types of intermediate and final processed files. For files produced by analyses that are computationally or time-intensive (e.g., trimmed FASTQ files, ASVs, assembled RADseq loci), it is a good idea to backup them up in a second location until the project is completed or the files are uploaded to a repository such as Dryad, Zenodo, or Figshare.
## Archiving and cross-linking
All NOAA ‘omics projects that are eligible for NCEI should include a project submission to the [National Centers for Environmental Information (NCEI)](https://www.ncei.noaa.gov/), and provide a README file locating where all products of that project have been submitted. This file should contain a description of the data and a link to a persistent digital object identifier (DOI) or NCBI accession number. This file should include include links to all raw data, metadata, data analysis products, and code used for the ‘omics project, including a self-referential link to the NCEI project submission where the README file is found.
This NCEI accession number can be cross-listed in other repositories containing the project's data. For example, you can provide a link to the NCEI project in the "Related Resources" section of an NCBI BioProject.
An additional organized and accessible data archive can further support reproducibility of a project. This archive can be hosted on Figshare, Dryad, or Github and archived with Zenodo. It should include cross-listed repository links or files for all raw data, metadata, data analysis products, code, and any research products (manuscripts, figures, ect.). Optionally, it can also include details on lab or field protocols.
The [AOML ’Omics](https://github.com/aomlomics/) GitHub organization contains the repository [Datasets](https://github.com/aomlomics/datasets) that lists the datasets generated by AOML ’Omics and links to the primary data location (BioProject, ProteomeXchange, or NCEI). It can be viewed as a [GitHub Page](https://aomlomics.github.io/datasets/), which is more user-friendly than a repository page. ’Omics users can edit the README.md file (containing the table of datasets) directly from a web browser (command-line Git is not required).
# 2 Metadata & Environmental Data
Metadata are contextual data about your experimental data. Metadata are the who, what, when, where, and why of these data. Metadata put these data into context. For 'omics studies, metadata include information about the sample: when it was collected, where it was collected from, what kind of sample it is, and what were the properties of the environment or experimental condition from which the sample was taken. Information about sample processing is also metadata: methods used to extract and purify molecules (e.g., DNA) from the sample, type of DNA sequencing or other ’omics analyses done, and where the raw experimental data are located.
For an 'omics study, metadata exist at multiple stages along the path from samples to analysis. For example, contextual information about when and where the sample was collected could include descriptors like date, time, and geospatial coordinates. Processing the sample in the lab for analysis requires different protocols and instrumentation. Once these data become electronic and are analyzed, the results should be accompanied by the versioned software used.
## Categories of metadata
### Sample metadata
*Most of this would not be considered metadata in the strict (non-omics) sense. It is data, but it is used as contextual data for the omics data.*
* Sample metadata: this is specified by the MIxS Package
### Experimental metadata
* Preparation metadata: how samples were extracted, how samples were sequenced; this is specified by the MIxS
* Data processing/analysis metadata: includes both primary
* Feature metadata: ID number, sequence, taxonomy (multiple)
## Metadata repositories
| Project type | Metadata repositories |
| --------------------- | ------------------------- |
| Environmental survey | [NCEI](https://www.ncei.noaa.gov/archive) & [NCBI](https://www.ncbi.nlm.nih.gov/sra/docs/submit/#accepted-data) |
| Mesocosm experiment | [NCEI](https://www.ncei.noaa.gov/archive) & [NCBI](https://www.ncbi.nlm.nih.gov/sra/docs/submit/#accepted-data) |
| Laboratory experiment | [NCBI](https://www.ncbi.nlm.nih.gov/sra/docs/submit/#accepted-data) |
| Genome/reference sequencing | [NCBI](https://www.ncbi.nlm.nih.gov/sra/docs/submit/#accepted-data) |
| Non-DNA/RNA data (e.g.: metabolomics, proteomics) | Specialized repository and/or [NCEI](https://www.ncei.noaa.gov/archive) (if environmental), [NCBI](https://www.ncbi.nlm.nih.gov/sra/docs/submit/#accepted-data) |
<!--
Can we submit metadata to BioSample without sequence data?
\!-->
## Recording metadata
Planning how and what metadata is recorded in a standardized way from the start of every project will go a long way towards saving time when submitting your data to data repositories, as well as improving overall research reproducibility. Typically, each individual sample is recorded as a row and metadata attributes for each sample are recorded in columns. The specific metadata attributes will vary depending on your 'omics method, but generally the sample metadata should contain the minimum, yet comprehensive information about the physical and chemical conditions of the sample, from collection through sequencing. A (non-comprehensive) list could include sampling environment, molecular lab procedure names, sequencing platform and model (required by NCBI SRA), and data processing steps, and external identifications from additional processing steps, e.g. sample IDs from a sequencing organization.
Metadata should be collated from primary sources (e.g., paper notes, emails, external collaborators, lab notebooks) and associated with sample IDs as soon as possible after it is generated. Primary sources of metadata should be backed up in case they are needed as a reference later. During this stage, the metadata should be evaluated for potential errors (e.g., incorrect GPS coordinates, missing data) and followed up on with the original data collector.
## Refining metadata to a standard
Standardizing the format of your metadata can both facilitate sharing of your results with others and improve identification of errors within your own metadata. Having a method-specific template Google Sheet or Excel file for metadata that you use across all similar studies can be very helpful. This template should include a second sheet or file with a "data dictionary" defining the desired attribute columns and formats. Refining your metadata to a standard has benefits for internal use, publication, and repository submission. Refined metadata should have the following characteristics:
1. Standardized: The data in each column should be written in a consistent manner, i.e. in the same format.
2. Well-defined attributes: The names of the metadata parameters of each sample should be obvious in definition, with units included if applicable.
3. Collected metadata should be comprehensive, but not with extraneous or unnecessary attributes.
A general workflow for transforming and refining metadata:
1. Fill in missing metadata: consult cruise/ship notes and other potential sources of unconsolidated information about the samples. For missing data that cannot be filled in, we recommend following the [INSDC standardized missing value reporting language](https://ena-docs.readthedocs.io/en/latest/submit/samples/missing-values.html).
2. Identify the columns that are present and compare that with those in the data dictionary, given that those are the minimum amount of information that we hope to submit with the dataset that we make publicly available.
3. Standardize the data to that of the column headers’ standards.
4. Optional: input columns to relational database
5. Publish data to NCEI and other potential funding specific sites
## Submitting metadata and environmental data to repositories
OAR ’Omics projects should make their metadata and non-’omics data (non-“big data”) publicly accessible to the appropriate repositories (also see [Table 1](https://github.com/aomlomics/omics-data-management/wiki/1-Omics-Data#repositories)).
### NCEI
Send metadata and environmental data to the [National Centers for Environmental Information (NCEI)](https://www.ncei.noaa.gov/), generally excluding ’omics data, as these large datasets should be stored in the relevant [long-term data repository](https://github.com/aomlomics/omics-data-management/wiki/1-Omics-Data#repositories) and linked from NCEI to those records with persistent identifiers. Although NCEI can curate ’omics datasets smaller than 20 GB (i.e., most proteomic and metabolomic datasets), it does not permit the critically important interactive querying feature that is integral to all ’omics-tailored data repositories and so should not be the lone repository for ’omics data.
### Domain-specific databases
* Ocean acidification (OA) data generated through the Ocean Acidification Program (OAP) should be submitted to a special section of NCEI, the [Ocean Acidification Data Stewardship (OADS) Project](http://www.nodc.noaa.gov/oceanacidification).
* Coral and coral reef data should be sent to NCEI, where it will then be posted on NCEI and referenced/cross-listed by NOAA’s [Coral Reef Information System (CoRIS)](https://www.coris.noaa.gov/).
* [Earth Science Information Partners (ESIP)](https://www.esipfed.org/esip-endorsed) Biological Data Standards.
* [International Organization for Standardization (ISO)](https://www.iso.org/) standards with guidance from NOAA’s [National Geophysical Data Center (NGDC)](https://www.ngdc.noaa.gov/wiki/index.php/MI_Metadata).
* Biodiversity data: GBIF, OBIS [(see guide)](https://docs.gbif.org/publishing-dna-derived-data/1.0/en/)
### Metadata standards
Environmental metadata should be formatted according to one or more of the following standards:
* General ’omics projects: [Genomics Standards Consortium (GSC) Minimum Information about any (x) Sequence (MIxS)](https://gensc.org/mixs/).
* OAP-funded projects: [Ocean Acidification Data Stewardship (OADS)](https://www.ncei.noaa.gov/products/ocean-acidification-data-stewardship) metadata guidelines.
* Water samples (AOML standard): [Metadata Guide - Water](https://docs.google.com/document/d/1SHdzxy-Snh0QxIBS1ucXna3oZLHYSxwbXQDBWCneaGA/edit#)
* Additional guidance from [ENA](https://ena-docs.readthedocs.io/) (European Nucleotide Archive), including [The ENA Metadata Model — ENA Training Modules 1 documentation](https://ena-docs.readthedocs.io/en/latest/submit/general-guide/metadata.html) and [Reporting Missing Values — ENA Training Modules 1 documentation](https://ena-docs.readthedocs.io/en/latest/submit/samples/missing-values.html).
### Timing of submission
The suggested deadline for data to be published and accessible in NCEI is one year after the end date of the project for NOAA intramural PIs, two years after the end date of the project for extramural PIs, or before a paper is published using these data (whichever is sooner). This schedule is based on the [OAP Data Management Agreement](https://www.ncei.noaa.gov/products/ocean-acidification-data-stewardship).
**Point of contact for submissions**: The PI is responsible for working with NCEI to publish the data in a timely manner.
## Useful links
* Metadata guides
+ [National Microbiome Data Collaborative (NMDC)](https://microbiomedata.org/introduction-to-metadata-and-ontologies/) -- covers additional metadata
+ [Earth Microbiome Project (EMP)](https://earthmicrobiome.org/protocols-and-standards/metadata-guide/)
* Metadata standards
+ [GSC defined terms](https://www.gensc.org/pages/standards/all-terms.html)
+ [Biosample Attributes](https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/)
+ [BeBOP-OBON Protocol Collection Template](https://github.com/BeBOP-OBON/0_protocol_collection_template)
+ [BeBOP-OBON Minimum Information about an Omics Protocol](https://github.com/BeBOP-OBON/miop)
+ [GBWG - Sustainable DarwinCore MIxS Interoperability - TDWG](https://www.tdwg.org/community/gbwg/MIxS/)
## Templates
(pull from MIxS/AOML)
**Sequence data**
- A [template generator](https://submit.ncbi.nlm.nih.gov/biosample/template/) is provided by NCBI, which allows the user to select the type of sequence data and relevant standards. Genome Standards Consortium MIxS packages are included in the generator. The template can be downloaded as an excel file or a .tsv file. Additionally, a definition or example record can be accessed for each standard package.
# 3 Publishing
## Findable & Accessible
How to find NOAA and other data, and how to make your data findable (keywords, persistent identifiers, cross-listing)
Standard language for publications, timing of publishing
## Interoperable & Reusable
Protocols -- fits with interoperable and reusable
# 4 Checklists
## General checklist
- [ ] **Start early** -- Develop your data management plan (with help from this guide) at the proposal stage.
- [ ] **Identify your data type(s)** -- The types of data you generate (e.g., amplicon, metagenomic, proteomic) dictates which standards are available and some of the metadata you'll want to collect (i.e., the non-environmental metadata).
- [ ] **Identify your environment(s)** -- If your samples came from the environment (e.g., seawater) or an environmental system (mesocosm), there are specific parameters you'll want to collect (e.g., temperature, salinity). If your samples are from an experimental system, you'll want to capture your experimental factors and possibly also ambient conditions.
- [ ] **Locate checklists and templates** -- Find the relevant standards checklists and templates for your data type(s) and environment(s).
- [ ] **Complete your metadata** -- As soon as you have sample or experimental metadata, start recording it in the appropriate format. Make sure to backup original and formatted metadata files.
- [ ] **Identify (meta)data repositories** -- List the relevant data and metadata repositories you will use, and verify submission instructions and user accounts.
- [ ] **Submit your data!** -- Recommended timeline for submission is one year after the project end date (NOAA intramural PIs), two years after the project end (extramural PIs), or before a paper is published using these data (whichever is sooner).
### Core metadata for all projects
| Sampling metadata | Description | Format |
| -------------------------- | ------------------------------------------------------------ | ------------------------------------- |
| Sample name | Identifier for a sample that is at least unique within the project | no spaces |
| Species or unclassified sequence type | Most descriptive organism name (to the species, if possible). | For unidentified species, choose the appropriate Genus and include 'sp.', e.g., "Escherichia sp.". When sequencing a genome from a non-metagenomic source, include a strain or isolate name too, e.g., "Pseudomonas sp. UK4". For metagenomic samples (environment or cultured), follow the [unclassified sequences](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=12908) format. |
| Collection date | Date on which the sample was collected; can also provide time of collection, and date/time ranges | Collection times are supported by adding "T", then the hour and minute after the date, and must be in Coordinated Universal Time (UTC), otherwise known as "Zulu Time" (Z); supported formats include "DD-Mmm-YYYY", "Mmm-YYYY", "YYYY" or ISO 8601 standard "YYYY-mm-dd", "YYYY-mm", "YYYY-mm-ddThh:mm:ss" |
| Broad environment type | Add terms that identify the major environment type(s) where your sample was collected. | Recommend subclasses of biome [ENVO:00000428]. Multiple terms can be separated by one or more pipes e.g.: mangrove biome [ENVO:01000181] \| estuarine biome [ENVO:01000020] |
| Local environment type | Add terms that identify environmental entities having causal influences upon the entity at time of sampling | multiple terms can be separated by pipes, e.g.: shoreline [ENVO:00000486] \| intertidal zone [ENVO:00000316] |
| Environmental medium | Add terms that identify the material displaced by the entity at time of sampling. | Recommend subclasses of environmental material [ENVO:00010483]. Multiple terms can be separated by pipes e.g.: estuarine water [ENVO:01000301] \| estuarine mud [ENVO:00002160] |
| Location name | Geographical origin of the sample | Use the appropriate name from this list http://www.insdc.org/documents/country-qualifier-vocabulary. Use a colon to separate the country or ocean from more detailed information about the location, eg "Canada: Vancouver" or "Germany: halfway down Zugspitze, Alps" |
| Latitude, longitude | The geographical coordinates of the location where the sample was collected. | Specify as degrees latitude and longitude in format "d[d.dddd] N\|S d[dd.dddd] W\|E" |
| Project summary | Short description of project goals | |
| External accessions | Accession numbers from any external resources to which the sample was submitted | eg, BioSample|
**Assay-level metadata**
This covers any metadata directly related to the preparation of the biomaterial undergoing the assay and the process of performing the assay.
| Assay metadata | Description | Format |
| -------------------- | ------------------------------------------------------------ | ----------------------------------- |
| 'Omics strategy | The type of omics method to prepare the sample for analysis (eg, WGS, amplicon) | See SRA Strategy list for all types |
| Sample 'omics source | Genomic, transcriptomic, metagenomic, ect | See SRA Source list |
| Selection method | Laboratory method to select or enrich the data prior to sequence/analysis (eg, PCR, Reduced Representation) | See SRA Selection list |
| Platform | The type of instrument used to perform the assay, eg Illumina HiSeq 4000 or Fluidigm C1 microfluidics platform | |
| Instrument Model | The specific instrument on which the assay was performed. Essential for QC purposes. | |
|External accessions |Accession numbers from external resources to which assay or protocol information was submitted | eg protocols.io
**Analysis-level metadata**
This includes any metadata related to the files that come out of the experiment, from the sequencing or imaging files generated directly by the machine to files generated during the various stages of processing and analysis, as well as details of any analyses performed.
| Analyses metadata | Description | Format |
| --------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
| analysis type | The type of analysis performed | E.g., taxonomy assignment, differential gene expression, SNP genotyping, ontology field - e.g. EFO, OBI or EDAM |
| computational method | The specific computational method or algorithm used as part of the analysis | ontology field - e.g. EFO or EDAM |
| file format | The file format in which the analysis is provided | ontology field - e.g. EDAM |
| file storage location | The location in which the data files are stored | |
| software package | The software package used for data analysis | |
| software version | The exact version number of the software package | |
| analysis code | A Github link or accession number of the code used to perform the analysis | |
| analysis date | The date on which the analysis was performed | |
| read index | The sequencing read a specific file represents, eg read1 or index1 | |
| read length | The length of a sequenced read in this file, in nucleotides. | |
## Checklists for specific use cases
Below are detailed data management guidelines for common 'omics use cases at NOAA. The core metadata required for all NOAA Omics projects is listed above, while any additional required metadata is specied in the specific use checklist.
### Template
#### Metadata standard(s) & requirements
#### Metadata template/example
#### Metadata repo
#### Data standard(s) & requirements
#### Data example
#### Data repo
#### Reporting of methods
#### Reporting of (meta)data availability
**NEED: Core metadata checklist that applies to all projects**
---
### Targeted quantitative surveys (qPCR, ddPCR...) [ZACK]
#### Metadata standard(s) & requirements
Currently there are no U.S. based required metadata standards or requirements for qPCR efforts. However, there are a series of best practices and guidelines including, but not limited to:
1. [Abbot et al. 2021](https://westernregionalpanel.org/wp-content/uploads/2021/04/Canada_eDNAGuidanceDoc.pdf) from the Canadian Science Advisory Secretariat
2. [Hays et al. 2022](https://link.springer.com/article/10.1208/s12248-022-00686-1) American Association of Pharmaceutical Scientists Workshop
3. [Sanders et al. 2018](https://www.sciencedirect.com/science/article/pii/S2214753517302097)
4. [Langlosi et al. 2021](https://onlinelibrary.wiley.com/doi/full/10.1002/edn3.164)
Recommended minimum information to be included from Abbot et al. 2021:
Sampling metadata | Description
-- | --
Species targeted | Provide common name, Latin name, and taxonomic identifier of the species targeted by the study. Taxonomic identifier should ideally conform to NCBI taxonomy although alternatives Barcode of Life Data System [BOLD](http://www.boldsystems.org/), Global Biodiversity Information Facility [GBIF](https://discourse.gbif.org/t/understanding-gbif-taxonomic-keys-usagekey-taxonkey-specieskey/3045), World Registry of Marine Species [WORMS](https://www.marinespecies.org/aphia.php?p=taxdetails&id=1). **NOTE** Taxonomy may not align between taxonomic identifiers and databases. Caution is advised.
Study Objectives | Provide reasion for conducting single species targeted analayses
Geographic location | GPS coordinates of sample locations following MIxS, OADS, ESIP, or ISO standards. For species of concern/masking of location data, provide the general description of the study area.
Sampling data | Dates for individual sampling following MIxS, OADS, ESIP, or ISO standards
Sample depth | depth of sample (in meters)
Sample type | Indidcate substrate sampled
Geographic location/region | broad units that describe the geographic area and/or contect of the study
Sites | physical places where samples have been collectively; should be realtively independent of each other
Stations | spatially distinct sampling locations within a site (i.e. spatial replicates)
Field sample replicates | also known as biological replicates are separate sample units collected as close as possible to the same point in space and time, storeed in separate containers, and analyzed independently
Technical qPCR replicates | distinct qPCR reactions from the same DNA extract
Technical filter replicates | splits of filters (e.g. half of a filter) that are tested separately
Environmental metadata | following MIxS, OADS, ESIP, or ISO standards
Sampling metadata | Description
-- | --
Equipment & Protocol used | description of sampling equipment used to collect the data, include link to permanent and stable archive of associated procotol. Protocols should include 1) equipment used for sample collection in the field, 2) field sample processing including filter type, pore size, diameter, manufacturer product number, number of filters used, etc. 3) preservation method including preservative, 4) DNA extraction protocol, 5) lab protocol (See Better Biomolecular Ocean Practices [BeBOP](https://github.com/BeBOP-OBON/0_protocol_collection_template) for examples of detailed protocols). Information should be fully reproducible from description and details.
Volume/Weight of sample collected | include units of sample volume/weight
Negative Controls | include information on field blanks, extraction blanks, PCR blanks, non tempalte controls, etc.
Positive Controls | include information on positive controls (gblocs, exogenous DNA, tissue, etc.) both standards as well as spike-in controls
Storage time | include information on time between sample collection and processing
#### Metadata template/example
Example is from NWFSC Hake qPCR eDNA project, specifically [Shelton et al. 2022](https://github.com/nwfsc-cb/eDNA-Hake-public).
Metadata type | Link
-- | --
CTD data | https://github.com/nwfsc-cb/eDNA-Hake-public/blob/main/Data/qPCR/CTD_hake_data_10-2019.csv
Sample metadata | https://github.com/nwfsc-cb/eDNA-Hake-public/blob/main/Data/qPCR/Hake%20eDNA%202019%20qPCR%20results%202021-07-15%20sample%20details.csv
#### Metadata repo
#### Data standard(s) & requirements
#### Data example
Metadata type | Link
-- | --
Standards data | https://github.com/nwfsc-cb/eDNA-Hake-public/blob/main/Data/qPCR/Hake%20eDNA%202019%20qPCR%20results%202020-01-04%20standards.csv
qPCR data | https://github.com/nwfsc-cb/eDNA-Hake-public/blob/main/Data/qPCR/Hake%20eDNA%202019%20qPCR%20results%202021-01-04%20results.csv
#### Data repo**
#### (Meta)data availability statement**
---
### Amplicon datasets (environmental DNA/RNA, metabarcoding, microbiome/microbial 16S surveys) [LUKE]
Amplicon or metabarcoding studies involve the collection of community DNA (e.g., filtration or precipitation of microbial or environmental DNA), extraction of that DNA, PCR amplification using primers for a particular genetic locus, and DNA sequencing (e.g., Illumina MiSeq or NovaSeq). Samples are typically collected from the environment and will be associated with environmental parameters that are important for interpretation of the DNA sequence data. In addition to this sample metadata, preparation metadata (e.g., about DNA extraction), sequencing metadata (e.g., about primers and PCR conditions), data processing metadata (e.g., how sequences were denoised and assigned taxonomy), and feature metadata (e.g., taxonomic assignments) will also be helpful for use and reuse.
#### Metadata standard(s) & requirements
#### Metadata template/example
#### Metadata repo
#### Data standard(s) & requirements
#### Data example
#### Data repo
#### Reporting of methods
- Primer sequences
- PCR conditions
- Sequencing center
- Sequencing technology
#### Reporting of (meta)data availability
- "Raw sequences were deposited in the NCBI Sequence Read Archive and sample metadata was deposited in NCBI BioSample, both accessible with BioProject ID XXX. Processed feature (ASV) tables, ASV sequences, and ASV taxonomy tables were deposited to Zenodo with DOI XXX."
---
### Other (functional gene surveys, RFLP(?)...)
### Eukaryote genomics (WGS, genome skimming/resequencing, microsatellites, population genomics, RAD-seq...) [KATHERINE]
#### Metadata standard(s) & requirements
#### Metadata template/example
#### Metadata repo
#### Data standard(s) & requirements
#### Data example
#### Data repo
#### Reporting of methods
#### Reporting of (meta)data availability
---
### Functional genomics (transcriptomics-RNA-seq, epigenetics-methylation) [KATHERINE]
Functional genomics data (e.g., quantitative gene expression from RNAseq, chromatin ChIP-Seq, HiC-seq, methylation seq) often have unique associated metadata, such as the experimental treatment, strain or population of origin, tissue type, and sampling time point. These metadata are critical for interpreting the results of a functional genomics study in the broader context. These data types are also characterized by processed data files that represent the function of interest (e.g., a matrix of gene expression counts, genomic location of methylated sites per sample).
* **Metadata standard(s) & requirements**
[MINSEQE (Minimum Information about a high-throughput nucleotide SEQuencing Experiment)](https://www.fged.org/projects/minseqe/) was developed by the Functional Genomics Data Society and is a widely adopted standard. NCBI GEO is MINISEQE compliant.
1. The description of the biological system, samples, and the experimental variables being studied
2. The sequence read data for each assay
3. The ‘final’ processed (or summary) data for the set of assays in the study
4. General information about the experiment and sample-data relationships
5. Essential experimental and data processing protocols
[Functional Annotation of Animal Genomes (FAANG)](https://github.com/FAANG/dcc-metadata/blob/master/docs/faang_sample_metadata.md) - FAANG provides a set of orthogonal standards for the capture of well-structured metadata for experiments, samples and analyses in the animal genomics domain. The FAANG standards support the MIAME and MINSEQE guidelines, and aim to convert them to a concrete specification.
**Minimum metadata for NOAA Functional 'Omics'**
These tables of suggested minimum metadata fields are adapted from the [FAIR cookbook on transcriptomics metadata](https://faircookbook.elixir-europe.org/content/recipes/interoperability/transcriptomics-metadata.html). It assumes you do not have human samples, which have their own distinct metadata and privacy standards. **These metadata are in addition to the set of common metadata for all projects(link).**
| Sample Metadata | Required? | Definition | Comment or recommended format |
|---|---|---|---|
| tissue/organism part | required | The tissue from which the sample was taken | [Uberon](https://www.ebi.ac.uk/ols/ontologies/uberon) |
| disease | required | Any diseases that may affect the sample | [MONDO ontology](http://purl.obolibrary.org/obo/MONDO_0000001) or relevant standard to the organism |
| sex | required | The biological/genetic sex of the sample | [PATO](http://purl.obolibrary.org/obo/PATO_0000047) |
| development stage | required | The developmental stage of the sample | [Uberon](https://www.ebi.ac.uk/ols/ontologies/uberon) or species dependent |
| external accessions | recommended | Accession numbers from any external resources to which the sample was submitted | eg NCBI BioSamples |
| strain/breed | recommended | Strain or breed of the species, if applicable | NCBITaxonomy |
| ancestry/ethnicity | recommended | Ancestry or population group of the individual | |
| age | recommended | Age of the organism from which the sample was collected | |
| cell type | recommended | The cell type(s) known or selected to be present in the sample | |
| growth conditions | recommended | Features relating to the growth and/or maintenance of the sample | e.g., diet, environmental controls |
| genetic variation | recommended | Any relevant genetic differences from the specimen or sample to the expected genomic information for this species, eg abnormal chromosome counts, major translocations or indels | |
| phenotype | recommended | Any relevant (usually abnormal) phenotypes of the specimen or sample | |
| cell location | recommended | The cell location from which genetic material was collected (usually either nucleus or mitochondria) | |
**Assay-level metadata**
This covers any metadata directly related to the preparation of the biomaterial undergoing the assay and the process of performing the assay.
| Assay Metadata | Required? | Definition | Comment or recommended format |
|---|---|---|---|
| molecule | required | The type of material that was extracted from the sample, eg polyA RNA | suggested format: [EFO biological macromolocule](http://www.ebi.ac.uk/efo/EFO_0004446) |
| nucleic acid extraction method | required | Technique used to extract the nucleic acid from the cell | ontology field - e.g. EFO or OBI |
| biological or technical replicate | required | Information whether the sample on which the assay was performed was biological or technical replicate. | boolean |
| end bias | recommended | The type of tag or end bias the library has, eg 3 prime tag or 5 prime end bias | method-specific |
| assay start time | recommended | The exact time at which the assay was started | |
| assay end time | recommended | The exact time at which the assay was completed | |
| assay duration | recommended | The duration, in a relevant time unit (eg minutes or hours), of the assay from start to finish | |
| sample quality | recommended | Quality information on extracted molecule | e.g., Bioanalyzer RIN |
| chemical compound | recommended | Any relevant chemical compounds used in the assay | ontology field - e.g. ChEBI |
| spike-in kit used | recommended | Information about the spike-in kit used during sequencing library preparation | |
| cDNA primer | recommended | Type of primer used for cDNA synthesis from RNA, eg polyA or random | standardised field or ontology |
| library strandedness | recommended | The strandedness of the cDNA library | standardised field or ontology |
| cell quality | recommended | Information about the quality of a single cell such as morphology or percent viability | standardised field or ontology |
| cell barcode | recommended | Information about the cell identifier barcode used to tag individual cells in single cell sequencing | |
| UMI barcode | recommended | Information about the Unique Molecular Identifier barcodes used to tag DNA fragments | |
**Analysis-level metadata**
This includes any metadata related to the files that come out of the experiment, from the sequencing or imaging files generated directly by the machine to files generated during the various stages of processing and analysis, as well as details of any analyses performed.
| Analyses metadata field | Required? | Definition | Comment |
|---|---|---|---|
| reference assembly | required |The reference assembly sequence that was used to map raw data to | UCSC or NCBI genome build number (e.g., hg18, mm9, human NCBI genome build 36, etc...), or reference sequence used for read alignment |
| normalization strategy | required | The approach used to normalize the data | ontology field - e.g. EFO or EDAM ||
| assembly type | recommended | The assembly type of the genome reference file, eg primary, complete or patch assembly. | |
#### Metadata template/example
* [Download link](https://www.ncbi.nlm.nih.gov/geo/info/examples/seq_template.xlsx) for metadata submission template to NCBI GEO, with examples for RNA-seq, ChIP-seq, and scRNA-seq.
* [Example Github repository](https://github.com/RobertsLab/paper-tanner-crab) for a differential gene expression project, recording metadata from the experiment through analyses. From [Crandall et al. 2022](https://pubmed.ncbi.nlm.nih.gov/35262806/).
* [Example Open Science Framework repository](https://osf.io/x5waz/) for methylation metadata.
#### Data standard(s) & requirements
**Raw data**
It is not necessary for you to submit raw data directly to SRA, but if you did and you want corresponding GEO entries, please include additional information in the Metadata template (Second tab, at the foot of this page) as follows:
[1] List the BioProject accession (PRJNAnnnn) in a 'BioProject' field in the STUDY section.
[2] Add an 'experiment' column to the SAMPLES section and include the corresponding SRA Experiment accessions (SRXnnnnnn) or SRA Run accessions (SRRnnnnnn) so that we can create the appropriate links between the SRA Experiments and GEO Samples.
[3] Add a 'BioSample' column to the SAMPLES section and include the corresponding BioSample accessions (SAMNnnnnnn).
NOTE: Do not upload the raw files and do not list the raw file names in the Metadata template.
The raw data files should be the original files containing reads and quality scores, as generated by the sequencing instrument (unless the raw files are barcoded/multiplexed, see below for further instructions).
Raw Data File Formats: Acceptable file formats include FASTQ, as well as other formats described in the [SRA File Format Guide](https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/). Files that do not conform to supported format requirements will be deleted from our systems.
Barcode/Multiplexed Data: Whenever possible, files should be demultiplexed so that each barcoded sample ends up with a dedicated run file. However, for single-cell sequencing studies (e.g. 10x Genomics, Drop-Seq, InDrops), GEO can support the submission of multiplexed files in cases where these files are required for reanalysis in your pipeline, or when demultiplexing would create an unmanageable number of files.
**Reference assembly**
Fasta file or accession number
Optional: WIG, bedGraph, GFF, or GTF files if necessary for analysis
**Processed data files**
Processed data are a required part of GEO submissions. The final processed data are defined as the data on which the conclusions in the related manuscript are based. We do not expect standard alignment files (e.g., BAM, SAM, BED) as processed data since conclusions are expected to be based on further-processed data. When standard alignments are the only processed data available, please write to us to inquire about whether your data are suitable for submission to GEO. Requirements for processed data files are not fully standardized and will depend on the nature of the experiment:
Expression profiling analysis usually generates quantitative data for features of interest. Features of interest may be genes, transcripts, exons, miRNA, or some other genetic entity. Two levels of data are often generated:
* Raw counts of sequencing reads for the features of interest, and/or normalized abundance measurements, e.g., output from Cufflinks, Cuffdiff, DESeq, edgeR, etc.
* Either or both of these data types may be supplied as processed data. They may be formatted either as a matrix table or individual files for each sample. Provide complete data with values for all features (e.g., genes) and all samples, not only lists of differentially-expressed genes.
* ChIP-Seq data might include peak files with quantitative data, tag density files, etc. Common formats include WIG, bigWig, bedGraph.
* Features (e.g., genes, transcripts) in processed data files should be traceable using public accession numbers or chromosome coordinates. The reference assembly used (e.g., hg19, mm9, GCF_000001405.13) should be provided in the metadata spreadsheet.
A description of the format and content of processed data files should be provided in the metadata spreadsheet data processing fields.
#### Data example
#### Metadata and data repo
The [NCBI Gene Expression Omnibus (GEO)](https://www.ncbi.nlm.nih.gov/geo/) repository is designed to associate raw data, processed data files, and relevant metadata for a specific study in an archived and accessible manner. See [Studivan & Voss 2020](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107688) for an example of a GEO submitted NOAA Omics project.
#### (Meta)data availability statement
---
### Microbial (meta)transcriptomics (RNA-seq) [LUKE]
---
### Proteomics [RACHAEL]
#### Metadata standard(s) & requirements
*Note: The information in this table is for submission to the [PRIDE](https://www.ebi.ac.uk/pride/) website. Additional file types and information can be found on the repositories outlined at [ProteomeXchange](https://www.proteomexchange.org/). See the submission guidelines document [here](http://www.proteomexchange.org/docs/guidelines_px.pdf).*
| Sample Metadata | Required? | Definition or Example | Recommended Format |
|---|---|---|---|
|SEARCH| Y, for partial submissions | Files from the software analysis tool (e.g. .dat from Mascot).| Format based on analysis tool- see [PRIDE](https://www.ebi.ac.uk/pride/markdownpage/pridefileformats). |
|RESULT| Y | Standard file formats from HUPO-PSI to report peptide/protein identification and quantification results. | [mzIdentML](https://github.com/HUPO-PSI/mzIdentML) and [mzTab](https://github.com/HUPO-PSI/mzTab). |
|PEAK| Y | The peak list files contains the set of MS/MS peaks used for peptide/protein identification (e.g. mgf Mascot generic files).|Various formats are accepted; see [PRIDE](https://www.ebi.ac.uk/pride/markdownpage/pridefileformats) for an exhaustive list. |
|SPECTRUM_LIBRARY| N | Spectrum libraries used to perform spectrum search.| .msp file format |
|GEL| N | Image files with the gels of the experiment. | All image formats are accepted. |
|PARAMETERS_FILE| N | The parameters file contains information about the parameters that where used to perform the experiment (e.g. MaxQuant param file).| .json or .txt |
|OTHER| N | Additional files that have been used to perform the experiment.| .doc, .pdf, .xls |
#### Metadata template/example
* MS data:
- A complete submission to [PRIDE](https://www.ebi.ac.uk/pride/markdownpage/pridefileformats) can be found [here](https://www.ebi.ac.uk/pride/archive/projects/PXD039714), with example files for metadata.
- An example of each metadata type is outlined [here](https://www.ebi.ac.uk/pride/markdown/submitdatapage/files/Submission_Summary_File_Format.pdf) for submission to [PRIDE](https://www.ebi.ac.uk/pride/markdownpage/pridefileformats).
* Sequence Data:
- A [template generator](https://submit.ncbi.nlm.nih.gov/biosample/template/) is provided by NCBI, which allows the user to select type of sequence data and relevant standards. [Genome Standards Consortium MIxS packages](https://www.gensc.org/pages/standards-intro.html) are included in the generator. The template can be downloaded as an excel file or a .tsv file. Additionally, a definition or example record can be accessed for each standard package.
#### Metadata repo
Proteomics datasets often are multifaceted and have various types of metadata associated with them. A consortia ([ProteomeXchange](https://www.proteomexchange.org/)) was formed including major proteomics databases to ensure standards for data submission and access to a variety of archival resources. The most inclusive resource (i.e. can accept raw MS or sequence data, metadata, and 'other' files) is [PRIDE](https://www.ebi.ac.uk/pride/). Instructions and requirements for submission of metadata can be found [here](https://www.ebi.ac.uk/pride/markdownpage/pridesubmissiontool#additional_submission_metadata). If sequence data is uploaded to the [NCBI SRA](https://www.ncbi.nlm.nih.gov/sra), metadata should be uploaded to [NCBI BioSample](https://www.ncbi.nlm.nih.gov/biosample/). Instructions for submission to BioSample can be found [here](https://www.ncbi.nlm.nih.gov/biosample/docs/submission/faq/).
#### Data examples, standard(s) & requirements
| Sample Data | Required? | Definition or Example | Recommended Format | Repository |
|---|---|---|---|---|
|MS data|Y|Original proprietary files provided by the instruments used in the study (e.g. Thermo RAW)| [mzML](https://www.psidev.info/mzML); </br> *Controlled vocabulary:* [MS ontology](https://www.ebi.ac.uk/ols/ontologies/ms); </br> *File formatting details:* [Pride](https://www.ebi.ac.uk/pride/markdownpage/pridefileformats)| [PRIDE](https://www.ebi.ac.uk/pride/markdownpage/pridefileformats)
|Sequencing data | N | Amino acid sequences, Whole genome sequences, RNA seq, Whole Exome Sequences | [FASTA](https://www.ncbi.nlm.nih.gov/genbank/fastaformat/), [FASTQ](https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#:~:text=Fastq%20consists%20of%20a%20defline,There%20are%20many%20variations.)| [MassIVE](https://massive.ucsd.edu/ProteoSAFe/static/massive.jsp), </br> [PRIDE](https://www.ebi.ac.uk/pride/markdownpage/pridefileformats) (as optional data), [NCBI SRA](https://www.ncbi.nlm.nih.gov/sra)|
**Other options for repositories, as well as general data submission [guidelines](http://www.proteomexchange.org/docs/guidelines_px.pdf) can be found on the ([ProteomeXchange](https://www.proteomexchange.org/)) website.**
References:
1. [Caufield et. al, 2021](https://pubs.acs.org/doi/abs/10.1021/acs.jproteome.1c00177)
2. [Vaudel et. al, 2015](https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/pmic.201500295)
#### Data example
* Raw MS data:
- .mzML format: [Yandi et. al, Study ID ST002474](https://www.metabolomicsworkbench.org/data/show_archive_contents_link.php?STUDY_ID=ST002474)
* Sequencing Data:
- FASTA/FASTQ format: [SRA Accession SRX19473245 ](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR23587914&display=download)
#### Data repo
A main data repository for proteomics data is [PRIDE](https://www.ebi.ac.uk/pride/)- instructions and requirements for submission of data can be found [here](https://www.ebi.ac.uk/pride/markdownpage/pridesubmissiontool#additional_submission_metadata). Accompanying sequence data can be uploaded to [MassIVE](https://ccms-ucsd.github.io/MassIVEDocumentation/#submit_data/), [PRIDE](https://www.ebi.ac.uk/pride/), or the [NCBI SRA](https://www.ncbi.nlm.nih.gov/sra/docs/submit/). Predicted protien sequences can be deposited at [UniProt](https://www.uniprot.org/).
#### Reporting of methods
#### Reporting of (meta)data availability
"Proteomics data files, including raw files, PD search files (msf)(...etc.) have been deposited to the PRIDE/ProteomeXchange database under the accession number XXX. Accompanying raw sequences were submitted to NCBI Sequence Read Archive and sample metadata was deposited to NCBI BioSample. These data are accessible via BioProject ID XXX."
---
### Metabolomics [RACHAEL]
Relevant papers from the [Metabolomics Standards Initiative (MSI)](https://wiki.metabolomicssociety.org/index.php/Metabolomics_Standards_Initiative):
* [Sumner et. al, 2007](https://link.springer.com/article/10.1007/s11306-007-0082-2)
* [Griffin et. al, 2007](https://link.springer.com/article/10.1007/s11306-007-0077-z)
* [Fiehn et. al, 2007](https://link.springer.com/article/10.1007/s11306-007-0070-6)
#### Metadata standard(s) & requirements
*Note: The information in this table is for submission to the [Metabolomics Workbench](https://www.metabolomicsworkbench.org/data/faq.php) website.*
| Sample Data | Required? | Definition or Example | Recommended Format |
|---|---|---|---|
|Analytical Metadata| Y | - Sample collection, storage, preparation, and extraction methods. </br> - Analytical methods.| Include enough detail for independent replication |
|Biological Metadata | Y |- Taxonomy used in *in-vivo* and *in-vitro* experiments </br> - Animal husbandry, dietary information, etc. | Taxonomy is required, other information is optional. |
|Metabolites | Y |Data matrix of sample IDs liked to metabolites (known and/or unknown)| - For known metabolites: The [InChIKey](http://inchi.info/inchikey_overview_en.html) or [PubChem](https://pubchem.ncbi.nlm.nih.gov/) ID should be included. </br> - For unknown metabolites: Local identifier and other annotations (i.e. m/z value, retention index) should be included </br> - Units of measurement are required. |
#### Metadata template/example
* MS data:
- Specifications for metadata submission and examples of each type based on the [Metabolomics Standards Initiative (MSI)](https://wiki.metabolomicssociety.org/index.php/Metabolomics_Standards_Initiative) can be found [here](https://www.metabolomicsworkbench.org/data/mwTab_specification.pdf).
- An example study from [Metabolomics Workbench](https://www.metabolomicsworkbench.org/data/faq.php) can be found [here](https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&DataMode=ProjectData&StudyID=ST002190&StudyType=MS&ResultType=5#DataTabs), with relevant metadata.
* Sequence Data:
- A [template generator](https://submit.ncbi.nlm.nih.gov/biosample/template/) is provided by NCBI, which allows the user to select type of sequence data and relevant standards. [Genome Standards Consortium MIxS packages](https://www.gensc.org/pages/standards-intro.html) are included in the generator. The template can be downloaded as an excel file or a .tsv file. Additionally, a definition or example record can be accessed for each standard package.
#### Metadata repo
Metadata for NMR or MS metabolomics data can be uploaded to the [Metabolomics Workbench](https://www.metabolomicsworkbench.org/data/faq.php). Instructions for submission to the Metabolomics workbench can be found [here](https://www.metabolomicsworkbench.org/data/DRCCDataDeposit.php). For European entries, [MetaboLights](https://www.ebi.ac.uk/metabolights/) is also available (instructions for submission [here](https://www.ebi.ac.uk/metabolights/guides/Quick_start_Guide/Quick%20start%20overviewQuick%20start%20overviewQuick%20start%20overviewQuick%20start%20overview)). Metadata for accompanying sequencing data (whole genome, amplicon, transcriptome) should be uploaded to [NCBI BioSample](https://www.ncbi.nlm.nih.gov/biosample/). Instructions for submission to BioSample can be found [here](https://www.ncbi.nlm.nih.gov/biosample/docs/submission/faq/).
#### Data standard(s) & requirements
| Sample Data | Required? | Definition or Example | Recommended Format | Repository |
|---|---|---|---|---|
|Raw NMR or MS data| Y | *NMR*: can be free induction decay (FID) or fourier transformed (FT) ; Should also include instrument and software versions.| Open Source Formats ([mzML](https://www.psidev.info/mzML), [mzXML](https://sashimi.sourceforge.net/schema_revision/mzXML_2.1/Doc/mzXML_2.1_tutorial.pdf), [CDF](https://cdf.gsfc.nasa.gov/))| [Metabolomics Workbench](https://www.metabolomicsworkbench.org/data/faq.php)|
|Sequencing Data| N | Whole genome, Amplicon, Transcriptome | [FASTA](https://www.ncbi.nlm.nih.gov/genbank/fastaformat/), [FASTQ](https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#:~:text=Fastq%20consists%20of%20a%20defline,There%20are%20many%20variations.) | [NCBI SRA](https://www.ncbi.nlm.nih.gov/sra)|
#### Data examples
* Raw MS data:
- .CDF format: [Sumner et. al, Study ID ST000054](https://www.metabolomicsworkbench.org/data/show_archive_contents_link.php?STUDY_ID=ST000054)
- .mzML format: [Yandi et. al, Study ID ST002474](https://www.metabolomicsworkbench.org/data/show_archive_contents_link.php?STUDY_ID=ST002474)
- .mzXML format: [Brown et. al, Study ID ST002465](https://www.metabolomicsworkbench.org/data/show_archive_contents_link.php?STUDY_ID=ST002465)
* Sequencing Data:
- FASTA/FASTQ format: [SRA Accession SRX19473245 ](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR23587914&display=download)
#### Data repo
Raw NMR or MS metabolomics data can be uploaded to the [Metabolomics Workbench](https://www.metabolomicsworkbench.org/data/faq.php), which is a resource sponsored by the Common Fund of the National Institutes of Health. Instructions for submission to the Metabolomics workbench can be found [here](https://www.metabolomicsworkbench.org/data/DRCCDataDeposit.php). For European entries, [MetaboLights](https://www.ebi.ac.uk/metabolights/) is also available (instructions for submission [here](https://www.ebi.ac.uk/metabolights/guides/Quick_start_Guide/Quick%20start%20overviewQuick%20start%20overviewQuick%20start%20overviewQuick%20start%20overview)). Accompanying sequencing data (whole genome, amplicon, transcriptome) should be uploaded to [NCBI SRA](https://www.ncbi.nlm.nih.gov/sra). Instructions for submission to SRA can be found [here](https://www.ncbi.nlm.nih.gov/sra/docs/submit/).
#### Reporting of methods
#### Reporting of (Meta)data availability
"Metabolomics data files, including raw files, biological metadata, (…etc.) have been deposited to the Metabolomics workbench database under the study ID XXX. Accompanying raw sequences were submitted to NCBI Sequence Read Archive and sample metadata was deposited to NCBI BioSample. These data are accessible via BioProject ID XXX."
---
### DNA reference sequences (plastids/mitogenome) [SEAN]
---
### Microbial (meta)genomics (MAG/SAG, genome assemblies) [SEAN]
## MIxS Standard
[Minimum Information about any Sequence (MIxS) Standard](https://genomicsstandardsconsortium.github.io/mixs)
URI: http://w3id.org/mixs
### MIxS Checklists
| Checklist | Description |
| -------- | -------- |
| [MIGSEukaryote](https://genomicsstandardsconsortium.github.io/mixs/MIGSEukaryote) | Minimal Information about a Genome Sequence: eukaryote |
| [MIGSBacteria](https://genomicsstandardsconsortium.github.io/mixs/MIGSBacteria) | Minimal Information about a Genome Sequence: cultured bacteria/archaea |
| [MIGSPlant](https://genomicsstandardsconsortium.github.io/mixs/MIGSPlant) | Minimal Information about a Genome Sequence: plant |
| [MIGSVirus](https://genomicsstandardsconsortium.github.io/mixs/MIGSVirus) | Minimal Information about a Genome Sequence: cultured bacteria/archaea |
| [MIGS](https://genomicsstandardsconsortium.github.io/mixs/MIGS) | Org Minimal Information about a Genome Sequence: org |
| [MIMS](https://genomicsstandardsconsortium.github.io/mixs/MIMS) | Metagenome or Environmental |
| [MIMARKSSpecimen](https://genomicsstandardsconsortium.github.io/mixs/MIMARKSSpecimen) | Minimal Information about a Marker Specimen: specimen |
| [MIMARKSSurvey](https://genomicsstandardsconsortium.github.io/mixs/MIMARKSSurvey) | Minimal Information about a Marker Specimen: survey |
| [MISAG](https://genomicsstandardsconsortium.github.io/mixs/MISAG) | Minimum Information About a Single Amplified Genome |
| [MIMAG](https://genomicsstandardsconsortium.github.io/mixs/MIMAG) | Minimum Information About a Metagenome-Assembled Genome |
| [MIUVIG](https://genomicsstandardsconsortium.github.io/mixs/MIUVIG) | Minimum Information About an Uncultivated Virus Genome |
### MIxS Packages
| Package | Description |
| -------- | -------- |
| [Agriculture](https://genomicsstandardsconsortium.github.io/mixs/Agriculture) | agriculture |
| [Air](https://genomicsstandardsconsortium.github.io/mixs/Air) | air |
| [BuiltEnvironment](https://genomicsstandardsconsortium.github.io/mixs/BuiltEnvironment) | built environment |
| [Food-animalAndAnimalFeed](https://genomicsstandardsconsortium.github.io/mixs/Food-animalAndAnimalFeed) | food-animal and animal feed |
| [Food-farmEnvironment](https://genomicsstandardsconsortium.github.io/mixs/Food-farmEnvironment) | food-farm environment |
| [Food-foodProductionFacility](https://genomicsstandardsconsortium.github.io/mixs/Food-foodProductionFacility) | food-food production facility |
| [Food-humanFoods](https://genomicsstandardsconsortium.github.io/mixs/Food-humanFoods) | food-human foods |
| [Host-associated](https://genomicsstandardsconsortium.github.io/mixs/Host-associated) | host-associated |
| [Human-associated](https://genomicsstandardsconsortium.github.io/mixs/Human-associated) | human-associated |
| [Human-gut](https://genomicsstandardsconsortium.github.io/mixs/Human-gut) | human-gut |
| [Human-oral](https://genomicsstandardsconsortium.github.io/mixs/Human-oral) | human-oral |
| [Human-skin](https://genomicsstandardsconsortium.github.io/mixs/Human-skin) | human-skin |
| [Human-vaginal](https://genomicsstandardsconsortium.github.io/mixs/Human-vaginal) | human-vaginal |
| [HydrocarbonResources-cores](https://genomicsstandardsconsortium.github.io/mixs/HydrocarbonResources-cores) | hydrocarbon resources-cores |
| [HydrocarbonResources-fluidsSwabs](https://genomicsstandardsconsortium.github.io/mixs/HydrocarbonResources-fluidsSwabs) | hydrocarbon resources-fluids/swabs |
| [MicrobialMatBiofilm](https://genomicsstandardsconsortium.github.io/mixs/MicrobialMatBiofilm) | microbial mat/biofilm |
| [MiscellaneousNaturalOrArtificialEnvironment](https://genomicsstandardsconsortium.github.io/mixs/MiscellaneousNaturalOrArtificialEnvironment) | miscellaneous natural or artificial environment |
| [Plant-associated](https://genomicsstandardsconsortium.github.io/mixs/Plant-associated) | plant-associated |
| [QuantityValue](https://genomicsstandardsconsortium.github.io/mixs/QuantityValue) | used to record a measurement |
| [Sediment](https://genomicsstandardsconsortium.github.io/mixs/Sediment) | sediment |
| [Soil](https://genomicsstandardsconsortium.github.io/mixs/Soil) | soil |
| [Symbiont-associated](https://genomicsstandardsconsortium.github.io/mixs/Symbiont-associated) | symbiont-associated |
| [WastewaterSludge](https://genomicsstandardsconsortium.github.io/mixs/WastewaterSludge) | wastewater/sludge |
| [Water](https://genomicsstandardsconsortium.github.io/mixs/Water) | water |