ENA submission

# ENA submission ## Register a study (or project) * studies are typically registered before any data submission * see https://ena-docs.readthedocs.io/en/latest/submit/study/programmatic.html ### Create the study XML * project.xml * see ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/ENA.project.xsd ```xml <PROJECT_SET> <PROJECT alias="cheddar_cheese"> <TITLE>Characterisation of Microbial Diversity and Chemical Properties of Cheddar Cheese Prepared from Heat-treated Milk</TITLE> <DESCRIPTION>This study aimed to characterise the interaction of microbial diversity and chemical properties of Cheddar cheese after three different heat treatments of milk</DESCRIPTION> <SUBMISSION_PROJECT> <SEQUENCING_PROJECT/> </SUBMISSION_PROJECT> </PROJECT> </PROJECT_SET> ``` ### Create the submission XML * submission.xml ```xml <?xml version="1.0" encoding="UTF-8"?> <SUBMISSION> <ACTIONS> <ACTION> <ADD/> </ACTION> </ACTIONS> </SUBMISSION> ``` * The submission XML declares one or more Webin submission service actions. In this case the action is ``<ADD/>`` which is used to submit new objects. ### Submit the XML using CURL * 24h test command: ``` ENAUSERNAME=Webin-53556 ENAPW=XXX curl -u $ENAUSERNAME:$ENAPW -F "SUBMISSION=@submission.xml" -F "PROJECT=@project.xml" "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/" ``` * To know if the submission was successful look in the first line of the ``<RECEIPT>`` block. * The receipt will contain the accession numbers of the objects that you have submitted. In the case of an ENA study this is likely to be the accession that you will be including in a publication. * if everything looks fine, change the submission command to: ``` curl -u ENAUSERNAME:ENAPW -F "SUBMISSION=@submission.xml" -F "PROJECT=@project.xml" "https://www.ebi.ac.uk/ena/submit/drop-box/submit/" ``` ## Register a sample * it is important to choose the most relevant sample checklist available to you and provide at least the minimum metadata * Please also make sure you are familiar with the ENA’s taxonomy services and use the correct taxonomy to describe your samples. * standard taxonomies: https://ena-docs.readthedocs.io/en/latest/faq/taxonomy.html * environmental taxonomies: https://ena-docs.readthedocs.io/en/latest/faq/taxonomy.html#environmental-taxonomic-classifications * all sample checklists can be found here: https://www.ebi.ac.uk/ena/submit/checklists ## Register samples programmatically * ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/SRA.sample.xsd * You can register one or more samples at the same time by using one ``<SAMPLE></SAMPLE>`` block for each sample. * Most of the sample information comes in the form of ``<TAG>`` and ``<VALUE>`` pairs that belong in ``<SAMPLE_ATTRIBUTE>`` blocks. You can have any number of ``<SAMPLE_ATTRIBUTE>`` blocks in your samples. ```xml <?xml version="1.0" encoding="UTF-8"?> <SAMPLE_SET> <SAMPLE alias="MT5176" center_name=""> <TITLE>human gastric microbiota, mucosal</TITLE> <SAMPLE_NAME> <TAXON_ID>1284369</TAXON_ID> <SCIENTIFIC_NAME>stomach metagenome</SCIENTIFIC_NAME> <COMMON_NAME></COMMON_NAME> </SAMPLE_NAME> <SAMPLE_ATTRIBUTES> <SAMPLE_ATTRIBUTE> <TAG>investigation type</TAG> <VALUE>mimarks-survey</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>sequencing method</TAG> <VALUE>pyrosequencing</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>collection date</TAG> <VALUE>2010</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>host body site</TAG> <VALUE>Mucosa of stomach</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>human-associated environmental package</TAG> <VALUE>human-associated</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>geographic location (latitude)</TAG> <VALUE>1.81</VALUE> <UNITS>DD</UNITS> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>geographic location (longitude)</TAG> <VALUE>-78.76</VALUE> <UNITS>DD</UNITS> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>geographic location (country and/or sea)</TAG> <VALUE>Colombia</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>geographic location (region and locality)</TAG> <VALUE>Tumaco</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>environment (biome)</TAG> <VALUE>coast</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>environment (feature)</TAG> <VALUE>human-associated habitat</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>environment (material)</TAG> <VALUE>gastric biopsy</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>ENA-CHECKLIST</TAG> <VALUE>ERC000014</VALUE> </SAMPLE_ATTRIBUTE> </SAMPLE_ATTRIBUTES> </SAMPLE> </SAMPLE_SET> ``` ### Create a sample XML * ``sample.xml`` ```xml <?xml version="1.0" encoding="UTF-8"?> <SAMPLE_SET> <SAMPLE alias="MT5176"> <TITLE>human gastric microbiota, mucosal</TITLE> <SAMPLE_NAME> <TAXON_ID>1284369</TAXON_ID> </SAMPLE_NAME> <SAMPLE_ATTRIBUTES> <SAMPLE_ATTRIBUTE> <TAG>collection date</TAG> <VALUE>2010</VALUE> </SAMPLE_ATTRIBUTE> </SAMPLE_ATTRIBUTES> </SAMPLE> </SAMPLE_SET> ``` ### Create the submission XML * ``submission.xml`` * this is the same like for the study submission (see above) ```xml <?xml version="1.0" encoding="UTF-8"?> <SUBMISSION> <ACTIONS> <ACTION> <ADD/> </ACTION> </ACTIONS> </SUBMISSION> ``` ### Submit via Curl * like for a study ``` curl -u $ENAUSERNAME:$ENAPW -F "SUBMISSION=@submission.xml" -F "SAMPLE=@sample.xml" "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/" ``` * if everything looks fine: ``` curl -u $ENAUSERNAME:$ENAPW -F "SUBMISSION=@submission.xml" -F "SAMPLE=@sample.xml" "https://www.ebi.ac.uk/ena/submit/drop-box/submit/" ``` ## Preparing files for submission * accepted data formats: https://ena-docs.readthedocs.io/en/latest/submit/fileprep/reads.html ### ONT data Oxford Nanopore native data must be submitted as a single tar.gz archive containing basecalled fast5 files from Metrichor or Albacore. ```bash XYZ/reads/downloads/fail/ XYZ/reads/downloads/pass/ ``` How to archive all files in the XYZ downloads directory in a linux command line: ```bash cd <directory containing XYZ directory> tar -cvzf XYZ.tar.gz XYZ/reads/downloads/ ``` ### Preparing FASTQ files for upload * gziped; ``gzip test.fq`` * MD5; ``md5sum test.fq.gz > test.fq.gz.md5`` * With the exception of Oxford Nanopore FAST5 files, do not tar archive any collections of files - each should be uploaded separately. ### Upload #### Using FTP Command Line Client On Linux/Mac * Open a terminal and type `ftp webin.ebi.ac.uk` * Enter the username and password associated with your Webin submission account. * Type `bin` to use binary mode. * Type `ls` command to check the content of your drop box. * Type `prompt` to switch off confirmation for each file uploaded. * Use `mput` command to upload files, for example `mput *.gz *.md5`. * Use `bye` command to exit the ftp client. #### Using Aspera ascp Command Line Program Aspera is a commercial file transfer protocol that may provide better transfer speeds than FTP over long distances. For short distance file transfers we recommend the use of FTP. Download Aspera CLI from https://downloads.asperasoft.com/en/downloads/62. Please select the correct operating system. The ascp command line client is distributed as part of the Aspera Cli in the cli/bin folder. Your command should look similar to this: ```bash ascp -QT -l300M -L- <file(s)> <Webin-N>@webin.ebi.ac.uk:. ``` The ``-l300M`` option sets the upload speed limit to 300MB/s. You may wish to lower this value to increase the reliability of the transfer. The ``-L-`` option is for printing logs out while transferring, The ``<file(s)>`` can be a file mask (e.g. *.cram), a list of files or a single file. ``<Webin-N>`` is your Webin submission account name. ## Submit raw reads * raw reads are represented as 'run' and 'experiment' objects * The run submission holds information about the raw read files generated in a run of sequencing. * The experiment submission holds metadata that describe the methods used to sequence the sample. ### Submit raw reads programmatically https://ena-docs.readthedocs.io/en/latest/submit/reads/programmatic.html * using experiment and run XMLs. An experiment object represents the library solution that is created from a sample and used in a sequencing experiment. The experiment object contains details about the sequencing platform and library protocols. A run object represents a lane (or equivalent) on an sequencing machine and is used to attach sequence read data to experiments. * ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/SRA.experiment.xsd * ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/SRA.run.xsd #### Run XML: part of experiment ```xml <EXPERIMENT_REF accession="ERX123456"/> ``` or ```xml <EXPERIMENT_REF refname="exp_mantis_religiosa"/> ``` #### Experiment XML: part of study ```xml <STUDY_REF accession="ERP123456"/> ``` #### Experiment XML: associated with sample ```xml <SAMPLE_DESCRIPTOR accession="SRS462875"/> ``` #### Upload data files * must be done before * If the files are uploaded to the root directory then simply enter the file name in the Run XML when referring to it: ```xml <FILE filename="mantis_religiosa_R1.fastq.gz" ... /> ``` #### Create the Run and Experiment XML ##### Run AND Experiment XML: paired fastq Below is an example of an Illumina HiSeq 2000 paired end reads being submitted in Fastq format. The experiment points to a pre-registered sample and study using their accessions. The run points to the experiment using the experiment’s alias. Experiment XML: ```xml <EXPERIMENT_SET> <EXPERIMENT alias="exp_mantis_religiosa"> <TITLE>The 1KITE project: evolution of insects</TITLE> <STUDY_REF accession="SRP017801"/> <DESIGN> <DESIGN_DESCRIPTION/> <SAMPLE_DESCRIPTOR accession="SRS462875"/> <LIBRARY_DESCRIPTOR> <LIBRARY_NAME/> <LIBRARY_STRATEGY>RNA-Seq</LIBRARY_STRATEGY> <LIBRARY_SOURCE>TRANSCRIPTOMIC</LIBRARY_SOURCE> <LIBRARY_SELECTION>cDNA</LIBRARY_SELECTION> <LIBRARY_LAYOUT> <PAIRED NOMINAL_LENGTH="250" NOMINAL_SDEV="30"/> </LIBRARY_LAYOUT> <LIBRARY_CONSTRUCTION_PROTOCOL>Messenger RNA (mRNA) was isolated using the Dynabeads mRNA Purification Kit (Invitrogen, Carlsbad Ca. USA) and then sheared using divalent cations at 72*C. These cleaved RNA fragments were transcribed into first-strand cDNA using II Reverse Transcriptase (Invitrogen, Carlsbad Ca. USA) and N6 primer (IDT). The second-strand cDNA was subsequently synthesized using RNase H (Invitrogen, Carlsbad Ca. USA) and DNA polymerase I (Invitrogen, Shanghai China). The double-stranded cDNA then underwent end-repair, a single `A? base addition, adapter ligati on, and size selection on anagarose gel (250 * 20 bp). At last, the product was indexed and PCR amplified to finalize the library prepration for the paired-end cDNA.</LIBRARY_CONSTRUCTION_PROTOCOL> </LIBRARY_DESCRIPTOR> </DESIGN> <PLATFORM> <ILLUMINA> <INSTRUMENT_MODEL>Illumina HiSeq 2000</INSTRUMENT_MODEL> </ILLUMINA> </PLATFORM> <EXPERIMENT_ATTRIBUTES> <EXPERIMENT_ATTRIBUTE> <TAG>library preparation date</TAG> <VALUE>2010-08</VALUE> </EXPERIMENT_ATTRIBUTE> </EXPERIMENT_ATTRIBUTES> </EXPERIMENT> </EXPERIMENT_SET> ``` Run XML: ```xml <RUN_SET> <RUN alias="run_mantis_religiosa" center_name=""> <EXPERIMENT_REF refname="exp_run_mantis_religiosa"/> <DATA_BLOCK> <FILES> <FILE filename="mantis_religiosa_R1.fastq.gz" filetype="fastq" checksum_method="MD5" checksum="9b8932f85caa54e687eba62fca3edce2"/> <FILE filename="antis_religiosa_R2.fastq.gz" filetype="fastq" checksum_method="MD5" checksum="183d6a24e0c3704e993bebe75bbbd989"/> </FILES> </DATA_BLOCK> </RUN> </RUN_SET> ``` You can submit several experiments and runs at the same time by using multiple ``<EXPERIMENT>`` and ``<RUN>`` blocks. Experiment XML: ```xml <EXPERIMENT_SET> <EXPERIMENT alias="exp_01"> ... </EXPERIMENT> ... <EXPERIMENT alias="exp_05"> ... </EXPERIMENT> </EXPERIMENT_SET> ``` Run XML: ```xml <RUN_SET> <RUN alias="run_01"> <EXPERIMENT_REF refname="exp_01"/> <DATA_BLOCK> <FILES> ... </FILES> </DATA_BLOCK> </RUN> ... <RUN alias="run_05"> <EXPERIMENT_REF refname="exp_05"/> <DATA_BLOCK> <FILES> ... </FILES> </DATA_BLOCK> </RUN> </RUN_SET> ``` ##### Experiment XML: Oxford Nanopore ```xml <EXPERIMENT_SET> <EXPERIMENT alias="exp_groundwater_aquadiva_nanopore_02"> <TITLE></TITLE> <STUDY_REF accession="PRJEB35315"/> <DESIGN> <DESIGN_DESCRIPTION/> <SAMPLE_DESCRIPTOR refname="aquadiva_02um"/> <LIBRARY_DESCRIPTOR> <LIBRARY_NAME/> <LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY> <LIBRARY_SOURCE>METAGENOMIC</LIBRARY_SOURCE> <LIBRARY_SELECTION>size fractionation</LIBRARY_SELECTION> <LIBRARY_LAYOUT> <SINGLE/> </LIBRARY_LAYOUT> <LIBRARY_CONSTRUCTION_PROTOCOL>We performed Nanopore sequencing ...</LIBRARY_CONSTRUCTION_PROTOCOL> </LIBRARY_DESCRIPTOR> </DESIGN> <PLATFORM> <OXFORD_NANOPORE> <INSTRUMENT_MODEL>MinION</INSTRUMENT_MODEL> </OXFORD_NANOPORE> </PLATFORM> <EXPERIMENT_ATTRIBUTES> <EXPERIMENT_ATTRIBUTE> <TAG>library preparation date</TAG> <VALUE>2019-01</VALUE> </EXPERIMENT_ATTRIBUTE> </EXPERIMENT_ATTRIBUTES> </EXPERIMENT> </EXPERIMENT_SET> ``` ##### Run XML: Oxford Nanopore If you wish to submit your nanopore sequencing reads in their native FAST5 format, you must prepare an individual gzipped tar archive for each run. The run XML should look as follows: ```xml <RUN_SET> <RUN alias="run_groundwater_aquadiva_nanopore_02" center_name=""> <EXPERIMENT_REF refname="exp_groundwater_aquadiva_nanopore_02"/> <DATA_BLOCK> <FILES> <FILE filename="h52_test_guppy2-3-1.fastq.gz" filetype="fastq" checksum_method="MD5" checksum="35739c52e8d4f62fdf705af09d912c37"/> </FILES> </DATA_BLOCK> </RUN> <RUN alias="run_groundwater_aquadiva_nanopore_02_raw" center_name=""> <EXPERIMENT_REF refname="exp_groundwater_aquadiva_nanopore_02"/> <DATA_BLOCK> <FILES> <FILE filename="2019-01-25_aquadiva_test.tar.gz" filetype="OxfordNanopore_native" checksum_method="MD5" checksum="3b4723db8575755a6884917658084bda"/> </FILES> </DATA_BLOCK> </RUN> </RUN_SET> ``` #### Create the submission XML * see above ``` curl -u $ENAUSERNAME:$ENAPW -F "SUBMISSION=@submission.xml" -F "EXPERIMENT=@experiment.xml" -F "RUN=@run.xml" "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/" ``` ``` curl -u $ENAUSERNAME:$ENAPW -F "SUBMISSION=@submission.xml" -F "EXPERIMENT=@experiment.xml" -F "RUN=@run.xml" "https://www.ebi.ac.uk/ena/submit/drop-box/submit/" ``` ## How to submit assemblies https://ena-docs.readthedocs.io/en/latest/submit/assembly.html ## How to submit annotated sequences https://ena-docs.readthedocs.io/en/latest/submit/sequence.html ----- ## Coverage calculation for ENA assembly upload so the coverage calculation is a bit longer than I though and involves mapping reads back to your assembly. This is the script here ```bash SCRIPT=https://github.com/EBI-Metagenomics/assembly-pipeline/blob/master/pipeline_lib/fasta_coverage.py ``` I think you probably only want the second half from line 63 onwards. The steps essentially are * 1. index contig file * 2. map reads to assemblies with BWA-MEM * 3. sort output bam file & index * 4. calculate coverage depth per contig (METABAT used in this case) * 5. multiply contig length by average depth (get total of all) * 6. divide total by input base count an example of a coverage file is here ```bash EXAMPLE=/hps/nobackup2/production/metagenomics/results/assemblies/SRP1899/SRP189971/SRR8822/SRR8822466/metaspades/001/coverage/coverage.tab ``` ### Commands ```bash bwa index {contig_file} for i in Hain_*.fasta.gz; do BN=$(basename $i _metaspades3-13.fasta.gz); NAME=$(echo $BN | sed 's/Hain_//g' | sed 's/um_R/_/g' | awk 'BEGIN{FS="_"};{printf $1"_"; if($2=="01"){printf "0_1"}else{printf "0_2"}; printf "_"$3}'); if [[ $NAME == "PNK108_0_2_02um" ]]; then NAME="PNK108_H32_0_2"; fi; printf "bsub -n 8 -M 24.0G \"bwa mem -t 8 ${i} ~/data/reads/illumina/2019-12-11_aquadiva_nextseq/${NAME}_R1.fastq.gz ~/data/reads/illumina/2019-12-11_aquadiva_nextseq/${NAME}_R2.fastq.gz | samtools view -uS - -o ${NAME}.bam\"\n"; done > MAPPING.sh#bwa mem -t 4 {index} {raw-file-1} {raw-file-2} | samtools view -uS - -o unsorted.bam for i in *.bam; do BN=$(basename $i .bam) bsub -n 4 -M 24.0G "samtools sort -@ 4 $i -o $BN.sorted.bam; samtools index $BN.sorted.bam; jgi_summarize_bam_contig_depths --outputDepth $BN.coverage.tab $BN.sorted.bam" done #That gives you coverage.tab (edited) #coverage.tab is a list of coverage depths per contig. #for each line in that file we then need to multiply length by read depth and get a total. Then divide that by the base count for your raw reads. I need to have a think how to do this. use this: source /hps/nobackup2/production/metagenomics/assembly-pipeline/prod/venv/bin/activate for R1 in ~/data/reads/illumina/2019-12-11_aquadiva_nextseq/*_R1.fastq.gz; do BN=$(basename $R1 _R1.fastq.gz) DN=$(dirname $R1) R2=${DN}/${BN}_R2.fastq.gz bsub -n 2 -M 8.0G "python coverage.py --input ${R1},${R2} --coverage $BN.coverage.tab > ${BN}.cov" done ``` ### Assembly upload ```bash #register new study using submission.xml and project.xml curl -u Webin-53556:DgJ8-_ueaajwwpyTRGpc -F "SUBMISSION=@submission.xml" -F "PROJECT=@project.xml" "https://www.ebi.ac.uk/ena/submit/drop-box/submit/" #upload assembly to the previously registered study using manifest.txt java -jar /hps/nobackup2/production/metagenomics/assembly-pipeline/prod/webin_cli/webin-cli-1.8.8.jar -context genome -manifest manifest.txt -submit -userName Webin-53556 -password DgJ8-_ueaajwwpyTRGpc -outputDir upload_report ```