Diamond-MEGAN Pipeline

# Diamond-MEGAN Pipeline <br> ## Pipeline Idea * Download NCBI nr database * Build Diamond database from nr * Run Diamond Blastx on each pair of sample fastq files, generating 2 daa file * Run daa2rma from each pair to generate single rma file per sample * Use Meganizer ## About paired Read "The only way to do a paired-end analysis is to use the daa2rma tool which will create a new single .rma file from your two input .daa files. This program has command-line options to setup paired-read import. Use the options: --paired --pairedSuffixLength <length> Here <length> is the number of trailing letters in the first word of the read name that distinguishes between two paired reads. If the two reads (in the two separate files) have exactly the same name (that is, first word in header line), then length should be set to 0, if they differ by one letter, then 1, etc." Source: http://megan.informatik.uni-tuebingen.de/t/paired-end-read-analysis-diamond-megan/1340 http://megan.informatik.uni-tuebingen.de/t/generic-pipeline-using-diamond-and-megan6/50 See here: http://megan.informatik.uni-tuebingen.de/t/paired-reads-daa-meganizer-daa2rma/1241 ## Diamond ### On the cloud instance: #### Create new directory for Diamond operations ``` mkdir diamond ``` ### Created a conda environment with specific package ``` conda create -y -n diamond_megan diamond megan conda activate diamond_megan ``` ### Build Diamond database #### Download NCBI nr database ``` wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz ``` ``` rsync --copy-links --times --verbose \ rsync://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz . ``` > File version 7/11/20, 7:39:00 PM, 76.2 GB Unarchive file ``` gunzip nr.gz ``` ~ 140 Gb in size #### Make Diamond database using nr ``` diamond makedb --in nr -d nr -p 42 ``` >Processed 300467018 sequences, 108494634730 letters. Total time = 5723.16s Above command creates a binary DIAMOND database file with the specified name (nr.dmnd, ~143 Gb) <br> ## Diamond Example Run #### Create sample query file Diamond only accepts single query file, so paired FASTQs have to be merged. ``` cd /home/cmicro/samples cat sample1_R1_trimmed.fastq.gz sample1_R2_trimmed.fastq.gz > \ sample1_comb_trimmed.fastq.gz ``` #### Add symbolic link to combined sample1 fastq file in diamond directory ``` cd /home/cmicro/diamond ln -s /home/cmicro/samples/sample1_comb_trimmed.fastq.gz . ``` #### Create symbolic links to fastq files for sample 4 ``` ln -s /vol_c/metagenomic-read-files/sample1_R1_trimmed.fastq.gz . ln -s /vol_c/metagenomic-read-files/sample1_R2_trimmed.fastq.gz . ``` #### Run Diamond BLAST search on query file using pre built database (Not using more senstivie flag, testing run time) ``` diamond blastx --threads 42 --query sample1_R1_trimmed.fastq.gz --db nr \ --daa sample1_R1_trimmed.daa ``` > Total time = 24478.4s Reported 80323993 pairwise alignments, 80371409 HSPs. 3507729 queries aligned. ``` diamond blastx --threads 42 --query sample1_R2_trimmed.fastq.gz --db nr \ --daa sample1_R2_trimmed.daa ``` >Total time = 21612.6s Reported 69715453 pairwise alignments, 69754248 HSPs. 3062178 queries aligned. ``` mkdir daa_backup cp *.daa daa_backup/ ``` #### Get MEGAN6 mapping files ``` wget https://software-ab.informatik.uni-tuebingen.de/download/megan6/megan-map-Jul2020-2.db.zip ``` ``` sudo apt install unzip unzip megan-map-Jul2020-2.db.zip ``` ``` diamond blastx --threads 42 --db nr --query sample1_comb_trimmed.fastq.gz \ -o sample1_comb_trimmed.m8 ``` ``` daa2rma -i 10daa/reads.daa -o 20rma/reads.rma -g2t -g2t gi_taxid.bin -g2kegg gi2kegg.bin -fun KEGG ``` ``` daa2rma --paired true --in sample1_R1_trimmed.daa sample1_R2_trimmed.daa \ --acc2taxa megan-map-Jul2020-2.db -v true -o megan_out ```