Try   HackMD

Diamond-MEGAN Pipeline


Pipeline Idea

  • Download NCBI nr database
  • Build Diamond database from nr
  • Run Diamond Blastx on each pair of sample fastq files, generating 2 daa file
  • Run daa2rma from each pair to generate single rma file per sample
  • Use Meganizer

About paired Read

"The only way to do a paired-end analysis is to use the daa2rma tool which will create a new single .rma file from your two input .daa files. This program has command-line options to setup paired-read import. Use the options:

paired pairedSuffixLength <length>

Here <length> is the number of trailing letters in the first word of the read name that distinguishes
between two paired reads. If the two reads (in the two separate files) have exactly the same name (that is, first word in header line), then length should be set to 0, if they differ by one letter, then 1, etc."

Source: http://megan.informatik.uni-tuebingen.de/t/paired-end-read-analysis-diamond-megan/1340

http://megan.informatik.uni-tuebingen.de/t/generic-pipeline-using-diamond-and-megan6/50

See here: http://megan.informatik.uni-tuebingen.de/t/paired-reads-daa-meganizer-daa2rma/1241

Diamond

On the cloud instance:

Create new directory for Diamond operations

mkdir diamond

Created a conda environment with specific package

conda create -y -n diamond_megan diamond megan
conda activate diamond_megan

Build Diamond database

Download NCBI nr database

wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
rsync --copy-links --times --verbose \
rsync://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz .

File version 7/11/20, 7:39:00 PM, 76.2 GB

Unarchive file

gunzip nr.gz

~ 140 Gb in size

Make Diamond database using nr

diamond makedb --in nr -d nr -p 42

Processed 300467018 sequences, 108494634730 letters.
Total time = 5723.16s

Above command creates a binary DIAMOND database file with the specified name (nr.dmnd, ~143 Gb)


Diamond Example Run

Create sample query file

Diamond only accepts single query file, so paired FASTQs have to be merged.

cd /home/cmicro/samples
cat sample1_R1_trimmed.fastq.gz sample1_R2_trimmed.fastq.gz > \
sample1_comb_trimmed.fastq.gz
cd /home/cmicro/diamond
ln -s /home/cmicro/samples/sample1_comb_trimmed.fastq.gz .
ln -s /vol_c/metagenomic-read-files/sample1_R1_trimmed.fastq.gz .
ln -s /vol_c/metagenomic-read-files/sample1_R2_trimmed.fastq.gz .

Run Diamond BLAST search on query file using pre built database

(Not using more senstivie flag, testing run time)

diamond blastx  --threads 42 --query sample1_R1_trimmed.fastq.gz --db nr \
--daa sample1_R1_trimmed.daa

Total time = 24478.4s
Reported 80323993 pairwise alignments, 80371409 HSPs.
3507729 queries aligned.

diamond blastx  --threads 42 --query sample1_R2_trimmed.fastq.gz --db nr \
--daa sample1_R2_trimmed.daa

Total time = 21612.6s
Reported 69715453 pairwise alignments, 69754248 HSPs.
3062178 queries aligned.

mkdir daa_backup
cp *.daa daa_backup/

Get MEGAN6 mapping files

wget https://software-ab.informatik.uni-tuebingen.de/download/megan6/megan-map-Jul2020-2.db.zip
sudo apt install unzip
unzip megan-map-Jul2020-2.db.zip 
diamond blastx  --threads 42 --db nr --query sample1_comb_trimmed.fastq.gz \
-o sample1_comb_trimmed.m8
daa2rma -i 10daa/reads.daa -o 20rma/reads.rma -g2t -g2t gi_taxid.bin -g2kegg gi2kegg.bin -fun KEGG
daa2rma --paired true --in sample1_R1_trimmed.daa sample1_R2_trimmed.daa \
--acc2taxa megan-map-Jul2020-2.db -v true -o megan_out