Workflow to reconstruct metagenome-assembled genomes (MAGs) from microbiomes

# Workflow to reconstruct metagenome-assembled genomes (MAGs) from microbiomes Sherlynette Pérez Castro **Description** Pipeline to reconstruct population genomes from microbiomes, including quality check, assembly, and binning. This pipeline is tailored to the servers and clusters of the Josephine Bay Paul Center computing facility at the Marine Biological Laboratory. ## Raw data processing Quality check is the first step, where the metagenomes summary statistics including the number of read sequences, minimum length, maximum length, quality scores for fastq sequences, number of sequence duplicates, ambiguous bases (N’s), etc. are calculated. ### Step 1a: Calculate pre-stats #### Fastqc > module load fastqc > fastqc 7E3_R1.fastq 7E3_R2.fastq -o /workspace/microeco/READS_QC/pre-QC_reports #### Prinseq > module load prinseq-lite/0.20.4 > prinseq-lite.pl -verbose -fastq 7E3_R1.fastq -fastq2 7E3_R2.fastq -stats_all > /workspace/microeco/READS_QC/pre-QC_reports/7E_stats.txt http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi?report=1 ### Step 1b: Filtering and trimming Settings for good quality sequences (these settings can be changed according to desired stringency): - **Phred quality score** - Minimum mean quality: 20 (-min_qual_mean 20) - 5' and 3' bases quality threshold: 20 (-trim_quality_left 20 -trim_quality_right 20) - **Read length** - Minimum read length: 75 bp (-min_len 75) - **Uncalled bases (Ns)** - Uncalled bases allowed per read: 0 (-ns_max_n 0) - **Duplicates** - Removal of all forms of duplicates (-derep 12345) #### PRINSEQ > prinseq-lite.pl -fastq 7E3_R1.fastq -fastq2 7E3_R2.fastq -min_len 60 -min_qual_mean 20 -ns_max_n 0 -derep 12345 -trim_qual_left 20 -trim_qual_right 20 -trim_qual_type min -trim_qual_rule lt -out_format 3 -out_good /workspace/microeco/GOOD_READS/7E3 -out_bad null -verbose #### Trimmomatic > module load trimmomatic ##### Elena's code > trimmomatic PE -phred33 7E3_R1.fastq 7E3_R2.fastq 7E3_R1_paired_R1.fastq 7E3_R2.fastq 7E3_R1_unpaired_R1.fastq 7E3_R2.fastq ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:75 #### Run post-FastQC > fastqc 7E3_1.fastq 7E3_2.fastq -o /workspace/microeco/READS_QC/post-QC_reports Inspect pre-FastQC and post-FastQC reports. #### Concatenate forward and reverse clean reads for co-assembly: > cat *_1.fastq > concatenated_1.fastq > cat *_2.fastq > concatenated_2.fastq ## Assembly ### SPAdes > clusterize -n 12 -log SPAdes_assembly spades.py -1 concatenated_1.fastq -2 concatenated_2.fastq -o SPAdes_Assembly --meta #### Command to filter contigs <1000 bp > anvi-script-reformat-fasta contigs.fasta -o contigs-fixed.fa -l 1000 --simplify-names ### MEGAHIT > module load megahit > megahit -1 7E3_1.fastq -2 7E3_2.fastq> -o MEGAHIT_assembly --min-contig-len 1000 ## Assembly stats ### MetaQUAST > clusterize -n 12 -log MetaQUAST_SPAdes_fixed.log metaquast.py contigs-fixed.fa -o metaQUAST --max-ref-number 0 **At this point you can use Anvi'o to process your contigs, visualize, identify and/or refine genome bins** ## Binning ### metaWRAP MetaWRAP offers a hybrid approach for extracting draft metagenomes (bins, MAGs) by using metaBAT2, CONCOCT, and MaxBin2 > module load metawrap > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap binning -o <output_folder> -t <num_threads> -a <final_assembly> --<binning_program> --<binning_program> --<binning_program> <clean_reads> --run-checkm` > clusterize -n 12 -log metaWRAP_binning.log -m sperezcastro@mbl.edu metawrap binning -o INITIAL_BINNING -t 12 -a contigs-fixed.fa --metabat2 --maxbin2 --concoct GOOD_READS/*fastq --run-checkm Inspect bin folders to see how many bins each algortihm produced. #### Bin_refinement MetaWRAP consolidates all 3 bin sets by picking the best of each bin version based on completion and redundancy (contamination) thresholds > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap bin_refinement -o <output_folder> -t <num_threads> -A <bininput_A> -B <bininput_B> -C <bininput_C> -c <min_completion> -x <max_redundancy>` > `clusterize -n 12 -log refinement.log -m mdesmarais@mbl.edu metawrap bin_refinement -o BIN_REFINEMENT -t 12 -A INITIAL_BINNING/metabat2_bins/ -B INITIAL_BINNING/maxbin2_bins/ -C INITIAL_BINNING/concoct_bins/ -c 50 -x 10` Inspect number of bins meeting completion and redundancy settings with all 3 algorithms and also with metaWRAP consolidated bins > `cat BIN_REFINEMENT/<algortithm_bins.stats> | awk '$2><min_completion> && $3<max_redundancy>' | wc -l` > `cat BIN_REFINEMENT/concoct_bins.stats | awk '$2>50 && $3<10' | wc -l` > `cat BIN_REFINEMENT/maxbin2_bins.stats | awk '$2>50 && $3<10' | wc -l` > `cat BIN_REFINEMENT/metabat2_bins.stats | awk '$2>50 && $3<10' | wc -l` > `cat BIN_REFINEMENT/metawrap_bins.stats | awk '$2>50 && $3<10' | wc -l` *A note on completion and contamination:* *The -c and -x arguments can be modified to whichever completion and contamination levels desired. The completion and contamination standards for high-quality MAGs are >90% and <5%, respectively. #### Reassemble_bins Re-assemble collect reads belonging to each bin and reassembles them with a "permissive" and a "strict" algorithm. > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap reassemble_bins -o <output_folder> -1 <clean_forward_reads> -2 <clean_reverse_reads> -t <num_threads> -c <min_completion> -x <max_redundancy> -b <metawrap_bins>` > `clusterize -n 12 -log reassembly.log -m mdesmarais@mbl.edu metawrap reassemble_bins -o BIN_REASSEMBLY -1 CLEAN_READS/ALL_READS_1.fastq -2 CLEAN_READS/ALL_READS_2.fastq -t 12 -c 50 -x 10 -b BIN_REFINEMENT/metawrap_bins` Inspect number of bins meeting completion and redundancy settings with all 3 algorithms and also consolidated > `cat BIN_REFINEMENT/reassembled_bins.stats | awk '$2>50 && $3<10' | wc -l` > `cat BIN_REASSEMBLY/reassembled_bins.stats | awk '$2>90 && $3<5' | wc -l` > ## Bins QC ### CheckM https://github.com/Ecogenomics/CheckM/wiki/Genome-Quality-Commands > module load python/2.7.15-201901021238 ### Other metaWRAP tools #### Quant_bin Calculate the abundance of each contig in each sample, and the bin abundance with Salmon tool (**Optional**: read abundance recruited to contigs and bins will be assessed downstream with anvio and Quant_bin results cannot be imported as a layer into anvio. Quant_bin is still a great tool to visualize contig, sample and bin abundance. See [tutorial](https://github.com/bxlab/metaWRAP/blob/master/Usage_tutorial.md) notes on Blobology.) > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap quant_bins -b <metawrap_bins> -o <output_folder> -a <final_assembly> <clean_reads>` > `clusterize -n 12 -log quant.log -m mdesmarais@mbl.edu metawrap quant_bins -b BIN_REFINEMENT/metawrap_bins -o QUANT_BINS -a ASSEMBLY/final_assembly.fasta CLEAN_READS/*fastq` #### Classify_bins Determine taxonomy of each contig and bin set with Taxator-tk (**Optional**: bin taxonomy will be assessed downstream with anvio and Classify_bins results cannot be imported as a layer into anvio. Classify_bins/PROKKA is still a great tool to accurately assign gene function and others ORF to contigs and bin sets. See [tutorial](https://github.com/bxlab/metaWRAP/blob/master/Usage_tutorial.md) notes on Blobology.) > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap classify_bins -b <username>@mbl.edu -b <reassembled_bins> -o <output_folder> -t <num_threads>` > `clusterize -n 4 -log tax.log -m mdesmarais@mbl.edu metawrap classify_bins -b BIN_REASSEMBLY/reassembled_bins -o BIN_CLASSIFICATION -t 4` #### Annotate_bins `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap annotate_bins -o <output_folder> -t <num_threads> -b <reassembled_bins>` > `clusterize -n 4 -log annotation.log -m mdesmarais@mbl.edu metawrap annotate_bins -o FUNCT_ANNOT -t 4 -b BIN_REASSEMBLY/reassembled_bins/` ## Anvi'o Re-formatting your input FASTA anvi-script-reformat-fasta transcripts.fasta -o transcripts-fixed.fa -l 0 --simplify-names anvi-script-reformat-fasta transcripts.fasta -o transcripts-fixed.fa -l 1000 --simplify-names screen -S anvio anvi-gen-contigs-database -f transcripts-fixed.fa -o transcripts.db -n 'Sticky_Roots' anvi-run-hmms -c transcripts.db SHH into server (not a cluster) of choice with 8008 port > `ssh -L 8008:localhost:8008 sperezcastro@minnie.jbpc-np.mbl.edu` Generate statistics > `anvi-display-contigs-stats transcripts.db --server-only -P 8008` anvi-run-ncbi-cogs -c transcripts.db --num-threads 4 anvi-get-sequences-for-gene-calls -c transcripts.db --get-aa-sequences -o amino-acid-sequences.fa anvi-script-run-eggnog-mapper -c transcripts.db --annotation amino-acid-sequences.fa.emapper.annotations --use-version 1.0.3 ./interproscan.sh -i amino-acid-sequences.fa -f tsv -o interpro-output.tsv anvi-import-functions -c contigs.db -i interpro-output.tsv -p interproscan ### anvi-gen-contigs-database Generate Anvi'o contigs database > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu anvi-gen-contigs-database -f <final_assembly> -o contigs.db -n 'An example contigs database'` > `clusterize -n 4 -log database.log -m mdesmarais@mbl.edu anvi-gen-contigs-database -f ASSEMBLY/final_assembly.fasta -o contigs.db -n 'EF_THE_shotgun'` ### anvi-run-hmms Search sequence databases for sequence homologs with HMMER > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu anvi-run-hmms -c contigs.db` > `clusterize -n 4 -log hmms.log -m mdesmarais@mbl.edu anvi-run-hmms -c contigs.db --num-threads 4` ### anvi-run-ncbi-cogs Annotate genes found with HMMER with functions from NCBI's Cluster of Orthologous Groups > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu anvi-run-ncbi-cogs -c contigs.db --num-threads <num_threads>` > `clusterize -n 4 -log cogs.log -m mdesmarais@mbl.edu anvi-run-ncbi-cogs -c contigs.db --num-threads 4` ### Centrifuge Annotate genes found with HMMER with Centrifuge for taxonomy according to this [arcticle](http://merenlab.org/2016/06/18/importing-taxonomy/#centrifuge) by Anvi'o. This will help you with the manual removing of contaminants with the anvi-refine step. Export all gene calls from contigs.db > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu anvi-get-sequences-for-gene-calls -c contigs.db -o gene-calls.fa` > `clusterize -n 40 -log getseq.log -m mdesmarais@mbl.edu anvi-get-sequences-for-gene-calls -c contigs.db -o gene-calls.fa` Run Centrifuge with blastdb database > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu centrifuge -f -x /blastdb/centrifuge/p+h+v/p+h+v gene-calls.fa -S centrifuge_hits.tsv` > `clusterize -n 40 -log centrifuge.log -m mdesmarais@mbl.edu centrifuge -f -x /blastdb/centrifuge/p+h+v/p+h+v gene-calls.fa -S centrifuge_hits.tsv` Check that output files are not empty on server > `wc -l centrifuge_report.tsv centrifuge_hits.tsv` Import Centrifuge results into contigs database > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu anvi-import-taxonomy-for-genes -c contigs.db -i centrifuge_report.tsv centrifuge_hits.tsv -p centrifuge` > `clusterize -n 40 -log cimport.log -m mdesmarais@mbl.edu anvi-import-taxonomy-for-genes -c contigs.db -i Centrifuge/centrifuge_report.tsv Centrifuge/centrifuge_hits.tsv -p centrifuge` ### anvi-display-contigs-stats Inspect contigs database statistics. SHH into server (not a cluster) of choice with 8008 port > `ssh -L 8008:localhost:8008 <username>@<server>` > `ssh -L 8008:localhost:8008 mdesmarais@minnie` Generate statistics > `anvi-display-contigs-stats contigs.db --server-only -P 8008` Listen to port 8008 in internet browser (Chrome): http://localhost:8008 Double check that assembly statistics agree with previous QUAST report of assembly in metaWRAP module. ### anvi-init-bam Sort and index .bam files generated from mapping software found in INITIAL_BINNING/work_files Create text file (SAMPLE_IDs.txt) containing sample names in project directory using GNU nano or other text editor (no extension name, commas or spaces). ![](https://i.imgur.com/I4Do9ZS.png) Create bash script `anvi-init.sh` to automatize anvi-profile with SAMPLE_IDs.txt > `nano ./anvi-init.sh` Input script (change number of threads based on cluster used) #!/bin/bash for sample in `cat SAMPLE_IDs.txt` do clusterize -n <num_threads> -log init_${sample}.log -m <username>@mbl.edu anvi-init-bam INITIAL_BINNING/work_files/$sample.bam -o INITIAL_BINNING/work_files/$sample-init.bam done Make script executable and run it > `chmod +x anvi-init.sh` > `./anvi-init.sh` ### anvi-profile Create database with sample-specific information about contigs. It will calculate mean coverage, standard deviation of coverage and average coverage for inner quartiles (Q1 and Q3) for each contig. It will also characterize SNVs. Create PROFILES directory > `mkdir PROFILES` Create bash script `anvi-profile.sh` to automatize anvi-profile with SAMPLE_IDs.txt, which was created in previous step > `nano ./anvi-profile.sh` Input script (change number of threads based on cluster used) #!/bin/bash for sample in `cat SAMPLE_IDs.txt` do clusterize -n <num_threads> -m <username>@mbl.edu -log profile_${sample}.log anvi-profile -i INITIAL_BINNING/work_files/${sample}-init.bam -c contigs.db --min-contig-length 1000 --output-dir PROFILES/${sample} --sample-name ${sample} --num-threads <_num_threads> done Make script executable and run it > `chmod +x anvi-profile.sh` > `./anvi-profile.sh` ### anvi-merge Merge Anvi'o profiles and skip CONCOCT binning since own metaWRAP refined bin collection will be imported in the next step. --enfore-hierarchical-clustering option can be removed for faster processing. Consult [anvi'o tutorial](http://merenlab.org/2016/06/22/anvio-tutorial-v2/#anvi-merge) for more information. > `clusterize -n <num_threads> -m <username>@mbl.edu -log <log> anvi-merge PROFILES/*/PROFILE.db -o SAMPLES-MERGED -c contigs.db --skip-concoct-binning --enforce-hierarchical-clustering` > `clusterize -n 10 -m mdesmarais@mbl.edu -log merge.log anvi-merge PROFILES/*/PROFILE.db -o SAMPLES-MERGED -c contigs.db --skip-concoct-binning --enforce-hierarchical-clustering` ### anvi-summarize Summarize your bin collection in terms of completion and redundancy with anvi'o to assess which bins need to be refined. Alternatively, the checkM results of the Bin_refinement module of metaWRAP can be used to identify bins to be refined, but it is advised to use anvi-summarize for the preQC and postQC of anvi-refine since quality metrics can sligthly vary between anvi'o checkM results and metaWRAP checkM results. > `clusterize -n <num_threads> -m <username>@mbl.edu -log <log> anvi-summarize -p <profile_database> -c <contigs_database> -o <output_folder> -C <bin_collection> --quick-summary` > `clusterize -n 10 -m mdesmarais@mbl.edu -log summ.log anvi-summarize -p SAMPLES-MERGED/PROFILE.db -c contigs.db -o SAMPLES-SUMMARY -C METAWRAP --quick-summary` The --quick-summary flag can be removed if a detailed summary of the collection is needed. ### anvi-summarize Summarize once again to see how the quality (completion and redundancy) of your bins have improved. > `clusterize -n <num_threads> -m <username>@mbl.edu -log <log> anvi-summarize -p <profile_database> -c <contigs_database> -o <output_folder> -C <bin_collection> --quick-summary` > `clusterize -n 10 -m mdesmarais@mbl.edu -log summ.log anvi-summarize -p SAMPLES-MERGED/PROFILE.db -c contigs.db -o SAMPLES-SUMMARY -C METAWRAP --quick-summary` The --quick-summary flag can be removed if a detailed summary of the collection is needed. ### anvi-interactive (optional) If you want, you can now visualize all bins (including curated and unrefined bins) with the interactive interface of anvi'o on a server. SHH into server (not a cluster) of choice with 8008 port > `ssh -L 8008:localhost:8008 <username>@<server>` > `ssh -L 8008:localhost:8008 mdesmarais@minnie` Generate statistics > `anvi-interactive -p SAMPLES-MERGED/PROFILE.db -c contigs.db -C METAWRAP --server-only -P 8008` Listen to port 8008 in internet browser (Chrome): http://localhost:8008 ## Prodigal > clusterize -n 5 -m sperezcastro@mbl.edu -log prodigal_UBA2244.log prodigal -i Candidatus_Marinimicrobia_bacterium_UBA2242.fna -o UBA2244.genes -a UBA2244.faa -p meta ``` nano ./prodigalloop.sh ``` ``` #!/bin/bash for sample in `cat highqmags.txt` do clusterize -n 5 -m sperezcastro@mbl.edu -log prodigal${sample}.log prodigal -i HIGHQ_MAGS/${sample}.fa -o PRODIGAL/GENES/${sample}.genes -a PRODIGAL/FAA/${sample}.faa -p meta done ``` ``` chmod +x prodigalloop.sh ./prodigalloop.sh ``` ## PSORTb https://psort.org/psortb/index.html > module purge > module load psortb/3.0.6 > psort [-p|-n|-a] mysequences.txt > mysequences.out