owned this note
owned this note
Published
Linked with GitHub
# Workflow to reconstruct metagenome-assembled genomes (MAGs) from microbiomes
Sherlynette Pérez Castro
**Description**
Pipeline to reconstruct population genomes from microbiomes, including quality check, assembly, and binning. This pipeline is tailored to the servers and clusters of the Josephine Bay Paul Center computing facility at the Marine Biological Laboratory.
## Raw data processing
Quality check is the first step, where the metagenomes summary statistics including the number of read sequences, minimum length, maximum length, quality scores for fastq sequences, number of sequence duplicates, ambiguous bases (N’s), etc. are calculated.
### Step 1a: Calculate pre-stats
#### Fastqc
> module load fastqc
> fastqc 7E3_R1.fastq 7E3_R2.fastq -o /workspace/microeco/READS_QC/pre-QC_reports
#### Prinseq
> module load prinseq-lite/0.20.4
> prinseq-lite.pl -verbose -fastq 7E3_R1.fastq -fastq2 7E3_R2.fastq -stats_all > /workspace/microeco/READS_QC/pre-QC_reports/7E_stats.txt
http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi?report=1
### Step 1b: Filtering and trimming
Settings for good quality sequences (these settings can be changed according to desired stringency):
- **Phred quality score**
- Minimum mean quality: 20 (-min_qual_mean 20)
- 5' and 3' bases quality threshold: 20 (-trim_quality_left 20 -trim_quality_right 20)
- **Read length**
- Minimum read length: 75 bp (-min_len 75)
- **Uncalled bases (Ns)**
- Uncalled bases allowed per read: 0 (-ns_max_n 0)
- **Duplicates**
- Removal of all forms of duplicates (-derep 12345)
#### PRINSEQ
> prinseq-lite.pl -fastq 7E3_R1.fastq -fastq2 7E3_R2.fastq -min_len 60 -min_qual_mean 20 -ns_max_n 0 -derep 12345 -trim_qual_left 20 -trim_qual_right 20 -trim_qual_type min -trim_qual_rule lt -out_format 3 -out_good /workspace/microeco/GOOD_READS/7E3 -out_bad null -verbose
#### Trimmomatic
> module load trimmomatic
##### Elena's code
> trimmomatic PE -phred33 7E3_R1.fastq 7E3_R2.fastq 7E3_R1_paired_R1.fastq 7E3_R2.fastq 7E3_R1_unpaired_R1.fastq 7E3_R2.fastq ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:75
#### Run post-FastQC
> fastqc 7E3_1.fastq 7E3_2.fastq -o /workspace/microeco/READS_QC/post-QC_reports
Inspect pre-FastQC and post-FastQC reports.
#### Concatenate forward and reverse clean reads for co-assembly:
> cat *_1.fastq > concatenated_1.fastq
> cat *_2.fastq > concatenated_2.fastq
## Assembly
### SPAdes
> clusterize -n 12 -log SPAdes_assembly spades.py -1 concatenated_1.fastq -2 concatenated_2.fastq -o SPAdes_Assembly --meta
#### Command to filter contigs <1000 bp
> anvi-script-reformat-fasta contigs.fasta -o contigs-fixed.fa -l 1000 --simplify-names
### MEGAHIT
> module load megahit
> megahit -1 7E3_1.fastq -2 7E3_2.fastq> -o MEGAHIT_assembly --min-contig-len 1000
## Assembly stats
### MetaQUAST
> clusterize -n 12 -log MetaQUAST_SPAdes_fixed.log metaquast.py contigs-fixed.fa -o metaQUAST --max-ref-number 0
**At this point you can use Anvi'o to process your contigs, visualize, identify and/or refine genome bins**
## Binning
### metaWRAP
MetaWRAP offers a hybrid approach for extracting draft metagenomes (bins, MAGs) by using metaBAT2, CONCOCT, and MaxBin2
> module load metawrap
> `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap binning -o <output_folder> -t <num_threads> -a <final_assembly> --<binning_program> --<binning_program> --<binning_program> <clean_reads> --run-checkm`
> clusterize -n 12 -log metaWRAP_binning.log -m sperezcastro@mbl.edu metawrap binning -o INITIAL_BINNING -t 12 -a contigs-fixed.fa --metabat2 --maxbin2 --concoct GOOD_READS/*fastq --run-checkm
Inspect bin folders to see how many bins each algortihm produced.
#### Bin_refinement
MetaWRAP consolidates all 3 bin sets by picking the best of each bin version based on completion and redundancy (contamination) thresholds
> `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap bin_refinement -o <output_folder> -t <num_threads> -A <bininput_A> -B <bininput_B> -C <bininput_C> -c <min_completion> -x <max_redundancy>`
> `clusterize -n 12 -log refinement.log -m mdesmarais@mbl.edu metawrap bin_refinement -o BIN_REFINEMENT -t 12 -A INITIAL_BINNING/metabat2_bins/ -B INITIAL_BINNING/maxbin2_bins/ -C INITIAL_BINNING/concoct_bins/ -c 50 -x 10`
Inspect number of bins meeting completion and redundancy settings with all 3 algorithms and also with metaWRAP consolidated bins
> `cat BIN_REFINEMENT/<algortithm_bins.stats> | awk '$2><min_completion> && $3<max_redundancy>' | wc -l`
> `cat BIN_REFINEMENT/concoct_bins.stats | awk '$2>50 && $3<10' | wc -l`
> `cat BIN_REFINEMENT/maxbin2_bins.stats | awk '$2>50 && $3<10' | wc -l`
> `cat BIN_REFINEMENT/metabat2_bins.stats | awk '$2>50 && $3<10' | wc -l`
> `cat BIN_REFINEMENT/metawrap_bins.stats | awk '$2>50 && $3<10' | wc -l`
*A note on completion and contamination:*
*The -c and -x arguments can be modified to whichever completion and contamination levels desired. The completion and contamination standards for high-quality MAGs are >90% and <5%, respectively.
#### Reassemble_bins
Re-assemble collect reads belonging to each bin and reassembles them with a "permissive" and a "strict" algorithm.
> `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap reassemble_bins -o <output_folder> -1 <clean_forward_reads> -2 <clean_reverse_reads> -t <num_threads> -c <min_completion> -x <max_redundancy> -b <metawrap_bins>`
> `clusterize -n 12 -log reassembly.log -m mdesmarais@mbl.edu metawrap reassemble_bins -o BIN_REASSEMBLY -1 CLEAN_READS/ALL_READS_1.fastq -2 CLEAN_READS/ALL_READS_2.fastq -t 12 -c 50 -x 10 -b BIN_REFINEMENT/metawrap_bins`
Inspect number of bins meeting completion and redundancy settings with all 3 algorithms and also consolidated
> `cat BIN_REFINEMENT/reassembled_bins.stats | awk '$2>50 && $3<10' | wc -l`
> `cat BIN_REASSEMBLY/reassembled_bins.stats | awk '$2>90 && $3<5' | wc -l`
>
## Bins QC
### CheckM
https://github.com/Ecogenomics/CheckM/wiki/Genome-Quality-Commands
> module load python/2.7.15-201901021238
### Other metaWRAP tools
#### Quant_bin
Calculate the abundance of each contig in each sample, and the bin abundance with Salmon tool
(**Optional**: read abundance recruited to contigs and bins will be assessed downstream with anvio and Quant_bin results cannot be imported as a layer into anvio. Quant_bin is still a great tool to visualize contig, sample and bin abundance. See [tutorial](https://github.com/bxlab/metaWRAP/blob/master/Usage_tutorial.md) notes on Blobology.)
> `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap quant_bins -b <metawrap_bins> -o <output_folder> -a <final_assembly> <clean_reads>`
> `clusterize -n 12 -log quant.log -m mdesmarais@mbl.edu metawrap quant_bins -b BIN_REFINEMENT/metawrap_bins -o QUANT_BINS -a ASSEMBLY/final_assembly.fasta CLEAN_READS/*fastq`
#### Classify_bins
Determine taxonomy of each contig and bin set with Taxator-tk
(**Optional**: bin taxonomy will be assessed downstream with anvio and Classify_bins results cannot be imported as a layer into anvio. Classify_bins/PROKKA is still a great tool to accurately assign gene function and others ORF to contigs and bin sets. See [tutorial](https://github.com/bxlab/metaWRAP/blob/master/Usage_tutorial.md) notes on Blobology.)
> `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap classify_bins -b <username>@mbl.edu -b <reassembled_bins> -o <output_folder> -t <num_threads>`
> `clusterize -n 4 -log tax.log -m mdesmarais@mbl.edu metawrap classify_bins -b BIN_REASSEMBLY/reassembled_bins -o BIN_CLASSIFICATION -t 4`
#### Annotate_bins
`clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap annotate_bins -o <output_folder> -t <num_threads> -b <reassembled_bins>`
> `clusterize -n 4 -log annotation.log -m mdesmarais@mbl.edu metawrap annotate_bins -o FUNCT_ANNOT -t 4 -b BIN_REASSEMBLY/reassembled_bins/`
## Anvi'o
Re-formatting your input FASTA
anvi-script-reformat-fasta transcripts.fasta -o transcripts-fixed.fa -l 0 --simplify-names
anvi-script-reformat-fasta transcripts.fasta -o transcripts-fixed.fa -l 1000 --simplify-names
screen -S anvio
anvi-gen-contigs-database -f transcripts-fixed.fa -o transcripts.db -n 'Sticky_Roots'
anvi-run-hmms -c transcripts.db
SHH into server (not a cluster) of choice with 8008 port
> `ssh -L 8008:localhost:8008 sperezcastro@minnie.jbpc-np.mbl.edu`
Generate statistics
> `anvi-display-contigs-stats transcripts.db --server-only -P 8008`
anvi-run-ncbi-cogs -c transcripts.db --num-threads 4
anvi-get-sequences-for-gene-calls -c transcripts.db --get-aa-sequences -o amino-acid-sequences.fa
anvi-script-run-eggnog-mapper -c transcripts.db --annotation amino-acid-sequences.fa.emapper.annotations --use-version 1.0.3
./interproscan.sh -i amino-acid-sequences.fa -f tsv -o interpro-output.tsv
anvi-import-functions -c contigs.db -i interpro-output.tsv -p interproscan
### anvi-gen-contigs-database
Generate Anvi'o contigs database
> `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu anvi-gen-contigs-database -f <final_assembly> -o contigs.db -n 'An example contigs database'`
> `clusterize -n 4 -log database.log -m mdesmarais@mbl.edu anvi-gen-contigs-database -f ASSEMBLY/final_assembly.fasta -o contigs.db -n 'EF_THE_shotgun'`
### anvi-run-hmms
Search sequence databases for sequence homologs with HMMER
> `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu anvi-run-hmms -c contigs.db`
> `clusterize -n 4 -log hmms.log -m mdesmarais@mbl.edu anvi-run-hmms -c contigs.db --num-threads 4`
### anvi-run-ncbi-cogs
Annotate genes found with HMMER with functions from NCBI's Cluster of Orthologous Groups
> `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu anvi-run-ncbi-cogs -c contigs.db --num-threads <num_threads>`
> `clusterize -n 4 -log cogs.log -m mdesmarais@mbl.edu anvi-run-ncbi-cogs -c contigs.db --num-threads 4`
### Centrifuge
Annotate genes found with HMMER with Centrifuge for taxonomy according to this [arcticle](http://merenlab.org/2016/06/18/importing-taxonomy/#centrifuge) by Anvi'o. This will help you with the manual removing of contaminants with the anvi-refine step.
Export all gene calls from contigs.db
> `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu anvi-get-sequences-for-gene-calls -c contigs.db -o gene-calls.fa`
> `clusterize -n 40 -log getseq.log -m mdesmarais@mbl.edu anvi-get-sequences-for-gene-calls -c contigs.db -o gene-calls.fa`
Run Centrifuge with blastdb database
> `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu centrifuge -f -x /blastdb/centrifuge/p+h+v/p+h+v gene-calls.fa -S centrifuge_hits.tsv`
> `clusterize -n 40 -log centrifuge.log -m mdesmarais@mbl.edu centrifuge -f -x /blastdb/centrifuge/p+h+v/p+h+v gene-calls.fa -S centrifuge_hits.tsv`
Check that output files are not empty on server
> `wc -l centrifuge_report.tsv centrifuge_hits.tsv`
Import Centrifuge results into contigs database
> `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu anvi-import-taxonomy-for-genes -c contigs.db -i centrifuge_report.tsv centrifuge_hits.tsv -p centrifuge`
> `clusterize -n 40 -log cimport.log -m mdesmarais@mbl.edu anvi-import-taxonomy-for-genes -c contigs.db -i Centrifuge/centrifuge_report.tsv Centrifuge/centrifuge_hits.tsv -p centrifuge`
### anvi-display-contigs-stats
Inspect contigs database statistics.
SHH into server (not a cluster) of choice with 8008 port
> `ssh -L 8008:localhost:8008 <username>@<server>`
> `ssh -L 8008:localhost:8008 mdesmarais@minnie`
Generate statistics
> `anvi-display-contigs-stats contigs.db --server-only -P 8008`
Listen to port 8008 in internet browser (Chrome): http://localhost:8008
Double check that assembly statistics agree with previous QUAST report of assembly in metaWRAP module.
### anvi-init-bam
Sort and index .bam files generated from mapping software found in INITIAL_BINNING/work_files
Create text file (SAMPLE_IDs.txt) containing sample names in project directory using GNU nano or other text editor (no extension name, commas or spaces).
![](https://i.imgur.com/I4Do9ZS.png)
Create bash script `anvi-init.sh` to automatize anvi-profile with SAMPLE_IDs.txt
> `nano ./anvi-init.sh`
Input script (change number of threads based on cluster used)
#!/bin/bash
for sample in `cat SAMPLE_IDs.txt`
do
clusterize -n <num_threads> -log init_${sample}.log -m <username>@mbl.edu anvi-init-bam INITIAL_BINNING/work_files/$sample.bam -o INITIAL_BINNING/work_files/$sample-init.bam
done
Make script executable and run it
> `chmod +x anvi-init.sh`
> `./anvi-init.sh`
### anvi-profile
Create database with sample-specific information about contigs. It will calculate mean coverage, standard deviation of coverage and average coverage for inner quartiles (Q1 and Q3) for each contig. It will also characterize SNVs.
Create PROFILES directory
> `mkdir PROFILES`
Create bash script `anvi-profile.sh` to automatize anvi-profile with SAMPLE_IDs.txt, which was created in previous step
> `nano ./anvi-profile.sh`
Input script (change number of threads based on cluster used)
#!/bin/bash
for sample in `cat SAMPLE_IDs.txt`
do
clusterize -n <num_threads> -m <username>@mbl.edu -log profile_${sample}.log anvi-profile -i INITIAL_BINNING/work_files/${sample}-init.bam -c contigs.db --min-contig-length 1000 --output-dir PROFILES/${sample} --sample-name ${sample} --num-threads <_num_threads>
done
Make script executable and run it
> `chmod +x anvi-profile.sh`
> `./anvi-profile.sh`
### anvi-merge
Merge Anvi'o profiles and skip CONCOCT binning since own metaWRAP refined bin collection will be imported in the next step. --enfore-hierarchical-clustering option can be removed for faster processing. Consult [anvi'o tutorial](http://merenlab.org/2016/06/22/anvio-tutorial-v2/#anvi-merge) for more information.
> `clusterize -n <num_threads> -m <username>@mbl.edu -log <log> anvi-merge PROFILES/*/PROFILE.db -o SAMPLES-MERGED -c contigs.db --skip-concoct-binning --enforce-hierarchical-clustering`
> `clusterize -n 10 -m mdesmarais@mbl.edu -log merge.log anvi-merge PROFILES/*/PROFILE.db -o SAMPLES-MERGED -c contigs.db --skip-concoct-binning --enforce-hierarchical-clustering`
### anvi-summarize
Summarize your bin collection in terms of completion and redundancy with anvi'o to assess which bins need to be refined. Alternatively, the checkM results of the Bin_refinement module of metaWRAP can be used to identify bins to be refined, but it is advised to use anvi-summarize for the preQC and postQC of anvi-refine since quality metrics can sligthly vary between anvi'o checkM results and metaWRAP checkM results.
> `clusterize -n <num_threads> -m <username>@mbl.edu -log <log> anvi-summarize -p <profile_database> -c <contigs_database> -o <output_folder> -C <bin_collection> --quick-summary`
> `clusterize -n 10 -m mdesmarais@mbl.edu -log summ.log anvi-summarize -p SAMPLES-MERGED/PROFILE.db -c contigs.db -o SAMPLES-SUMMARY -C METAWRAP --quick-summary`
The --quick-summary flag can be removed if a detailed summary of the collection is needed.
### anvi-summarize
Summarize once again to see how the quality (completion and redundancy) of your bins have improved.
> `clusterize -n <num_threads> -m <username>@mbl.edu -log <log> anvi-summarize -p <profile_database> -c <contigs_database> -o <output_folder> -C <bin_collection> --quick-summary`
> `clusterize -n 10 -m mdesmarais@mbl.edu -log summ.log anvi-summarize -p SAMPLES-MERGED/PROFILE.db -c contigs.db -o SAMPLES-SUMMARY -C METAWRAP --quick-summary`
The --quick-summary flag can be removed if a detailed summary of the collection is needed.
### anvi-interactive (optional)
If you want, you can now visualize all bins (including curated and unrefined bins) with the interactive interface of anvi'o on a server.
SHH into server (not a cluster) of choice with 8008 port
> `ssh -L 8008:localhost:8008 <username>@<server>`
> `ssh -L 8008:localhost:8008 mdesmarais@minnie`
Generate statistics
> `anvi-interactive -p SAMPLES-MERGED/PROFILE.db -c contigs.db -C METAWRAP --server-only -P 8008`
Listen to port 8008 in internet browser (Chrome): http://localhost:8008
## Prodigal
> clusterize -n 5 -m sperezcastro@mbl.edu -log prodigal_UBA2244.log prodigal -i Candidatus_Marinimicrobia_bacterium_UBA2242.fna -o UBA2244.genes -a UBA2244.faa -p meta
```
nano ./prodigalloop.sh
```
```
#!/bin/bash
for sample in `cat highqmags.txt`
do
clusterize -n 5 -m sperezcastro@mbl.edu -log prodigal${sample}.log prodigal -i HIGHQ_MAGS/${sample}.fa -o PRODIGAL/GENES/${sample}.genes -a PRODIGAL/FAA/${sample}.faa -p meta
done
```
```
chmod +x prodigalloop.sh
./prodigalloop.sh
```
## PSORTb
https://psort.org/psortb/index.html
> module purge
> module load psortb/3.0.6
> psort [-p|-n|-a] mysequences.txt > mysequences.out