Keck Facility Metagenomic Analysis Pipeline

# Keck Facility Metagenomic Analysis Pipeline **Miguel Desmarais**, W. M. Keck Ecological and Evolutionary Genetics Facility, Marine Biological Laboratory, Woods Hole, MA **Description** Metagenomic analysis pipeline for the recovery and optimization of high quality metagenome-assembled genomes (MAGs) from shotgun metagenomic reads using [PRINSEQ](http://prinseq.sourceforge.net/), [metaWRAP](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0541-1) and [Anvi'o](https://peerj.com/articles/1319/). **Instruction** This code is tailored to the servers and clusters of the [Josephine Bay Paul Center](https://www.mbl.edu/jbpc/) computing facility at the Marine Biological Laboratory and to the demultiplexed raw shotgun metagenomic sequencing data generated at the [Keck Facility](https://www.mbl.edu/jbpc/keck/). The code can easily be modified for your own work environment and sequencing data. Replace code <variable_name> such as username, run date, project directory, project name, email address and number of threads (-n) to match end user, project and Josephine Bay Paul Center server or cluster used. The `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu` part of the command can be removed if the command is run on a server or personal computer instead of a cluster. Always work from /workspace/<username>/<project_directory>/ once raw reads have been renamed if following exact code below. While anvi'o is already loaded to work environment, metaWRAP has to be loaded with `module load metawrap` before running commands. Once the pipeline is finished, the whole directory can be compressed and sent with this HackMD for reference. ![](https://i.imgur.com/GC3sE0d.png) ## Raw data processing ### Raw reads #### Create project working directory in /workspace/ under your username >`mkdir /workspace/<username>/<project_directory>/` #### Create symbolic link to tarred and compressed project directory (<project_name>.tar.gz) containing **demultiplexed** raw reads in Keck Facility sequencing directory > `ln -s /xraid2-2/sequencing/Illumina/<rundate_ns>/<project_name>.tar.gz /workspace/<username>/<project_directory>/` #### Untar <project_name>.tar.gz directory > `tar -zxvf /workspace/<username>/<project_directory>/*tar.gz` #### Rename untarred directory to RAW_READS > `mv <project_name> RAW_READS` #### Fastq file decompression > `gunzip *gz /RAW_READS/*fastq.gz` #### Rename files to match metaWRAP raw reads format Pre-renaming: ![](https://i.imgur.com/mcq6ZIW.png) Post-renaming: ![](https://i.imgur.com/Sc3qukS.png) Move forward and reverse reads to separate folders /R1/ and /R2/ to rename separately > `mkdir RAW_READS/R1` > `mkdir RAW_READS/R2` > `mv RAW_READS/*R1_001.fastq RAW_READS/R1` > `mv RAW_READS/*R2_001.fastq RAW_READS/R2` While in RAW_READS/R1/, rename forward reads >`for i in ?????????????*; do mv -i "$i" "${i%?????????????}"; done` >`for i in *; do mv "$i" "$i"_1.fastq; done` While in RAW_READS/R2/, rename reverse reads > `for i in ?????????????*; do mv -i "$i" "${i%?????????????}"; done` > `for i in *; do mv "$i" "$i"_2.fastq; done` Move reads back to original folder and delete /R1/ and /R2/ folders > `mv RAW_READS/R1/*_1.fastq RAW_READS/` > `mv RAW_READS/R2/*_2.fastq RAW_READS/` > `rm -r RAW_READS/R1` > `rm -r RAW_READS/R2` ### PRINSEQ (for high quality filtering) Quality filtering of raw reads with [PRINSEQ software](http://prinseq.sourceforge.net/). **Do PRINSEQ for "high quality" filtering OR Read_qc for "good quality" filtering.** PRINSEQ has been implemented in this pipeline instead of the metaWRAP Read_qc module since it provides a more flexible and stringent quality filtering. Alternatively, metaWRAP [Read_qc module](https://hackmd.io/E-ZBxJj2Q6WLV2HEhxitfg?view#Read_qc-good-quality) can be used for quality filtering of raw reads to retain as many reads as possible while lowering overall clean reads quality. BMTagger (also found in metaWRAP Read_qc module) can be used after PRINSEQ to remove host contaminant reads if necessary. Note: Illumina adapters and barcodes have already been removed from the raw reads by Keck Facility staff. If Illumina adapters and barcodes are present in raw reads from other facilities, use software such as Cutadapt before quality filtering with PRINSEQ. Settings for high quality filtering (these settings can be changed according to desired stringency by principal investigator): - **Phred quality score** - Minimum mean quality: 20 (-min_qual_mean 20) - 5' and 3' bases quality threshold: 20 (-trim_quality_left 20 -trim_quality_right 20) - **Read length** - Minimum read length: 60 bp (-min_len 60) - **Uncalled bases (Ns)** - Uncalled bases allowed per read: 0 (-ns_max_n 0) - **Duplicates** - Removal of all forms of duplicates (-derep 12345) #### Setup Create text file (rawreadslist.txt) containing sample names in project directory using GNU nano or other text editor (no extension name, commas or spaces). ![](https://i.imgur.com/I4Do9ZS.png) Create needed directories > `mkdir READ_QC` > `mkdir READ_QC/pre-QC_reports` > `mkdir READ_QC/post-QC_reports` > `mkdir READ_QC/logs` > `mkdir READ_QC/singletons` > `mkdir CLEAN_READS` > `mkdir CLEAN_READS/ALL_READS` #### Run pre-FastQC Create script `preqc.sh` > `nano ./preqc.sh` Input script (change number of threads based on cluster used) #!/bin/bash for sample in `cat rawreadslist.txt` do clusterize -n <num_threads> -m <username>@mbl.edu -log READ_QC/logs/preqc_${sample}.log fastqc RAW_READS/${sample}_1.fastq RAW_READS/${sample}_2.fastq --outdir READ_QC/pre-QC_reports done Make script executable and run it > `chmod +x preqc.sh` > `./preqc.sh` #### Run PRINSEQ Load PRINSEQ module > `module load prinseq-lite/0.20.4` Create `read_qc_loop.sh` > `nano ./read_qc_loop.sh` Input script (change number of threads based on cluster used) #!/bin/bash for sample in `cat rawreadslist.txt` do clusterize -n <num_threads> -m <username>@mbl.edu -log READ_QC/logs/prinseq_${sample}.log prinseq-lite.pl -fastq RAW_READS/${sample}_1.fastq -fastq2 RAW_READS/${sample}_2.fastq -min_len 60 -min_qual_mean 20 -ns_max_n 0 -derep 12345 -trim_qual_left 20 -trim_qual_right 20 -trim_qual_type min -trim_qual_rule lt -out_format 3 -out_good CLEAN_READS/${sample} -out_bad null -verbose done Make script executable and run it > `chmod +x read_qc_loop.sh` > `./read_qc_loop.sh` Move singletons to separate directory > `mv CLEAN_READS/*singletons.fastq READ_QC/singletons` #### Run post-FastQC Create script `postqc.sh` > `nano ./postqc.sh` Input script (change number of threads based on cluster used) #!/bin/bash for sample in `cat rawreadslist.txt` do clusterize -n <num_threads> -m <username>@mbl.edu -log READ_QC/logs/postqc_${sample}.log fastqc CLEAN_READS/${sample}_1.fastq CLEAN_READS/${sample}_2.fastq --outdir READ_QC/post-QC_reports done Make script executable and run it > `chmod +x postqc.sh` > `./postqc.sh` Inspect PRINSEQ logs to see how many reads were removed for each setting (duplicates, quality, short reads, etc.). You can also inspect pre-FastQC and post-FastQC reports to confirm reads quality. #### Concatenate forward and reverse clean reads for co-assembly: > `cat CLEAN_READS/*_1.fastq > CLEAN_READS/ALL_READS/ALL_READS_1.fastq` > `cat CLEAN_READS/*_2.fastq > CLEAN_READS/ALL_READS/ALL_READS_2.fastq` ### Read_qc (for good quality filtering) Trim-Galore (cutadapt + Fast_qc) + BMTagger **Do PRINSEQ for "high quality" filtering OR Read_qc for "good quality" filtering.** Default metaWRAP settings for Trim Galore (can't be changed): - Detection an removal of standard Illumina adaptors ('AGATCGGAAGAGC') - PHRED quality score threshold: >=20 - Read length: >=20 bp - Does not remove reads with uncalled bases - Does not remove short reads - Does not remove duplicates BMTagger: - Removes human reads automatically - To skip: --skip-bmtagger - To change host genome: -x Since the metaWRAP Read_qc module does not allow to change default settings, PRINSEQ can be used if a more stringent quality filtering such as the removal of shorter reads, uncalled bases and duplicates is needed. The resulting clean reads can be easily fed into the metaWRAP assembly module as long as they follow the metaWRAP naming format. #### Setup Create text file (rawreadslist.txt) containing sample names in project directory using GNU nano or other text editor (no extension name, commas or spaces). ![](https://i.imgur.com/I4Do9ZS.png) #### Create script read_qc_loop.sh and READ_QC directory > `mkdir READ_QC` > `nano ./read_qc_loop.sh` Input script (change number of threads based on cluster used) #!/bin/bash for sample in `cat rawreadslist.txt` do clusterize -n <num_threads> -m <username>@mbl.edu -log read_qc_${sample}.log metawrap read_qc -1 RAW_READS/${sample}_1.fastq -2 RAW_READS/${sample}_2.fastq -t 8 -o READ_QC/${sample} done #### Make script executable and run it > `chmod +x read_qc_loop.sh` > `./read_qc_loop.sh` #### Inspect QC reports and move trimmed/de-contaminated good quality reads to /CLEAN_READS/ folder and rename after sample name > `mkdir CLEAN_READS` > `mkdir CLEAN_READS/ALL_READS` ``` for i in READ_QC/*; do b=${i#*/} mv ${i}/final_pure_reads_1.fastq CLEAN_READS/${b}_1.fastq mv ${i}/final_pure_reads_2.fastq CLEAN_READS/${b}_2.fastq done ``` #### Concatenate forward and reverse clean reads for co-assembly: > `cat CLEAN_READS/*_1.fastq > CLEAN_READS/ALL_READS/ALL_READS_1.fastq` > `cat CLEAN_READS/*_2.fastq > CLEAN_READS/ALL_READS/ALL_READS_2.fastq` ## metaWRAP Following the published [article](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0541-1) and Github [metaWRAP tutorial](https://github.com/bxlab/metaWRAP/blob/master/Usage_tutorial.md) modified for the Keck Facility raw Nextseq demultiplexed sequences. MetaWRAP is used in this pipeline for the power of its binning modules and the recovery of a high number of high quality bins (MAGs): BINNING, BIN_REFINEMENT & BIN_REASSEMBLY. MetaWRAP's modules are fully independent of each other, therefore steps such as reads filtering, assembly or binning can be run on a different software and results can be fed into the next module. Make sure to check out the figures generated by the various metaWRAP modules to assess the efficiency of the assembly, binning, bin_refinent and bin_reassembly steps. ### Assembly Assemble reads into contigs with --megahit or --metaspades and create assembly report with QUAST > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap assembly -1 <all_forward_cleanreads> -2 <all_reverse_cleanreads> -m <memory> -t <num_threads> --<assembly_program> -o <output_folder>` > `clusterize -n 12 -log megahit.log -m mdesmarais@mbl.edu metawrap assembly -1 CLEAN_READS/ALL_READS/ALL_READS_1.fastq -2 CLEAN_READS/ALL_READS/ALL_READS_2.fastq -m 1024 -t 12 --megahit -o ASSEMBLY` Inspect assembly_report.html file for assembly statistics. The metaWRAP assembly module automatically removes contigs that are shorter than 1000bp. While metaSPAdes yields the best assembly statistics, Megahit is better at capturing micro-diversity. metaSPAdes is also more demanding in terms of computing power than Megahit, and therefore usually takes longer to run. Picking between the two will depends on your sample source and aim of the study. See this [article](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0169662 ) for more information. Other assembly algorithms can also be used with metaWRAP. At this point, the Anvi'o part of the metagenomic pipeline [anvi-gen-contigs-database](https://hackmd.io/E-ZBxJj2Q6WLV2HEhxitfg?view#anvi-gen-contigs-database) can be run simultaneously until the re-assembled metaWRAP bins are needed for import into the profiles (anvi-import-collection). ### Kraken (optional) Assess taxonomic composition of clean reads and assembly (contigs) with Kraken module (**Optional**: taxonomic composition will be assessed downstream with anvi'o and Kraken results cannot be imported as a layer into anvi'o. Kraken is still a great tool to visualize taxonomic composition of the sample communities (clean reads), and to inspect what taxonomic groups were assembled better than others (contigs) - clean reads vs. contigs. See [tutorial](https://github.com/bxlab/metaWRAP/blob/master/Usage_tutorial.md) notes on Kraken.) > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap kraken -o <output_folder> -t <num_threads> -s <reads_subset> <clean_reads> <final_assembly>` > `clusterize -n 16 -log kraken.log -m mdesmarais@mbl.edu metawrap kraken -o KRAKEN -t 16 -s 1000000 CLEAN_READS/*fastq ASSEMBLY/final_assembly.fasta` ### Binning Initial binning with metabat2, maxbin2 & concoct (can use other algorithms) > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap binning -o <output_folder> -t <num_threads> -a <final_assembly> --<binning_program> --<binning_program> --<binning_program> <clean_reads> --run-checkm` > `clusterize -n 12 -log binning.log -m mdesmarais@mbl.edu metawrap binning -o INITIAL_BINNING -t 12 -a ASSEMBLY/final_assembly.fasta --metabat2 --maxbin2 --concoct CLEAN_READS/*fastq --run-checkm` Inspect bin folders to see how many bins each algortihm produced. Those bins are the total bins produced and are not quality filtered. ### Bin_refinement Consolidate all 3 bin sets by picking the best of each bin version based on completion and redundancy (contamination) thresholds > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap bin_refinement -o <output_folder> -t <num_threads> -A <bininput_A> -B <bininput_B> -C <bininput_C> -c <min_completion> -x <max_redundancy>` > `clusterize -n 12 -log refinement.log -m mdesmarais@mbl.edu metawrap bin_refinement -o BIN_REFINEMENT -t 12 -A INITIAL_BINNING/metabat2_bins/ -B INITIAL_BINNING/maxbin2_bins/ -C INITIAL_BINNING/concoct_bins/ -c 50 -x 10` Inspect number of bins meeting completion and redundancy settings with all 3 algorithms and also with metaWRAP consolidated bins > `cat BIN_REFINEMENT/<algortithm_bins.stats> | awk '$2><min_completion> && $3<max_redundancy>' | wc -l` > `cat BIN_REFINEMENT/concoct_bins.stats | awk '$2>50 && $3<10' | wc -l` > `cat BIN_REFINEMENT/maxbin2_bins.stats | awk '$2>50 && $3<10' | wc -l` > `cat BIN_REFINEMENT/metabat2_bins.stats | awk '$2>50 && $3<10' | wc -l` > `cat BIN_REFINEMENT/metawrap_bins.stats | awk '$2>50 && $3<10' | wc -l` *A note on completion and redundancy:* *According to anvio [article](http://merenlab.org/2016/06/09/assessing-completion-and-contamination-of-MAGs/), the contamination and redundancy golden standards are of 90% and 10%, respectively. In order to report a bin, it should be more than 50% complete with elss than 10% contamination.* *The -c and -x arguments can be modified to whichever completion and contamination (redundancy) levels desired. By default, the metaWRAP minimum completion is 70%, and maximum redundancy is 5%. A lower completion % and higher redundancy % were used in this example to retain as many bins as possible and explore the potential of binning & bin_refinement modules (metaWRAP) combined with the anvi-refine module (Anvio). Using binning & bin_refinement with less stringent parameters provide the user with a greater number of lower quality bins. High redundancy % may point to bins containing multiple genomes, therefore anvi-refine can further improve bins through manual decontamination and bin splitting. It has the potential to generate new bins (MAGS) with greater completion and lower redundancy levels, and thus increase the number of high-quality bins. This process can be long and tedious.* #### Anvi'o **After the Bin_refinement module, the metaWRAP bin collection can be imported into anvi'o for human-guided curation of individual bins (in other words, to remove contaminating contigs that are not believed to belong to this genome). This can potentially improve the completion and redundancy metrics of the bins. This step is highly subjective and time-consuming. It should only be done if the P.I. agrees with members of the Keck Facility manually modifying the binning results. Move on to the next step of metaWRAP if a fast and automated binning method is wanted, after which the bins will be ready to be passed on to the P.I.** ### Blobology (optional) Visualize entire assembly on GC vs Abundance plane, and annotate contigs with taxonomy and bin information (**Optional**: taxonomic composition of contigs and bin will be assessed downstream with anvio and Blobology results cannot be imported as a layer into anvio. Blobology is still a great tool to visualize taxonomic composition of the contigs and bins. See [tutorial](https://github.com/bxlab/metaWRAP/blob/master/Usage_tutorial.md) notes on Blobology.) > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap metawrap blobology -a <final_assembly> -t <num_threads> -o <output_folder> --bins <metawrap_bins> <clean_reads>` > `clusterize -n 12 -log blobology.log -m mdesmarais@mbl.edu metawrap blobology -a ASSEMBLY/final_assembly.fasta -t 12 -o BLOBOLOGY --bins BIN_REFINEMENT/metawrap_bins CLEAN_READS/*fastq` ### Quant_bin (optional) Calculate the abundance of each contig in each sample, and the bin abundance with Salmon tool (**Optional**: read abundance recruited to contigs and bins will be assessed downstream with anvio and Quant_bin results cannot be imported as a layer into anvio. Quant_bin is still a great tool to visualize contig, sample and bin abundance. See [tutorial](https://github.com/bxlab/metaWRAP/blob/master/Usage_tutorial.md) notes on Blobology.) > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap quant_bins -b <metawrap_bins> -o <output_folder> -a <final_assembly> <clean_reads>` > `clusterize -n 12 -log quant.log -m mdesmarais@mbl.edu metawrap quant_bins -b BIN_REFINEMENT/metawrap_bins -o QUANT_BINS -a ASSEMBLY/final_assembly.fasta CLEAN_READS/*fastq` ### Reassemble_bins At this point, input bins have been refined through the anvi'o part of this pipeline, or if anvi'o human-guided refining has been skipped, input bins are coming straight form the Bin_refinement module. Re-assemble consolidated bins by collecting reads belonging to each bin and re-assembling them. Only bins that improved will be kept. > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap reassemble_bins -o <output_folder> -1 <clean_forward_reads> -2 <clean_reverse_reads> -t <num_threads> -c <min_completion> -x <max_redundancy> -b <metawrap_bins>` > `clusterize -n 12 -log reassembly.log -m mdesmarais@mbl.edu metawrap reassemble_bins -o BIN_REASSEMBLY -1 CLEAN_READS/ALL_READS_1.fastq -2 CLEAN_READS/ALL_READS_2.fastq -t 12 -c 50 -x 10 -b BIN_REFINEMENT/metawrap_bins` Inspect number of bins meeting completion and redundancy settings with all 3 algorithms and also consolidated > `cat BIN_REFINEMENT/reassembled_bins.stats | awk '$2>50 && $3<10' | wc -l` > `cat BIN_REASSEMBLY/reassembled_bins.stats | awk '$2>90 && $3<5' | wc -l` **The binning part of this pipeline is now finished. Bins can be rename directly in the folder and unwanted low-quality bins can be manually discared. At this point, bins (as well as the entire assembly, working directory and HackMD for reference) can be compressed and sent directly to the P.I. for downstream taxonomic and functional analysis. The bins can also be fed into the following modules to get a glimpse into its functional and taxonomic composition.** ### Classify_bins (optional) Determine taxonomy of each contig and bin set with Taxator-tk (**Optional**: bin taxonomy will be assessed downstream with anvio and Classify_bins results cannot be imported as a layer into anvio. Classify_bins/PROKKA is still a great tool to accurately assign gene function and others ORF to contigs and bin sets. See [tutorial](https://github.com/bxlab/metaWRAP/blob/master/Usage_tutorial.md) notes on Blobology.) > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap classify_bins -b <username>@mbl.edu -b <reassembled_bins> -o <output_folder> -t <num_threads>` > `clusterize -n 4 -log tax.log -m mdesmarais@mbl.edu metawrap classify_bins -b BIN_REASSEMBLY/reassembled_bins -o BIN_CLASSIFICATION -t 4` ### Annotate_bins (optional) (**Optional**: functional annotation will be done downstream with anvio (HMMER and PRODIGAL) and Annotate_bins results cannot be imported as a layer into anvio. Classify_bins/Taxator-tk is still a great tool to accurately assign taxonomy to contigs and bin sets. See [tutorial](https://github.com/bxlab/metaWRAP/blob/master/Usage_tutorial.md) notes on Blobology.) `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu metawrap annotate_bins -o <output_folder> -t <num_threads> -b <reassembled_bins>` > `clusterize -n 4 -log annotation.log -m mdesmarais@mbl.edu metawrap annotate_bins -o FUNCT_ANNOT -t 4 -b BIN_REASSEMBLY/reassembled_bins/` ## Anvi'o (optional) Following the published [article](https://peerj.com/articles/1319/) and [Anvi'o metagenomics tutorial](http://merenlab.org/2016/06/22/anvio-tutorial-v2/#creating-an-anvio-contigs-database) modified for metaWRAP assembly and bin set. The metaWRAP bin collection can be imported into anvi’o for human-guided curation of individual bins (in other words, to remove contaminating contigs that are not believed to belong to this genome). This can potentially improve the completion and redundancy metrics of the bins. This step is highly suggestive and time-consuming. It should only be done if the P.I. agrees with members of the Keck Facility manually modifying the binning results. Move on to the next step of metaWRAP if a fast and automated binning method is wanted, after which the bins will be ready to be passed on to the P.I. ### anvi-gen-contigs-database Generate Anvi'o contigs database > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu anvi-gen-contigs-database -f <final_assembly> -o contigs.db -n 'An example contigs database'` > `clusterize -n 4 -log database.log -m mdesmarais@mbl.edu anvi-gen-contigs-database -f ASSEMBLY/final_assembly.fasta -o contigs.db -n 'EF_THE_shotgun'` ### anvi-run-hmms Search sequence databases for sequence homologs with HMMER > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu anvi-run-hmms -c contigs.db` > `clusterize -n 4 -log hmms.log -m mdesmarais@mbl.edu anvi-run-hmms -c contigs.db --num-threads 4` ### anvi-run-ncbi-cogs Annotate genes found with HMMER with functions from NCBI's Cluster of Orthologous Groups > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu anvi-run-ncbi-cogs -c contigs.db --num-threads <num_threads>` > `clusterize -n 4 -log cogs.log -m mdesmarais@mbl.edu anvi-run-ncbi-cogs -c contigs.db --num-threads 4` ### Centrifuge Annotate genes found with HMMER with Centrifuge for taxonomy according to this [arcticle](http://merenlab.org/2016/06/18/importing-taxonomy/#centrifuge) by Anvi'o. This will help you with the manual removing of contaminants with the anvi-refine step. Export all gene calls from contigs.db > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu anvi-get-sequences-for-gene-calls -c contigs.db -o gene-calls.fa` > `clusterize -n 40 -log getseq.log -m mdesmarais@mbl.edu anvi-get-sequences-for-gene-calls -c contigs.db -o gene-calls.fa` Run Centrifuge with blastdb database > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu centrifuge -f -x /blastdb/centrifuge/p+h+v/p+h+v gene-calls.fa -S centrifuge_hits.tsv` > `clusterize -n 40 -log centrifuge.log -m mdesmarais@mbl.edu centrifuge -f -x /blastdb/centrifuge/p+h+v/p+h+v gene-calls.fa -S centrifuge_hits.tsv` Check that output files are not empty on server > `wc -l centrifuge_report.tsv centrifuge_hits.tsv` Import Centrifuge results into contigs database > `clusterize -n <num_threads> -log <log> -m <username>@mbl.edu anvi-import-taxonomy-for-genes -c contigs.db -i centrifuge_report.tsv centrifuge_hits.tsv -p centrifuge` > `clusterize -n 40 -log cimport.log -m mdesmarais@mbl.edu anvi-import-taxonomy-for-genes -c contigs.db -i Centrifuge/centrifuge_report.tsv Centrifuge/centrifuge_hits.tsv -p centrifuge` ### anvi-display-contigs-stats Inspect contigs database statistics. SHH into server (not a cluster) of choice with 8008 port > `ssh -L 8008:localhost:8008 <username>@<server>` > `ssh -L 8008:localhost:8008 mdesmarais@minnie` Generate statistics > `anvi-display-contigs-stats contigs.db --server-only -P 8008` Listen to port 8008 in internet browser (Chrome): http://localhost:8008 Double check that assembly statistics agree with previous QUAST report of assembly in metaWRAP module. ### anvi-init-bam Sort and index .bam files generated from mapping software found in INITIAL_BINNING/work_files Create text file (SAMPLE_IDs.txt) containing sample names in project directory using GNU nano or other text editor (no extension name, commas or spaces). ![](https://i.imgur.com/I4Do9ZS.png) Create bash script `anvi-init.sh` to automatize anvi-profile with SAMPLE_IDs.txt > `nano ./anvi-init.sh` Input script (change number of threads based on cluster used) #!/bin/bash for sample in `cat SAMPLE_IDs.txt` do clusterize -n <num_threads> -log init_${sample}.log -m <username>@mbl.edu anvi-init-bam INITIAL_BINNING/work_files/$sample.bam -o INITIAL_BINNING/work_files/$sample-init.bam done Make script executable and run it > `chmod +x anvi-init.sh` > `./anvi-init.sh` ### anvi-profile Create database with sample-specific information about contigs. It will calculate mean coverage, standard deviation of coverage and average coverage for inner quartiles (Q1 and Q3) for each contig. It will also characterize SNVs. Create PROFILES directory > `mkdir PROFILES` Create bash script `anvi-profile.sh` to automatize anvi-profile with SAMPLE_IDs.txt, which was created in previous step > `nano ./anvi-profile.sh` Input script (change number of threads based on cluster used) #!/bin/bash for sample in `cat SAMPLE_IDs.txt` do clusterize -n <num_threads> -m <username>@mbl.edu -log profile_${sample}.log anvi-profile -i INITIAL_BINNING/work_files/${sample}-init.bam -c contigs.db --min-contig-length 1000 --output-dir PROFILES/${sample} --sample-name ${sample} --num-threads <_num_threads> done Make script executable and run it > `chmod +x anvi-profile.sh` > `./anvi-profile.sh` ### anvi-merge Merge Anvi'o profiles and skip CONCOCT binning since own metaWRAP refined bin collection will be imported in the next step. --enfore-hierarchical-clustering option can be removed for faster processing. Consult [anvi'o tutorial](http://merenlab.org/2016/06/22/anvio-tutorial-v2/#anvi-merge) for more information. > `clusterize -n <num_threads> -m <username>@mbl.edu -log <log> anvi-merge PROFILES/*/PROFILE.db -o SAMPLES-MERGED -c contigs.db --skip-concoct-binning --enforce-hierarchical-clustering` > `clusterize -n 10 -m mdesmarais@mbl.edu -log merge.log anvi-merge PROFILES/*/PROFILE.db -o SAMPLES-MERGED -c contigs.db --skip-concoct-binning --enforce-hierarchical-clustering` ### anvi-import-collection Substitute period in metawrap_bins.contigs bins names by underscore so it agrees with anvi'o format (make a copy of your metawrap_bins.contigs file first to make sure you do not lose everything). Run this command on a server (not a cluster) in the BIN_REFINEMENT directory > `ed 's/bin./bin_/g' <metawrap_bins.contigs >metawrap_bins1.contigs` Import collection into merged profile database > `clusterize -n <num_threads> -m <username>@mbl.edu -log <log> anvi-import-collection <bin_refinement_contigs> -p <merged_profile.db> -c contigs.db -C <Collection_name> --contigs-mode` > `clusterize -n 10 -m mdesmarais@mbl.edu -log impcoll.log anvi-import-collection BIN_REFINEMENT/metawrap_bins1.contigs -p SAMPLES-MERGED/PROFILE.db -c contigs.db -C METAWRAP --contigs-mode` ### anvi-summarize Summarize your bin collection in terms of completion and redundancy with anvi'o to assess which bins need to be refined. Alternatively, the checkM results of the Bin_refinement module of metaWRAP can be used to identify bins to be refined, but it is advised to use anvi-summarize for the preQC and postQC of anvi-refine since quality metrics can sligthly vary between anvi'o checkM results and metaWRAP checkM results. > `clusterize -n <num_threads> -m <username>@mbl.edu -log <log> anvi-summarize -p <profile_database> -c <contigs_database> -o <output_folder> -C <bin_collection> --quick-summary` > `clusterize -n 10 -m mdesmarais@mbl.edu -log summ.log anvi-summarize -p SAMPLES-MERGED/PROFILE.db -c contigs.db -o SAMPLES-SUMMARY -C METAWRAP --quick-summary` The --quick-summary flag can be removed if a detailed summary of the collection is needed. ### anvi-refine Visualize bins individually and manually curate/refine by removing contaminanting contigs as per this [article](http://merenlab.org/2015/05/11/anvi-refine/) by anvi'o. Coverage, taxonomy and GC content can be used to identify contaminanting splits. SHH into server (not a cluster) of choice with 8008 port > `ssh -L 8008:localhost:8008 <username>@<server>` > `ssh -L 8008:localhost:8008 mdesmarais@minnie` > `anvi-refine -p SAMPLES-MERGED/PROFILE.db -c contigs.db -C <collection> -b <bin_to_be_refined> --server-only -P 8008` > `anvi-refine -p SAMPLES-MERGED/PROFILE.db -c contigs.db -C METAWRAP -b bin_95 --server-only -P 8008` Listen to port 8008 in internet browser (Chrome): http://localhost:8008 Select all splits in selected bin, and remove splits that are believed to be contaminants by right clicking and removing split from bin. Once finished, click on **Store refined bins in the database**. Make sure to name bins in a way that agrees with your original bin set that was generated with Bin_refinement module. Repeat for all bins that you want refined. ### anvi-summarize Summarize once again to see how the quality (completion and redundancy) of your bins have improved. > `clusterize -n <num_threads> -m <username>@mbl.edu -log <log> anvi-summarize -p <profile_database> -c <contigs_database> -o <output_folder> -C <bin_collection> --quick-summary` > `clusterize -n 10 -m mdesmarais@mbl.edu -log summ.log anvi-summarize -p SAMPLES-MERGED/PROFILE.db -c contigs.db -o SAMPLES-SUMMARY -C METAWRAP --quick-summary` The --quick-summary flag can be removed if a detailed summary of the collection is needed. **Now, move on to the Bin_reassembly module of metaWRAP to reassemble your curated bins.** ### anvi-interactive (optional) If you want, you can now visualize all bins (including curated and unrefined bins) with the interactive interface of anvi'o on a server. SHH into server (not a cluster) of choice with 8008 port > `ssh -L 8008:localhost:8008 <username>@<server>` > `ssh -L 8008:localhost:8008 mdesmarais@minnie` Generate statistics > `anvi-interactive -p SAMPLES-MERGED/PROFILE.db -c contigs.db -C METAWRAP --server-only -P 8008` Listen to port 8008 in internet browser (Chrome): http://localhost:8008