Genome decontamination

# Genome decontamination (*Anvi'o*) ###### tags: `Assembly` ## Workflow 1. Map raw reads to raw assembly * `bowtie` or `minimap2` * Input: Raw fasta & assembly fasta * Output: Alignment `.sam` file 2. Convert alignment `.sam` file to `.bam` format * `samtools` 3. Inspect ONT mapping report * `samtools` 4. Annotate assembly taxonomies * `blastn` 5. Initially assess contamination * `blobtools` * Input: Alignment `.bam` files, raw assembly fasta & raw assembly taxonomic annotations 6. Create contigs database (CDB) * `anvi'o` * Holds basic information assembly scaffolds (k-mer frequency, GC-content, ...) * Additional steps for: * COG gene annotations * `centrifuge` taxonomy annotations 7. Profile `.bam` file to CDB * `anvi’o` Profiles describes statistics per scaffold in `.bam` file (average coverage, portion of each scaffold covered by at least one read, ...) 8. Merge `anvi’o` profiles * Computes hierarchical clustering of scaffolds 9. Visualize merged data on `anvi’o`'s interface * Allows identification of draft genome bins & removal of contaminants ## MBL Cluster Preparation File paths: * Home directory (`~`): `/groups/fr2020/Isa/` * ONT (to decontaminate): * Raw reads: `~/ONT_raw_reads/ONT_2019_2021_all.fastq.gz` * Assembly: `~/pbjelly/Ds_jelly-assembly.fasta` * Illumina (if needed): * Reads: `~/groups/fr2020/Isa/Illumina_ds_paired_reads/` * Assembly: `~/Ds_illumina_genome/Dst_b1v03.fasta` * Annotations: `~/Ds_illumina_genome/Dst_b1v03.max_arth_b2g_droso_b2g_ctg-names.gff` # [BlobTools](https://blobtoolkit.genomehubs.org/blobtools2/) ## Preparing `blobtools` **Goal:** Re-run `blobtools` with lastest NCBI `nt` database (October 2025), and compare with coverage files. ### 1. Map reads to the assembly #### Illumina Mapping the Illumina reads to the `pbjelly` assembly with [`bowtie`](https://blobtoolkit.genomehubs.org/blobtools2/): ``` cd /groups/fr2020/Isa/ bowtie2 -p 12 -x Ds_jelly-assembly \ -1 Illumina_ds_paired_reads/trimmomatic/Darwinula_stevensoni_250_350_pass_paired_1_fixed.fastq.gz \ -2 Illumina_ds_paired_reads/trimmomatic/Darwinula_stevensoni_250_350_pass_paired_2_fixed.fastq.gz \ -S pbjelly/coverage/Ds_jelly-assembly_250_350.sam ``` #### ONT Mapping the ONT reads to the `pbjelly` assembly with `minimap2`: ``` module load minimap2 cd pbjelly minimap2 -ax map-ont Ds_jelly-assembly.mm ../ONT_raw_reads/ONT_2019_2021_all.fastq.gz -t 42 > coverage/Ds_jelly-assembly_ONT.sam ``` ### 2. Convert alignment `.sam` to `.bam` Convert SAM to BAM format: ``` cd coverage samtools view Ds_jelly-assembly_250_350.sam -S -b > Ds_jelly-assembly_250_350.bam samtools sort Ds_jelly-assembly_250_350.bam > Ds_jelly-assembly_250_350.sort.bam samtools view Ds_jelly-assembly_ONT.sam -S -b > Ds_jelly-assembly_ONT.bam samtools sort Ds_jelly-assembly_ONT.bam > Ds_jelly-assembly_ONT.sort.bam ``` ### 3. Inspect ONT mapping report View the mapping report of ONT reads (important for later): ``` samtools flagstat Ds_jelly-assembly_ONT.sort.bam 3588820 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 3385697 + 0 mapped (94.34%:-nan%) ``` ### 4. Annotate taxonomies with `blastn` Blast the raw assembly against NCBI's most recent *nt* database: ``` blastn -query ../Ds_jelly-assembly.fasta \ -db nt \ -outfmt '6 qseqid staxids bitscore std' \ -max_target_seqs 1 \ -max_hsps 1 \ -evalue 1e-25 \ -num_threads 20 \ -out ../blobtools/Ds_jelly-assembly-blastn.out ``` :::info * `-outfmt '6 qseqid staxids bitscore std'`: * [Output format six](https://www.metagenomics.wiki/tools/blast/blastn-output-format-6) with 3 specified columns, and then `std` * `std`: Standard columns (n = 11) * Resulting table has 11 columns * `qseqid` Is both manually specified, and included in `std` * Therefore, it is included twice ::: :::warning The BLAST database on the MBL cluster is outdated. Therefore, this was done using RIT BLAST databases. ::: ### 5. Initially assess contamination via [`blobtools`](https://blobtoolkit.genomehubs.org/blobtools2/) :::info I have [`blobtools`](https://blobtoolkit.genomehubs.org/blobtools2/) installed in my home directory; it's not in bioware. ::: #### Illumina Run [`blobtools`](https://blobtoolkit.genomehubs.org/blobtools2/): ``` ~/bin/blobtools/blobtools create \ -i ../Ds_jelly-assembly.fasta \ -b ../coverage/Ds_jelly-assembly_250_350.sort.bam \ -t Ds_jelly-assembly-blastn.out \ -o Ds_jelly-assembly_Ill-blastn-blobplot ~/bin/blobtools/blobtools view \ -i Ds_jelly-assembly-blastn.blobDB.json ~/bin/blobtools/blobtools plot -i Ds_jelly-assembly-blastn.blobDB.json ``` #### ONT Run [`blobtools`](https://blobtoolkit.genomehubs.org/blobtools2/): ``` ~/bin/blobtools/blobtools create \ -i ../Ds_jelly-assembly.fasta \ -b ../coverage/Ds_jelly-assembly_ONT.sort.bam \ -t Ds_jelly-assembly-blastn.out \ -o Ds_jelly-assembly-blastn_ONT-blobplot ~/bin/blobtools/blobtools view \ -i Ds_jelly-assembly-blastn.blobDB.json ~/bin/blobtools/blobtools plot -i Ds_jelly-assembly-blastn.blobDB.json ``` :::warning **NOT SURE if we can combine with in one plot report (!!!)** ::: **EVERYTHING** in: `/groups/fr2020/Isa/pbjelly/blobtools/` Compare old report (old BLAST DB) in `/groups/fr2020/Isa/pbjelly/blobtools/oldDB/` with new ones: Illumina: `Ds_pbjelly.bestsum.Illumina.png` `Ds_pbjelly.bestsum.Illumina.read_cov.png` ONT: `Ds_pbjelly.bestsum.ONT.png` `Ds_pbjelly.bestsum.ONT.read_cov.png` :::info **++Result interpretation++** * Illumina: * Little (negligible) contamination * 8.20 % did not map * Illumina + old NCBI `nt` database: * Suggested 11.65% of mapped reads **(11.64% increase)** came from conaminants * (Did not affect mapping) * ONT + latest NCBI `nt` database: * Some contamination -> Requires cleanup * Mostly Proteobacteria * 49.85 % did not map -> Reverse reads? No idea (FR). But I get (samtools flagstat) 94.34% of ONT reads mapped to the Ds pbjellt assembly with minimap2. **We need to investigate this.** ::: # ANVIO ### 6. Create contigs database Create CDB: ``` module load anvio/8-conda cd ../anvio anvi-gen-contigs-database -f ../Ds_jelly-assembly.fasta \ -o contigs.db -T 24 -n Ds_jelly -g 1 ``` :::info * `-f`: The FASTA file that contains reference sequences you mapped your samples against. * `-n`: Name of the project. * `-g`: Translation table to use ::: `anvio 8` Output clarifies used parameters: ``` ##uses k-mer = 4 ##--split-length 20000 (default) ##--prodigal-translation-table 1 (in PRODIGAL -g 1) ### CONTIGS DB CREATE REPORT =============================================== Split Length .................................: 20,000 K-mer size ...................................: 4 Skip gene calling? ...........................: False External gene calls provided? ................: False Ignoring internal stop codons? ...............: False Splitting pays attention to gene calls? ......: True Contigs with at least one gene call ..........: 1235 of 1243 (99.4%) Contigs database .............................: A new database, contigs.db, has been created. Number of contigs ............................: 1,243 Number of splits .............................: 20,912 Total number of nucleotides ..................: 427,252,629 Gene calling step skipped ....................: False Splits broke genes (non-mindful mode) ........: False Desired split length (what the user wanted) ..: 20,000 Average split length (what anvi'o gave back) .: 20,433 ``` "Populate `contigs.db` with Hidden Markov Model hits": ``` anvi-run-hmms -c contigs.db --num-threads 20 ``` #### Additional CDB annotations Protocol reference can be found [here](https://pmc.ncbi.nlm.nih.gov/articles/PMC4824900/) ##### Annotation via COGs Annotate genes in `contigs-db` with functions using NCBI’s Clusters of Orthologus Groups (COGs) database: ``` anvi-run-ncbi-cogs -c contigs.db \ --cog-data-dir /blastdb/cogs -T 24 anvi-get-dna-sequences-for-gene-calls -c contigs.db \ -o gene-calls.fa ``` *"How many genes did Anvi'o find on your contigs?"*: ``` egrep ">" gene-calls.fa | wc -l ``` ``` 468511 ``` ##### Annotation via `centrifuge` Protocol reference can be found [here](https://merenlab.org/2016/06/18/importing-taxonomy/). Classify microbes in metagenomic sequences using `centrifuge`: ``` export PATH=/bioware/centrifuge-1.0.3-beta:$PATH export CENTRIFUGE_BASE="/groups/fr2020/Isa/pbjelly/anvio/CENTRIFUGE" cd $CENTRIFUGE_BASE wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p+h+v.tar.gz tar -zxvf p+h+v.tar.gz && rm -rf p+h+v.tar.g centrifuge -f -x $CENTRIFUGE_BASE/p+h+v/p+h+v gene-calls.fa \ -S centrifuge_hits.tsv ``` ##### Import annotations to CDB Import taxonomies into `anvi'o`: ``` anvi-import-taxonomy -c contigs.db \ -i centrifuge_report.tsv centrifuge_hits.tsv \ -p centrifuge ``` ### 7. Profile `.bam` to CDB Profile `.bam` files relative to `anvi'o` CDB: ``` anvi-profile -i ../coverage/Ds_jelly-assembly_250_350.sort.bam \ --output-dir ./illumina_profile \ --sample-name illumina_250_350PE \ -c contigs.db --min-contig-length 800 --num-threads 24 anvi-profile -i ../coverage/Ds_jelly-assembly_ONT.sort.bam \ --output-dir ./ONT_profile \ --sample-name ONT \ -c contigs.db --min-contig-length 800 --num-threads 24 ``` #### 8. Merge profiles Merge the different `anvi'o` profiles: ``` anvi-merge illumina_profile/PROFILE.db ONT_profile/PROFILE.db \ -o Ds_illumina-ONT_merge -c contigs.db \ -S Ds_pbjelly --enforce-hierarchical-clustering ``` Interactive (fun!) mode: ``` anvi-interactive -p Ds_illumina-ONT_merge/PROFILE.db \ -c contigs.db --server-only ``` :::info * For use of `anvi-interactive` from a PC outside the MBL network, connect to the servers using Windows Powershell and the following command: ``` ssh -L8123:localhost:8080 -J ${USERNAME}@evol5.mbl.edu ${USERNAME}@minnie ``` ::: :::info ++**Report is found at:**++ `/groups/fr2020/Isa/pbjelly/anvio/bacteria/Ds_illumina-ONT_merge/SUMMARY_default/` ::: # NOW WHAT? Goods news: there are not many contaminants ~16 Megabases (~250 contigs) Bad news: we have to remove those. Anvio gives you the Fasta file with the Two Bins (bacteria vs Ds) in the reprot (bin_by_bin). But I think it would be ideal to combine both reports (blobtools and anvio), like a big table with contigs and taxonomic assignation and extract our filtered Darwinula ONLY contigs for future analysis. :) ### Combine `blobtools` and `anvi'o` reports **Goal:** One table, listing all contigs and their taxonomies (see below) |Contig|tax_blobtools_Ill|tax_blobtools_ONT|bin_anvio| |-|-|-|-| | Contig 0 | Arthropoda | Arthropoda |Ds| |...|...|...|...| |Contig1242|Proteobacteria|Proteobacteria|Bacteria| Input files: * `blobtools/Ds_jelly-assembly-blastn_ONT.Ds_jelly-assembly-blastn_ONT-blobplot.blobDB.table.txt` * `blobtools/Ds_jelly-assembly-blastn-blobplot.blobDB.table.txt` * `anvio/bacteria/Ds_illumina-ONT_merge/SUMMARY_default/bin_by_bin/Bin_bacteria/Bin_bacteria-contigs.fa` * `anvio/bacteria/Ds_illumina-ONT_merge/SUMMARY_default/bin_by_bin/Bin_Ds/Bin_Ds-contigs.fa` :::warning **What about `anvio/Ds_illumina-ONT_merge/SUMMARY_default/bin_by_bin/...`?** ::: Processing: * `blobtools` files: 1. Remove `#`-containing lines (`grep`) 2. Extract columns of interest (`awk`) as `blobtools_contigs` ``` grep -v "#" blobtools/Ds_jelly-assembly-blastn-blobplot.blobDB.table.txt \ | awk '{print $1, $6}' \ > blobtools/blobtools_comb-report-prep_Ill.txt grep -v "#" blobtools/Ds_jelly-assembly-blastn_ONT.Ds_jelly-assembly-blastn_ONT-blobplot.blobDB.table.txt \ | awk '{print $6}' \ > blobtools/blobtools_comb-report-prep_ONT.txt paste blobtools/blobtools_comb-report-prep_Ill.txt blobtools/blobtools_comb-report-prep_ONT.txt | awk '{print $1, $2, $4}' | sort > blobtools/blobtools_comb-report-prep_ONT-Ill.txt ``` :::warning `Contig0` misses from `anvio_comb-report-prep.txt` ::: * `anvi'o` files: 1. Extract headers (`grep`) 2. Remove `>` (`sed`) 3. Export lists of contigs as `anvio_contigs_Ds` and `anvio_contigs_Bacteria` ``` grep ">" anvio/bacteria/Ds_illumina-ONT_merge/SUMMARY_default/bin_by_bin/Bin_Ds/Bin_Ds-contigs.fa | sed 's/>//g' | awk '{print $1, "Ds"}' > anvio/anvio_contigs_Ds.txt grep ">" anvio/bacteria/Ds_illumina-ONT_merge/SUMMARY_default/bin_by_bin/Bin_bacteria/Bin_bacteria-contigs.fa | sed 's/>//g' | awk '{print $1, "Bacteria"}'> anvio/anvio_contigs_Bacteria.txt cat anvio/anvio_contigs_Ds.txt anvio/anvio_contigs_Bacteria.txt | sort > anvio/anvio_comb-report-prep.txt join blobtools/blobtools_comb-report-prep_ONT-Ill.txt anvio/anvio_comb-report-prep.txt | awk 'BEGIN{print "contig", "blob_Illumina", "blob_ONT", "anvio"}1' > blob_anvio_report.txt ``` 5. Create a new, two-column file in a for-loop (see table below) as `anvio_contigs`: * This file lists the contig (in the 1^st^ column), and the `anvi'o` taxonomy (in the 2^nd^column, added through a conditional statement comparing to the `anvio_contigs_...` files) ``` touch anvio_comb-report-prep_Ds_Bacteria.txt ``` 6. Horizontally concatenate `blobtools_contigs` and `anvio_contigs` (`paste`) 7. Add header line (`cat`) Table from step 4: |contig|bin_anvio| |-|-| |Contig1| Bacteria| |...|...| |Contig1242| Ds| ### Extract *Darwinula* contigs 1. Unzip reads: ``` gunzip /groups/fr2020/Isa/ONT_raw_reads/ONT_2019_2021_all.fastq.gz ``` 2. Filter summary file to *D. stevensoni* contigs (`awk`) 3. Extract *D. stevensoni* contigs by name (`asub`) ### Assemble *Cardinium* genome 1. Filter summary file to *Cardinium* contigs (`awk`) * Filter to Proteobacteria, or something more specific? 2. Extract *Cardinium* contigs by name from the summary file (`asub`) 3. Map raw ONT reads to extracted *Cardinium* contigs (`minimap2`) 4. Extract raw ONT reads which mapped to *Cardinium* contigs from `.sam` file (`awk`) 5. Assemble *Cardinium* genome (`flye`)