Pangenomic with Anvio

Before running

:warning: Disclaimer :warning:

If you are here and you didn't get the link from me. Please go to the official Anvio tutorial here and here. Meren, his team and Mike Lee do a fantastic job on their website and it will be much better.

Installing softwares

This page for anvio

And we'll need a side set of script from Mike Lee:

conda create -y -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit

Preparing run

We run the Pangenomic analysis with our Bacta annotated genomes.

First place the name of genome of interest in a Gene.list.txt file

Our genomes:

BH3
MYb71
MYb15
MYb49
MYb68
JUb45
JUb46
BIGb0125

Our genbank files are not compatible with the default anvio pipeline and we need to hack our way in. Thankfully, Mike Lee did all the job for us with his bit environment.

All Prokka gbk file are copied in the GBK folder and we create a clean folder for having the genbank corrected files stored in.

for i in $(cat Gene.list.txt);
    do cp /PATH/$i GBK/;
done
mkdir clean

In brief the script rename Bacta genbank contig name to unique contig per genomes

conda activate bit
for i in $(ls -d GBK/*);
  do a=$(echo $i | cut -f2 -d"/" |gsed "s/gbff\///g; s/\.gbff//g");
  echo $i $a;
  bit-genbank-locus-clean-slate -i $i -o clean/$a.gbk -w $a;
done
cat clean/*gbk> all_refs.gbff
conda deactivate

Running Anvio

Anvio is a very comprehensive pipeline with many modules. We'll barrel through the pipeline here, for more information go to the Anvio website.

conda activate anvio-8

Import all the information we have from our genbank files into compatible file format

anvi-script-process-genbank -i all_refs.gbff --output-gene-calls all_refs_gene_calls.tsv --output-functions all_refs_functions.tsv --output-fasta all_refs.fa --include-locus-tags-as-functions

Create a contig database, with all the gene call information

anvi-gen-contigs-database -f all_refs.fa -o contigs.db -n Wormbiome --external-gene-calls all_refs_gene_calls.tsv  --split-length -1 --num-threads 12
anvi-import-functions -c contigs.db -i all_refs_functions.tsv

Look for unique copy marker genes:

anvi-run-hmms -c contigs.db --num-threads 12

Create a collection file

cd clean
for genome in $(ls *gbk | cut -f1 -d ".");
  do   grep "LOCUS" "$genome".gbk | tr -s " " "\t" | cut -f2 > "$genome"_contigs.tmp;
    for contig in $(cat "$genome"_contigs.tmp);
       do     echo "$genome";
     done > "$genome"_name.tmp;
   paste "$genome"_contigs.tmp "$genome"_name.tmp > "$genome"_for_cat.tmp;
done
cat *_for_cat.tmp > collection.tsv && rm *.tmp
cd ..
mv  clean/collection.tsv ./
anvi-profile -c contigs.db -o profile -S profile --blank-profile --min-contig-length 0 --skip-hierarchical-clustering
anvi-import-collection collection.tsv -c contigs.db -p profile/PROFILE.db -C Wormbiome --contigs-mode

Get the genome database and file ready:

echo -e "name\tbin_id\tcollection_id\tprofile_db_path\tcontigs_db_path" > header.tmp

cut -f2 collection.tsv | uniq > name_and_bin_id.tmp
for i in $(cat name_and_bin_id.tmp); do echo "Wormbiome"; done > collection_id.tmp
for i in $(cat name_and_bin_id.tmp); do echo "$PWD/profile/PROFILE.db"; done > profile_db_path.tmp
for i in $(cat name_and_bin_id.tmp); do echo "$PWD/contigs.db"; done > contigs_db_path.tmp
paste name_and_bin_id.tmp name_and_bin_id.tmp collection_id.tmp profile_db_path.tmp contigs_db_path.tmp > body.tmp
cat header.tmp body.tmp > internal_genomes.tsv && rm *.tmp

anvi-gen-genomes-storage -i internal_genomes.tsv -o wormbiome-GENOMES.db --gene-caller NCBI_PGAP

Finally run the pangenome comparison

anvi-pan-genome -g wormbiome-GENOMES.db \
               -n wormbiome-mcl \
               --num-threads 12 --mcl-inflation 10 

Get everything exported for downstream analysis

anvi-script-add-default-collection -c contigs.db -p wormbiome-mcl/wormbiome-mcl-PAN.db -C Wormbiome
#Export gene clusters
anvi-summarize -p wormbiome-mcl/wormbiome-mcl-PAN.db -g wormbiome-GENOMES.db -C Wormbiome -o Project_summary
#Export gene names
anvi-export-gene-calls -c contigs.db --gene-caller NCBI_PGAP -o gene_call.txt --skip-sequence-reporting

Voila