# Pangenomic with Anvio
## Before running
### :warning: Disclaimer :warning:
If you are here and you didn't get the link from me. Please go to the official Anvio tutorial [here](https://merenlab.org/2016/11/08/pangenomics-v2/) and [here](https://merenlab.org/2018/12/01/combining-annotation-sources-for-pan/). Meren, his team and [Mike Lee](https://anvio.org/people/AstrobioMike/) do a fantastic job on their website and it will be much better.
### Installing softwares
This [page](https://anvio.org/install/) for anvio
And we'll need a side set of script from [Mike Lee](https://merenlab.org/2018/12/01/combining-annotation-sources-for-pan/):
```{shell}
conda create -y -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit
```
## Preparing run
We run the Pangenomic analysis with our Bacta annotated genomes.
First place the name of genome of interest in a `Gene.list.txt` file
Our genomes:
```{text}
BH3
MYb71
MYb15
MYb49
MYb68
JUb45
JUb46
BIGb0125
```
Our genbank files are not compatible with the default anvio pipeline and we need to hack our way in. Thankfully, Mike Lee did all the job for us with his `bit` environment.
All Prokka gbk file are copied in the `GBK` folder and we create a `clean` folder for having the genbank corrected files stored in.
```{shell}
for i in $(cat Gene.list.txt);
do cp /PATH/$i GBK/;
done
mkdir clean
```
In brief the script rename Bacta genbank contig name to unique contig per genomes
```{shell}
conda activate bit
for i in $(ls -d GBK/*);
do a=$(echo $i | cut -f2 -d"/" |gsed "s/gbff\///g; s/\.gbff//g");
echo $i $a;
bit-genbank-locus-clean-slate -i $i -o clean/$a.gbk -w $a;
done
cat clean/*gbk> all_refs.gbff
conda deactivate
````
## Running Anvio
Anvio is a very comprehensive pipeline with many modules. We'll barrel through the pipeline here, for more information go to the Anvio website.
```{R}
conda activate anvio-8
```
Import all the information we have from our genbank files into compatible file format
```{R}
anvi-script-process-genbank -i all_refs.gbff --output-gene-calls all_refs_gene_calls.tsv --output-functions all_refs_functions.tsv --output-fasta all_refs.fa --include-locus-tags-as-functions
```
Create a contig database, with all the gene call information
```{R}
anvi-gen-contigs-database -f all_refs.fa -o contigs.db -n Wormbiome --external-gene-calls all_refs_gene_calls.tsv --split-length -1 --num-threads 12
anvi-import-functions -c contigs.db -i all_refs_functions.tsv
```
Look for unique copy marker genes:
```
anvi-run-hmms -c contigs.db --num-threads 12
```
Create a collection file
```{R}
cd clean
for genome in $(ls *gbk | cut -f1 -d ".");
do grep "LOCUS" "$genome".gbk | tr -s " " "\t" | cut -f2 > "$genome"_contigs.tmp;
for contig in $(cat "$genome"_contigs.tmp);
do echo "$genome";
done > "$genome"_name.tmp;
paste "$genome"_contigs.tmp "$genome"_name.tmp > "$genome"_for_cat.tmp;
done
cat *_for_cat.tmp > collection.tsv && rm *.tmp
cd ..
mv clean/collection.tsv ./
anvi-profile -c contigs.db -o profile -S profile --blank-profile --min-contig-length 0 --skip-hierarchical-clustering
anvi-import-collection collection.tsv -c contigs.db -p profile/PROFILE.db -C Wormbiome --contigs-mode
```
Get the genome database and file ready:
```{R}
echo -e "name\tbin_id\tcollection_id\tprofile_db_path\tcontigs_db_path" > header.tmp
cut -f2 collection.tsv | uniq > name_and_bin_id.tmp
for i in $(cat name_and_bin_id.tmp); do echo "Wormbiome"; done > collection_id.tmp
for i in $(cat name_and_bin_id.tmp); do echo "$PWD/profile/PROFILE.db"; done > profile_db_path.tmp
for i in $(cat name_and_bin_id.tmp); do echo "$PWD/contigs.db"; done > contigs_db_path.tmp
paste name_and_bin_id.tmp name_and_bin_id.tmp collection_id.tmp profile_db_path.tmp contigs_db_path.tmp > body.tmp
cat header.tmp body.tmp > internal_genomes.tsv && rm *.tmp
anvi-gen-genomes-storage -i internal_genomes.tsv -o wormbiome-GENOMES.db --gene-caller NCBI_PGAP
```
Finally run the pangenome comparison
```{R}
anvi-pan-genome -g wormbiome-GENOMES.db \
-n wormbiome-mcl \
--num-threads 12 --mcl-inflation 10
```
Get everything exported for downstream analysis
```{R}
anvi-script-add-default-collection -c contigs.db -p wormbiome-mcl/wormbiome-mcl-PAN.db -C Wormbiome
#Export gene clusters
anvi-summarize -p wormbiome-mcl/wormbiome-mcl-PAN.db -g wormbiome-GENOMES.db -C Wormbiome -o Project_summary
#Export gene names
anvi-export-gene-calls -c contigs.db --gene-caller NCBI_PGAP -o gene_call.txt --skip-sequence-reporting
```
Voila