# Pangenomic with Anvio ## Before running ### :warning: Disclaimer :warning: If you are here and you didn't get the link from me. Please go to the official Anvio tutorial [here](https://merenlab.org/2016/11/08/pangenomics-v2/) and [here](https://merenlab.org/2018/12/01/combining-annotation-sources-for-pan/). Meren, his team and [Mike Lee](https://anvio.org/people/AstrobioMike/) do a fantastic job on their website and it will be much better. ### Installing softwares This [page](https://anvio.org/install/) for anvio And we'll need a side set of script from [Mike Lee](https://merenlab.org/2018/12/01/combining-annotation-sources-for-pan/): ```{shell} conda create -y -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit ``` ## Preparing run We run the Pangenomic analysis with our Bacta annotated genomes. First place the name of genome of interest in a `Gene.list.txt` file Our genomes: ```{text} BH3 MYb71 MYb15 MYb49 MYb68 JUb45 JUb46 BIGb0125 ``` Our genbank files are not compatible with the default anvio pipeline and we need to hack our way in. Thankfully, Mike Lee did all the job for us with his `bit` environment. All Prokka gbk file are copied in the `GBK` folder and we create a `clean` folder for having the genbank corrected files stored in. ```{shell} for i in $(cat Gene.list.txt); do cp /PATH/$i GBK/; done mkdir clean ``` In brief the script rename Bacta genbank contig name to unique contig per genomes ```{shell} conda activate bit for i in $(ls -d GBK/*); do a=$(echo $i | cut -f2 -d"/" |gsed "s/gbff\///g; s/\.gbff//g"); echo $i $a; bit-genbank-locus-clean-slate -i $i -o clean/$a.gbk -w $a; done cat clean/*gbk> all_refs.gbff conda deactivate ```` ## Running Anvio Anvio is a very comprehensive pipeline with many modules. We'll barrel through the pipeline here, for more information go to the Anvio website. ```{R} conda activate anvio-8 ``` Import all the information we have from our genbank files into compatible file format ```{R} anvi-script-process-genbank -i all_refs.gbff --output-gene-calls all_refs_gene_calls.tsv --output-functions all_refs_functions.tsv --output-fasta all_refs.fa --include-locus-tags-as-functions ``` Create a contig database, with all the gene call information ```{R} anvi-gen-contigs-database -f all_refs.fa -o contigs.db -n Wormbiome --external-gene-calls all_refs_gene_calls.tsv --split-length -1 --num-threads 12 anvi-import-functions -c contigs.db -i all_refs_functions.tsv ``` Look for unique copy marker genes: ``` anvi-run-hmms -c contigs.db --num-threads 12 ``` Create a collection file ```{R} cd clean for genome in $(ls *gbk | cut -f1 -d "."); do grep "LOCUS" "$genome".gbk | tr -s " " "\t" | cut -f2 > "$genome"_contigs.tmp; for contig in $(cat "$genome"_contigs.tmp); do echo "$genome"; done > "$genome"_name.tmp; paste "$genome"_contigs.tmp "$genome"_name.tmp > "$genome"_for_cat.tmp; done cat *_for_cat.tmp > collection.tsv && rm *.tmp cd .. mv clean/collection.tsv ./ anvi-profile -c contigs.db -o profile -S profile --blank-profile --min-contig-length 0 --skip-hierarchical-clustering anvi-import-collection collection.tsv -c contigs.db -p profile/PROFILE.db -C Wormbiome --contigs-mode ``` Get the genome database and file ready: ```{R} echo -e "name\tbin_id\tcollection_id\tprofile_db_path\tcontigs_db_path" > header.tmp cut -f2 collection.tsv | uniq > name_and_bin_id.tmp for i in $(cat name_and_bin_id.tmp); do echo "Wormbiome"; done > collection_id.tmp for i in $(cat name_and_bin_id.tmp); do echo "$PWD/profile/PROFILE.db"; done > profile_db_path.tmp for i in $(cat name_and_bin_id.tmp); do echo "$PWD/contigs.db"; done > contigs_db_path.tmp paste name_and_bin_id.tmp name_and_bin_id.tmp collection_id.tmp profile_db_path.tmp contigs_db_path.tmp > body.tmp cat header.tmp body.tmp > internal_genomes.tsv && rm *.tmp anvi-gen-genomes-storage -i internal_genomes.tsv -o wormbiome-GENOMES.db --gene-caller NCBI_PGAP ``` Finally run the pangenome comparison ```{R} anvi-pan-genome -g wormbiome-GENOMES.db \ -n wormbiome-mcl \ --num-threads 12 --mcl-inflation 10 ``` Get everything exported for downstream analysis ```{R} anvi-script-add-default-collection -c contigs.db -p wormbiome-mcl/wormbiome-mcl-PAN.db -C Wormbiome #Export gene clusters anvi-summarize -p wormbiome-mcl/wormbiome-mcl-PAN.db -g wormbiome-GENOMES.db -C Wormbiome -o Project_summary #Export gene names anvi-export-gene-calls -c contigs.db --gene-caller NCBI_PGAP -o gene_call.txt --skip-sequence-reporting ``` Voila