ISS Staph paper pangenomics notes

--- tags: ISS Staph paper title: ISS Staph paper pangenomics notes --- # ISS Staph paper pangenomics notes [toc] > **NOTE** > These are my notes while building, eventually the [OSF](https://osf.io/8sy2a/wiki/3.%20Pangenomics/) will be updated appropriately. What's below i imperfect for following along. ## Cleaning genbank LOCUS names For all input genbank files, I ran `bit-genbank-locus-clean-slate` to avoid problems with anvio and headers, e.g.: ```bash conda activate bit bit-genbank-locus-clean-slate -i OBSA1.gbk -w OBSA1 -o OBSA1.gb ``` ```bash for i in $(cat ../unique-genome-IDs.txt); do echo $i; bit-genbank-locus-clean-slate -i ${i}.gb -w ${i} -o ${i}.tmp; mv ${i}.tmp ${i}.gb; done ``` ## Anvio ### Install Anvio installed as described in OSF here: https://osf.io/8sy2a/wiki/3.%20Pangenomics/ ### Processing each genome into contigs and profile dbs Just doing 2 here while building the workflow ```bash conda activate anvio ``` ```bash # ignoring that the annotation version will be different for those annotated with JCVI's PGAP and NCBI's PGAP, not important here mkdir input-files-for-anvio anvi-script-process-genbank -i all-genbank-files/OBSA1.gb -O input-files-for-anvio/OBSA1 anvi-script-process-genbank -i all-genbank-files/OBSA2.gb -O input-files-for-anvio/OBSA2 ``` ### Mapping ```bash mkdir -p bowtie2-indexes bam-files bowtie2-build --threads 20 input-files-for-anvio/OBSA1-contigs.fa bowtie2-indexes/OBSA1 bowtie2-build --threads 20 input-files-for-anvio/OBSA2-contigs.fa bowtie2-indexes/OBSA2 bowtie2 -q --threads 50 -x bowtie2-indexes/OBSA1 -1 ../all-err-corr-reads/OBSA1-err-corr-R1.fq.gz -2 ../all-err-corr-reads/OBSA1-err-corr-R2.fq.gz --no-unal 2> bam-files/OBSA1-mapping.log | samtools view -b | samtools sort -@ 20 > bam-files/OBSA1.bam samtools index -@ 20 bam-files/OBSA1.bam bowtie2 -q --threads 50 -x bowtie2-indexes/OBSA2 -1 ../all-err-corr-reads/OBSA2-err-corr-R1.fq.gz -2 ../all-err-corr-reads/OBSA2-err-corr-R2.fq.gz --no-unal 2> bam-files/OBSA2-mapping.log | samtools view -b | samtools sort -@ 20 > bam-files/OBSA2.bam samtools index -@ 20 bam-files/OBSA2.bam ``` ### Making and populating contigs dbs ```bash mkdir -p contigs-dbs profile-dbs anvi-gen-contigs-database -f input-files-for-anvio/OBSA1-contigs.fa -o contigs-dbs/OBSA1-contigs.db -n OBSA1 --external-gene-calls input-files-for-anvio/OBSA1-external-gene-calls.txt -T 50 anvi-gen-contigs-database -f input-files-for-anvio/OBSA2-contigs.fa -o contigs-dbs/OBSA2-contigs.db -n OBSA2 --external-gene-calls input-files-for-anvio/OBSA2-external-gene-calls.txt -T 50 anvi-import-functions -c contigs-dbs/OBSA1-contigs.db -i input-files-for-anvio/OBSA1-external-functions.txt anvi-import-functions -c contigs-dbs/OBSA2-contigs.db -i input-files-for-anvio/OBSA2-external-functions.txt anvi-run-hmms -T 50 -I Bacteria_71 -c contigs-dbs/OBSA1-contigs.db anvi-run-hmms -T 50 -I Bacteria_71 -c contigs-dbs/OBSA2-contigs.db anvi-scan-trnas -T 50 -c contigs-dbs/OBSA1-contigs.db anvi-scan-trnas -T 50 -c contigs-dbs/OBSA2-contigs.db anvi-run-ncbi-cogs -c contigs-dbs/OBSA1-contigs.db --cog-data-dir ~/ref-dbs/anvio/COGs/ -T 50 --sensitive anvi-run-ncbi-cogs -c contigs-dbs/OBSA2-contigs.db --cog-data-dir ~/ref-dbs/anvio/COGs/ -T 50 --sensitive anvi-run-kegg-kofams -c contigs-dbs/OBSA1-contigs.db --kegg-data-dir ~/ref-dbs/anvio/KOs/ -T 50 anvi-run-kegg-kofams -c contigs-dbs/OBSA2-contigs.db --kegg-data-dir ~/ref-dbs/anvio/KOs/ -T 50 ``` ### Profiling ```bash anvi-profile -c contigs-dbs/OBSA1-contigs.db -i bam-files/OBSA1.bam -o profile-dbs/OBSA1-profile -S OBSA1 --cluster-contigs --min-contig-length 1000 -T 50 anvi-profile -c contigs-dbs/OBSA2-contigs.db -i bam-files/OBSA2.bam -o profile-dbs/OBSA2-profile -S OBSA2 --cluster-contigs --min-contig-length 1000 -T 50 ``` ### Pangenomics Making external genomes file (based on assuming a starting file of wanted input genomes as made here): ```bash printf "OBSA1\nOBSA2\n" > test-genomes.txt printf "name\tcontigs_db_path\n" > external-genomes.tsv sed 's/^/contigs-dbs\//' test-genomes.txt | sed 's/$/-contigs.db/'> paths.tmp paste test-genomes.txt paths.tmp >> external-genomes.tsv rm paths.tmp ``` Making genome storage db: ```bash anvi-gen-genomes-storage -e external-genomes.tsv -o our-isolates-GENOMES.db --gene-caller NCBI_PGAP ``` Running pangenome: ``` # first doing min of 1 so includes singletons anvi-pan-genome -g our-isolates-GENOMES.db --mcl-inflation 6 --min-occurrence 1 -n ISS_Staph_mcl_6 -o ISS-Staph-pan -T 50 --sensitive # now re-running (doesn't need to do the blastp again) without singletons, just need to change -n anvi-pan-genome -g our-isolates-GENOMES.db --mcl-inflation 6 --min-occurrence 2 -n ISS_Staph_mcl_6_min_2 -o ISS-Staph-pan -T 50 --sensitive ``` ### Workflow Workflow built is at github here: https://github.com/AstrobioMike/ISS-Staph-anvio-wf