STAMPS 2025 Day 5 (afternoon) - metabolism estimation and pangenomics with anvi'o

# STAMPS 2025 Day 5 - metabolism estimation and pangenomics with anvi'o ## Afternoon session: pangenomics Iva Veseli July 19, 2025 We're going to continue working through the [_Vibrio jascida_ pangenome tutorial](https://merenlab.org/tutorials/vibrio-jasicida-pangenome), which we started this morning. You should already have the 8 genomes in their contigs databases. Let's go back to the working directory that contains those databases. If you started from FASTA files, this is how to get there: ``` cd ~/DAY_05_anvio/PANGENOMICS/V_jascida_genomes ``` Alternatively, if you downloaded the datapack of pre-made contigs databases, this is the path to go to: ``` cd ~/DAY_05_anvio/PANGENOMICS/V_jascida_contigs_dbs ``` 1. take a look at the genome stats Before we make a pangenome we want to make sure we trust all of our input genomes. We can view an interactive webpage summarizing the genome stats: ``` anvi-display-contigs-stats *db ``` Open the interactive interface in your web browser using [the same link](https://github.com/mblstamps/stamps2025/wiki/Accessing-our-cloud-computers#attendees) as before. You should see a few charts of assembly metrics, single-copy core gene annotations, and a big table of all the genome statistics. Do you notice anything strange? (Hint: try comparing all the MAGs to the reference genome, which you know is high-quality). <details> <summary>Click here for the answer.</summary> What is up with MAG 52??? Not only is it much longer than the other genomes, it also seems to have more duplicates of single-copy core genes. This genome is likely contaminated -- it contains sequences that do not belong to Vibrio jascida. </details> 2. make an external genomes file We've already used an external genomes file before, when we were running metabolism estimation on multiple genomes. This time, we don't have one ready-made. But it is fairly easy to make one. Anvi'o will help you: ``` anvi-script-gen-genomes-file --input-dir . \ -o external-genomes.txt ``` I highly suggest removing the problematic genome that we found earlier so that we don't include it in our pangenome analysis. <details> <summary>Click here for the code to remove it.</summary> You could just open the external genomes file and delete the line manually. Or, you could use this convenient BASH one-liner: ``` grep -v 52 external-genomes.txt > new; mv new external-genomes.txt ``` </details> There. Now we will make a pangenome out of 7 _V. jascida_ genomes. 3. make the genomes storage database This program will consolidate all of the genome information needed for pangenomics into one database file. ``` anvi-gen-genomes-storage -e external-genomes.txt \ -o V_jascida-GENOMES.db ``` It should be very fast. 4. make the pangenome Making a pangenome is very straightforward: ``` anvi-pan-genome -g V_jascida-GENOMES.db \ --project-name V_jascida \ --num-threads 8 ``` While it runs, we should discuss a few of the tunable parameters. I personally like the explanations found [here](https://merenlab.org/2016/11/08/pangenomics-v2/#running-a-pangenome-analysis). According to those explanations, we probably should have set the MCL inflation parameter higher than the default (2), like maybe `--mcl-inflation 5` or `10`, since we are working with genomes from the same species. Increasing that parameter would change the number of gene clusters we get. Feel free to experiment and see what changes! It will be _slightly_ faster when you re-do the pangenome, since anvi'o can reuse the Diamond output that is already stored in the working directory (but unfortunately that step is not really the bottleneck here). 5. visualize the pangenome We are going to use the "pangenome mode" of the anvi'o interactive interface to look at our data. ``` anvi-display-pan -p V_jascida/V_jascida-PAN.db \ -g V_jascida-GENOMES.db ``` As usual, you'll need to open it in your web browser. There is a lot of stuff to talk about here, and we will probably spend a long time discussing the aspects of the interface and tweaking the visualization :) But once we are ready to move on, there are a couple more additions we can make to the pangenome figure. Don't forget to save the 'state' before quitting the interface. 6. (Optional) add an ANI heatmap You can compute ANI between all pairs of genomes by giving the external genomes file to yet another anvi'o program. If you also provide the pangenome database, it will automatically store the ANI data inside the db for visualization later: ``` anvi-compute-genome-similarity --external-genomes external-genomes.txt \ --program pyANI \ --output-dir ANI \ --num-threads 8 \ --pan-db V_jascida/V_jascida-PAN.db ``` It will take some time :) Took about 2.5 minutes when I tried it. If you see an error saying something like `PyANI returned with non-zero exit code`, don't worry. What a good learning opportunity! What does the error message say -- what should we do? <details> <summary>Click here for the fix.</summary> The error message says to look in the log file, so let's do that. The anvi'o terminal output from the program will tell you where it is (see the line starting with `[PyANI] Log file path`). If you `cat` that file, you should see an ugly Python traceback that ends in something like "AttributeError: module 'matplotlib.pyplot' has no attribute 'register_cmap'". An internet search would tell you that this means the PyANI code is expecting to work with a particular version of `matplotlib`, but the version of `matplotlib` that we have is too new. You can fix it by downgrading to an earlier version of Matplotlib, one in which the `register_cmap` attribute still exists: ``` pip install matplotlib==3.8.2 ``` After that, you can re-run the program and it should work just fine. </details> Once the ANI calculations are done, you can re-load the pangenome display and tweak some of the layer settings to make the ANI heatmap appear. ``` anvi-display-pan -p V_jascida/V_jascida-PAN.db \ -g V_jascida-GENOMES.db ``` 7. study the pangenome interactively We will take some time to go through some of the exploratory pangenome analyses you can do on the interactive interface. Many of these are described [here in the main tutorial](https://merenlab.org/tutorials/vibrio-jasicida-pangenome/#studying-the-pangenome) in case you ever need a reminder. It could be a nice exercise to try exploring things on your own. Take 20-30 minutes to work with the pangenome in the interface and ask me questions. :) Then we can regroup and discuss what we found. 8. time for questions, and perhaps an extra demo Before we wrap up, I wanted to leave some space for open questions. And if there is time (and interest), I can go through a few cool things that anvi'o can do (not necessarily related to pangenomics or metabolism). Let me know what you are interested in seeing! You might get some ideas from the anvi'o program page: [https://anvio.org/help/main/](https://anvio.org/help/main/)