STAMPS 2025 Day 5 - metabolism estimation and pangenomics with anvi'o

# STAMPS 2025 Day 5 - metabolism estimation and pangenomics with anvi'o ## Morning session: anvi'o basics + metabolism Iva Veseli July 19, 2025 The lab below is based on two existing anvi'o tutorials: - [An exercise on metabolic reconstruction in anvi'o](https://anvio.org/tutorials/fmt-mag-metabolism) - [Vibrio jascida pangenome](https://merenlab.org/tutorials/vibrio-jasicida-pangenome/) Due to time constraints, we'll only be going over a small portion of these tutorials. The relevant commands are copied below, with slight modifications so that they work on your [cloud computers](https://github.com/mblstamps/stamps2025/wiki/Accessing-our-cloud-computers). The context for each step is very minimal in this document, so please ask questions as needed (I may forget to explain something out loud). We might not get through everything, and that is okay :) The materials will remain available online for anyone who wants more. ## anvi'o basics - working with a single genome We will work with the datasets that we'll be using later, by following [this section of the metabolism tutorial](https://anvio.org/tutorials/fmt-mag-metabolism/#running-the-metabolism-suite-of-programs---a-single-genome-example). We follow it almost exactly as written, except for a specific KEGG data directory. 1. activate anvio environment ``` conda activate anvio ``` Try anvi-TAB-TAB to see the list of all anvio programs. 2. make a directory in which to start running stuff ``` mkdir DAY_05_anvio cd DAY_05_anvio/ ``` Since we have two tutorials to go through today, we'll actually make two subdirectories to keep things organized: ``` mkdir METABOLISM mkdir PANGENOMICS ``` Let's go into the metabolism subdir for now. ``` cd METABOLISM ``` 3. obtain the first genome we will work with This is an _Akkermansia muciniphila_ genome from NCBI. ``` wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/731/575/GCF_009731575.1_ASM973157v1/GCF_009731575.1_ASM973157v1_genomic.fna.gz && \ gunzip GCF_009731575.1_ASM973157v1_genomic.fna.gz ``` 4. try making a contigs database Look at the help page first: ``` anvi-gen-contigs-database -h ``` You can see what this program expects from you, as well as the optional parameters you can provide. 4. take a look at the FASTA file I want to know what is in this genome FASTA file. For instance, how many contigs do we have? Can anyone suggest how we could find that out? <details> <summary>Click here for the answer.</summary> ``` head GCF_009731575.1_ASM973157v1_genomic.fna # how many contigs do we have? grep -c '>' GCF_009731575.1_ASM973157v1_genomic.fna # what are those contigs? a complete genome and its plasmid grep '>' GCF_009731575.1_ASM973157v1_genomic.fna ``` </details> Notice something about the contig header: it is very long, and contains some interesting characters. Not all downstream bioinformatics tools can deal with non-alphanumeric characters in sequence headers, and if they do work with them, sometimes they do weird (and inconsistent) things. If we leave those characters in, then at some point downstream in our analysis, we might get errors, and then we'd have to go all the way back to the beginning. To avoid such frustration, anvi'o requires us to have only alphanumeric contig headers. Let's try making a contigs database from this FASTA file anyway: ``` anvi-gen-contigs-database --contigs-fasta GCF_009731575.1_ASM973157v1_genomic.fna \ --project-name A_muciniphila \ --output-db-path A_muciniphila-CONTIGS.db \ --num-threads 2 ``` You should get an error :) So, we need to reformat the FASTA file first. 5. reformat Once again, let's take a look at the help page first. At the bottom, you should see a link to the program's documentation page. ``` anvi-script-reformat-fasta -h ``` We will need at least the parameter `--simplify-names` to fix the contig headers (aka "deflines"). But a few other parameters might look useful as well: see `--report-file` and `--prefix`. They don't matter for our purposes today, but they can be critical when reformatting large datasets. Then we can run the program for real: ``` anvi-script-reformat-fasta GCF_009731575.1_ASM973157v1_genomic.fna \ --simplify-names \ --output-file A_muciniphila.fa \ --report-file A_muciniphila_reformat_report.txt ``` Take a look at the terminal output. Anvi'o in general is very verbose, and likes to talk to you while it does stuff so that you know exactly what is going on. Then you can take a look at the two output files: ``` head A_muciniphila.fa head A_muciniphila_reformat_report.txt ``` We seem to be ready to move on now :) 6. make the contigs database While this command is running, we can talk a bit about what anvi'o is doing behind the scenes. ``` anvi-gen-contigs-database --contigs-fasta A_muciniphila.fa \ --project-name A_muciniphila \ --output-db-path A_muciniphila-CONTIGS.db \ --num-threads 2 ``` Once it is done (should be <1 minute), you should see the contigs database in your working directory. If you are curious and want to see what is actually inside the contigs database, you can download the `.db` file and open it with an application like [DB Browser for SQLite](https://sqlitebrowser.org/). Just to remove the element of mystery from this file type :) From the command-line, a quick way to identify what's in an anvi'o database is to use the program `anvi-db-info`: ``` anvi-db-info A_muciniphila-CONTIGS.db ``` You might notice that the output of that program has a section called `AVAILABLE FUNCTIONAL ANNOTATION SOURCES`, which indicates that we haven't annotated our gene sequences in this genome yet. We will need annotations later! So let's start a functional annotation program, and while it runs in the background, we can talk about the theory behind functional annotation strategies. 7. annotate the genes with KEGG KOfams [KEGG](https://www.kegg.jp/) is a commonly-used database of genomic data, and they provide a very large set of annotation models (profile hidden Markov Models, or pHMMs) for the 'KEGG Orthlog' protein families. You should have access to the KEGG database at `/opt/shared/ref-dbs/anvio/kegg`. ``` # with 4 threads, this takes about 5 minutes to process (perfect time to go over the theory) anvi-run-kegg-kofams -c A_muciniphila-CONTIGS.db \ --kegg-data-dir /opt/shared/ref-dbs/anvio/kegg \ -T 4 ``` You might see some output in the terminal about an annotation heuristic -- we will talk about this a bit later. If you check the database info again, this time you should see the KEGG annotations: ``` anvi-db-info A_muciniphila-CONTIGS.db ``` 8. questions so far? Let's stop for a moment to check in about everything so far :) ## anvi'o basics - working with multiple genomes We're briefly going to switch gears to our pangenomics dataset. Let's move to the other subdirectory: ``` cd ../PANGENOMICS ``` One of the most useful things that one can learn from workshops like these is (perhaps ironically) not how to use a specific bioinformatics tool, but how to use BASH tricks to level up your speed at processing data. We are going to create and annotate a bunch of contigs databases for several _Vibrio jascida_ genomes, which we will use later for our pangenome analysis. That means we need to repeat the steps we took above multiple times (once per genome), and we don't want to do that manually. Instead, we are going to use BASH loops and variables to process each genome automatically. **NOTE:** we only have to do this part if there is interest (and enough time)! If not, we can all skip to number (6) and download the pre-built datapack of contigs databases. 1. download the FASTA files Our primary genomes are isolates of _Vibrio jascida_ that were generated by the Microbial Diversity course at the Marine Biological Laboratory in 2018. ``` curl -L https://ndownloader.figshare.com/files/28965090 \ -H "User-Agent: Chrome/115.0.0.0" \ -o V_jascida_genomes.tar.gz # unpack it tar -zxvf V_jascida_genomes.tar.gz # go into the new directory cd V_jascida_genomes # take a look at what we have in our hands ls ``` There should be 8 FASTA files. 7 of these are the isolate genomes from the MBL Microbial Diversity course (named with numbers, see a table of where they came from [here](https://merenlab.org/tutorials/vibrio-jasicida-pangenome/#isolate-genomes)), and the 9th (`ref_scaffolds.fasta`) is a reference genome of _Vibrio jascida_ downloaded from NCBI. 2. prepare for looping If we want to loop over each of our FASTA files, we could use a convenient way to access the name of each genome without having to parse it from the file name every single time. So let's parse the names out once, and save them into a text file that we can use later. Luckily for us, all the FASTA files are consistently named (`*_scaffolds.fasta`), so all we need to do is extract the filename prefix. This code uses the pipe character `|`, which is special BASH syntax for "take the output of the previous command and send it as input to the next command". It also uses the BASH command `cut`, which splits a string on a delimiting character (in this case, underscore '_') and extracts one or more of the resulting "fields" (in this case, only the first one). ``` ls *_scaffolds.fasta | cut -d '_' -f 1 > genomes.txt # take a look at the result cat genomes.txt ``` Now we have our list of genome names to loop over. 3. reformat all the FASTA files Just like before, we want to make sure the contig headers for each genome are anvi'o-compatible. Here is a BASH loop to run `anvi-script-reformat-fasta` on each of our original FASTA files: ``` for g in `cat genomes.txt` do echo "Working on $g ..." anvi-script-reformat-fasta ${g}_scaffolds.fasta \ --min-len 2500 \ --simplify-names \ -o ${g}_scaffolds_2.5K.fasta done ``` The whole thing should take less than 30 seconds to run. A few notes: - `g` is a variable. In each iteration of the loop, it holds a different value -- in this case, each row of the `genomes.txt` file - we use `g` when we _create_ the BASH variable (in the loop header), and we use `$g` when we want to access the current _value_ of the BASH variable (inside the loop). `${g}` is a special form of `$g` that is used when combining variable names with other strings that contain underscores (so that BASH doesn't think we are trying to access a variable called `$g_scaffolds`) - the `echo` statement is just printing output to the terminal for us to keep track of what is going on Why are we using `--min-len 2500` here? Turns out, some of these genomes have very short contigs that might be low quality or contaminating sequences. Gene sequences are often over 1kb in length, so very short contigs are unlikely to have a full gene on them, which makes them useless for our pangenome anyway. Therefore, we get rid of all contigs <2,500 base pairs in length to ensure we only work with high quality contigs containing at least one gene. 4. make all the contigs databases We make a similar loop, but this time for `anvi-gen-contigs-database`. Try it yourself! Use the same structure as the previous loop, and insert an `anvi-gen-contigs-database` command similar to what we did earlier (but with variables to name the input and output files). <details> <summary>Click here for the answer.</summary> ``` for g in `cat genomes.txt` do echo "Working on $g ..." anvi-gen-contigs-database -f ${g}_scaffolds_2.5K.fasta \ -o V_jascida_${g}.db \ --num-threads 6 \ -n V_jascida_${g} done ``` </details> The loop should take about 15 seconds per genome to run, or about 2 minutes in total. 5. annotate everything! Finally, we will loop over each genome and run a handful of annotation programs. This loop will take much longer to run, so we will just run it and move back to the lecture (and coffee break) while it does its thing. ``` for g in *.db do anvi-run-hmms -c $g --num-threads 6 anvi-run-ncbi-cogs -c $g --num-threads 6 --cog-data-dir /opt/shared/ref-dbs/anvio/cog anvi-scan-trnas -c $g --num-threads 6 done ``` What each of these programs annotates: - `anvi-run-hmms`: single-copy core genes and ribosomal RNAs - `anvi-run-ncbi-cogs`: functions in the NCBI COGs (Clusters of Orthologous Groups) database. Similar to KEGG Orthologs, but the functional categories are a bit more broad - `anvi-scan-trnas`: tRNA genes (self-explanatory) 6. (OPTIONAL) How to download ready-to-go contigs databases for these genomes (in case you couldn't run the loops above) In case you need a shortcut, the following datapack gives you access to anvi'o v8 contigs databases for all the genomes (with annotations). ``` curl -L https://figshare.com/ndownloader/files/41600745 \ -H "User-Agent: Chrome/115.0.0.0" \ -o V_jascida_contigs_dbs.tar.gz # unpack it tar -zxvf V_jascida_contigs_dbs.tar.gz # go into the new directory cd V_jascida_contigs_dbs ``` ## Metabolism estimation - single genome Let's go back to the [metabolism tutorial](https://anvio.org/tutorials/fmt-mag-metabolism/) and work through a few of the sections. ``` cd ~/DAY_05_anvio/METABOLISM ``` 1. know the location and version of the KEGG data Thanks to Mike Lee, we can all access a common KEGG database at `/opt/shared/ref-dbs/anvio/kegg/`. That means we don't individually have to run `anvi-setup-kegg-data`. Instead, let's just take a quick look at what is in the KEGG data directory: ``` ls /opt/shared/ref-dbs/anvio/kegg/ ``` You should see a file called `MODULES.db`. That's where all the information about KEGG's metabolic modules are stored. Just like any other anvi'o database, you can inspect it with `anvi-db-info`: ``` anvi-db-info /opt/shared/ref-dbs/anvio/kegg/MODULES.db ``` In the output, you will see a `hash` value of `a2b5bde358bb`. This is essentially like a version number -- the hash value summarizes the contents of the database. The hash value is how anvi'o makes sure the annotation data in your contigs database is coming from the same KEGG snapshot as the one you use to estimate metabolism. Doesn't matter much for us right now, since we only have access to one snapshot anyway, but good to be aware of for the future. 2. estimate metabolism We already ran our annotations earlier, so we can jump right into estimating metabolism. First, check out the help page: ``` anvi-estimate-metabolism -h ``` There are lots of options, but most of them won't apply to us today. The ones I want you to pay attention to are the 'OUTPUT' options. If you find yourself needing even more context or details, you can always check out the [program help page](https://anvio.org/help/8/programs/anvi-estimate-metabolism/). Let's run the program. ``` anvi-estimate-metabolism -c A_muciniphila-CONTIGS.db \ -O A_muciniphila \ --output-modes modules \ --kegg-data-dir /opt/shared/ref-dbs/anvio/kegg/ ``` We will probably spend quite a few minutes talking about the contents of the output file. The [main tutorial page](https://anvio.org/tutorials/fmt-mag-metabolism/#estimating-metabolism) includes some discussions about interpreting this output that you might find helpful. The output types are also explained extensively on [this documentation page](https://anvio.org/help/8/artifacts/kegg-metabolism/). There are other output types and aspects of this program covered in the main tutorial which we will not talk about today (copy numbers, user-defined pathways, etc). If any of it peaks your interest, though, we can discuss in the evening session. ## Metabolism estimation - multiple genomes Now we are going to estimate metabolism for multiple genomes, and I will show you one way to compare the output visually by making a _metabolism heatmap_. This is the [section of the main tutorial](https://anvio.org/tutorials/fmt-mag-metabolism/#metabolism-estimation-and-enrichment-on-a-real-world-dataset) that we'll be working through. 1. download and unpack the new data This datapack contains 40 metagenome-assembled genomes (MAGs) recovered from human gut metagenomes, specifically from individuals who donate samples for fecal microbiota transplant (FMT). They are already in contigs databases, and annotated. ``` curl -L https://figshare.com/ndownloader/files/42699766 \ -H "User-Agent: Chrome/115.0.0.0" \ -o FMT_MAGS_FOR_METABOLIC_ENRICHMENT.tar.gz tar -xvf FMT_MAGS_FOR_METABOLIC_ENRICHMENT.tar.gz && cd FMT_MAGS_FOR_METABOLIC_ENRICHMENT/ ``` I recommend reading [this section](https://anvio.org/tutorials/fmt-mag-metabolism/#a-dataset-of-high--and-low-fitness-mags) of the tutorial to learn more about what this dataset represents. The TL;DR is that we have two groups of MAGs (20 per group) and we want to compare their metabolic capacity. 2. estimate metabolism on all of them Don't worry, we are done with BASH loops for today. `anvi-estimate-metabolism` can work with multiple genomes at once. If you run `ls` in your working directory, you should see an 'external genomes' file (`external-genomes.txt`). Take a look at it: ``` head external-genomes.txt ``` You will see that the path to each contigs database is listed in the file. You can give a file like this to many anvi'o programs, and anvi'o will treat them all as individual inputs. The same goes for `anvi-estimate-metabolism`. I want to create a heatmap of module completeness scores so that we can visually compare our two groups of MAGs. This means that I need `anvi-estimate-metabolism` to output a matrix rather than a long-format file. Take a look at program help page again. Can you figure out how to run it on an external genomes file and generate an output matrix? <details> <summary>Click here for the answer.</summary> ``` anvi-estimate-metabolism -e external-genomes.txt \ -O FMT_MAG_metabolism \ --matrix-format \ --kegg-data-dir /opt/shared/ref-dbs/anvio/kegg/ ``` </details> When you run the program, you should get a lot of matrix files out. We want to compare module completeness scores, so which output file should we look at? (hint: what type of metric should we most likely use for individual genomes?) <details> <summary>Click here for the answer.</summary> We should look at pathwise completeness scores :) ``` head FMT_MAG_metabolism-module_pathwise_completeness-MATRIX.txt ``` </details> In the matrix, each row is a different metabolic module and each column is one of the 40 MAGs. 3. make a heatmap Anvi'o can interactively visualize arbitrary data matrices in its so-called 'manual mode'. We're going to use it to make a heatmap from our matrix of completeness scores. You might recall that anvi'o uses profile databases to store visualization settings for the interface. We don't have a profile database yet, but that is okay -- anvi'o will create one as long as we provide a name for the database with the `-p` parameter. ``` anvi-interactive -d FMT_MAG_metabolism-module_pathwise_completeness-MATRIX.txt \ -p modules_heatmap.db \ --manual \ --title "FMT MAG METABOLISM HEATMAP" ``` The interactive interface runs in a web browser (Google Chrome usually works the best). You can open it by directing your browser to the link `http://<YOUR_CLOUD_IP>:8080/`, where `<YOUR_CLOUD_IP>` is replaced by the IP address of your cloud instance. Either copy-paste that link into your browser window, or click on your personal anvi'o link from [this table](https://github.com/mblstamps/stamps2025/wiki/Accessing-our-cloud-computers#attendees). I will show you how to turn the resulting visualization into a nice-looking heatmap. But most of the instructions can be found [in this section](https://anvio.org/tutorials/fmt-mag-metabolism/#estimating-metabolism-for-these-mags) of the main tutorial webpage. <details> <summary>Click here for a quick reference of the additional heatmap commands.</summary> Remember to save the state (visualization settings) before you close the interface each time. ``` # organize the data so that modules with similar completeness # across the genomes are close to each other anvi-matrix-to-newick FMT_MAG_metabolism-module_pathwise_completeness-MATRIX.txt -o module_organization.newick # flip the matrix, then organize the genomes so that those # with similar metabolic capacity are close to each other anvi-script-transpose-matrix FMT_MAG_metabolism-module_pathwise_completeness-MATRIX.txt \ -o FMT_MAG_metabolism-module_pathwise_completeness-MATRIX-TRANSPOSED.txt anvi-matrix-to-newick FMT_MAG_metabolism-module_pathwise_completeness-MATRIX-TRANSPOSED.txt \ -o mag_organization.newick # import both dendrograms into the profile database anvi-import-items-order -i module_organization.newick \ -p modules_heatmap.db \ --name module_organization # this is a little BASH trick to get the dendrogram into a BASH variable TREE=$(cat mag_organization.newick) # copy the dendrogram into a variable called 'TREE' echo -e "item_name\tdata_type\tdata_value\nmag_organization\tnewick\t$TREE" > layer_order.txt # put it into a file # import into the database anvi-import-misc-data -p modules_heatmap.db \ -t layer_orders \ layer_order.txt # visualize again anvi-interactive -d FMT_MAG_metabolism-module_pathwise_completeness-MATRIX.txt \ -p modules_heatmap.db \ --manual \ --title "FMT MAG METABOLISM HEATMAP" # add some information about the MAG groups to the profile database anvi-import-misc-data -p modules_heatmap.db \ -t layers \ MAG_groups.txt # visualize again anvi-interactive -d FMT_MAG_metabolism-module_pathwise_completeness-MATRIX.txt \ -p modules_heatmap.db \ --manual \ --title "FMT MAG METABOLISM HEATMAP" ``` </details> 4. time for questions! Before we break for lunch, there will hopefully be some time for questions. In case we skipped something on the main tutorial page that interested you, this could also be a good time to think about those. Here are some ideas: - want to know what the MAGs are? we can run `anvi-estimate-scg-taxonomy` - wondering what metabolic pathways are the most associated with a given group of MAGs? we can run `anvi-compute-metabolic-enrichment` - want to throw off the shackles of the KEGG database? we can talk about "user-defined pathways"