STAMPS 2024 MAG concepts

# STAMPS 2024 MAG concepts --- [toc] --- ## Overview Metagenomics attempts to sequence all the DNA present in a sample. It can provide a window into the taxonomy and functional potential of a mixed community. There are a **ton** of things that can be done with metagenomics data as this *non-exhaustive* overview figure begins to highlight: <a href="https://astrobiomike.github.io/images/metagenomics_overview.png "><img src="https://astrobiomike.github.io/images/metagenomics_overview.png "></a> This page is an introduction to some concepts about one of the things we can try to do with metagenomics data: recovering **m**etagenome-**a**ssembled **g**enomes (MAGs). --- ## Key concepts ### What are MAGs? MAGs (metagenome-assembled genomes) are most often not the same thing as isolate genomes. An isolate genome is usually extremely close to an individual-genome assembly – like a microbe isolated on a plate -> a colony from that plate is selected and grown overnight in a flask -> DNA is extracted and sequenced. Even then, some small number of mutations will exist in the population that grew from that starting isolation, but we consider it virtually an identical population that was sequenced and assembled. In many natural microbial communities, just like there is some variation that occurs overnight in a flask, some variation exists within what we'd consider "the same bug" (or "population"). And often, this level of variation will be collapsed when we perform an assembly and recover a MAG. So it is best to think of MAGs as something like composite representative genomes of *very* closely related genomic lineages. This isn't a problem necessarily, MAGs are still extremely valuable! ### Why are MAGs valuable? Relatively recently, the recovery of MAGs has become a very powerful approach in microbial ecology, drastically expanding the known Tree of Life by granting us genomic access to as-yet unculturable microbial populations and expanding our understanding of the breadth of biodiversity on Earth (e.g., [Hug et al. 2016](https://www.nature.com/articles/nmicrobiol201648); [Parks et al. 2017](https://www.nature.com/articles/s41564-017-0012-7)). **Basically MAGs are opening our eyes to the real diversity of microbial genomic diversity out there 🎉** ### What is a bin vs a MAG? The process of recovering MAGs typically involves assembly of metagenomic data (taking shorter reads and stitching them together to create longer contigs) and then "binning" (grouping) contigs together based on characteristics such as sequence composition and relative abundance. After we "bin" contigs together, we typically call a groups of contigs we created "bins". Then based on passing some quality thresholds, we will often say high-quality bins are MAGs. This is an arbitrary delineation that doesn't have one cutoff, but a common one used might be >90% estimated completeness and <10% estimated redundancy (sometimes called contamination). Bins that meet that criteria might be "promoted" to being called MAGs. Other commonly used metrics are listed in [Bowers et al. 2017](https://www.nature.com/articles/nbt.3893). The important part is to be clear about what thresholds and methods we're using for quality assessment and MAG delineation. ### How do we recover MAGs conceptually? A typical process for recovering bins is depicted in the slide below (the numbers 5 and 2 represent "abundances" of those unreasonably perfect genomes following DNA extraction): ![image](https://hackmd.io/_uploads/rkUb9cA_R.png) [PDF download](https://ndownloader.figshare.com/files/16241105); [keynote](https://ndownloader.figshare.com/files/16240550); [powerpoint](https://ndownloader.figshare.com/files/16240814) This can be done with an individual sample, but it becomes much more powerful and more accurate when multiple samples can be co-assembled (when appropriate). This is because instead of just having coverage ("abundance") information from one sample, we can then have differential coverage information across many samples for each starting genomic population. This can improve our ability to properly bin the assembled contigs together. When to co-assemble or not is not a straightforward thing, some discussion on that can be found [here](https://astrobiomike.github.io/genomics/metagen_anvio#what-is-a-co-assembly). ### MAGs are almost always a subset of the data With metagenomics data, we can try to look at all the data we recovered by just looking at the starting reads we recovered. Where we try to taxonomically classify them and functionally annotate them just as they are, for example. A plus to this is we're trying to look at "all" the data, but a downside is that it can be difficult to taxonomically classify and functionally annotate (what is most frequently) short reads. We can also try to perform an assembly with the reads we collected in an effort to get longer contigs, and then do things like taxonomically classify our contigs and call genes and then functionally annotate those genes. Taxonomic classification and functional annotation generally improve with longer sequences, so that is a plus to assembly. However, often not all of our reads will assemble. So already an assembly is a subset of our starting data. It is common to map all our reads we used to make an assembly, to that very assembly they made, in order to get an idea of how much of our starting data is represented by the assembly. Maybe 90% of our reads map back to our assembly, which tells us it represents like 90% of our reads. It is not uncommon for some environments to yield assemblies that capture much lower proportions of our starting reads. When we bin contigs, we are then getting a subset of our assembly (only those contigs that binned successfully into groups). And then if we focus only on those bins that surpass certain quality thresholds that we deem as MAGs, we're looking at a subset of even those. Just like with mapping our starting reads to the assembly to get an idea of how much of our starting data is represented by the assembly, we can do a similar thing and map our reads to our recovered MAGs to get an idea of how much of our starting data they encompass. The thing to keep in mind is that an assembly is usually a subset of the starting reads, and bins/MAGs are then a subset of the assembly. Which is totally fine! If we want, we can examine our reads and our MAGs. They are just different levels of resolution into a system :+1: --- ## Some workflows for MAG-recovery - In my [bit package](https://github.com/AstrobioMike/bit/tree/master?tab=readme-ov-file#bioinformatics-tools-bit) I have an assembly-focused metagenomics workflow that includes MAG recovery and characterization that can be found [**here**](https://github.com/AstrobioMike/bit/tree/master/workflows/metagenomics-wf#overview). - [Metawrap](https://github.com/bxlab/metaWRAP?tab=readme-ov-file#metawrap---a-flexible-pipeline-for-genome-resolved-metagenomic-data-analysis) runs a few binning programs and tries to identify the best bins out of the possibilities. ---