---
tags: BGC, biosynthetic gene clusters, 2023, sourmash
---
Biosynthetic Gene Cluster Project
===
## Background
**[Meleshko et al](https://genome.cshlp.org/content/29/8/1352.short) abstract nicely lays out the problem:**
>"Predicting biosynthetic gene clusters (BGCs) is critically important for discovery of antibiotics and other natural products. While BGC prediction from complete genomes is a well-studied problem, predicting BGCs in fragmented genomic assemblies remains challenging. The existing BGC prediction tools often assume that each BGC is encoded within a single contig in the genome assembly, a condition that is violated for most sequenced microbial genomes where BGCs are often scattered through several contigs, making it difficult to reconstruct them. The situation is even more severe in shotgun metagenomics, where the contigs are often short, and the existing tools fail to predict a large fraction of long BGCs."
### Why is our lab suited to this research question?
The dib-lab has two tools that are designed to tackle these sorts of challenges, **sourmash** and **spacegraphcats**. Both of these tools use k-mers, small subsequences of length k, to search for shared sequence between two datasets (in our case, a metagenome query, and a reference set of BGC). We also have workflows for k-mer based machine learning methods (more below).
**[sourmash](https://sourmash.readthedocs.io/en/latest/):**
- Sourmash turns each sequence file into a "sketch" of k-mers, and does comparison between two sketch files. **There are two special things about it:**
1. Instead of using the entire dataset (all of the k-mers), we can select a subset of them for comparison. As long as we do this randomly (selection is not based on the base pair sequence, so you're not biasing the dataset towards some base pair sequences over others) and systematically (the same way for each dataset), the comparison remains accurate. This means sourmash can compare large datasets quickly.
2. When we search, we first find all* k-mers shared between the two datasets. Then, we do a post-processing step where we consider which reference matches had the largest coverage by the query k-mers, and assign groups of k-mers to their best-match first. This means each k-mer in the dataset will be assigned to a single best-match reference BGC.
- all* = "all sketched k-mers".
**[spacegraphcats](https://spacegraphcats.github.io/spacegraphcats/):**
- The first step of many BGC workflows is to assemble the dataset so that you obtain longer contiguous sequences ('contigs') that likely make up the genome of the organism you sequenced. However, assembling a dataset may not always work well, meaning you can a) lose BGC that couldn't assemble, or b) lose BGC that assembled but were not assembled into a single contig, as required by many detection tools.
- The first step of assembly is to build an assembly graph from the reads: to see how they all fit together. See [dib-lab *de bruijn* graph notes here](https://dib-lab.github.io/dib_rotation/08_bin_completion_with_spacegraphcats/).
- [spacegraphcats](https://spacegraphcats.github.io/spacegraphcats/) is a tool that allows you to work with the assembly graph itself, the graph of how all reads relate to each other _before_ an assembler has made guesses to build contigs. Specifically, spacegraphcats builds an atlas (CATlas :) to search connected portions of the graph. We can then "query" the graph with a reference BGC, and find the region of the graph that has shared k-mers. The idea of 'magnet fishing' might be useful here -- you have some known sequence. You fish in the graph for that exact sequence and pull out that matching sequence _plus_ whatever was adjacent to it in the graph. That means, you have the exactly-matched k-mers _plus_ the reads those were present in, plus the reads that overlapped with parts of those reads, etc. **This is the part that may allow spacegraphcats to improve on other methods.**
- [Meleshko et al](https://genome.cshlp.org/content/29/8/1352.short) is an existing assembly graph approach for BGC predition/detection, and would be a good reference to read. Their abstract continues (from above):
> "While it is difficult to assemble BGCs in a single contig, the structure of the genome assembly graph often provides clues on how to combine multiple contigs into segments encoding long BGCs. We describe biosyntheticSPAdes, a tool for predicting BGCs in assembly graphs and demonstrate that it greatly improves the reconstruction of BGCs from genomic and metagenomics data sets."
**Machine learning:**
We have had success with random forest classifiers based on k-mers, see [IBD paper](https://www.biorxiv.org/content/10.1101/2022.06.30.498290v1.abstract) and [generalized distillerycats workflow](https://github.com/dib-lab/distillerycats).
Existing machine learning methods for BGC:
- [Hannigan et al., 2019](https://academic.oup.com/nar/article/47/18/e110/5545735) (random forests)
> Natural products represent a rich reservoir of small molecule drug candidates utilized as antimicrobial drugs, anticancer therapies, and immunomodulatory agents. These molecules are microbial secondary metabolites synthesized by co-localized genes termed Biosynthetic Gene Clusters (BGCs). The increase in full microbial genomes and similar resources has led to development of BGC prediction algorithms, although their precision and ability to identify novel BGC classes could be improved. Here we present a deep learning strategy (DeepBGC) that offers reduced false positive rates in BGC identification and an improved ability to extrapolate and identify novel BGC classes compared to existing machine-learning tools. We supplemented this with random forest classifiers that accurately predicted BGC product classes and potential chemical activity. Application of DeepBGC to bacterial genomes uncovered previously undetectable putative BGCs that may code for natural products with novel biologic activities. The improved accuracy and classification ability of DeepBGC represents a major addition to in-silico BGC identification.
- [Almeida et al 2020](https://academic.oup.com/nargab/article/2/4/lqaa098/6007553) (predicts on amino acid sequences)
> Fungal secondary metabolites (SMs) are an important source of numerous bioactive compounds largely applied in the pharmaceutical industry, as in the production of antibiotics and anticancer medications. The discovery of novel fungal SMs can potentially benefit human health. Identifying biosynthetic gene clusters (BGCs) involved in the biosynthesis of SMs can be a costly and complex task, especially due to the genomic diversity of fungal BGCs. Previous studies on fungal BGC discovery present limited scope and can restrict the discovery of new BGCs. In this work, we introduce TOUCAN, a supervised learning framework for fungal BGC discovery. Unlike previous methods, TOUCAN is capable of predicting BGCs on amino acid sequences, facilitating its use on newly sequenced and not yet curated data. It relies on three main pillars: rigorous selection of datasets by BGC experts; combination of functional, evolutionary and compositional features coupled with outperforming classifiers; and robust post-processing methods. TOUCAN best-performing model yields 0.982 F-measure on BGC regions in the Aspergillus niger genome. Overall results show that TOUCAN outperforms previous approaches. TOUCAN focuses on fungal BGCs but can be easily adapted to expand its scope to process other species or include new features.
## Research direction:
Explore assembly-free methods to reliably find BGC's in unassembled metagenome datasets. Some combination of sourmash, spacegraphcats, and k-mer machine learning methods may provide advantages over existing strategies.
> Beyond the functionality of our existing workflows and tools, we always have lots of ideas for improvements that may be useful here. For example, we could add HMM search into the spacegraphcats graph (this exists for other assembly graph tools). Or we could search with more flexible types of protein k-mers, etc.
Potential end points of this project:
1. Publish the method, ideally in automated/workflow-ed form. Yay, other people can use it and find more BGC than they were previously able to!
2. Do some interesting biology!
- find a/some BGC(s) we care about, and some public data where that BGC might be found. Analyze the presence /absence of that BGC in relation to treatment, environment, etc. Or, look at the sequence evolution of that BGC, etc.
- later: if all that works and is interesting, search ALL THE PUBLIC DATA for this BGC, and make some conclusions about what the results mean. See [branchwater](https://dib-lab.github.io/2022-paper-branchwater-software/) and its usage for [cyanobacteria](https://www.biorxiv.org/content/10.1101/2022.10.27.514113v1.abstract).
3. Don't do anything with it, but take your GitHub and sourmash and spacegraphcats skills forward to another project.
## Getting Started:
### References that may be useful
- "standard" HMM assembly-based methods, highly used
- [original antiSMASH](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3125804/)
- [antiSMASH 6.0](https://academic.oup.com/nar/article/49/W1/W29/6274535)
- [PRISM 4](https://www.nature.com/articles/s41467-020-19986-1)
- assembly graph methods:
- [Meleshko et al 2019](https://genome.cshlp.org/content/29/8/1352.short)
- follow-up: [Melshenko et al 2022](https://academic.oup.com/bioinformatics/article/38/1/1/6354349)
- machine learning methods:
- [Hannigan et al., 2019](https://academic.oup.com/nar/article/47/18/e110/5545735)
- [Almeida et al 2020](https://academic.oup.com/nargab/article/2/4/lqaa098/6007553)
- follow-up [Almeida et al 2022](https://academic.oup.com/bioinformatics/article-abstract/38/16/3984/6619162)
A recent review: [Medema et al., 2020](https://pubs.rsc.org/en/content/articlehtml/2021/np/d0np00090f)
Refs with a little more biology:
- [Discovery of an Abundance of Biosynthetic Gene Clusters in Shark Bay Microbial Mats](https://www.frontiersin.org/articles/10.3389/fmicb.2020.01950/full) (uses antiSMASH)
- [gutSMASH predicts specialized primary metabolic pathways from the human gut microbiota](https://www.nature.com/articles/s41587-023-01675-1)
### A potential starting point
- Find / build good test data (pull from BGC papers/tools, ideally)
- small set of BGC that existing methods can reliably detect
- 2-3 datasets that contain those BGC
- Format those datasets + databases for sourmash + spacegraphcats usage
- use the dib-lab rotation commands!
- database will require `sourmash sketch`
- metagenome dataset: `sourmash sketch` AND building a CAtlas/ Atlas with spacegraphcats