2/27/2023 Lab Discussion with Dan Udwary, Christy Grettenberger - BGCs and distance decay
===
## Biosynthetic Gene Clusters (with Dan Udwary)
*(not "operons" bc there can be multiple translatable units)*
Q: Are some BGC's environment-specific?
## Branchwater discussion / Approach:
- search BGC w/ branchwater; analyze environmental metadata (and/or gps level)
- need to use/adapt ctb's envionmental metadata SRA breakdown; some test/stats re: evaluating bias per environment
BGC size: 1 gene --> 200kb (large extreme)
[miBiG database](https://mibig.secondarymetabolites.org/)
-- individual nucleotide sequences that are well-annotated
[BiG-FAM](https://bigfam.bioinformatics.nl/home) - biosynthetic gene families
> This database version (1.0.0) is based on a GCF clustering of 1,225,071 BGCs taken from multiple publicly available sources (details). This large-scale analysis was performed using the BiG-SLiCE software (1.0.0) with an arbitrary clustering threshold (T=900.0), which resulted in the construction of 29,955 GCF models, each representing distinct protein domain and sequence features shared by the BGCs.
- BGC's don't just contain biosyntheis genes - also have transport genes, self-resistance genes, etc - sometimes called "pathogenicity islands"
- important that they are (as a group/cluster) laterally transferable - "laterally transferable cassettes"
- genes are not always in the same order
- BGCs - sometimes you get recombination between parts of BGC's.
- with circular genome, you often find BGC's in specific portions of the circular genome (3:00, 10:00, etc) -- Dan's hypothesis is that there's lots of recombination in some parts of the genome (~6pm), then as they mature they move to more stable parts of the genome.
How many BGCs?
- 0-50 (50 in some Streptomyces)
(Note: Dan is building a Data Portal for BGC's)
Metagenome assembly is problematic for these -- often it's the secondary metabolism genes that are broken in genome assemblies / MAGs.
BGC db is not v complete -- vast majority of BGCs that have been studied come from Actinomycetes.
** protein SRA -- would be helpful here too **
--> could search in protein space instead.
Q: Would you expect BGC's to be environment-specific
- some marine-specific, e.g. compounds that are heavily brominated more likely to be from marine envs rather than, e.g. soil
Can you get to function?
- sometimes, e.g. if structure resembles commonly known antibiotic, it might have antibiotic properties
Co-occurrence important/intersting? Co-occurrence networks.
Why are people interested in BGC's??
- drug discovery (find similar)
- environmental reasons -- catalog/understand what products are being produced by the organism(s)
Metabolites specific to crc
- strain-specific metabolites produced by biosynthetic gene clusters?
- disease-specific metabolome + strain-specific genomic signal
-
hand-wavey statement: gene loss happens so rapidly in bacteria/archaea that if a gene is present, it's (more) likely to be used (active) than not.
- ecosystem function BGC -- e.g. organism might need certain product in certain environment (e.g. compound that might help with toxic/polluted envs, etc)
Ideas for specific environments/conditions to look for BGC's / to specifically investigate:
- immunocompromised ppl / inflammatory guts
- deserts have - BGC's
- challenging environments -- usually more BGC's, organisms need to try to keep their niche/make compounds to avoid competition.
- Antibiotic resistance (not BGC, bc usually not biosynthetic)
- BGC's associated with antibiotic-treated pigs, etc
## Spacegraphcats discussion / Approach
- in a single / set of metagenomes, search using a set of BGC's to pull out reads that cover / are physically connected to sequence that matched the BGC
- can then assemble those
- with this small of reads, could switch back to overlap-layout-consensus assembly approach to avoid issues k-mer strategies have with repeats
- aggressive assembly
- could ask what the distance between two interesting genomic components is (e.g. how far on sgc graph)
More fun with Assembly graphs:
- if you have a repetitive sequence found in four different regions of the genome, this should let you figure out _how many_ repeats were in the genome based on how many connections there were to the repetitive region (as long as there's some genomic distinction)
### JGI Project - sequencing 10k bioactive bacteria collection (Pfizer?)
- short reads bc so many.
- paired metabolomics.
- spacegraphcats --> assembly approach might help with this project.
BCG diversity across metagenomes:
- pull out graph nbhd across many metagenomes --> meta-pan BGC --> insight into evolutionary characteristics of the BGC
note: many ppl want to use BGC sequences to synthesize the secondary metabolite + assess /use the function. Seq error, other flaws --> often doesn't work. If we can grab the reads, that might help...
- evolutionary patterns --> find novel BGCs
cluster finders = lots of false negatives, not really used anymore
- Evomining = take a set of genomes, try to figure out the core genome vs secondary metabolism)
JGI approach -- whole genome assembly (as good as possible without worrying about secondary metabolism), then antiSMASH (rules based/based on domain content) -scans for HMMs + does some addl analysis on domains to give you more info
- a couple machine learning methods are good, but they are all trained on antiSMASH data
- limitation with rules based = only find things you already have
- limitation w/machine learning - still only find things you trained on, also get false positives -- bc will pick up on regulatory sections, etc
* There's a Paired Omics database geared towards natural products...
## Microbial Ecology / distance-decay (with Christy Grettenberger)
- distance-decay relationship of organisms/strains from a specific environment, e.g. gut bacteria from wastewater treatment plant
Q: Can we track waste based on the presence/absence of certain organisms (gut bacteria) / by tracking community composition?
- PCA /NMDS
follow-up: can we apply machine learning --> learn components of a "contaminated" community + use it to detect?
Q: If microbes are dispersal-limited, community similarity will be more different the farther apart they are. If not dispersal limited, might expect the same community / highly similar community.
- in areas with the same geochemistry, we
example: three ponds separated by 500k each. Dispersal-limited:: communities are more different the further apart they are.
- Microbial communities -- BB = everything is everywhere and the environment selects
- Hot Springs, Antarctic Lakes etc... = dispersal limitation. Less extreme environments = haven't really found much dispersal limitation.
Community composition / intragenetic diversity across a range of expected high-low dispersal limitation.
- compare community structure /composition
## Potential Projects
- e.g.100 marine metagenome across a latitudinal transect
- community jaccard similarity --> regression to look at community similarity across distance
- w/in a species: ANI regression across distance
but really, two considerations:
genetic distance <--> geographical distances
genetic distance <--> geochemical distance
(need to control for or consider geochemical properties)
PCA similarity
- for a genome/microbe of interest
-- find all members (all containment) across SRA metagenomes
- look for pattern across distance (containment/ANI)
- how does environment affect this pattern? Any interesting outliers, etc?
- look for presence/absence/utilization of diff accessory elements in specific environment
- associated spacegraphcats project:: build environment-specific pan-metagenome == define accessory elements/accessory k-mers only found in specific environments... (or learning approach:: are certain k-mers consistently associated with specific environments?)
--> Actinomycetes (lots of BGCs)
- alt: community analysis::: presence/absence