2/27/2023 Lab Discussion with Dan Udwary, Christy Grettenberger - BGCs and distance decay === ## Biosynthetic Gene Clusters (with Dan Udwary) *(not "operons" bc there can be multiple translatable units)* Q: Are some BGC's environment-specific? ## Branchwater discussion / Approach: - search BGC w/ branchwater; analyze environmental metadata (and/or gps level) - need to use/adapt ctb's envionmental metadata SRA breakdown; some test/stats re: evaluating bias per environment BGC size: 1 gene --> 200kb (large extreme) [miBiG database](https://mibig.secondarymetabolites.org/) -- individual nucleotide sequences that are well-annotated [BiG-FAM](https://bigfam.bioinformatics.nl/home) - biosynthetic gene families > This database version (1.0.0) is based on a GCF clustering of 1,225,071 BGCs taken from multiple publicly available sources (details). This large-scale analysis was performed using the BiG-SLiCE software (1.0.0) with an arbitrary clustering threshold (T=900.0), which resulted in the construction of 29,955 GCF models, each representing distinct protein domain and sequence features shared by the BGCs. - BGC's don't just contain biosyntheis genes - also have transport genes, self-resistance genes, etc - sometimes called "pathogenicity islands" - important that they are (as a group/cluster) laterally transferable - "laterally transferable cassettes" - genes are not always in the same order - BGCs - sometimes you get recombination between parts of BGC's. - with circular genome, you often find BGC's in specific portions of the circular genome (3:00, 10:00, etc) -- Dan's hypothesis is that there's lots of recombination in some parts of the genome (~6pm), then as they mature they move to more stable parts of the genome. How many BGCs? - 0-50 (50 in some Streptomyces) (Note: Dan is building a Data Portal for BGC's) Metagenome assembly is problematic for these -- often it's the secondary metabolism genes that are broken in genome assemblies / MAGs. BGC db is not v complete -- vast majority of BGCs that have been studied come from Actinomycetes. ** protein SRA -- would be helpful here too ** --> could search in protein space instead. Q: Would you expect BGC's to be environment-specific - some marine-specific, e.g. compounds that are heavily brominated more likely to be from marine envs rather than, e.g. soil Can you get to function? - sometimes, e.g. if structure resembles commonly known antibiotic, it might have antibiotic properties Co-occurrence important/intersting? Co-occurrence networks. Why are people interested in BGC's?? - drug discovery (find similar) - environmental reasons -- catalog/understand what products are being produced by the organism(s) Metabolites specific to crc - strain-specific metabolites produced by biosynthetic gene clusters? - disease-specific metabolome + strain-specific genomic signal - hand-wavey statement: gene loss happens so rapidly in bacteria/archaea that if a gene is present, it's (more) likely to be used (active) than not. - ecosystem function BGC -- e.g. organism might need certain product in certain environment (e.g. compound that might help with toxic/polluted envs, etc) Ideas for specific environments/conditions to look for BGC's / to specifically investigate: - immunocompromised ppl / inflammatory guts - deserts have - BGC's - challenging environments -- usually more BGC's, organisms need to try to keep their niche/make compounds to avoid competition. - Antibiotic resistance (not BGC, bc usually not biosynthetic) - BGC's associated with antibiotic-treated pigs, etc ## Spacegraphcats discussion / Approach - in a single / set of metagenomes, search using a set of BGC's to pull out reads that cover / are physically connected to sequence that matched the BGC - can then assemble those - with this small of reads, could switch back to overlap-layout-consensus assembly approach to avoid issues k-mer strategies have with repeats - aggressive assembly - could ask what the distance between two interesting genomic components is (e.g. how far on sgc graph) More fun with Assembly graphs: - if you have a repetitive sequence found in four different regions of the genome, this should let you figure out _how many_ repeats were in the genome based on how many connections there were to the repetitive region (as long as there's some genomic distinction) ### JGI Project - sequencing 10k bioactive bacteria collection (Pfizer?) - short reads bc so many. - paired metabolomics. - spacegraphcats --> assembly approach might help with this project. BCG diversity across metagenomes: - pull out graph nbhd across many metagenomes --> meta-pan BGC --> insight into evolutionary characteristics of the BGC note: many ppl want to use BGC sequences to synthesize the secondary metabolite + assess /use the function. Seq error, other flaws --> often doesn't work. If we can grab the reads, that might help... - evolutionary patterns --> find novel BGCs cluster finders = lots of false negatives, not really used anymore - Evomining = take a set of genomes, try to figure out the core genome vs secondary metabolism) JGI approach -- whole genome assembly (as good as possible without worrying about secondary metabolism), then antiSMASH (rules based/based on domain content) -scans for HMMs + does some addl analysis on domains to give you more info - a couple machine learning methods are good, but they are all trained on antiSMASH data - limitation with rules based = only find things you already have - limitation w/machine learning - still only find things you trained on, also get false positives -- bc will pick up on regulatory sections, etc * There's a Paired Omics database geared towards natural products... ## Microbial Ecology / distance-decay (with Christy Grettenberger) - distance-decay relationship of organisms/strains from a specific environment, e.g. gut bacteria from wastewater treatment plant Q: Can we track waste based on the presence/absence of certain organisms (gut bacteria) / by tracking community composition? - PCA /NMDS follow-up: can we apply machine learning --> learn components of a "contaminated" community + use it to detect? Q: If microbes are dispersal-limited, community similarity will be more different the farther apart they are. If not dispersal limited, might expect the same community / highly similar community. - in areas with the same geochemistry, we example: three ponds separated by 500k each. Dispersal-limited:: communities are more different the further apart they are. - Microbial communities -- BB = everything is everywhere and the environment selects - Hot Springs, Antarctic Lakes etc... = dispersal limitation. Less extreme environments = haven't really found much dispersal limitation. Community composition / intragenetic diversity across a range of expected high-low dispersal limitation. - compare community structure /composition ## Potential Projects - e.g.100 marine metagenome across a latitudinal transect - community jaccard similarity --> regression to look at community similarity across distance - w/in a species: ANI regression across distance but really, two considerations: genetic distance <--> geographical distances genetic distance <--> geochemical distance (need to control for or consider geochemical properties) PCA similarity - for a genome/microbe of interest -- find all members (all containment) across SRA metagenomes - look for pattern across distance (containment/ANI) - how does environment affect this pattern? Any interesting outliers, etc? - look for presence/absence/utilization of diff accessory elements in specific environment - associated spacegraphcats project:: build environment-specific pan-metagenome == define accessory elements/accessory k-mers only found in specific environments... (or learning approach:: are certain k-mers consistently associated with specific environments?) --> Actinomycetes (lots of BGCs) - alt: community analysis::: presence/absence