BRAILLE update (3-Feb-2021)

--- tags: BRAILLE title: BRAILLE update (3-Feb-2021) --- [toc] --- >**Interactive function plots and tables are [here](https://astrobiomike.shinyapps.io/braille-mg/)**. --- > Previous update docs > - [https://hackmd.io/@astrobiomike/BRAILLE-notes-12-Dec-2020](https://hackmd.io/@astrobiomike/BRAILLE-notes-12-Dec-2020) # BRAILLE update (3-Feb-2021) Metagenomic processing repository with data and code is at [OSF here](https://osf.io/uhk48/wiki/home/). It's currently not up to date with everything below. # Working on key metabolism overview figures Here is an example for [Nitrogen metabolism](https://www.genome.jp/kegg-bin/show_pathway?map00910): <a href="https://i.imgur.com/gnyEXPy.png"><img src="https://i.imgur.com/gnyEXPy.png"></a> Not sure how easy or not it's going to be to put others we want together yet, will move forward with the OSF pages i put together for Methane and Sulfur next. [This paper](https://bmcmicrobiol.biomedcentral.com/articles/10.1186/s12866-019-1521-8) Diana sent along has a nice overview of carbon-fixation, nitrogen, methane, and sulfur as Figure 6: <a href="https://i.imgur.com/OB8qugU.jpg"><img src="https://i.imgur.com/OB8qugU.jpg"></a> # Implemented a better normalization method In ironing out how to do differential abundance testing, I incorporated a more sophisticated normalization procedure than the coverage-per-million I was using before. That's great for human interpretability, but not as ideal for other things like clustering and differential abundance testing. So I generated median-ratio normalized data for both of those. [This page](https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html) gives a very nice explanation of this approach, with example data, so probably more clear to look there (it's under section "DESeq2-normalized counts: Median of ratios method"), but here's the gist: 1. For each unit (our KOs in this case), calculate the geometric mean for that KO across all the samples (this new column of geometric means for all KOs becomes a pseudo-reference). 2. For each sample, divide each KO's coverage by the pseudo-reference value for that KO (giving us a ratio for each sample's KOs over that KO's geometric mean). 3. For each sample, take the median of those ratios that exist for each KO term, and that becomes the normalization factor for that sample. 4. Divide each sample's KO coverages by the sample-specific normalization factors. This minimally altered our hierarchical clustering, but I also clean up that figure and added sample characteristic icons: <a href="https://i.imgur.com/mNFUM1v.png"><img src="https://i.imgur.com/mNFUM1v.png"></a> ## Coverages across samples Here's an example of how we can display relative coverages across the samples (similar to our MAG overview, but now about coverage in the samples rather than presence/absence in a genome): <a href="https://i.imgur.com/gGugK9R.png"><img src="https://i.imgur.com/gGugK9R.png"></a> Each row is a "module" of a "pathway" (key on right holds pathway names and IDs). And each module is a summation of the genes within that module. I just threw these 4 together, but will probably all more and then filter out less interesting rows that we don't need/want to include in an overview summary figure (could also decide what to include based on differential abundance testing, discussed next). Let me know any KEGG functions/modules/pathways whatever you'd like to see in there if you have any! Any individual functions can still be plotted at our [shiny app here](https://astrobiomike.shinyapps.io/braille-mg/). ## Differential abundance testing Started looking into this, this is what led me to the better normalization we're using now. DESeq2 is the gold-standard for differential expression analysis. It's commonly used for amplicon data too, and could be used for metagenomics data also, but *only* if our units are the same. DESeq2 doesn't normalize for gene-length, because it is built to work on the same unit (gene) across all samples, so that doesn't matter. It takes raw counts and handles everything else. We couldn't do that because we don't have the same genes in all samples, we had to summarize things based on functional annotations. So we no longer have the case where we have the same units across all samples – our gene lengths may be varying. So we're performing their median-ratio normalization procedure and then running testing on that. With lots of KO terms and 2 replicates for our qualitative groups, not too much is coming out of this right now. I'm doing kruskal-wallis ANOVA, then dunn post-hoc test, which leaves us pretty under-powered, but the data aren't normally distributed and are heteroskedastic. So i don't think we have much of a choice. I have a meeting with one of my stats buddies on Friday and will pick their brain. As-is, nothing comes through with <= 0.1 adjusted p-values. But also, we aren't really testing any hypotheses. We can look at the most different, even if not significant based on any standard thresholds, and decide if we want to mention or look at them specifically. We can also run this at the summed module level (or pathway), rather than the individual gene level. I'll probably do this just to see what that looks like too. ## Chatted with Maggie She is going to send me their current paper on carbon ratios, we'll see if we can tie that to the carbon fixation coverages we have or if any of our recovered could be involved with carbon fixation in the samples they see ratios that suggest C-fixation is going on.