Exploration of functions across our samples

--- tags: BRAILLE --- # Exploration of functions across our samples [toc] > **This page details how to access and use a web-based environment that let's us explore our metagenomic data based on our [Kegg Orthology (KO)](https://www.genome.jp/kegg/ko.html) annotations. The primary processing code is detailed in the [Gene-level functional analysis in R section](https://osf.io/uhk48/wiki/2a.%20Gene-level%20functional%20analysis%20in%20R/) of our OSF, here won't be about any of the processing details, just info on using the web environment – write Mike on slack if you need/want access to the OSF (it's not needed to explore things in this web-based environment).** > > I've tried to make this such that you don't need to be familiar with R at all to use this and look at any given function of interest – but if you hit any snags, please let me know and I'll be happy to help! And if you do want to learn more about R from scratch, [here](https://astrobiomike.github.io/R/basics) is one place you could start 🙂 ## Getting started ### Accessing the environment **1. Click this badge to access the environment** [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/AstrobioMike/binder-BRAILLE/main?urlpath=rstudio) That will open in a new window and after a few seconds should look something like this: ![](https://i.imgur.com/6bw4pF2.png) > There are rarely any snags when opening, but if it doesn't load or you get an error, try hitting refresh. If that fails, close it and start again from clicking the above badge. **2. Open our `explore.R` script** In the bottom-right panel we can see a browser of the files in this environment. We want to click on the **`explore.R`** file, which will open in the top-left panel like so: ![](https://i.imgur.com/BvM7NBr.png) It's in this text-editor panel on the top-left that we will enter, modify, and run lines of code. ### How to run code here **1. Place the cursor on the line we want to run** In the top-left panel, if we click so that our cursor is on the line that says `load("BRAILLE-mg.RDATA")`: ![](https://i.imgur.com/6LPLqW1.png) **2. Execute the line** We can execute that line either with the keyboard (**`cmd + return`** on Mac, **`ctrl + enter`** on Windows), or by clicking the "Run" icon at the top of the panel, bottom-right of this image: ![](https://i.imgur.com/Xpf8RAT.png) After we execute that line, it is sent to the "console" on the bottom-left: ![](https://i.imgur.com/rJp01sc.png) When the ">" symbol returns to the bottom-left console window as depicted above, it means that operation is done and it's ready for us to do more things. That just loaded all our data which also added a bunch of items to the top-right panel that are holding our data in here. **3. Select a block of lines** Again in the top-left panel, we can highlight and select multiple lines to run, like all these packages of code we need to load: ![](https://i.imgur.com/uJG2qBe.png) **4. Execute the selected block of code** We can then execute that block of code the same way we did a single line, either with the keyboard (`cmd + return` on Mac, `ctrl + enter` on Windows), or by clicking the "Run" icon at the top of our top-left panel. It will be sent to the console as the work is done, and when it's finished in a few seconds it will look like this with our ">" prompt ready again: ![](https://i.imgur.com/8hyBzRq.png) That's all there is to running code in here! And we can change and modify things in the top-left text-editor as we'd like before running them from there, as demonstrated below. > **NOTE** > Those lines above, the `load` and all the `library` lines, need to be run first anytime the environment is open, or we will get an error when using the below functions. ## Searching for functions (and their KO IDs) of interest > Our genes are annotated with [Kegg Orthologs (KOs)](https://www.genome.jp/kegg/ko.html). There are two main web-based ways we can look for KO terms: **1. [Primary KO database page](https://www.genome.jp/kegg/ko.html)** This one is probably better if we want to search by a specific term or gene name, e.g. if we wanted to look for `nifH`: ![](https://i.imgur.com/lCj1V4q.png) Takes us to here: ![](https://i.imgur.com/GecvFFO.png) And those KO IDs ("K02588", "K02589", "K02590") are what we would use to look for them in our data (examples below). **2. [Explore grouped and collapsable lists of all KOs](https://www.genome.jp/kegg-bin/get_htext?ko00001) While this one has a search function too and will highlight what we search for in red, it's probably more useful when wanting to explore the KO structure a little more. It starts like this: ![](https://i.imgur.com/zfRgES1.png) And we could click our way through to find KO terms we might be interested in. E.g. if we click "Metabolism", then "Energy metabolism", then "Methane metabolism", it expands out to show us all the KO IDs under that grouping: ![](https://i.imgur.com/Axm8noT.png) And again those KO IDs, like the "K10944" for the row highlighted there are what we will use to look for them in our data. ## Examples > Our genes were annotated with [Kegg Orthologs (KOs)](https://www.genome.jp/kegg/ko.html). ### Plotting a specific function We're set up to quickly look at things normalized in different ways. One way is considering all genes, as demonstrated first below. The normalized to all way also includes un-annotated gene coverages. We can also normalize to specific metabolisms that are already prepared (discussed [here](https://osf.io/uhk48/wiki/2a.%20Gene-level%20functional%20analysis%20in%20R/) (let Mike know if you want access)), demonstrated second below. #### Normalized to all genes Let's say we want to look into a function that we saw above, like the "K02588" nifH we searched for in the first example. We can try to plot its normalized coverage across our 10 samples like so: ```r plot_single_KO("K02588") ``` If it's not there, as is the case with this one, it will tell us. Let's one that is there, the "K10944" methane/ammonia monooxygenase example from looking above (this is all we change to plot a different one, *but make sure it is within quotes* like shown here – this tells R to treat it like text, rather than thinking it is a stored object it should look for): ```r plot_single_KO("K10944") ``` ![](https://i.imgur.com/Zx5meMi.png) > This is how we can plot different ones, we just need That plots in the bottom-right panel, but might fit poorly there. It might be easier to view if we click "Zoom" in that panel to get a separate window for our figures: ![](https://i.imgur.com/bjR406w.png) Which breaks it out as its own window: ![](https://i.imgur.com/Qu3mhdM.png) In these plots, the left y-axis is the normalized coverage, and the right is the number of gene-copys that were annotated as this KO, indicated by plus signs on the figure. By default, these try to round the coverage axis to the nearest hundred, this one would be more useful rounded to the nearest ten. We can adjust that with another argument, like so (note that the number 10 is not in quotes, numbers don't need to be told to be treated like text as the KO_ID does): ```r plot_single_KO("K10944", coverage_round_acc = 10) ``` Where we added `coverage_round_acc = 10` in order to set the coverage rounding accuracy to 10: ![](https://i.imgur.com/SSouXLe.png) That might be useful to adjust depending on what we are plotting. #### Normalized to a specific metabolism If instead of normalizing to all genes we want to normalize within a specific metabolism, we just need to provide another argument to our `plot_single_KO()` function that holds the correct object for whatever metabolism we care about (these are listed in the `explore.R` file and in the next section). Here's an example looking at the same "K10944" we just did, but now normalized to only those KOs that are within [KEGG's Methane metabolism pathway](https://www.genome.jp/kegg-bin/show_pathway?map00680): ```r plot_single_KO("K10944", methane_KO_gene_df_list) ``` ![](https://i.imgur.com/0CpfYcj.png) Which in this case doesn't look too different trend-wise from the one normalized to everything just above. (Note that `methane_KO_gene_df_list` is not in quotes. This is because it *is* a stored object we want R to look for, rather than treat it as text like we want it to do with the "K10944".) ### Plotting overview of specific metabolism We can also plot the top most abundant KOs in a given metabolism. The ones currently available in here and their object names are listed in the next section, but here is how we could do it with Nitrogen: ```r plot_overview_of_KOs_with_highest_coverages(N_KO_gene_df_list) ``` Which plots out this: <a href="https://i.imgur.com/WMKgIKn.png"><img src="https://i.imgur.com/WMKgIKn.png"></a> And prints out a table of the KOs that were plotted in our bottom-left console panel: ![](https://i.imgur.com/KpCXBa1.png) If we wanted to adjust the number we were going to plot, we could do so by adding the `number_to_plot` argument like so: ```r plot_overview_of_KOs_with_highest_coverages(N_KO_gene_df_list, number_to_plot = 12) ``` ## Metabolisms currently in here We can specify the specific metabolisms in here to normalize to for a single plot, and to generate an overview of them as shown above. Each one has a directory associated with it that can be seen in the bottom-right panel if we click on "Files", and within each one's directory, we have files with: 1) gene-copy counts for each associated KO; 2) normalized coverage for each KO; and 3) a KO annotation file with information on the KOs associated with that metabolism. That last one may be helpful if you just want to look over what's included in a given metabolism. Here's a table with the metabolisms in here and the object name that we need to put in in order to specify them: | Metabolism | Object name | |:----|----| |[Nitrogen](https://www.genome.jp/dbget-bin/www_bget?map00910)|N_KO_gene_df_list| |[Methane](https://www.genome.jp/dbget-bin/www_bget?map00680)|methane_KO_gene_df_list| |[Sulfur](https://www.genome.jp/dbget-bin/www_bget?map04122)|sulfur_KO_gene_df_list| |[Carbon](https://www.genome.jp/dbget-bin/www_bget?map01200)|carbon_KO_gene_df_list| |[Diverse microbial metabolism](https://www.genome.jp/dbget-bin/www_bget?map01120)|diverse_microbial_metabolism_KO_gene_df_list| For example, to plot a single KO normalized within the Carbon metabolism pathway: ```r plot_single_KO("K00196", carbon_KO_gene_df_list) ``` ![](https://i.imgur.com/akdbcJS.png) Or to plot an overview of the top 12 most abundant within the Diverse microbial metabolism pathway: ```r plot_overview_of_KOs_with_highest_coverages(diverse_microbial_metabolism_KO_gene_df_list, number_to_plot = 12) ``` <a href="https://i.imgur.com/S4fORUK.png"><img src="https://i.imgur.com/S4fORUK.png"></a> ![](https://i.imgur.com/5QtOHzk.png) ## Functions currently in here Let me know if you have ideas of other things you'd like to be able to do and I'll see if I can add them! For now, this is what's in here. ### get_KO_info() This will print out general info about any provided KO ID, including a link to the main page for it and whether or not it was assembled and detected in our data. ```r get_KO_info("K00196") ``` ![](https://i.imgur.com/QVlU9KV.png) ### plot_single_KO() Plots a single KOs coverage across all samples. Normalized to all gene-coverages (including un-annotated): ```r plot_single_KO("K10944") ``` Normalized to a specific metabolism (does not include un-annotated): ```r plot_single_KO("K10944", methane_KO_gene_df_list) ``` ### plot_overview_of_KOs_with_highest_coverages() This plots the top most abundant KOs in a grouping, to run it on all, we can provide it with this object `all_our_KO_gene_df_list` like so: ```r plot_overview_of_KOs_with_highest_coverages(all_our_KO_gene_df_list) ``` It does 25 by default, but we can change the number to be plotted, e.g.: ```r plot_overview_of_KOs_with_highest_coverages(all_our_KO_gene_df_list, number_to_plot = 4) ``` ## Last note on the web-environment This session only lasts as long as we're using it. Files can be uploaded to it and downloaded from it, but nothing will remain on the session/in the environment after we close it, or when it sits idle for too long (I think this might be like 10 minutes). Clicking the launch badge again will open a new one with no retained files or work. Let me know if you want to know more about this and what we can do about it if you find yourself limited by that 🙂