Data analysis with R (Saturday AM)

# Outline: + 9:10 - 9:35am [unix leftovers](https://hackmd.io/qtds9NkrTHSKnrrh2_I64Q?view#Section-2-Working-with-files-CSV-scavenger-hunt) + 9:40 - 10:00am [More R: phyloseq and ggplot](https://hackmd.io/4Zpa35WESgG72gi325AbWw)/ [R Tutorials](#R-Tutorials) + 10:30 - 10:50 Break + 10:50 - 12:00pm more open work on tutorials: [ggplot](#05-ggplot) and [phyloseq](#06-Phyloseq) + If finished early, go over [mike's glorious commands](https://astrobiomike.github.io/unix/six-glorious-commands) # R Tutorials + data (you probably have this downloaded from day 1) + [covariates](https://github.com/mblstamps/stamps2023/blob/main/day1/intro_R/data/FWS_covariates.txt) + [abundances](https://github.com/mblstamps/stamps2023/blob/main/day1/intro_R/data/FWS_OTUs.txt) + 01 - 04 (you probably have this downloaded from day 1) + 01: [data in R](https://github.com/mblstamps/stamps2023/blob/main/day1/intro_R/01-data.Rmd) .Rmd file + [.Rmd with solutions](https://github.com/mblstamps/stamps2023/blob/main/day1/intro_R/01-data-solutions.Rmd) + 02: [functions](https://github.com/mblstamps/stamps2023/blob/main/day1/intro_R/02-functions.Rmd) .Rmd file + [.Rmd with solutions](https://github.com/mblstamps/stamps2023/blob/main/day1/intro_R/02-functions-solutions.Rmd) + 03: [dplyr](https://github.com/mblstamps/stamps2023/blob/main/day1/intro_R/03-dplyr.Rmd) .Rmd file + 04: [plot results](https://github.com/mblstamps/stamps2023/blob/main/day1/intro_R/04-plot-results.Rmd) .Rmd file + 05 - 06 (these are new today) + You should download the .Rmd file for each lesson by following the link and downloading the file. You should save the file in the **same folder** where tutorials 01-04 are saved on your computer or on your RStudio server. + 05: [phyloseq](https://github.com/mblstamps/stamps2023/blob/main/day1/intro_R/05-phyloseq.Rmd) .Rmd file + [.Rmd with solutions](https://github.com/mblstamps/stamps2023/blob/main/day1/intro_R/05-phyloseq-solutions.Rmd) + 06: [ggplot](https://github.com/mblstamps/stamps2023/blob/main/day1/intro_R/06-ggplot.Rmd) .Rmd file # More R: phyloseq and ggplot2 [link](https://hackmd.io/4Zpa35WESgG72gi325AbWw) # 05: Phyloseq ## Phyloseq `phyloseq` is an R package that is used to store, analyze, and plot microbiome abundance data. We will start by importing it from the Bioconductor package repository. ```r! if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") if (!requireNamespace("phyloseq", quietly = TRUE)) { BiocManager::install("phyloseq") } # load in the phyloseq package library(phyloseq) # load in tidyverse library(tidyverse) ``` :::info :bulb: If you have any issues installing `phyloseq` on your local computer, we recommend switching over to your R Studio server account where `phyloseq` is already installed. ::: ## Load in data We'll be using the same `covariates` and `abundances` datasets and show how to make them into a `phyloseq` object. ```r! covariates <- read.csv("data/FWS_covariates.txt", sep = "\t") abundances <- read.csv("data/FWS_OTUs.txt", sep = "\t", row.names = 1, header = T) ``` ## Creating phyloseq objects A `phyloseq` object is made up of three tables. 1. OTU table 2. Taxonomy table 3. Sample data table Let's prepare these three tables from our data. ```r! # our covariates data frame is our sample data table. # let's move the "SampleName" column to be row names for our data frame row.names(covariates) <- covariates$SampleName covariates <- covariates %>% select(-SampleName) # our abundance table doesn't need to have all of the taxonomic information. # let's update the row names by numbering our OTUs. # first we'll save the current rownames, because we need that taxonomy information later taxonomy <- row.names(abundances) # replace row names with "taxa1" up through "taxa387" with the "paste0()" function row.names(abundances) <- paste0("taxa", 1:nrow(abundances)) # now we can make a taxonomy table # we would like to separate each taxonomic identification into separate columns for each hierarchical level # make a data frame for taxonomy tax_df <- data.frame(tax = taxonomy) # separate the single column tax by creating a new column each time a ";" is seen # name the newly created columns based on their rank tax_df_sep <- separate_wider_delim(data = tax_df, cols = tax, delim = ";", names = c("kingdom", "phylum", "class", "order", "family", "genus")) # turn the data frame into a matrix # (phyloseq prefers a matrix of taxonomy to a data frame) # (I don't know why) tax_mat <- as.matrix(tax_df_sep) # make sure the row names of the taxonomy matrix match the row names of the OTU table row.names(tax_mat) <- row.names(abundances) # now we can finally make a phyloseq object! phylo_obj <- phyloseq(otu_table(abundances, taxa_are_rows = TRUE), tax_table(tax_mat), sample_data(covariates)) phylo_obj ``` Great, we have a `phyloseq` object! But what have we actually done here? We have linked all of our information about our data (covariates for each sample, abundances, and taxonomic information about OTUs) into a single object. That's pretty cool! ## Using phyloseq objects Once we have our `phyloseq` object, we can do a lot with it! Let's start out by learning the basics of it (which we probably already know from our explorations of the data). ```{r} # check out our sample data nsamples(phylo_obj) sample_names(phylo_obj) sample_variables(phylo_obj) # check out our abundance data sample_sums(phylo_obj) ntaxa(phylo_obj) head(taxa_sums(phylo_obj), 20) # check out our taxonomy table rank_names(phylo_obj) ``` We can also change the level of taxonomic classification we are working with. What if we only cared about differences up to the order level. ```r! # use tax_glom to aggregate, where the second argument is the taxonomic level to aggregate to phylo_order <- tax_glom(phylo_obj, "order") ntaxa(phylo_order) head(tax_table(phylo_order)) # note that now the family and genus levels are all NA value because we have # aggregated to the phylum level # NA is a data type in R that refers to missing data or data that doesn't exist ``` We can also subset taxa. What if we only want to look at taxa from the phylum *Proteobacteria*? ```r! # use subset_taxa to subset, using the second argument to specify which taxa you want at which taxonomic rank phylo_proteo <- subset_taxa(phylo_obj, phylum == "Proteobacteria") ntaxa(phylo_proteo) head(tax_table(phylo_proteo)) ``` ## Plotting phyloseq objects One very useful aspect of `phyloseq` is its analysis and plotting capacity. Although you could make any plot you want with the two data sets and some time spent with `dplyr` and `ggplot`, `phyloseq` makes this process more convenient for you. Imagine we want to look into alpha diversity for our data. ```r! # estimate several types of alpha diversity estimate_richness(phylo_obj) # plot Shannon and Simpson diversity measures by location # note: this plotting function is a wrapper for ggplot, so we can customize it as we would a ggplot object plot_richness(phylo_obj, x = "Location", color = "Season", measures = c("Shannon", "Simpson")) + # use a black and white background theme_bw() + # add a title ggtitle("Shannon and Simpson diversities by location") + # center the title theme(plot.title = element_text(hjust = 0.5)) ``` Now imagine we want to make an ordination plot where we compare communities based on Bray-Curtis dissimilarities. ```r! # calculate bray-curtis dissimilarities between each sample # then perform dimension reduction with nMDS (non-metric Multi Dimensional Scaling) phylo_bray <- ordinate(phylo_obj, "NMDS", "bray") # color our plot by season plot_ordination(phylo_obj, phylo_bray, type="samples", color="Season") + # use a black and white background theme_bw() + # add a title ggtitle("nMDS of samples with Bray-Curtis dissimilarities") + # center the title theme(plot.title = element_text(hjust = 0.5)) ``` ## Getting data out of phyloseq Finally, there are some settings where you might rather work with individual data tables than with your phyloseq object. Pretend for a moment that you started with your `phylo_obj` object and have no access to the `covariates` or `abundances` data frames. `phyloseq` has handy functions for us to extract each of its three tables. ```r! # extract sample data table samp_tab <- sample_data(phylo_obj) head(samp_tab) # extract otu table otu_tab <- otu_table(phylo_obj) head(otu_tab) # extract taxonomy table taxon_tab <- tax_table(phylo_obj) head(taxon_tab) ``` Now you have each of these tables as their own R objects and can manipulate and analyze them however you would like! ## Exercise Using the `phylo_obj` object, subset to only taxa that belong to *Firmicutes*. Build an **ordination plot** based on the subsetted `phyloseq` object you've created using Bray-Curtis dissimilarities and nMDS (as in the above section) and color the points by the location where they were collected. :::spoiler *Hint 1* :::info :bulb: subset to a specific taxa with `subset_taxa()`. ::: :::spoiler *Hint 2* :::info :bulb: locations are saved as the "Location" variable in the `covariates` dataset. ::: :::spoiler *Click to Reveal a Solution* :::success ```r! phylo_firm <- subset_taxa(phylo_obj, phylum == "Firmicutes") phylo_firm_bray <- ordinate(phylo_firm, "NMDS", "bray") # color our plot by season plot_ordination(phylo_firm, phylo_firm_bray, type="samples", color="Location") + # use a black and white background theme_bw() + # add a title ggtitle("nMDS of samples with Bray-Curtis dissimilarities") + # add a subtitle labs(subtitle = "only Firmicutes taxa") + # center the title and subtitle theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) ``` ::: ## References ```{r} citation("phyloseq") citation("ggplot2") ``` Some materials adapted from this [tutorial](https://www.nicholas-ollberding.com/post/introduction-to-phyloseq/) by Nicholas Ollberding. # 06: ggplot ## ggplot2 In yesterday's tutorial, we tested out generating a PCoA plot using ggplot (and vegan). Today, we'll delve into ggplot in more detail, exploring its various features and functionalities. ggplot2 (commonly just referred to as just ggplot) is a R package used to make high quality plots for your data. *ggplot2 is part of the tidyverse suite!* It offers consistent syntax and allows for detailed customization of almost every element of your plots. Visualization options range from bargraphs, boxplots, scatterplot, pie charts, histograms and more. to get started, lets load tidyverse (we're going to use a few dplyr functions in addition to ggplot, so loading tidyverse is a quick solution) ```{r} library(tidyverse) ``` Lets check out a new science (fiction) dataset, and check out it's columns: ```{r} powers <- read_csv("https://raw.githubusercontent.com/mblstamps/stamps2023/main/day1/intro_R/super_data/super_hero_powers.csv") powers ``` see how a few of the names have spaces in them, and some of them start with uppercase letters? the package `janitor` has a function that "cleans up" column names according to their standards (read more here:https://rpubs.com/jenrichmond/clean_names) ```{r} if (!requireNamespace("janitor", quietly = TRUE)) { install.packages("janitor") } library(janitor) powers <- clean_names(powers) powers ``` ### Practice 1. Try out janitor on the supers_hero_powers data ```{r} superhero_info <- read_csv("https://raw.githubusercontent.com/mblstamps/stamps2023/main/day1/intro_R/super_data/heroes_information.csv") # add code to clean names below, and update (overwrite) the superhero_info object with the clean names ``` lets take that superpowers dataframe and make a plot! ```{r} superhero_info %>% arrange(desc(height)) ``` say we want to make a bar chart to visualize the relative heights of superheros. This is about the simplest version we can make using ggplot. ```{r} superhero_info %>% ggplot(aes(x = name, y = height)) + geom_col() ``` Although the above code is equivalent to: ```{r} ggplot(superhero_info, aes(x = name, y = height)) + geom_col() ``` We tend to put the data first and then use the pipe (%>%) to send it to the ggplot() function. This is nice because when we add further data wrangling functions between the data and the ggplot(). For example, a filter! ```{r} superhero_info %>% filter(skin_color == "green") %>% ggplot(aes(x = name, y = height)) + geom_col() ``` The lines that come before the ggplot() function are piped, whereas from ggplot() onwards you have to use +. This is because we are now adding different layers and customization to the same plot. aes() stands for aesthetics - things we can see. These set the axes for the plot, and infact you can just plot the aesthetics like this: ```{r} superhero_info %>% filter(skin_color == "green") %>% ggplot(aes(x = name, y = height)) ``` Lets clean this up a little bit- we can fix those axis labels like this: ```{r} superhero_info %>% filter(skin_color == "green") %>% ggplot(aes(x = name, y = height)) + theme(axis.text.x = element_text(angle = 45, hjust = 1))+ # hjust sets the alignment of the text. geom_col() ``` ```{r} superhero_info %>% filter(skin_color == "green") %>% ggplot(aes(x = name, y = height)) + geom_col()+ theme(axis.text.x = element_text(angle = 45, hjust = 1))+ labs(title = "Heights of Super Heros", x = "Hero Name", y = "Height (cm)") ``` Adding another aesthetic, fill, lets us fill in color by a variable of interest. ```{r} superhero_info %>% filter(skin_color == "green") %>% ggplot(aes(x = name, y = height, fill = gender)) + geom_col()+ theme(axis.text.x = element_text(angle = 45, hjust = 1))+ labs(title = "Heights of Super Heros", x = "Hero Name", y = "Height (cm)") ``` ### Practice 2. edit the code below to make a barchart of your own, choosing your own: - filter - x, y, and fill aesthetic - title and axis labels try choosing a different x aesthetic in particular (tip: for a barchart, usually y is numeric data. ) ```{r, eval=FALSE} superhero_info %>% filter(_____) %>% ggplot(aes(x = ___, y = ___, fill = ___)) + geom_col()+ theme(axis.text.x = element_text(angle = 45, hjust = 1))+ labs(title = "___", x = "___", y = "___") ``` lets test out a scatter plot, by choosing a continuous variable for both the weights and height. note that the aesthetic for coloring the points is "color", not "fill", and the geom() layer is "point". ```{r} superhero_info %>% ggplot(aes(x = weight, y = height, color = gender)) + geom_point()+ theme(axis.text.x = element_text(angle = 45, hjust = 1))+ labs(title = "Heights of Super Heros", x = "Weight", y = "Height (cm)") ``` ## Applying to our microbial abundance data Lets test out some plotting with our microbial abundance data, first by generating a relative abundance column by running the code below ```{r} abundances <- read_tsv("data/FWS_OTUs.txt") rel_abundances <- abundances %>% rename("taxonomy" = `...1`) %>% pivot_longer(!taxonomy, names_to = "name", values_to = "value") %>% separate(name, c("where", "when"), remove = F) %>% group_by(where, when) %>% mutate(relative_abundance = value/sum(value)) rel_abundances ``` Here we can make a scatter plot, plotting the relative abundance of each species by sample. adding `aplha = .5` in the geom layer makes the points 50% translucent ```{r} rel_abundances %>% ggplot(aes(x = name, y = relative_abundance, col = when)) + geom_point(alpha=.5) + labs(x = "", y = "Relative\nabundance")+ theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) ``` Lets test out a new plot. Here is code to make a relative abundances ```{r} rel_abundances <- rel_abundances %>% mutate(when = factor(when, levels = c("Jan", "Feb", "May", "Jun", "Jul", "Aug"))) rel_abundances ``` ```{r} rel_abundances %>% filter(taxonomy == "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Burkholderiaceae;Ralstonia") %>% ggplot(aes(x = name, y = relative_abundance, col = when)) + geom_point() + labs(y = "Relative\nabundance")+ theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) ``` ### Practice 3. Here is code to pivot the data wide again. first run this ```{r} rel_abundances_wide <- abundances %>% rename("taxonomy" = 1) %>% pivot_longer(-1) %>% mutate(relative_abundance = value/sum(value)) %>% filter(taxonomy %in% c("Bacteria;Cyanobacteria;Cyanobacteria;SubsectionIV;Unassigned;Anabaena", "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Burkholderiaceae;Ralstonia"))%>% select(!value) %>% pivot_wider(names_from = taxonomy, values_from = relative_abundance) %>% rename("Cyanobacteria" = 2, "Proteobacteria" = 3) rel_abundances_wide ``` Now, try to make a scatter plot of the relative abundances of Amy's favorite 2 taxa, against each other. - add the x & y aesthetics to match the 2 columns in rel_abundances_wide that represent Amy's favorite taxa. - debug the geom layer - fill in the x & y labels, and add a title ```{r, eval=F} rel_abundances_wide %>% ggplot(aes(x = ____, y = ____, fill = ____)) + geom_spot() + labs(x="___", y = "___")+ theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.