Introduction to R for RNA-Seq Workshop April 2022

--- tags: workshop --- # Introduction to R for RNA-Seq Workshop April 2022  **When:** Wednesday, April 27, 2022, 10 am - 12 pm PDT [world clock](https://www.timeanddate.com/worldclock/fixedtime.html?msg=+Introduction+to+R+for+RNA-Sequencing+Analysis&iso=20220427T10&p1=3736&ah=2) **Where:** [Zoom](https://zoom.us/j/7575820324?pwd=d2UyMEhYZGNiV3kyUFpUL1EwQmthQT09) and [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nih-cfde/training-rstudio-binder/data?urlpath=rstudio) **Instructors:** Dr. Rayna Harris **Helpers:** Jeremy Walter and Dr. Jessica Lumian **Organizer:** [The Common Fund Data Ecosystem](https://training.nih-cfde.org/) **Description:** RNA-Sequencing (RNA-Seq) is a popular method for determining the presence and quantity of RNA in biological samples. In this 2 hour workshop, we will use R to explore publicly-available RNA-Seq data from the [Gene Expression Tissue Project (GTEx)](https://gtexportal.org/home/). Attendees will be introduced to the R syntax, variables, functions, packages, and data structures common to RNA-Seq projects. We will use RStudio to import, tidy, transform, and visualize RNA-Seq count data. Attendees will learn tips and tricks for making the processes of data wrangling and data harmonization more manageable. This workshop will not cover cloud-based workflows for processing RNA-seq reads or statistics and modeling because these topics are covered in our [RNA-Seq Concepts](https://osf.io/kj5av/) and [RNA-Seq in the Cloud](https://github.com/nih-cfde/rnaseq-in-the-cloud/blob/stable/rnaseq-workflow.pdf) workshops. Rather, this workshop will focus on general R concepts applied to RNA-Seq data. Familiarity with R is not required but would be useful. :::info ### Learning Objectives In this workshop, you will learn how to use R and RStudio to: - import and view files commonly associated with RNA-sequencing experiments - visualize data using bar graphs, scatter plots, and box plots - filter observations and select variables that are relevant to research questions - create and rename variables - join data frames by common variables ### Research questions You will produce graphs and tables to answer motivating questions like: - How many RNA-sequencing samples are in the GTEx project? - Do you have enough samples to test the effects of sex, age, hardy scale, and their interactions for all tissues? - What is the effect of age on gene expression in the heart? - How does the expression of gene X differ between age groups? ::: ![](https://hackmd.io/_uploads/SkkxxSHeq.png =350x) ![](https://i.imgur.com/D6U6vlW.png =350x) ## Introduction The book [“R for Data Science”](https://r4ds.had.co.nz/index.html) provides an excellent framework for using data science to turn raw data into understanding, insight, and knowledge. We will use this framework as an outline for this workshop. **R** is a statistical computing and data visualization programming language. **RStudio** is an integrated development environment, or IDE, for R programming. R and RStudio work on Mac, Linux, and Windows operating systems. The RStudio layout displays lots of useful information and allows us to run R code from a script and view and save outputs all from one interface. When you start RStudio, you’ll see two key regions in the interface: the console and the output. When working in R, you can type directly into the console, or you can type into a script. Saving commands in a script will make it easier to reproduce. You will learn more as we go along! ![](https://hackmd.io/_uploads/H1a8-HHx5.png) For today’s lesson, we will focus on data from the [Gene-Tissue Expression (GTEx) Project](https://commonfund.nih.gov/gtex). GTEx is an ongoing effort to build a comprehensive public resource to study tissue-specific gene expression and regulation. Samples were collected from 54 non-diseased tissue sites across nearly 1000 individuals, primarily for molecular assays including WGS, WES, and RNA-Seq. We will work with a few different files and make the following graphs. ![](https://i.imgur.com/AfjBaPE.png) ![](https://i.imgur.com/XyTGHTK.png) ![](https://i.imgur.com/Z15NE7l.png) ### Getting Started 1. Click the [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nih-cfde/training-rstudio-binder/data?urlpath=rstudio) button to generate a computing environment for this workshop. 2. Navigate to the GTEx folder. 3. Click `GTEx.Rproj` and click “Yes” to open up an Rproject. This will set the working directory to `~/GTEx/`. 4. If you open the `r4rnaseq-workshop.R` file which contains all the commands for today’s workshop, you can click through this and all the commands should run successfully. 5. If you open a new R Script by clicking **File > New File > R Script**, you can code along by typing out all the commands for today’s lesson as I type them. Click “Run” to send commands from a script to the console or click command enter. ### R is a calculator You can perform simple and advanced calculations in R. ``` r 2 + 2 * 100 ``` ## [1] 202 ``` r log(202) ``` ## [1] 5.308268 You can save variable and recall them later. ``` r pval <- 0.05 pval ``` ## [1] 0.05 ``` r -log10(pval) ``` ## [1] 1.30103 You can save really long lists of things with a short, descriptive names that are easy to recall later. ``` r favorite_genes <- c("BRCA1", "JUN", "GNRH1", "TH", "AR") favorite_genes ``` ## [1] "BRCA1" "JUN" "GNRH1" "TH" "AR" ### Loading R packages Many of the functions we will use are pre-installed. The [Tidyverse](https://www.tidyverse.org/) is a collection of R packages that include functions, data, and documentation that provide more tools and capabilities when using R. You can install the popular data visualization package `ggplot2` with the command `install.packages("ggplot2")`). It is a good idea to “comment out” this line of code by adding a `#` at the beginning so that you don’t re-install the package every time you run the script. For this workshop, the packages listed in the `.binder/environment.yml` file were pre-installed with Conda. ``` r #install.packages("ggplot2") ``` After installing packages, we need to load the functions and tools we want to use from the package with the `library()` command. Let’s load the `ggplot2` package. ``` r library(ggplot2) ``` :::warning #### Challenge We will also use functions from the packages `tidyr` and `dplyr` to tidy and transform data. What command would you run to load these packages? :::spoiler `library(tidyr)` `library(dplyr)` ::: You can also navigate to the “Packages” tab in the bottom right pane of RStudio to view a list of available packages. Packages with a checked box next to them have been successfully loaded. You can click a box to load installed packages. Clicking the “Help” Tab will provide a quick description of the package and its functions. :::success #### Key functions | Function | Description | |----------------------|----------------------------------------------| | `<-` | The assignment variable | | `log10()` | A built-in function for a log transformation | | `install.packages()` | An R function to install packages | | `library()` | The command used to load installed packages | ::: ## Importing and viewing data Data can be imported using packages from base R or the tidyverse. What are some differences between the data objects imported by base R functions such as `read.csv()` and Tidyverse functions such as `read_csv()`? To begin with, `read.csv()` replaces spaces and dashes periods in column names, and it also preserves row.names. On the other hand, `read_csv()` preserves spaces and dashes in column names but drops row.names. For this workshop, we will use `read_csv()`, which means we may have to replace dashes with periods so that our sample names in all objects with sample name information. Today, I will show you how to import the following files: 1. data/samples.csv 2. data/GTExHeart_20-29_vs_50-59.tsv 3. data/colData.HEART.csv 4. data/countData.HEART.csv.gz Later, you can practice on your own using the following files: 1. data/GTExMuscle_20-29_vs_70-79.tsv 2. data/colData.MUSCLE.csv 3. data/countData.MUSCLE.csv.gz The `GTExPortal.csv` file in `./data/` contains information about all the samples in the GTEx portal. Let’s import this file using `read.csv()`. ``` r samples <- read.csv("./data/samples.csv") ``` After importing a file, there are multiple ways to view the data. `head()` to view the first few lines of each file. `names()` will print just the column names. `str` will compactly displaying the internal structure. `summary` will compute statistics. ``` r View(samples) head(samples) ``` ## SUBJID SEX AGE DTHHRDY SAMPID SMTS ## 1 GTEX-1117F Female 60-69 Slow death GTEX-1117F-0226-SM-5GZZ7 Adipose Tissue ## 2 GTEX-1117F Female 60-69 Slow death GTEX-1117F-0426-SM-5EGHI Muscle ## 3 GTEX-1117F Female 60-69 Slow death GTEX-1117F-0526-SM-5EGHJ Blood Vessel ## 4 GTEX-1117F Female 60-69 Slow death GTEX-1117F-0626-SM-5N9CS Blood Vessel ## 5 GTEX-1117F Female 60-69 Slow death GTEX-1117F-0726-SM-5GIEN Heart ## 6 GTEX-1117F Female 60-69 Slow death GTEX-1117F-1326-SM-5EGHH Adipose Tissue ## SMNABTCH SMNABTCHD SMGEBTCHT SMAFRZE SMCENTER SMRIN SMATSSCR ## 1 BP-43693 2013-09-17 TruSeq.v1 RNASEQ B1 6.8 0 ## 2 BP-43495 2013-09-12 TruSeq.v1 RNASEQ B1 7.1 0 ## 3 BP-43495 2013-09-12 TruSeq.v1 RNASEQ B1 8.0 0 ## 4 BP-43956 2013-09-25 TruSeq.v1 RNASEQ B1 6.9 1 ## 5 BP-44261 2013-10-03 TruSeq.v1 RNASEQ B1 6.3 1 ## 6 BP-43495 2013-09-12 TruSeq.v1 RNASEQ B1 5.9 1 ``` r tail(samples) ``` ## SUBJID SEX AGE DTHHRDY SAMPID ## 1479 GTEX-145ME Female 40-49 Ventilator Case GTEX-145ME-0926-SM-5O9AR ## 1480 GTEX-145ME Female 40-49 Ventilator Case GTEX-145ME-1026-SM-5O9B4 ## 1481 GTEX-145ME Female 40-49 Ventilator Case GTEX-145ME-1126-SM-5SIAT ## 1482 GTEX-145ME Female 40-49 Ventilator Case GTEX-145ME-1226-SM-5SIB6 ## 1483 GTEX-145ME Female 40-49 Ventilator Case GTEX-145ME-1326-SM-5O98Q ## 1484 GTEX-145ME Female 40-49 Ventilator Case GTEX-145ME-1426-SM-5RQJS ## SMTS SMNABTCH SMNABTCHD SMGEBTCHT SMAFRZE SMCENTER SMRIN ## 1479 Small Intestine BP-47675 2013-12-19 TruSeq.v1 RNASEQ B1 7.4 ## 1480 Stomach BP-47675 2013-12-19 TruSeq.v1 RNASEQ B1 7.4 ## 1481 Colon BP-47616 2013-12-18 TruSeq.v1 RNASEQ B1 6.9 ## 1482 Ovary BP-47616 2013-12-18 TruSeq.v1 RNASEQ B1 7.3 ## 1483 Uterus BP-47675 2013-12-19 TruSeq.v1 RNASEQ B1 8.5 ## 1484 Vagina BP-48437 2014-01-17 TruSeq.v1 RNASEQ B1 7.2 ## SMATSSCR ## 1479 1 ## 1480 1 ## 1481 1 ## 1482 1 ## 1483 1 ## 1484 1 ``` r str(samples) ``` ## 'data.frame': 1484 obs. of 13 variables: ## $ SUBJID : chr "GTEX-1117F" "GTEX-1117F" "GTEX-1117F" "GTEX-1117F" ... ## $ SEX : chr "Female" "Female" "Female" "Female" ... ## $ AGE : chr "60-69" "60-69" "60-69" "60-69" ... ## $ DTHHRDY : chr "Slow death" "Slow death" "Slow death" "Slow death" ... ## $ SAMPID : chr "GTEX-1117F-0226-SM-5GZZ7" "GTEX-1117F-0426-SM-5EGHI" "GTEX-1117F-0526-SM-5EGHJ" "GTEX-1117F-0626-SM-5N9CS" ... ## $ SMTS : chr "Adipose Tissue" "Muscle" "Blood Vessel" "Blood Vessel" ... ## $ SMNABTCH : chr "BP-43693" "BP-43495" "BP-43495" "BP-43956" ... ## $ SMNABTCHD: chr "2013-09-17" "2013-09-12" "2013-09-12" "2013-09-25" ... ## $ SMGEBTCHT: chr "TruSeq.v1" "TruSeq.v1" "TruSeq.v1" "TruSeq.v1" ... ## $ SMAFRZE : chr "RNASEQ" "RNASEQ" "RNASEQ" "RNASEQ" ... ## $ SMCENTER : chr "B1" "B1" "B1" "B1" ... ## $ SMRIN : num 6.8 7.1 8 6.9 6.3 5.9 6.6 6.3 6.5 5.8 ... ## $ SMATSSCR : int 0 0 0 1 1 1 1 1 2 1 ... ``` r summary(samples) ``` ## SUBJID SEX AGE DTHHRDY ## Length:1484 Length:1484 Length:1484 Length:1484 ## Class :character Class :character Class :character Class :character ## Mode :character Mode :character Mode :character Mode :character ## ## ## ## SAMPID SMTS SMNABTCH SMNABTCHD ## Length:1484 Length:1484 Length:1484 Length:1484 ## Class :character Class :character Class :character Class :character ## Mode :character Mode :character Mode :character Mode :character ## ## ## ## SMGEBTCHT SMAFRZE SMCENTER SMRIN ## Length:1484 Length:1484 Length:1484 Min. : 3.200 ## Class :character Class :character Class :character 1st Qu.: 6.300 ## Mode :character Mode :character Mode :character Median : 7.000 ## Mean : 7.061 ## 3rd Qu.: 7.700 ## Max. :10.000 ## SMATSSCR ## Min. :0.0000 ## 1st Qu.:0.0000 ## Median :1.0000 ## Mean :0.8491 ## 3rd Qu.:1.0000 ## Max. :3.0000 Count files can be very long and wide, so it is a good idea to only view the first (or last) few rows and columns. Typically, a gene identifier (like an ensemble id) will be used as the row names. We can use `dim` to see how many rows and columns are in the file. ``` r counts <- read.csv("./data/countData.HEART.csv", row.names = 1) dim(counts) ``` ## [1] 63856 306 ``` r head(counts)[1:5] ``` ## GTEX.12ZZX.0726.SM.5EGKA GTEX.13D11.1526.SM.5J2NA ## ENSG00000278704 0 0 ## ENSG00000277400 0 0 ## ENSG00000274847 0 0 ## ENSG00000277428 0 0 ## ENSG00000276256 0 0 ## ENSG00000278198 0 0 ## GTEX.ZAJG.0826.SM.5PNVA GTEX.11TT1.1426.SM.5EGIA ## ENSG00000278704 0 0 ## ENSG00000277400 0 0 ## ENSG00000274847 0 0 ## ENSG00000277428 0 0 ## ENSG00000276256 0 0 ## ENSG00000278198 0 0 ## GTEX.13VXT.1126.SM.5LU3A ## ENSG00000278704 0 ## ENSG00000277400 0 ## ENSG00000274847 0 ## ENSG00000277428 0 ## ENSG00000276256 0 ## ENSG00000278198 0 This “countData” was generated by using `recount3` as described in the file `scripts/recount3.Rmd`. It comes from a Ranged Summarized Experiment (rse) which contains quantitative information about read counts as well as quality control information and sample descriptions. The “colData” from an rse can also be obtained. This information *should* match the information in our samples file, but there can be subtle differences in formatting We will read the colData in a later section. Very large tabular files are often saved as .tsv files. These can be imported with `read.table()` or `read_tsv()`. You can also specify the tab delimiter as well as the row and column names. You can import files using the default parameters or you can change them. Because the first column in the .tsv files does not have a row name, by default, `read.table()`, imports the first column as the row.names. When `sep = "\t", header = TRUE` is specified, the fist column is imported as column one and given the column name `X`. ``` r # with row.names results <- read.table("./data/GTEx_Heart_20-29_vs_50-59.tsv") head(results) ``` ## logFC AveExpr t P.Value adj.P.Val B ## A1BG 0.67408600 1.6404652 2.1740238 0.03283291 0.1536518 -3.617093 ## A1BG-AS1 0.23168690 -0.1864802 1.0403316 0.30150123 0.5316030 -4.984225 ## A2M 0.02453974 9.8251848 0.1948624 0.84602333 0.9215696 -5.783835 ## A2M-AS1 0.38115436 2.4535892 2.4839630 0.01520646 0.1033370 -3.067127 ## A2ML1 0.58865741 -1.0412696 1.8263856 0.07173966 0.2328150 -4.065276 ## A2MP1 0.31631081 -0.8994146 1.4061454 0.16377753 0.3730822 -4.583435 :::warning #### Challenge What commands could you use to read the following files: 1. GTEx results comparing the muscles of 20-29 year old to 70-79 year olds? 1. The csv file information describing the muscle samples? :::spoiler 1. `read.table("./data/GTEx_Muscle_20-29_vs_70-79.tsv")` 2. `read.csv("./data/countData.MUSCLE.csv", row.names = 1)` ::: #### Quick summary statistics and sample size You have now seen a variety of options for importing files. You may use many more in your R-based RNA-seq workflow, but these basics will get you started. Let’s now explore the functions `summary()`, `length()`, `dim()`, and `count()` us to quickly summarize and compare data frames to answer the following questions. How many samples do we have? Over 1400! ``` r dim(samples) ``` ## [1] 1484 13 How many samples are there per tissue? ``` r dplyr::count(samples, SMTS) ``` ## SMTS n ## 1 Adipose Tissue 133 ## 2 Adrenal Gland 20 ## 3 Blood Vessel 138 ## 4 Brain 75 ## 5 Breast 50 ## 6 Colon 77 ## 7 Esophagus 143 ## 8 Heart 99 ## 9 Kidney 6 ## 10 Liver 28 ## 11 Lung 67 ## 12 Muscle 96 ## 13 Nerve 70 ## 14 Ovary 16 ## 15 Pancreas 30 ## 16 Pituitary 37 ## 17 Prostate 24 ## 18 Salivary Gland 22 ## 19 Skin 155 ## 20 Small Intestine 23 ## 21 Spleen 16 ## 22 Stomach 29 ## 23 Testis 37 ## 24 Thyroid 68 ## 25 Uterus 14 ## 26 Vagina 11 How many samples are there per tissue and sex? Can we test the effect of sex on gene expression in all tissues? For many samples, yes, but not all tissues were samples from both males and females. ``` r dplyr::count(samples, SMTS, SEX) ``` ## SMTS SEX n ## 1 Adipose Tissue Female 40 ## 2 Adipose Tissue Male 93 ## 3 Adrenal Gland Female 7 ## 4 Adrenal Gland Male 13 ## 5 Blood Vessel Female 48 ## 6 Blood Vessel Male 90 ## 7 Brain Female 23 ## 8 Brain Male 52 ## 9 Breast Female 16 ## 10 Breast Male 34 ## 11 Colon Female 25 ## 12 Colon Male 52 ## 13 Esophagus Female 37 ## 14 Esophagus Male 106 ## 15 Heart Female 30 ## 16 Heart Male 69 ## 17 Kidney Female 1 ## 18 Kidney Male 5 ## 19 Liver Female 4 ## 20 Liver Male 24 ## 21 Lung Female 20 ## 22 Lung Male 47 ## 23 Muscle Female 25 ## 24 Muscle Male 71 ## 25 Nerve Female 19 ## 26 Nerve Male 51 ## 27 Ovary Female 16 ## 28 Pancreas Female 12 ## 29 Pancreas Male 18 ## 30 Pituitary Female 10 ## 31 Pituitary Male 27 ## 32 Prostate Male 24 ## 33 Salivary Gland Female 5 ## 34 Salivary Gland Male 17 ## 35 Skin Female 47 ## 36 Skin Male 108 ## 37 Small Intestine Female 8 ## 38 Small Intestine Male 15 ## 39 Spleen Female 8 ## 40 Spleen Male 8 ## 41 Stomach Female 9 ## 42 Stomach Male 20 ## 43 Testis Male 37 ## 44 Thyroid Female 17 ## 45 Thyroid Male 51 ## 46 Uterus Female 14 ## 47 Vagina Female 11 How many samples are there per sex, age, and hardy scale? Do you have enough samples to test the effects of Sex, Age, and Hardy Scale in the Heart? ``` r dplyr::count(samples, SMTS, SEX, AGE, DTHHRDY ) ``` ## SMTS SEX AGE DTHHRDY n ## 1 Adipose Tissue Female 20-29 Ventilator Case 3 ## 2 Adipose Tissue Female 30-39 Ventilator Case 2 ## 3 Adipose Tissue Female 40-49 Fast death of natural causes 1 ## 4 Adipose Tissue Female 40-49 Ventilator Case 5 ## 5 Adipose Tissue Female 40-49 Violent and fast death 2 ## 6 Adipose Tissue Female 50-59 Fast death of natural causes 3 ## 7 Adipose Tissue Female 50-59 Slow death 1 ## 8 Adipose Tissue Female 50-59 Ventilator Case 8 ## 9 Adipose Tissue Female 60-69 Fast death of natural causes 3 :::warning #### Challenge What series commands would you use to import the `data/colData.MUSCLE.csv` and count the number of muscles samples per sex, age? How many female muscles samples are there from age group 30-39? *Hint: use head() or names() after importing a file to verify the variable names.* :::spoiler `df <- read.csv("./data/colData.MUSCLE.csv")` `dplyr::count(df, SMTS, SEX, AGE)` `# 3 samples are in the female group age 30-39` ::: :::success #### Key functions for importing and quickly viewing raw and summarized data | Function | Description | |-----------------------|-----------------------------------------------------------------| | `read.csv()` | A base R function for importing comma separated tabular data | | `read_csv()` | A tidyR function for importing .csv files as tibbles | | `read.table()` | A base R function for importing tabular data with any delimiter | | `read_tsv()` | A tidyR function for importing .tsv files as tibbles | | `as_tibble()` | Convert data frames to tibbles | | `head()` and `tail()` | Print the first or last 6 lines of an object | | `dim()` | A function that prints the dimensions of an object | | `length()` | Calculate the length of an object | | `count()` | A dplyr function that counts number of samples per group | | `str()` | A function that prints the internal structure of an object | | `summary()` | A function that summarizes each variable | ::: ## Visualizing data with ggplot2 `ggplot2` is a very popular package for making visualization. It is built on the “grammar of graphics”. Any plot can be expressed from the same set of components: a data set, a coordinate system, and a set of “geoms” or the visual representation of data points such as points, bars, line, or boxes. This is the template we build on: ggplot(data = <DATA>, aes(<MAPPINGS>)) + <geom_function>() + ... We just used the `count()` function to calculate how many samples are in each group. The function for creating bar graphs (`geom_bar()`) also makes use of `stat = "count"` to plot the total number of observations per variable. Let’s use ggplot2 to create a visual representation of how many samples there are per tissue, sex, and hardiness. ``` r ggplot(samples, aes(x = SMTS)) + geom_bar(stat = "count") ``` ![](https://i.imgur.com/YV8nArr.png) ![](./images/bar1-1.png) In the last section, we will discuss how to modify the `themes()` to adjust the axes, legends, and more. For now, let’s flip the x and y coordinates so that we can read the sample names. We do this by adding a layer and the function `coord_flip()` ``` r ggplot(samples, aes(x = SMTS)) + geom_bar(stat = "count") + coord_flip() ``` ![](https://i.imgur.com/xl1gav9.png) ![](./images/bar2-1.png) Now, there are two ways we can visualize another variable in addition to tissue. We can add color or we can add facets. Let’s first color the data by age bracket. Color is an aesthetic, so it must go inside the `aes()`. If you include `aes(color = AGE)` inside `ggplot()`, the color will be applied to every layer in your plot. If you add `aes(color = AGE)` inside `geom_bar()`, it will only be applied to that layer (which is important later when you layer multiple geoms. head(samples) ``` r ggplot(samples, aes(x = SMTS, color = AGE)) + geom_bar(stat = "count") + coord_flip() ``` ![](https://i.imgur.com/8vzyMHo.png) ![](./images/bar3-1.png) Note that the bars are outlined in a color according to hardy scale. If instead, you would the bars “filled” with color, use the aesthetic `aes(fill = AGE)` ``` r ggplot(samples, aes(x = SMTS, fill = AGE)) + geom_bar(stat = "count") + coord_flip() ``` ![](https://i.imgur.com/lNLwxpM.png) ![](./images/bar4-1.png) Now, let’s use `facet_wrap(~SEX)` to break the data into two groups based on the variable sex. ``` r ggplot(samples, aes(x = SMTS, fill = AGE)) + geom_bar(stat = "count") + coord_flip() + facet_wrap(~SEX) ``` ![](./images/bar5-1.png) ![](https://i.imgur.com/AfjBaPE.png) With this graph, we have an excellent overview of the total numbers of RNA-Seq samples in the GTEx project, and we can see where we are missing data (for good biological reasons). However, this plot doesn’t show us Hardy Scale. It’s hard to layer 4 variables, so let’s remove Tissue as a variable by focusing just on one Tissue. One thing these plots show us is that we don’t have enough samples to test the effects of all our experimental variables (age, sex, tissue, and hardy scale) and their interactions on gene expression. We can, however, focus on one or two variables or groups at a time. Earlier, we imported the file “data/GTEx_Heart_20-29_vs_70-79.tsv”)” and saved it as “results”. This file contains the results of a differential gene expression analysis comparing heart tissue from 20-29 to heart tissue from 30-39 year olds. This is a one-way design investigating only the effect of age (but not sex or hardy scale) on gene expression in the heart. Let’s visualize these results. [Volcano Plots](https://en.wikipedia.org/wiki/Volcano_plot_(statistics)) are a type of scatter plot that show the log fold change (logFC) on the x axis and the inverse log (`-log10()`) of a p-value that has been corrected for multiple hypothesis testing (adj.P.Val). Let’s create a Volcano Plot using the `gplot()` and `geom_point()`. *Note: this may take a minute because there are 15,000 points that must be plotted* ``` r ggplot(results, aes(x = logFC, y = -log10(adj.P.Val))) + geom_point() ``` ![](https://i.imgur.com/itaUZwg.png) ![](./images/volcano1-1.png) The inverse log of p \< 05 is 1.30103. We can add a horizontal line to our plot using `geom_hline()` so that we can visually see how many genes or points are significant and how many are not. ``` r ggplot(results, aes(x = logFC, y = -log10(adj.P.Val))) + geom_point() + geom_hline(yintercept = -log10(0.05)) ``` ![](https://i.imgur.com/hqlmU53.png) ![](./images/volcano2-1.png) ``` r ggplot(results, aes(x = logFC, y = -log10(adj.P.Val))) + geom_point(aes(color = ifelse( adj.P.Val < 0.05, "p < 0.05", "NS"))) + geom_hline(yintercept = -log10(0.05)) ``` ![](https://i.imgur.com/pZLY8qo.png) ![](./images/volcano3-1.png) ``` r ggplot(results, aes(x = logFC, y = -log10(adj.P.Val))) + geom_point(aes(color = ifelse( adj.P.Val < 0.05, "p < 0.05", "NS"))) + geom_hline(yintercept = -log10(0.05)) + theme(legend.position = "bottom") + labs(color = "20-29 vs 50-59 year olds", subtitle = "Heart Tissue Gene Expression") ``` ![](https://i.imgur.com/qU4wcMi.png) ![](./images/volcano4-1.png) :::warning #### Challenge Create a volcano plot for the results comparing the heart tissue of 20-29 year olds to that of 70-79 year olds? Are there more or less differential expressed gene between 20 and 50 year olds or 20 and 70 year olds? :::spoiler df <- read.table("./data/GTEx_Heart_20-29_vs_70-79.tsv") ggplot(df, aes(x = logFC, y = -log10(adj.P.Val))) + geom_point() + geom_hline(yintercept = -log10(0.05)) # more ::: In addition to containing information about the donor tissue, the samples file contains has a column with a RIN score, which tells us about the quality of the data. If we wanted to look for interactions between RIN score (SMRIN) and sequencing facility (SMCENTER), we can use a box plot. ``` r ggplot(samples, aes(x = SMCENTER, y = SMRIN)) + geom_boxplot() + geom_jitter(aes(color = SMRIN)) ``` ![](https://i.imgur.com/JjdADL3.png) ![](./images/boxplot-1.png) Now you know a handful of R functions for importing, summarizing, and visualizing data. In the next section, we will tidy and transform our data so that we can make even better summaries and figures. In the last section, you will learn ggplot function for making fancier figures. :::success #### Key functions | Function | Description | |----------------|-------------------------------------------------------------------------------------------------------------------------| | `ggplot2` | An open-source data visualization package for the statistical programming language R | | `ggplot()` | The function used to construct the initial plot object, and is almost always followed by + to add component to the plot | | `aes()` | Aesthetic mappings that describe how variables in the data are mapped to visual properties (aesthetics) of geoms | | `geom_point()` | A function used to create scatter plots | | `geom_bar()` | A function used to create bar plots | | `coord_flip()` | Flips the x and y axis | | `geom_hline()` | Add a horizontal line to plots | ::: ## Tidying and transforming data [Data wrangling](https://en.wikipedia.org/wiki/Data_wrangling) is the process of tidying and transforming data to make it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data. **Tidying** your data means storing it in a consistent form. When your data is tidy, each column is a variable, and each row is an observation. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions. Some tidying functions include `pivot_longer()`, `pivot_wider()`, `separate()`, `unite()`, `drop_na()`, `replace_na()`. The “lubridate” package has a number of functions for tidying dates. You may also use `mutate()` function to convert objects from, say, characters or integers to factors or rename observations and variables. **Transforming** your data includes narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means). Summary functions such as `summarize()` and `count()` to create new tables with statistics. Before summarizing or counting a whole data frame, you can use `group_by()` to group variables. You can use `filter()` and `select()` to isolate specific rows or columns, respectively. If you want to sort columns, `arrange()` and `arrange(desc())` are two functions to familiarize yourself with. **Combining tables** can be accomplished in one of two ways. If all the columns or all the rows have all the same names, you can use `rbind()` or `cbind()`, respectively, to join the data frames. If however, each data frame has a column (or multiple columns) that contain unique identifiers, then you can use the family of join functions (`inner_join()`, `outter_join()`, `left_join()`, and `right_join()`) For each downstream analysis, you will likely use a series of tidying and transforming steps in various order to get your data in the appropriate format. Instead of creating dozens of intermediate files after each step, we will use the `%>%` operator to “pipe” the output of one function to the input of the other. Instead of going into each function or each process in detail in isolation, let’s start with some typical research questions and then piece together R functions to get the desired information ### Filtering Data Filtering is done in a few different ways depending on the type of variable. You can use `>` and less `<` to filter greater or less than a number. `==` and `!=` are used to filter by characters or factors that match or do not match a specific pattern. `%in% c()` is used to filter by things in a list. Let’s filter by adjusted p-value. You can use `|` and `&` to mean “or” or “and” To explore filtering data, let’s answer the following question: What are the approved names and symbols of the differentially expressed genes (DEGs) in the heart tissue between 20-29 and 30-29 year olds? To answer this question, we need a subset of information from both the results and genes files. We need, in no particular order, to: 1. filter by adj.p.value \< 0.05 (or desired alpha) 2. filter by results by logFC > 1 or \<-1 3. filter by a list of gene symbols ``` r results %>% filter(adj.P.Val < 0.05) %>% head() ``` ## logFC AveExpr t P.Value adj.P.Val B ## AAGAB -0.4011097 4.904460 -4.013690 0.0001394737 0.02151260 0.8922610 ## ABCA6 0.7016965 5.597551 3.502285 0.0007779680 0.03216746 -0.6657214 ## ABCA9 0.6970969 6.237353 3.114336 0.0026040559 0.04866231 -1.7334759 ## ABCB7 -0.3708764 5.088921 -3.134484 0.0024510401 0.04755891 -1.6840509 ## ABCD3 -0.6082187 6.190704 -3.966022 0.0001646305 0.02282632 0.7409476 ## ABCE1 -0.3764537 5.332213 -3.336582 0.0013173889 0.03802552 -1.1360534 ``` r results %>% filter(logFC > 1 | logFC < -1) %>% head() ``` ## logFC AveExpr t P.Value adj.P.Val B ## ACTA1 -1.358451 10.437939 -2.446102 0.016765666 0.10888918 -3.1289828 ## ADAMTSL2 1.338257 5.208146 2.900833 0.004871530 0.06245142 -2.2916348 ## ADH1B 1.259668 7.381462 3.382176 0.001141426 0.03624785 -0.9658612 ## ADIPOQ -1.119484 1.117207 -1.720643 0.089405972 0.26371302 -4.3157948 ## AJAP1 1.010117 -1.212201 2.790425 0.006659281 0.07018819 -2.4766685 ## ALAS2 1.122494 -1.039829 1.840601 0.069603628 0.22901993 -4.0459220 Sometimes its nice to arrange by pvalue. ``` r results %>% filter(adj.P.Val < 0.05, logFC > 1 | logFC < -1) %>% arrange(adj.P.Val) %>% head() ``` ## logFC AveExpr t P.Value adj.P.Val B ## EDA2R 1.253278 1.0260046 6.019187 5.853141e-08 0.0004544671 6.8806284 ## PTCHD4 1.962957 -1.7174066 6.067597 4.781186e-08 0.0004544671 5.4766844 ## BTBD11 -1.194207 0.4506981 -5.289726 1.153068e-06 0.0044764970 4.2906292 ## MTHFD2P1 1.825674 -1.6578790 5.341073 9.392086e-07 0.0044764970 3.5204863 ## C4orf54 -2.824211 2.7276196 -4.502382 2.398852e-05 0.0159674585 2.4089730 ## LOC101929331 1.129013 -1.6306733 4.312543 4.813231e-05 0.0175055847 0.8558573 ``` r resultsDEGs <- results %>% filter(adj.P.Val < 0.05, logFC > 1 | logFC < -1) %>% arrange(adj.P.Val) %>% rownames(.) resultsDEGs ``` ## [1] "EDA2R" "PTCHD4" "BTBD11" "MTHFD2P1" "C4orf54" ## [6] "LOC101929331" "FMO3" "KLHL41" "ETNPPL" "HOPX" ## [11] "PDIA2" "RPL10P7" "FCMR" "RAD9B" "LMO3" ## [16] "NXF3" "FHL1" "EREG" "CHMP1B2P" "MYPN" ## [21] "VIT" "XIRP1" "DNASE1L3" "LIPH" "PRELP" ## [26] "CSRP3" "FZD10-AS1" "LINC02268" "GDF15" "PHF21B" ## [31] "CPXM1" "IL24" "ADH1B" "MCF2" "WWC1" ## [36] "SGPP2" "COL24A1" "SEC24AP1" "ANKRD1" "CDO1" ## [41] "CCL28" "SLC5A10" "XIRP2" :::warning #### Challenge Replace the input results file with a different file, such as the results of the comparison of 20-29 and 50-59 year old heart samples. What are the deferentially expressed genes? :::spoiler You could use the following code to get this result below resultsDEGs2 <- read.table("./data/GTEx_Heart_20-29_vs_50-59.tsv") %>% filter(adj.P.Val < 0.05, logFC > 1 | logFC < -1) %>% arrange(adj.P.Val) %>% rownames(.) resultsDEGs2 [1] "EDA2R" "PTCHD4" "BTBD11" "MTHFD2P1" "C4orf54" "LOC101929331" [7] "FMO3" "KLHL41" "ETNPPL" "HOPX" "PDIA2" "RPL10P7" [13] "FCMR" "RAD9B" "LMO3" "NXF3" "FHL1" "EREG" [19] "CHMP1B2P" "MYPN" "VIT" "XIRP1" "DNASE1L3" "LIPH" [25] "PRELP" "CSRP3" "FZD10-AS1" "LINC02268" "GDF15" "PHF21B" [31] "CPXM1" "IL24" "ADH1B" "MCF2" "WWC1" "SGPP2" [37] "COL24A1" "SEC24AP1" "ANKRD1" "CDO1" "CCL28" "SLC5A10" [43] "XIRP2" ::: ### Mutating Data Most RNA-Seq pipelines require that the counts file to be in a matrix format where each sample is a column and each gene is a row and all the values are integers or doubles with all the experimental factors in a separate file. More over, we need a corresponding file where the row names are the sample id and they match the column names of the counts file. When you type `rownames(colData) == colnames(counts)` you should see many TRUE statments. If the answer if FALSE your data cannot be processed by downstream tools. ``` r colData <- read.csv("./data/colData.HEART.csv", row.names = 1) head(colData) ``` ## SAMPID SMTS ## GTEX-12ZZX-0726-SM-5EGKA GTEX-12ZZX-0726-SM-5EGKA Heart ## GTEX-13D11-1526-SM-5J2NA GTEX-13D11-1526-SM-5J2NA Heart ## GTEX-ZAJG-0826-SM-5PNVA GTEX-ZAJG-0826-SM-5PNVA Heart ## GTEX-11TT1-1426-SM-5EGIA GTEX-11TT1-1426-SM-5EGIA Heart ## GTEX-13VXT-1126-SM-5LU3A GTEX-13VXT-1126-SM-5LU3A Heart ## GTEX-14ASI-0826-SM-5Q5EB GTEX-14ASI-0826-SM-5Q5EB Heart ## SMTSD SUBJID SEX AGE SMRIN ## GTEX-12ZZX-0726-SM-5EGKA Heart - Atrial Appendage GTEX-12ZZX Female 40-49 7.1 ## GTEX-13D11-1526-SM-5J2NA Heart - Atrial Appendage GTEX-13D11 Female 50-59 8.9 ## GTEX-ZAJG-0826-SM-5PNVA Heart - Left Ventricle GTEX-ZAJG Female 50-59 6.4 ## GTEX-11TT1-1426-SM-5EGIA Heart - Atrial Appendage GTEX-11TT1 Male 20-29 9.0 ## GTEX-13VXT-1126-SM-5LU3A Heart - Left Ventricle GTEX-13VXT Female 20-29 8.6 ## GTEX-14ASI-0826-SM-5Q5EB Heart - Atrial Appendage GTEX-14ASI Male 60-69 6.4 ## DTHHRDY SRA DATE ## GTEX-12ZZX-0726-SM-5EGKA Violent and fast death SRR1340617 2013-10-22 ## GTEX-13D11-1526-SM-5J2NA Ventilator Case SRR1345436 2013-12-04 ## GTEX-ZAJG-0826-SM-5PNVA Intermediate death SRR1367456 2013-10-31 ## GTEX-11TT1-1426-SM-5EGIA Ventilator Case SRR1378243 2013-10-24 ## GTEX-13VXT-1126-SM-5LU3A Ventilator Case SRR1381693 2013-12-17 ## GTEX-14ASI-0826-SM-5Q5EB Fast death of natural causes SRR1335164 2014-01-17 ``` r head(rownames(colData) == colnames(counts)) ``` ## [1] FALSE FALSE FALSE FALSE FALSE FALSE ``` r head(colnames(counts)) ``` ## [1] "GTEX.12ZZX.0726.SM.5EGKA" "GTEX.13D11.1526.SM.5J2NA" ## [3] "GTEX.ZAJG.0826.SM.5PNVA" "GTEX.11TT1.1426.SM.5EGIA" ## [5] "GTEX.13VXT.1126.SM.5LU3A" "GTEX.14ASI.0826.SM.5Q5EB" ``` r head(rownames(colData)) ``` ## [1] "GTEX-12ZZX-0726-SM-5EGKA" "GTEX-13D11-1526-SM-5J2NA" ## [3] "GTEX-ZAJG-0826-SM-5PNVA" "GTEX-11TT1-1426-SM-5EGIA" ## [5] "GTEX-13VXT-1126-SM-5LU3A" "GTEX-14ASI-0826-SM-5Q5EB" The row and col names don’t match because the the dashes were replaced with periods when the data were imported. This is kind of okay because `DESeq2` would complain if your colnames had dashes. We can use `gsub()` to replace the dashes with periods. ``` r colData_tidy <- colData %>% mutate(SAMPID = gsub("-", ".", SAMPID)) rownames(colData_tidy) <- colData_tidy$SAMPID mycols <- rownames(colData_tidy) head(mycols) ``` ## [1] "GTEX.12ZZX.0726.SM.5EGKA" "GTEX.13D11.1526.SM.5J2NA" ## [3] "GTEX.ZAJG.0826.SM.5PNVA" "GTEX.11TT1.1426.SM.5EGIA" ## [5] "GTEX.13VXT.1126.SM.5LU3A" "GTEX.14ASI.0826.SM.5Q5EB" Then, we rename the row names. We can use `select(all_of())` to make sure that all the rows in colData are represented at columns in countData. We could modify the original files, but since they are so large and importing taking a long time, I like to save “tidy” versions for downstream analyses. ``` r counts_tidy <- counts %>% select(all_of(mycols)) head(rownames(colData_tidy) == colnames(counts_tidy)) ``` ## [1] TRUE TRUE TRUE TRUE TRUE TRUE ### Joining Data Genes can be identified by their name, their symbol, an Ensemble ID, or any number of other identifiers. Our results file uses gene symbols, but our counts file uses Ensemble IDs. Let’s read a file called “genes.txt” and combine this with our results file so that we have gene symbols, names, and ids, alongside with the p-values and other statistics. ``` r genes <- read.table("./data/genes.txt", sep = "\t", header = T, fill = T) head(genes) ``` ## HGNC.ID Approved.symbol ## 1 HGNC:5 A1BG ## 2 HGNC:37133 A1BG-AS1 ## 3 HGNC:24086 A1CF ## 4 HGNC:6 A1S9T ## 5 HGNC:7 A2M ## 6 HGNC:27057 A2M-AS1 ## Approved.name Chromosome ## 1 alpha-1-B glycoprotein 19q13.43 ## 2 A1BG antisense RNA 1 19q13.43 ## 3 APOBEC1 complementation factor 10q11.23 ## 4 symbol withdrawn, see [HGNC:12469](/data/gene-symbol-report/ ## 5 alpha-2-macroglobulin 12p13.31 ## 6 A2M antisense RNA 1 12p13.31 ## Accession.numbers NCBI.Gene.ID Ensembl.gene.ID ## 1 1 ENSG00000121410 ## 2 BC040926 503538 ENSG00000268895 ## 3 AF271790 29974 ENSG00000148584 ## 4 NA ## 5 BX647329, X68728, M11313 2 ENSG00000175899 ## 6 144571 ENSG00000245105 ## Mouse.genome.database.ID ## 1 MGI:2152878 ## 2 ## 3 MGI:1917115 ## 4 ## 5 MGI:2449119 ## 6 The column with genes symbols is called `Approved.symbol`. To combine this data frame without results. We can use the mutate function to create a new column based off the row names. Let’s save this as `resultsSymbol`. ``` r resultsSymbol <- results %>% mutate(Approved.symbol = row.names(.)) head(resultsSymbol) ``` ## logFC AveExpr t P.Value adj.P.Val B ## A1BG 0.67408600 1.6404652 2.1740238 0.03283291 0.1536518 -3.617093 ## A1BG-AS1 0.23168690 -0.1864802 1.0403316 0.30150123 0.5316030 -4.984225 ## A2M 0.02453974 9.8251848 0.1948624 0.84602333 0.9215696 -5.783835 ## A2M-AS1 0.38115436 2.4535892 2.4839630 0.01520646 0.1033370 -3.067127 ## A2ML1 0.58865741 -1.0412696 1.8263856 0.07173966 0.2328150 -4.065276 ## A2MP1 0.31631081 -0.8994146 1.4061454 0.16377753 0.3730822 -4.583435 ## Approved.symbol ## A1BG A1BG ## A1BG-AS1 A1BG-AS1 ## A2M A2M ## A2M-AS1 A2M-AS1 ## A2ML1 A2ML1 ## A2MP1 A2MP1 Now, we can use one of the join functions to combine two data frames. `left_join` will return all records from the left table and any matching values from the right. `right_join` will return all values from the right table and any matching values from the left. `inner_join` will return records that have values in both tables. `full_join` will return everything. ``` r resultsName <- left_join(resultsSymbol, genes, by = "Approved.symbol") head(resultsName) ``` ## logFC AveExpr t P.Value adj.P.Val B ## 1 0.67408600 1.6404652 2.1740238 0.03283291 0.1536518 -3.617093 ## 2 0.23168690 -0.1864802 1.0403316 0.30150123 0.5316030 -4.984225 ## 3 0.02453974 9.8251848 0.1948624 0.84602333 0.9215696 -5.783835 ## 4 0.38115436 2.4535892 2.4839630 0.01520646 0.1033370 -3.067127 ## 5 0.58865741 -1.0412696 1.8263856 0.07173966 0.2328150 -4.065276 ## 6 0.31631081 -0.8994146 1.4061454 0.16377753 0.3730822 -4.583435 ## Approved.symbol HGNC.ID Approved.name Chromosome ## 1 A1BG HGNC:5 alpha-1-B glycoprotein 19q13.43 ## 2 A1BG-AS1 HGNC:37133 A1BG antisense RNA 1 19q13.43 ## 3 A2M HGNC:7 alpha-2-macroglobulin 12p13.31 ## 4 A2M-AS1 HGNC:27057 A2M antisense RNA 1 12p13.31 ## 5 A2ML1 HGNC:23336 alpha-2-macroglobulin like 1 12p13.31 ## 6 A2MP1 HGNC:8 alpha-2-macroglobulin pseudogene 1 12p13.31 ## Accession.numbers NCBI.Gene.ID Ensembl.gene.ID ## 1 1 ENSG00000121410 ## 2 BC040926 503538 ENSG00000268895 ## 3 BX647329, X68728, M11313 2 ENSG00000175899 ## 4 144571 ENSG00000245105 ## 5 AK057908 144568 ENSG00000166535 ## 6 M24415 3 ENSG00000256069 ## Mouse.genome.database.ID ## 1 MGI:2152878 ## 2 ## 3 MGI:2449119 ## 4 ## 5 ## 6 Congratulations! You have successfully joined two tables. Now, you can filter and select columns to make a pretty table of the DEGS. . ``` r resultsNameTidy <- resultsName %>% filter(adj.P.Val < 0.05, logFC > 1 | logFC < -1) %>% arrange(adj.P.Val) %>% select(Approved.symbol, Approved.name, Ensembl.gene.ID, logFC, AveExpr, adj.P.Val) head(resultsNameTidy) ``` ## Approved.symbol Approved.name Ensembl.gene.ID logFC ## 1 EDA2R <NA> <NA> 1.253278 ## 2 PTCHD4 <NA> <NA> 1.962957 ## 3 BTBD11 BTB domain containing 11 ENSG00000151136 -1.194207 ## 4 MTHFD2P1 <NA> <NA> 1.825674 ## 5 C4orf54 chromosome 4 open reading frame 54 ENSG00000248713 -2.824211 ## 6 LOC101929331 <NA> <NA> 1.129013 ## AveExpr adj.P.Val ## 1 1.0260046 0.0004544671 ## 2 -1.7174066 0.0004544671 ## 3 0.4506981 0.0044764970 ## 4 -1.6578790 0.0044764970 ## 5 2.7276196 0.0159674585 ## 6 -1.6306733 0.0175055847 ### Lengthening data The matrix form of the count data is required for some pipelines, but many R programs are better suited to data in a long format where each row is an observation. I like to create `counts_tidy_long` file that can be easily subset by variables or genes of interest. Because the count files are so large, it is good to filter the counts first. I’ll filter by `rowSums(.) > 0` and then take the top 6 with `head()`. Then create a column for lengthening. ``` r counts_tidy_slim <- counts_tidy %>% filter(rowSums(.) >0 ) %>% head() %>% mutate(Ensembl.gene.ID = row.names(.) ) head(counts_tidy_slim)[1:5] ``` ## GTEX.12ZZX.0726.SM.5EGKA GTEX.13D11.1526.SM.5J2NA ## ENSG00000223972.5 630 877 ## ENSG00000278267 1686 1467 ## ENSG00000227232.5 60803 59166 ## ENSG00000284332 0 0 ## ENSG00000243485.5 459 356 ## ENSG00000237613.2 717 0 ## GTEX.ZAJG.0826.SM.5PNVA GTEX.11TT1.1426.SM.5EGIA ## ENSG00000223972.5 1211 344 ## ENSG00000278267 3369 675 ## ENSG00000227232.5 71878 37289 ## ENSG00000284332 0 0 ## ENSG00000243485.5 192 237 ## ENSG00000237613.2 303 0 ## GTEX.13VXT.1126.SM.5LU3A ## ENSG00000223972.5 76 ## ENSG00000278267 1198 ## ENSG00000227232.5 43725 ## ENSG00000284332 76 ## ENSG00000243485.5 288 ## ENSG00000237613.2 252 Now we can pivot longer. We use `cols` to specify which column names will be turned into observations and we use `names_to` to specify the name of the new column that contains those observations. We use `values_to` to name the column with the corresonding value, in this case we will call the new columns, `SAMPID` and `counts`. ``` r counts_tidy_long <- counts_tidy_slim %>% pivot_longer(cols = all_of(mycols), names_to = "SAMPID", values_to = "counts") head(counts_tidy_long) ``` ## # A tibble: 6 × 3 ## Ensembl.gene.ID SAMPID counts ## <chr> <chr> <dbl> ## 1 ENSG00000223972.5 GTEX.12ZZX.0726.SM.5EGKA 630 ## 2 ENSG00000223972.5 GTEX.13D11.1526.SM.5J2NA 877 ## 3 ENSG00000223972.5 GTEX.ZAJG.0826.SM.5PNVA 1211 ## 4 ENSG00000223972.5 GTEX.11TT1.1426.SM.5EGIA 344 ## 5 ENSG00000223972.5 GTEX.13VXT.1126.SM.5LU3A 76 ## 6 ENSG00000223972.5 GTEX.14ASI.0826.SM.5Q5EB 895 Now, that we have a `SAMPID` column, we can join this with our colData_tidy and then make some pretty plots. ``` r counts_tidy_long_joined <- counts_tidy_long%>% inner_join(., colData_tidy, by = "SAMPID") %>% arrange(desc(counts)) head(counts_tidy_long_joined) ``` ## # A tibble: 6 × 12 ## Ensembl.gene.ID SAMPID counts SMTS SMTSD SUBJID SEX AGE SMRIN DTHHRDY ## <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> ## 1 ENSG00000227232.5 GTEX.1… 136022 Heart Heart… GTEX-… Male 60-69 6.1 Fast d… ## 2 ENSG00000227232.5 GTEX.1… 132414 Heart Heart… GTEX-… Male 60-69 6.4 Fast d… ## 3 ENSG00000227232.5 GTEX.1… 127732 Heart Heart… GTEX-… Fema… 50-59 6.3 Fast d… ## 4 ENSG00000227232.5 GTEX.Y… 108293 Heart Heart… GTEX-… Fema… 40-49 7.4 Violen… ## 5 ENSG00000227232.5 GTEX.1… 99074 Heart Heart… GTEX-… Fema… 50-59 6.5 Interm… ## 6 ENSG00000227232.5 GTEX.Z… 97872 Heart Heart… GTEX-… Fema… 60-69 6.3 Violen… ## # … with 2 more variables: SRA <chr>, DATE <chr> ``` r library(scales) counts_tidy_long_joined %>% ggplot(aes(x = AGE, y = counts)) + geom_boxplot() + geom_point() + facet_wrap(~Ensembl.gene.ID, scales = "free_y") + scale_y_log10(labels = label_number_si()) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` ![](https://i.imgur.com/hfwQLDx.png) ![](./images/boxplot2-1.png) That completes our section on tidying and transforming data. :::success #### Key functions: Tidy and Transform | Function | Description | |------------------|-------------------------------------------------------------------------------------------------| | `filter()` | A function for filtering data | | `mutate()` | A function for create new columns | | `select()` | A function for selecting/reordering columns | | `arrange()` | A function for ordering observations | | `full_join()` | Join 2 tables, return all observations | | `left_join()` | Join 2 tables, return all observations in the left and matching observations in the right table | | `inner_join()` | Join 2 tables, return observations with values in both tables | | `pivot_wider()` | Widen a data frame | | `pivot_longer()` | Lengthen a data frame | | `drop_na()` | Remove missing values | | `separate()` | Separate a column into two columns | ::: ## Communicate ### R Markdown The workshop notes for using this repository to teach an Introduction to R for RNA-seq are created with the file `r4rnaseq-workshop.Rmd`. ### References - [R for Data Science by Hadley Wickham and Garrett Grolemund](https://r4ds.had.co.nz/index.html) - [Rouillard et al. 2016. The Harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins. Database (Oxford).](http://database.oxfordjournals.org/content/2016/baw100.short) ### Additional Resources - [RStudio cheatsheet for readr](https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-import.pdf) - [RStudio cheatsheet for dplyr](https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-transformation.pdf) - [RStudio cheatsheet for data Wrangling with dplyr](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) - [ggplot point shapes](http://www.sthda.com/english/wiki/ggplot2-point-shapes) - [Angus 2019 Intro to R Lesson](https://angus.readthedocs.io/en/2019/R_Intro_Lesson.html) - [Angus 2019 Differential Gene Expression in R Lesson](https://angus.readthedocs.io/en/2019/diff-ex-and-viz.html) - [Software Carpentry R Lesson](http://swcarpentry.github.io/r-novice-inflammation/) *Note: the source document [r4rnaseq-workshop.Rmd](https://github.com/nih-cfde/training-rstudio-binder/blob/data/GTEx/r4rnaseq-workshop.Rmd) was last modified 27 April, 2022.*

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.