owned this note
owned this note
Published
Linked with GitHub
---
tags: workshop
---
# An Introduction to R for RNA-Seq Analysis March 2022
<!---
*Even though you can
[![hackmd-github-sync-badge](https://hackmd.io/2ArmQGwGT0uUyL5Ehqy0hQ/badge)](https://hackmd.io/2ArmQGwGT0uUyL5Ehqy0hQ),
please make changes by editing [this .Rmd
file](https://github.com/nih-cfde/training-rstudio-binder/blob/data/GTEx/r4rnaseq-workshop.Rmd).*
--->
**Description**
This free two-hour workshop will provide an overview of the R
programming language and the user-friendly RStudio environment for
exploratory RNA-sequencing analysis. You will be introduced to R syntax,
variables, functions, packages, and data structures. You will learn the
basics of data wrangling including importing data from files,
sub-setting, joining, filtering, and more. You will gain hands-on
practice calculating and visualizing differential gene expression using
popular open-source packages and public RNA-Seq data from the Gene
Expression Tissue Project (GTEx).
*Please fill out the [pre-workshop survey](https://forms.gle/SkADeiK6F65VcWTZA) to help us keep these workshops free.*
**When:** Wednesday, March 23, 10 am - 12 pm PT
**Where:**
[Zoom](https://zoom.us/j/7575820324?pwd=d2UyMEhYZGNiV3kyUFpUL1EwQmthQT09)
**Helpers:** Dr. Amanda Charbonneau and Dr. Jessica Lumian
**Instructor:** Dr. Rayna Harris
Click
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nih-cfde/training-rstudio-binder/data?urlpath=rstudio)
to generate a computing environment for this workshop. Souce code available at: https://github.com/nih-cfde/training-rstudio-binder/releases/tag/2022.03
### Overview
:::info
#### Learning Objectives
In this workshop, you will learn how to use R and RStudio to import,
tidy, transform, and visualize data structures commonly associated with
RNA-sequencing experiments. Specifically, you will:
- Explore public RNA-Seq data from the Gene-Tissue Expression Project
- Select variables and observations that are relevant to research
questions
- Rename variables and experimental factors for data joining and
plotting
- Visualize raw and summarized data using bar graphs, scatter plots,
and box plots
:::
[TOC]
## Introduction
The book [“R for Data Science”](https://r4ds.had.co.nz/index.html)
provides an excellent framework for using data science to turn raw data
into understanding, insight, and knowledge. We will use this framework
as an outline for this workshop.
**R** is a statistical computing and data visualization programming
language. **RStudio** is an integrated development environment, or IDE,
for R programming. R and RStudio work on Mac, Linux, and Windows
operating systems. The RStudio layout displays lots of useful
information and allows us to run R code from a script and view and save
outputs all from one interface.
When you start RStudio, you’ll see two key regions in the interface: the
console and the output. When working in R, you can type directly into
the console, or you can type into a script. Saving commands in a script
will make it easier to reproduce. You will learn more as we go along!
![reference link](https://hackmd.io/_uploads/SkkxxSHeq.png =300x)
![](https://hackmd.io/_uploads/H1a8-HHx5.png =300x)
For today’s lesson, we will focus on data from the [Gene-Tissue Expression (GTEx) Project](https://commonfund.nih.gov/gtex). The GTEx is an
ongoing effort to build a comprehensive public resource to study
tissue-specific gene expression and regulation. Samples were collected
from 54 non-diseased tissue sites across nearly 1000 individuals,
primarily for molecular assays including WGS, WES, and RNA-Seq.
By the end of today’s workshop, you create tables and plots like the
ones below that give an overview of the samples collected and the
variables that can be used for differential gene expression analysis.
More importantly, you will learn the basics of importing, tidying,
transforming, and visualizing data, which are the key component of an R
workflow.
### Some motivating questions
- How many RNA-sequencing samples are in the GTEx project?
- What is the age and sex of each donor?
- What was the cause of death?
- What is the effect of age on gene expression in the heart?
- Do you have enough samples to test the effects of sex, age, hardy
scale, and their interactions for all tissues?
- How do I combine, clean, modify, separate, etc. data sets and
variables?
- How is my gene of interest affected by age in the heart and muscle?
![](https://hackmd.io/_uploads/BJW4ua_Mq.png =300x) ![](https://hackmd.io/_uploads/r1wddadG9.png =300x)
![](https://hackmd.io/_uploads/S1iVdTOM5.png =300x) ![](https://hackmd.io/_uploads/HkwOO6df9.png =300x)
### Getting Started
1. Click the
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nih-cfde/training-rstudio-binder/data?urlpath=rstudio)
button to generate a computing environment for this workshop.
2. Navigate to the GTEx folder.
3. Click `GTEx.Rproj` and click “Yes” to open up an Rproject. This will
set the working directory to `~/GTEx/`.
4. Open the `r4rnaseq-workshop.R` file which contains all the commands
for today’s workshop. This contains all the commands we will build
today. This is your reference.
5. Open a new R Script by clicking **File > New File > R Script**. You
will type most commands for today’s lesson here and click “Run” to
send them to the console.
### R packages
[**Tidyverse**](https://www.tidyverse.org/) is a collection of R
packages that include functions, data, and documentation that provide
more tools and capabilities when using R. There are many packages for R,
but we are using this set because the packages are designed to work well
together -and they are especially useful (and popular!) for doing data
science.
You can install the complete tidyverse with a single line of code:
`install.packages("tidyverse")`, or you can install packages
individually (e.g. `install.packages("ggplot2")`). It is a good idea to
“comment out” this line of code by adding a `#` at the beginning so that
you don’t re-install the package every time you run the script. For this
workshop, the packages listed in the `.binder/environment.yml` file were
pre-installed with Conda. For some reason, the tidyverse package doesn’t
always install properly, so we installed each package individually.
``` r
#install.packages("tidyverse")
#install.packages("ggplot2")
#install.packages("tidyr")
```
After installing packages, we need to load the functions and tools we
want to use from the package with the `library()` command. Let’s load
the following packages: `ggplot2, tidyr, dplyr, readr, and tibble` by
copying and pasting or typing these commands into your new script.
``` r
library(ggplot2)
library(tidyr)
library(dplyr)
library(readr)
library(tibble)
```
:::warning
#### Challenge
We will also use `cowplot` and `scales` to make pretty visualizations,
`forcats.` for working with factors, and `stringr` for parsing text.
What commands do you need to add to your script to load these packages?
:::spoiler
`library(cowplot)`
`library(scales)`
`library(forcats)`
`library(stringr)`
:::
You can also navigate to the “Packages” tab in the bottom right pane of
RStudio to view a list of available packages. Packages with a checked
box next to them have been successfully loaded. You can click a box to
load installed packages. Clicking the “Help” Tab will provide a quick
description of the package and its functions.
:::success
#### Key functions
| Function | Description |
|----------------------|---------------------------------------------|
| `install.packages()` | An R function to install packages |
| `library()` | The command used to load installed packages |
:::
## Import
Data can be imported using packages from base R or the tidyverse. What
are some differences between the data objects imported by base R
functions such as `read.csv()` and Tidyverse functions such as
`read_csv()`? To begin with, `read.csv()` replaces spaces and dashes
periods in column names, and it also preserves row.names. On the other
hand, `read_csv()` preserves spaces and dashes in column names but drops
row.names. For this workshop, we will use `read_csv()`, which means we
may have to replace dashes with periods so that our sample names in all
objects with sample name information.
Today, I will show you how to import the following files:
1. data/GTExPortal.csv
2. data/GTExHeart_20-29_vs_70-79.tsv
3. data/colData.HEART.csv
4. data/countData.HEART.csv.gz
Then, you will practice on your own using the following files:
1. data/GTExMuscle_20-29_vs_70-79.tsv
2. data/colData.MUSCLE.csv
3. data/countData.MUSCLE.csv.gz
The `GTExPortal.csv` file in `./data/` contains information about all
the samples in the GTEx portal. Let’s import this file using
`read.csv()`.
``` r
samples <- read.csv("./data/GTExPortal.csv")
```
After importing a file, there are multiple ways to view the data.
`head()` to view the first few lines of each file. `names()` will print
just the column names.
``` r
head(samples)
```
## Tissue.Sample.ID Tissue Subject.ID Sex Age.Bracket
## 1 GTEX-1117F-0126 Skin - Sun Exposed (Lower leg) GTEX-1117F female 60-69
## 2 GTEX-1117F-0226 Adipose - Subcutaneous GTEX-1117F female 60-69
## 3 GTEX-1117F-0326 Nerve - Tibial GTEX-1117F female 60-69
## 4 GTEX-1117F-0426 Muscle - Skeletal GTEX-1117F female 60-69
## 5 GTEX-1117F-0526 Artery - Tibial GTEX-1117F female 60-69
## 6 GTEX-1117F-0626 Artery - Coronary GTEX-1117F female 60-69
## Hardy.Scale Pathology.Categories
## 1 Slow death
## 2 Slow death
## 3 Slow death clean_specimens
## 4 Slow death
## 5 Slow death monckeberg, sclerotic
## 6 Slow death
## Pathology.Notes
## 1 6 pieces, minimal fat, squamous epithelium is ~50-70 microns
## 2 2 pieces, ~15% vessel stroma, rep delineated
## 3 2 pieces, clean specimens
## 4 2 pieces, !5% fibrous connective tissue, delineated (rep)
## 5 2 pieces, clean, Monckebeg medial sclerosis, rep delineated
## 6 2 pieces, up to 4mm aderent fat/nerve/vessel, delineated
``` r
names(samples)
```
## [1] "Tissue.Sample.ID" "Tissue" "Subject.ID"
## [4] "Sex" "Age.Bracket" "Hardy.Scale"
## [7] "Pathology.Categories" "Pathology.Notes"
Very large tabular files are often saved as .tsv files. These can be
imported with `read.table()` or `read_tsv()`. You can also specify the
tab delimiter as well as the row and column names. You can import files
using the default parameters or you can change them. Because the first
column in the .tsv files does not have a row name, by default,
`read.table()`, imports the first column as the row.names. When
`sep = "\t", header = TRUE` is specified, the fist column is imported as
column one and given the column name `X`.
``` r
# with row.names
results <- read.table("./data/GTEx_Heart_20-29_vs_30-39.tsv")
head(results)
```
## logFC AveExpr t P.Value adj.P.Val B
## A1BG 0.10332788 1.3459363 0.3221575 0.7482217611 0.87480317 -5.672644
## A1BG-AS1 0.13609230 -0.2381928 0.6395041 0.5244264675 0.73078056 -5.345563
## A2M -0.01605178 9.7981987 -0.1132389 0.9101410387 0.95645802 -5.956689
## A2M-AS1 0.60505571 2.5392220 3.4884410 0.0008131523 0.05545654 -0.635100
## A2ML1 0.35413535 -1.1667406 1.0788316 0.2840898578 0.52922642 -4.948617
## A2MP1 0.65764737 -0.7564399 3.2615528 0.0016630789 0.06067003 -1.358971
``` r
# without row.names
results2 <- read.table("./data/GTEx_Heart_20-29_vs_30-39.tsv", sep = "\t", header = TRUE )
head(results2)
```
## X logFC AveExpr t P.Value adj.P.Val B
## 1 A1BG 0.10332788 1.3459363 0.3221575 0.7482217611 0.87480317 -5.672644
## 2 A1BG-AS1 0.13609230 -0.2381928 0.6395041 0.5244264675 0.73078056 -5.345563
## 3 A2M -0.01605178 9.7981987 -0.1132389 0.9101410387 0.95645802 -5.956689
## 4 A2M-AS1 0.60505571 2.5392220 3.4884410 0.0008131523 0.05545654 -0.635100
## 5 A2ML1 0.35413535 -1.1667406 1.0788316 0.2840898578 0.52922642 -4.948617
## 6 A2MP1 0.65764737 -0.7564399 3.2615528 0.0016630789 0.06067003 -1.358971
I find it helpful to import a table of gene names and symbols that can
be merged with other tables with gene information or searched for useful
information. This table of gene names was downloaded from
<https://www.genenames.org/download/custom/>.
In this case, we use `fill = T` to fill missing fields with a NULL
value.
``` r
genes <- read.table("./data/genes.txt", sep = "\t", header = T, fill = T)
head(genes)
```
## HGNC.ID Approved.symbol
## 1 HGNC:5 A1BG
## 2 HGNC:37133 A1BG-AS1
## 3 HGNC:24086 A1CF
## 4 HGNC:6 A1S9T
## 5 HGNC:7 A2M
## 6 HGNC:27057 A2M-AS1
## Approved.name Chromosome
## 1 alpha-1-B glycoprotein 19q13.43
## 2 A1BG antisense RNA 1 19q13.43
## 3 APOBEC1 complementation factor 10q11.23
## 4 symbol withdrawn, see [HGNC:12469](/data/gene-symbol-report/
## 5 alpha-2-macroglobulin 12p13.31
## 6 A2M antisense RNA 1 12p13.31
## Accession.numbers NCBI.Gene.ID Ensembl.gene.ID
## 1 1 ENSG00000121410
## 2 BC040926 503538 ENSG00000268895
## 3 AF271790 29974 ENSG00000148584
## 4 NA
## 5 BX647329, X68728, M11313 2 ENSG00000175899
## 6 144571 ENSG00000245105
## Mouse.genome.database.ID
## 1 MGI:2152878
## 2
## 3 MGI:1917115
## 4
## 5 MGI:2449119
## 6
:::info Using the Terminal to uncompress files
Very large data files, such as files with RNA-Seq counts are often
compressed before they are shared. To uncompress a file, click on the
Terminal tab and run the following command.
gunzip -k ./data/countData.HEART.csv.gz
:::
Once that file is uncompressed, it can be imported. Count files can be
very long and wide, so it is a good idea to only view the first (or
last) few rows and columns. Typically, a gene identifier (like an
ensemble id) will be used as the row names. We can use `dim` to see how
many rows and columns are in the file.
``` r
counts <- read.csv("./data/countData.HEART.csv",
header = TRUE, row.names = 1)
dim(counts)
```
## [1] 63856 306
``` r
head(counts)[1:5]
```
## GTEX.12ZZX.0726.SM.5EGKA.1 GTEX.13D11.1526.SM.5J2NA.1
## ENSG00000278704.1 0 0
## ENSG00000277400.1 0 0
## ENSG00000274847.1 0 0
## ENSG00000277428.1 0 0
## ENSG00000276256.1 0 0
## ENSG00000278198.1 0 0
## GTEX.ZAJG.0826.SM.5PNVA.1 GTEX.11TT1.1426.SM.5EGIA.1
## ENSG00000278704.1 0 0
## ENSG00000277400.1 0 0
## ENSG00000274847.1 0 0
## ENSG00000277428.1 0 0
## ENSG00000276256.1 0 0
## ENSG00000278198.1 0 0
## GTEX.13VXT.1126.SM.5LU3A.1
## ENSG00000278704.1 0
## ENSG00000277400.1 0
## ENSG00000274847.1 0
## ENSG00000277428.1 0
## ENSG00000276256.1 0
## ENSG00000278198.1 0
This “countData” was generated by using `recount3` as described in the
file `scripts/recount3.Rmd`. It comes from a Ranged Summarized
Experiment (rse) which contains quantitative information about read
counts as well as quality control information and sample descriptions.
The “colData” from an rse can also be obtained. This information
*should* match the information in our samples file, but there can be
subtle differences in formatting Let’s read the colData file for the
heart samples.
``` r
colData <- read.csv("./data/colData.HEART.csv")
head(colData)
```
## X external_id
## 1 GTEX-12ZZX-0726-SM-5EGKA.1 GTEX-12ZZX-0726-SM-5EGKA.1
## 2 GTEX-13D11-1526-SM-5J2NA.1 GTEX-13D11-1526-SM-5J2NA.1
## 3 GTEX-ZAJG-0826-SM-5PNVA.1 GTEX-ZAJG-0826-SM-5PNVA.1
## 4 GTEX-11TT1-1426-SM-5EGIA.1 GTEX-11TT1-1426-SM-5EGIA.1
## 5 GTEX-13VXT-1126-SM-5LU3A.1 GTEX-13VXT-1126-SM-5LU3A.1
## 6 GTEX-14ASI-0826-SM-5Q5EB.1 GTEX-14ASI-0826-SM-5Q5EB.1
## gtex.smtsd study gtex.smts gtex.subjid gtex.sampid
## 1 Heart - Atrial Appendage HEART Heart GTEX-12ZZX GTEX-12ZZX-0726-SM-5EGKA
## 2 Heart - Atrial Appendage HEART Heart GTEX-13D11 GTEX-13D11-1526-SM-5J2NA
## 3 Heart - Left Ventricle HEART Heart GTEX-ZAJG GTEX-ZAJG-0826-SM-5PNVA
## 4 Heart - Atrial Appendage HEART Heart GTEX-11TT1 GTEX-11TT1-1426-SM-5EGIA
## 5 Heart - Left Ventricle HEART Heart GTEX-13VXT GTEX-13VXT-1126-SM-5LU3A
## 6 Heart - Atrial Appendage HEART Heart GTEX-14ASI GTEX-14ASI-0826-SM-5Q5EB
## gtex.run_acc gtex.sex gtex.age gtex.dthhrdy gtex.smrin gtex.smcenter
## 1 SRR1340617 2 40-49 1 7.1 C1
## 2 SRR1345436 2 50-59 0 8.9 B1
## 3 SRR1367456 2 50-59 3 6.4 C1
## 4 SRR1378243 1 20-29 0 9.0 B1
## 5 SRR1381693 2 20-29 0 8.6 B1
## 6 SRR1335164 1 60-69 2 6.4 C1
## gtex.smpthnts
## 1 2 pieces, adherent/interstitial fat is ~40% of specimens
## 2 2 pieces, no abnormalities, ~25% fat, delineated
## 3 2 pieces, mild-moderate interstitial fibrosis, mild ischemic changes
## 4 2 pieces, one piece is 40% fat
## 5 2 pieces; 1 piece contains 30% external fat
## 6 2 pieces
## gtex.smnabtchd recount_qc.aligned_reads..chrm recount_qc.aligned_reads..chrx
## 1 10/22/2013 21.68 1.95
## 2 12/04/2013 22.77 1.82
## 3 10/31/2013 27.67 1.76
## 4 10/24/2013 23.99 1.98
## 5 12/17/2013 33.66 1.52
## 6 01/17/2014 15.45 1.99
## recount_qc.aligned_reads..chry recount_qc.bc_auc.all_reads_all_bases
## 1 0.00 5121146510
## 2 0.00 6606164884
## 3 0.01 5307211837
## 4 0.05 4433076550
## 5 0.00 7188560773
## 6 0.08 7130421400
## Date
## 1 2013-10-22
## 2 2013-12-04
## 3 2013-10-31
## 4 2013-10-24
## 5 2013-12-17
## 6 2014-01-17
:::warning
#### Challenge
What commands could you use to read the following files: 1. GTEx results
comparing the muscles of 20-29 year old to 70-79 year olds? 1. The csv
file information describing the muscle samples?
:::spoiler
1. `read.table("./data/GTEx_Muscle_20-29_vs_70-79.tsv")`
2. `read.csv("./data/colData.MUSCLE.csv")`
:::
#### Quick summary statistics and sample size
You have now seen a variety of options for importing files. You may use
many more in your R-based RNA-seq workflow, but these basics will get
you started. Let’s now explore the functions `summary()`, `length()`,
`dim()`, and `count()` us to quickly summarize and compare data frames
to answer the following questions.
How many transcripts were counted in the Heart tissues? Over 63,000.
``` r
dim(counts)
```
## [1] 63856 306
``` r
length(row.names(counts))
```
## [1] 63856
How many genes are in this version of the human transcriptome? Over
15,000.
``` r
dim(genes)
```
## [1] 15659 8
``` r
length(genes$Approved.symbol)
```
## [1] 15659
How many samples are there in the GTEx portal (as of this version)? Over
25,000!
``` r
dim(samples)
```
## [1] 25713 8
``` r
length(samples$Tissue.Sample.ID)
```
## [1] 25713
How many samples are there per tissue?
``` r
dplyr::count(samples, Tissue)
```
## Tissue n
## 1 Adipose - Subcutaneous 978
## 2 Adipose - Visceral (Omentum) 815
## 3 Adrenal Gland 717
## 4 Artery - Aorta 858
## 5 Artery - Coronary 662
## 6 Artery - Tibial 979
## 7 Bladder 130
## 8 Brain - Cerebellum 426
## 9 Brain - Cortex 428
## 10 Breast - Mammary Tissue 894
## 11 Cervix - Ectocervix 38
## 12 Cervix - Endocervix 42
## 13 Colon - Sigmoid 800
## 14 Colon - Transverse 937
## 15 Esophagus - Gastroesophageal Junction 787
## 16 Esophagus - Mucosa 963
## 17 Esophagus - Muscularis 958
## 18 Fallopian Tube 43
## 19 Heart - Atrial Appendage 614
## 20 Heart - Left Ventricle 761
## 21 Kidney - Cortex 557
## 22 Kidney - Medulla 49
## 23 Liver 610
## 24 Lung 860
## 25 Minor Salivary Gland 247
## 26 Muscle - Skeletal 1001
## 27 Nerve - Tibial 975
## 28 Ovary 255
## 29 Pancreas 865
## 30 Pituitary 428
## 31 Prostate 599
## 32 Skin - Not Sun Exposed (Suprapubic) 818
## 33 Skin - Sun Exposed (Lower leg) 1001
## 34 Small Intestine - Terminal Ileum 798
## 35 Spleen 874
## 36 Stomach 939
## 37 Testis 592
## 38 Thyroid 902
## 39 Uterus 237
## 40 Vagina 276
How many samples are there per tissue and sex? Can we test the effect of
sex on gene expression in all tissues? For many samples, yes, but not
all tissues were samples from both males and females.
``` r
dplyr::count(samples, Tissue, Sex)
```
## Tissue Sex n
## 1 Adipose - Subcutaneous female 325
## 2 Adipose - Subcutaneous male 653
## 3 Adipose - Visceral (Omentum) female 269
## 4 Adipose - Visceral (Omentum) male 546
## 5 Adrenal Gland female 232
## 6 Adrenal Gland male 485
## 7 Artery - Aorta female 281
## 8 Artery - Aorta male 577
## 9 Artery - Coronary female 227
## 10 Artery - Coronary male 435
## 11 Artery - Tibial female 324
## 12 Artery - Tibial male 655
## 13 Bladder female 51
## 14 Bladder male 79
## 15 Brain - Cerebellum female 121
## 16 Brain - Cerebellum male 305
## 17 Brain - Cortex female 121
## 18 Brain - Cortex male 307
## 19 Breast - Mammary Tissue female 306
## 20 Breast - Mammary Tissue male 588
## 21 Cervix - Ectocervix female 38
## 22 Cervix - Endocervix female 42
## 23 Colon - Sigmoid female 261
## 24 Colon - Sigmoid male 539
## 25 Colon - Transverse female 313
## 26 Colon - Transverse male 624
## 27 Esophagus - Gastroesophageal Junction female 256
## 28 Esophagus - Gastroesophageal Junction male 531
## 29 Esophagus - Mucosa female 327
## 30 Esophagus - Mucosa male 636
## 31 Esophagus - Muscularis female 325
## 32 Esophagus - Muscularis male 633
## 33 Fallopian Tube female 43
## 34 Heart - Atrial Appendage female 198
## 35 Heart - Atrial Appendage male 416
## 36 Heart - Left Ventricle female 252
## 37 Heart - Left Ventricle male 509
## 38 Kidney - Cortex female 167
## 39 Kidney - Cortex male 390
## 40 Kidney - Medulla female 16
## 41 Kidney - Medulla male 33
## 42 Liver female 192
## 43 Liver male 418
## 44 Lung female 282
## 45 Lung male 578
## 46 Minor Salivary Gland female 74
## 47 Minor Salivary Gland male 173
## 48 Muscle - Skeletal female 335
## 49 Muscle - Skeletal male 666
## 50 Nerve - Tibial female 325
## 51 Nerve - Tibial male 650
## 52 Ovary female 255
## 53 Pancreas female 287
## 54 Pancreas male 578
## 55 Pituitary female 121
## 56 Pituitary male 307
## 57 Prostate male 599
## 58 Skin - Not Sun Exposed (Suprapubic) female 267
## 59 Skin - Not Sun Exposed (Suprapubic) male 551
## 60 Skin - Sun Exposed (Lower leg) female 335
## 61 Skin - Sun Exposed (Lower leg) male 666
## 62 Small Intestine - Terminal Ileum female 261
## 63 Small Intestine - Terminal Ileum male 537
## 64 Spleen female 289
## 65 Spleen male 585
## 66 Stomach female 317
## 67 Stomach male 622
## 68 Testis male 592
## 69 Thyroid female 308
## 70 Thyroid male 594
## 71 Uterus female 237
## 72 Vagina female 276
How many samples are there per sex, age, and hardy scale in the HEART
tissues? Do you have enough samples to test the effects of Sex, Age, and
Hardy Scale in the Heart?
*Note: Let’s use the colData data frame for this since it is specific to
Heart. Later we will talk about subsetting.*
``` r
#names(colData)
dplyr::count(colData, gtex.smts, gtex.sex, gtex.age, gtex.dthhrdy)
```
## gtex.smts gtex.sex gtex.age gtex.dthhrdy n
## 1 Heart 1 20-29 0 2
## 2 Heart 1 20-29 2 1
## 3 Heart 1 30-39 0 12
## 4 Heart 1 30-39 1 4
## 5 Heart 1 40-49 0 15
## 6 Heart 1 40-49 2 7
## 7 Heart 1 40-49 3 2
## 8 Heart 1 40-49 4 3
## 9 Heart 1 50-59 0 32
## 10 Heart 1 50-59 2 24
## 11 Heart 1 50-59 3 5
## 12 Heart 1 50-59 4 7
## 13 Heart 1 60-69 0 18
## 14 Heart 1 60-69 1 4
## 15 Heart 1 60-69 2 36
## 16 Heart 1 60-69 3 12
## 17 Heart 1 60-69 4 9
## 18 Heart 1 70-79 0 2
## 19 Heart 1 70-79 2 3
## 20 Heart 2 20-29 0 6
## 21 Heart 2 30-39 0 3
## 22 Heart 2 40-49 0 15
## 23 Heart 2 40-49 1 3
## 24 Heart 2 40-49 3 1
## 25 Heart 2 50-59 0 27
## 26 Heart 2 50-59 2 7
## 27 Heart 2 50-59 3 4
## 28 Heart 2 50-59 4 2
## 29 Heart 2 60-69 0 24
## 30 Heart 2 60-69 1 1
## 31 Heart 2 60-69 2 9
## 32 Heart 2 60-69 3 3
## 33 Heart 2 60-69 4 3
:::warning
#### Challenge
What series commands would you use to import the
`data/colData.MUSCLE.csv` and count the number of muscles samples per
sex, age?
How many female muscles samples are there from age group 30-39?
*Hint: use head() or names() after importing a file to verify the
variable names.*
:::spoiler
df \<- read.csv(“./data/colData.MUSCLE.csv”) dplyr::count(df, gtex.smts,
gtex.sex, gtex.age) # 3 samples are in the female group age 30-39
:::
Finally, the `str()` and `summary()` commands are also quite useful for
summarizing every variable in a data frame. These let you know if R has
imported columns properly as integers, characters or factors.
``` r
str(samples)
```
## 'data.frame': 25713 obs. of 8 variables:
## $ Tissue.Sample.ID : chr "GTEX-1117F-0126" "GTEX-1117F-0226" "GTEX-1117F-0326" "GTEX-1117F-0426" ...
## $ Tissue : chr "Skin - Sun Exposed (Lower leg)" "Adipose - Subcutaneous" "Nerve - Tibial" "Muscle - Skeletal" ...
## $ Subject.ID : chr "GTEX-1117F" "GTEX-1117F" "GTEX-1117F" "GTEX-1117F" ...
## $ Sex : chr "female" "female" "female" "female" ...
## $ Age.Bracket : chr "60-69" "60-69" "60-69" "60-69" ...
## $ Hardy.Scale : chr "Slow death" "Slow death" "Slow death" "Slow death" ...
## $ Pathology.Categories: chr "" "" "clean_specimens" "" ...
## $ Pathology.Notes : chr "6 pieces, minimal fat, squamous epithelium is ~50-70 microns" "2 pieces, ~15% vessel stroma, rep delineated" "2 pieces, clean specimens" "2 pieces, !5% fibrous connective tissue, delineated (rep)" ...
``` r
summary(samples)
```
## Tissue.Sample.ID Tissue Subject.ID Sex
## Length:25713 Length:25713 Length:25713 Length:25713
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Age.Bracket Hardy.Scale Pathology.Categories Pathology.Notes
## Length:25713 Length:25713 Length:25713 Length:25713
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
``` r
str(results2)
```
## 'data.frame': 15529 obs. of 7 variables:
## $ X : chr "A1BG" "A1BG-AS1" "A2M" "A2M-AS1" ...
## $ logFC : num 0.1033 0.1361 -0.0161 0.6051 0.3541 ...
## $ AveExpr : num 1.346 -0.238 9.798 2.539 -1.167 ...
## $ t : num 0.322 0.64 -0.113 3.488 1.079 ...
## $ P.Value : num 0.748222 0.524426 0.910141 0.000813 0.28409 ...
## $ adj.P.Val: num 0.8748 0.7308 0.9565 0.0555 0.5292 ...
## $ B : num -5.673 -5.346 -5.957 -0.635 -4.949 ...
``` r
summary(results2)
```
## X logFC AveExpr t
## Length:15529 Min. :-2.917546 Min. :-2.303 Min. :-4.64984
## Class :character 1st Qu.:-0.131376 1st Qu.: 2.108 1st Qu.:-1.12926
## Mode :character Median : 0.009261 Median : 4.171 Median : 0.07798
## Mean : 0.026300 Mean : 3.806 Mean : 0.03410
## 3rd Qu.: 0.170189 3rd Qu.: 5.586 3rd Qu.: 1.21902
## Max. : 2.659235 Max. :13.593 Max. : 5.38745
## P.Value adj.P.Val B
## Min. :0.0000008 Min. :0.01209 Min. :-6.250
## 1st Qu.:0.0568327 1st Qu.:0.22729 1st Qu.:-5.839
## Median :0.2412433 Median :0.48246 Median :-5.306
## Mean :0.3345941 Mean :0.49944 Mean :-4.852
## 3rd Qu.:0.5772327 3rd Qu.:0.76952 3rd Qu.:-4.282
## Max. :0.9999816 Max. :0.99998 Max. : 5.635
``` r
str(counts[1:5])
```
## 'data.frame': 63856 obs. of 5 variables:
## $ GTEX.12ZZX.0726.SM.5EGKA.1: int 0 0 0 0 0 0 0 0 0 0 ...
## $ GTEX.13D11.1526.SM.5J2NA.1: int 0 0 0 0 0 0 0 0 0 0 ...
## $ GTEX.ZAJG.0826.SM.5PNVA.1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GTEX.11TT1.1426.SM.5EGIA.1: int 0 0 0 0 0 0 0 0 0 0 ...
## $ GTEX.13VXT.1126.SM.5LU3A.1: int 0 0 0 0 0 0 0 0 0 0 ...
``` r
summary(counts[1:5])
```
## GTEX.12ZZX.0726.SM.5EGKA.1 GTEX.13D11.1526.SM.5J2NA.1
## Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0
## Median : 95 Median : 76
## Mean : 75254 Mean : 103343
## 3rd Qu.: 9058 3rd Qu.: 9584
## Max. :164409376 Max. :407066080
## GTEX.ZAJG.0826.SM.5PNVA.1 GTEX.11TT1.1426.SM.5EGIA.1
## Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0
## Median : 70 Median : 38
## Mean : 80789 Mean : 69070
## 3rd Qu.: 8988 3rd Qu.: 7366
## Max. :338010074 Max. :263027953
## GTEX.13VXT.1126.SM.5LU3A.1
## Min. : 0
## 1st Qu.: 0
## Median : 47
## Mean : 111238
## 3rd Qu.: 6710
## Max. :602494565
:::success
#### Key functions for importing and quickly viewing raw and summarized data
| Function | Description |
|-----------------------|-----------------------------------------------------------------|
| `read.csv()` | A base R function for importing comma separated tabular data |
| `read_csv()` | A tidyR function for importing .csv files as tibbles |
| `read.table()` | A base R function for importing tabular data with any delimiter |
| `read_tsv()` | A tidyR function for importing .tsv files as tibbles |
| `as_tibble()` | Convert data frames to tibbles |
| `head()` and `tail()` | Print the first or last 6 lines of an object |
| `dim()` | A function that prints the dimensions of an object |
| `length()` | Calculate the length of an object |
| `count()` | A dplyr function that counts number of samples per group |
| `str()` | A function that prints the internal structure of an object |
| `summary()` | A function that summarizes each variable |
:::
## Visualize
`ggplot2` is a very popular package for making visualization. It is
built on the “grammar of graphics”. Any plot can be expressed from the
same set of components: a data set, a coordinate system, and a set of
“geoms” or the visual representation of data points such as points,
bars, line, or boxes. This is the template we build on:
ggplot(data = <DATA>, aes(<MAPPINGS>)) +
<geom_function>() +
...
### Bar plots
We just used the `count()` function to calculate how many samples are in
each group. The function for creating bar graphs (`geom_bar()`) also
makes use of `stat = "count"` to plot the total number of observations
per variable. Let’s use ggplot2 to create a visual representation of how
many samples there are per tissue, sex, and hardiness.
``` r
ggplot(samples, aes(x = Tissue)) +
geom_bar(stat = "count")
```
![](./images/bar1-1.png)
![](https://hackmd.io/_uploads/SJFrK0Uf5.png)
In the last section, we will discuss how to modify the `themes()` to
adjust the axes, legends, and more. For now, let’s flip the x and y
coordinates so that we can read the sample names. We do this by adding a
layer and the function `coord_flip()`
``` r
ggplot(samples, aes(x = Tissue)) +
geom_bar(stat = "count") +
coord_flip()
```
![](./images/bar2-1.png)
![](https://hackmd.io/_uploads/rkE8YCIM5.png)
Now, there are two ways we can visualize another variable in addition to
tissue. We can add color or we can add facets.
Let’s first color the data by age bracket. Color is an aesthetic, so it
must go inside the `aes()`. If you include `aes(color = Age.Bracket)`
inside `ggplot()`, the color will be applied to every layer in your
plot. If you add `aes(color = Age.Bracket)` inside `geom_bar()`, it will
only be applied to that layer (which is important later when you layer
multiple geoms.
head(samples)
``` r
ggplot(samples, aes(x = Tissue, color = Age.Bracket)) +
geom_bar(stat = "count") +
coord_flip()
```
![](./images/bar3-1.png)
![](https://hackmd.io/_uploads/ryJIFRIG9.png)
Note that the bars are outlined in a color according to hardy scale. If
instead, you would the bars “filled” with color, use the aesthetic
`aes(fill = Hardy.Scale)`
``` r
ggplot(samples, aes(x = Tissue, fill = Age.Bracket)) +
geom_bar(stat = "count") +
coord_flip()
```
![](./images/bar4-1.png)<!-- -->
![](https://hackmd.io/_uploads/Bks8F0IG9.png)
Now, let’s use `facet_wrap(~Sex)` to break the data into two groups
based on the variable sex.
``` r
ggplot(samples, aes(x = Tissue, fill = Age.Bracket)) +
geom_bar(stat = "count") +
coord_flip() +
facet_wrap(~Sex)
```
![](./images/bar5-1.png)<!-- -->
![](https://hackmd.io/_uploads/SkGDFALG5.png)
With this graph, we have an excellent
overview of the total numbers of RNA-Seq samples in the GTEx project,
and we can see where we are missing data (for good biological reasons).
However, this plot doesn’t show us Hardy Scale. It’s hard to layer 4
variables, so let’s remove Tissue as a variable by focusing just on one
Tissue.
:::warning
#### Challenge
Create a plot showing the total number of samples per Sex, Age Bracket,
and Hardy Scale for *just* the Heart samples. Paste the code you used in
the chat.
:::spoiler
There are many options. Here are a few.
ggplot(colData, aes(x = gtex.dthhrdy, fill = gtex.age)) +
geom_bar(stat = “count”) + facet_wrap(\~gtex.sex)
ggplot(colData, aes(x = gtex.age, fill = as.factor(gtex.dthhrdy))) +
geom_bar(stat = “count”) + facet_wrap(\~gtex.sex)
:::
One thing these plots show us is that we don’t have enough samples to
test the effects of all our experimental variables (age, sex, tissue,
and hardy scale) and their interactions on gene expression. We can,
however, focus on one or two variables or groups at a time.
### Volcano plots
Earlier, we imported the file “data/GTEx_Heart_20-29_vs_70-79.tsv”)” and
saved it as “results”. This file contains the results of a differential
gene expression analysis comparing heart tissue from 20-29 to heart
tissue from 30-39 year olds. This is a one-way design investigating only
the effect of age (but not sex or hardy scale) on gene expression in the
heart. Let’s visualize these results.
[Volcano Plots](https://en.wikipedia.org/wiki/Volcano_plot_(statistics))
are a type of scatter plots that show the log fold change (logFC) on the
x axis and the inverse log (`-log10()`) of a p-value that has been
corrected for multiple hypothesis testing (adj.P.Val). Let’s create a
Volcano Plot using the `gplot()` and `geom_point()`. *Note: this may
take a minute because there are 15,000 points that must be plotted*
``` r
ggplot(results, aes(x = logFC, y = -log10(adj.P.Val))) +
geom_point()
```
![](./images/volcano1-1.png)<!-- -->
The inverse log of p \< 05 is 1.30103. We can add a horizontal line to
our plot using `geom_hline()` so that we can visually see how many genes
or points are significant and how many are not.
``` r
ggplot(results, aes(x = logFC, y = -log10(adj.P.Val))) +
geom_point() +
geom_hline(yintercept = -log10(0.05))
```
![](./images/volcano2-1.png)<!-- -->
![](https://hackmd.io/_uploads/ryXKt08G9.png)
:::warning #### Challenge
Create a volcano plot for the results comparing the heart tissue of
20-29 year olds to that of 70-70 year olds? Are there more or less
differential expressed gene between 20 and 30 year olds or 20 and 70
year olds?
:::spoiler
df <- read.table("data/GTEx_Heart_20-29_vs_70-79.tsv")
ggplot(df, aes(x = logFC, y = -log10(adj.P.Val))) +
geom_point() +
geom_hline(yintercept = -log10(0.05))
# more
:::
### Boxplots
In addition to containing information about the donor tissue, the
“colData” file contains has a column with a RIN score, which tells us
about the quality of the data, and the facility where the RNA was
processed. If we wanted to confirm visually that were negligible
technical differences in RNA quality due to sequencing location, we
could use a box plot.
``` r
ggplot(colData, aes(x = gtex.smcenter, y = gtex.smrin)) +
geom_boxplot() +
geom_jitter(aes(color = gtex.smrin))
```
![](./images/boxplot-1.png)<!-- -->
![](https://hackmd.io/_uploads/ByzhYCLz9.png)
Now you know a handful of R functions for importing, summarizing, and
visualizing data. In the next section, we will tidy and transform our
data so that we can make even better summaries and figures. In the last
section, you will learn ggplot function for making fancier figures.
:::success
#### Key functions
| Function | Description |
|----------------|-------------------------------------------------------------------------------------------------------------------------|
| `ggplot2` | An open-source data visualization package for the statistical programming language R |
| `ggplot()` | The function used to construct the initial plot object, and is almost always followed by + to add component to the plot |
| `aes()` | Aesthetic mappings that describe how variables in the data are mapped to visual properties (aesthetics) of geoms |
| `geom_point()` | A function used to create scatterplots |
| `geom_bar()` | A function used to create bar plots |
| `coord_flip()` | Flips the x and y axis |
| `geom_hline()` | Add a horizontal line to plots |
:::
## Tidy and Transform Data
[Data wrangling](https://en.wikipedia.org/wiki/Data_wrangling) is the
process of tidying and transforming data to make it more appropriate and
valuable for a variety of downstream purposes such as analytics. The
goal of data wrangling is to assure quality and useful data. Data
analysts typically spend the majority of their time in the process of
data wrangling compared to the actual analysis of the data.
**Tidying** your data means storing it in a consistent form. When your
data is tidy, each column is a variable, and each row is an observation.
Tidy data is important because the consistent structure lets you focus
your struggle on questions about the data, not fighting to get the data
into the right form for different functions. Some tidying functions
include `pivot_longer()`, `pivot_wider()`, `separate()`, `unite()`,
`drop_na()`, `replace_na()`. The “lubridate” package has a number of
functions for tidying dates. You may also use `mutate()` function to
convert objects from, say, characters or integers to factors or rename
observations and variables.
**Transforming** your data includes narrowing in on observations of
interest (like all people in one city, or all data from the last year),
creating new variables that are functions of existing variables (like
computing speed from distance and time), and calculating a set of
summary statistics (like counts or means). Summary functions such as
`summarize()` and `count()` to create new tables with statistics. Before
summarizing or counting a whole data frame, you can use `group_by()` to
group variables. You can use `filter()` and `select()` to isolate
specific rows or columns, respectively. If you want to sort columns,
`arrange()` and `arrange(desc())` are two functions to familiarize
yourself with.
**Combining tables** can be accomplished in one of two ways. If all the
columns or all the rows have all the same names, you can use `rbind()`
or `cbind()`, respectively, to join the data frames. If however, each
data frame have a column (or multiple columns) that contain unique
identifiers, then you can use the family of join functions
(`inner_join()`, `outter_join()`, `left_join()`, and `right_join()`)
For each downstream analysis, you will likely use a series of tidying
and transforming steps in various order to get your data in the
appropriate format. Interest of creating dozens of intermediate files
after each step, we will use the `%>%` operator to “pipe” the output of
one function to the input of the other.
Instead of going into each function or each process in detail in
isolation, let’s start with some typical research questions and then
piece together R functions to get the desired information
### Question 1: What are the gene names and Ensemble IDs of the top 20 most differentially expressed genes (DEGs) in the heart tissue between 20-29 and 30-29 year olds?
To answer this question, we need a subset of information from both the
results and genes files. We need, in no particular order, to:
1. create a column in results with the gene symbols named “Approved.symbol”
1. filter the results for genes adj.p.value \< 0.05 (or desired alpha) and
logFC \> 1 or \<-1
1. arrange the results by descending adj.p.value 1. join the results and genes data frames by “Approved.symbol”
1. select the gene symbol, name, and ensemble id, lfc, and adj.p.val
1. create lists of ensemble ideas and gene symbols of DEGs
Let’s go deeper into each step.
#### 1.1 Create (mutate) or rename a column in results with the gene symbols named “Approved.symbol”
Currently, we have “results” with gene symbols as the row names and
“results2” has a column called “X” with the gene symbol. Depending on
which data frame we use, e can either use mutate to create a new column
with the name “Approved.symbol”, or we can rename column X.
``` r
results %>% mutate(Approved.symbol = row.names(.))
results2 %>% rename(Approved.symbol = X)
```
## # A tibble: 15,529 × 7
## logFC AveExpr t P.Value adj.P.Val B Approved.symbol
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 0.103 1.35 0.322 0.748 0.875 -5.67 A1BG
## 2 0.136 -0.238 0.640 0.524 0.731 -5.35 A1BG-AS1
## 3 -0.0161 9.80 -0.113 0.910 0.956 -5.96 A2M
## 4 0.605 2.54 3.49 0.000813 0.0555 -0.635 A2M-AS1
## 5 0.354 -1.17 1.08 0.284 0.529 -4.95 A2ML1
## 6 0.658 -0.756 3.26 0.00166 0.0607 -1.36 A2MP1
## 7 0.0655 6.54 0.565 0.574 0.767 -6.07 A4GALT
## 8 0.192 5.48 1.99 0.0504 0.214 -4.37 AAAS
## 9 0.0341 4.51 0.441 0.660 0.823 -6.11 AACS
## 10 0.355 1.88 2.94 0.00432 0.0763 -2.04 AADAT
## # … with 15,519 more rows
## # A tibble: 15,529 × 7
## Approved.symbol logFC AveExpr t P.Value adj.P.Val B
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A1BG 0.103 1.35 0.322 0.748 0.875 -5.67
## 2 A1BG-AS1 0.136 -0.238 0.640 0.524 0.731 -5.35
## 3 A2M -0.0161 9.80 -0.113 0.910 0.956 -5.96
## 4 A2M-AS1 0.605 2.54 3.49 0.000813 0.0555 -0.635
## 5 A2ML1 0.354 -1.17 1.08 0.284 0.529 -4.95
## 6 A2MP1 0.658 -0.756 3.26 0.00166 0.0607 -1.36
## 7 A4GALT 0.0655 6.54 0.565 0.574 0.767 -6.07
## 8 AAAS 0.192 5.48 1.99 0.0504 0.214 -4.37
## 9 AACS 0.0341 4.51 0.441 0.660 0.823 -6.11
## 10 AADAT 0.355 1.88 2.94 0.00432 0.0763 -2.04
## # … with 15,519 more rows
#### 1.2 Filter the results for genes adj.p.value \< 0.05 (or desired alpha)
Filter is done in a few different ways depending on the type of
variable. You can use `>` and less `<` to filter greater or less than a
number. `==` and `!=` are used to filter by characters or factors that
match or do not match a specific pattern. `%in% c()` is used to filter
by things in a list. Let’s filter by adjusted p-value. You can use `|`
and `&` to mean “or” or “and”
``` r
results %>% filter(adj.P.Val < 0.05,
logFC > 1 | logFC < -1)
```
## logFC AveExpr t P.Value adj.P.Val B
## ANKRD1 -1.189850 11.5066380 -3.832167 2.605709e-04 0.04495879 0.3442151
## C4orf54 -2.236262 3.0509899 -3.625726 5.200638e-04 0.04767478 -0.2579115
## FN1 -1.021743 9.5528978 -3.665273 4.563675e-04 0.04759268 -0.1398436
## IL1R2 -1.703322 0.7638632 -3.913550 1.972401e-04 0.04456995 0.4781060
## KRT80 -1.176516 -1.0145047 -3.714169 3.878470e-04 0.04632981 -0.2947158
## LINC00310 1.218272 -1.2321395 4.098561 1.034725e-04 0.04456995 0.6526078
## LINC02610 1.185186 -1.0647221 3.942760 1.783369e-04 0.04456995 0.2795004
## MTHFD2P1 1.616058 -1.7650730 5.015657 3.394989e-06 0.02636039 2.9968710
## NAV2-AS2 1.002273 1.0153707 3.822544 2.692335e-04 0.04495879 0.2476898
## PLEKHG4B 1.061661 -1.3135604 3.942497 1.784989e-04 0.04456995 0.2317681
## RAD9B -1.140384 0.3431846 -4.018721 1.369645e-04 0.04456995 0.7204587
## RANBP3L 1.044263 0.1329902 3.626793 5.182397e-04 0.04767478 -0.3692743
## RXRG 1.088903 2.1133650 4.326187 4.577186e-05 0.04456995 1.8528496
## SPRED3 -1.006501 -0.1510558 -3.590380 5.840643e-04 0.04857888 -0.4938726
## TNC -1.642206 4.7894623 -3.699919 4.067330e-04 0.04656770 -0.1151009
## TNN 1.724698 -1.6035527 3.634121 5.058730e-04 0.04759268 -0.5822258
## TSPEAR 2.161166 1.3779887 3.918461 1.939328e-04 0.04456995 0.5533325
## TUBA1C -1.083854 4.5752140 -4.649841 1.381098e-05 0.04456995 2.9939469
#### 1.3 Arrange the results by adj.p.value, from smallest to largest
We can use the arrange function to reorder observations.
``` r
results %>% arrange(adj.P.Val)
```
## logFC AveExpr t P.Value adj.P.Val B
## NMRK1 0.6401956 3.797183 5.387450 7.786610e-07 0.01209183 5.6346663
## MTHFD2P1 1.6160575 -1.765073 5.015657 3.394989e-06 0.02636039 2.9968710
## AAGAB -0.4180789 4.886796 -3.988491 1.521860e-04 0.04456995 0.7797524
## ABHD14A 0.4435404 4.400213 4.028887 1.321830e-04 0.04456995 0.9225181
## ADCY10P1 0.6464649 3.621729 3.915168 1.961445e-04 0.04456995 0.5924164
## AJUBA-DT 0.9256980 -1.281389 4.331900 4.483107e-05 0.04456995 1.2622196
Let’s combine the three previous steps into one series of commands and
save the this as a temporary file that we will join with.
```
resultsDEGs <- results %>%
mutate(Approved.symbol = row.names(.)) %>%
filter(adj.P.Val < 0.05,
logFC > 1 | logFC < -1) %>%
arrange(adj.P.Val)
resultsDEGs
```
## logFC AveExpr t P.Value adj.P.Val B
## MTHFD2P1 1.616058 -1.7650730 5.015657 3.394989e-06 0.02636039 2.9968710
## IL1R2 -1.703322 0.7638632 -3.913550 1.972401e-04 0.04456995 0.4781060
## LINC00310 1.218272 -1.2321395 4.098561 1.034725e-04 0.04456995 0.6526078
## LINC02610 1.185186 -1.0647221 3.942760 1.783369e-04 0.04456995 0.2795004
## PLEKHG4B 1.061661 -1.3135604 3.942497 1.784989e-04 0.04456995 0.2317681
## RAD9B -1.140384 0.3431846 -4.018721 1.369645e-04 0.04456995 0.7204587
## RXRG 1.088903 2.1133650 4.326187 4.577186e-05 0.04456995 1.8528496
## TSPEAR 2.161166 1.3779887 3.918461 1.939328e-04 0.04456995 0.5533325
## TUBA1C -1.083854 4.5752140 -4.649841 1.381098e-05 0.04456995 2.9939469
## ANKRD1 -1.189850 11.5066380 -3.832167 2.605709e-04 0.04495879 0.3442151
## NAV2-AS2 1.002273 1.0153707 3.822544 2.692335e-04 0.04495879 0.2476898
## KRT80 -1.176516 -1.0145047 -3.714169 3.878470e-04 0.04632981 -0.2947158
## TNC -1.642206 4.7894623 -3.699919 4.067330e-04 0.04656770 -0.1151009
## FN1 -1.021743 9.5528978 -3.665273 4.563675e-04 0.04759268 -0.1398436
## TNN 1.724698 -1.6035527 3.634121 5.058730e-04 0.04759268 -0.5822258
## C4orf54 -2.236262 3.0509899 -3.625726 5.200638e-04 0.04767478 -0.2579115
## RANBP3L 1.044263 0.1329902 3.626793 5.182397e-04 0.04767478 -0.3692743
## SPRED3 -1.006501 -0.1510558 -3.590380 5.840643e-04 0.04857888 -0.4938726
## Approved.symbol
## MTHFD2P1 MTHFD2P1
## IL1R2 IL1R2
## LINC00310 LINC00310
## LINC02610 LINC02610
## PLEKHG4B PLEKHG4B
## RAD9B RAD9B
## RXRG RXRG
## TSPEAR TSPEAR
## TUBA1C TUBA1C
## ANKRD1 ANKRD1
## NAV2-AS2 NAV2-AS2
## KRT80 KRT80
## TNC TNC
## FN1 FN1
## TNN TNN
## C4orf54 C4orf54
## RANBP3L RANBP3L
## SPRED3 SPRED3
#### 1.4 Join the results and genes data frames by “Approved.symbol”
``` r
left_join(resultsDEGs, genes, by = "Approved.symbol")
```
## # A tibble: 18 × 14
## logFC AveExpr t P.Value adj.P.Val B Approved.symbol HGNC.ID
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1.62 -1.77 5.02 0.00000339 0.0264 3.00 MTHFD2P1 <NA>
## 2 -1.70 0.764 -3.91 0.000197 0.0446 0.478 IL1R2 HGNC:5994
## 3 1.22 -1.23 4.10 0.000103 0.0446 0.653 LINC00310 <NA>
## 4 1.19 -1.06 3.94 0.000178 0.0446 0.280 LINC02610 <NA>
## 5 1.06 -1.31 3.94 0.000178 0.0446 0.232 PLEKHG4B <NA>
## 6 -1.14 0.343 -4.02 0.000137 0.0446 0.720 RAD9B <NA>
## 7 1.09 2.11 4.33 0.0000458 0.0446 1.85 RXRG <NA>
## 8 2.16 1.38 3.92 0.000194 0.0446 0.553 TSPEAR HGNC:1268
## 9 -1.08 4.58 -4.65 0.0000138 0.0446 2.99 TUBA1C HGNC:20768
## 10 -1.19 11.5 -3.83 0.000261 0.0450 0.344 ANKRD1 HGNC:15819
## 11 1.00 1.02 3.82 0.000269 0.0450 0.248 NAV2-AS2 <NA>
## 12 -1.18 -1.01 -3.71 0.000388 0.0463 -0.295 KRT80 <NA>
## 13 -1.64 4.79 -3.70 0.000407 0.0466 -0.115 TNC HGNC:5318
## 14 -1.02 9.55 -3.67 0.000456 0.0476 -0.140 FN1 <NA>
## 15 1.72 -1.60 3.63 0.000506 0.0476 -0.582 TNN HGNC:22942
## 16 -2.24 3.05 -3.63 0.000520 0.0477 -0.258 C4orf54 HGNC:27741
## 17 1.04 0.133 3.63 0.000518 0.0477 -0.369 RANBP3L <NA>
## 18 -1.01 -0.151 -3.59 0.000584 0.0486 -0.494 SPRED3 HGNC:31041
## # … with 6 more variables: Approved.name <chr>, Chromosome <chr>,
## # Accession.numbers <chr>, NCBI.Gene.ID <int>, Ensembl.gene.ID <chr>,
## # Mouse.genome.database.ID <chr>
#### 1.5 Select the gene symbol, name, and ensemble id, lfc, and adj.p.val
By now, our results file is getting quite wide. We can use the
`select()` function to subset data frames by specific column names.
``` r
resultsDEGs %>% select(Approved.symbol, logFC, adj.P.Val)
```
## Approved.symbol logFC adj.P.Val
## MTHFD2P1 MTHFD2P1 1.616058 0.02636039
## IL1R2 IL1R2 -1.703322 0.04456995
## LINC00310 LINC00310 1.218272 0.04456995
## LINC02610 LINC02610 1.185186 0.04456995
## PLEKHG4B PLEKHG4B 1.061661 0.04456995
## RAD9B RAD9B -1.140384 0.04456995
## RXRG RXRG 1.088903 0.04456995
## TSPEAR TSPEAR 2.161166 0.04456995
## TUBA1C TUBA1C -1.083854 0.04456995
## ANKRD1 ANKRD1 -1.189850 0.04495879
## NAV2-AS2 NAV2-AS2 1.002273 0.04495879
## KRT80 KRT80 -1.176516 0.04632981
## TNC TNC -1.642206 0.04656770
## FN1 FN1 -1.021743 0.04759268
## TNN TNN 1.724698 0.04759268
## C4orf54 C4orf54 -2.236262 0.04767478
## RANBP3L RANBP3L 1.044263 0.04767478
## SPRED3 SPRED3 -1.006501 0.04857888
Now, let’s combine all five steps into one. However, instead of
rearranging by p-value, let’s arrange by gene symbol. LEt’s also convert
this object to a tibble with `as_tibble()` for easier viewing.
``` r
resultsDEGs <- results %>%
mutate(Approved.symbol = row.names(.)) %>%
filter(adj.P.Val < 0.05,
logFC > 1 | logFC < -1) %>%
arrange(Approved.symbol) %>%
left_join(., genes, by = "Approved.symbol") %>%
select(Approved.symbol, Ensembl.gene.ID, logFC, adj.P.Val, Approved.name ) %>%
as_tibble()
resultsDEGs
```
## # A tibble: 18 × 5
## Approved.symbol Ensembl.gene.ID logFC adj.P.Val Approved.name
## <chr> <chr> <dbl> <dbl> <chr>
## 1 ANKRD1 ENSG00000148677 -1.19 0.0450 ankyrin repeat domain 1
## 2 C4orf54 ENSG00000248713 -2.24 0.0477 chromosome 4 open reading fr…
## 3 FN1 <NA> -1.02 0.0476 <NA>
## 4 IL1R2 ENSG00000115590 -1.70 0.0446 interleukin 1 receptor type 2
## 5 KRT80 <NA> -1.18 0.0463 <NA>
## 6 LINC00310 <NA> 1.22 0.0446 <NA>
## 7 LINC02610 <NA> 1.19 0.0446 <NA>
## 8 MTHFD2P1 <NA> 1.62 0.0264 <NA>
## 9 NAV2-AS2 <NA> 1.00 0.0450 <NA>
## 10 PLEKHG4B <NA> 1.06 0.0446 <NA>
## 11 RAD9B <NA> -1.14 0.0446 <NA>
## 12 RANBP3L <NA> 1.04 0.0477 <NA>
## 13 RXRG <NA> 1.09 0.0446 <NA>
## 14 SPRED3 ENSG00000188766 -1.01 0.0486 sprouty related EVH1 domain …
## 15 TNC ENSG00000041982 -1.64 0.0466 tenascin C
## 16 TNN ENSG00000120332 1.72 0.0476 tenascin N
## 17 TSPEAR ENSG00000175894 2.16 0.0446 thrombospondin type laminin …
## 18 TUBA1C ENSG00000167553 -1.08 0.0446 tubulin alpha 1c
:::warning
#### Challenge
Replace the input results file with a different file, such as the
results of the comparison of 20-29 and 30-39 year old muscle samples.
What are the differentially expressed genes?
:::spoiler
```
resultsDEGs <- results %>%
mutate(Approved.symbol = row.names(.)) %>%
filter(adj.P.Val < 0.05,
logFC > 1 | logFC < -1) %>%
arrange(Approved.symbol) %>%
left_join(., genes, by = “Approved.symbol”) %>%
select(Approved.symbol, Ensembl.gene.ID,
logFC, adj.P.Val, Approved.name ) %>%
filter(grepl(“ENSG”, Ensembl.gene.ID)) %>%
as_tibble()
resultsDEGs
```
:::
``` r
resultsDEGs <- results %>%
mutate(Approved.symbol = row.names(.)) %>%
filter(adj.P.Val < 0.05,
logFC > 1 | logFC < -1) %>%
arrange(Approved.symbol) %>%
left_join(., genes, by = "Approved.symbol") %>%
select(Approved.symbol, Ensembl.gene.ID, logFC, adj.P.Val, Approved.name ) %>%
filter(grepl("ENSG", Ensembl.gene.ID)) %>%
as_tibble()
resultsDEGs
```
## # A tibble: 8 × 5
## Approved.symbol Ensembl.gene.ID logFC adj.P.Val Approved.name
## <chr> <chr> <dbl> <dbl> <chr>
## 1 ANKRD1 ENSG00000148677 -1.19 0.0450 ankyrin repeat domain 1
## 2 C4orf54 ENSG00000248713 -2.24 0.0477 chromosome 4 open reading fra…
## 3 IL1R2 ENSG00000115590 -1.70 0.0446 interleukin 1 receptor type 2
## 4 SPRED3 ENSG00000188766 -1.01 0.0486 sprouty related EVH1 domain c…
## 5 TNC ENSG00000041982 -1.64 0.0466 tenascin C
## 6 TNN ENSG00000120332 1.72 0.0476 tenascin N
## 7 TSPEAR ENSG00000175894 2.16 0.0446 thrombospondin type laminin G…
## 8 TUBA1C ENSG00000167553 -1.08 0.0446 tubulin alpha 1c
With this exercise, you have explored a few functions related to tidying
and transforming data. Let’s try one more exercise
#### 1.6 Create lists of ensemble ids and gene names of DEGs
``` r
DEGsENSG <- resultsDEGs %>% pull(Ensembl.gene.ID)
DEGsENSG
```
## [1] "ENSG00000148677" "ENSG00000248713" "ENSG00000115590" "ENSG00000188766"
## [5] "ENSG00000041982" "ENSG00000120332" "ENSG00000175894" "ENSG00000167553"
``` r
DEGsSymbol <- resultsDEGs %>% pull(Approved.symbol)
DEGsSymbol
```
## [1] "ANKRD1" "C4orf54" "IL1R2" "SPRED3" "TNC" "TNN" "TSPEAR"
## [8] "TUBA1C"
### Question 2: What are the raw counts of the differentially expressed genes?
Most RNA-Seq pipelines require that the counts file to be in a matrix
format where each sample is a column and each gene is a row and all the
values are integers or doubles with all the experimental factors in a
separate file. However, many R tools prefer to have these data combined
in a single, tidy data frameor tibble. To process the counts data using
the DESeq2 pipeline, we need a corresponding file where the row names
are the sample id and they match the column names of the counts file. We
confirm this by asking if `rownames(colData) == colnames(counts)` or by
checking the dimensions of each. Using `head()` is a good way to only
print 5 rows. Using `[1:5]` is a good way to only print 5 columns.
I also like to create `counts_tidy_long` file that can be easily subset
by variables or genes of interest. To create this, we need to combine
three data frames: genes, colData (or Samples), and counts.
In this section, we will combine tidying, transforming, and visualizing
to answer the question “What are the raw counts of the differentially
expressed genes?”
``` r
head(rownames(colData) == colnames(counts))
```
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
``` r
head(colnames(counts))
```
## [1] "GTEX.12ZZX.0726.SM.5EGKA.1" "GTEX.13D11.1526.SM.5J2NA.1"
## [3] "GTEX.ZAJG.0826.SM.5PNVA.1" "GTEX.11TT1.1426.SM.5EGIA.1"
## [5] "GTEX.13VXT.1126.SM.5LU3A.1" "GTEX.14ASI.0826.SM.5Q5EB.1"
``` r
head(rownames(colData))
```
## [1] "1" "2" "3" "4" "5" "6"
``` r
head(colData$X)
```
## [1] "GTEX-12ZZX-0726-SM-5EGKA.1" "GTEX-13D11-1526-SM-5J2NA.1"
## [3] "GTEX-ZAJG-0826-SM-5PNVA.1" "GTEX-11TT1-1426-SM-5EGIA.1"
## [5] "GTEX-13VXT-1126-SM-5LU3A.1" "GTEX-14ASI-0826-SM-5Q5EB.1"
``` r
colData_tidy <- colData %>%
mutate(X = gsub("-", ".", X)) %>%
filter(gtex.age %in% c("20-29", "70-79")) %>%
mutate(gtex.age = factor(gtex.age))
rownames(colData_tidy) <- colData_tidy$X
head(rownames(colData_tidy) == colnames(counts))
```
## Warning in rownames(colData_tidy) == colnames(counts): longer object length is
## not a multiple of shorter object length
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
``` r
mycols <- colData_tidy %>% dplyr::pull(X)
counts_tidy <- counts %>%
select(all_of(mycols))
head(rownames(colData_tidy) == colnames(counts_tidy))
```
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
``` r
counts_tidy_long <- counts %>%
select(all_of(mycols)) %>%
mutate(Ensembl.gene.ID = rownames(.)) %>%
separate(Ensembl.gene.ID, into = c("Ensembl.gene.ID", "version"),
sep = "\\.") %>%
filter(Ensembl.gene.ID %in% all_of(DEGsENSG)) %>%
pivot_longer(cols = all_of(mycols), names_to = "X",
values_to = "counts") %>%
inner_join(., colData_tidy, by = "X") %>%
arrange(desc(counts)) %>%
inner_join(., genes, by = "Ensembl.gene.ID") %>%
select(Ensembl.gene.ID, Approved.name, Approved.symbol, counts,
X, gtex.smtsd, study, gtex.age, gtex.sex, gtex.dthhrdy,
gtex.smcenter)
head(counts_tidy_long)
```
## # A tibble: 6 × 11
## Ensembl.gene.ID Approved.name Approved.symbol counts X gtex.smtsd study
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 ENSG00000148677 ankyrin repea… ANKRD1 5.54e7 GTEX.… Heart - L… HEART
## 2 ENSG00000148677 ankyrin repea… ANKRD1 4.65e7 GTEX.… Heart - L… HEART
## 3 ENSG00000148677 ankyrin repea… ANKRD1 1.84e7 GTEX.… Heart - A… HEART
## 4 ENSG00000148677 ankyrin repea… ANKRD1 1.58e7 GTEX.… Heart - A… HEART
## 5 ENSG00000148677 ankyrin repea… ANKRD1 1.22e7 GTEX.… Heart - L… HEART
## 6 ENSG00000148677 ankyrin repea… ANKRD1 1.12e7 GTEX.… Heart - L… HEART
## # … with 4 more variables: gtex.age <fct>, gtex.sex <int>, gtex.dthhrdy <int>,
## # gtex.smcenter <chr>
``` r
counts_tidy_long %>%
ggplot(aes(x = gtex.age, y = counts)) +
geom_boxplot() +
geom_point() +
facet_wrap(~Approved.symbol, scales = "free_y") +
scale_y_log10(labels = label_number_si())
```
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
![](./images/boxplot2-1.png)<!-- -->
![](https://hackmd.io/_uploads/Syj5KRIf5.png)
:::warning
#### Challenge
Plot the gene expression of your five favorite genes.
:::spoiler
*Hint: Create a list of your favorite genes. Use the gene Ensemble ides
to get the counts and use the colData or sample information to explore
how age, sex, tissue, and other variable may influence gene expression.*
:::
:::success
#### Key functions: Tidy and Transform
| Function | Description |
|------------------|-------------|
| `pivot_wider()` | widen a data frame |
| `pivot_longer()` | lengthen a data frame |
| `separate()` | separate columns |
| `drop_na()` | drop missing variables |
| `select()` | select columns of interest |
| `arrange()` | arrange in an ascending (default) or descending order |
| `summarize()` | perform a mathmatical function |
| `mutate()` | create a new variable |
| `full_join()` | join all columns and rows of a data frame |
| `left_join()` | add columns from the right data frame to the left when there are mathing rows |
| `inner_join()` | join only rows with values in both data frames |
:::
## Communicate
### R Markdown
The workshop notes for using this repository to teach an Introduction to
R for RNA-seq are crated with the file [r4rnaseq-workshop.Rmd](https://github.com/nih-cfde/training-rstudio-binder/blob/data/GTEx/r4rnaseq-workshop.Rmd).
### References
- [R for Data Science by Hadley Wickham and Garrett
Grolemund](https://r4ds.had.co.nz/index.html)
- [Rouillard et al. 2016. The Harmonizome: a collection of processed
datasets gathered to serve and mine knowledge about genes and
proteins. Database
(Oxford).](http://database.oxfordjournals.org/content/2016/baw100.short)
### Additional Resources
- [RStudio cheatsheet for
readr](https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-import.pdf)
- [RStudio cheatsheet for
dplyr](https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-transformation.pdf)
- [RStudio cheatsheet for data Wrangling with
dplyr](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
- [ggplot point
shapes](http://www.sthda.com/english/wiki/ggplot2-point-shapes)
- [Angus 2019 Intro to R
Lesson](https://angus.readthedocs.io/en/2019/R_Intro_Lesson.html)
- [Angus 2019 Differential Gene Expression in R
Lesson](https://angus.readthedocs.io/en/2019/diff-ex-and-viz.html)
- [Software Carpentry R
Lesson](http://swcarpentry.github.io/r-novice-inflammation/)
*Note: the source document
[r4rnaseq-workshop.Rmd](https://github.com/nih-cfde/training-rstudio-binder/blob/data/GTEx/r4rnaseq-workshop.Rmd)
was last modified 23 March, 2022.*