# RNA-Seq DGE Pipeline - Hebert Lab
**Note:** when running any of the scripts below, it is assumed that you are running them from the same directory from which you set up the project space (i.e. the parent directory of all the folders you create in step I.1.i below). The scripts frequently use relative paths; if they are run from a different location, the scripts will fail.
## I. Prepare to run the pipeline
### I.1. Set up the project space
**i.** Run the below commands to set up the folders that will be used throughout the pipeline.
```
mkdir 00_cat_reads
mkdir 01_genome_index
mkdir 02_mapped_reads
mkdir 03_transcript_quant
mkdir logs
mkdir metrics
mkdir raw_reads
mkdir references
mkdir scripts
```
**ii.** Link to the preloaded human genome and annotation file on the cluster.
```
ln -s /share/data/umw_biocore/genome_data/human/hg38/hg38.fa references/
ln -s /share/data/umw_biocore/genome_data/human/hg38/ucsc.gtf references
```
**iii.** Manually copy each of the <a href="#bash_scripts">BASH scripts </a> at the bottom of this page into separate files in the `scripts` directory, named accordingly.
<font size=2> For example, to copy the first script ... if using emacs as your text editor, you could simply type `emacs scripts/00_fastqc.sh`, then copy the associated code into the file, save, exit, and it's ready to run.</font>
### I.2. Check quality of reads from sequencing center
<font size=2> The first step for any pipeline is to check the quality of the reads. We'll use the program [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to assess the samples individually, and then collate the information for all the samples together using [MultiQC](https://multiqc.info/), which will output a single coherent summary of sequence quality across the samples.
</font>
**Run script `00_fastqc.sh`**
**Expected output**
- FastQC reports `logs/fastqc_reports`
- MultiQC report `logs/multiqc_report`
Transfer both folders to a local directory and open the multiqc html file `multiqc_reports/multiqc_report.html`.
Use these results as a guide for the filtering steps that are necessary downstream. **In general, we want everything to be green**.
The MultiQC html file has links that can help you determine what to do if a sample (or several) fail the quality checks, and potential reasons why.
The UMass sequencing core demultiplexes and cleans samples before sharing the sequences. Assuming the MultiQC report is positive (ie mostly all green), we can proceed to mapping without any additional quality filtering.
<font size=2>
Note: RNA-Seq data will commonly have high sequence duplication levels, which could indicate overamplification during the final PCR during library prep; however, this could also represent real expression, with the tagged duplicates representing transcripts with high abundance. So if the sequence duplication levels seem high, it's not necessarily cause for concern (but do think critically about whether the results make sense and whether it is worth an additional step to remove PCR duplicates).
</font>
## II. Read mapping and quantification with STAR and RSEM
<font size=2> Choice of mapping software is generally not critical on downstream DGE results ([see Cost-Silva et al 2017](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190152)). Here, we will use [STAR](https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/03_alignment.html), which is a very fast and accurate program for aligning reads to a reference genome. For this section we essentially follow Alternate Protocol 7 from the [STAR protocol](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4631051/). </font>
### II.1. Index the reference for fast alignment
<font size=2>Accurately aligning billions of bases of sequence data to a reference genome that is itself billions of bases long is - to put it mildly - a computationally-intensive task. One way to speed up the process is to provide an "index" for the genome. We can think of this index like the index at the back of a book that we use to find the page where specific search terms are referenced within the text. In the context of a genome, an index allows us to find where our reads are located in the genome without having to query the whole genome for each read.
We will create an index for our reference genome using [STAR](https://github.com/alexdobin/STAR) and call it with the program [RSEM](https://deweylab.github.io/RSEM/). We will first use this index for read mapping with STAR and then to quantify expression with RSEM. RSEM is stricter with its requirements for indices and may reject an index created with STAR if it isn't called through RSEM. So, we use a built-in option in RSEM to call STAR's algorithm to prepare the index such that it meets the requirements of both programs.
See [STAR manual](https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf) and [RSEM README](https://deweylab.github.io/RSEM/README.html) file for details
</font>
**RUN SCRIPT `01_prep_index.sh`**
**Expected output**
All the files needed for mapping will be output to the directory specified in the script (in this case the folder is called `01_genome_index`).
### II.2. Concatenate reads and map to the indexed reference
<font size=2>Once we've prepared the reference genome with an index, it's time to map our reads to the indexed genome. One of the primary bioinformatics challenges with aligning reads to a reference genome is splicing. Particularly if we focus our efforts on mRNAs, then the sequence data we obtain are likely to span introns that are included in the reference genome. We can mitigate this issue by using a splice-aware aligner, like [STAR](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3530905/) (Spliced Transcripts Alignment to a Reference). STAR has the added benefit of incorporating annotations in the alignment and outputting an annotated transcriptome alignment that we can use to quantify our reads.
If samples were sequenced across multiple lanes, we need to concatenate the lanes for each sample together before mapping to the reference genome. There is a small chance that data will be lost during this step, so we also want to perform a quick check to make sure the concatenated files received the correct amount of data from the raw files.
</font>
The script for this part of the pipeline is split into three steps corresponding to the tasks listed above:
1) Concatenate lanes for each sample together so there is a single forward and reverse input file for each sample.
2) Check that files were collated properly by comparing lines in raw files to lines in concatenated files.
3) Map reads to the indexed reference
<font size=2> Note that if a strand-specific protocol was used, it is not necessary to specify strandedness *in this step*. That will come into play when we quantify transcript abundance in the following step. </font>
**RUN SCRIPT `02_map_to_reference.sh`**
**Relevant output**
**1) Abridged report on % of unmapped reads**
- `metrics/unmapped_reads.txt`
<font size=2> This file quickly reports the percent of unmapped reads for each sample. We'd like this number to be 5% or less for the human genome.</font>
**2) Log file from STAR**
- `02_mapped_reads/[file_name]Log.final.out`
<font size=2> This file has more details about mapping success/failure. It is a good idea to look at this file for each sample, especially if the abridged report reveals a low percentage of mapped reads. </font>
**3) Comparison of line count from concatenated file vs separate files for each sample**
- `metrics/line_counts.txt`
<font size=2> The numbers under the "Forward" and "Reverse" headings for each sample should match. </font>
**4) Annotated transcriptome alignment**
- `02_mapped_reads/[file_name]Aligned.toTranscriptome.out.bam`.
<font size=2> This is the primary output file that will be used as input in the next step of the pipeline. </font>
### II.3 Quantify transcript abundance
<font size=2> Now that we have mapped our reads to a reference genome with STAR, the next step is to count the number of reads that map to each transcript/gene. One of the challenges here is that RNA-Seq reads may map to multiple genes or isoforms. RSEM's model accounts for this uncertainty and gives us the option to report confidence intervals for reads that map to multiple regions. By default, we won't request this information. </font>
**RUN SCRIPT `03_count_reads.sh`**
<font size=2> This is the longest step in the protocol and can take a day or longer to run, so plan accordingly </font>
**Expected output** (in `03_transcript_quant`)
1) `[sample_name].genes.results`
2) `[sample_name].isoforms.results`
3) `all.counts.matrix`
- <font size=2> This is the file that contains counts for all the samples and genes. We will import this file into R for differential expression analysis</font>
## III. Differential Expression Analysis with EdgeR
<font size=2>Now that we have our matrix of counts for all genes and samples, we can import the data into R and fit a statistical model to determine which genes are differentially expressed among our different experimental conditions. The main statistical challenge here is determining the dispersion (variance) of the data. edgeR does this in multiple steps that are rather impressive. See the [edgeR User Guide](https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf) for more details and for a tutorial on using the software.
The following steps can be done using the R interface on the cluster, but I prefer to work in R locally ... </font>
- Transfer the file `all.counts.matrix` to your local desktop using a program such as [FileZilla](https://filezilla-project.org/) or [WinSCP](https://winscp.net/eng/download.php).
- Open R and set your working directory to the same location as the counts matrix (or be sure to edit the R script to include the path to the file).
- Then, using the code view on this site (pencil in top left), copy *everything* below this point from just below the heading DGE SCRIPT START to just above the heading <a href="#dge_end"> DGE SCRIPT END </a> into a new Markdown file in R (File > New File > R Markdown).
<font size=2> [The edgeR manual](https://dockflow.org/workflow/rnaseq-gene-edgerql/) is very helpful for understanding and adapting the script below. I highly recommend having it open as a reference when working with the below script .. or anytime you use edgeR, really. </font>
### **DGE SCRIPT START**
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
rm(list=ls()) #Clear workspace
#Install bioconductor to install edgeR - run below two lines together
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("edgeR") #Install edgeR from Bioconductor
browseVignettes("edgeR") #Look up edgeR vigenette (if desired)
install.packages(c("tidyverse","statmod","RColorBrewer", "pheatmap")) #Install other packages from CRAN
```
Here's a really useful resource: https://dockflow.org/workflow/rnaseq-gene-edgerql/
Model and comparisons including ALG6
```{r}
library(edgeR)
library(tidyverse)
library(statmod)
library(RColorBrewer)
library(pheatmap)
############READ IN DATA###########
counts <- data.frame(data.matrix(read.table(file="total.counts.matrix"))) #Import count data to R
colnames(counts) <- c("G62S10", "G62S11", "G62S12", "G62S13", "G62S14", "G62S15","G62S01", "G62S02", "G62S03", "G62S04", "G62S05", "G62S06", "G62S07", "G62S08", "G62S09") #rename column names to sample names
#Reorder according to sample number; confirmed that sample names are correct
counts <- counts %>%
select(G62S01, G62S02, G62S03, G62S04, G62S05, G62S06, G62S07, G62S08, G62S09, G62S10, G62S11, G62S12, G62S13, G62S14, G62S15)
#Name groups - each has three samples
group <- factor(c(rep(c("ALG6", "ALG6.UGGT1", "ALG6.UGGT2", "ALG6.UGGT1.UGGT2", "WT.parental"), each = 3)), levels=c("ALG6", "ALG6.UGGT1", "ALG6.UGGT2", "ALG6.UGGT1.UGGT2", "WT.parental")) #Group samples by replicates
y <- DGEList(counts = counts, group = group) #Create DGEList object with groups; confirmed groups and samples are labeled correctly
keep <- filterByExpr(y) #Filter out lowly expressed genes that will mess with comparisons
y <- y[keep,,keep.lib.sizes=FALSE]
#Make MDS plot to ensure samples group together
points <- c(rep(c(19,18,15,16,17), each=3))
colors <- c(rep(c("blue", "red", "green", "purple", "black"), each = 3))
plotMDS(y, col=colors, pch=points)
legend("topleft", legend=levels(group), pch=points[c(1,4,7,10,13)], col=colors[c(1,4,7,10,13)], ncol=1)
###################SET PARAMETERS AND FIT MODEL###############
y <- calcNormFactors(y) #Normalize library sizes to account for some samples having genes that absorb a disproportionate amount of read depth
design <- model.matrix(~0+group) #Create design matrix (0+ replaces the intercept with a group)
y <- estimateDisp(y, design, robust=TRUE) #Estimate dispersion for negative binomial model
fit <- glmQLFit(y, design, robust=TRUE) #Fit the negative binomial model
plotQLDisp(fit) #Plot quasi-likelihood dispersion estimates - see edgeR manual for interpretation
#Make design matrix for comparisons
comparisons <- makeContrasts(ALG6_vs_ALG6.UGGT1 = groupALG6 - groupALG6.UGGT1,
ALG6_vs_ALG6.UGGT1.UGGT2 = groupALG6 - groupALG6.UGGT1.UGGT2,
ALG6_vs_ALG6.UGGT2 = groupALG6 - groupALG6.UGGT2,
ALG6_vs_WT = groupALG6 - groupWT.parental,
ALG6.UGGT1_vs_ALG6.UGGT1.UGGT2 = groupALG6.UGGT1 - groupALG6.UGGT1.UGGT2,
ALG6.UGGT1_vs_ALG6.UGGT2 = groupALG6.UGGT1 - groupALG6.UGGT2,
ALG6.UGGT1_vs_WT = groupALG6.UGGT1 - groupWT.parental,
ALG6.UGGT1.UGGT2_vs_ALG6.UGGT2 = groupALG6.UGGT1.UGGT2 - groupALG6.UGGT2,
ALG6.UGGT1.UGGT2_vs_WT = groupALG6.UGGT1.UGGT2 - groupWT.parental,
ALG6.UGGT2_vs_WT = groupALG6.UGGT2 - groupWT.parental,
levels=design)
#Combine all comparisons into a single F statistic and p-value and extract top tags
comps <- glmQLFTest(fit, contrast=comparisons)
topTags(comps)
##Make heatmap of all comparisons##
logcpm <- cpm(y, log=TRUE) #calculate counts per million
rownames(logcpm) <- rownames(y$counts) #save gene names as rownames
colnames(logcpm) <- paste(y$samples$group, sep="_") #name samples by group
o <- order(comps$table$PValue) #order differentially expressed genes by p value and save row index value
logcpm <- logcpm[o[1:30],] #extract counts per million of the highest DEGs specified above
#Create heat map
pheatmap(logcpm, cluster_col=FALSE, scale="row")
```
PERFORM PAIRWISE COMPARISONS
Samples:
1-3: ALG6-/-
4-6: ALG6/UGGT1-/-
7-9: ALG6/UGGT2-/-
10-12: ALG6/UGGT1/UGGT2-/-
13-15: WT HEK293EBNA1-6E cells (parental)
```{r}
##Comparison 1: ALG6-/- vs ALG6/UGGT1-/-; samples 1:3 vs samples 4:6##
#Perform differential expression analysis
qlf_ALG6_vs_ALG6.UGGT1 <- glmQLFTest(fit, contrast=comparisons[,"ALG6_vs_ALG6.UGGT1"])
#Make MD plot of comparisons
plotMD(qlf_ALG6_vs_ALG6.UGGT1, main = "ALG6-/- vs ALG6/UGGT1-/-")
#Show number of genes differentially expressed (up and down)
summary(decideTestsDGE(qlf_ALG6_vs_ALG6.UGGT1))
#Extract all differentially expressed genes at FDR < 0.05 differentially expressed genes (DEGs) - set n high so the limit is the p_value cutoff
top_DGEs1 <- topTags(qlf_ALG6_vs_ALG6.UGGT1, n=16314)
#Histogram of FDR values
top_DGEs1 <- as.data.frame(top_DGEs1)
hist(top_DGEs1$FDR, main="ALG6-/- vs ALG6/UGGT1-/-", xlab = "FDR")
#Save top DEGs in a csv file
write.csv(top_DGEs1, file="Top_tags_ALG6_vs_ALG6.UGGT1.csv", row.names=TRUE)
#Comparison 2: ALG6-/- vs ALG6/UGGT2-/-; samples 1:3 vs samples 7:9
qlf_ALG6_vs_ALG6.UGGT2 <- glmQLFTest(fit, contrast=comparisons[,"ALG6_vs_ALG6.UGGT2"])
#Make MD plot of comparisons
plotMD(qlf_ALG6_vs_ALG6.UGGT2, main = "ALG6-/- vs ALG6/UGGT2-/-")
#Show number of genes differentially expressed (up and down)
summary(decideTestsDGE(qlf_ALG6_vs_ALG6.UGGT2))
#Extract all differentially expressed genes at FDR < 0.05 differentially expressed genes (DEGs) - set n high so the limit is the p_value cutoff
top_DGEs2 <- topTags(qlf_ALG6_vs_ALG6.UGGT2, n=16314)
#Histogram of FDR values
top_DGEs2 <- as.data.frame(top_DGEs2)
hist(top_DGEs1$FDR, main="ALG6-/- vs ALG6/UGGT2-/-", xlab = "FDR")
#Save top DEGs in a csv file
write.csv(top_DGEs2, file="Top_tags_ALG6_vs_ALG6.UGGT2.csv", row.names=TRUE)
#Comparison 3: ALG6 vs ALG6/UGGT1/UGGT2-/- ; samples 1:3 vs samples 10:12
qlf_ALG6_vs_ALG6.UGGT1.UGGT2 <- glmQLFTest(fit, contrast=comparisons[,"ALG6_vs_ALG6.UGGT1.UGGT2"]) #
#Make MD plot of comparisons
plotMD(qlf_ALG6_vs_ALG6.UGGT1.UGGT2, main = "ALG6-/- vs ALG6/UGGT1/UGGT2-/-")
#Show number of genes differentially expressed (up and down)
summary(decideTestsDGE(qlf_ALG6_vs_ALG6.UGGT1.UGGT2))
#Extract all differentially expressed genes at FDR < 0.05 differentially expressed genes (DEGs) - set n high so the limit is the p_value cutoff
top_DGEs3 <- topTags(qlf_ALG6_vs_ALG6.UGGT1.UGGT2, n=16314)
#Histogram of FDR values
top_DGEs3 <- as.data.frame(top_DGEs3)
hist(top_DGEs3$FDR, main="ALG6-/- vs ALG6/UGGT1/UGGT2-/-", xlab = "FDR")
#Save top DEGs in a csv file
write.csv(top_DGEs3, file="Top_tags_ALG6_vs_ALG6.UGGT1.UGGT2.csv", row.names=TRUE)
#Comparison 4: ALG6-/- vs WT ; samples 1:3 vs samples 13:15
qlf_ALG6_vs_WT <- glmQLFTest(fit, contrast=comparisons[,"ALG6_vs_WT"]) #
#Make MD plot of comparisons
plotMD(qlf_ALG6_vs_WT, main = "ALG6-/- vs WT")
#Show number of genes differentially expressed (up and down)
summary(decideTestsDGE(qlf_ALG6_vs_WT))
#Extract all differentially expressed genes at FDR < 0.05 differentially expressed genes (DEGs) - set n high so the limit is the p_value cutoff
top_DGEs4 <- topTags(qlf_ALG6_vs_WT, n=16314)
#Histogram of FDR values
top_DGEs4 <- as.data.frame(top_DGEs1)
hist(top_DGEs4$FDR, main="ALG6-/- vs WT", xlab = "FDR")
#Save top DEGs in a csv file
write.csv(top_DGEs4, file="Top_tags_ALG6_vs_WT.csv", row.names=TRUE)
#Comparison 5: ALG6/UGGT1 vs ALG6/UGGT2 samples 4:6 vs samples 7:9
qlf_ALG6.UGGT1_vs_ALG6.UGGT2 <- glmQLFTest(fit, contrast=comparisons[,"ALG6.UGGT1_vs_ALG6.UGGT2"]) #
#Make MD plot of comparisons
plotMD(qlf_ALG6.UGGT1_vs_ALG6.UGGT2, main = "ALG6/UGGT1-/- vs ALG6/UGGT2-/-")
#Show number of genes differentially expressed (up and down)
summary(decideTestsDGE(qlf_ALG6.UGGT1_vs_ALG6.UGGT2))
#Extract all differentially expressed genes at FDR < 0.05 differentially expressed genes (DEGs) - set n high so the limit is the p_value cutoff
top_DGEs5 <- topTags(qlf_ALG6.UGGT1_vs_ALG6.UGGT2, n=16314)
#Histogram of FDR values - need to include all samples above
top_DGEs5 <- as.data.frame(top_DGEs5)
hist(top_DGEs5$FDR, main="ALG6/UGGT1-/- vs ALG6/UGGT2-/-", xlab = "FDR")
#Save top DEGs in a csv file
write.csv(top_DGEs5, file="Top_tags_ALG6.UGGT1_vs_ALG6.UGGT2.csv", row.names=TRUE)
#Comparison 6: ALG6/UGGT1-/- vs ALG6/UGGT1/UGGT2-/- ; samples 4:6 vs samples 10:12
qlf_ALG6.UGGT1_vs_ALG6.UGGT1.UGGT2 <- glmQLFTest(fit, contrast=comparisons[,"ALG6.UGGT1_vs_ALG6.UGGT1.UGGT2"]) #
#Make MD plot of comparisons
plotMD(qlf_ALG6.UGGT1_vs_ALG6.UGGT1.UGGT2, main = "ALG6/UGGT1-/- vs ALG6/UGGT1/UGGT2-/-")
#Show number of genes differentially expressed (up and down)
summary(decideTestsDGE(qlf_ALG6.UGGT1_vs_ALG6.UGGT1.UGGT2))
#Extract all differentially expressed genes at FDR < 0.05 differentially expressed genes (DEGs) - set n high so the limit is the p_value cutoff
top_DGEs6 <- topTags(qlf_ALG6.UGGT1_vs_ALG6.UGGT1.UGGT2, n=16314)
#Histogram of FDR values
top_DGEs6 <- as.data.frame(top_DGEs6)
hist(top_DGEs6$FDR, main="ALG6/UGGT1-/- vs ALG6/UGGT1/UGGT2-/-", xlab = "FDR")
#Save top DEGs in a csv file
write.csv(top_DGEs6, file="Top_tags_ALG6.UGGT1_vs_ALG6.UGGT1.UGGT2.csv", row.names=TRUE)
#Comparison 7: ALG6/UGGT1-/- vs WT ; samples 4:6 vs samples 13:15
qlf_ALG6.UGGT1_vs_WT <- glmQLFTest(fit, contrast=comparisons[,"ALG6.UGGT1_vs_WT"]) #
#Make MD plot of comparisons
plotMD(qlf_ALG6.UGGT1_vs_WT, main = "ALG6/UGGT1-/- vs WT")
#Show number of genes differentially expressed (up and down)
summary(decideTestsDGE(qlf_ALG6.UGGT1_vs_WT))
#Extract all differentially expressed genes at FDR < 0.05 differentially expressed genes (DEGs) - set n high so the limit is the p_value cutoff
top_DGEs7 <- topTags(qlf_ALG6.UGGT1_vs_WT, n=16314)
#Histogram of FDR values
top_DGEs7 <- as.data.frame(top_DGEs7)
hist(top_DGEs7$FDR, main="ALG6/UGGT1-/- vs WT", xlab = "FDR")
#Save top DEGs in a csv file
write.csv(top_DGEs7, file="Top_tags_ALG6.UGGT1_vs_WT.csv", row.names=TRUE)
#Comparison 8: ALG6/UGGT1/UGGT2 -/- vs ALG6/UGGT2-/- ; samples 10:12 vs samples 7:9
qlf_ALG6.UGGT1.UGGT2_vs_ALG6.UGGT2 <- glmQLFTest(fit, contrast=comparisons[,"ALG6.UGGT1.UGGT2_vs_ALG6.UGGT2"]) #
#Make MD plot of comparisons
plotMD(qlf_ALG6.UGGT1.UGGT2_vs_ALG6.UGGT2, main = "ALG6/UGGT1/UGGT2-/- vs ALG6/UGGT2-/-")
#Show number of genes differentially expressed (up and down)
summary(decideTestsDGE(qlf_ALG6.UGGT1.UGGT2_vs_ALG6.UGGT2))
#Extract all differentially expressed genes at FDR < 0.05 differentially expressed genes (DEGs) - set n high so the limit is the p_value cutoff
top_DGEs8 <- topTags(qlf_ALG6.UGGT1.UGGT2_vs_ALG6.UGGT2, n=16314)
#Histogram of FDR values
top_DGEs8 <- as.data.frame(top_DGEs8)
hist(top_DGEs8$FDR, main="ALG6/UGGT1/UGGT2-/- vs ALG6/UGGT2-/-", xlab = "FDR")
#Save top DEGs in a csv file
write.csv(top_DGEs8, file="Top_tags_ALG6.UGGT1.UGGT2_vs_ALG6.UGGT2.csv", row.names=TRUE)
#Comparison 9: ALG6/UGGT2-/- vs WT ; samples 7:9 vs samples 13:15
qlf_ALG6.UGGT2_vs_WT <- glmQLFTest(fit, contrast=comparisons[,"ALG6.UGGT2_vs_WT"]) #
#Make MD plot of comparisons
plotMD(qlf_ALG6.UGGT2_vs_WT, main = "ALG6/UGGT2-/- vs WT")
#Show number of genes differentially expressed (up and down)
summary(decideTestsDGE(qlf_ALG6.UGGT2_vs_WT))
#Extract all differentially expressed genes at FDR < 0.05 differentially expressed genes (DEGs) - set n high so the limit is the p_value cutoff
top_DGEs9 <- topTags(qlf_ALG6.UGGT2_vs_WT, n=16314)
#Histogram of FDR values
top_DGEs9 <- as.data.frame(top_DGEs9)
hist(top_DGEs9$FDR, main="ALG6/UGGT2-/- vs WT", xlab = "FDR")
#Save top DEGs in a csv file
write.csv(top_DGEs9, file="Top_tags_ALG6.UGGT2_vs_WT.csv", row.names=TRUE)
#Comparison 10: ALG6/UGGT1/UGT2-/- vs WT ; samples 10:12 vs samples 13:15
qlf_ALG6.UGGT1.UGGT2_vs_WT <- glmQLFTest(fit, contrast=comparisons[,"ALG6.UGGT1.UGGT2_vs_WT"]) #
#Make MD plot of comparisons
plotMD(qlf_ALG6.UGGT1.UGGT2_vs_WT, main = "ALG6/UGGT1/UGGT2-/- vs WT")
#Show number of genes differentially expressed (up and down)
summary(decideTestsDGE(qlf_ALG6.UGGT1.UGGT2_vs_WT))
#Extract all differentially expressed genes at FDR < 0.05 differentially expressed genes (DEGs) - set n high so the limit is the p_value cutoff
top_DGEs10 <- topTags(qlf_ALG6.UGGT1.UGGT2_vs_WT, n=16314)
#Histogram of FDR values
top_DGEs10 <- as.data.frame(top_DGEs10)
hist(top_DGEs10$FDR, main="ALG6/UGGT1/UGGT2-/- vs WT", xlab = "FDR")
#Save top DEGs in a csv file
write.csv(top_DGEs10, file="Top_tags_ALG6.UGGT1.UGGT2_vs_WT.csv", row.names=TRUE)
```
Model and comparisons NOT including ALG6
```{r}
library(edgeR)
library(tidyverse)
library(statmod)
library(RColorBrewer)
library(pheatmap)
############READ IN DATA###########
counts <- data.frame(data.matrix(read.table(file="total.counts.matrix"))) #Import count data to R
head(counts)
colnames(counts) <- c("G62S10", "G62S11", "G62S12", "G62S13", "G62S14", "G62S15","G62S01", "G62S02", "G62S03", "G62S04", "G62S05", "G62S06", "G62S07", "G62S08", "G62S09") #rename column names to sample names
#Reorder according to sample number and remove ALG6; confirmed that sample names are correct
counts <- counts %>%
select(G62S04, G62S05, G62S06, G62S07, G62S08, G62S09, G62S10, G62S11, G62S12, G62S13, G62S14, G62S15)
#Name groups - each has three samples
group <- factor(c(rep(c("ALG6.UGGT1", "ALG6.UGGT2", "ALG6.UGGT1.UGGT2", "WT.parental"), each = 3)), levels=c("ALG6.UGGT1", "ALG6.UGGT2", "ALG6.UGGT1.UGGT2", "WT.parental")) #Group samples by replicates
y <- DGEList(counts = counts, group = group) #Create DGEList object with groups; confirmed groups and samples are labeled correctly
keep <- filterByExpr(y) #Filter out lowly expressed genes that will mess with comparisons
y <- y[keep,,keep.lib.sizes=FALSE]
#Make MDS plot to ensure samples group together
points <- c(rep(c(18,15,16,17), each=3))
colors <- c(rep(c("red", "green", "purple", "black"), each = 3))
plotMDS(y, col=colors, pch=points)
legend("bottomleft", legend=levels(group), pch=points[c(1,4,7,10,13)], col=colors[c(1,4,7,10,13)], ncol=1)
###################SET PARAMETERS AND FIT MODEL###############
y <- calcNormFactors(y) #Normalize library sizes to account for some samples having genes that absorb a disproportionate amount of read depth
design <- model.matrix(~0+group) #Create design matrix (0+ replaces the intercept with a group)
y <- estimateDisp(y, design, robust=TRUE) #Estimate dispersion for negative binomial model
fit <- glmQLFit(y, design, robust=TRUE) #Fit the negative binomial model
plotQLDisp(fit) #Plot quasi-likelihood dispersion estimates - see edgeR manual for interpretation
#Make design matrix for comparisons
comparisons <- makeContrasts( ALG6.UGGT1_vs_ALG6.UGGT1.UGGT2 = groupALG6.UGGT1 - groupALG6.UGGT1.UGGT2,
ALG6.UGGT1_vs_ALG6.UGGT2 = groupALG6.UGGT1 - groupALG6.UGGT2,
ALG6.UGGT1_vs_WT = groupALG6.UGGT1 - groupWT.parental,
ALG6.UGGT1.UGGT2_vs_ALG6.UGGT2 = groupALG6.UGGT1.UGGT2 - groupALG6.UGGT2,
ALG6.UGGT1.UGGT2_vs_WT = groupALG6.UGGT1.UGGT2 - groupWT.parental,
ALG6.UGGT2_vs_WT = groupALG6.UGGT2 - groupWT.parental,
levels=design)
#Combine all comparisons into a single F statistic and p-value and extract top tags
comps <- glmQLFTest(fit, contrast=comparisons)
topTags(comps)
##Make heatmap of all comparisons##
logcpm <- cpm(y, log=TRUE) #calculate counts per million
rownames(logcpm) <- rownames(y$counts) #save gene names as rownames
colnames(logcpm) <- paste(y$samples$group, sep="_") #name samples by group
o <- order(comps$table$PValue) #order differentially expressed genes by p value and save row index value
logcpm <- logcpm[o[1:30],] #extract counts per million of the highest DEGs specified above
#Create heat map
pheatmap(logcpm, cluster_col=FALSE, scale="row")
```
PERFORM PAIRWISE COMPARISONS
Samples:
4-6: ALG6/UGGT1-/-
7-9: ALG6/UGGT2-/-
10-12: ALG6/UGGT1/UGGT2-/-
13-15: WT HEK293EBNA1-6E cells (parental)
```{r}
#Comparison 5: ALG6/UGGT1 vs ALG6/UGGT2 samples 4:6 vs samples 7:9
qlf_ALG6.UGGT1_vs_ALG6.UGGT2 <- glmQLFTest(fit, contrast=comparisons[,"ALG6.UGGT1_vs_ALG6.UGGT2"]) #
#Make MD plot of comparisons
plotMD(qlf_ALG6.UGGT1_vs_ALG6.UGGT2, main = "ALG6/UGGT1-/- vs ALG6/UGGT2-/-")
#Show number of genes differentially expressed (up and down)
summary(decideTestsDGE(qlf_ALG6.UGGT1_vs_ALG6.UGGT2))
#Extract all differentially expressed genes at FDR < 0.05 differentially expressed genes (DEGs) - set n high so the limit is the p_value cutoff
top_DGEs5 <- topTags(qlf_ALG6.UGGT1_vs_ALG6.UGGT2, n=16314)
#Histogram of FDR values - need to include all samples above
top_DGEs5 <- as.data.frame(top_DGEs5)
hist(top_DGEs5$FDR, main="ALG6/UGGT1-/- vs ALG6/UGGT2-/-", xlab = "FDR")
#Save top DEGs in a csv file
write.csv(top_DGEs5, file="NO_ALG6_Top_tags_ALG6.UGGT1_vs_ALG6.UGGT2.csv", row.names=TRUE)
#Comparison 6: ALG6/UGGT1-/- vs ALG6/UGGT1/UGGT2-/- ; samples 4:6 vs samples 10:12
qlf_ALG6.UGGT1_vs_ALG6.UGGT1.UGGT2 <- glmQLFTest(fit, contrast=comparisons[,"ALG6.UGGT1_vs_ALG6.UGGT1.UGGT2"]) #
#Make MD plot of comparisons
plotMD(qlf_ALG6.UGGT1_vs_ALG6.UGGT1.UGGT2, main = "ALG6/UGGT1-/- vs ALG6/UGGT1/UGGT2-/-")
#Show number of genes differentially expressed (up and down)
summary(decideTestsDGE(qlf_ALG6.UGGT1_vs_ALG6.UGGT1.UGGT2))
#Extract all differentially expressed genes at FDR < 0.05 differentially expressed genes (DEGs) - set n high so the limit is the p_value cutoff
top_DGEs6 <- topTags(qlf_ALG6.UGGT1_vs_ALG6.UGGT1.UGGT2, n=16314)
#Histogram of FDR values
top_DGEs6 <- as.data.frame(top_DGEs6)
hist(top_DGEs6$FDR, main="ALG6/UGGT1-/- vs ALG6/UGGT1/UGGT2-/-", xlab = "FDR")
#Save top DEGs in a csv file
write.csv(top_DGEs6, file="NO_ALG6-Top_tags_ALG6.UGGT1_vs_ALG6.UGGT1.UGGT2.csv", row.names=TRUE)
#Comparison 7: ALG6/UGGT1-/- vs WT ; samples 4:6 vs samples 13:15
qlf_ALG6.UGGT1_vs_WT <- glmQLFTest(fit, contrast=comparisons[,"ALG6.UGGT1_vs_WT"]) #
#Make MD plot of comparisons
plotMD(qlf_ALG6.UGGT1_vs_WT, main = "ALG6/UGGT1-/- vs WT")
#Show number of genes differentially expressed (up and down)
summary(decideTestsDGE(qlf_ALG6.UGGT1_vs_WT))
#Extract all differentially expressed genes at FDR < 0.05 differentially expressed genes (DEGs) - set n high so the limit is the p_value cutoff
top_DGEs7 <- topTags(qlf_ALG6.UGGT1_vs_WT, n=16314)
#Histogram of FDR values
top_DGEs7 <- as.data.frame(top_DGEs7)
hist(top_DGEs7$FDR, main="ALG6/UGGT1-/- vs WT", xlab = "FDR")
#Save top DEGs in a csv file
write.csv(top_DGEs7, file="NO_ALG6-Top_tags_ALG6.UGGT1_vs_WT.csv", row.names=TRUE)
#Comparison 8: ALG6/UGGT1/UGGT2 -/- vs ALG6/UGGT2-/- ; samples 10:12 vs samples 7:9
qlf_ALG6.UGGT1.UGGT2_vs_ALG6.UGGT2 <- glmQLFTest(fit, contrast=comparisons[,"ALG6.UGGT1.UGGT2_vs_ALG6.UGGT2"]) #
#Make MD plot of comparisons
plotMD(qlf_ALG6.UGGT1.UGGT2_vs_ALG6.UGGT2, main = "ALG6/UGGT1/UGGT2-/- vs ALG6/UGGT2-/-")
#Show number of genes differentially expressed (up and down)
summary(decideTestsDGE(qlf_ALG6.UGGT1.UGGT2_vs_ALG6.UGGT2))
#Extract all differentially expressed genes at FDR < 0.05 differentially expressed genes (DEGs) - set n high so the limit is the p_value cutoff
top_DGEs8 <- topTags(qlf_ALG6.UGGT1.UGGT2_vs_ALG6.UGGT2, n=16314)
#Histogram of FDR values
top_DGEs8 <- as.data.frame(top_DGEs8)
hist(top_DGEs8$FDR, main="ALG6/UGGT1/UGGT2-/- vs ALG6/UGGT2-/-", xlab = "FDR")
#Save top DEGs in a csv file
write.csv(top_DGEs8, file="NO_ALG6-Top_tags_ALG6.UGGT1.UGGT2_vs_ALG6.UGGT2.csv", row.names=TRUE)
#Comparison 9: ALG6/UGGT2-/- vs WT ; samples 7:9 vs samples 13:15
qlf_ALG6.UGGT2_vs_WT <- glmQLFTest(fit, contrast=comparisons[,"ALG6.UGGT2_vs_WT"]) #
#Make MD plot of comparisons
plotMD(qlf_ALG6.UGGT2_vs_WT, main = "ALG6/UGGT2-/- vs WT")
#Show number of genes differentially expressed (up and down)
summary(decideTestsDGE(qlf_ALG6.UGGT2_vs_WT))
#Extract all differentially expressed genes at FDR < 0.05 differentially expressed genes (DEGs) - set n high so the limit is the p_value cutoff
top_DGEs9 <- topTags(qlf_ALG6.UGGT2_vs_WT, n=16314)
#Histogram of FDR values
top_DGEs9 <- as.data.frame(top_DGEs9)
hist(top_DGEs9$FDR, main="ALG6/UGGT2-/- vs WT", xlab = "FDR")
#Save top DEGs in a csv file
write.csv(top_DGEs9, file="NO_ALG6-Top_tags_ALG6.UGGT2_vs_WT.csv", row.names=TRUE)
#Comparison 10: ALG6/UGGT1/UGT2-/- vs WT ; samples 10:12 vs samples 13:15
qlf_ALG6.UGGT1.UGGT2_vs_WT <- glmQLFTest(fit, contrast=comparisons[,"ALG6.UGGT1.UGGT2_vs_WT"]) #
#Make MD plot of comparisons
plotMD(qlf_ALG6.UGGT1.UGGT2_vs_WT, main = "ALG6/UGGT1/UGGT2-/- vs WT")
#Show number of genes differentially expressed (up and down)
summary(decideTestsDGE(qlf_ALG6.UGGT1.UGGT2_vs_WT))
#Extract all differentially expressed genes at FDR < 0.05 differentially expressed genes (DEGs) - set n high so the limit is the p_value cutoff
top_DGEs10 <- topTags(qlf_ALG6.UGGT1.UGGT2_vs_WT, n=16314)
#Histogram of FDR values
top_DGEs10 <- as.data.frame(top_DGEs10)
hist(top_DGEs10$FDR, main="ALG6/UGGT1/UGGT2-/- vs WT", xlab = "FDR")
#Save top DEGs in a csv file
write.csv(top_DGEs10, file="NO_ALG6-Top_tags_ALG6.UGGT1.UGGT2_vs_WT.csv", row.names=TRUE)
```
### **<p id="dge_end">DGE SCRIPT END</p>**
## <p id="bash_scripts">BASH Scripts</p>
**00_fastqc.sh**
```
#!/bin/bash
#run fastqc
#BSUB -q short
#BSUB -W 4:00
#BSUB -R rusage[mem=1000]
#BSUB -n 8
#BSUB -R span[hosts=1]
#BSUB -e logs/QC1.err
#BSUB -oo logs/QC1.log
module load fastqc/0.11.5
for file in 00_raw_reads/*.fastq.gz
do
echo $file
fastqc $file --outdir=logs/fastqc_reports
done
module load python3/3.5.0_packages/multiqc/1.4
multiqc logs/fastqc_reports/ -outdir logs/multiqc_report
```
**01_prep_index.sh**
```
#!/bin/bash
#Index reference genome
#BSUB -q short
#BSUB -W 4:00
#BSUB -R rusage[mem=16000]
#BSUB -n 8
#BSUB -R span[hosts=1]
#BSUB -e logs/01_prep_index.err
#BSUB -oo logs/01_prep_index.log
#Load required modules
module load star/2.7.0e
module load gcc/8.1.0
module load RSEM/1.3.0
#Arguments in order are: annotation file (--gtf flag), reference genome file, [output directory/file_prefix], use STAR to prepare the index
rsem-prepare-reference --gtf references/ucsc.gtf references/hg38.fa 01_genome_index/rsem_hg38 --star
```
**02_map_to_reference.sh**
```
#!/bin/bash
#Map reads to indexed reference genome
#BSUB -q long
#BSUB -W 40:00
#BSUB -R rusage[mem=32000]
#BSUB -n 8
#BSUB -R span[hosts=1]
#BSUB -e logs/02_map_to_reference.err
#BSUB -oo logs/02_map_to_reference.log
#Load required modules
module load star/2.7.0e
module load gcc/8.1.0
#May need to manually adjust input for steps 1 & 2 below to adjust for number of lanes. The present script has four lanes with the same prefix.
####Step 0: check if files created in the script below already exist. If they do, delete previous versions.
if test -f ./metrics/line_counts.txt; then
rm ./metrics/line_counts.txt
fi
if test -f ./metrics/unmapped_reads.txt; then
rm ./metrics/unmapped_reads.txt
fi
####Step 1: Concatenate F and R reads from each sample####
for lane in raw_reads/*L001_R1*.fastq.gz
do
sample=${lane%%_L001*} #Store path and basename of file minus read info
out_name=${sample##*/} #Remove path, but keep basename top be used as a prefix to output
##Add additional samples to below forward and reverse read lines as necessary
##Forward read
cat ${sample}_L001_R1_001.fastq.gz ${sample}_L002_R1_001.fastq.gz ${sample}_L003_R1_001.fastq.gz ${sample}_L004_R1_001.fastq.gz > 00_cat_reads/${out_name}_R1.fastq.gz
##Reverse read
cat ${sample}_L001_R2_001.fastq.gz ${sample}_L002_R2_001.fastq.gz ${sample}_L003_R2_001.fastq.gz ${sample}_L004_R2_001.fastq.gz > 00_cat_reads/${out_name}_R2.fastq.gz
echo "Finished with sample $sample"
done
wait
####Step 2: Count lines in raw files and concatenated files to ensure they match (ie no data loss)
touch ./metrics/line_counts.txt
for file in raw_reads/*L001_R1*.fastq.gz
do
sample1=${file%%_L001*} #Store path and basename of file minus read info
out_name1=${sample1##*/} #Remove path, but keep basename top be used as a prefix to output
echo "Line count for Sample $out_name1" >> ./metrics/line_counts.txt
echo "*****Forward*****" >> ./metrics/line_counts.txt
#Make a variable for each lane - add or remove lanes as necessary
f1=${sample1}_L001_R1*
f2=${sample1}_L002_R1*
f3=${sample1}_L003_R1*
f4=${sample1}_L004_R1
#Add each of the variables specified above to the wc -l command - add or remove lanes as necessary
linesF=$(wc -l $f1 $f2 $f3 $f4 | awk '/total/ {print $1}')
c1=00_cat_reads/${out_name1}_R1*
linesF_cat=$(wc -l $c1 | awk '{print $1}')
echo "raw reads: $linesF" >> ./metrics/line_counts.txt
echo "concatenated reads: $linesF_cat" >> ./metrics/line_counts.txt
echo "*****Reverse*****" >> ./metrics/line_counts.txt
#Make a variable for each lane - add or remove as necessary
r1=${sample1}_L001_R2*
r2=${sample1}_L002_R2*
r3=${sample1}_L003_R2*
r4=${sample1}_L004_R2*
#Add each of the variables specified above to the wc -l command - add or remove as necessary
linesR=$(wc -l $r1 $r2 $r3 $r4 | awk '/total/ {print $1}')
c2=00_cat_reads/${out_name1}_R2*
linesR_cat=$(wc -l $c2 | awk '{print $1}')
echo "raw reads: $linesR" >> ./metrics/line_counts.txt
echo -e "concatenated reads: $linesR_cat \n" >> ./metrics/line_counts.txt
done
wait
####Step 3: Map each sample to indexed reference genome using STAR
touch ./metrics/unmapped_reads.txt
for f in 00_cat_reads/*R1.fastq.gz
do
sample2=${f%%R1.fastq.gz} #Store path and basename of file minus read info
out_name2=${sample2##*/} #Remove path, but keep basename top be used as a prefix to output
STAR --runThreadN 8 --genomeDir 01_genome_index --readFilesIn ${sample2}R1.fastq.gz ${sample2}R2.fastq.gz --readFilesCommand zcat --quantMode TranscriptomeSAM --outFileNamePrefix ./02_mapped_reads/$out_name2
wait
echo "sample $out_name2" >> ./metrics/unmapped_reads.txt
grep "unmapped" ./02_mapped_reads/${out_name2}Log.final.out >> ./metrics/unmapped_reads.txt
done
```
**03_count_reads.sh**
```
#!/bin/bash
#Calculate transcript abundance
#BSUB -q long
#BSUB -W 40:00
#BSUB -R rusage[mem=16000]
#BSUB -n 12
#BSUB -R span[hosts=1]
#BSUB -e logs/03_count_reads2.err
#BSUB -oo logs/03_count_reads2.log
#Load required modules
module load RSEM/1.3.0
for num in $(seq 1 9)
do
file=02_mapped_reads/G62S${num}_S${num}_Aligned.toTranscriptome.out.bam
sample=${file%%_Aligned.toTranscriptome.out.bam} #Store path and basename of file minus read info
out_name=${sample##*/} #Remove path, but keep basename top be used as a prefix to output
#Arguments are (after paired-end): input file is bam file, don't output bam file, use 12 threads, specify paired end sequences, stranded data, location/prefix of rsem index files, location/prefix of rsem calulate expression output files (from this run)
rsem-calculate-expression --bam --no-bam-output -p 12 --paired-end --strandedness reverse $file 01_genome_index/rsem_hg38 03_transcript_quant/$out_name
done
#Once the above loop is finished, combine results from all samples into one data matrix for R
rsem-generate-data-matrix 03_transcript_quant/*.genes.results > all.counts.matrix
```