# GGG 201(a) 2019 - class 1 - NOTES
Titus Brown, ctbrown@ucdavis.edu
## A very brief intro
Modern sequencing data analysis relies on a fairly standard set of data science approaches. I'm going to slowly walk through a particular RNAseq workflow and show you how things work. This will be a brief intro to the basic concepts that we'll explore more thoroughly in the GGG 201(b) labs.
If you feel completely overwhelmed - never fear! This is a much faster introduction than we do in 201(b), when we have more time to explore.
(Note, [this other RNAseq lesson from ANGUS 2019](https://angus.readthedocs.io/en/2019/diff-ex-and-viz.html) gives a slightly different perspective on a similar workflow.)
## Getting started!
Go [here](https://github.com/ngs-docs/2019-ggg-201a-rnaseq-oneday) and click on "Run RStudio." This runs RStudio in a ["binder"](https://binder.readthedocs.io/en/latest/user-manual/overview/intro.html), which means you don't need to install anything on your local computer! Here, everything is running in the [Google Cloud](https://cloud.google.com/compute/).
Wait ~10-15 seconds. You should see an RStudio Server window ...sooner or later!
Next, open a Terminal by clicking on the Terminal tab, and copy and past the following commands in,.
## Install [salmon](https://salmon.readthedocs.io):
Salmon is software to do RNAseq quantification from Illumina short reads. Here we are using [conda](https://angus.readthedocs.io/en/2019/conda_tutorial.html) to install the software.
```
conda install -c bioconda -y salmon
```
## Download some data from [Schurch et al, 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4878611/)
Next, let's download some actual RNAseq data!
I've stored this data on [osf.io](http://osf.io) under a class-specific folder ([here](https://osf.io/vzfc6/)) because it's faster to download from there.
```
curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz
curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz
curl -L https://osf.io/xju4a/download -o ERR458500.fastq.gz
curl -L https://osf.io/nmqe6/download -o ERR458501.fastq.gz
```
These are [FASTQ files](https://en.wikipedia.org/wiki/FASTQ_format) containing [shotgun Illumina sequencing](https://en.wikipedia.org/wiki/Shotgun_sequencing) of RNA from yeast. You can look at the contents of one of these files like so:
```
gunzip -c ERR458501.fastq.gz | head
```
## Download the yeast reference transcriptome:
Salmon needs to know which genes to align the sequencing reads to - we'll give it the predicted Open Reading Frames (ORFs) from yeast. Download that file.
```
curl -O https://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna/orf_coding.fasta.gz
```
Take a look at the contents and compare with the above FASTQ reads:
```
gunzip -c orf_coding.fasta.gz | head
```
## Index the yeast transcriptome:
In order to work, salmon needs to index the ORFs for fast search:
```
salmon index --index yeast_orfs --transcripts orf_coding.fasta.gz
```
## Run salmon individually on each of the samples:
Quantify the number of times each gene in the yeast transcriptome is "hit" by an RNAseq read:
```
for i in *.fastq.gz
do
salmon quant -i yeast_orfs --libType U -r $i -o $i.quant --seqBias --gcBias
done
```
Note here that we're using a little bit of shell programming (the for loop) to write a command that will process _every_ `fastq.gz` file in the folder.
## Collect all of the sample counts using [this Python script](https://github.com/ngs-docs/2018-ggg201b/blob/master/lab6-rnaseq/gather-counts.py):
salmon puts each of the counts in their own directory; gather them all up and rename them so that R can load them more easily.
```
curl -L -O https://raw.githubusercontent.com/ngs-docs/2018-ggg201b/master/lab6-rnaseq/gather-counts.py
python2 gather-counts.py
```
## Load in the counts and use edgeR to analyze the samples for differentially expressed genes:
In the RStudio console, copy and paste these lines - note, you can do it one block at a time, if you like.
```
library("edgeR")
files <- c(
"ERR458493.fastq.gz.quant.counts",
"ERR458494.fastq.gz.quant.counts",
"ERR458500.fastq.gz.quant.counts",
"ERR458501.fastq.gz.quant.counts"
)
labels=c("A", "B", "C", "D")
data <- readDGE(files)
print(data)
###
group <- c(rep("mut", 2), rep("wt", 2))
dge = DGEList(counts=data, group=group)
dge <- estimateCommonDisp(dge)
dge <- estimateTagwiseDisp(dge)
# make an MA-plot
et <- exactTest(dge, pair=c("wt", "mut"))
etp <- topTags(et, n=100000)
etp$table$logFC = -etp$table$logFC
pdf("yeast-edgeR-MA-plot.pdf")
plot(
etp$table$logCPM,
etp$table$logFC,
xlim=c(-3, 20), ylim=c(-12, 12), pch=20, cex=.3,
col = ifelse( etp$table$FDR < .2, "red", "black" ) )
dev.off()
# plot MDS
pdf("yeast-edgeR-MDS.pdf")
plotMDS(dge, labels=labels)
dev.off()
# output CSV for 0-6 hr
write.csv(etp$table, "yeast-edgeR.csv")
```
## Question: how close is this to a "good" publishing workflow?
What more could/should/would we need to do to make it match an ideal publishing workflow, where we're communicating clearly about the analysis methods?
(Assume no additional *analysis* needs to be done, and we're just talking about communicating what we *did*.)
## Broad advice
Knowledge of data science in general, and sequence analysis in particular, is really useful.
Learn R **by using it**.
Learn shell **by using it**.
**Participate in the local (and global) community of practice** in data science.
## Other links you might find interesting
UC Davis "stuff":
* [Davis R Users Group](https://d-rug.github.io/) and [Meet and Analyze Data](https://dib-training.readthedocs.io/en/pub/) help groups
* [Data Science Initiative / DataLab](http://dsi.ucdavis.edu/)
* [Data Intensive Biology Summer Institute](http://ivory.idyll.org/dibsi/)
* in Winter 2019/2020, I teach GGG 201(b) Lab Section, which will focus on three basic sequence analysis techniques: variant calling, de novo assembly, and RNAseq gene expression analysis.
* in Winter 2019/2020, I plan to be teaching a GGG elective course that dives into more of the data science toolchain (shell, HPC, cloud, workflows, R, Python, etc.)