GGG 201(a) 2019 - class 1 - NOTES

# GGG 201(a) 2019 - class 1 - NOTES Titus Brown, ctbrown@ucdavis.edu ## A very brief intro Modern sequencing data analysis relies on a fairly standard set of data science approaches. I'm going to slowly walk through a particular RNAseq workflow and show you how things work. This will be a brief intro to the basic concepts that we'll explore more thoroughly in the GGG 201(b) labs. If you feel completely overwhelmed - never fear! This is a much faster introduction than we do in 201(b), when we have more time to explore. (Note, [this other RNAseq lesson from ANGUS 2019](https://angus.readthedocs.io/en/2019/diff-ex-and-viz.html) gives a slightly different perspective on a similar workflow.) ## Getting started! Go [here](https://github.com/ngs-docs/2019-ggg-201a-rnaseq-oneday) and click on "Run RStudio." This runs RStudio in a ["binder"](https://binder.readthedocs.io/en/latest/user-manual/overview/intro.html), which means you don't need to install anything on your local computer! Here, everything is running in the [Google Cloud](https://cloud.google.com/compute/). Wait ~10-15 seconds. You should see an RStudio Server window ...sooner or later! Next, open a Terminal by clicking on the Terminal tab, and copy and past the following commands in,. ## Install [salmon](https://salmon.readthedocs.io): Salmon is software to do RNAseq quantification from Illumina short reads. Here we are using [conda](https://angus.readthedocs.io/en/2019/conda_tutorial.html) to install the software. ``` conda install -c bioconda -y salmon ``` ## Download some data from [Schurch et al, 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4878611/) Next, let's download some actual RNAseq data! I've stored this data on [osf.io](http://osf.io) under a class-specific folder ([here](https://osf.io/vzfc6/)) because it's faster to download from there. ``` curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz curl -L https://osf.io/xju4a/download -o ERR458500.fastq.gz curl -L https://osf.io/nmqe6/download -o ERR458501.fastq.gz ``` These are [FASTQ files](https://en.wikipedia.org/wiki/FASTQ_format) containing [shotgun Illumina sequencing](https://en.wikipedia.org/wiki/Shotgun_sequencing) of RNA from yeast. You can look at the contents of one of these files like so: ``` gunzip -c ERR458501.fastq.gz | head ``` ## Download the yeast reference transcriptome: Salmon needs to know which genes to align the sequencing reads to - we'll give it the predicted Open Reading Frames (ORFs) from yeast. Download that file. ``` curl -O https://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna/orf_coding.fasta.gz ``` Take a look at the contents and compare with the above FASTQ reads: ``` gunzip -c orf_coding.fasta.gz | head ``` ## Index the yeast transcriptome: In order to work, salmon needs to index the ORFs for fast search: ``` salmon index --index yeast_orfs --transcripts orf_coding.fasta.gz ``` ## Run salmon individually on each of the samples: Quantify the number of times each gene in the yeast transcriptome is "hit" by an RNAseq read: ``` for i in *.fastq.gz do salmon quant -i yeast_orfs --libType U -r $i -o $i.quant --seqBias --gcBias done ``` Note here that we're using a little bit of shell programming (the for loop) to write a command that will process _every_ `fastq.gz` file in the folder. ## Collect all of the sample counts using [this Python script](https://github.com/ngs-docs/2018-ggg201b/blob/master/lab6-rnaseq/gather-counts.py): salmon puts each of the counts in their own directory; gather them all up and rename them so that R can load them more easily. ``` curl -L -O https://raw.githubusercontent.com/ngs-docs/2018-ggg201b/master/lab6-rnaseq/gather-counts.py python2 gather-counts.py ``` ## Load in the counts and use edgeR to analyze the samples for differentially expressed genes: In the RStudio console, copy and paste these lines - note, you can do it one block at a time, if you like. ``` library("edgeR") files <- c( "ERR458493.fastq.gz.quant.counts", "ERR458494.fastq.gz.quant.counts", "ERR458500.fastq.gz.quant.counts", "ERR458501.fastq.gz.quant.counts" ) labels=c("A", "B", "C", "D") data <- readDGE(files) print(data) ### group <- c(rep("mut", 2), rep("wt", 2)) dge = DGEList(counts=data, group=group) dge <- estimateCommonDisp(dge) dge <- estimateTagwiseDisp(dge) # make an MA-plot et <- exactTest(dge, pair=c("wt", "mut")) etp <- topTags(et, n=100000) etp$table$logFC = -etp$table$logFC pdf("yeast-edgeR-MA-plot.pdf") plot( etp$table$logCPM, etp$table$logFC, xlim=c(-3, 20), ylim=c(-12, 12), pch=20, cex=.3, col = ifelse( etp$table$FDR < .2, "red", "black" ) ) dev.off() # plot MDS pdf("yeast-edgeR-MDS.pdf") plotMDS(dge, labels=labels) dev.off() # output CSV for 0-6 hr write.csv(etp$table, "yeast-edgeR.csv") ``` ## Question: how close is this to a "good" publishing workflow? What more could/should/would we need to do to make it match an ideal publishing workflow, where we're communicating clearly about the analysis methods? (Assume no additional *analysis* needs to be done, and we're just talking about communicating what we *did*.) ## Broad advice Knowledge of data science in general, and sequence analysis in particular, is really useful. Learn R **by using it**. Learn shell **by using it**. **Participate in the local (and global) community of practice** in data science. ## Other links you might find interesting UC Davis "stuff": * [Davis R Users Group](https://d-rug.github.io/) and [Meet and Analyze Data](https://dib-training.readthedocs.io/en/pub/) help groups * [Data Science Initiative / DataLab](http://dsi.ucdavis.edu/) * [Data Intensive Biology Summer Institute](http://ivory.idyll.org/dibsi/) * in Winter 2019/2020, I teach GGG 201(b) Lab Section, which will focus on three basic sequence analysis techniques: variant calling, de novo assembly, and RNAseq gene expression analysis. * in Winter 2019/2020, I plan to be teaching a GGG elective course that dives into more of the data science toolchain (shell, HPC, cloud, workflows, R, Python, etc.)