RNA-seq count quantification with Salmon

# RNA-seq count quantification with Salmon The RNA-seq analysis we'll be doing today has a workflow like this: ![](https://i.imgur.com/qubM8Ta.png) We're starting with the second to last step (Salmon), and then moving to R/Rstudio for the Differential Expression Analysis and Visualization. We definitely don't have the time in class to do everything you can do with RNASeq data, so there are some links at the bottom of this document that you can follow to tailor this workflow to your needs. To begin: Log on to the CRC Make a working directory for today and cd into it: ``` mkdir SEP_29 cd SEP_29 ``` Copy files for today to your working directory: ``` cp /tmp/ND_ICG_SEP_29/* . ls ERR458493.qc.fq.gz ERR458500.qc.fq.gz GCA_000146045.2_R64_rna_from_genomic.fna.gz ERR458494.qc.fq.gz ERR458501.qc.fq.gz ERR458495.qc.fq.gz ERR458502.qc.fq.gz salmon.ba ``` What's here? - 6 fastq files. These are pre-trimmed and quality checked RNASeq reads. They can stay zipped. - 1 fasta file. This is the yeast transcriptome we'll be mapping our reads to. - 1 bash script (salmon.ba). I've pre-written this to save time, you can modify this code to work with any RNASeq fastq files/transcriptomes. ## Salmon to quantify counts ``` module load bio/2.0 ``` To make sure the software is working correctly: ``` salmon -h ``` Should see a output of a help menu with information about salmon version and usage. Open and examine the script 'salmon.ba' (use 'cat', 'nano', or 'less') to see the code we're using today. The first line uses salmon to index the yeast transcriptome. As we saw with BWA, this takes a giant data file (transcriptome) and turns it into a type of library that the algorithm can use to quickly map reads to. The output of this is an indexed transcriptome - a folder we've told Salmon to name sc_quant. ``` salmon index --index sc_index --type quasi --transcripts GCA_000146045.2_R64_rna_from_genomic.fna.gz ``` The second line is a for loop that uses Salmon to map the reads to the indexed transcriptome and then quantify the mapped reads (aka the 'counts'). The output of this will be 6 new directories (e.g. ERR458493.qc.fq.gz -> ERR458493.qc.fq.gz_quant). These directories contain your count information that we'll import into R in the next step. ``` for i in *.qc.fq.gz do salmon quant -i sc_index --libType A -r ${i} -o ${i}_quant --seqBias --gcBias --validateMappings done ``` Close salmon.ba and run it: ``` bash salmon.ba ``` How well did this work? You can look at the percentage of mapped reads with the following command: ``` for infile in *_quant/aux_info/meta_info.json do echo ${infile} grep "percent_mapped" ${infile} done ``` ![](https://i.imgur.com/7RpRSik.png) Looks like over 85% of reads in our samples mapped to the transcriptome - let's move on! ## RNA Seq Differential Expression in R Using DESeq2 On your local machine (or wherever you're using Rstudio), make a working directory and move into it. To continue our analysis, we need the files we just created on the CRC. Use scp to transfer them to this directory. If you use windows, use explorer to navigate to your working directory. Once you're in the correct directory, hold down shift and right-click to pull up a menu that will let you open powershell. In powershell, you can use scp to transfer your files. This command should work for both mac terminals and powershells. ``` #this is what my command looks like - your directory paths will differ. scp -r crivaldi@crcfe01.crc.nd.edu:Classes/ND_ICG_2020/RNA-seq/data/*_quant . ``` :::info #### If you need to download a clean copy of the salmon output, read this info! You can download the count files here: https://osf.io/5dbrh/download These files will most likely download to your "Downloads" folder, and you need to move them to your current working directory, either via command line (Mac) or drag & drop (Windows, unless you're comfortable in PowerShell). They will download as a single zipped file, and you need to unzip them. For Mac people, the command is this: tar xvzf salmon_quant ::: Now that you have the files downloaded, let's get started in R. Copy this entire script into a new R script - save (I recommend putting it in the same directory as your data) ```r #if you haven't yet installed packages: install.packages("tidyverse") if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("DESeq2") BiocManager::install("tximport") #after you've installed packages: library("tidyverse") library("tximport") library("DESeq2") # I had an older version of a package or two and so needed to do this install - you probably don't need to do this, # but read your error messages carefully if you're getting them. # BiocManager::install("RSQLite") # library("RSQLite") # store the directory path to your data (the place where you downloaded the *_quant directories) in an object dir <- "/Users/criva/Box Sync/TA/Intro_Comp_Genom/RNA_Seq/data" # Transcript - Gene ID Data Frame # read in the file from url, store in new object tx2gene_map <- read_tsv("https://osf.io/a75zm/download") # look at the first 6 lines head(tx2gene_map) # What is this file and how do I make one? # Need to associate Transcript IDs with GeneIDs - Salmon only provides Transcript IDs # Making a tx2gene file is often a little different for each organism. # If your organism has a transcriptome (or *rna_from_genomic.fna.gz file) on RefSeq, you can often get the information from the *feature_table.txt.gz. # You might also be able to parse a gtf or gff file to produce the information you need. # This information is sometimes also found in the fasta headers of the transcriptome file itself. # https://bioconductor.org/packages/3.7/bioc/vignettes/tximport/inst/doc/tximport.html (under Import transcript-level estimates) # https://support.bioconductor.org/p/106319/ # Sample Data - # read in file from url, store in new object samples <- read_csv("https://osf.io/84asy/download") # look at the first 6 lines head(samples) # What is this file and how do I make one? # You can make these a few different ways - this is your sample 'metadata', and, depending on the experiment, it can get quite large. # This was pre-made for class, but you can make one in R, command line, or excel. # honestly most of the time I use excel since that's how my sample data comes back from the sequencing facility. Just remember to save your file as a csv when it's ready to be imported. # Create an object with your file names Qfiles <- file.path(dir, samples$quant_file) Qfiles # Combine these files and paths using the tximport function. From the documentation: # The tximport package has a single function for importing transcript-level estimates. # The type argument is used to specify what software was used for estimation. # A simple list with matrices, “abundance”, “counts”, and “length”, is returned, where the transcript level information is summarized to the gene-level. # The “length” matrix can be used to generate an offset matrix for downstream gene-level differential analysis of count matrices, as shown below. txi <- tximport(files = Qfiles, type = "salmon", tx2gene = tx2gene_map) #Examine this a bit names(txi) head(txi$counts) summary(txi) #Change the column names to something more meaningful (e.g., the names of the samples) colnames(txi$counts) <- samples$sample # Prepare this data for the differential expression algorithm # Look through the documentation for this function to find out all the options you have here # documentation in R is found by a command like "?DESeqDataSetFromTximport") dds <- DESeqDataSetFromTximport(txi = txi, colData = samples, design = ~condition) # Run DESeq on this dataset dds <- DESeq(dds) # Store DESeq results in a new object res <- results(dds) # Check out results head(res) # Store a subset of results in a new object, in this case, the ones with an adjusted p-value of < 0.05 res_sig <- subset(res, padj<.05) # Out of the subset we created above, subset the results that changed between conditions res_lfc <- subset(res_sig, abs(log2FoldChange) > 1) head(res_lfc) #We aren't scheduled to start visualizing yet, but it's fun to make graphs and you worked hard today so try this one: plotMA(res) ``` ### End of Class Tuesday --- ### Beginning of Class Thursday We left off by making our first plot (!) which is an MA plot. The MA plot is useful for looking to see that your expression data is normalized, and works off of the assumption that most genes will not be differentially expressed (you can check this by seeing if the convergence of data points is zero). DESeq2 has a few other plotting functions built in to the package. You might want to see how an individual gene is expressed differently across conditions. If you have a specific one in mind, you can use that gene by name: ```r plotCounts(dds, gene="ITS1-1", intgroup="condition") ``` You can also look at the gene that has the most significant adjusted p-value: ```r plotCounts(dds, gene=which.min(res$padj), intgroup="condition") ``` We can use an MDS plot to see a spatial distribution of the samples ```r #Normalize raw counts (dataset before we applied DESeq algorithm) on stabalizing variance vsd <- vst(dds) # calculate sample distances sample_dists <- assay(vsd) %>% t() %>% dist() %>% as.matrix() head(sample_dists) #Next, let’s calculate the MDS values from the distance matrix. mdsData <- data.frame(cmdscale(sample_dists)) mds <- cbind(mdsData, as.data.frame(colData(vsd))) # combine with sample data head(mds) ggplot(mds, aes(X1, X2, shape = condition)) + geom_point(size = 3) + theme_minimal() ``` We can also plot heatmaps with our data: ```r install.packages("pheatmap") library(pheatmap) #Next, we can select a subset of genes to plot. Although we could plot all ~6000 yeast genes, let’s choose the 20 genes with the largest positive log2fold change. genes <- order(res_lfc$log2FoldChange, decreasing=TRUE)[1:20] #We can also make a data.frame that contains information about our samples that will appear in the heatmap. We will use our samples data.frame from before to do this. annot_col <- samples %>% column_to_rownames('sample') %>% select(condition) %>% as.data.frame() head(annot_col) # And now plot the heatmap! pheatmap(assay(vsd)[genes, ], cluster_rows=TRUE, show_rownames=TRUE, cluster_cols=FALSE, annotation_col=annot_col) ``` #### ggplots ```r d <- plotCounts(dds, gene=which.min(res$padj), intgroup="condition", returnData=TRUE) library("ggplot2") ggplot(d, aes(x=condition, y=count)) + geom_point(position=position_jitter(w=0.1,h=0)) + scale_y_log10(breaks=c(25,100,400)) ``` ## Gene Ontology/Expression Analysis If you're working with a well studied organism, you can usually find a description of the genes that you're working with. Here's a site that has the descriptions of the Saccharomyces cerevisiae genome - very useful! https://www.yeastgenome.org/search?q=&category=resource Select the GO Slim mapping file, and it will start a download of a text file we can use to attach descriptions to our genes. # Gene Ontology ```r #load mapping file: GO_map <- read_tsv("/Users/crivaldi/Box/TA/Intro_Comp_Genom/RNA_Seq/go_slim_mapping.tab", col_names = F) colnames(GO_map) <- c("Standard Name", "Systematic Name", "SGD ID", "CFP", "Description","GO ID","Feature Type") head(rownames(res_lfc)) res_lfc.df <- as.data.frame(res_lfc) res_lfc.df <- rownames_to_column(res_lfc.df, var = "Standard Name") res_descriptions <- merge(res_lfc.df, GO_map, "Standard Name") top_res_descriptions <- res_descriptions %>% filter(abs(log2FoldChange) > 4) ggplot(top_res_descriptions) + geom_boxplot(aes(x = `Description`, y = `log2FoldChange`)) + theme(axis.text.x=element_text(angle = 45)) ``` --- ## External Links This workflow adapted from the tutorial taught at ANGUS 2019 (UC Davis) https://angus.readthedocs.io/en/2019/diff-ex-and-viz.html# Experimental dataset that can be used for exploring more analyses: https://bioconductor.org/packages/3.11/data/experiment/html/airway.html Paper describing full analysis of the airway dataset: https://pubmed.ncbi.nlm.nih.gov/24926665/ What is an MA plot and why is it useful? https://reneshbedre.github.io/blog/ma.html Tools for evaluating splicing: https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbz126/5648232 Identifying alternative splice junctions: https://academic.oup.com/bioinformatics/article/25/9/1105/203994