---
tags: ggg, ggg2026, ggg201b
---
# Week 8 - Summary of bulk RNAseq analysis - Canine Differential Gene Expression! - GGG 201B lab section: Genomics (Winter 2026)
2/27/26
Jamie Norris, jknorris@ucdavis.edu
* https://codeberg.org/jamiekaj/GGG201B
* https://hackmd.io/6JOdliknSkmz50YFt88yZQ
<hr>
#### Table of Contents
[TOC]
# At the beginning
* start zoom / recording!
* Introduction and questions?
* <a href="https://hackmd.io/aa1jWbvBTbGTeo9j6NLkqA?view#Cleaning-up-old-disk-space">How's everyone doing with available disk space? </a>
```
free_up_some_space() {
rm -rf ~/.cache/rstudio-server/
rm -rf ~/.RData
rm -rf ~/.local/share/rstudio/
rm -rf ~/.config/rstudio
conda clean --all
rm -rf ~/.conda/pkgs
rm -rf ~/.conda/envs/*/pkgs
}
```
## Review
* Rstudio / OnDemand
* Snakemake
* Genomic data alignment and SNV calling
* Data alignment visualization with IGV
* *E.coli* genome assembly & annotation
## Learning objectives, hands-on section of week 8:
* **At the end of today, you'll kinda understand how to:**
* find and download raw RNAseq data from a paper
* differential expression analysis
* gene ontology search with this data
* produce publication quality figures?
## RNAseq > DEG analysis Overview
#### a complicated (simplified) overview
```mermaid
graph TD;
A[Raw Fastq Files] --> B[Trim & Filter];
B -- Trim Galore / Trimmomatic --> C[Align];
B -- Trim Galore / Trimmomatic --> D[Pseudoalign & Quantify];
C -- HiSat2 --> E[Quantify];
C -- STAR --> E[Quantify];
C -- Minimap2 --> E[Quantify];
E -- FeatureCounts --> F[DE Genes];
E -- htseqcount --> F[DE Genes];
D -- Kallisto --> F[DE Genes];
D -- Salmon --> F[DE Genes];
F -- edgeR --> H[Further Analysis];
F -- limma --> H[Further Analysis];
F -- DESeq2 --> H[Further Analysis];
```
# Multiple RNA-seq Pipelines:
## Historic (~2009–2014):
- fastq files
1. (Optional) cutadapt / FASTX filtering
2. TopHat (align)
3. CuffLinks (transcript assembly)
4. CuffMerge (merge transcriptome)
5. CuffDiff (differential gene expression)
## Alignment to a reference genome assembly (~2014–Current)
- fastq files
1. Trimmomatic / TrimGalore (optional trimming for adapters & quality)
2. Alignment to reference genome
- HISAT2 or STAR for short reads
- Minimap2 for long reads (~2018)
3. Gene-level counting
- htseq-count or featureCounts
4. Differential expression analysis
- DESeq2 (or edgeR / limma-voom)
## Pseudoalignment Emergence (~2015–2018)
- Kallisto and Salmon are introduced
## Modern Bulk RNA-seq (~2018–current)
1. **Preprocessing:**
* (Optional) Trimming and low-quality read cleaning with Trim Galore or Trimmomatic
2. **Quantification with Salmon:**
* Indexing the reference transcriptome (`salmon index`)
* Quantifying isoforms using your RNA-seq data (`salmon quant`)
* hashing to rapidly determine which transcripts a read could originate from, without performing a full base-by-base alignment
* Lightweight transcriptome-level pseudoalignment/quantification
<div style="text-align: center;"><a href="https://combine-lab.github.io/salmon/getting_started/"><img src="https://hackmd.io/_uploads/r1qjwXL_-e.png", alt="Salmon!" width="500"></a></div>
3. **Importing Salmon results into R:**
* Creating a transcript-to-gene mapping (`tx2gene`) using `GenomicFeatures`
* Importing Salmon quantification files with `tximport`
4. **DESeq2 differential expression analysis:**
* DESeq2 will model the raw counts, using normalization factors (size factors) to account for differences in library depth.
* (scales the data so you can compare "apples to apples")
* Then, it will estimate the gene-wise dispersions and shrink these estimates to generate more accurate estimates of dispersion to model the counts.
* (average variance across all genes and "pulls" (shrinks) outlier genes toward that average)
* Finally, DESeq2 will fit the negative binomial model and perform hypothesis testing using the Wald test or Likelihood Ratio Test.
* doesn't follow a bell curve - generated the final p-value to suggest the change is real and exciting!
5. **Exploration and visualization:**
* Principal Component Analysis (PCA)
* Volcano plot
* Heatmap
6. **Output:**
* Writing the results to a CSV file
## Today
**Option A (Genome-based)**
STAR / HISAT2 → featureCounts → DESeq2
**Option B (Transcriptome-based, dominant in many labs now)**
Salmon / Kallisto → tximport → DESeq2
# Let's start:
Please go [here](https://ondemand.farm.hpc.ucdavis.edu/pun/sys/dashboard/batch_connect/sys/rstudio/session_contexts/new):
<div style="text-align: center"><a href="https://ondemand.farm.hpc.ucdavis.edu/pun/sys/dashboard/batch_connect/sys/rstudio/session_contexts/new" target="_blank"><img src="https://hackmd.io/_uploads/Syz_tXL_Wl.jpg", alt="FARM!" width="300"></a></div>
* Account: ctbrowngrp
* Partition: low
* **Number of cores: 1**
* **Amount of memory: 12**
* Conda environment: r-4.4.2
* Number of hours: 3
leave everything else as-is, and click "Launch".
# Organize your project directory!
```
mkdir -p ~/201b-RNAseq && cd ~/201b-RNAseq
git clone https://codeberg.org/jamiekaj/GGG201B.git
cd ~/201b-RNAseq/GGG201B
```
**Now we need some raw RNAseq data to work with.**
## What RNAseq data do we want and where is it?
<div style="text-align: center"><img src="https://hackmd.io/_uploads/S11D9Rnd-e.png", alt="puppy!" width="300"></div>
* **Domestication!**
* Does domestication look the same (transcriptomically) in all animals?
* Compared the frontal cortex gene transcription (mRNA) of dogs/wolves, pigs/boars, and domesticated/wild rabbits
* Wolves > Dogs, ~15,000 years ago
* spolier alert: the answer is mostly no... domestication is different for every species.
* Conclusion: The cortical transcriptomes are different, but not dramatically.
* ~fdr 0.01, 30 DE genes between dogs and wolves
* Most upregulated in dogs:TKTL1
* ~47-fold higher expression in dogs
* Involved in aerobic glycolysis
* Most upregulated in wolves: AP5Z1 (KIAA0415)
* ~2.5-fold higher in wolves
* Involved in intracellular protein trafficking
* Dogs had strong gene enrichment for:
* “immune system process”
* “leukocyte migration”
* immune regulation changes may be part of dog domestication.
* Dog domestication appears to involve modest regulatory tuning rather than massive brain transcriptome reprogramming.
<hr>
**neither endorsed by Titus nor UCDavis**
## Scihub & find the PROJECT ID!
* <a href = "https://sci-hub.st/10.1371/journal.pgen.1002962" > https://sci-hub.st/10.1371/journal.pgen.1002962 </a>
<iframe height="150" width="300" src="https://sci-hub.st/10.1371/journal.pgen.1002962" title="scihub"></iframe>
#### Please go here and look for the BioProject ID:
<a href="https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1002962">Albert, Frank W., Mehmet Somel, Miguel Carneiro, et al. “A Comparison of Brain Gene Expression Levels in Domesticated and Wild Animals.” PLOS Genetics 8, no. 9 (2012): e1002962. https://doi.org/10.1371/journal.pgen.1002962.</a>
* Hint: Use "find"
* **Linux/Windows**: ctrl + f > "accession"
*or*
* **Mac**: command(⌘) +f > "accession"
* We're looking for something like an "accession code"
* "Raw sequencing reads are available in the ArrayExpress archive
(http://www.ebi.ac.uk/arrayexpress/) as accession **E-MTAB-1249**."
* https://www.ebi.ac.uk/biostudies/arrayexpress/studies?query=E-MTAB-1249
* https://www.ncbi.nlm.nih.gov/biosample/?term=E-MTAB-1249
<hr>
#### Purely so you guys have it
##### (~300Mb is smaller than 112Gb)
* **don't run this** - it takes a while and takes up a lot of space
```
# Create conda environment
conda create -p \
~ctbrown/scratch3/2026-conda/$USER/ncbi-entrez \
-y entrez-direct sra-tools pigz
# softlink the conda enviroment into your home conda environment list
ln -s ~ctbrown/scratch3/2026-conda/$USER/ncbi-entrez ~/.conda/envs
# activate the conda environment
conda activate ncbi-entrez
```
* replace the *** with the BioProject ID
```
PROJECT_ID="***" # <<< here - PRJEB3197
# Download metadata for the project
esearch -db sra -query "$PROJECT_ID" | efetch -format runinfo > metadata.csv
# Filter for RNA-Seq runs AND only dog/wolf samples
grep "RNA-Seq" metadata.csv | grep -iE "dog|wolf" | cut -d ',' -f 1 > srr_list_dog_wolf.txt
for SRR in $(cat srr_list_dog_wolf.txt); do
echo "Downloading $SRR ..."
fasterq-dump --threads 8 --progress --split-files --temp /tmp "$SRR"
pigz -p 8 "${SRR}_1.fastq" "${SRR}_2.fastq"
done
conda deactivate
```
## SNAKEMAKE + salmon
```
# Create conda environment
conda create -p \
~ctbrown/scratch3/2026-conda/$USER/salmon_test \
-y salmon pigz agat
# softlink the conda enviroment into your home conda environment list
ln -s ~ctbrown/scratch3/2026-conda/$USER/salmon_test ~/.conda/envs
# activate the conda environment
conda activate salmon_test
conda deactivate
```
### Snakefile
```
############################################
# Assign some variables to make your life a little easier
############################################
RAW_DIR = "../raw"
TRANS_DIR = "../transcriptome"
TRIM_DIR = "../trim"
LOG_DIR = "../log"
OUT_DIR = "Snakemake_output"
TRANSCRIPTOME = "GCF_011100685.1_UU_Cfam_GSD_1.0_rna.fna"
INDEX_NAME = "DOG_GCF_011100685_index"
ADAPTERS = f"{RAW_DIR}/all_illumina.fa"
INDEX_THREADS = 8
PIGZ_THREADS = 8
QUANT_THREADS = 12
TRIM_THREADS = 8
############################################
# Sampe detection...
############################################
SAMPLES, = glob_wildcards(f"{RAW_DIR}/{{sample}}_1.fastq.gz")
############################################
# ALL rule
############################################
rule all:
input:
directory(f"{TRANS_DIR}/{INDEX_NAME}"),
# Quantifications
expand(f"{OUT_DIR}/quants/{{sample}}/quant.sf", sample=SAMPLES),
# Illumina Adapters...
f"{RAW_DIR}/all_illumina.fa",
# FastQC reports
expand(f"{OUT_DIR}/fastqc/raw/{{sample}}_1_fastqc.html", sample=SAMPLES),
expand(f"{OUT_DIR}/fastqc/trimmed/{{sample}}_1_paired_fastqc.html", sample=SAMPLES),
# Reference
f"{TRANS_DIR}/{TRANSCRIPTOME}",
# MultiQC
f"{OUT_DIR}/multiqc/multiqc_report.html"
############################################
# download the dog reference from NCBI
############################################
rule download_reference:
output:
fasta = f"{TRANS_DIR}/{TRANSCRIPTOME}"
params:
url = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/011/100/685/GCF_011100685.1_UU_Cfam_GSD_1.0"
threads: PIGZ_THREADS
conda: "ncbi-entrez"
shell:
"""
mkdir -p {TRANS_DIR}
wget -nc {params.url}/{TRANSCRIPTOME}.gz -P {TRANS_DIR}
pigz -d -p {threads} -f {TRANS_DIR}/{TRANSCRIPTOME}.gz
"""
############################################
# FASTQC - on the raw data
############################################
rule fastqc_raw:
input:
r1 = f"{RAW_DIR}/{{sample}}_1.fastq.gz",
r2 = f"{RAW_DIR}/{{sample}}_2.fastq.gz"
output:
html1 = f"{OUT_DIR}/fastqc/raw/{{sample}}_1_fastqc.html",
html2 = f"{OUT_DIR}/fastqc/raw/{{sample}}_2_fastqc.html"
conda: "qc_test"
shell:
"""
mkdir -p {OUT_DIR}/fastqc/raw
fastqc {input.r1} {input.r2} --outdir {OUT_DIR}/fastqc/raw
"""
############################################
# trimmomatic
############################################
rule create_illumina_adapters:
output:
adapters = f"{RAW_DIR}/all_illumina.fa"
run:
adapters = {
"PrefixNX/1": "AGATGTGTATAAGAGACAG",
"PrefixNX/2": "AGATGTGTATAAGAGACAG",
"Trans1": "TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG",
"Trans1_rc": "CTGTCTCTTATACACATCTGACGCTGCCGACGA",
"Trans2": "GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG",
"Trans2_rc": "CTGTCTCTTATACACATCTCCGAGCCCACGAGAC",
"PrefixPE/1": "TACACTCTTTCCCTACACGACGCTCTTCCGATCT",
"PrefixPE/2": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT",
"PCR_Primer1": "AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT",
"PCR_Primer1_rc": "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT",
"PCR_Primer2": "CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT",
"PCR_Primer2_rc": "AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG",
"FlowCell1": "TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACAC",
"FlowCell2": "TTTTTTTTTTCAAGCAGAAGACGGCATACGA",
"TruSeq2_SE": "AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG",
"TruSeq2_PE_f": "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT",
"TruSeq2_PE_r": "AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG",
"TruSeq3_IndexedAdapter": "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC",
"TruSeq3_UniversalAdapter": "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA"
}
with open(output.adapters, "w") as f:
for header, seq in adapters.items():
f.write(f">{header}\n{seq}\n")
rule trimmomatic:
input:
r1 = f"{RAW_DIR}/{{sample}}_1.fastq.gz",
r2 = f"{RAW_DIR}/{{sample}}_2.fastq.gz",
full_adapter_list = f"{RAW_DIR}/all_illumina.fa"
output:
r1_paired = f"{TRIM_DIR}/{{sample}}_1_paired.fastq.gz",
r1_unpaired = f"{TRIM_DIR}/{{sample}}_1_unp.fastq.gz",
r2_paired = f"{TRIM_DIR}/{{sample}}_2_paired.fastq.gz",
r2_unpaired = f"{TRIM_DIR}/{{sample}}_2_unp.fastq.gz"
threads: TRIM_THREADS
conda: "salmon_test"
shell:
"""
mkdir -p {TRIM_DIR}
trimmomatic PE -threads {threads} -phred33 \
{input.r1} {input.r2} \
{output.r1_paired} {output.r1_unpaired} \
{output.r2_paired} {output.r2_unpaired} \
ILLUMINACLIP:{input.full_adapter_list}:2:30:10 \
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
"""
############################################
# FASTQC - on the trimmed data
############################################
rule fastqc_trimmed:
input:
r1 = f"{TRIM_DIR}/{{sample}}_1_paired.fastq.gz",
r2 = f"{TRIM_DIR}/{{sample}}_2_paired.fastq.gz"
output:
html1 = f"{OUT_DIR}/fastqc/trimmed/{{sample}}_1_paired_fastqc.html",
html2 = f"{OUT_DIR}/fastqc/trimmed/{{sample}}_2_paired_fastqc.html"
conda: "qc_test"
shell:
"""
mkdir -p {OUT_DIR}/fastqc/trimmed
fastqc {input.r1} {input.r2} --outdir {OUT_DIR}/fastqc/trimmed
"""
############################################
# SALMON Transcriptome Index...
############################################
rule salmon_index:
input:
transcriptome = f"{TRANS_DIR}/{TRANSCRIPTOME}"
output:
idx = directory(f"{TRANS_DIR}/{INDEX_NAME}")
threads: INDEX_THREADS
conda: "salmon_test"
shell:
"salmon index -t {input.transcriptome} -i {output.idx} -p {threads}"
############################################
# SALMON!
############################################
rule salmon_quant:
input:
index = f"{TRANS_DIR}/{INDEX_NAME}",
r1 = f"{TRIM_DIR}/{{sample}}_1_paired.fastq.gz",
r2 = f"{TRIM_DIR}/{{sample}}_2_paired.fastq.gz"
output:
quant = f"{OUT_DIR}/quants/{{sample}}/quant.sf",
dir = directory(f"{OUT_DIR}/quants/{{sample}}")
threads: QUANT_THREADS
conda: "salmon_test"
shell:
"""
salmon quant -i {input.index} -l A \
-1 {input.r1} \
-2 {input.r2} \
-p {threads} --validateMappings \
-o {output.dir}
"""
############################################
# Multiqc - see how we did and look at some cool QC metrics
############################################
rule multiqc:
input:
expand(f"{OUT_DIR}/fastqc/raw/{{sample}}_1_fastqc.html", sample=SAMPLES),
expand(f"{OUT_DIR}/fastqc/raw/{{sample}}_2_fastqc.html", sample=SAMPLES),
expand(f"{OUT_DIR}/fastqc/trimmed/{{sample}}_1_paired_fastqc.html", sample=SAMPLES),
expand(f"{OUT_DIR}/fastqc/trimmed/{{sample}}_2_paired_fastqc.html", sample=SAMPLES),
output:
f"{OUT_DIR}/multiqc/multiqc_report.html"
conda: "qc_test"
shell:
"""
multiqc {OUT_DIR} -o {OUT_DIR}/multiqc --force
"""
```
## Import Salmon Quant Files into R
### R SETUP - load/install some libraries
```
# 🎃️ SETUP ----
## Remove any stored variables ----
rm(list=ls())
## Install the Libraries ----
### Ensure BiocManager is installed ----
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
packages_to_use <- c("readr", "stringr", "openxlsx", "dplyr", "pheatmap",
"tximport", "DESeq2", "GenomicFeatures", "ggplot2", "svglite",
"AnnotationDbi", "matrixStats", "gprofiler2", "rbioapi", "txdbmaker")
### Install missing packages ----
missing_packages <- packages_to_use[!packages_to_use %in% rownames(installed.packages())]
if (length(missing_packages) > 0) {
BiocManager::install(missing_packages, ask = FALSE, update = FALSE)
}
### Load the libraries ----
invisible(lapply(packages_to_use, library, character.only = TRUE))
```
### Fancy RSTUDIO setwd()
```
# if we're using Rstudio, set the working directory to wherever this file is saved
if(.Platform$GUI == "RStudio") {
if (!(TRUE %in% grepl(pattern = "rstudioapi", x = rownames(installed.packages())))) {
install.packages('rstudioapi')
}
library(rstudioapi)
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
} else {
setwd("~/201b-RNAseq/GGG201B/Week8_RNAseq/analysis")
}
getwd()
```
### Variables
```
# Define some variables ----
# what do you want to consider "statistically relevant"?
## Fold Change?
tmp_fc <- 1
## False-discovery rate adjusted p-value?
tmp_padj <- 0.05
highcolor <- "coral" # or #FF7F50
midcolor <- "lightgrey" # or #d3d3d3
lowcolor <- "skyblue" # or #87CEEB
today_date <- format(Sys.Date(), "%y%m%d")
```
### Load the metadata
```
# Import your metadata ----
meta_raw <- read.csv("../raw/canine_metadata.csv", header = FALSE)
meta <- data.frame(
sample = meta_raw$V1,
title = meta_raw$V12,
cute_title = stringr::str_extract(meta_raw$V12, "^[^_]+_[^_]+"), # we only want the "title" - but want to remove everything after the second "_"
sex = meta_raw$V35)
meta$species <- ifelse(grepl("^Wolf", meta$title), "Wolf",
ifelse(grepl("^Dog", meta$title), "Dog", NA))
meta$species <- factor(meta$species)
meta$sex <- factor(meta$sex)
meta$batch <- ifelse(grepl("^ERR266", meta$sample), "Batch1", "Batch2")
meta$batch <- factor(meta$batch)
## Explore ----
table(meta$species, meta$sex)
table(meta$species, meta$batch)
```
### Transcripts to genes
```
# Transcript to gene ----
## 1. Make TxDb ----
txdb <- makeTxDbFromGFF("../genome/GCF_011100685.1_UU_Cfam_GSD_1.0_genomic.gff")
## 2. Extract transcripts with gene mapping ----
tx_df <- as.data.frame(transcripts(txdb, columns=c("tx_name","gene_id")))
## 3. Fill missing gene IDs with tx_id ----
tx_df$gene_id[is.na(tx_df$gene_id)] <- tx_df$tx_id[is.na(tx_df$gene_id)]
## 4. Ensure unique transcript names ----
tx_df$tx_name <- make.unique(tx_df$tx_name)
## 5. Create transcript to gene dataframe ----
tx2gene_df <- tx_df[, c("tx_name","gene_id")]
colnames(tx2gene_df) <- c("TXNAME","GENEID")
tx2gene_df$TXNAME <- as.character(tx2gene_df$TXNAME)
tx2gene_df$GENEID <- as.character(tx2gene_df$GENEID)
```
### Load the read quantification files
```
# Get the quantification files from Salmon
samples_list <- read.csv("../raw/srr_list_rnaseq.txt", header = FALSE)
# actually find the files
quant_files <- file.path("Snakemake_output/quants",samples_list[,1], "quant.sf")
# use our sample_list to name our files for later (aesthetic)
names(quant_files) <- samples_list[,1]
# view an example
read.table(quant_files[1], header=T)
# convert to dataframe
tx2gene_df <- as.data.frame(tx2gene_df)
# transcript import
txi <- tximport(
files = quant_files,
type = "salmon",
tx2gene = tx2gene_df,
countsFromAbundance = "lengthScaledTPM"
)
```
### Differential Gene Expression Analysis - DESeq2
```
# 🐺+🐶 DESeq2 ----
# Here's the cool part - DESeq2 adjusts for unwanted variation (batch, sex) while testing your main biological question (species differences)!
dds <- DESeqDataSetFromTximport(
txi = txi,
colData = meta,
design = ~ batch + sex + species
)
# fit DESeq's neg. binomial model & calculate the stats
dds_fit <- DESeq(dds)
# print the available comparisons
resultsNames(dds_fit)
# define the contrast that you want to focus on
contrast_list <- c("species", "Dog", "Wolf")
# extract the specific results that you want
res_species_Dog_vs_Wolf <- results(dds_fit, contrast = contrast_list, alpha = tmp_padj)
# preview your results!
head(x = res_species_Dog_vs_Wolf, n = 10)
# important - how would you get the other contrasts???
res_sex_male_vs_female <- results(dds_fit, contrast = c("sex", "male", "female"), alpha = tmp_padj)
res_sex_female_vs_male <- results(dds_fit, contrast = c("sex", "female", "male"), alpha = tmp_padj)
# define a contrast name variable
comparison_name <- paste0(contrast_list[2], "_V_", contrast_list[3])
```
#### Write to CSV
```
## SAVE to TEXT ----
csv_filename <- paste0(today_date, "_", comparison_name, ".csv")
write.csv(x = res_species_Dog_vs_Wolf, file = csv_filename)
```
#### Write to Excel
```
## SAVE to EXCEL ----
res_species_Dog_vs_Wolf$gene <- rownames(res_species_Dog_vs_Wolf)
res <- res_species_Dog_vs_Wolf %>%
as.data.frame() %>%
mutate(
Group = case_when(
log2FoldChange > tmp_fc & padj < tmp_padj ~ "Up-regulated",
log2FoldChange < -tmp_fc & padj < tmp_padj ~ "Down-regulated",
TRUE ~ "Not-significant"
)
)
# up in dog compared to wolf
up_df <- res %>% filter(Group == "Up-regulated")
# down in dog compared to wolf
down_df <- res %>% filter(Group == "Down-regulated")
# establish our workbook
wb <- openxlsx::createWorkbook()
# add a sheet - "all genes"
addWorksheet(wb, "All_Genes")
writeData(wb, "All_Genes", res)
# add a sheet - "UP"
addWorksheet(wb, "Upregulated")
writeData(wb, "Upregulated", up_df)
# add a sheet - "DOWN"
addWorksheet(wb, "Downregulated")
writeData(wb, "Downregulated", down_df)
# add a sheet - "Just rawcounts"
addWorksheet(wb, "RawCounts")
writeData(wb = wb, sheet = "RawCounts", x =data.frame(gene = rownames(txi$counts), txi$counts))
# add a sheet - "Just TPM values"
addWorksheet(wb, "TPM")
writeData(wb, "TPM",
data.frame(gene = rownames(txi$abundance), txi$abundance))
xlsx_filename <- paste0(today_date, "_", comparison_name, ".xlsx")
saveWorkbook(wb = wb, file = xlsx_filename, overwrite = TRUE)
```
### PCA plot
```
# PCA ----
vsd <- DESeq2::vst(dds) # Quickly estimate dispersion trend and apply a variance stabilizing transformation
pca_species <- plotPCA(vsd, intgroup = "species")
pca_species
pca_cute <- plotPCA(vsd, intgroup = "cute_title")
pca_cute
# save your PCA - *.svg vector images!
svg_filename <- paste0(today_date, "_pca.svg")
ggsave(filename = svg_filename, plot = pca_species, width = 6, height = 6)
```
### Volcano Plot
```
# Volcano ----
# Define the results object for brevity
res <- res_species_Dog_vs_Wolf
# Create a color vector
my_colors <- ifelse(
res$padj < 0.05 & res$log2FoldChange > 1, highcolor,
ifelse(res$padj < 0.05 & res$log2FoldChange < -1, lowcolor, midcolor)
)
# 1. Open the file device
png(filename = "Dog_vs_Wolf_Volcano.png", width = 1600, height = 1600, units = "px", res = 300)
#svg(filename = "Dog_vs_Wolf_Volcano.svg", width = 6, height = 6)
## Pixels (Raster) vs. Inches (Vector)
# 2. Plot
plot(res$log2FoldChange, -log10(res$padj),
pch = 20,
main = "Dog vs. Wolf Volcano Plot",
xlab = "Log2 Fold Change",
ylab = "-Log10 Adjusted P-value",
col = my_colors)
# 3. Add the lines
abline(h = -log10(0.05), col = "black", lty = 2)
abline(v = c(-1, 1), col = "black", lty = 2)
# 4. Close the file device
dev.off()
list.files()
```
### Heatmap
```
# Heatmap ----
# 1. Get the top 20 most variable genes (with gene names)
vsd_filtered <- vsd[!grepl("^LOC", rownames(vsd)), ]
top_genes <- head(order(rowVars(assay(vsd_filtered)), decreasing = TRUE), 20)
mat <- assay(vsd_filtered)[top_genes, ]
# 2. Subtract the row mean so we see relative changes
mat <- mat - rowMeans(mat)
# 3. Plot & save it!
png(filename = "Dog_vs_Wolf_Heatmap.png", width = 1600, height = 1600, units = "px", res = 300)
pheatmap(mat,
annotation_col = as.data.frame(colData(vsd)["species"]),
main = "Top 20 Variable Genes: Dog vs. Wolf")
dev.off()
```
## So you have a bunch of genes that are statistically exciting - but what does it mean biologically?
### STRINGdb - Functional Protein Association Networks
* Protein-protein interactions
* Enrichment based on interaction networks
<div style="text-align:center;"><a href="https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1"><img src="https://hackmd.io/_uploads/ryEjy4IO-e.gif", alt="Salmon!" width="600"></a></div>
<br>
* *How did I find the Taxonomy ID for "canis familiaris?"*
* [link](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1)
```
# STRINGdb Functional Protein Association Networks! - Protein-Protein interactions ----
# https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1
proteins_mapped_df_up <- rbioapi::rba_string_map_ids(ids = rownames(up_df), species = 9615)
rbioapi::rba_string_enrichment(ids = proteins_mapped_df_up$stringId,
species = 9615,
split_df = TRUE)
proteins_mapped_df_down <- rbioapi::rba_string_map_ids(ids = rownames(down_df), species = 9615)
rbioapi::rba_string_enrichment(ids = proteins_mapped_df_down$stringId,
species = 9615,
split_df = TRUE)
```
### g:Profiler - gene set enrichment analysis
[link](https://biit.cs.ut.ee/gprofiler/page/organism-list)
```
# Gprofiler - gene set enrichment analysis / functional enrichment analysis / over-representation analysis (ORA)! ----
# https://biit.cs.ut.ee/gprofiler/page/organism-list
gprofiler_output_up <- gprofiler2::gost(
query = rownames(up_df),
organism = "clfamiliaris",
correction_method = "fdr",
significant = TRUE
)
# view the output
View(gprofiler_output_up$result)
gprofiler_plot_up <- gprofiler2::gostplot(gprofiler_output_up,
capped = FALSE,
interactive = TRUE)
gprofiler_plot_up
gprofiler_output_down <- gprofiler2::gost(
query = rownames(down_df),
organism = "clfamiliaris",
correction_method = "fdr",
significant = TRUE
)
View(gprofiler_output_down$result)
gprofiler_plot_down <- gprofiler2::gostplot(gprofiler_output_down, capped = FALSE, interactive = FALSE)
gprofiler_plot_down
# manual searches? ----
UP_named_genes <- up_df[which( grepl(pattern = "^LOC", x = rownames(up_df)) == FALSE ),]
DOWN_named_genes <- down_df[which( grepl(pattern = "^LOC", x = rownames(down_df)) == FALSE ),]
# if you want to manually look these up
cat(rownames(UP_named_genes), sep = "\n")
cat(rownames(DOWN_named_genes), sep = "\n")
```
### output for manual searches?
```
# manual searches? ----
UP_named_genes <- up_df[which( grepl(pattern = "^LOC", x = rownames(up_df)) == FALSE ),]
DOWN_named_genes <- down_df[which( grepl(pattern = "^LOC", x = rownames(down_df)) == FALSE ),]
# if you want to manually look these up
cat(rownames(UP_named_genes), sep = "\n")
cat(rownames(DOWN_named_genes), sep = "\n")
```
## Learning resources
<div style="text-align: center;"><a href="https://ucdavisdatalab.github.io/workshop_index/#r"><img src="https://hackmd.io/_uploads/HJNkuBU_-l.png", alt="UCD_DataLab!" width="300"></a>
<br><hr>
<a href="https://cran.r-project.org/doc/manuals/r-release/R-intro.html">
<img src="https://hackmd.io/_uploads/HyfzdQL_bg.png", alt="R-INTRO!" width="150">
</a>
<a href="https://learning.oreilly.com/library/view/r-for-data/9781492097396/">
<img src="https://hackmd.io/_uploads/ryfCFBIube.jpg", alt="R-OREILLY!" width="150">
</a>
<a href="https://docs.posit.co/ide/user/ide/get-started/">
<img src="https://hackmd.io/_uploads/SkUtdmUdZg.png", alt="Learn More RSTUDIO!" width="100">
</a>
</div><hr>
## R / Rstudio Pro Tips
**I always start R scripts with this - just removes all stored variables so you start fresh**
```
rm(list = ls())
```
**Code Snippets**
```
https://docs.posit.co/ide/user/ide/guide/productivity/snippets.html
```
**Code Tools**
- reformat selection
**Code Organization**
```
# SECTION ----
# followed by 4x "#" or "-"
#####
#----
# use this space to name your sections ####
# section ----
## sub-section ----
### sub-sub-section ----
#### etc. ----
```
# The End!
**please email me if you would like additional tips or have questions!**
jknorris@ucdavis.edu