Week 8 - Summary of bulk RNAseq analysis - Differential Gene Expression! - GGG 201B lab section: Genomics (Winter 2 copy

--- tags: ggg, ggg2026, ggg201b --- # Week 8 - Summary of bulk RNAseq analysis - Canine Differential Gene Expression! - GGG 201B lab section: Genomics (Winter 2026) 2/27/26 Jamie Norris, jknorris@ucdavis.edu * https://codeberg.org/jamiekaj/GGG201B * https://hackmd.io/6JOdliknSkmz50YFt88yZQ <hr> #### Table of Contents [TOC] # At the beginning * start zoom / recording! * Introduction and questions? * <a href="https://hackmd.io/aa1jWbvBTbGTeo9j6NLkqA?view#Cleaning-up-old-disk-space">How's everyone doing with available disk space? </a> ``` free_up_some_space() { rm -rf ~/.cache/rstudio-server/ rm -rf ~/.RData rm -rf ~/.local/share/rstudio/ rm -rf ~/.config/rstudio conda clean --all rm -rf ~/.conda/pkgs rm -rf ~/.conda/envs/*/pkgs } ``` ## Review * Rstudio / OnDemand * Snakemake * Genomic data alignment and SNV calling * Data alignment visualization with IGV * *E.coli* genome assembly & annotation ## Learning objectives, hands-on section of week 8: * **At the end of today, you'll kinda understand how to:** * find and download raw RNAseq data from a paper * differential expression analysis * gene ontology search with this data * produce publication quality figures? ## RNAseq > DEG analysis Overview #### a complicated (simplified) overview ```mermaid graph TD; A[Raw Fastq Files] --> B[Trim & Filter]; B -- Trim Galore / Trimmomatic --> C[Align]; B -- Trim Galore / Trimmomatic --> D[Pseudoalign & Quantify]; C -- HiSat2 --> E[Quantify]; C -- STAR --> E[Quantify]; C -- Minimap2 --> E[Quantify]; E -- FeatureCounts --> F[DE Genes]; E -- htseqcount --> F[DE Genes]; D -- Kallisto --> F[DE Genes]; D -- Salmon --> F[DE Genes]; F -- edgeR --> H[Further Analysis]; F -- limma --> H[Further Analysis]; F -- DESeq2 --> H[Further Analysis]; ``` # Multiple RNA-seq Pipelines: ## Historic (~2009–2014): - fastq files 1. (Optional) cutadapt / FASTX filtering 2. TopHat (align) 3. CuffLinks (transcript assembly) 4. CuffMerge (merge transcriptome) 5. CuffDiff (differential gene expression) ## Alignment to a reference genome assembly (~2014–Current) - fastq files 1. Trimmomatic / TrimGalore (optional trimming for adapters & quality) 2. Alignment to reference genome - HISAT2 or STAR for short reads - Minimap2 for long reads (~2018) 3. Gene-level counting - htseq-count or featureCounts 4. Differential expression analysis - DESeq2 (or edgeR / limma-voom) ## Pseudoalignment Emergence (~2015–2018) - Kallisto and Salmon are introduced ## Modern Bulk RNA-seq (~2018–current) 1. **Preprocessing:** * (Optional) Trimming and low-quality read cleaning with Trim Galore or Trimmomatic 2. **Quantification with Salmon:** * Indexing the reference transcriptome (`salmon index`) * Quantifying isoforms using your RNA-seq data (`salmon quant`) * hashing to rapidly determine which transcripts a read could originate from, without performing a full base-by-base alignment * Lightweight transcriptome-level pseudoalignment/quantification <div style="text-align: center;"><a href="https://combine-lab.github.io/salmon/getting_started/"><img src="https://hackmd.io/_uploads/r1qjwXL_-e.png", alt="Salmon!" width="500"></a></div> 3. **Importing Salmon results into R:** * Creating a transcript-to-gene mapping (`tx2gene`) using `GenomicFeatures` * Importing Salmon quantification files with `tximport` 4. **DESeq2 differential expression analysis:** * DESeq2 will model the raw counts, using normalization factors (size factors) to account for differences in library depth. * (scales the data so you can compare "apples to apples") * Then, it will estimate the gene-wise dispersions and shrink these estimates to generate more accurate estimates of dispersion to model the counts. * (average variance across all genes and "pulls" (shrinks) outlier genes toward that average) * Finally, DESeq2 will fit the negative binomial model and perform hypothesis testing using the Wald test or Likelihood Ratio Test. * doesn't follow a bell curve - generated the final p-value to suggest the change is real and exciting! 5. **Exploration and visualization:** * Principal Component Analysis (PCA) * Volcano plot * Heatmap 6. **Output:** * Writing the results to a CSV file ## Today **Option A (Genome-based)** STAR / HISAT2 → featureCounts → DESeq2 **Option B (Transcriptome-based, dominant in many labs now)** Salmon / Kallisto → tximport → DESeq2 # Let's start: Please go [here](https://ondemand.farm.hpc.ucdavis.edu/pun/sys/dashboard/batch_connect/sys/rstudio/session_contexts/new): <div style="text-align: center"><a href="https://ondemand.farm.hpc.ucdavis.edu/pun/sys/dashboard/batch_connect/sys/rstudio/session_contexts/new" target="_blank"><img src="https://hackmd.io/_uploads/Syz_tXL_Wl.jpg", alt="FARM!" width="300"></a></div> * Account: ctbrowngrp * Partition: low * **Number of cores: 1** * **Amount of memory: 12** * Conda environment: r-4.4.2 * Number of hours: 3 leave everything else as-is, and click "Launch". # Organize your project directory! ``` mkdir -p ~/201b-RNAseq && cd ~/201b-RNAseq git clone https://codeberg.org/jamiekaj/GGG201B.git cd ~/201b-RNAseq/GGG201B ``` **Now we need some raw RNAseq data to work with.** ## What RNAseq data do we want and where is it? <div style="text-align: center"><img src="https://hackmd.io/_uploads/S11D9Rnd-e.png", alt="puppy!" width="300"></div> * **Domestication!** * Does domestication look the same (transcriptomically) in all animals? * Compared the frontal cortex gene transcription (mRNA) of dogs/wolves, pigs/boars, and domesticated/wild rabbits * Wolves > Dogs, ~15,000 years ago * spolier alert: the answer is mostly no... domestication is different for every species. * Conclusion: The cortical transcriptomes are different, but not dramatically. * ~fdr 0.01, 30 DE genes between dogs and wolves * Most upregulated in dogs:TKTL1 * ~47-fold higher expression in dogs * Involved in aerobic glycolysis * Most upregulated in wolves: AP5Z1 (KIAA0415) * ~2.5-fold higher in wolves * Involved in intracellular protein trafficking * Dogs had strong gene enrichment for: * “immune system process” * “leukocyte migration” * immune regulation changes may be part of dog domestication. * Dog domestication appears to involve modest regulatory tuning rather than massive brain transcriptome reprogramming. <hr> **neither endorsed by Titus nor UCDavis** ## Scihub & find the PROJECT ID! * <a href = "https://sci-hub.st/10.1371/journal.pgen.1002962" > https://sci-hub.st/10.1371/journal.pgen.1002962 </a> <iframe height="150" width="300" src="https://sci-hub.st/10.1371/journal.pgen.1002962" title="scihub"></iframe> #### Please go here and look for the BioProject ID: <a href="https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1002962">Albert, Frank W., Mehmet Somel, Miguel Carneiro, et al. “A Comparison of Brain Gene Expression Levels in Domesticated and Wild Animals.” PLOS Genetics 8, no. 9 (2012): e1002962. https://doi.org/10.1371/journal.pgen.1002962.</a> * Hint: Use "find" * **Linux/Windows**: ctrl + f > "accession" *or* * **Mac**: command(⌘) +f > "accession" * We're looking for something like an "accession code" * "Raw sequencing reads are available in the ArrayExpress archive (http://www.ebi.ac.uk/arrayexpress/) as accession **E-MTAB-1249**." * https://www.ebi.ac.uk/biostudies/arrayexpress/studies?query=E-MTAB-1249 * https://www.ncbi.nlm.nih.gov/biosample/?term=E-MTAB-1249 <hr> #### Purely so you guys have it ##### (~300Mb is smaller than 112Gb) * **don't run this** - it takes a while and takes up a lot of space ``` # Create conda environment conda create -p \ ~ctbrown/scratch3/2026-conda/$USER/ncbi-entrez \ -y entrez-direct sra-tools pigz # softlink the conda enviroment into your home conda environment list ln -s ~ctbrown/scratch3/2026-conda/$USER/ncbi-entrez ~/.conda/envs # activate the conda environment conda activate ncbi-entrez ``` * replace the *** with the BioProject ID ``` PROJECT_ID="***" # <<< here - PRJEB3197 # Download metadata for the project esearch -db sra -query "$PROJECT_ID" | efetch -format runinfo > metadata.csv # Filter for RNA-Seq runs AND only dog/wolf samples grep "RNA-Seq" metadata.csv | grep -iE "dog|wolf" | cut -d ',' -f 1 > srr_list_dog_wolf.txt for SRR in $(cat srr_list_dog_wolf.txt); do echo "Downloading $SRR ..." fasterq-dump --threads 8 --progress --split-files --temp /tmp "$SRR" pigz -p 8 "${SRR}_1.fastq" "${SRR}_2.fastq" done conda deactivate ``` ## SNAKEMAKE + salmon ``` # Create conda environment conda create -p \ ~ctbrown/scratch3/2026-conda/$USER/salmon_test \ -y salmon pigz agat # softlink the conda enviroment into your home conda environment list ln -s ~ctbrown/scratch3/2026-conda/$USER/salmon_test ~/.conda/envs # activate the conda environment conda activate salmon_test conda deactivate ``` ### Snakefile ``` ############################################ # Assign some variables to make your life a little easier ############################################ RAW_DIR = "../raw" TRANS_DIR = "../transcriptome" TRIM_DIR = "../trim" LOG_DIR = "../log" OUT_DIR = "Snakemake_output" TRANSCRIPTOME = "GCF_011100685.1_UU_Cfam_GSD_1.0_rna.fna" INDEX_NAME = "DOG_GCF_011100685_index" ADAPTERS = f"{RAW_DIR}/all_illumina.fa" INDEX_THREADS = 8 PIGZ_THREADS = 8 QUANT_THREADS = 12 TRIM_THREADS = 8 ############################################ # Sampe detection... ############################################ SAMPLES, = glob_wildcards(f"{RAW_DIR}/{{sample}}_1.fastq.gz") ############################################ # ALL rule ############################################ rule all: input: directory(f"{TRANS_DIR}/{INDEX_NAME}"), # Quantifications expand(f"{OUT_DIR}/quants/{{sample}}/quant.sf", sample=SAMPLES), # Illumina Adapters... f"{RAW_DIR}/all_illumina.fa", # FastQC reports expand(f"{OUT_DIR}/fastqc/raw/{{sample}}_1_fastqc.html", sample=SAMPLES), expand(f"{OUT_DIR}/fastqc/trimmed/{{sample}}_1_paired_fastqc.html", sample=SAMPLES), # Reference f"{TRANS_DIR}/{TRANSCRIPTOME}", # MultiQC f"{OUT_DIR}/multiqc/multiqc_report.html" ############################################ # download the dog reference from NCBI ############################################ rule download_reference: output: fasta = f"{TRANS_DIR}/{TRANSCRIPTOME}" params: url = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/011/100/685/GCF_011100685.1_UU_Cfam_GSD_1.0" threads: PIGZ_THREADS conda: "ncbi-entrez" shell: """ mkdir -p {TRANS_DIR} wget -nc {params.url}/{TRANSCRIPTOME}.gz -P {TRANS_DIR} pigz -d -p {threads} -f {TRANS_DIR}/{TRANSCRIPTOME}.gz """ ############################################ # FASTQC - on the raw data ############################################ rule fastqc_raw: input: r1 = f"{RAW_DIR}/{{sample}}_1.fastq.gz", r2 = f"{RAW_DIR}/{{sample}}_2.fastq.gz" output: html1 = f"{OUT_DIR}/fastqc/raw/{{sample}}_1_fastqc.html", html2 = f"{OUT_DIR}/fastqc/raw/{{sample}}_2_fastqc.html" conda: "qc_test" shell: """ mkdir -p {OUT_DIR}/fastqc/raw fastqc {input.r1} {input.r2} --outdir {OUT_DIR}/fastqc/raw """ ############################################ # trimmomatic ############################################ rule create_illumina_adapters: output: adapters = f"{RAW_DIR}/all_illumina.fa" run: adapters = { "PrefixNX/1": "AGATGTGTATAAGAGACAG", "PrefixNX/2": "AGATGTGTATAAGAGACAG", "Trans1": "TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG", "Trans1_rc": "CTGTCTCTTATACACATCTGACGCTGCCGACGA", "Trans2": "GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG", "Trans2_rc": "CTGTCTCTTATACACATCTCCGAGCCCACGAGAC", "PrefixPE/1": "TACACTCTTTCCCTACACGACGCTCTTCCGATCT", "PrefixPE/2": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT", "PCR_Primer1": "AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT", "PCR_Primer1_rc": "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT", "PCR_Primer2": "CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT", "PCR_Primer2_rc": "AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG", "FlowCell1": "TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACAC", "FlowCell2": "TTTTTTTTTTCAAGCAGAAGACGGCATACGA", "TruSeq2_SE": "AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG", "TruSeq2_PE_f": "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT", "TruSeq2_PE_r": "AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG", "TruSeq3_IndexedAdapter": "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC", "TruSeq3_UniversalAdapter": "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA" } with open(output.adapters, "w") as f: for header, seq in adapters.items(): f.write(f">{header}\n{seq}\n") rule trimmomatic: input: r1 = f"{RAW_DIR}/{{sample}}_1.fastq.gz", r2 = f"{RAW_DIR}/{{sample}}_2.fastq.gz", full_adapter_list = f"{RAW_DIR}/all_illumina.fa" output: r1_paired = f"{TRIM_DIR}/{{sample}}_1_paired.fastq.gz", r1_unpaired = f"{TRIM_DIR}/{{sample}}_1_unp.fastq.gz", r2_paired = f"{TRIM_DIR}/{{sample}}_2_paired.fastq.gz", r2_unpaired = f"{TRIM_DIR}/{{sample}}_2_unp.fastq.gz" threads: TRIM_THREADS conda: "salmon_test" shell: """ mkdir -p {TRIM_DIR} trimmomatic PE -threads {threads} -phred33 \ {input.r1} {input.r2} \ {output.r1_paired} {output.r1_unpaired} \ {output.r2_paired} {output.r2_unpaired} \ ILLUMINACLIP:{input.full_adapter_list}:2:30:10 \ LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 """ ############################################ # FASTQC - on the trimmed data ############################################ rule fastqc_trimmed: input: r1 = f"{TRIM_DIR}/{{sample}}_1_paired.fastq.gz", r2 = f"{TRIM_DIR}/{{sample}}_2_paired.fastq.gz" output: html1 = f"{OUT_DIR}/fastqc/trimmed/{{sample}}_1_paired_fastqc.html", html2 = f"{OUT_DIR}/fastqc/trimmed/{{sample}}_2_paired_fastqc.html" conda: "qc_test" shell: """ mkdir -p {OUT_DIR}/fastqc/trimmed fastqc {input.r1} {input.r2} --outdir {OUT_DIR}/fastqc/trimmed """ ############################################ # SALMON Transcriptome Index... ############################################ rule salmon_index: input: transcriptome = f"{TRANS_DIR}/{TRANSCRIPTOME}" output: idx = directory(f"{TRANS_DIR}/{INDEX_NAME}") threads: INDEX_THREADS conda: "salmon_test" shell: "salmon index -t {input.transcriptome} -i {output.idx} -p {threads}" ############################################ # SALMON! ############################################ rule salmon_quant: input: index = f"{TRANS_DIR}/{INDEX_NAME}", r1 = f"{TRIM_DIR}/{{sample}}_1_paired.fastq.gz", r2 = f"{TRIM_DIR}/{{sample}}_2_paired.fastq.gz" output: quant = f"{OUT_DIR}/quants/{{sample}}/quant.sf", dir = directory(f"{OUT_DIR}/quants/{{sample}}") threads: QUANT_THREADS conda: "salmon_test" shell: """ salmon quant -i {input.index} -l A \ -1 {input.r1} \ -2 {input.r2} \ -p {threads} --validateMappings \ -o {output.dir} """ ############################################ # Multiqc - see how we did and look at some cool QC metrics ############################################ rule multiqc: input: expand(f"{OUT_DIR}/fastqc/raw/{{sample}}_1_fastqc.html", sample=SAMPLES), expand(f"{OUT_DIR}/fastqc/raw/{{sample}}_2_fastqc.html", sample=SAMPLES), expand(f"{OUT_DIR}/fastqc/trimmed/{{sample}}_1_paired_fastqc.html", sample=SAMPLES), expand(f"{OUT_DIR}/fastqc/trimmed/{{sample}}_2_paired_fastqc.html", sample=SAMPLES), output: f"{OUT_DIR}/multiqc/multiqc_report.html" conda: "qc_test" shell: """ multiqc {OUT_DIR} -o {OUT_DIR}/multiqc --force """ ``` ## Import Salmon Quant Files into R ### R SETUP - load/install some libraries ``` # 🎃️ SETUP ---- ## Remove any stored variables ---- rm(list=ls()) ## Install the Libraries ---- ### Ensure BiocManager is installed ---- if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } packages_to_use <- c("readr", "stringr", "openxlsx", "dplyr", "pheatmap", "tximport", "DESeq2", "GenomicFeatures", "ggplot2", "svglite", "AnnotationDbi", "matrixStats", "gprofiler2", "rbioapi", "txdbmaker") ### Install missing packages ---- missing_packages <- packages_to_use[!packages_to_use %in% rownames(installed.packages())] if (length(missing_packages) > 0) { BiocManager::install(missing_packages, ask = FALSE, update = FALSE) } ### Load the libraries ---- invisible(lapply(packages_to_use, library, character.only = TRUE)) ``` ### Fancy RSTUDIO setwd() ``` # if we're using Rstudio, set the working directory to wherever this file is saved if(.Platform$GUI == "RStudio") { if (!(TRUE %in% grepl(pattern = "rstudioapi", x = rownames(installed.packages())))) { install.packages('rstudioapi') } library(rstudioapi) setwd(dirname(rstudioapi::getSourceEditorContext()$path)) } else { setwd("~/201b-RNAseq/GGG201B/Week8_RNAseq/analysis") } getwd() ``` ### Variables ``` # Define some variables ---- # what do you want to consider "statistically relevant"? ## Fold Change? tmp_fc <- 1 ## False-discovery rate adjusted p-value? tmp_padj <- 0.05 highcolor <- "coral" # or #FF7F50 midcolor <- "lightgrey" # or #d3d3d3 lowcolor <- "skyblue" # or #87CEEB today_date <- format(Sys.Date(), "%y%m%d") ``` ### Load the metadata ``` # Import your metadata ---- meta_raw <- read.csv("../raw/canine_metadata.csv", header = FALSE) meta <- data.frame( sample = meta_raw$V1, title = meta_raw$V12, cute_title = stringr::str_extract(meta_raw$V12, "^[^_]+_[^_]+"), # we only want the "title" - but want to remove everything after the second "_" sex = meta_raw$V35) meta$species <- ifelse(grepl("^Wolf", meta$title), "Wolf", ifelse(grepl("^Dog", meta$title), "Dog", NA)) meta$species <- factor(meta$species) meta$sex <- factor(meta$sex) meta$batch <- ifelse(grepl("^ERR266", meta$sample), "Batch1", "Batch2") meta$batch <- factor(meta$batch) ## Explore ---- table(meta$species, meta$sex) table(meta$species, meta$batch) ``` ### Transcripts to genes ``` # Transcript to gene ---- ## 1. Make TxDb ---- txdb <- makeTxDbFromGFF("../genome/GCF_011100685.1_UU_Cfam_GSD_1.0_genomic.gff") ## 2. Extract transcripts with gene mapping ---- tx_df <- as.data.frame(transcripts(txdb, columns=c("tx_name","gene_id"))) ## 3. Fill missing gene IDs with tx_id ---- tx_df$gene_id[is.na(tx_df$gene_id)] <- tx_df$tx_id[is.na(tx_df$gene_id)] ## 4. Ensure unique transcript names ---- tx_df$tx_name <- make.unique(tx_df$tx_name) ## 5. Create transcript to gene dataframe ---- tx2gene_df <- tx_df[, c("tx_name","gene_id")] colnames(tx2gene_df) <- c("TXNAME","GENEID") tx2gene_df$TXNAME <- as.character(tx2gene_df$TXNAME) tx2gene_df$GENEID <- as.character(tx2gene_df$GENEID) ``` ### Load the read quantification files ``` # Get the quantification files from Salmon samples_list <- read.csv("../raw/srr_list_rnaseq.txt", header = FALSE) # actually find the files quant_files <- file.path("Snakemake_output/quants",samples_list[,1], "quant.sf") # use our sample_list to name our files for later (aesthetic) names(quant_files) <- samples_list[,1] # view an example read.table(quant_files[1], header=T) # convert to dataframe tx2gene_df <- as.data.frame(tx2gene_df) # transcript import txi <- tximport( files = quant_files, type = "salmon", tx2gene = tx2gene_df, countsFromAbundance = "lengthScaledTPM" ) ``` ### Differential Gene Expression Analysis - DESeq2 ``` # 🐺+🐶 DESeq2 ---- # Here's the cool part - DESeq2 adjusts for unwanted variation (batch, sex) while testing your main biological question (species differences)! dds <- DESeqDataSetFromTximport( txi = txi, colData = meta, design = ~ batch + sex + species ) # fit DESeq's neg. binomial model & calculate the stats dds_fit <- DESeq(dds) # print the available comparisons resultsNames(dds_fit) # define the contrast that you want to focus on contrast_list <- c("species", "Dog", "Wolf") # extract the specific results that you want res_species_Dog_vs_Wolf <- results(dds_fit, contrast = contrast_list, alpha = tmp_padj) # preview your results! head(x = res_species_Dog_vs_Wolf, n = 10) # important - how would you get the other contrasts??? res_sex_male_vs_female <- results(dds_fit, contrast = c("sex", "male", "female"), alpha = tmp_padj) res_sex_female_vs_male <- results(dds_fit, contrast = c("sex", "female", "male"), alpha = tmp_padj) # define a contrast name variable comparison_name <- paste0(contrast_list[2], "_V_", contrast_list[3]) ``` #### Write to CSV ``` ## SAVE to TEXT ---- csv_filename <- paste0(today_date, "_", comparison_name, ".csv") write.csv(x = res_species_Dog_vs_Wolf, file = csv_filename) ``` #### Write to Excel ``` ## SAVE to EXCEL ---- res_species_Dog_vs_Wolf$gene <- rownames(res_species_Dog_vs_Wolf) res <- res_species_Dog_vs_Wolf %>% as.data.frame() %>% mutate( Group = case_when( log2FoldChange > tmp_fc & padj < tmp_padj ~ "Up-regulated", log2FoldChange < -tmp_fc & padj < tmp_padj ~ "Down-regulated", TRUE ~ "Not-significant" ) ) # up in dog compared to wolf up_df <- res %>% filter(Group == "Up-regulated") # down in dog compared to wolf down_df <- res %>% filter(Group == "Down-regulated") # establish our workbook wb <- openxlsx::createWorkbook() # add a sheet - "all genes" addWorksheet(wb, "All_Genes") writeData(wb, "All_Genes", res) # add a sheet - "UP" addWorksheet(wb, "Upregulated") writeData(wb, "Upregulated", up_df) # add a sheet - "DOWN" addWorksheet(wb, "Downregulated") writeData(wb, "Downregulated", down_df) # add a sheet - "Just rawcounts" addWorksheet(wb, "RawCounts") writeData(wb = wb, sheet = "RawCounts", x =data.frame(gene = rownames(txi$counts), txi$counts)) # add a sheet - "Just TPM values" addWorksheet(wb, "TPM") writeData(wb, "TPM", data.frame(gene = rownames(txi$abundance), txi$abundance)) xlsx_filename <- paste0(today_date, "_", comparison_name, ".xlsx") saveWorkbook(wb = wb, file = xlsx_filename, overwrite = TRUE) ``` ### PCA plot ``` # PCA ---- vsd <- DESeq2::vst(dds) # Quickly estimate dispersion trend and apply a variance stabilizing transformation pca_species <- plotPCA(vsd, intgroup = "species") pca_species pca_cute <- plotPCA(vsd, intgroup = "cute_title") pca_cute # save your PCA - *.svg vector images! svg_filename <- paste0(today_date, "_pca.svg") ggsave(filename = svg_filename, plot = pca_species, width = 6, height = 6) ``` ### Volcano Plot ``` # Volcano ---- # Define the results object for brevity res <- res_species_Dog_vs_Wolf # Create a color vector my_colors <- ifelse( res$padj < 0.05 & res$log2FoldChange > 1, highcolor, ifelse(res$padj < 0.05 & res$log2FoldChange < -1, lowcolor, midcolor) ) # 1. Open the file device png(filename = "Dog_vs_Wolf_Volcano.png", width = 1600, height = 1600, units = "px", res = 300) #svg(filename = "Dog_vs_Wolf_Volcano.svg", width = 6, height = 6) ## Pixels (Raster) vs. Inches (Vector) # 2. Plot plot(res$log2FoldChange, -log10(res$padj), pch = 20, main = "Dog vs. Wolf Volcano Plot", xlab = "Log2 Fold Change", ylab = "-Log10 Adjusted P-value", col = my_colors) # 3. Add the lines abline(h = -log10(0.05), col = "black", lty = 2) abline(v = c(-1, 1), col = "black", lty = 2) # 4. Close the file device dev.off() list.files() ``` ### Heatmap ``` # Heatmap ---- # 1. Get the top 20 most variable genes (with gene names) vsd_filtered <- vsd[!grepl("^LOC", rownames(vsd)), ] top_genes <- head(order(rowVars(assay(vsd_filtered)), decreasing = TRUE), 20) mat <- assay(vsd_filtered)[top_genes, ] # 2. Subtract the row mean so we see relative changes mat <- mat - rowMeans(mat) # 3. Plot & save it! png(filename = "Dog_vs_Wolf_Heatmap.png", width = 1600, height = 1600, units = "px", res = 300) pheatmap(mat, annotation_col = as.data.frame(colData(vsd)["species"]), main = "Top 20 Variable Genes: Dog vs. Wolf") dev.off() ``` ## So you have a bunch of genes that are statistically exciting - but what does it mean biologically? ### STRINGdb - Functional Protein Association Networks * Protein-protein interactions * Enrichment based on interaction networks <div style="text-align:center;"><a href="https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1"><img src="https://hackmd.io/_uploads/ryEjy4IO-e.gif", alt="Salmon!" width="600"></a></div> <br> * *How did I find the Taxonomy ID for "canis familiaris?"* * [link](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1) ``` # STRINGdb Functional Protein Association Networks! - Protein-Protein interactions ---- # https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1 proteins_mapped_df_up <- rbioapi::rba_string_map_ids(ids = rownames(up_df), species = 9615) rbioapi::rba_string_enrichment(ids = proteins_mapped_df_up$stringId, species = 9615, split_df = TRUE) proteins_mapped_df_down <- rbioapi::rba_string_map_ids(ids = rownames(down_df), species = 9615) rbioapi::rba_string_enrichment(ids = proteins_mapped_df_down$stringId, species = 9615, split_df = TRUE) ``` ### g:Profiler - gene set enrichment analysis [link](https://biit.cs.ut.ee/gprofiler/page/organism-list) ``` # Gprofiler - gene set enrichment analysis / functional enrichment analysis / over-representation analysis (ORA)! ---- # https://biit.cs.ut.ee/gprofiler/page/organism-list gprofiler_output_up <- gprofiler2::gost( query = rownames(up_df), organism = "clfamiliaris", correction_method = "fdr", significant = TRUE ) # view the output View(gprofiler_output_up$result) gprofiler_plot_up <- gprofiler2::gostplot(gprofiler_output_up, capped = FALSE, interactive = TRUE) gprofiler_plot_up gprofiler_output_down <- gprofiler2::gost( query = rownames(down_df), organism = "clfamiliaris", correction_method = "fdr", significant = TRUE ) View(gprofiler_output_down$result) gprofiler_plot_down <- gprofiler2::gostplot(gprofiler_output_down, capped = FALSE, interactive = FALSE) gprofiler_plot_down # manual searches? ---- UP_named_genes <- up_df[which( grepl(pattern = "^LOC", x = rownames(up_df)) == FALSE ),] DOWN_named_genes <- down_df[which( grepl(pattern = "^LOC", x = rownames(down_df)) == FALSE ),] # if you want to manually look these up cat(rownames(UP_named_genes), sep = "\n") cat(rownames(DOWN_named_genes), sep = "\n") ``` ### output for manual searches? ``` # manual searches? ---- UP_named_genes <- up_df[which( grepl(pattern = "^LOC", x = rownames(up_df)) == FALSE ),] DOWN_named_genes <- down_df[which( grepl(pattern = "^LOC", x = rownames(down_df)) == FALSE ),] # if you want to manually look these up cat(rownames(UP_named_genes), sep = "\n") cat(rownames(DOWN_named_genes), sep = "\n") ``` ## Learning resources <div style="text-align: center;"><a href="https://ucdavisdatalab.github.io/workshop_index/#r"><img src="https://hackmd.io/_uploads/HJNkuBU_-l.png", alt="UCD_DataLab!" width="300"></a> <br><hr> <a href="https://cran.r-project.org/doc/manuals/r-release/R-intro.html"> <img src="https://hackmd.io/_uploads/HyfzdQL_bg.png", alt="R-INTRO!" width="150"> </a> <a href="https://learning.oreilly.com/library/view/r-for-data/9781492097396/"> <img src="https://hackmd.io/_uploads/ryfCFBIube.jpg", alt="R-OREILLY!" width="150"> </a> <a href="https://docs.posit.co/ide/user/ide/get-started/"> <img src="https://hackmd.io/_uploads/SkUtdmUdZg.png", alt="Learn More RSTUDIO!" width="100"> </a> </div><hr> ## R / Rstudio Pro Tips **I always start R scripts with this - just removes all stored variables so you start fresh** ``` rm(list = ls()) ``` **Code Snippets** ``` https://docs.posit.co/ide/user/ide/guide/productivity/snippets.html ``` **Code Tools** - reformat selection **Code Organization** ``` # SECTION ---- # followed by 4x "#" or "-" ##### #---- # use this space to name your sections #### # section ---- ## sub-section ---- ### sub-sub-section ---- #### etc. ---- ``` # The End! **please email me if you would like additional tips or have questions!** jknorris@ucdavis.edu

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.