---
title: 'Patrick Murphy Bulk RNA-Seq'
disqus: hackmd
---
Patrick Murphy bulk RNA-Seq Analysis
===
## Table of Contents
[TOC]
## 1. Introduction
This is a bulk RNA-Seq project, which includes human data. We received the data in two batches.
1. First batch included 35 samples
2. Second batch included 5 samples.
In the first batch, 3 samples were extra and not part of the analysis. In total, only 32 samples were included in the first batch.
In the second batch, all 5 samples were included.
In total, we have 37 samples for this analysis.
Though we received the samples in 2 batches, in reality the samples were sequenced in 7 batches, giving raise to the possibility of batch effects.
Metadata sheet was provided labeling sample groups for differential expression analysis. Some samples were not given a group label (NA's) and such samples can be removed from performing DE analysis.
The table below shows the 37 samples used in the analysis and the details of color highlights are:
1. Samples 18-240 and 19-59 were orginally labelled as BAD_DATA by Pat's group.
2. Samples 18-141, 18-218 and 19-224 are poorly mapped by STAR aligner and has high rRNA contamination (discussed in detail in QC and Mapping)
3. Sample 19-255 does not have status information for both genes SF3B1 and BAP1.
In total, the remaining 31 samples were considered for differential expression analysis.

> Paper style methods [TODO]
>
## 2. Data
The first batch data was received on HTC in the folder: `/bgfs/uchandran/shared/uchandran_psm25`
The second batch was downloaded via DNAnexus links provided in excel sheet.
The analysis folder on HTC can be found in: `/bgfs/uchandran/Pat_Murphy`
The raw data can be found in `/bgfs/uchandran/Pat_Murphy/Fastq/Raw`
Since, we planned to implement GDC pipeline for RNA-Seq analysis in this project, all the reference files were downloaded from GDC [website](https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files). They were downloaded to: `/bgfs/genomics/refs/refs/GDC_Refs/RNA-Seq`. Since, transcript fasta file was not provided, it was downloaded from [gencode](ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/gencode.v35.transcripts.fa.gz) website for Salmon mapping.
## 3. QC on Reads
### 3.1. FastQC and Adapter trimming
FastQC tool was run on all 37 input samples to check for basic QC and adapter content.
Most of the samples had reads in the range of 30 million to 70 million per strand. Samples labeled BAD_DATA (18-240 and 19-59) had very few reads ~200k and these samples were removed from further analysis.
We also noticed unusual GC% in most samples and adapter contamination. We suspected this unusual GC is because of possible rRNA and adapter contamination.
Using cutadapt, we removed Illumina universal adapter sequences in all samples and re-ran FastQC. We noticed no adapter contamination, but still the unusual GC% persisted.
GC% after trimming in FastQC can be seen below:

### 3.2. FastQ Screen
We ran a tool called to check for rRNA and any other contamination in the sequences.
We noticed 3 samples (18-141, 18-218 and 19-224) had high rRNA contamination compared to the rest.

We proceeded with mapping in all 35 samples, including the the above mentioned 3 rRNA contaminated samples.
### 3.3. SortMeRNA
SortMeRNA version 4.2.0 ([Github repo](https://github.com/biocore/sortmerna)) was used to determine the % of rRNA in each sample (n=35). We used the [8 rRNA databases](https://github.com/biocore/sortmerna/tree/master/data/rRNA_databases) that SortMeRNA supplies (master branch, commit 18e59e1). The following command line was used:
```
sortmerna -ref refs/rfam-5.8s-database-id98.fasta \
-ref refs/rfam-5s-database-id98.fasta \
-ref refs/silva-arc-16s-id95.fasta \
-ref refs/silva-arc-23s-id98.fasta \
-ref refs/silva-bac-16s-id90.fasta \
-ref refs/silva-bac-23s-id98.fasta \
-ref refs/silva-euk-18s-id95.fasta \
-ref refs/silva-euk-28s-id98.fasta \
-reads reads/SAMPLENAME_R1.cutadapt.fastq.gz \
-reads reads/SAMPLENAME_R2.cutadapt.fastq.gz \
-num_alignments 1 \
-a 8 \
-workdir SAMPLENAME
```
where SAMPLENAME is replaced with the name of the sample. The output was collated into a table called `sortmerna_results.tsv` (one row per sample).
* Archive samples [run1] -

* 02_05_21 40 samples [run2] - same version and parameters as Run1. Only 1 sample with high %RNA at 26% (TPF-20-108)
* 02_17_21 12 samples [run3] - same version and parameters as Run1. All had %RNA < 4.1%.
* 02_26_21 25 samples [run4] - same version and parameters as Run1. There is a high concordance between Archive batch with 02_26_21 (run4) batch since they are from the same sequencing run.

See [Paul HackMD](https://hackmd.io/OpBuG5BETHm9HPTdwYOcZw?both) for info about future runs
## 4. STAR mapping
GDC pipeline workflow and parameters were followed for STAR mapping and the references provided by them were used. Here is the [link](https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/) to the workflow.
In total STAR aligner was run on 35 samples (excluding BAD_DATA)

As can be seen, the 3 rRNA contaminated samples had poor mapping %. These samples were not considered for further analysis because of their poor mapping rate.
## 5. HT-Seq
Counts per gene were generated using HT-seq counts on the remaining 32 samples using gencode gtf as provided by GDC.
Each row in the count file represents a gene, it's a ensembl gene id.
## 6. Differential Expression
### 6.1. edgeR analysis (gene level)
The counts genereated from HT-Seq were used for DE analysis by edgeR.
Moving forward, sample 19-255 was removed from DE analysis, as it does not have any status for SF3B1 gene and BAP1 gene and proceed with 31 samples.
The sample split up for the two conditions is as follows
| Status | wt | mut | Total |
| -------- | -------- | -------- |-------- |
| SF3B1 | 17 | 14 | 31|
| BAP1 | 22 | 8 | 30|
We have fewer samples for BAP1 status because, sample 18-136 has SF3B1 status, but no BAP1 status.
MDS plots for were different conditions can be seen below:
**MDS plot for by Batches**

**MDS plot by SF3B1 status**

**MDS plot by BAP1 status**

Results for edgeR DE analysis is shared in Box folder.
###### tags: `PatrickMurphy` `RNA-Seq`