# RNA-Seq analysis
Prepared by: Eric Juo
Date: 2022/04/15
<!-- Put the link to this slide here so people can follow -->
slide: https://hackmd.io/@atomic-notes/Sk-TQULN9
---
## What is RNA-Seq?
- RNA sequencing
- especially mRNA
- compare gene expression levels
---
Genome-guided or de novo

credit: www.nature.com
---
## de novo assembly
- de novo (latin) means 'from the begining'
- without genome guided
- commonly used for non-model species
---
Flowchart of a typical RNA-seq in de novo assembly route

credit: Raghavan et al., 2022
---
## RNA library preparation
- Total RNA extraction
- mRNA enrichment
- Fragmentation
- First-strand cDNA synthesis
- RNA digestion
- Second-strand cDNA synthesis (stranded)
- 3'end repair
- 3'end adenylation
- adapter ligation
---
## Total RNA gel

credit: Bawazeer et al., 2017
---
## mRNA enrichment
- 90% RNAs are rRNA
- 1-2% RNAs are mRNA
- mRNA contains poly-A tails
---
### Poly-T magnetic beads

credit: www.lexogen.com
---
## Fragmentation

credit: www.international.neb.com
---
## Stranded RNA library protocol

credit: zhao et al., 2015
---
## Sequencing
- flow cells
https://www.youtube.com/watch?v=pfZp5Vgsbw0
---
### Sequence extend one base at a time

credit: www.intechopen.com
---
## FastQ format
- read name
- sequence
- "+"
- quality
```
@SRR799770.1 1 length=73
CGACAAACAGTTATTGTGCTGACGTTGCTTCCAGAGGCNNTTTGGAACTTGGTAGTGAGTGTTACAACCAAGG
+
IIIIIIHGIIIIIIHIIIIIGIIIIIIIIIIIHIH>@>##579557CCCE5=7<=ACAC?GEECEHGGHG@EB
```
---
### Phred33 score
- ASCII characters: https://www.cs.cmu.edu/~pattis/15-1XX/common/handouts/ascii.html
- Phred quality score = ord(ascii) - 33
- Phred score 40 = 1 in 10000 chances called incorrectly
----
### Hands on practice
Create a project directory
```
$ mkdir rna_seq_practice
$ cd rna_seq_practice
```
Create an virtual conda environment for the project
```
$ conda create -n rna_seq
```
Activate rna_seq environment
```
$ conda activate rna_seq
```
List conda environments
```
$ conda env list
```
Remove rna_seq environment
```
$ conda deactivate
$ conda env remove -n rna_seq
```
Install mamba (an superior version of conda)
```
$ conda install -c conda-forge mamba
```
----
### Download example data from GEO
Install sra-tools (an application to download from GEO)
```
$ mamba install -c bioconda sra-tools=2.9.6
```
Create directory for raw data
```
$ mkdir -p data/01_raw/raw_reads/
```
Download brown trout raw reads (~260Mb)
```
$ prefetch -O data/01_raw/raw_reads/ SRR799769
```
Convert SRA format to fastq format
```
$ fasterq-dump -O data/02_intermediate/fastq \
data/01_raw/raw_reads/SRR799769.sra
```
----
### Manipulate fastq files
Peek first 10 lines of fastq file
```
$ head data/02_intermediate/fastq/SRR799769.sra_1.fastq
```
----
### Bioawk
Install bioawk
```
$ mamba install -c bioconda bioawk
```
Caculate read numbers in a file
```
$ bioawk -c fastx 'END{print NR}' data/02_intermediate/fastq/SRR799769.sra_1.fastq
```
---
## Initial quality report using FastQC
Quality overview for short reads
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
---
### Quality metrics
- per base quality
- per sequence quality
- per base sequence content
- per sequence GC content
- Duplicated sequences
- Overrepresented sequences
- Adatper contents
https://htmlpreview.github.io/?https://github.com/ericjuo/salmo_trutta_rna_seq/blob/master/data/02_intermediate/inital_fastqc/SRR799769_1_fastqc.html#M1
----
## Run FastQC
Install FastQC
```
$ mamba install -c bioconda fastqc
```
Create directory for initial QC report
```
$ mkdir -p report/initialqc
```
Run fastqc with multithreading mode
```
$ fastqc -t 2 -o report/initialqc data/02_intermediate/fastq/SRR799769.sra_1.fastq data/02_intermediate/fastq/SRR799769.sra_2.fastq
```
Check CPU cores
```
$ lscpu
```
---
## Quality trimming using Trimmomatic
Trimmer for illumina sequence data
http://www.usadellab.org/cms/?page=trimmomatic
---
### Run Trimmomatic
Install Trimmomatic
```
$ mamba install -c bioconda trimmomatic
```
Download adapter sequences
```
$ mkdir -p data/01_raw/adapters
```
https://github.com/usadellab/Trimmomatic/tree/main/adapters
```
$ mkdir -p data/02_intermediate/trimmomatic/
```
Usage(Pair end):
```
$ trimmomatic PE -threads 3 -phred33 data/02_intermediate/fastq/SRR799769.sra_1.fastq data/02_intermediate/fastq/SRR799769.sra_1.fastq data/02_intermediate/trimmomatic/SRR799769_paired_1.fastq data/02_intermediate/trimmomatic/SRR799769_unpaired_1.fastq data/02_intermediate/trimmomatic/SRR799769_paired_2.fastq data/02_intermediate/trimmomatic/SRR799769_unpaired_2.fastq ILLUMINACLIP:data/01_raw/adapters/TruSeq2-PE.fa:2:30:10 SLIDINGWINDOW:4:20 MINLEN:35
```
---
### Trimmomatic parameters
- ILLUMINACLIP:[adapter sequence file]:[mismatches]:[palindrome score]:[simple threshold]
- SLIDINGWINDOW:[window size]:[quality threshold]
- MINLEN:[minimum read size after trimmed]
---
## Foreign species contaminant detection using Fastq Screen
Species classifier
- bowtie, bowtie2 (default), bwa
- Burrows-Wheeler Transform
- customize database
- Do not support PE mode
https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/
---

---
## Run Fastq screen
Downgrade Python version (The latest fastq screen is not compatible with Python 3.10)
```
$ conda remove mamba
$ conda install python=3.12.9
$ conda install -c conda-forge mamba
```
Install Fastq Screen
```
$ mamba install -c bioconda fastq-screen
```
Usage:
```
$ fastq_screen -threads 10 -conf contaminants/FastQ_Screen_Genomes/fastq_screen.conf --aligner 'bowtie2' --outdir report/post_fastq_screen/ data/02_intermediate/filter_foreign_contaminants/SRR799769_unclassified_1.fastq
```
----
### Fastq screen database
- http://ftp1.babraham.ac.uk/ftpusr46/FastQ_Screen_Genomes/
- Adapters, Arabidopsis, Drosophila, E coli, Human,
Lambda, Mitochondira, Mouse, PhiX, Rat, Vectors, Worm,
Yeast, rRNA
- 12 GB
Download database
```
$ fastq_screen --get_genomes --outidr contaminants/
```
```
mkdir -p data/01_raw/fastq_screen/Drosophila/
cd data/01_raw/fastq_screen/Drosophila/
wget http://ftp1.babraham.ac.uk/ftpusr46/FastQ_Screen_Genomes/Drosophila/*
wget -O data/01_raw/fastq_screen/fastq_screen.conf http://ftp1.babraham.ac.uk/ftpusr46/FastQ_Screen_Genomes/fastq_screen.conf
```
---
## Remove contaminants with Kraken2
Another species classifer
- K mer method
- Ignore sequences in common across species
- Assign classification based on weighted taxonomy
- https://ccb.jhu.edu/software/kraken2/
---
## Kraken2 output
```
99.17 2528249 2528249 U 0 unclassified
0.83 21108 37 R 1 root
0.82 21008 2193 R1 131567 cellular organisms
0.68 17351 0 D 2759 Eukaryota
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0.68 17351 0 O4 314295 Hominoidea
0.68 17351 0 F 9604 Hominidae
0.68 17351 0 F1 207598 Homininae
0.68 17351 0 G 9605 Homo
0.68 17351 17351 S 9606 Homo sapiens
```
----
### Run Kraken2
Install Kraken2
```
$ mamba -c bioconda kraken2
```
Usage:
```
$
--->
{"metaMigratedAt":"2023-06-16T23:13:12.066Z","metaMigratedFrom":"YAML","title":"RNA-Seq overivew slide","breaks":true,"description":"lectures and hands on tutorial","contributors":"[{\"id\":\"2516dd70-7c3d-4c57-90a7-5b25e0df1c8a\",\"add\":11536,\"del\":3730}]"}