owned this note
owned this note
Published
Linked with GitHub
SETAC-NA RNAseq installation
===
###### tags: `workshop`
# Backup files
If you weren't able to create this input file from the previous pipeline, you can copy ones that we created here:
```
/opt/rnaseq/
```
* Input raw fastq data
```
curl -L https://osf.io/365fg/download -o nema_subset_0Hour.zip
curl -L https://osf.io/9tf2g/download -o nema_subset_6Hour.zip
unzip nema_subset_0Hour.zip
unzip nema_subset_6Hour.zip
```
* Trimmed reads
```
/opt/rnaseq/trim/*.qc.fq.gz
```
* Transcriptome assembly
```
/opt/rnaseq/assembly/nema_trinity/Trinity.fasta
```
* Annotation, created with
```
/opt/rnaseq/annotation/Trinity.fasta.dammit/nema_gene_name_id.csv
```
* Salmon quantification files
```
/opt/rnaseq/quant/*.quant
```
# Image creation:
Launch instance: "Ubuntu 18.04 Devel and Docker" base image (Oct 1, 2018 by jfischer), m1.large (CPU: 10, Mem: 30 GB, Disk: 60 GB)
Allocation Source: TG-MCG180142
https://angus.readthedocs.io/en/2018/jetstream-bioconda-config.html
Install in `/opt`, per [Jetstream installation instructions](https://iujetstream.atlassian.net/wiki/spaces/JWT/pages/17465521/Imaging+Guidelines):
```
sudo chmod a+w /opt
mkdir /opt/rnaseq
cd /opt/rnaseq
```
```
curl -O -L https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh #install in /opt/miniconda3
echo export PATH=$PATH:/opt/miniconda3/bin >> ~/.bashrc
source ~/.bashrc
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
```
Install R Studio:
```
sudo apt-get update && sudo apt-get -y install gdebi-core r-base
```
Then:
```
wget https://download2.rstudio.org/rstudio-server-1.1.453-amd64.deb
sudo gdebi -n rstudio-server-1.1.453-amd64.deb
```
Trimmomatic:
```
source ~/.bashrc
conda install -y fastqc multiqc trimmomatic trinity time osfclient salmon khmer
sudo apt-get install tree sl
```
Make a backup data products dir:
```
mkdir /opt/rnaseq
cd /opt/rnaseq
mkdir data
cd data
```
# QC
https://angus.readthedocs.io/en/2018/quality-and-trimming.html
https://github.com/WhiteheadLab/2018-setacna-rnaseq/blob/master/quality-trimming.md
https://rnaseq-workshop-2017.readthedocs.io/en/latest/quality-trimming.html
Download data:
```
curl -L https://osf.io/365fg/download -o nema_subset_0Hour.zip
curl -L https://osf.io/9tf2g/download -o nema_subset_6Hour.zip
unzip nema_subset_0Hour.zip
unzip nema_subset_6Hour.zip
```
(except the "subset" was WAY to large.)
```
-bash-4.4$ zgrep -c "^@" 0Hour_ATCACG_L002_R1_001.fastq.gz
4836256
```
So, made a new subset (400K was again too many, tried 10K - ) then uploaded to [osf](https://osf.io/72bu3/)
```
for file in *R1*; do cat $file | head -10000 > ${file}_subset; done
for file in *R2*; do cat $file | head -10000 > ${file}_subset; done
rm -rf *.fastq
for filename in *_subset; do base=$(basename $filename _subset); mv $filename $base; done
export OSF_PASSWORD=Fundulus23
export OSF_USERNAME=ljcohen@ucdavis.edu
zip nema_subset_small_0Hour.zip 0Hour_*
zip nema_subset_small_6Hour.zip 6Hour_*
osf -p 72bu3 upload nema_subset_small_0Hour.zip subset/nema_subset_small_0Hour.zip
osf -p 72bu3 upload nema_subset_small_6Hour.zip subset/nema_subset_small_6Hour.zip
```
Move:
```
for filename in *gz; do base=$(basename $filename .gz); mv $filename $base; done
```
Quality trimming:
```
cd ..
mkdir quality
cd quality
ln -s ../data/*.fastq .
fastqc *.fastq
multiqc .
```
Setup trim directory:
```
cd ..
mkdir trim
cd trim
ln -s ../data/*.fastq . ###This should be .gz
cat /opt/miniconda3/share/trimmomatic*/adapters/* > combined.fa
```
Run Trimmomatic:
```
for filename in *_R1_*.fastq
do
# first, make the base by removing fastq.gz
base=$(basename $filename .fastq)
echo $base
# now, construct the R2 filename by replacing R1 with R2
baseR2=${base/_R1_/_R2_}
echo $baseR2
# finally, run Trimmomatic
trimmomatic PE ${base}.fastq ${baseR2}.fastq \
${base}.qc.fq s1_se \
${baseR2}.qc.fq s2_se \
ILLUMINACLIP:combined.fa:2:40:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 \
MINLEN:25
# gzip output
gzip -9c ${base}.qc.fq > ${base}.qc.fq.gz
gzip -9c ${baseR2}.qc.fq > ${baseR2}.qc.fq.gz
# save the orphans
gzip -9c s1_se s2_se >> orphans.qc.fq.gz
rm -f s1_se s2_se
done
rm -rf *.qc.fq
```
### Running alternative trimming
* Want to only produce gzipped files to save space and match true identity of files (.gz)
```
for filename in *_R1_*.fastq.gz
do
# first, make the base by removing fastq.gz
base=$(basename $filename .fastq.gz)
echo $base
# now, construct the R2 filename by replacing R1 with R2
baseR2=${base/_R1_/_R2_}
echo $baseR2
# finally, run Trimmomatic
trimmomatic PE ${base}.fastq.gz ${baseR2}.fastq.gz \
${base}.qc.fq.gz s1_se.gz \
${baseR2}.qc.fq.gz s2_se.gz \
ILLUMINACLIP:combined.fa:2:40:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 \
MINLEN:25
# save the orphans
zcat s1_se.gz s2_se.gz >> orphans.qc.fq.gz
rm -f s1_se.gz s2_se.gz
done
```
Now, run fastqc again on trimmed files:
```
fastqc *.qc.fq.gz
multiqc .
```
# Trinity assembly
https://setac-omics.readthedocs.io/en/latest/transcriptome-assembly.html
## Install the TRINITY assembler
The [Trinity assembler](https://www.ncbi.nlm.nih.gov/pubmed/21572440) can also install it through `conda`:
```
conda install -y trinity
```
Make assembly directory:
```
cd ..
mkdir assembly
cd assembly
```
Softlink trimmed files:
```
ln -s ../trim/*.qc.fq.gz .
```
Combine all fq into 2 files (left.fq and right.fq)
```
zcat *R1*.qc.fq.gz > left.fq
zcat *R2*.qc.fq.gz > right.fq
zcat orphans.qc.fq.gz >> left.fq
```
run Trinity:
```
time Trinity --seqType fq --max_memory 30G --CPU 10 --left left.fq --right right.fq --output nema_trinity
```
# Quant
https://setac-omics.readthedocs.io/en/latest/rnaseq-quant.html
```
cd ..
mkdir quant
cd quant
```
Soft link assembly:
```
ln -s ../assembly/nema_trinity/Trinity.fasta .
```
Index the assembly:
```
salmon index --index nema --type quasi --transcripts Trinity.fasta
```
Link in trimmed reads:
```
ln -s ../trim/0Hour*.qc.fq.gz .
ln -s ../trim/6Hour*.qc.fq.gz .
```
Run salmon
```
for R1 in *R1*.qc.fq.gz
do
sample=$(basename $R1 extract.qc.fq.gz)
echo sample is $sample, R1 is $R1
R2=${R1/R1/R2}
echo R2 is $R2
salmon quant -i nema -p 2 -l IU -1 <(gunzip -c $R1) -2 <(gunzip -c $R2) -o ${sample}.quant
done
```
Take a look at quant output, [examples](https://github.com/ngs-docs/2015-nov-adv-rna/blob/master/salmon.rst)
```
head 0Hour_ATCACG_L002_R1_001.qc.fq.gz.quant/quant.sf
```
All the mapping rates:
```
find . -name \salmon_quant.log -exec grep -H "Mapping rate" {} \;
```
# Annotation (will not do during setac workshop, this was just to generate the annotation files required for DE analysis in the next lesson)
https://angus.readthedocs.io/en/2018/dammit_annotation.html
* [dammit](https://github.com/camillescott/dammit)
* [databases](http://www.camillescott.org/dammit/databases.html)
Takes ~1 hr to install dammit + databases:
```
conda create -y --name py3.dammit python=3
source activate py3.dammit
conda install -y dammit
dammit databases --install --database-dir /opt/.dammit/databases --busco-group eukaryota
cd ..
mkdir annotation
cd annotation
ln -s ../assembly/nema_trinity/Trinity.fasta .
```
Grab *Nematostella* protein database from uniprot:
```
curl -LO ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000001593_45351.fasta.gz
gunzip -c UP000001593_45351.fasta.gz > nema.reference.prot.faa
rm UP000001593_45351.fasta.gz
```
Annotate (~20 min):
```
export DAMMIT_DB_DIR=/opt/.dammit/databases
dammit annotate Trinity.fasta --busco-group eukaryota --user-databases nema.reference.prot.faa --n_threads 14
```
Convert contig ID to gene names (in Python):
Change directory to dammit output:
```
cd Trinity.fasta.dammit
```
Then, start python to make a table of contig id and gene name (required for tximport for DE analysis):
```
import pandas as pd
from dammit.fileio.gff3 import GFF3Parser
gff_file = "Trinity.fasta.dammit.gff3"
annotations = GFF3Parser(filename=gff_file).read()
# selects one annotation per contig by sorting by e-value (less than 1e-05) and picking the top
names = annotations.sort_values(by=['seqid', 'score'], ascending=True).query('score < 1e-05').drop_duplicates(subset='seqid')[['seqid', 'Name']]
# removes contigs with no name
new_file = names.dropna(axis=0,how='all')
# converts dammit ID to contig ID
conversions_dammit = pd.read_csv("Trinity.fasta.dammit.namemap.csv")
conversions_dammit['Name'], conversions_dammit['info'] = conversions_dammit['original'].str.split(' ', 1).str
conversions_dammit = conversions_dammit[['Name','renamed']]
conversions_dammit.columns = ['Name','seqid']
new_file = pd.merge(new_file,conversions_dammit,on="seqid")
new_file.columns = ['seqid','Name','contig']
new_file.to_csv("nema_gene_name_id.csv")
```
# Differential Expression
https://setac-omics.readthedocs.io/en/latest/DE.html
(if in the annotation conda environment)
```
source deactivate
```
Change password on RStudio server:
```
sudo passwd username
```
Check to see if RStudio is installed:
```
echo My RStudio Web server is running at: http://$(hostname):8787/
```
Install R dependencies:
```
sudo apt-get install -y libxml2 libxml2-dev libcurl4-gnutls-dev libssl-dev
```
Script to install differential expression software
```
curl -O -L https://github.com/ngs-docs/angus/raw/2017/_static/install-deseq2.R
sudo Rscript --no-save install-deseq2.R
```
RStudio only recognizes files in home `~/`. So, soft link files there:
```
mkdir ~/DE
cd ~/DE
mkdir quant
cd quant
ln -s /opt/rnaseq/quant/*.quant .
```
Custom PCA script that labels samples:
```
cd
wget https://raw.githubusercontent.com/ngs-docs/2017-dibsi-rnaseq/master/plotPCAWithSampleNames.R
cp /opt/rnaseq/annotation/Trinity.fasta.dammit/nema_gene_name_id.csv .
```
# RStudio
```
library(DESeq2)
library("lattice")
library(tximport)
library(readr)
library(gplots)
library(RColorBrewer)
source('~/plotPCAWithSampleNames.R')
setwd("~/DE/quant/")
dir<-'~/DE'
files_list = list.files()
files <- file.path(dir, "quant",files_list, "quant.sf")
names(files) <- c("0Hour_1","0Hour_2","0Hour_3","0Hour_4","0Hour_5","6Hour_1","6Hour_2","6Hour_3","6Hour_4","6Hour_5")
files
print(file.exists(files))
```
DE analysis:
```
tx2gene <- read.csv("~/nema_gene_name_id.csv")
tx2gene <- tx2gene[,c(4,3)]
cols<-c("transcript_id","gene_id")
colnames(tx2gene)<-cols
head(tx2gene)
txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene,importer=read.delim)
head(txi.salmon$counts)
dim(txi.salmon$counts)
condition = factor(c("0Hour","0Hour","0Hour","0Hour","0Hour","6Hour","6Hour","6Hour","6Hour","6Hour"))
ExpDesign <- data.frame(row.names=colnames(txi.salmon$counts), condition = condition)
ExpDesign
dds <- DESeqDataSetFromTximport(txi.salmon, ExpDesign, ~condition)
dds <- DESeq(dds, betaPrior=FALSE)
counts_table = counts( dds, normalized=TRUE )
filtered_norm_counts<-counts_table[!rowSums(counts_table==0)>=1, ]
filtered_norm_counts<-as.data.frame(filtered_norm_counts)
GeneID<-rownames(filtered_norm_counts)
filtered_norm_counts<-cbind(filtered_norm_counts,GeneID)
dim(filtered_norm_counts)
head(filtered_norm_counts)
plotDispEsts(dds)
log_dds<-rlog(dds)
plotPCAWithSampleNames(log_dds, intgroup="condition", ntop=40000)
res<-results(dds,contrast=c("condition","6Hour","0Hour"))
head(res)
res_ordered<-res[order(res$padj),]
GeneID<-rownames(res_ordered)
res_ordered<-as.data.frame(res_ordered)
res_genes<-cbind(res_ordered,GeneID)
dim(res_genes)
head(res_genes)
dim(res_genes)
res_genes_merged <- merge(res_genes,filtered_norm_counts,by=unique("GeneID"))
dim(res_genes_merged)
head(res_genes_merged)
res_ordered<-res_genes_merged[order(res_genes_merged$padj),]
write.csv(res_ordered, file="nema_DESeq_all.csv" )
# threshold padj<0.05
resSig = res_ordered[res_ordered$padj < 0.05, ]
# threshold +- log2foldchange 1, but let's not do this - controversial
#resSig = resSig[resSig$log2FoldChange > 1 | resSig$log2FoldChange < -1,]
write.csv(resSig,file="nema_DESeq_padj0.05_log2FC1.csv")
plot(log2(res_ordered$baseMean), res_ordered$log2FoldChange, col=ifelse(res_ordered$padj < 0.05, "red","gray67"),main="nema (padj<0.05, log2FC = ±1)",xlim=c(1,20),pch=20,cex=1,ylim=c(-12,12))
abline(h=c(-1,1), col="blue")
genes<-resSig$GeneID
mygenes <- resSig[,]
baseMean_mygenes <- mygenes[,"baseMean"]
log2FoldChange_mygenes <- mygenes[,"log2FoldChange"]
text(log2(baseMean_mygenes),log2FoldChange_mygenes,labels=genes,pos=2,cex=0.60)
d<-resSig
dim(d)
head(d)
colnames(d)
d<-d[,c(8:17)]
d<-as.matrix(d)
d<-as.data.frame(d)
d<-as.matrix(d)
rownames(d) <- resSig[,1]
head(d)
hr <- hclust(as.dist(1-cor(t(d), method="pearson")), method="complete")
mycl <- cutree(hr, h=max(hr$height/1.5))
clusterCols <- rainbow(length(unique(mycl)))
myClusterSideBar <- clusterCols[mycl]
myheatcol <- greenred(75)
heatmap.2(d, main="nema (padj<0.05, log2FC = ±1)",
Rowv=as.dendrogram(hr),
cexRow=0.75,cexCol=0.8,srtCol= 90,
adjCol = c(NA,0),offsetCol=2.5,
Colv=NA, dendrogram="row",
scale="row", col=myheatcol,
density.info="none",
trace="none", RowSideColors= myClusterSideBar)
```
# Notes
Received this error (10/13/2018):
```
ljcohen@js-16-157:~/databackup/quality$ Error occurred during initialization of VM
java.lang.Error: Properties init: Could not determine current working directory.
at java.lang.System.initProperties(Native Method)
at java.lang.System.initializeSystemClass(System.java:1166)
```
Original [eelpond](https://eel-pond.readthedocs.io/en/latest/)
[SIO Rnaseq workshop](https://rnaseq-workshop-2017.readthedocs.io/en/latest/quality-trimming.html)