SETAC-NA RNAseq installation

SETAC-NA RNAseq installation === ###### tags: `workshop` # Backup files If you weren't able to create this input file from the previous pipeline, you can copy ones that we created here: ``` /opt/rnaseq/ ``` * Input raw fastq data ``` curl -L https://osf.io/365fg/download -o nema_subset_0Hour.zip curl -L https://osf.io/9tf2g/download -o nema_subset_6Hour.zip unzip nema_subset_0Hour.zip unzip nema_subset_6Hour.zip ``` * Trimmed reads ``` /opt/rnaseq/trim/*.qc.fq.gz ``` * Transcriptome assembly ``` /opt/rnaseq/assembly/nema_trinity/Trinity.fasta ``` * Annotation, created with ``` /opt/rnaseq/annotation/Trinity.fasta.dammit/nema_gene_name_id.csv ``` * Salmon quantification files ``` /opt/rnaseq/quant/*.quant ``` # Image creation: Launch instance: "Ubuntu 18.04 Devel and Docker" base image (Oct 1, 2018 by jfischer), m1.large (CPU: 10, Mem: 30 GB, Disk: 60 GB) Allocation Source: TG-MCG180142 https://angus.readthedocs.io/en/2018/jetstream-bioconda-config.html Install in `/opt`, per [Jetstream installation instructions](https://iujetstream.atlassian.net/wiki/spaces/JWT/pages/17465521/Imaging+Guidelines): ``` sudo chmod a+w /opt mkdir /opt/rnaseq cd /opt/rnaseq ``` ``` curl -O -L https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh #install in /opt/miniconda3 echo export PATH=$PATH:/opt/miniconda3/bin >> ~/.bashrc source ~/.bashrc conda config --add channels defaults conda config --add channels conda-forge conda config --add channels bioconda ``` Install R Studio: ``` sudo apt-get update && sudo apt-get -y install gdebi-core r-base ``` Then: ``` wget https://download2.rstudio.org/rstudio-server-1.1.453-amd64.deb sudo gdebi -n rstudio-server-1.1.453-amd64.deb ``` Trimmomatic: ``` source ~/.bashrc conda install -y fastqc multiqc trimmomatic trinity time osfclient salmon khmer sudo apt-get install tree sl ``` Make a backup data products dir: ``` mkdir /opt/rnaseq cd /opt/rnaseq mkdir data cd data ``` # QC https://angus.readthedocs.io/en/2018/quality-and-trimming.html https://github.com/WhiteheadLab/2018-setacna-rnaseq/blob/master/quality-trimming.md https://rnaseq-workshop-2017.readthedocs.io/en/latest/quality-trimming.html Download data: ``` curl -L https://osf.io/365fg/download -o nema_subset_0Hour.zip curl -L https://osf.io/9tf2g/download -o nema_subset_6Hour.zip unzip nema_subset_0Hour.zip unzip nema_subset_6Hour.zip ``` (except the "subset" was WAY to large.) ``` -bash-4.4$ zgrep -c "^@" 0Hour_ATCACG_L002_R1_001.fastq.gz 4836256 ``` So, made a new subset (400K was again too many, tried 10K - ) then uploaded to [osf](https://osf.io/72bu3/) ``` for file in *R1*; do cat $file | head -10000 > ${file}_subset; done for file in *R2*; do cat $file | head -10000 > ${file}_subset; done rm -rf *.fastq for filename in *_subset; do base=$(basename $filename _subset); mv $filename $base; done export OSF_PASSWORD=Fundulus23 export OSF_USERNAME=ljcohen@ucdavis.edu zip nema_subset_small_0Hour.zip 0Hour_* zip nema_subset_small_6Hour.zip 6Hour_* osf -p 72bu3 upload nema_subset_small_0Hour.zip subset/nema_subset_small_0Hour.zip osf -p 72bu3 upload nema_subset_small_6Hour.zip subset/nema_subset_small_6Hour.zip ``` Move: ``` for filename in *gz; do base=$(basename $filename .gz); mv $filename $base; done ``` Quality trimming: ``` cd .. mkdir quality cd quality ln -s ../data/*.fastq . fastqc *.fastq multiqc . ``` Setup trim directory: ``` cd .. mkdir trim cd trim ln -s ../data/*.fastq . ###This should be .gz cat /opt/miniconda3/share/trimmomatic*/adapters/* > combined.fa ``` Run Trimmomatic: ``` for filename in *_R1_*.fastq do # first, make the base by removing fastq.gz base=$(basename $filename .fastq) echo $base # now, construct the R2 filename by replacing R1 with R2 baseR2=${base/_R1_/_R2_} echo $baseR2 # finally, run Trimmomatic trimmomatic PE ${base}.fastq ${baseR2}.fastq \ ${base}.qc.fq s1_se \ ${baseR2}.qc.fq s2_se \ ILLUMINACLIP:combined.fa:2:40:15 \ LEADING:2 TRAILING:2 \ SLIDINGWINDOW:4:2 \ MINLEN:25 # gzip output gzip -9c ${base}.qc.fq > ${base}.qc.fq.gz gzip -9c ${baseR2}.qc.fq > ${baseR2}.qc.fq.gz # save the orphans gzip -9c s1_se s2_se >> orphans.qc.fq.gz rm -f s1_se s2_se done rm -rf *.qc.fq ``` ### Running alternative trimming * Want to only produce gzipped files to save space and match true identity of files (.gz) ``` for filename in *_R1_*.fastq.gz do # first, make the base by removing fastq.gz base=$(basename $filename .fastq.gz) echo $base # now, construct the R2 filename by replacing R1 with R2 baseR2=${base/_R1_/_R2_} echo $baseR2 # finally, run Trimmomatic trimmomatic PE ${base}.fastq.gz ${baseR2}.fastq.gz \ ${base}.qc.fq.gz s1_se.gz \ ${baseR2}.qc.fq.gz s2_se.gz \ ILLUMINACLIP:combined.fa:2:40:15 \ LEADING:2 TRAILING:2 \ SLIDINGWINDOW:4:2 \ MINLEN:25 # save the orphans zcat s1_se.gz s2_se.gz >> orphans.qc.fq.gz rm -f s1_se.gz s2_se.gz done ``` Now, run fastqc again on trimmed files: ``` fastqc *.qc.fq.gz multiqc . ``` # Trinity assembly https://setac-omics.readthedocs.io/en/latest/transcriptome-assembly.html ## Install the TRINITY assembler The [Trinity assembler](https://www.ncbi.nlm.nih.gov/pubmed/21572440) can also install it through `conda`: ``` conda install -y trinity ``` Make assembly directory: ``` cd .. mkdir assembly cd assembly ``` Softlink trimmed files: ``` ln -s ../trim/*.qc.fq.gz . ``` Combine all fq into 2 files (left.fq and right.fq) ``` zcat *R1*.qc.fq.gz > left.fq zcat *R2*.qc.fq.gz > right.fq zcat orphans.qc.fq.gz >> left.fq ``` run Trinity: ``` time Trinity --seqType fq --max_memory 30G --CPU 10 --left left.fq --right right.fq --output nema_trinity ``` # Quant https://setac-omics.readthedocs.io/en/latest/rnaseq-quant.html ``` cd .. mkdir quant cd quant ``` Soft link assembly: ``` ln -s ../assembly/nema_trinity/Trinity.fasta . ``` Index the assembly: ``` salmon index --index nema --type quasi --transcripts Trinity.fasta ``` Link in trimmed reads: ``` ln -s ../trim/0Hour*.qc.fq.gz . ln -s ../trim/6Hour*.qc.fq.gz . ``` Run salmon ``` for R1 in *R1*.qc.fq.gz do sample=$(basename $R1 extract.qc.fq.gz) echo sample is $sample, R1 is $R1 R2=${R1/R1/R2} echo R2 is $R2 salmon quant -i nema -p 2 -l IU -1 <(gunzip -c $R1) -2 <(gunzip -c $R2) -o ${sample}.quant done ``` Take a look at quant output, [examples](https://github.com/ngs-docs/2015-nov-adv-rna/blob/master/salmon.rst) ``` head 0Hour_ATCACG_L002_R1_001.qc.fq.gz.quant/quant.sf ``` All the mapping rates: ``` find . -name \salmon_quant.log -exec grep -H "Mapping rate" {} \; ``` # Annotation (will not do during setac workshop, this was just to generate the annotation files required for DE analysis in the next lesson) https://angus.readthedocs.io/en/2018/dammit_annotation.html * [dammit](https://github.com/camillescott/dammit) * [databases](http://www.camillescott.org/dammit/databases.html) Takes ~1 hr to install dammit + databases: ``` conda create -y --name py3.dammit python=3 source activate py3.dammit conda install -y dammit dammit databases --install --database-dir /opt/.dammit/databases --busco-group eukaryota cd .. mkdir annotation cd annotation ln -s ../assembly/nema_trinity/Trinity.fasta . ``` Grab *Nematostella* protein database from uniprot: ``` curl -LO ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000001593_45351.fasta.gz gunzip -c UP000001593_45351.fasta.gz > nema.reference.prot.faa rm UP000001593_45351.fasta.gz ``` Annotate (~20 min): ``` export DAMMIT_DB_DIR=/opt/.dammit/databases dammit annotate Trinity.fasta --busco-group eukaryota --user-databases nema.reference.prot.faa --n_threads 14 ``` Convert contig ID to gene names (in Python): Change directory to dammit output: ``` cd Trinity.fasta.dammit ``` Then, start python to make a table of contig id and gene name (required for tximport for DE analysis): ``` import pandas as pd from dammit.fileio.gff3 import GFF3Parser gff_file = "Trinity.fasta.dammit.gff3" annotations = GFF3Parser(filename=gff_file).read() # selects one annotation per contig by sorting by e-value (less than 1e-05) and picking the top names = annotations.sort_values(by=['seqid', 'score'], ascending=True).query('score < 1e-05').drop_duplicates(subset='seqid')[['seqid', 'Name']] # removes contigs with no name new_file = names.dropna(axis=0,how='all') # converts dammit ID to contig ID conversions_dammit = pd.read_csv("Trinity.fasta.dammit.namemap.csv") conversions_dammit['Name'], conversions_dammit['info'] = conversions_dammit['original'].str.split(' ', 1).str conversions_dammit = conversions_dammit[['Name','renamed']] conversions_dammit.columns = ['Name','seqid'] new_file = pd.merge(new_file,conversions_dammit,on="seqid") new_file.columns = ['seqid','Name','contig'] new_file.to_csv("nema_gene_name_id.csv") ``` # Differential Expression https://setac-omics.readthedocs.io/en/latest/DE.html (if in the annotation conda environment) ``` source deactivate ``` Change password on RStudio server: ``` sudo passwd username ``` Check to see if RStudio is installed: ``` echo My RStudio Web server is running at: http://$(hostname):8787/ ``` Install R dependencies: ``` sudo apt-get install -y libxml2 libxml2-dev libcurl4-gnutls-dev libssl-dev ``` Script to install differential expression software ``` curl -O -L https://github.com/ngs-docs/angus/raw/2017/_static/install-deseq2.R sudo Rscript --no-save install-deseq2.R ``` RStudio only recognizes files in home `~/`. So, soft link files there: ``` mkdir ~/DE cd ~/DE mkdir quant cd quant ln -s /opt/rnaseq/quant/*.quant . ``` Custom PCA script that labels samples: ``` cd wget https://raw.githubusercontent.com/ngs-docs/2017-dibsi-rnaseq/master/plotPCAWithSampleNames.R cp /opt/rnaseq/annotation/Trinity.fasta.dammit/nema_gene_name_id.csv . ``` # RStudio ``` library(DESeq2) library("lattice") library(tximport) library(readr) library(gplots) library(RColorBrewer) source('~/plotPCAWithSampleNames.R') setwd("~/DE/quant/") dir<-'~/DE' files_list = list.files() files <- file.path(dir, "quant",files_list, "quant.sf") names(files) <- c("0Hour_1","0Hour_2","0Hour_3","0Hour_4","0Hour_5","6Hour_1","6Hour_2","6Hour_3","6Hour_4","6Hour_5") files print(file.exists(files)) ``` DE analysis: ``` tx2gene <- read.csv("~/nema_gene_name_id.csv") tx2gene <- tx2gene[,c(4,3)] cols<-c("transcript_id","gene_id") colnames(tx2gene)<-cols head(tx2gene) txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene,importer=read.delim) head(txi.salmon$counts) dim(txi.salmon$counts) condition = factor(c("0Hour","0Hour","0Hour","0Hour","0Hour","6Hour","6Hour","6Hour","6Hour","6Hour")) ExpDesign <- data.frame(row.names=colnames(txi.salmon$counts), condition = condition) ExpDesign dds <- DESeqDataSetFromTximport(txi.salmon, ExpDesign, ~condition) dds <- DESeq(dds, betaPrior=FALSE) counts_table = counts( dds, normalized=TRUE ) filtered_norm_counts<-counts_table[!rowSums(counts_table==0)>=1, ] filtered_norm_counts<-as.data.frame(filtered_norm_counts) GeneID<-rownames(filtered_norm_counts) filtered_norm_counts<-cbind(filtered_norm_counts,GeneID) dim(filtered_norm_counts) head(filtered_norm_counts) plotDispEsts(dds) log_dds<-rlog(dds) plotPCAWithSampleNames(log_dds, intgroup="condition", ntop=40000) res<-results(dds,contrast=c("condition","6Hour","0Hour")) head(res) res_ordered<-res[order(res$padj),] GeneID<-rownames(res_ordered) res_ordered<-as.data.frame(res_ordered) res_genes<-cbind(res_ordered,GeneID) dim(res_genes) head(res_genes) dim(res_genes) res_genes_merged <- merge(res_genes,filtered_norm_counts,by=unique("GeneID")) dim(res_genes_merged) head(res_genes_merged) res_ordered<-res_genes_merged[order(res_genes_merged$padj),] write.csv(res_ordered, file="nema_DESeq_all.csv" ) # threshold padj<0.05 resSig = res_ordered[res_ordered$padj < 0.05, ] # threshold +- log2foldchange 1, but let's not do this - controversial #resSig = resSig[resSig$log2FoldChange > 1 | resSig$log2FoldChange < -1,] write.csv(resSig,file="nema_DESeq_padj0.05_log2FC1.csv") plot(log2(res_ordered$baseMean), res_ordered$log2FoldChange, col=ifelse(res_ordered$padj < 0.05, "red","gray67"),main="nema (padj<0.05, log2FC = ±1)",xlim=c(1,20),pch=20,cex=1,ylim=c(-12,12)) abline(h=c(-1,1), col="blue") genes<-resSig$GeneID mygenes <- resSig[,] baseMean_mygenes <- mygenes[,"baseMean"] log2FoldChange_mygenes <- mygenes[,"log2FoldChange"] text(log2(baseMean_mygenes),log2FoldChange_mygenes,labels=genes,pos=2,cex=0.60) d<-resSig dim(d) head(d) colnames(d) d<-d[,c(8:17)] d<-as.matrix(d) d<-as.data.frame(d) d<-as.matrix(d) rownames(d) <- resSig[,1] head(d) hr <- hclust(as.dist(1-cor(t(d), method="pearson")), method="complete") mycl <- cutree(hr, h=max(hr$height/1.5)) clusterCols <- rainbow(length(unique(mycl))) myClusterSideBar <- clusterCols[mycl] myheatcol <- greenred(75) heatmap.2(d, main="nema (padj<0.05, log2FC = ±1)", Rowv=as.dendrogram(hr), cexRow=0.75,cexCol=0.8,srtCol= 90, adjCol = c(NA,0),offsetCol=2.5, Colv=NA, dendrogram="row", scale="row", col=myheatcol, density.info="none", trace="none", RowSideColors= myClusterSideBar) ``` # Notes Received this error (10/13/2018): ``` ljcohen@js-16-157:~/databackup/quality$ Error occurred during initialization of VM java.lang.Error: Properties init: Could not determine current working directory. at java.lang.System.initProperties(Native Method) at java.lang.System.initializeSystemClass(System.java:1166) ``` Original [eelpond](https://eel-pond.readthedocs.io/en/latest/) [SIO Rnaseq workshop](https://rnaseq-workshop-2017.readthedocs.io/en/latest/quality-trimming.html)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.