RNA-Seq (Transcriptome Guided) [Salmon]

# RNA-Seq (Transcriptome Guided) [Salmon] https://hackmd.io/@xrissymae/B1TQlR36V For our project, we're working with White Lupin, which is a non-model organism. In order to do a differential expression analysis, we would have to build a transcriptome to guide and count our short-reads against. Thankfully, somebody else built a reference transcriptome we can use called the *Lupinus albus* Gene Index version 2 (LAGI02). ### Pipeline: ![Salmon Pipeline](https://i.imgur.com/WomUBtj.png) ### Reference: 1. [ANGUS DIBSI Tutorial 2018](https://angus.readthedocs.io/en/2018/rna-seq.html) 2. [Salmon Documents](https://salmon.readthedocs.io/en/latest/salmon.html) ## **Before Starting**: Install programs 1. Log into Comet 2. Download and install bioconda ``` curl -O -L https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh ``` Say yes to everything. If you see ">>>", you're still within the installation process. Once finished, run the following command to activate the conda environment. ``` source ~/.bashrc ``` Now enable various channels for software installation ``` conda config --add channels defaults conda config --add channels conda-forge conda config --add channels bioconda ``` 3. Install Salmon using *conda* ``` conda install salmon ``` 4. Install multiqc ``` conda install multiqc ``` ## Downloading our Sequences Instead of using our super large real data, we'll be using a practice RNA-Seq set from [Schurch et al, 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4878611/). 1. Create a logs folder ``` cd ~ mkdir logs ``` 2. Create new directory for sequences in Oasis. ``` cd /oasis/scratch/comet/$USER/temp_project mkdir data cd data ``` 3. Download data from yeast RNA-Seq Study. #### RNA-Seq Data ``` curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz curl -L https://osf.io/2wvn3/download -o ERR458495.fastq.gz curl -L https://osf.io/xju4a/download -o ERR458500.fastq.gz curl -L https://osf.io/nmqe6/download -o ERR458501.fastq.gz curl -L https://osf.io/qfsze/download -o ERR458502.fastq.gz ``` 4. Change permissions just incase. This will remove writing priviledges to prevent modifying the data. * if you type `ls -l`, you should see: ``` -rw-r--r-- 1 klorilla ceb101 59532325 May 29 21:21 ERR458493.fastq.gz -rw-r--r-- 1 klorilla ceb101 58566854 May 29 21:21 ERR458494.fastq.gz -rw-r--r-- 1 klorilla ceb101 58114810 May 29 21:21 ERR458495.fastq.gz -rw-r--r-- 1 klorilla ceb101 102201086 May 29 21:21 ERR458500.fastq.gz -rw-r--r-- 1 klorilla ceb101 101222099 May 29 21:21 ERR458501.fastq.gz -rw-r--r-- 1 klorilla ceb101 100585843 May 29 21:22 ERR458502.fastq.gz ``` * Now remove the writing priviledge by typing`chmod a-w *` and check permissions again with `ls -l`. The (rw-r--r--) should now be (r--r--r--) ## Link data files into your working directory * Head back to your main directory `~` and link the files from Oasis. ``` cd ~ mkdir data cd data ln -fs /oasis/scratch/comet/$USER/temp_project/data/* . ls -l ``` * We do this so that the data is easier to work with than having to type `/oasis/scratch/comet/$USER/temp_project/data/` all the time. ## Quality Check with FastQC 1. Create new folder for fastqc ``` cd ~ mkdir quality cd quality ``` 2. Download the fastqc script to the folder and move file to fastqc folder. ``` wget https://raw.githubusercontent.com/xrissymae/biobasics/master/fastqc.sh ``` 3. Edit fastqc.sh file with `vim`. 4. Run fastqc on yeast RNA-seq files. ``` sbatch fastqc.sh ``` 5. Run `multiqc .` to consolidate the fastqc data. 6. Download `.html` files into your local computer through Globus or secure copy. * **Globus**: Login via app.globus.org. * **Commandline**: In your local terminal (not connected to Comet), type: ``` mkdir ~/Desktop/fastqc cd ~/Desktop/fastqc scp $USER@comet.sdsc.xsede.org:~/quality/*.html . ls ``` ### Example Reports 1. [Good Illumina Data](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) 2. [Bad Illumina Data](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) * [More FastQC Information](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) ## Trim Adapters with Trimmomatic 1. Download the adapters you will be using. For our real data, we have a custom adapter file (`adapters.fa`), which is based off the latest Illumina Paired-End TruSeq3 Library. But for this tutorial, we'll be using the standard file for Illumina Paired-End TruSeq2 Libraries (`TruSeq2-PE.fa`). ``` cp /opt/biotools/trimmomatic/adapters/TruSeq2-PE.fa . ``` 2. Download [trimmomatic script](https://raw.githubusercontent.com/xrissymae/biobasics/master/trim.sh) to working directory. ``` wget https://raw.githubusercontent.com/xrissymae/biobasics/master/trim.sh ``` 3. Edit Trimmomatic shell file. 4. Run script (~4min). ## FastQC Again 1. We'll run another quality check on our now trimmed data by editing our fastqc.sh file and changing the input directory and extension (`~/data/` -> `~/quality`;`fastq` -> `qc.fq`) 2. Download the `.html` files into your local to view. ## Read Mapping & Counting with Salmon After quality checks, we can now map our reads to a reference transcriptome. We will use [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html). 1. Download reference transcriptome. ``` cd ~ mkdir index cd index curl -O https://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna/orf_coding.fasta.gz ``` 2. Make a new folder for your rnaseq data ``` cd ~ mkdir rnaseq cd rnaseq ``` 3. Create index for reference transcriptome. ``` salmon index --index yeast_orfs --type quasi --transcripts ~/index/orf_coding.fasta.gz ``` 3. Map reads to reference index. ``` for i in ~/quality/*.fq.gz do salmon quant -i yeast_orfs --libType U -r $i -o $i.quant --seqBias --gcBias done ``` ## Gather Counts * This [python script](https://raw.githubusercontent.com/ngs-docs/angus/2018/scripts/gather-counts.py) by Titus Brown basically takes the raw counts from the salmon `.quant` files and makes a new file. ``` curl -L -O https://raw.githubusercontent.com/ngs-docs/2018-ggg201b/master/lab6-rnaseq/gather-counts.py python2 gather-counts.py ``` ## Data Analysis with EdgeR 1. Download and run [R script](https://raw.githubusercontent.com/ngs-docs/angus/2018/scripts/yeast.salmon.R). ``` curl -L -O https://raw.githubusercontent.com/ngs-docs/angus/2018/scripts/yeast.salmon.R Rscript --no-save yeast.salmon.R ``` * Uses the package edgeR: [manual](https://www.bioconductor.org/packages/devel/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf) * Outputs 1. `yeast-edgeR-MA-plot.pdf` MA-plot 2. `yeast-edgeR-MDS.pdf` MDS Plot 3. `yeast-edgeR.csv` CSV file of DEGs 2. Download the 3 files to your personal computer to view. Alternatively, you can run this R script in your personal computer by downloading the `.quant` files from the "Gather Counts" step to your personal computer.