Transcriptome Assembly Practical DOCKER ELAEVO 2019

The following details the steps involved in:

Generating a de novo RNA-Seq assembly using the Oyster River Protocol
Evaluating the quality of the assembly
- BUSCO and TransRate
Quantifying transcript expression levels
- Using Salmon
Functionally annotating transcripts (won't get here, but see code)
- dammit
Predicting coding regions (won't get here, but see code)
- TransDecoder

Oyster River Protocol

One of the nice things about the ORP is that it does, automatically, the assembly, evaluation, and quantification of the assembly. I will more fully introduce these procedures in the lecture part of the day.

All required software and data are provided pre-installed on a Docker VM. For when you get home, and want to install this on your own machines, see http://oyster-river-protocol.readthedocs.io/en/latest/aws_setup.html and https://oyster-river-protocol.readthedocs.io/en/latest/docker_install.html

The workshop materials here expect that you have some familiarity with UNIX. If you need a refresher, the Command Line Bootcamp is good (http://nhinbre.org/bioinformatics-modules/). For extended prep, see DataCarpentry and SoftwareCarpentry organizations.

About the Tutorial

This tutorial uses a very small dataset, so that everyone can have the opportunity to complete an assembly without using a truly massive amount of compute. BUT, if you can assembly the small dataset, you can assembly a bigger one, too, assuming you have a big enough computer.

About the data

The data used in this tutorial are from SRR1221220, which is a RNAseq dataset from the Japanese fire belly newt (Cynops pyrrhogaster). This is what I did, before the class.

Assembled the full dataset using the ORP
Quantitated gene expression using Salmon
Extracted out 50 of the highest expression transcripts (numbers 451-500 to be exact).
Mapped the reads to the fasta file of those 50 contigs using bwa mem, producing a SAM file.
Used Picard to move from SAM to fastq
Extracted a random 80,000 read pairs from the larger fastq using seqtk.

Run Docker image

Do it!

docker run -it macmaneslab/orp:2.2.8 bash

At this point, your terminal prompt should look something like this orp@b09c83d73388:~$

Tutorial Begins

Assemble using the ORP

Activate the conda environment:

conda activate orp

Update the ORP

cd Oyster_River_Protocol/
git pull
cd

At this point, your command prompt should look something like (orp) orp@1312b3d53c18:~$. Note the prefix (orp).

The 1st thing you should always do when logging in to a new machine is explore the directory structure. Where are the data, where are programs installed, etc.

Make a directory that will contain all of your assembly materials. In general, it's smart practice to have individual folders for each step in your bioinformatics pipeline.

mkdir $HOME/assembly_practical && cd $HOME/assembly_practical

What is this && thing?? It basically serves to link two commands together, IF the 1st one succeeds. So, do command 1, do command 2 if 1 succeeds, do command 3 if command 2 succeeds

$HOME/Oyster_River_Protocol/oyster.mk \
STRAND=RF \
TPM_FILT=1 \
MEM=2 \
CPU=2 \
READ1=$HOME/Oyster_River_Protocol/sampledata/test.1.fq.gz \
READ2=$HOME/Oyster_River_Protocol/sampledata/test.2.fq.gz \
RUNOUT=sampledata

the final assembly will be at assemblies/sampledata.ORP.fasta

can we talk about the assembly stats? Why are they so good/so bad??

Let's understand this command:

$HOME/Oyster_River_Protocol/oyster.mk the ORP is written as a Makefile, which is a nice way to organize computational pipelines. With few exceptions, if your run fails mid-way through the process, restarting is using the same command will pick up at the point at which it failed.
STRAND Was the library prepared using a strand-specific approach. In 2019, most libraries are, and RF is the most common.
TPM_FILT Do you want to remove* lowly expressed, mostly non-biological transcripts. (you probably do)
MEM How much RAM do you require (in Gb)? I usually set this to about 10% less than what the computer has.
CPU use all that you have.
READ1 and READ2 The ORP requires PE reads, sorry. Assembling with SE reads is not really worth it.. The reads can be gzipped. Always safe® to use the full path to the reads.
RUNOUT Name the output. This name will be included in the final assembly, so choose wisely. No special characters (e.g., \|*?/)
appending --dry-run to the end of the command will print out to your screen the commands that will be run on your dataset, but they won't actually be run.

What is the ORP doing. We'll talk about this in the lecture part of the class.

Trim adapters and low quality bases (Phred<2)
Correct the reads
Asssembly using Trinity, SPAdes k=55, SPAdes k=75, TransABySS
For iso-groups from the 4 assemblies, and pick the best member of each isogroup.
BLAST assemblies using Diamond to Swiss-Prot and make sure that we have all the unique stuff from each assembly.
cd-hit the assembly
Run Salmon on the assembly
Optionally, filter very lowly expressed trasnscripts, and make sure we are not throwing out "real" stuff.
Run BUSCO
Run TransRate.
Print a report

How many resources do you need?

RAM The amount of RAM you need scales with the number of unique kmers. The number of unique kmers is positively correlated with number of reads, which is a far more accessible number. In general, you need about 1GB RAM per million reads.
CPUs More is better, but machines with between 12 and 64 cores are common. Note the assembly process is not able to leverage MPI.

Evaluating content using BUSCO and TransRate

So you have an assembly, now how good is it? One way is to look at the assembly content. Are all the expected genes present? Fo this part of the exercise, We're going to use a 'real' assembly.

mkdir $HOME/assembly_practical/assembly_eval && cd $HOME/assembly_practical/assembly_eval

$HOME/Oyster_River_Protocol/report.mk main \
ASSEMBLY=$HOME/assembly_practical/assemblies/sampledata.ORP.fasta \
CPU=2 \
MEM=2 \
LINEAGE=eukaryota_odb9 \
READ1=$HOME/Oyster_River_Protocol/sampledata/test.1.fq.gz \
READ2=$HOME/Oyster_River_Protocol/sampledata/test.2.fq.gz \
RUNOUT=samplledataquality

Understanding this command:

$HOME/Oyster_River_Protocol/report.mk the ORP is written as a Makefile, which is a nice way to organize computational pipelines. With few exceptions, if your run fails mid-way through the process, restarting is using the same command will pick up at the point at which it failed.
MEM How much RAM do you require (in Gb)? I usually set this to about 10% less than what the computer has.
CPU use all that you have.
LINEAGE Which BUSCO database are you going to use. Here, we will use the eukaryote database.
READ1 and READ2 The ORP requires PE reads, sorry. Assembling with SE reads is not really worth it.. The reads can be gzipped. Always safe® to use the full path to the reads.
RUNOUT Name the output. This name will be included in the final assembly, so choose wisely. No special characters (e.g., \|*?/)
appending --dry-run to the end of the command will print out to your screen the commands that will be run on your dataset, but they won't actually be run.

This will take about 15 minutes to run. One member for your group - write results on the board. Which assembly is the best?

Note: The ORP runs BUSCO automatically, and has done so for your dummy assembly. See $HOME/assembly_practical/reports/run*ORP/short_summary*txt. A real assembly would score much better, hopefully…

Note: The ORP runs TransRate automatically, and has done so for your dummy assembly. See $HOME/assembly_practical/reports/transrate*/assemblies.csv. This little code snippet will make it easier to view this file.

paste <(sed -n 1p $HOME/assembly_practical/reports/transrate*/assemblies.csv | tr , '\n') \
<(sed -n 2p $HOME/assembly_practical/reports/transrate*/assemblies.csv | tr , '\n')

n_seqs	104
smallest	206
largest	3036
n_bases	90282
mean_len	868.09615
n_under_200	0
n_over_1k	33
n_over_10k	0
n_with_orf	53
mean_orf_percent	75.34392
n90	357
n70	875
n50	1326
n30	1751
n10	2218
gc	0.48795
bases_n	0
proportion_n	0.0
fragments	79999
fragments_mapped	79400
p_fragments_mapped	0.99251
good_mappings	77022
p_good_mapping	0.96279
bad_mappings	2378
potential_bridges	20
bases_uncovered	1605
p_bases_uncovered	0.01778
contigs_uncovbase	32
p_contigs_uncovbase	0.30769
contigs_uncovered	2
p_contigs_uncovered	0.01923
contigs_lowcovered	15
p_contigs_lowcovered	0.14423
contigs_segmented	8
p_contigs_segmented	0.07692
score	0.57688
optimal_score	0.66115
cutoff	0.23415
weighted	5638.68556

 12|  #
 11|  #
 10|  #
  9| ##
  8| ##
  7| ##
  6| ##
  5| ##
  4| ##
  3| ##
  2| ##
  1| ##        #
    -----------

------------------------------------
|             Summary              |
------------------------------------
|         observations: 22         |
|       min value: -1.000000       |
|         mean : -0.900045         |
|       max value: 0.980000        |
------------------------------------


*****  See the following link for interpretation *****
*****  https://oyster-river-protocol.readthedocs.io/en/latest/strandexamine.html *****



*****  QUALITY REPORT FOR: sampledata using the ORP version 2.2.7 ****
*****  THE ASSEMBLY CAN BE FOUND HERE: /home/orp/assemblies/sampledata.ORP.fasta ****

*****  BUSCO SCORE ~~~~~~~~~~~~~~~~~~~~~~>	C:0.0%[S:0.0%,D:0.0%],F:0.3%,M:99.7%,n:303
*****  TRANSRATE SCORE ~~~~~~~~~~~~~~~~~~>      0.36072
*****  TRANSRATE OPTIMAL SCORE ~~~~~~~~~~>      0.61637
*****  UNIQUE GENES ORP ~~~~~~~~~~~~~~~~~>      40
*****  UNIQUE GENES TRINITY ~~~~~~~~~~~~~>      31
*****  UNIQUE GENES SPADES55 ~~~~~~~~~~~~>      22
*****  UNIQUE GENES SPADES75 ~~~~~~~~~~~~>      23
*****  UNIQUE GENES TRANSABYSS ~~~~~~~~~~>      35
*****  READS MAPPED AS PROPER PAIRS ~~~~~>      99.10%

Quantification

The ORP does this.. using Salmon..

Annotation using dammit (time permitting)

See http://dib-lab.github.io/dammit/install/

Installing (let's go rogue!!)

conda deactivate #to exit your current conda environment.  
conda create -y --name dammit # make a new conda environent
conda activate dammit #run the new envoronment
conda install -y -c bioconda dammit # install dammit, this will take a few minutes

I've installed the databases for you, using this command. You don't have to do this

dammit databases --database-dir $HOME/share/Day3/dammit_dbs \
--install --busco-group eukaryota

Running the Annotation (minimally tested and potentially broken)

mkdir -p $HOME/assembly_practical/dammit/ && cd $HOME/assembly_practical/dammit

dammit annotate $HOME/assembly_practical/assemblies/sampledata.ORP.fasta \
--busco-group eukaryota \
--n_threads 8 \
--database-dir $HOME/share/Day3/dammit_dbs

Look at the main output files:

$HOME/assembly_practical/dammit/sampledata.ORP.fasta.dammit/sampledata.ORP.fasta.dammit.fasta
$HOME/assembly_practical/dammit/sampledata.ORP.fasta.dammit/sampledata.ORP.fasta.dammit.gff3

What next??

This is your final annotated assembly - you should save it someplace(s)
Use it for whatever downstream analysis.
Upload it, the reads, and the code you used to make the assembly, to a public repository when you publish.

Bibliography

Best practices: doi 10.1101/035642
Oyster River Protocol: doi: 10.7717/peerj.5428
BUSCO: doi: 10.1093/molbev/msx319
TransRate: doi: 10.1101/gr.196469.115
Salmon: doi: 10.1038/nmeth.4197
dammit: http://dib-lab.github.io/dammit/