One of the nice things about the ORP is that it does, automatically, the assembly, evaluation, and quantification of the assembly. I will more fully introduce these procedures in the lecture part of the day.
All required software and data are provided pre-installed on a Docker VM. For when you get home, and want to install this on your own machines, see http://oyster-river-protocol.readthedocs.io/en/latest/aws_setup.html and https://oyster-river-protocol.readthedocs.io/en/latest/docker_install.html
The workshop materials here expect that you have some familiarity with UNIX. If you need a refresher, the Command Line Bootcamp is good (http://nhinbre.org/bioinformatics-modules/). For extended prep, see DataCarpentry and SoftwareCarpentry organizations.
This tutorial uses a very small dataset, so that everyone can have the opportunity to complete an assembly without using a truly massive amount of compute. BUT, if you can assembly the small dataset, you can assembly a bigger one, too, assuming you have a big enough computer.
The data used in this tutorial are from SRR1221220, which is a RNAseq dataset from the Japanese fire belly newt (Cynops pyrrhogaster). This is what I did, before the class.
Do it!
At this point, your terminal prompt should look something like this orp@b09c83d73388:~$
Activate the conda environment:
Update the ORP
At this point, your command prompt should look something like (orp) orp@1312b3d53c18:~$
. Note the prefix (orp)
.
The 1st thing you should always do when logging in to a new machine is explore the directory structure. Where are the data, where are programs installed, etc.
Make a directory that will contain all of your assembly materials. In general, it's smart practice to have individual folders for each step in your bioinformatics pipeline.
What is this &&
thing?? It basically serves to link two commands together, IF the 1st one succeeds. So, do command 1, do command 2 if 1 succeeds, do command 3 if command 2 succeeds
the final assembly will be at assemblies/sampledata.ORP.fasta
Let's understand this command:
$HOME/Oyster_River_Protocol/oyster.mk
the ORP is written as a Makefile, which is a nice way to organize computational pipelines. With few exceptions, if your run fails mid-way through the process, restarting is using the same command will pick up at the point at which it failed.STRAND
Was the library prepared using a strand-specific approach. In 2019, most libraries are, and RF
is the most common.TPM_FILT
Do you want to remove* lowly expressed, mostly non-biological transcripts. (you probably do)MEM
How much RAM do you require (in Gb)? I usually set this to about 10% less than what the computer has.CPU
use all that you have.READ1
and READ2
The ORP requires PE reads, sorry. Assembling with SE reads is not really worth it.. The reads can be gzipped. Always safe® to use the full path to the reads.RUNOUT
Name the output. This name will be included in the final assembly, so choose wisely. No special characters (e.g., \|*?/
)--dry-run
to the end of the command will print out to your screen the commands that will be run on your dataset, but they won't actually be run.What is the ORP doing. We'll talk about this in the lecture part of the class.
How many resources do you need?
So you have an assembly, now how good is it? One way is to look at the assembly content. Are all the expected genes present? Fo this part of the exercise, We're going to use a 'real' assembly.
Understanding this command:
$HOME/Oyster_River_Protocol/report.mk
the ORP is written as a Makefile, which is a nice way to organize computational pipelines. With few exceptions, if your run fails mid-way through the process, restarting is using the same command will pick up at the point at which it failed.MEM
How much RAM do you require (in Gb)? I usually set this to about 10% less than what the computer has.CPU
use all that you have.LINEAGE
Which BUSCO database are you going to use. Here, we will use the eukaryote database.READ1
and READ2
The ORP requires PE reads, sorry. Assembling with SE reads is not really worth it.. The reads can be gzipped. Always safe® to use the full path to the reads.RUNOUT
Name the output. This name will be included in the final assembly, so choose wisely. No special characters (e.g., \|*?/
)--dry-run
to the end of the command will print out to your screen the commands that will be run on your dataset, but they won't actually be run.This will take about 15 minutes to run. One member for your group - write results on the board. Which assembly is the best?
Note: The ORP runs BUSCO automatically, and has done so for your dummy assembly. See $HOME/assembly_practical/reports/run*ORP/short_summary*txt
. A real assembly would score much better, hopefully…
Note: The ORP runs TransRate automatically, and has done so for your dummy assembly. See $HOME/assembly_practical/reports/transrate*/assemblies.csv
. This little code snippet will make it easier to view this file.
The ORP does this.. using Salmon..
See http://dib-lab.github.io/dammit/install/
Installing (let's go rogue!!)
Look at the main output files: