This document describes how to run command-line BLAST with custom search-able databases on Colby's nscc.
"BLAST+" is a suite of linux programs that run the different variations of the basic local alignment search tool (BLAST) algorithm. It can be run on n26
(the Biology node). Which program you need to run (blastn
, blastp
, blastx
or tblastn
) depends to whether your search query is a nucleotide or amino acid (protein) sequence and whether you're searching a database of nucleotide or protein sequences.
Once you know the variety of BLAST you need, the basic form of the command is:
Notice that the default is for BLAST to send its output to "stdout", that is, the screen. So typically you want to redirect that output to a named file.
Here's a more realistic example.
Notice a few things about this command.
./
. So, I typically run the BLAST command from the folder where I want the output file to be. Typically this would be a project-specific folder under /export/groups/drangeli/
Dmel.act5C
, in this exampleNlug.OGSv1-0.CDS
tblastn
.txt
extension should make it clear that this is human-readable outputIn order to keep things organized on nscc, it's helpful to use consistent and clear file names. For species, they are typically identified by a four letter abbreviation consisting of the first initial of the genus, followed by the first three letters of the specific epithet. So, Oncopeltus fasciatus becomes Ofas
.
There are a few exceptions to this rule. The pea aphid, Acyrthosiphon pisum, would become Apis
, which would certainly get confused with the honeybee genus, Apis. So the aphid community has adopted Acypi
as their digital handle.
Below is a list of paths to each of the custom BLAST search-able databases available to our lab group on nscc
. Custom BLAST databases each reside in a folder, but consist of several different files (with the extensions nhr
, nin
, nog
, nsd
, nsi
, and nsq
). To run BLAST from the linux command line, you will need to refer to the path of the database and the base name of those files. List the contents of the database folder if you're not sure what this looks like.
At the end of this document, you can find instructions for creating your own BLAST database.
This is the reference transcriptome that was used to annotate the O. fasciatus genome project. The CDS (coding sequence) FASTA file was kindly provided by Kristen Panfilio. When she shared the file, Kristen told us the data originate from "pooled Illumina RNAseq reads from mixed juvenile, adult male, and adult female libraries". But we have no more specific information about the ages of those individuals. We might be able to follow-up with Kristen about that.
More recently, we also have BLAST databases for the genome assembly Ofas_2.0_genome
and the annotated gene set Ofas_OGSv1.2_peptide
.
This database is built from the 'grand union' of transcripts assembled from samples of the Plantation Key ecotype from whole bodies, dorsal thorax and gonads of first day adults. It is the most complete J. haematoloma transcriptome we have and is estimated (by BUSCO) to include 89% of conserved genes.
Additionally, we have transcriptomes and BLAST databases for samples from male and female gonads (testes and ovaries). I'm not sure if the testes also included the accessory glands.
J. sanguinolenta is a seperate species of soapberry bug whose range overlaps that of J. haematoloma in South Florida. They are not know to hybridize in the wild and attempts to do so in the lab have not produced offspring. J. sanguinolenta is not known to produce short-wing morphs in the wild, but they have appeared in the lab, suggesting a hidden reaction norm.
At the time we made this transcriptome, we thought the identification of this bug might have been Jadera hinnulea, which is why these files are labeled Jhin
rather than Jsan
. (I'd rename that, but I'm afraid the contents of these files might refer to their own file names, which would cause problems. Eventually, we should just re-build the BLAST database, so eliminate that name issue.)
The eastern box elder bug, Boisea trivittata, is the only rhopalid native to northeastern North America. Our lab created a transcriptome for this species with individuals collected in Waterville, ME.
The USDA i5k project has a number of other genome projects in the works for Hemiptera at various stages of completion. There are currently five species with transcriptomes available, and I've made BLAST databases from these.
To understand how these species might be useful, it's helpful to have some idea of the phylogenetic relationships of these species as well as their unique features. Unfortunately, the taxonomy of true bugs is a mess. Mid-century entomologists recognised two orders: Heteroptera, the true bugs, and Homoptera, the cicadas and their allies. However, molecular phylogenetics has found that Heteroptera is nested within what used to be 'Homoptera'. So the term 'Homoptera' is now defunct, and people recognise a single order Hemiptera, with 3 suborders: Heteroptera (the same name with the same species in it), Auchenorrhyncha (cicadas, leafhoppers, treehoppers, planthoppers, and spittlebugs) and Sternorrhyncha (aphids, whiteflies, and scale insects).
Here's a figure Kristen and I produced for a review we wrote on hemipteran genomics. Below these species are listed in decreasing relatedness to Oncopeltus.
The invasive brown marmorated stink bug is a member of the Pentatomomorpha, the infraorder that includes Oncopeltus (family Lygaeidae and Jadera (family Rhopalidae).
The bedbug represents the infraorder Cimicomorpha.
The water strider represents the infraorder Gerromorpha, the earliest diverging lineage of the Heteroptera.
The brown planthopper is currently the only member of the Auchenorrhyncha with an available transcriptome.
The pea pahid genome was one of the first big insect genome projects. This pest represents the Sternorrhyncha, the earliest-branching major lineage of Hemiptera.
Pachypsylla venusta is a gall-forming psyllid. This species represents a relatively early-branching lineage within the Sternorrhyncha.
Thrips such as F. occidentalis are members of the Phthiraptera, the sister-order to the Hemiptera.
The default output for local BLAST is a human-readable tabular format, give a long list of matches, including some that are very poor. However, you can adjust BLAST's output with some additional parameters in the command line. The BLAST+ manual has detailed information on all the suite's functionality. But here's a summary of some useful switches. These switches can be used in combination with one another.
This will limit the output to the top 10 matches.
It's possible for a query search and a subject sequence to align more of less well along different stretches. This is most relevant when the database includes very long sequences, that might have tandem duplications. This output can be reduced to show only the best such alignments. The example below will report only the best alignment between any query and subject sequence pair.
nscc is a multi-processor computing cluster. By default, most programs will use only one processor, but you can direct many of them to "multi-thread" and use multiple processors simultaneously.
Do this only with caution, since it will impact other users on nscc. As a rule, you should only use multi-threading on node 26. The biology node has 32 processors, and as a courtesy to other users, it's good form to use at most half of those at any time. The example below uses 16 processors.
BLAST has the ability to render its output in format other than the default table and alignment. Several options are possible, and they're described in Table C1 in the Appendix to the BLAST+ manual. If you look at this table, scroll down or search for the outfmt
entry.
For many big data bioinformatics approaches, it can be useful to get your BLAST output in a TSV format that can be imported into R or other applications. One line per result is produced with -outfmt 6
. But you'll want to specify exactly which metadata the BLAST returns. They are specified in double quotes, as in the example below.
It is also possible to create your own BLAST database that you can search locally. This can be useful if you want to restrict the database to a set of sequences that is known only by our lab group, such as a transcriptome or genome that is not yet in a public database. The makeblastdb
command can be a little finicky, so using complete paths may be helpful. Start by making a folder where the new database will reside. Here's an example.
For a complete description of the makeblastdb
functionality, check out NCBI's documentation for the command.
Blast the J. haematoloma reference transcriptome and pull the resulting sequence
Blast the J. haematoloma genome assembly.
-Dave Angelini, March 2020 (updated 15 April 2022)