hybpiper_wiki
This page provides the full options and parameters for each subcommand, along with additional explanations and links where neccessary. The available subcommands can be viewed using the command hybpiper --help
:
To view the full parameters for a particular subcommand, use hybpiper <subcommand> --help
.
hybpiper assemble
Additional information:
--diamond
By default (i.e. when not providing the flag --bwa
to the hybpiper assemble
command) HybPiper will use BLASTx to map reads against a target file of amino-acid sequences. While BLASTx can be very sensitive, it can also be quite slow, and the read-mapping step of the pipeline can take some time if the input read files are large. An alternative is to use DIAMOND in place of BLASTx. DIAMOND offers a faster alternative to BLASTx that with results that can be equally sensitive. The sensitivity of the DIAMOND algorithm can be adjusted using the parameter --diamond_sensitivity
.
--cov_cutoff
By default, HybPiper performs per-sample/gene assemblies using SPAdes with the parameter --cov-cutoff 8
. As noted here, if the --cov-cutoff
option is manually provided to SPAdes, it is interpreted as an "average nucleotide coverage". This value can be changed using the HybPiper pipeline parameter --cov_cutoff <int>
. Reducing this value to 1, for example, results in a) longer contigs for many samples/genes, and/or b) contigs being generated for some samples/genes that previously had none. However, there is an obvious tradeoff between more/increased length contigs, and confidence in base-level accuracy of contigs - you might want to lower this value to only 4 or 5 and accept some reduced contig lengths/presences for some genes.
--single_cell_assembly
This parameter can be used to run per-sample/gene assemblies using SPAdes in MDA (single-cell) mode, i.e. supplying SPAdes with the flag --sc
. This setting provides uneven coverage optimisation implemented for single-cell data, which can also assist with the uneven coverage that can result from target-catpure sequencing approaches. As for the --cov_cutoff
setting above, using the --single_cell_assembly
flag can result in a) longer contigs for many samples/genes, and/or b) contigs being generated for some samples/genes that previously had none. However, --sc
mode tends to produce many more contigs that standard mode for each gene, and this can sometimes result in the introduction of incorrect sequence in the gene sequences output by HybPiper. Therefore, we currently only reccommend running HybPiper with the --single_cell_assembly
flag if essential genes are not recovered using the default SPAdes assembly, and that users manually examine the sequence output.
--timeout_assemble
In some situations SPAdes assembly for some genes can take a very long time; this is often due to the presence of many (i.e. millions) of low-complexity reads mapping to a given gene. While this situation should be avoided (see here for discussion and recommendations), the --timeout_assemble
parameter can also be used to provide a time limit for SPAdes assembly. Any SPAdes assembly that is taking longer than the a percentage value provided to the paramters (e.g. --timeout_assemble 200
) will be stopped, and no sequences will be recovered for that gene.
--timeout_exonerate_contigs
In some situations (see --timeout_assemble
above) the SPAdes assembly for some genes will comprise hundreds of low-complexity contigs. This causes the next stage of the pipeline - extraction of coding sequences from the SPAdes contigs - to take a very long time, and can produce enormous log files if --verbose_logging
is used (see below). By default, sequence extraction for any gene that takes longer than 120 seconds will be cancelled, and no final sequence will be produced NOTE logged and printed to screen. This value can be altered using the --timeout_exonerate_contigs
parameter. Note that HybPiper logs and prints a list of genes if they were cancelled due to exceeding the timeout.
--target
This parameter can be used to provide a single target file taxon name, e.g. Artocarpus
. The target file sequence from this taxon will be used during Exonerate searches of SPAdes contigs for each gene. Alternatively, a tab-delimited file can be provided, e.g.:
Gene name | Target file taxon to use |
---|---|
gene001 | Artocarpus |
gene002 | Morus |
NOTE: do not include the column names row in your file! If a file is provided, it must specify a gene name and a target file taxon name for each gene in your target file.
By default, HybPiper will select the 'best' target file sequence to be used in Exonerate searches for each gene via the highest cumalative MAPQ score (if using BWA) or the highest cumulative bit score (if using BLASTx or DIAMOND).
--exclude
This parameter can be used to provide a single target file taxon name, e.g. Artocarpus
. No target file sequence from this taxon will be used during Exonerate searches of SPAdes contigs, for any gene.
--no_stitched_contig
When reads are assembled in to contigs using SPAdes for a given gene/sample, multiple contigs can be produced. These contigs can contain exons corresponding to different regions of the target sequence. By default, HybPiper will attempt to recover as much gene sequence as possible by extracting these exons from different contigs and "stitching" them together to create a contiguous sequence. HybPiper calls these multi-contig sequences stitched_contigs
.
This approach is not without its downsides, particularly for samples that have high levels of paralogy. In such cases, creating stitched_contigs
can result in contigs from different paralogs being stitched together, resulting in 'chimeric' locus sequences. This potentially results in reduced phylogenetic resolution or misleading results. HybPiper therefore provides the optional flag --no_stitched_contig
. When used, stitched_contigs
will not be constructed and only single longest sequence originating from a single contig will be returned for each sample/gene.
Note: In cases where your target file corresponds to multi-exon genes with long introns, running the pipeline with the --no_stitched_contig
flag can dramatically reduce the total loci sequence length recovered.
--chimeric_stitched_contig_edit_distance
As described above for the flag --no_stitched_contig
, HybPiper can potentially create 'chimeric' gene sequences comprising regions from multiple paralogs.
HybPiper implements a work-in-progress method to detect such putative chimeras. Details, including the effect of the related pipeline parameters:
--chimeric_stitched_contig_edit_distance
, --chimeric_stitched_contig_discordant_reads_cutoff
, --bbmap_memory
, --bbmap_subfilter
and --bbmap_threads
…are described [here].
Note: The HybPiper command hybpiper paralog_retriever
will create two folders, one with all "main" and paralog sequences, and one without any putative chimeric sequences. The default folder names are paralogs_all
and paralogs_no_chimeras
, respectively; these name can be user-specified.
--depth_multiplier
When HybPiper detect 'long' paralogs for a given gene/sample, it assigns one of them as the "main" sequence, and add the suffix .main
to the sequence FASTA header. All other paralogs are arbitrarily assignedan incremment number as a suffix, starting at zero (<sequence_name>.0
, <sequence_name>.1
, etc.). The .main
paralog is assigned first by the read depth of the corresponding SPAdes contig, i.e. if a contig has a coverage depth greater than <depth_multiplier> times that of all other long paralogs. The default value for --depth_multiplier
is 10. Note that if no paralog contig has a depth that fulfills this critereon, the .main
paralog is assigned via the greatest percentage sequence similarity to the reference target file protein sequence.
--merged
When paired-end sequencing is generated using libraries with short insert sizes, the forwards and reverse reads of a given pair can overlap. In such cases, target locus recovery can be improved when SPAdes assemblies are performed with merged reads*, that is, when a single contig is produced from overlapping pairs when possible. This can be done using the flag --merged
. If provided, paired-end reads will be merged using bbmerge.sh
[link] with default settings, and SPAdes assemblies will be carried out with both merged and any remaining unmerged output.
*CJJ: benchmarking is required here; this is currently word-of-mouth although it makes intuitive sense…
--keep_intermediate_files
By default, HybPiper will delete intermediate files and folders for each gene after processing. This greatly helps to reduce the total number of files output by the HybPiper pipeline. For example, running the hybpiper assemble
commmand on the tutorial test dataset produces 1,924 files by default, or 19,935 files when using the --keep_intermediate_files
flag. Total file number can be a limiting factor when running the pipeline on some HPC systems. Intermediate files and folders are:
*.log
file within each gene subfolder, containing details from running the exonerate_hits.py
module for the given gene. This log file is usually re-logged to the main sample *.log
file and deleted.*.FNA
file contains nucleotide sequences, and one *.FAA
contains amino-acid sequences. These files are found in each <gene_name>/<sample_name>
directory.chimera_test_stitched_contig.sam
. A Sequence Alignment Map file containing mapping details for paired-end reads mapped against a stitched contig reference; used in chimera tests. Found in each <gene_name>/<sample_name>
directory.--no_padding_supercontigs
If the flag --run_intronerate
is provided, HybPiper will attempt to recover intron sequences (if present for a given gene/sample) and a supercontig
sequence. The supercontig
sequence comprises both exons and introns; in cases where it has been constructed from more than one SPAdes contig, HybPiper will add a stretch of 10 'N' characters bewtween abutting contigs. This can be turned off using the flag --no_padding_supercontigs
--paralog_min_length_percentage
HybPiper tests for the presence of paralogs for a given gene/sample in two different ways:
From a set of exon-only sequences extracted from a set of SPAdes contigs (and where each exon-only sequence originates from a single contig only), it searches for more than one sequence matching a given target file reference, where each sequence is greater than 75% of the reference length. If such sequences are found, they are written to *.fasta
file and can be recovered using the hybpiper paralog_retriever
command.
From a set of exon-only sequences extracted from a set of SPAdes contigs (and where each exon-only sequence originates from a single contig only), calculate sequence coverage across the length of the query protein. If coverage is >1 for a given percentage threshold of the query length, produce a paralog warning. This warning is useful as it provides an indication that paralogs might be present for the given gene/sample, without the requirement that SPAdes has assembled each paralog into an almost full-length contig. Note that these sequences are not written to file, as there is currently no way to definitively group contigs containing partial gene sequences into single-paralog clusters. Please let us know if you can think of a way!
Hybpiper allows the length percentage cut-off used by each paralog detection approach to be specified using the --paralog_min_length_percentage
parameter. The default value is 0.75.
--verbose_logging
When this flag is supplied, hybpiper will capture and log additional data from the pipeline run. This can be useful if you run in to problems/debugging, but not that it can increase the size of the log file by an order of magnitide or more!
hybpiper stats
hybpiper recovery_heatmap
hybpiper retrieve_sequences
hybpiper paralog_retriever
NOTE: if paralogs are detected for a given gene/sample, the sequence in the paralog folder with the suffix *.main
will not necessarily be identical to the corresponding *.FNA
sequence. This is because each paralog sequence is recovered from a single SPAdes contig only, whereas the *.FNA
sequence could be derived from a stitched contig (comprising sequence from more than one SPAdes contig).
hybpiper check_dependencies
hybpiper check_targetfile
Additional information:
--sliding_window_size
By default, the sliding window size used to scan across each sequence in the target file is 50 characters (amino-acids) for a protein file, and 100 characters (nucleotides) for a DNA file.
--complexity_minimum_threshold
By default, the threshold below which a DNA sequence within the sliding window will be flagged as low-complexity is set to 1.5
. However, for a DNA target file containing only ATCG
characters (i.e. assuming no N
or ambiguity characters), the maximum 'complexity value' acheivable is 2.0
, and so alternative values up to 2.0
can be used.
By default, the threshold below which a protein sequence within the sliding window will be flagged as low-complexity is set to 3.0
. However, for a protein target file containing only ARNDCQEGHILKMFPSTWYV
characters (i.e. assuming no ambiguity characters), the maximum 'complexity value' acheivable is 4.0
, and so alternative values up to 4.0
can be used.