###### tags: `hybpiper_wiki` # Full pipeline parameters [TOC] This page provides the full options and parameters for each subcommand, along with additional explanations and links where neccessary. The available subcommands can be viewed using the command `hybpiper --help` : ``` usage: hybpiper [-h] [--version] {assemble,stats,retrieve_sequences,paralog_retriever,recovery_heatmap,check_dependencies,check_targetfile} ... optional arguments: -h, --help show this help message and exit --version, -v Print the HybPiper version number. Subcommands for HybPiper: Valid subcommands: {assemble,stats,retrieve_sequences,paralog_retriever,recovery_heatmap,check_dependencies,check_targetfile} assemble Assemble gene, intron, and supercontig sequences stats Gather statistics about the HybPiper run(s) retrieve_sequences Retrieve sequences generated from multiple runs of HybPiper paralog_retriever Retrieve paralog sequences for a given gene, for all samples recovery_heatmap Create a gene recovery heatmap for the HybPiper run check_dependencies Run a check for all pipeline dependencies and exit check_targetfile Check the target file for sequences with low-complexity regions, then exit To view parameters and help for a subcommand, use e.g. "assemble --help" ``` To view the full parameters for a particular subcommand, use `hybpiper <subcommand> --help`. ## 1.0 **`hybpiper assemble`** ``` usage: hybpiper assemble [-h] --readfiles READFILES [READFILES ...] (--targetfile_dna TARGETFILE_DNA | --targetfile_aa TARGETFILE_AA) [--bwa | --diamond] [--diamond_sensitivity {mid-sensitive,sensitive,more-sensitive,very-sensitive,ultra-sensitive}] [--start_from {map_reads,distribute_reads,assemble_reads,exonerate_contigs}] [--cpu CPU] [--evalue EVALUE] [--max_target_seqs MAX_TARGET_SEQS] [--cov_cutoff COV_CUTOFF] [--single_cell_assembly] [--kvals KVALS [KVALS ...]] [--thresh THRESH] [--paralog_min_length_percentage PARALOG_MIN_LENGTH_PERCENTAGE] [--depth_multiplier DEPTH_MULTIPLIER] [--prefix PREFIX] [--timeout_assemble TIMEOUT_ASSEMBLE] [--timeout_exonerate_contigs TIMEOUT_EXONERATE_CONTIGS] [--target TARGET] [--exclude EXCLUDE] [--unpaired UNPAIRED] [--no_stitched_contig] [--bbmap_memory BBMAP_MEMORY] [--bbmap_subfilter BBMAP_SUBFILTER] [--bbmap_threads BBMAP_THREADS] [--chimeric_stitched_contig_edit_distance CHIMERIC_STITCHED_CONTIG_EDIT_DISTANCE] [--chimeric_stitched_contig_discordant_reads_cutoff CHIMERIC_STITCHED_CONTIG_DISCORDANT_READS_CUTOFF] [--merged] [--run_intronerate] [--keep_intermediate_files] [--no_padding_supercontigs] [--verbose_logging] optional arguments: -h, --help show this help message and exit --readfiles READFILES [READFILES ...], -r READFILES [READFILES ...] One or more read files to start the pipeline. If exactly two are specified, will assume it is paired Illumina reads. --targetfile_dna TARGETFILE_DNA, -t_dna TARGETFILE_DNA FASTA file containing DNA target sequences for each gene. If there are multiple targets for a gene, the id must be of the form: >Taxon-geneName --targetfile_aa TARGETFILE_AA, -t_aa TARGETFILE_AA FASTA file containing amino-acid target sequences for each gene. If there are multiple targets for a gene, the id must be of the form: >Taxon-geneName --bwa Use BWA to search reads for hits to target. Requires BWA and a target file that is nucleotides! --diamond Use DIAMOND instead of BLASTx. --diamond_sensitivity {mid-sensitive,sensitive,more-sensitive,very-sensitive,ultra-sensitive} Use the provided sensitivity for DIAMOND searches. --start_from {map_reads,distribute_reads,assemble_reads,exonerate_contigs} Start the pipeline from the given step. Note that this relies on the presence of output files for previous steps, produced by a previous run attempt. Default is map_reads --cpu CPU Limit the number of CPUs. Default is to use all cores available. --evalue EVALUE e-value threshold for blastx hits, default: 0.0001 --max_target_seqs MAX_TARGET_SEQS Max target seqs to save in BLASTx search, default: 10 --cov_cutoff COV_CUTOFF Coverage cutoff for SPAdes. Default is: 8 --single_cell_assembly Run SPAdes assemblies in MDA (single-cell) mode. Default is False --kvals KVALS [KVALS ...] Values of k for SPAdes assemblies. SPAdes needs to be compiled to handle larger k-values! Default is auto-detection by SPAdes. --thresh THRESH Percent identity threshold for retaining Exonerate hits. Default is 55, but increase this if you are worried about contaminant sequences. --paralog_min_length_percentage PARALOG_MIN_LENGTH_PERCENTAGE Minimum length percentage of a contig Exonerate hit vs reference protein length for a paralog warning and sequence to be generated. Default is 0.75 --depth_multiplier DEPTH_MULTIPLIER Assign a long paralog as the "main" sequence if it has a coverage depth <depth_multiplier> times all other long paralogs. Set to zero to not use depth. Default is 10 --prefix PREFIX Directory name for pipeline output, default is to use the FASTQ file name. --timeout_assemble TIMEOUT_ASSEMBLE Kill long-running gene assemblies if they take longer than X percent of average. --timeout_exonerate_contigs TIMEOUT_EXONERATE_CONTIGS Kill long-running processes if they take longer than X seconds. Default is 120 --target TARGET Use the target file sequence with this taxon name in Exonerate searches for each gene. Other targets for that gene will be used only for read sorting. Can be a tab-delimited file (one <gene>\t<taxon_name> per line) or a single taxon name. --exclude EXCLUDE Do not use any sequence with the specified taxon name string in Exonerate searches. Sequenced from this taxon will still be used for read sorting. --unpaired UNPAIRED Include a single FASTQ file with unpaired reads along with two paired read files --no_stitched_contig Do not create any stitched contigs. The longest single Exonerate hit will be used. --bbmap_memory BBMAP_MEMORY GB memory (RAM ) to use for bbmap.sh with exonerate_hits.py. Default is 1. --bbmap_subfilter BBMAP_SUBFILTER Ban alignments with more than this many substitutions. Default is 7. --bbmap_threads BBMAP_THREADS Number of threads to use for BBmap when searching for chimeric stitched contig. Default is 2. --chimeric_stitched_contig_edit_distance CHIMERIC_STITCHED_CONTIG_EDIT_DISTANCE Minimum number of differences between one read of a read pair vs the stitched contig reference for a read pair to be flagged as discordant. --chimeric_stitched_contig_discordant_reads_cutoff CHIMERIC_STITCHED_CONTIG_DISCORDANT_READS_CUTOFF Minimum number of discordant reads pairs required to flag a stitched contig as a potential chimera of contigs from multiple paralogs --merged For assembly with both merged and unmerged (interleaved) reads. --run_intronerate Run intronerate to recover fasta files for supercontigs with introns (if present), and introns-only. --keep_intermediate_files Keep all intermediate files and logs, which can be useful for debugging. Default action is to delete them, which greatly reduces the total file number). --no_padding_supercontigs If Intronerate is run, and a supercontig is created by concatenating multiple SPAdes contigs, do not add 10 "N" characters between contig joins. By default, Ns will be added. --verbose_logging If supplied, enable verbose login. NOTE: this can increase the size of the log files by an order of magnitude. ``` **Additional information:** **`--diamond`** By default (i.e. when **not** providing the flag `--bwa` to the `hybpiper assemble` command) HybPiper will use BLASTx to map reads against a target file of amino-acid sequences. While BLASTx can be very sensitive, it can also be quite slow, and the read-mapping step of the pipeline can take some time if the input read files are large. An alternative is to use [DIAMOND](https://github.com/bbuchfink/diamond) in place of BLASTx. DIAMOND offers a faster alternative to BLASTx that with results that can be equally sensitive. The sensitivity of the DIAMOND algorithm can be adjusted using the parameter `--diamond_sensitivity`. **`--cov_cutoff`** By default, HybPiper performs per-sample/gene assemblies using SPAdes with the parameter `--cov-cutoff 8`. As noted [here][1], if the `--cov-cutoff` option is manually provided to SPAdes, it is interpreted as an "average nucleotide coverage". This value can be changed using the HybPiper pipeline parameter `--cov_cutoff <int>`. Reducing this value to 1, for example, results in a) longer contigs for many samples/genes, and/or b) contigs being generated for some samples/genes that previously had none. However, there is an obvious tradeoff between more/increased length contigs, and confidence in base-level accuracy of contigs - you might want to lower this value to only 4 or 5 and accept some reduced contig lengths/presences for some genes. **`--single_cell_assembly`** This parameter can be used to run per-sample/gene assemblies using SPAdes in MDA (single-cell) mode, i.e. supplying SPAdes with the flag `--sc`. This setting provides uneven coverage optimisation implemented for single-cell data, which can also assist with the uneven coverage that can result from target-catpure sequencing approaches. As for the `--cov_cutoff` setting above, using the `--single_cell_assembly` flag can result in a) longer contigs for many samples/genes, and/or b) contigs being generated for some samples/genes that previously had none. However, `--sc` mode tends to produce many more contigs that standard mode for each gene, and this can sometimes result in the introduction of incorrect sequence in the gene sequences output by HybPiper. Therefore, we currently only reccommend running HybPiper with the `--single_cell_assembly` flag if essential genes are not recovered using the default SPAdes assembly, and that users **manually examine** the sequence output. **`--timeout_assemble`** In some situations SPAdes assembly for some genes can take a very long time; this is often due to the presence of many (i.e. millions) of low-complexity reads mapping to a given gene. While this situation should be avoided (see [here]() for discussion and recommendations), the `--timeout_assemble` parameter can also be used to provide a time limit for SPAdes assembly. Any SPAdes assembly that is taking longer than the a percentage value provided to the paramters (e.g. `--timeout_assemble 200`) will be stopped, and no sequences will be recovered for that gene. **`--timeout_exonerate_contigs`** In some situations (see `--timeout_assemble` above) the SPAdes assembly for some genes will comprise hundreds of low-complexity contigs. This causes the next stage of the pipeline - extraction of coding sequences from the SPAdes contigs - to take a very long time, and can produce enormous log files if `--verbose_logging` is used (see below). By default, sequence extraction for any gene that takes longer than 120 seconds will be cancelled, and no final sequence will be produced NOTE logged and printed to screen. This value can be altered using the `--timeout_exonerate_contigs` parameter. Note that HybPiper logs and prints a list of genes if they were cancelled due to exceeding the timeout. **`--target`** This parameter can be used to provide a single target file taxon name, e.g. `Artocarpus`. The target file sequence from this taxon will be used during Exonerate searches of SPAdes contigs for each gene. Alternatively, a tab-delimited file can be provided, e.g.: | Gene name | Target file taxon to use | | -------- | -------- | | gene001 | Artocarpus | gene002 | Morus **NOTE:** do **not** include the column names row in your file! If a file is provided, it **must** specify a gene name and a target file taxon name for **each** gene in your target file. By default, HybPiper will select the 'best' target file sequence to be used in Exonerate searches for each gene via the highest cumalative MAPQ score (if using BWA) or the highest cumulative bit score (if using BLASTx or DIAMOND). **`--exclude`** This parameter can be used to provide a single target file taxon name, e.g. `Artocarpus`. No target file sequence from this taxon will be used during Exonerate searches of SPAdes contigs, for any gene. **`--no_stitched_contig`** When reads are assembled in to contigs using SPAdes for a given gene/sample, multiple contigs can be produced. These contigs can contain exons corresponding to different regions of the target sequence. By default, HybPiper will attempt to recover as much gene sequence as possible by extracting these exons from different contigs and "stitching" them together to create a contiguous sequence. HybPiper calls these multi-contig sequences `stitched_contigs`. This approach is not without its downsides, particularly for samples that have high levels of paralogy. In such cases, creating `stitched_contigs` can result in contigs from **different paralogs** being stitched together, resulting in 'chimeric' locus sequences. This potentially results in reduced phylogenetic resolution or misleading results. HybPiper therefore provides the optional flag `--no_stitched_contig`. When used, `stitched_contigs` will not be constructed and only single longest sequence originating from a single contig will be returned for each sample/gene. **Note:** In cases where your target file corresponds to multi-exon genes with long introns, running the pipeline with the `--no_stitched_contig` flag can dramatically reduce the total loci sequence length recovered. **`--chimeric_stitched_contig_edit_distance`** As described above for the flag `--no_stitched_contig`, HybPiper can potentially create 'chimeric' gene sequences comprising regions from multiple paralogs. HybPiper implements a work-in-progress method to detect such putative chimeras. Details, including the effect of the related pipeline parameters: **`--chimeric_stitched_contig_edit_distance`**, **`--chimeric_stitched_contig_discordant_reads_cutoff`**, **`--bbmap_memory`**, **`--bbmap_subfilter`** and **`--bbmap_threads`** ...are described [here]. **Note:** The HybPiper command `hybpiper paralog_retriever` will create two folders, one with all "main" and paralog sequences, and one without any putative chimeric sequences. The default folder names are `paralogs_all` and `paralogs_no_chimeras`, respectively; these name can be user-specified. **`--depth_multiplier`** When HybPiper detect 'long' paralogs for a given gene/sample, it assigns one of them as the "main" sequence, and add the suffix `.main` to the sequence FASTA header. All other paralogs are arbitrarily assignedan incremment number as a suffix, starting at zero (`<sequence_name>.0`, `<sequence_name>.1`, etc.). The `.main` paralog is assigned first by the read depth of the corresponding SPAdes contig, i.e. if a contig has a coverage depth greater than <depth_multiplier> times that of all other long paralogs. The default value for `--depth_multiplier` is 10. Note that if no paralog contig has a depth that fulfills this critereon, the `.main` paralog is assigned via the greatest percentage sequence similarity to the reference target file protein sequence. **`--merged`** When paired-end sequencing is generated using libraries with short insert sizes, the forwards and reverse reads of a given pair can overlap. In such cases, target locus recovery can be improved when SPAdes assemblies are performed with merged reads*, that is, when a single contig is produced from overlapping pairs when possible. This can be done using the flag `--merged`. If provided, paired-end reads will be merged using `bbmerge.sh`[link] with default settings, and SPAdes assemblies will be carried out with both merged and any remaining unmerged output. *CJJ: benchmarking is required here; this is currently word-of-mouth although it makes intuitive sense... **`--keep_intermediate_files`** By default, HybPiper will delete intermediate files and folders for each gene after processing. This greatly helps to reduce the total number of files output by the HybPiper pipeline. For example, running the `hybpiper assemble` commmand on the tutorial test dataset produces 1,924 files by default, or 19,935 files when using the `--keep_intermediate_files` flag. Total file number can be a limiting factor when running the pipeline on some HPC systems. Intermediate files and folders are: - The SPAdes assembly folder for each gene. - A `*.log` file within each gene subfolder, containing details from running the `exonerate_hits.py` module for the given gene. This log file is usually re-logged to the main sample `*.log` file and deleted. - Fasta files containing the extracted and trimmed (if necessary) Exonerate hits used to create stitched contigs. One `*.FNA` file contains nucleotide sequences, and one `*.FAA` contains amino-acid sequences. These files are found in each `<gene_name>/<sample_name>` directory. - `chimera_test_stitched_contig.sam`. A Sequence Alignment Map file containing mapping details for paired-end reads mapped against a stitched contig reference; used in chimera tests. Found in each `<gene_name>/<sample_name>` directory. **`--no_padding_supercontigs`** If the flag `--run_intronerate` is provided, HybPiper will attempt to recover intron sequences (if present for a given gene/sample) and a `supercontig` sequence. The `supercontig` sequence comprises both exons **and** introns; in cases where it has been constructed from more than one SPAdes contig, HybPiper will add a stretch of 10 'N' characters bewtween abutting contigs. This can be turned off using the flag `--no_padding_supercontigs` **`--paralog_min_length_percentage`** HybPiper tests for the presence of paralogs for a given gene/sample in two different ways: 1) From a set of exon-only sequences extracted from a set of SPAdes contigs (and where each exon-only sequence originates from a single contig only), it searches for more than one sequence matching a given target file reference, where each sequence is greater than 75% of the reference length. If such sequences are found, they are written to `*.fasta` file and can be recovered using the `hybpiper paralog_retriever` command. 2) From a set of exon-only sequences extracted from a set of SPAdes contigs (and where each exon-only sequence originates from a single contig only), calculate sequence coverage across the length of the query protein. If coverage is >1 for a given percentage threshold of the query length, produce a paralog warning. This warning is useful as it provides an indication that paralogs might be present for the given gene/sample, without the requirement that SPAdes has assembled each paralog into an almost full-length contig. Note that these sequences are not written to file, as there is currently no way to definitively group contigs containing partial gene sequences into single-paralog clusters. Please let us know if you can think of a way! Hybpiper allows the length percentage cut-off used by each paralog detection approach to be specified using the `--paralog_min_length_percentage` parameter. The default value is 0.75. **`--verbose_logging`** When this flag is supplied, hybpiper will capture and log additional data from the pipeline run. This can be useful if you run in to problems/debugging, but not that it can increase the size of the log file by an order of magnitide or more! ## 2.0 **`hybpiper stats`** ``` usage: hybpiper stats [-h] (--targetfile_dna TARGETFILE_DNA | --targetfile_aa TARGETFILE_AA) [--seq_lengths_filename SEQ_LENGTHS_FILENAME] [--stats_filename STATS_FILENAME] {gene,GENE,supercontig,SUPERCONTIG} namelist positional arguments: {gene,GENE,supercontig,SUPERCONTIG} Sequence type (gene or supercontig) to recover lengths for namelist Text file with names of HybPiper output directories, one per line optional arguments: -h, --help show this help message and exit --targetfile_dna TARGETFILE_DNA, -t_dna TARGETFILE_DNA FASTA file containing DNA target sequences for each gene. If there are multiple targets for a gene, the id must be of the form: >Taxon-geneName --targetfile_aa TARGETFILE_AA, -t_aa TARGETFILE_AA FASTA file containing amino-acid target sequences for each gene. If there are multiple targets for a gene, the id must be of the form: >Taxon-geneName --seq_lengths_filename SEQ_LENGTHS_FILENAME File name for the sequence lengths *.tsv file. Default is <seq_lengths.tsv>. --stats_filename STATS_FILENAME File name for the stats *.tsv file. Default is= <hybpiper_stats.tsv> ``` ## 3.0 **`hybpiper recovery_heatmap`** ``` usage: hybpiper recovery_heatmap [-h] [--heatmap_filename HEATMAP_FILENAME] [--figure_length FIGURE_LENGTH] [--figure_height FIGURE_HEIGHT] [--sample_text_size SAMPLE_TEXT_SIZE] [--gene_text_size GENE_TEXT_SIZE] [--heatmap_filetype {png,pdf,eps,tiff,svg}] [--heatmap_dpi HEATMAP_DPI] seq_lengths_file positional arguments: seq_lengths_file Filename for the seq_lengths file (output of the 'hybpiper stats' command) optional arguments: -h, --help show this help message and exit --heatmap_filename HEATMAP_FILENAME Filename for the output heatmap, saved by default as a *.png file. Defaults to "recovery_heatmap" --figure_length FIGURE_LENGTH Length dimension (in inches) for the output heatmap file. Default is automatically calculated based on the number of genes --figure_height FIGURE_HEIGHT Height dimension (in inches) for the output heatmap file. Default is automatically calculated based on the number of samples --sample_text_size SAMPLE_TEXT_SIZE Size (in points) for the sample text labels in the output heatmap file. Default is automatically calculated based on the number of samples --gene_text_size GENE_TEXT_SIZE Size (in points) for the gene text labels in the output heatmap file. Default is automatically calculated based on the number of genes --heatmap_filetype {png,pdf,eps,tiff,svg} File type to save the output heatmap image as. Default is *.png --heatmap_dpi HEATMAP_DPI Dot per inch (DPI) for the output heatmap image. Default is 300 ``` ## 4.0 **`hybpiper retrieve_sequences`** ``` usage: hybpiper retrieve_sequences [-h] (--targetfile_dna TARGETFILE_DNA | --targetfile_aa TARGETFILE_AA) [--sample_names SAMPLE_NAMES] [--single_sample_name SINGLE_SAMPLE_NAME] [--hybpiper_dir HYBPIPER_DIR] [--fasta_dir FASTA_DIR] [--skip_chimeric_genes] [--stats_file STATS_FILE] [--filter_by column comparison_symbol threshold] {dna,aa,intron,supercontig} positional arguments: {dna,aa,intron,supercontig} Type of sequence to extract optional arguments: -h, --help show this help message and exit --targetfile_dna TARGETFILE_DNA, -t_dna TARGETFILE_DNA FASTA file containing DNA target sequences for each gene. If there are multiple targets for a gene, the id must be of the form: >Taxon-geneName --targetfile_aa TARGETFILE_AA, -t_aa TARGETFILE_AA FASTA file containing amino-acid target sequences for each gene. If there are multiple targets for a gene, the id must be of the form: >Taxon-geneName --sample_names SAMPLE_NAMES Directory containing Hybpiper output OR a file containing HybPiper output names, one per line --single_sample_name SINGLE_SAMPLE_NAME A single sample name to recover sequences for --hybpiper_dir HYBPIPER_DIR Specify directory containing HybPiper output --fasta_dir FASTA_DIR Specify directory for output FASTA files --skip_chimeric_genes Do not recover sequences for putative chimeric genes --stats_file STATS_FILE Stats file produced by "hybpiper stats", required for selective filtering of retrieved sequences --filter_by column comparison threshold Provide three space-separated arguments: 1) column of the stats_file to filter by, 2) "greater" or "smaller", 3) a threshold - either an integer (raw number of genes) or float (percentage of genes in analysis). This parameter can be supplied more than once to filter by multiple criteria.. ``` ## 5.0 **`hybpiper paralog_retriever`** ``` usage: hybpiper paralog_retriever [-h] (--targetfile_dna TARGETFILE_DNA | --targetfile_aa TARGETFILE_AA) [--fasta_dir_all FASTA_DIR_ALL] [--fasta_dir_no_chimeras FASTA_DIR_NO_CHIMERAS] [--paralog_report_filename PARALOG_REPORT_FILENAME] [--paralogs_above_threshold_report_filename PARALOGS_ABOVE_THRESHOLD_REPORT_FILENAME] [--paralogs_list_threshold_percentage PARALOGS_LIST_THRESHOLD_PERCENTAGE] [--heatmap_filename HEATMAP_FILENAME] [--figure_length FIGURE_LENGTH] [--figure_height FIGURE_HEIGHT] [--sample_text_size SAMPLE_TEXT_SIZE] [--gene_text_size GENE_TEXT_SIZE] [--heatmap_filetype {png,pdf,eps,tiff,svg}] [--heatmap_dpi HEATMAP_DPI] namelist positional arguments: namelist Text file containing list of HybPiper output directories, one per line. optional arguments: -h, --help show this help message and exit --targetfile_dna TARGETFILE_DNA, -t_dna TARGETFILE_DNA FASTA file containing DNA target sequences for each gene. Used to extract unique gene names for paralog recovery. If there are multiple targets for a gene, the id must be of the form: >Taxon-geneName --targetfile_aa TARGETFILE_AA, -t_aa TARGETFILE_AA FASTA file containing amino-acid target sequences for each gene. Used to extract unique gene names for paralog recovery. If there are multiple targets for a gene, the id must be of the form: >Taxon-geneName --fasta_dir_all FASTA_DIR_ALL Specify directory for output FASTA files (ALL). Default is "paralogs_all". --fasta_dir_no_chimeras FASTA_DIR_NO_CHIMERAS Specify directory for output FASTA files (no putative chimeric sequences). Default is "paralogs_no_chimeras". --paralog_report_filename PARALOG_REPORT_FILENAME Specify the filename for the paralog *.tsv report table --paralogs_above_threshold_report_filename PARALOGS_ABOVE_THRESHOLD_REPORT_FILENAME Specify the filename for the *.txt list of genes with paralogs in <paralogs_list_threshold_percentage> number of samples --paralogs_list_threshold_percentage PARALOGS_LIST_THRESHOLD_PERCENTAGE Percent of total number of samples and genes that must have paralog warnings to be reported in the <genes_with_paralogs.txt> report file. The default is 0.0, meaning that all genes and samples with at least one paralog warning will be reported --heatmap_filename HEATMAP_FILENAME Filename for the output heatmap, saved by default as a *.png file. Defaults to "paralog_heatmap" --figure_length FIGURE_LENGTH Length dimension (in inches) for the output heatmap file. Default is automatically calculated based on the number of genes --figure_height FIGURE_HEIGHT Height dimension (in inches) for the output heatmap file. Default is automatically calculated based on the number of samples --sample_text_size SAMPLE_TEXT_SIZE Size (in points) for the sample text labels in the output heatmap file. Default is automatically calculated based on the number of samples --gene_text_size GENE_TEXT_SIZE Size (in points) for the gene text labels in the output heatmap file. Default is automatically calculated based on the number of genes --heatmap_filetype {png,pdf,eps,tiff,svg} File type to save the output heatmap image as. Default is png --heatmap_dpi HEATMAP_DPI Dots per inch (DPI) for the output heatmap image. Default is 300 ``` **NOTE:** if paralogs are detected for a given gene/sample, the sequence in the paralog folder with the suffix `*.main` will not necessarily be identical to the corresponding `*.FNA` sequence. This is because each paralog sequence is recovered from a single SPAdes contig only, whereas the `*.FNA` sequence could be derived from a stitched contig (comprising sequence from more than one SPAdes contig). [1]:https://github.com/ablab/spades/issues/18#issuecomment-305639398 ## 6.0 **`hybpiper check_dependencies`** ``` usage: hybpiper check_dependencies [-h] optional arguments: -h, --help show this help message and exit ``` ## 7.0 **`hybpiper check_targetfile`** ``` usage: hybpiper check_targetfile [-h] (--targetfile_dna TARGETFILE_DNA | --targetfile_aa TARGETFILE_AA) [--sliding_window_size SLIDING_WINDOW_SIZE] [--complexity_minimum_threshold COMPLEXITY_MINIMUM_THRESHOLD] optional arguments: -h, --help show this help message and exit --targetfile_dna TARGETFILE_DNA, -t_dna TARGETFILE_DNA FASTA file containing DNA target sequences for each gene. If there are multiple targets for a gene, the id must be of the form: >Taxon-geneName --targetfile_aa TARGETFILE_AA, -t_aa TARGETFILE_AA FASTA file containing amino-acid target sequences for each gene. If there are multiple targets for a gene, the id must be of the form: >Taxon-geneName --sliding_window_size SLIDING_WINDOW_SIZE Number of characters (single-letter DNA or amino-acid codes) to include in the sliding window for low-complexity check --complexity_minimum_threshold COMPLEXITY_MINIMUM_THRESHOLD Minimum threshold value. Beneath this value, the sequence in the sliding window is flagged as low-complexity, and the corresponding target file sequence is reported as having low-complexity regions ``` **Additional information:** **`--sliding_window_size`** By default, the sliding window size used to scan across each sequence in the target file is 50 characters (amino-acids) for a protein file, and 100 characters (nucleotides) for a DNA file. **`--complexity_minimum_threshold`** By default, the threshold below which a DNA sequence within the sliding window will be flagged as low-complexity is set to `1.5`. However, for a DNA target file containing only `ATCG` characters (i.e. assuming no `N` or ambiguity characters), the maximum 'complexity value' acheivable is `2.0`, and so alternative values up to `2.0` can be used. By default, the threshold below which a protein sequence within the sliding window will be flagged as low-complexity is set to `3.0`. However, for a protein target file containing only `ARNDCQEGHILKMFPSTWYV` characters (i.e. assuming no ambiguity characters), the maximum 'complexity value' acheivable is `4.0`, and so alternative values up to `4.0` can be used.