UPDATE FOR ADDITION OF FIX_TARGETFILE: Troubleshooting, common issues, and recommendations

###### tags: `hybpiper_wiki` # UPDATE FOR ADDITION OF FIX_TARGETFILE: Troubleshooting, common issues, and recommendations - [1.0 Target file](#10-target-file) * [1.1 Creating a target file tailored for your study](#11-creating-a-target-file-tailored-for-your-study) + [1.1.1 Angiosperms353](#111-angiosperms353) * [1.2 Target files with low-complexity sequences: troubleshooting](#12-target-files-with-low-complexity-sequences-troubleshooting) * [1.3 Target files with protein-coding nucleotide sequences in the wrong frame](#13-target-files-with-protein-coding-nucleotide-sequences-in-the-wrong-frame) * [1.4 Target files where some representative gene sequences are much shorter than others](#14-target-files-where-some-representative-gene-sequences-are-much-shorter-than-others) * [1.5 Filtering and correcting your target file with `hybpiper fix_targetfile`](#15-filtering-and-correcting-your-target-file-with-`hybpiper-fix_targetfile`) - [2.0 Read mapping](#20-read-mapping) * [2.1 BWA vs BLAST/DIAMOND](#21-bwa-vs-blastdiamond) ## 1.0 Target file ### 1.1 Creating a target file tailored for your study Bioinformatic sequence recovery for universal target-capture bait kits can be [substantially improved](https://github.com/chrisjackson-pellicle/NewTargets) by appropriate tailoring of target files to the group under study. If you have poor gene recovery for many samples, you may want to extract the best sequences at each gene and add them to your target file. Rerunning HybPiper with this new set of targets should improve recovery for many samples. #### 1.1.1 Angiosperms353 Here are some suggestions if you are using the Angiosperms353 Universal Probes for flowering plants: - Use the "mega353" curated target file [available here](https://github.com/chrisjackson-pellicle/NewTargets). - Search for close representative sequences in the [Kew PAFTOL Tree of Life Explorer](https://treeoflife.kew.org/specimen-viewer) - Run HybPiper with distantly related targets and extract coding sequences from the best samples. Add these to the target file and re-run HybPiper for the samples with low recovery. ### 1.2 Target files with low-complexity sequences: troubleshooting Target files containing sequences with regions of low complexity can cause issues with HybPiper. This is because: 1) Many low-complexity sample reads can map to regions of low complexity in the target file sequences. These low-complexity reads are likely to comprise largely off-target reads. **Millions** of low-complexity reads can map to a low-complexity region in a target sequence. 2) The SPAdes assembler then attempts to assemble the large, low-complexity read set for the given gene. This can take a long time, and often produces hundreds of low-complexity contigs. 3) These low complexity contigs are passed on to the exon-extraction stage of the pipeline, during which the 'best' target file protein sequence for a given gene (translated from a DNA sequence if necessary) is aligned with the DNA contigs using the program Exonerate. Again, this can take a very long time if there are many contigs, and produces astronomically large log files if the `--verbose_logging` flag is used with `hybpiper assemble`. To deal with these issues, HybPiper provides: - The parameter `--timeout_assemble X`. If provided, this will kill long-running SPAdes gene assemblies if they take longer than X percent of the average gene assembly time. A running median assembly time is calculated after at least 3 assemblies have been completed. Note that **no gene sequence will be produced for the sample** if the gene assembly is killed. By default, all assemblies will be allowed to complete with no time limit. - The parameter `--timeout_exonerate_contigs X`. The default timeout for the exon-extraction stage is 120 seconds. After this time, processing of a given gene will be cancelled and **no gene sequence will be produced for the sample**. HybPiper will log a list of genes that were cancelled due to timeout. In our testing, no gene that did *not* have hundreds of low complexity SPAdes contigs took longer than 120 seconds, but this number is adjustable using the `--timeout_exonerate_contigs` parameter if necessary. **STRONGLY RECOMMENDED:** We recommend checking your target file for sequences with low-complexity regions **prior** to running `hybpiper assemble`. To assist with this, HybPiper provides the command [`hybpiper check_targetfile`](https://github.com/mossmatters/HybPiper/wiki/Full-pipeline-parameters/_edit#70-hybpiper-check_targetfile). This will: 1) Scan along each sequence in your target file using a sliding window of 100 nucleotides (default for a DNA target file) or 50 amino-acids (default for a protein target file). The sliding window size is adjustable using the parameter `--sliding_window_size`. Within each sliding window, the complexity of the sequence will be calculated using Shannon entropy. 2) If a sequence within a sliding window falls beneath a given 'complexity threshold' (see below), the sequence is reported as having low-complexity regions. The sequence name will also be written to the `fix_targetfile_<date_time>.ctl` text file output by `hybpiper check_targetfile`. The Shannon entropy is calculated using a base of 2. This means that for a DNA target file containing only ATCG characters (i.e. assuming no N or ambiguity characters), the maximum 'complexity value' achievable is ~2.0. For a protein target file containing only ARNDCQEGHILKMFPSTWYV characters (i.e. assuming no ambiguity characters), the maximum 'complexity value' achievable is ~4.0. The default thresholds used by the `hybpiper check_targetfile` command are 1.5 (DNA) and 3.0 (protein), but this can be adjusted using the `--complexity_minimum_threshold` parameter. **Note:** Using these defaults, the sequences flagged as having low-complexity regions might differ between a DNA target file and its corresponding protein translation. We recommend removing flagged sequences from your target file before running `hybpiper assemble`, ensuring that your file still contains other representative sequences for the corresponding genes. This can be done manually, or using the command `hybpiper fix_targetfile` (see [here] for full command options, and [below] for further details). ### 1.3 Target files with protein-coding nucleotide sequences in the wrong frame **Short summary:** If your target file contains sequences that are closely related to your samples, then having protein-coding nucleotide sequences in your target file that do not begin in the correct frame (i.e. can't be translated into the corresponding protein sequence in the first forwards frame) won't matter much; the 5' and 3' ends of some recovered gene sequences might be slightly truncated. However, if your target file does *not* contain sequences closely related to your samples, having out-of-frame nucleotide sequences in your target file can matter a lot - you might see much larger truncations in the sequences recovered, or no sequence might be recovered at all for some genes. **Long explanation:** After HybPiper has assembled contigs for a given gene, it extracts coding sequence (that is, exons only) from the contigs via the following process: 1) Select the 'best' target file sequence for a given gene as a reference protein (translated from the nucleotide sequence in the first forwards frame if a nucleotide target file is provided). This _should_ correspond to the target file sequence that is most similar to a given gene/sample. 2) Align the reference protein to the nucleotide contigs using the program Exonerate, with an intron-aware model (protein2genome). 3) Extract and concatenate regions of the contigs identified as exons in the Exonerate search (i.e. regions that, when translated, have good matches to the reference protein; note that Exonerate translates the contigs in all 6 frames when trying to align the protein reference). Given the above process, issues can arise when the reference target file protein sequence is translated in the wrong frame, and the taxon from which the reference protein originates is somewhat distantly related to the given sample. This is particularly problematic if the reference sequence has been translated from the third wobble position, as these positions are more likely to have mutated at a higher rate (due to reduced positive or purifying selection) and to different nucleotides when comparing the target file sequence and the gene contig. This can often mean that the translated amino-acids differ between the two sequences. To illustrate this, here's a fragment of an Exonerate alignment between a target file reference protein and a nucleotide contig that's been assembled from sample reads. In this case, the reference protein (top) is closely related to the sample and has been translated in the correct frame: ``` 20 : PheIleMetProLeuAspLeuGlyAlaLysGlySerCysGlnIleGlyGlyAsnValSerThrA : 41 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| PheIleMetProLeuAspLeuGlyAlaLysGlySerCysGlnIleGlyGlyAsnValSerThrA 600 : TTTATTATGCCACTGGACTTAGGTGCTAAAGGTAGCTGCCAAATTGGTGGAAACGTTTCAACAA : 537 ``` ...whereas here is the alignment when the reference protein has been translated in the third frame: ``` 18 : ArgIleTyrTyrAlaThrGlyLeuGlyCysLysArg***LeuProAsnTrpTrpLysCysPheA : 39 !:!||||||||||||||||||||| !!||| !!|||||||||||||||||||||||| !!|||| GlnIleTyrTyrAlaThrGlyLeuArgCys***Arg***LeuProAsnTrpTrpLysArgPheA 604 : CAGATTTATTATGCCACTGGACTTAGGTGCTAAAGGTAGCTGCCAAATTGGTGGAAACGTTTCA : 541 ``` As can be seen, the second alignment is less similar and contains stop codons, as expected given the query reference 'protein' (and hence the best contig hit translation) has been translated in the wrong frame. However, if the taxa from which the two sequences originate are fairly closely related, the translation of the nucleotide contig in the third frame can still match the reference 'protein' well enough for Exonerate to produce an alignment, and this results in HybPiper recovering a sequence for this gene. However, if the target file sequence is less closely related, it doesn't work as well. Here's an Exonerate alignment using the same gene contig but with a more distantly related protein reference (translated in the correct frame): ``` 20 : PheIleMetProLeuAspLeuGlyAlaLysGlySerCysHisIleGlyGlyAsnIleSerThrA : 41 |||||||||||||||||||||||||||||||||||||||!!.||||||||||||:!!||||||| PheIleMetProLeuAspLeuGlyAlaLysGlySerCysGlnIleGlyGlyAsnValSerThrA 600 : TTTATTATGCCACTGGACTTAGGTGCTAAAGGTAGCTGCCAAATTGGTGGAAACGTTTCAACAA : 537 ``` This looks fine, but if the reference protein is translated in the third frame (i.e. sequence `N*SWLCVGEFKLLCGK*GVYYAA*PGSKR*LSHRWQHFN*RRWPTFHTLRFTSWKCTWS*SCLGRWNCSRYAYYFKERQHWI*SEALIYWK*RLTRNCH*NCNTY`), then the similarity falls beneath the threshold Exonerate requires to return an alignment; the result of the Exonerate search is empty, and HybPiper does not recover a sequence for this gene. A future release of HybPiper will include a command to correct the reading frame of such problematic target files, where possible. In the meantime, the script linked [here](https://github.com/mossmatters/HybPiper/issues/82#issuecomment-1158575310) might be useful in some cases. **STRONGLY RECOMMENDED:** We recommend correcting or removing sequences from your target file that do not start in the correct reading frame **prior** to running `hybpiper assemble`. To assist with this, HybPiper provides the command [`hybpiper fix_targetfile`](https://github.com/mossmatters/HybPiper/wiki/Full-pipeline-parameters/#70-hybpiper-fix_targetfile) (see [here] for full command options, and [below] for further details). ### 1.4 Target files where some representative gene sequences are much shorter than others If your target file contains more than one representative sequence for a given gene, and one or more of the sequences is substantially shorter than the longest representative(s), this can potentially limit the maximum size of the gene sequence recovered for a given gene and sample. This is because HybPiper identifies exonic coding sequence in assembled SPAdes DNA contigs by aligning them with the protein sequence of the 'best' target file reference (translated from DNA if a DNA target file is used). If the 'best' reference is one of the shorter sequences for a given gene, the maximum size of the final gene sequence is limited to this length. In general, it is best to make sure that all representative sequences for a given gene are roughly the same length. You can filter your target file to acheive this using the command [`hybpiper fix_targetfile`](https://github.com/mossmatters/HybPiper/wiki/Full-pipeline-parameters/#70-hybpiper-fix_targetfile) (see [here] for full command options, and [below] for further details) ### 1.5 Filtering and correcting a target file with `hybpiper fix_targetfile` To address the issues raised in sections 1.2 - 1.4, your target file can be filtered and fixed using the command `hybpiper fix_targetfile`. ***Inputs:*** - A text file named `fix_targetfile_<date_time>.ctl`, as written by the command `hybpiper check_targetfile`. This file contains details about the target file that was checked, the parameters used to check for sequences with low-complexity regions (and a list of any such sequence names), and some default parameters for the `hyppiper fix_targetfile` command. - If using a target file with DNA sequences, the user can optionally supply a `*.fasta` file of protein sequences that contains one correct amino-acid sequence for any given gene. Sequence `*.fasta` headers should be formatted in the same manner as the targetfile (i.e. ">TaxonID-geneID"). The amino-acid sequences will be used as references to select the correct reading frame for a given DNA gene/sequence, where necessary/possible. Note that the reference protein file does not HAVE to contain a sequence for each gene. ***Pipeline:*** If using a target file containing **DNA sequences**: 1) Select correct reading frames. Translate each sequence in each of the three forwards frames, and check for stop codons in each translation. If the `*.ctl` file provided contains the entry `NO_TERMINAL_STOP_CODONS True` or the flag `--no_terminal_stop_codons` is used, any stop codon at the C-terminal end of the translation is considered unexpected; otherwise these stop codons are ignored. 2) If only one of the three translations lacks stop codons, this frame is considered to be correct and the nucleotide sequence is adjusted accordingly (if necessary). If more than one translation lacks stop codons, the program will check for the presence of an external reference amino-acid sequence (as provided in an optional fasta file via the flag `--reference_protein_file`). i) If an external reference is found, an alignment is performed between the reference amino-acid sequence and each candidate translation, and a distance matrix is calculated. The translation with the least distance from the reference is selected, and the corresponding nucleotide sequence is adjusted accordingly (if necessary). Note that the maximum distance allowed for a candidate translation to be considered can be provided via the parameter `--maximum_distance`; this can be useful if you think some of your sequences might not actually correspond to the desired gene, and/or to filter out sequences with frameshifts that do NOT introduce stop codons. ii) If no external reference is found, each candidate translation for a given gene is set aside until all sequences in the target file have been examined. Then, for a given gene, if any sequence in the target file could be translated without stop codons in only one frame, this translation is considered to be correct, and is used as an *internal* reference to test candidate translations from other representative sequences for the given gene. Note that if more than one possible internal reference exists, the longest sequence is chosen. For each sequence with mutiple potential translations, the correct frame is chosen via an alignment/distance matrix, as described above for external references. iii) If no interal reference is found, the sequence is not included in the filtered target file, and is instead written to separate `*.fasta` file (see [Outputs](link) below). 3) Filter by sequence length. If your frame-corrected target file sequences contain more than one representative sequence for any gene, the longest sequence is selected; all other sequences are only retained if they are above a given length percentage of the longest sequence. Note that by default this length percentage is 0.0, so all sequences are retained. 4) Remove sequences with low-complexity regions. If the `*.ctl` file provided does *not* contain the entry `LOW_COMPLEXITY_SEQUENCES None`, any sequences identified as having low-complexity regions using the command `hybpiper check_targetfile` is removed. Note that such sequences can be optionally retained using the flag `--keep_low_complexity_sequences`. In either scenario, all such sequences are written to a separate `*.fasta` file (see [Outputs](link) below). 5) Alignment of protein sequences from the filtered, corrected target file. If the user provides the optional flag `alignments`, a per-gene alignment of protein sequences will be generated. If the target file contains DNA sequencs, these protein sequences will correspond to filtered, frame-corrected protein translations. If using a target file containing **protein sequences**: It is assumed that the protein sequences are correct. Only the length filtering and removal of sequences with low-complexity regions steps will be performed, along with the optional protein alignment step. **NOTE:** For each of the steps above (frame selection, length filtering, low-complexity sequence removal), it's possible that all representative sequences for a given gene will be removed. This is disallowed by default, and instead the program will exit with a list of affected genes. If you are happy to remove entire genes, you can use the flag `--allow_gene_removal`. ***Outputs:*** If using a target file (e.g. `target_file.fasta`) containing **DNA sequences**: - `target_file_fixed.fasta`. A target file in `*fasta` format containing only in-frame, filtered sequences. - `target_file_stop_codons_all_frames.fasta`. A file in `*fasta` format containing any sequences with unexpected stop codons in all forwards frames. - `target_file_undetermined_frame.fasta`. A file in `*fasta` format containing sequences that: 1) had more than one translation without unexpected stop codons; and 2) lacked an external or internal protein reference. A separate sequence is written for each such frame. - `target_file_above_maximum_distance_frames.fasta`. A file in `*fasta` format containing sequences that: 1) had more than one translation without unexpected stop codons; 2) had an external or internal protein reference; and 3) had all candidate translation greater than the given maximum allowable distance. A separate sequence is written for each such frame. - `target_file_low_complexity_regions.fasta`. A file in `*fasta` format containing sequences with low-complexity regions. - `01_dna_targetfile_gene_translated_seqs_unaligned`. A folder containing per-gene `*.fasta` files with unaligned protein translations of in-frame, filtered sequences. - `01_dna_targetfile_gene_translated_seqs_aligned`. A folder containing per-gene `*.fasta` files with aligned protein translations of in-frame, filtered sequences. If using a target file containing **protein sequences**: - `target_file_fixed.fasta`. A target file in `*fasta` format containing only filtered sequences. - `target_file_low_complexity_regions.fasta`. A file in `*fasta` format containing sequences with low-complexity regions. - `01_protein_targetfile_seqs_unaligned`. A folder containing per-gene `*.fasta` files with unaligned protein sequences of filtered sequences. - `02_protein_targetfile_seqs_aligned`. A folder containing per-gene `*.fasta` files with aligned protein sequences of filtered sequences. ## 2.0 Read mapping ### 2.1 BWA vs BLAST/DIAMOND HybPiper can map sample reads to a target file containing nucleotide sequences using BWA, or protein sequences using BLASTx or DIAMOND. While results may vary depending on the dataset and/or target file, our testing has found that using a target file with protein sequences and mapping reads to target sequences using BLASTx (translated nucleotide query and protein database) results in longer contigs for many genes, and the generation of contigs for some samples/genes that had none when using BWA mapping. This can occur even when BLASTx maps *fewer* reads overall than BWA. Consequently, **we recommend using a target file containing protein sequences and BLASTx mapping**. See below for some example comparisons between BWA and BLASTx/DIAMOND. ----- **Sample 1** (15,758,918 reads, Angiosperms353 target file): | Mapping tool | Total reads mapped | Genes with sequences | Run time * | | -------- | -------- | -------- | -------- | | BWA | 1,103,583 | 310 | 394 | | BLASTx | 1,304,929 | 340 | 2204 | | DIAMOND** | 1,325,989 | 340 | 1269 | \* 30 threads, AMD EPYC 7601 2700 MHz ** Default sensitivity For genes where both mapping methods recovered a sequence: - Average increase in length of genes for BLASTx vs BWA: ~45% (median = 20%) - Average increase in length of genes for DIAMOND vs BWA: ~45% (median = 20%) ----- **Sample 2** (611,894 reads, Angiosperms353 target file): | Mapping tool | Total reads mapped | Genes with sequences | Run time (sec) * | | -------- | -------- | -------- | -------- | | BWA | 115,784 | 206 | 202 | | BLASTx | 35,338 | 252 | 312 | | DIAMOND** | 31,707 | 249 | 222 | \* 30 threads, AMD EPYC 7601 2700 MHz ** Default sensitivity For genes where both mapping methods recovered a sequence: - Average increase in length of genes for BLASTx vs BWA: ~30% (median = 4%) - Average increase in length of genes for DIAMOND vs BWA: ~25% (median = 1%) ----- As can be seen from the statistics above, BLASTx takes significantly longer than BWA, and **DIAMOND appears to provide a good balance between sensitivity and speed**. **Note:** if you provide a nucleotide target file to `hybpiper assemble` via the `t_dna`/`targetfile_dna` parameter, but you do not provide the flag `--bwa`, HybPiper will automatically translate your target file and use BLASTx for reads mapping. It is expected that your nucleotide target file can be translated correctly (i.e. in-frame with no stop codons) from the first codon positions in the forwards frame. For nucleotide target file sequences that either a) were not a multiple of three and hence did not comprise complete codons only; or b) contained stop codons in the translated protein, a warning and list of gene names will be logged to file. **Note:** When using BLASTx/DIAMOND, we have found that an eValue of 1e-4 works well, and this is the default. The user can change this using the parameter `--evalue`.