Assessing Paralogs

###### tags: `hybpiper_wiki` # Assessing Paralogs When running the `hybpiper assemble` command, you may notice warnings about paralogs. These warnings are triggered in two scenarios: 1) When HybPiper detects multiple contigs containing long coding sequences - by default at least 75% of the reference sequence. HybPiper will choose among these competing long contigs by first checking whether one of the contigs has coverage depth that greatly exceeds the others (10x by default). If all competing long contigs have similar depth, the sequence with the greatest percent identity to the reference is chosen. 2) When the set of sequences extracted from contigs has a coverage of >1 for 75% (default) of the reference sequence length. These warnings are collated and written to report files for each sample within the main sample directory e.g.: - `EG98/EG98_genes_with_long_paralog_warnings.txt` - `EG98/EG98_genes_with_paralog_warnings_by_contig_depth.csv` ## Scenerio 1: long paralogs In scenario 1 (long paralogs detected), the criteria used to select the main sequence may not be ideal in all cases. Choosing the appropriate gene copy to use for phylogenetics requires careful consideration, and so HybPiper flags the genes that may require further attention. Note that there may be other reasons for multiple long-length contigs: recent polyploidy, contamination, or even allelic variation may also result in this. HybPiper includes the post-processing command `hybpiper paralog_retriever` to recover coding sequences from alternative long paralogs. To run it on the test data, provide the `namelist.txt` file and the target_file (used to recover gene names): ``` hybpiper paralog_retriever namelist.txt test_targets.fasta ``` The command will produce a number of output files/folders. These are: - **`paralog_report.tsv`**: a report file in tab-seperated-values format. Tabulates the number of paralogs for each gene and sample e.g.: | Species | gene030 | gene111 | gene293 | | -------- | -------- | -------- | -------- | | EG30 | 1 | 1 | 1 | | EG98 | 5 | 2 |2 | | MWL2 | 4 | 1 |2 | | etc... | | | | - **`paralog_heatmap.png`**: a heatmap visualisation of the `paralog_report.tsv` file, e.g.: ![](https://i.imgur.com/LTXau3C.png) - **`paralogs_above_threshold_report.txt`**: a text report file that lists 1) The number and names of genes with paralogs in a minimum percentage of samples; 2) The number and names of samples that have paralogs in a minimum percentage of genes. By default, this percentage is set to zero, so all genes and samples with paralogs will be reported. - **`paralogs_all`**: a folder containing a `*.fasta` file for each sample/gene, containing paralog sequences of present, or the `*.FNA` sequence recovered by HybPiper is no paralogs were detected. - **`paralogs_no_chimeras`**: a folder containing a `*.fasta` file for each sample/gene as above, but with any putative chimeric `*.FNA` sequences removed. If many samples have more than one copy for many genes, it may indicate an ancient gene duplication. If one sample tends to have many copies, it may indicate it is a polyploid. Not all samples have paralog warnings, such as EG30. Some genes do not have any paralogs found, such as gene001. However, HybPiper recovered coding sequence from two contigs in the sample EG98 for gene002: ``` >EG98.main NODE_3_length_1380_cov_36.695132,Artocarpus-gene002,0,282,90.85,(1),17,869 >EG98.0 NODE_4_length_1359_cov_18.996753,Artocarpus-gene002,17,282,87.59,(1),0,786 ``` The suffix `.main` refers to the paralog that was chosen during the initial run of HybPiper. The original names of the contigs from SPAdes are also shown, along with some statistics from Exonerate searches of the contigs. The `.main` contig had about twice as much depth of coverage as the other paralog (36x vs 18x), and also had a higher percent identity (90.85% vs 87.59%). Are these two sequences paralogs or alleles? The best way to check is to use multiple samples and build gene trees from the sequences. The unaligned `*.fasta` files generated by `hybpiper paralog_retriever` can be used in a phylogenetics pipeline to align and reconstruct a phylogeny. ---- If you have `mafft` and `FastTree` installed, you can create the tree directly from a `*.paralogs_all.fasta` file using pipes: ``` cat gene660_paralogs_all.fasta | mafft --auto - | FastTree -nt -gtr > gene660.paralogs.tre ``` ***FixMe:*** _None of the genes now return a toplogy like that in the original figure below i.e. multiple genes with 'paralogs' that group together. There are one or two trees with a 'paralog' pair that groups together - sufficient?_ TREE FIGURE HERE From the tree above (plotted using FigTree), you can see that for each species, the two sequences recovered form a clade. This means there is no evidence of an ancient duplication event for this gene. There may be another explanation for why HybPiper commonly found two competing sequences, such as alleles. For phylogenetics, choosing the `main` sequence for each species is sufficient. Other genes have more complicated histories: ``` cat gene660_paralogs_all.fasta | mafft --auto - | FastTree -nt -gtr > gene660.paralogs.tre ``` ![](https://i.imgur.com/aF2p4K7.png) We see that there are two distinct clades of sequences, sister to a clade of the outgroups (NZ874, EG30, and NZ281). The `.main` sequences form one clade, suggesting that this gene had an ancestral duplication within the outgroup. Cases like the one above were common in the *Artocarpus* data on which the test dataset is based. By investigating each gene tree, it is possible to delineate which sequence is appropraite for phylogenetics. However, some samples may be missing one or both copies of the duplicated gene. Although the genes may really be lost, it is also possible that HybPiper was not able to recover both copies. ***FixMe:*** _Matt - would you still recommend the approach described below?_ **MJ** I think so, [it is now published](https://academic.oup.com/sysbio/article/70/3/558/5911134) and we could cite it. To test this possibility, one option is to create a new target file for HybPiper that includes examples of both copies. HybPiper will map reads against each copy separately, treating them as separate loci. For the *Artocarpus* dataset, this allowed us to nearly double the number of "loci" for phylogenetics! ## Scenerio 1: paralogs warning by contig depth In scenario 2, paralog warning are produced when the set of sequences extracted from SPAdes contigs has a depth greater than 1 across a given percentage length (default 75%) of the reference sequence for that gene/sample. This warning is useful as it provides an indication that paralogs might be present for the given gene/sample, without the requirement that SPAdes has assembled each paralog into an almost full-length contig. The latter scenario is unlikely if your target genes contain long introns, meaning that long paralog warnings would nor be produced even though paralogs are present in your dataset for that gene/sample. Note that for paralog warning based on sequence depth, **no sequences are written to file**. This is because there is currently no way to definitively group contigs containing partial gene sequences into single-paralog clusters for concatenation. Please let us know if you can think of a way!