Agarista HybSeq

# Agarista HybSeq We received 79 samples of Agarista for target enrichment with Angiosperms353. We prepared libraries using the SPTLabTech Mosquito HV and Dragonfly liquid handling robots, allowing us to use 1/10 the normal reaction volume. Enriched libraries were combined with several other projects and sequenced using NovaSeq6000 with an SP 2x250 flow cell. ## Trimming Sequences I used [FastP](https://github.com/OpenGene/fastp) to trim sequences using default settings. FastP is good at removing polyG sequences, which is common when the insert size is less than the read length, as is the case here. ## HybPiper ### Custom Target File Recovering sequences is much more efficient when using a custom target file with representative sequences closer to the sequenced taxa. I downloaded sequences from [The Kew Tree of Life Explorer ](https://https://treeoflife.kew.org/tree-of-life) for *[Agarista salsifolia](https://treeoflife.kew.org/tree-of-life/12189)* and two other members of the same clade of Ericaceae, [Craibiodendron yunnanense](https://treeoflife.kew.org/tree-of-life/11987), and [Pieris formosa](https://treeoflife.kew.org/tree-of-life/12487). I used a quick Python script to transform the Angiosperms353 sequences from all three species into the format required for HybPiper (e.g. `Agarista-4471`, `Pieris-4471`) and combined them into a single target file for HybPiper. I used [HybPiper](https://github.com/mossmatters/hybpiper) version 2.3.0 to assemble gene sequences for all 79 samples. First, I put all library names were put into a textfile `DALASTRAseq.txt` and used `parallel` to run five assemblies at once, each using 12 CPU threads. ```bash parallel -j 5 hybpiper assemble -r /media/johnsonseq5TB/HybSeq_20241014/fastq/Dalastra/{}/{}*.fastq.gz -t_dna agarista.angiosperms353.targets.fasta --prefix {} --compress_sample_folder --cpu 12 --bwa --force_overwrite :::: DALASTRAseq.txt ``` I then ran the `stats` and `recovery_heatmap` functions of HybPiper. ```bash hybpiper stats -t_dna agarista.angiosperms353.targets.fasta gene DALASTRAseq.txt hybpiper recovery_heatmap seq_lengths.tsv ``` ![agarista.recovery_heatmap](https://hackmd.io/_uploads/r17xnjflyg.png) Things are looking pretty great! The two genes missing were also not present in any of the three reference sequences. An average of 313 genes were recovered with sequences at least 50% of the expected length. That number is a bit lower only because of a few samples that seemed to not work well at all: | Name | NumReads | ReadsMapped | PctOnTarget | GenesMapped | GenesWithContigs | GenesWithSeqs | |-|-|-|-|-|-|-| | DALASTRAseq016 | 11782 | 1599 | 13.6 | 68 | 1 | 1 | | DALASTRAseq044 | 79870 | 39072 | 48.9 | 342 | 93 | 87 | | DALASTRAseq072 | 147984 | 76534 | 51.7 | 347 | 244 | 232 | | DALASTRAseq031 | 176010 | 102022 | 58 | 349 | 182 | 171 | As expected, gene recovery was very good (> 200 genes) when there were at least 150000 mapped reads. ## Paralogs The HybPiper stats script revealed about 30 genes per sample were getting flagged as having paralogs. I ran the `hybpiper paralog_retriever` function to get more information. I used 3% as a (very conservative) threshold for identifying a gene or sample as having too many paralogs. ```bash hybpiper paralog_retriever -t_dna agarista.angiosperms353.targets.fasta DALASTRAseq.txt --paralogs_list_threshold_percentage 0.03 ``` HybPiper identified the 75 genes as possibly having paralogs (at least 4 samples). Since the sequence recovery was so good, I think the fastest way to deal with this is to simply delete these genes from the downstream analysis. ```python paralogs = [x.rstrip() for x in open("genes_with_paralogs.txt")] all_genes = [x.rstrip() for x in open("genelist.txt")] with open("genelist_noparalogs.txt",'w') as outfile: for gene in set(all_genes).difference(set(paralogs)): outfile.write(gene + "\n") ``` As for samples, there were 34 samples with paralogs identified in at least 10 genes. However three samples stood out as having paralogs identified in at least 30 genes, and another three with paralogs in 20 or more genes: | Species | Paralogs | |-|-| | DALASTRAseq068 | 54 | | DALASTRAseq057 | 53 | | DALASTRAseq074 | 49 | | DALASTRAseq055 | 27 | | DALASTRAseq063 | 21 | | DALASTRAseq062 | 20 | Without additional information I think it's best to keep all of these in the analysis but just keep these samples in mind in case there's something real strange happening downstream. ## Sequences and Alignments ### Retrieving Sequences First, I separated the sequences from the three references taken from Kew into 351 gene files. ```python from Bio import SeqIO seqs={} for seq in SeqIO.parse("../agarista.angiosperms353.targets.fasta","fasta"): try: seqs[seq.id.split("-")[1]].append(seq) except KeyError: seqs[seq.id.split("-")[1]] = [seq] for gene in seqs: SeqIO.write(seqs[gene],"Agarista."+gene+".FNA",'fasta') ``` I then used `hybpiper retrieve_sequences` to extract gene (coding) sequence from the HybPiper output. ``` hybpiper retrieve_sequences -t_dna agarista.angiosperms353.targets.fasta --fasta_dir recovered_sequences --sample_names DALASTRAseq.txt dna ``` I made a list of all the genes recovered: ```bash= ls recovered_sequences/ | cut -f 1 -d '.' > genelist.txt ``` I merged the reference and recovered sequences for each gene. ```bash= parallel "cat reference_seqs/Agarista.{}.FNA recovered_sequences/{}.FNA > combined_sequences/Agarista.{}.unaligned.FNA" :::: genelist.txt ``` ### Alignment and Trimming I used MAFFT v7.520 (2023/Mar/22) to align each gene sequence: ```bash= parallel "mafft combined_sequences/Agarista.{}.unaligned.FNA > MAFFT/Agarista.{}.mafft.fasta" :::: genelist.txt ``` Note: since these are coding sequences it would probably be better to compare alignment of these sequences to a frame-aware alignment using MACSE. I used [trimal](https://vicfero.github.io/trimal/) version v1.4.rev15 build[2013-12-17] to remove any columns present in less than 50% of the samples: ``` parallel trimal -gt 0.5 -in MAFFT/Agarista.{}.mafft.fasta -out trimal/Agarista.{}.trimal.fasta :::: genelist.txt ``` ## Gene Trees I used a gene tree using [IQTree](http://www.iqtree.org/doc/) version 2.2.2.7 with the `-m MFP` flag to search for the optimal substitution model, and `-B 1000` to get gene tree support using 1000 'fast bootstraps': ```bash parallel -j 10 iqtree -s Agarista.{}.trimal.fasta -m MFP -B 1000 :::: ../genelist.txt ``` Gene tree branches should be collapsed before being given to astral, for this I used `newick_utils` to collapse all branches with less than 30% gene tree bootstrap support: ```bash parallel "nw_ed trimal/Agarista.{}.trimal.fasta.treefile 'i & b<=30' o > collapsed/Agarista.{}.collapsed.tre" :::: genelist_noparalogs.txt ``` Gene trees were combined into a single file, skipping the genes that had evidence of paralogs in > 3 samples. Also, need to remove some extraneous characters from the sequences so they will nicely combine in Astral. ```bash! parallel sed -i "s/-{}//g" collapsed/Agarista.{}.collapsed.tre :::: ../genelist.txt cat collapsed/*.tre > astral/Agarista.genetrees.tre ``` ## Species Tree I inferred a species tree using [Astral-III](https://github.com/smirarab/ASTRAL) (version 5.6.3) from 276 gene trees: ```bash! java -jar /opt/apps/Software/ASTRAL/Astral/astral.5.6.3.jar -i Agarista.genetrees.tre > Agarista.astral.tre ``` I have a script that converts the library IDs to something more useful for a tree using a simple comma-separated file. ```bash! python ../replace_labels.py Agarista.astral.tre ../taxon_subst.csv > Agarista.astral.names.tre ``` The resulting tree is below - support values are Astral's Local Posterior probability. Branch lengths correspond to amount of concordance among genes (short branches mean more discordance). ![Agarista.astral.names.tre](https://hackmd.io/_uploads/SyyEVTfl1e.png)