Mixed DNA Hybseq

# Mixed DNA Hybseq [TOC] ## Sample List Data sequenced September 2021 on MiSeq 2x300 ```csvpreview {header="true"} # Sample ID Type Adapter ng/uL Size Description 1 ahp-voucher-redo-A1 voucher G2 14 432 pure vouchered saffron 2 corn-silk-A voucher B2 11.6 506 pure corn silk 3 gardenia-A voucher H10 3.4 379 pure vouchered gardenia jasmoides 4 safflower-A voucher E7 3.56 358 pure safflower 10 URI20 voucher F6 23.8 528 Artemisia absinthium 11 GF9 voucher G6 17.6 511 Hydrastis canadensis 12 27-Jim Duke voucher G7 23.6 542 Mentha peperita 13 18-Jim Duke voucher D11 15.3 474 Silybum marianum 14 wen13439 voucher G10 14 499 Ginkgo biloba 15 5-Jim-Duke voucher B4 23 489 Panax quinquefolius 16 wen13341 voucher B5 14.1 448 Camellia sinensis 17 1-Jim Duke voucher B8 21.1 460 Ilex paraguariensis 18 clover-honey mixed-unknown G1 0.562 313 unfiltered clover honey (spun down to collect particulates and extracted) 19 wildflower-honey mixed-unknown H1 1.36 328 unfiltered wildflower honey (spun down to collect particulates and extracted) 5 cornspike-10A mixed-known H2 14.5 399 mock adulterated sample, 10% corn silk 90% saffron 6 gardspike-10B mixed-known G9 11.5 448 mock adulterated sample, 10% gardenia jasmoides 90% saffron 7 saffspike-10A mixed-known E11 13.4 438 mock adulterated sample, 10% safflower 90% saffron 8 sup-2-1 mixed-known C4 12.3 357 mock dietary supplement (Ginkgo biloba, Panax quinquefolius, Camellia sinensis, Ilex paraguariensis) 9 sup-4-1 mixed-known D4 11.5 408 mock dietary supplement (Artemisia absinthium, Hydrastis canadensis, Mentha peperita, Silybum marianum) 20 saff-spike-5A mixed-known E9 17.5 413 mock adulterated sample, 5% safflower 95% saffron 21 gard-spike-5A mixed-known H11 12.7 436 mock adulterated sample, 5% gardenia 95% saffron 22 corn-spike-5A mixed-known G4 11.4 386 mock adulterated sample, 5% corn 95% saffron 23 corn-spike-20A mixed-known H4 14.6 418 mock adulterated sample, 20% corn 80% saffron 24 NOBI-B unknown E4 17.3 479 saffron sample from amazon, chemists have indicated it is adulterated in some way ``` In the `fda_mixed` directory with the `fastq.gz` files: `ls | cut -f 1 -d '_' | sort -u > namelist.txt` (Remove the Undetermined line) ## Trimming Reads with FastP `fastp` [is a command line tool](https://github.com/OpenGene/fastp) for trimming FASTQ files. With the default parameters, it will: * Trim adapters in both directions * Delete reads with mean quality score < 20 * Delete reads if more than 40% of the bases have Q < 15 * Delete reads if less than 15 bp remain after filtering `parallel fastp -i /media/johnsonseq5TB/fda_mixed/{}*R1_001.fastq.gz -I /media/johnsonseq5TB/fda_mixed/{}*R2_001.fastq.gz -o trimmed_reads/{}.trimmed.R1.fq.gz -O trimmed_reads/{}.trimmed.R2.fq.gz -j {}.fastp.json -h {}.fastp.html :::: namelist.txt` Results - number of read pairs passing filter: ```csvpreview 1-Jim-Duke.fastp.json 2337002 18-Jim-Duke.fastp.json 1720676 27-Jim-Duke.fastp.json 1358532 5-Jim-Duke.fastp.json 2073868 ahp-voucher-redo-A1.fastp.json 1038950 clover-honey.fastp.json 379558 corn-silk-A.fastp.json 2438902 cornspike-10A.fastp.json 2097452 corn-spike-20A.fastp.json 2825694 corn-spike-5A.fastp.json 3002666 gardenia-A.fastp.json 671326 gardspike-10B.fastp.json 2063402 gard-spike-5A.fastp.json 1947184 GF9.fastp.json 1951208 NOBI-B.fastp.json 2051470 safflower-A.fastp.json 1028830 saffspike-10A.fastp.json 2151534 saff-spike-5A.fastp.json 2315246 sup-2-1.fastp.json 1105520 sup-4-1.fastp.json 3457636 URI20.fastp.json 1373764 wen13341.fastp.json 2410718 wen13439.fastp.json 1757508 wildflower-honey.fastp.json 427370 ``` ### prep for Hybpiper `cd trimmed_reads` `gunzip *` ## HybPiper ### Target File Using the Mega353 target file from [the NewTargets github site](https://github.com/chrisjackson-pellicle/NewTargets), which has curated target sequences for the Angiosperms353 genes across many angiosperms, with sequences from the 1KP transcriptomes. ### Initial HybPiper Commands ```bash= while read name do /opt/apps/Software/HybPiper/reads_first.py \ -r ../trimmed_reads/"$name"*.fq \ -b ../mega353.fasta \ --cpu 50 \ --bwa \ --prefix $name python /opt/apps/Software/HybPiper/cleanup.py $name done < ../namelist.txt ``` ### HybPiper Stats ```bash= python /opt/apps/Software/HybPiper/get_seq_lengths.py ../mega353.fasta ../namelist.txt dna > sep2021_seqlengths.txt python /opt/apps/Software/HybPiper/hybpiper_stats.py sep2021_seqlengths.txt ../namelist.txt > sep2021_hybpiper_stats.txt ``` Stats explanation: * GenesMapped = at least 1 read on target * GenesWithContigs = any sequence assembled * GenesWithSeqs = any sequence extracted * GenesAt25Pct = 25% of targeted length * ParalogWarnings = genes with > 1 contig representing 75% of target length ```csvpreview {header=T} Name NumReads ReadsMapped PctOnTarget GenesMapped GenesWithContigs GenesWithSeqs GenesAt25pct GenesAt50pct GenesAt75pct Genesat150pct ParalogWarnings 18-Jim-Duke 1757343 542604 0.309 352 343 329 323 304 217 0 0 1-Jim-Duke 2379083 808875 0.340 353 350 346 342 317 258 0 17 27-Jim-Duke 1413080 524184 0.371 353 324 308 297 269 195 0 74 5-Jim-Duke 2115417 765874 0.362 353 346 334 326 280 189 0 21 ahp-voucher-redo-A1 1042808 264124 0.253 353 242 147 103 38 14 0 0 clover-honey 380986 76454 0.201 323 114 0 0 0 0 0 0 corn-silk-A 2463767 503856 0.205 353 331 287 277 233 150 0 9 cornspike-10A 2105432 587917 0.279 353 296 227 181 95 35 0 3 corn-spike-20A 2839733 810109 0.285 353 319 272 239 162 73 1 8 corn-spike-5A 3013073 906937 0.301 353 313 243 197 102 35 0 5 gardenia-A 677978 240587 0.355 352 314 284 253 165 89 0 4 gardspike-10B 2070701 510448 0.247 353 285 200 150 74 25 0 3 gard-spike-5A 1951537 396369 0.203 352 253 141 99 33 13 0 2 GF9 1994916 667050 0.334 352 348 340 335 307 229 1 3 NOBI-B 2059690 545492 0.265 352 296 207 159 77 32 0 3 safflower-A 1040873 476493 0.458 353 328 307 288 205 109 0 2 saffspike-10A 2156910 432816 0.201 352 267 180 121 56 20 0 1 saff-spike-5A 2321613 574120 0.247 352 279 187 145 62 21 0 1 sup-2-1 1110902 430304 0.387 353 299 281 232 138 74 0 5 sup-4-1 3518310 1260437 0.358 353 346 327 316 272 197 2 38 URI20 1391856 249203 0.179 352 312 265 258 230 161 0 6 wen13341 2437307 795193 0.326 353 345 333 318 276 184 0 10 wen13439 1758226 284962 0.162 353 163 38 19 5 2 0 0 wildflower-honey 428493 197451 0.461 346 88 0 0 0 0 0 0 ``` ### HybPiper Gene Recovery Heatmap Every column is a gene; the color in the cell is the proportion of the target length recovered. ![](https://i.imgur.com/R5ppm4S.png) ### Notes * The two honey samples did not have any genes recovered. However they still had 1000s of reads on target and a few dozen contigs assembled. * In cases where the potential adulterants are known, we could generate more specific target files (e.g. for corn and saffron) and re-run HybPiper * Perhaps HybPiper is not doing well with mixed samples? * Can still check the contigs for assembled sequence. * Using merged reads in upcoming new HybPiper update should help * `wen13439` having low success makes sense because it's _Ginkgo biloba_ and therefore might have trouble getting recovered by _Angiosperms353_ ... * Still 280K reads on target, could try a gymnosperm target file ### Honey `wildflower-honey` had 88 genes with contigs. The most was for gene `6128` although the majority of this is repetitive DNA. It was also the only gene to return any exonerate results. Looking further into the contigs: This sequence from `6544` had sequence similarity to the asian hornet: ``` >NODE_1_length_519_cov_21.823980 AACTATAAATACCGAATAAGCTGCGCTCATTAAATGAGTCCAACCAATAGAATTGCCGTA ACGAACAATCAAGTGCCGGCAATGAGGTAATCGATACATTGGACGATCCATCGGTATGAA CAGTAAGTCTGGGAGAAGACTGAGACACTGAAGACGACGATGACAACGACGACGTCGTCG CCAACGACGAAGAAAAACGAGAAGTAGACGACGCGACCGCAACAGAAACCAAAGAAGAAG AAGAAGAAGAAGTCGAAGAAAACGACGACAACATTGTCGGTGTAGTAACTACGGTTGTTA CTACAGCTGTCCCTGGCACAGAAATCACATTCACACGTCGCAAAAAACATACATTTTTAC CCAAATGTGCGAATAAATGATCATCGCTATGAACACGAGGTAATTTGTTCACCGATCGAC TAACAACACACTCGTCACTATTCATTTTACCAATTAAATTACCATTACTACCATCATCAT TATCAACAACATCATCATCATAATCATCATCATTATTAT ``` `clover-honey` had 114 genes with contigs, with `6128` again with the most. Both honey samples use the sequence `>DHAW_grafted_with_ZSSR-6128` as the target reference: ``` >DHAW_grafted_with_ZSSR-6128 RRYSSDEDSDSISGSEDDESEPTGSDSEYSDSESARAGAGARRKRKERRRRKQKEERRRR REKRKRREREEEKKKKKKRRRKKEKKKKEKKEKGKVGAVTDSWGKYGIIRETDMWNKRPE FTAWLSEVKQVNLEHLANWEEKQMFKQFMEDHNTATFPSKKYYSLDAYYRKKIEKQVKKG LTKVVQTERTVFNDEEERRQEMMRVREKEKEAEVEALKRSMQSGLAQAMKEQARLREEMA YQYKVGNLEAAAAIQRRLDPDAE* ``` There could be some issue with the highly repetitive amino acid sequence causing issues with read mapping/assembly. I *think* this is getting fixed in HybPiper 1.4, check with CJ to be sure. Would it be worth using a clover reference to see how many reads map? I got the *Trifolium pratense* [sequences from the Kew Explorer](https://treeoflife.kew.org/specimen-viewer/19181) and made a target file. There were 88 genes with mapped reads, 32 genes with contigs, but no recovered sequences. One of the contigs recovered is `6318` which in an NCBI BLAST search returns as 71% identity to the *Solanum lycopersicum* genome. I tried mapping the `clover-honey` reads to the *Trifolium* genome, and 51% of the reads did map. A very quick look in IGV suggests most of these are in isolated regions where they are PCR duplicates. Another approach is *de novo* assembly of target capture reads with subsequent matching to target genes. Using `spads.py` I generated an assembly with 528 contigs, including one that's 7693 bp with 11x coverage. The seqeuence, `>NODE_1_length_7693_cov_11.017446` has a best NCBI BLAST hit to an *Apis mellifera* filamentous virus which I suppose makes sense! ## NOBI-B This sample of saffron from Amazon is suspected to be adulterated. Some quick BLAST searches of extracted sequences: * `6875` (135 aa): 81% ID to *Apostasia shenzhenica* (grass orchid) * `5670` (180 aa): 83% ID to *Cocos nucifera* (coconut palm) * `5919` (149 aa): 85% ID to *Asparagus officinalis* mitochondrial gene GenBank probably isn't the best guide here, but interesting that they are hitting non-grass monocots. Iridaceae is probably poorly represented in GenBank ### Phylogeny of Iridaceae with A353 Saffron (*Crocus sativus*) is in the Croceae tribe of the Iridaceae. According to [Goldblatt et al. 2008](https://www.ingentaconnect.com/contentone/aspt/sb/2008/00000033/00000003/art00004?crawler=true&mimetype=application/pdf) (using plastid markers), the closest relatives of *Crocus* are *Syringodea*, *Afrocrocus*, and *Romulea*, PAFTOL "Pilot" phylogeny from Angiosperms353, with Iridaceae highlighed ![](https://i.imgur.com/c0t6CVm.png) There are 23 samples of Iridaceae in PAFTOL but 10 of these are *Babiana* species. Of these, the closest relatives to *Crocus* are probably *Tritoniopsis*, *Gladiolus*, *Zygotritonia*, and *Lapeirousia*. The phylogeny should have these 13 plus related Asparagales sequences. To this we can add the saffron voucher sequences. ### Gene phylogeny with mixed samples The potential adulterants are: * Gardenia jasmoides (Rubiaceae) * Zea mays (Poaceae) * Carthamus tinctorius (Asteraceae) Reduce A353 pilot gene phylogeny to those three families plus Iridaceae. Sequences were recovered for NOBI-B for 207 genes. Identifying which genes has the most sequences corn/saffron/safflower: ``` parallel --tag "grep '>' {} | wc -l" ::: $(grep NOBI * | cut -f 1 -d":") ``` 170 genes have sequences from all 12 samples. #### Adding sequences to PAFTOL alignments Extracting the PAFTOL "pilot" alignments and restricting to just the needed samples: ```python= import sys from Bio import SeqIO samplelist = sys.argv[1] geneNum = sys.argv[2] paftol_samples = [x.rstrip() for x in open(samplelist)] paftol_samples = set(paftol_samples) with open("{}.paftol.fasta".format(geneNum),"w") as outfile: for seq in SeqIO.parse("/scratch/joh97948/paftol/nuclear/family_level_seqs_dna/trimmed_seqs/{}.opt.trimal.FNA".format(geneNum),"fasta"): if seq.id in paftol_samples: SeqIO.write(seq,outfile,'fasta') ``` Running for all genes, followed by `mafft --add` to add the new sequences, and quick ML tree inference using `iqtree`: ``` parallel python ../extract_paftol.py paftol_samples.txt {} :::: ../genes_for_phylo.txt parallel "mafft --preservecase --keeplength --add ../sequences/{}.FNA {}.paftol.fasta > {}.saffron.fasta" :::: ../genes_for_phylo.txt parallel -j 25 iqtree -s {}.saffron.fasta -B 1000 -m MFP :::: ../genes_for_phylo.txt ``` ## Mixed Samples ### Read Mapping In the absence of a good way to extract full-length sequences (because that's what I'm asking for money to develop!) we can also take the same steps as Hunter et al., to use read-mapping to a reference as a proxy for assessing mixtures. #### Compiling reference sequences The four references are: ``` corn-silk-A gardenia-A safflower-A ahp-voucher-redo-A1 ``` Because the saffron `ahp-voucher-redo-A1` had the worst gene recovery (likely due to the poor sampling in this family), should restrict the analysis to genes recovered in all four vouchers. Using Python: ```python= safflower = set([x.split("\t")[0] for x in open("hybpiper/safflower-A/genes_with_seqs.txt")]) corn = set([x.split("\t")[0] for x in open("hybpiper/corn-silk-A/genes_with_seqs.txt")]) gard = set([x.split("\t")[0] for x in open("hybpiper/gardenia-A/genes_with_seqs.txt")]) saffron = set([x.split("\t")[0] for x in open("hybpiper/ahp-voucher-redo-A1/genes_with_seqs.txt")]) inall = saffron & safflower & corn & gard with open("voucher_genes.txt","w") as outfile: for g in inall: outfile.write(f"{g}\n") ``` This gives 124 genes. Extracting into separate "reference genomes" for each voucher: ```bash= parallel "cat hybpiper/{1}/{2}/{1}/sequences/intron/{2}_supercontig.fasta >> voucher_refs/{1}.supercontigs.fasta" ::: corn-silk-A gardenia-A safflower-A ahp-voucher-redo-A1 :::: voucher_genes.txt ``` Making BWA, samtools, and gatk reference index files: ```bash= parallel gatk CreateSequenceDictionary -R {}.supercontigs.fasta ::: corn-silk-A gardenia-A safflower-A ahp-voucher-redo-A1 parallel samtools faidx {}.supercontigs.fasta ::: corn-silk-A gardenia-A safflower-A ahp-voucher-redo-A1 parallel bwa index {}.supercontigs.fasta ::: corn-silk-A gardenia-A safflower-A ahp-voucher-redo-A1 ``` #### Mapping and deduplicating reads Reads will need to be de-duplicated because of the PCR step in target capture. ```bash= prefix=$2 #read1fn="$prefix.R1.paired.fastq" #read2fn="$prefix.R2.paired.fastq" read1fn=../trimmed_reads/"$prefix".trimmed.R1.fq read2fn=../trimmed_reads/"$prefix".trimmed.R2.fq mkdir -p $prefix cd $prefix #Align read files to reference sequence and map bwa mem $reference $read1fn $read2fn | samtools view -bS - | samtools sort - -o "$prefix.sorted.bam" gatk FastqToSam -F1 $read1fn -F2 $read2fn -O $prefix.unmapped.bam -SM $prefix.sorted.bam #Replace read groups to mapped and unmapped bam files using library prep and sequencing information gatk AddOrReplaceReadGroups -I $prefix.sorted.bam -O $prefix.sorted-RG.bam -RGID 2 -RGLB lib1 -RGPL illumina -RGPU unit1 -RGSM $prefix gatk AddOrReplaceReadGroups -I $prefix.unmapped.bam -O $prefix.unmapped-RG.bam -RGID 2 -RGLB lib1 -RGPL illumina -RGPU unit1 -RGSM $prefix #Combine mapped and unmapped BAM files gatk MergeBamAlignment --ALIGNED_BAM $prefix.sorted-RG.bam --UNMAPPED_BAM $prefix.unmapped-RG.bam -O $prefix.merged.bam -R $reference #Remove duplicate sequences gatk MarkDuplicates -I $prefix.merged.bam -O $prefix.marked.bam -M $prefix.metrics.txt samtools index $prefix.marked.bam ``` #### Mapping #### Stats ![](https://i.imgur.com/P4olQIb.png)