# Soil Seed Bank HybPiper ###### tags: `courtney` After DNA was extracted using CTAB protocol, I ran that DNA through Hybpiper in order to create species reference files. The following is a link to a 2021 paper written about the process of DNA extraction, Angiosperms353, and the benefits of using herbarium specimens: https://bsapubs.onlinelibrary.wiley.com/doi/full/10.1002/aps3.11419 In my project, a soil seed bank experimental protocol was used. When determining what species are present in an environment, the easiest way to do so is to identify all the adult plants that have reached maturation. Because many seeds don't germinate unless specific events occur such as a wildfire or an extreme freeze, there are usually many more plant species present in an environment than can be detected visually. The species that are in the soil can be detected by using a soil seed bank. Usually in a soil seed bank, a sample of soil is taken from an area is taken from the ground and the seeds present in that soil are determined by which plants end up germinating. This is a lengthy and inaccurate process because some seeds require harsh cold, extreme heat, or even passing through an animal's digestive system in order to germinate. Since not all species present in a soil seed bank are guaranteed to germinate, it is possible to not identify some species present in the environment the soil seed bank was taken from. Another way that people have determined the species present in a soil seed bank is to extract all seeds from the soil and identify them. Since it is possible for some seeds to be mistaken for rocks or not be seen at all due to how tiny they are, this potentially results in missing the detection of some species present in a soil seed bank. This would also allow the presence of plants that germinate in extreme conditions to be detected in the soil. If successful, this process should also prove useful in determining plant species present in mixtures such as herbal supplements and in tracking the usage of endangered plant species. In my study, a soil seed bank was taken from New Mexico by Dr. Akasha Faist and some of her students. As with a traditional soil seed bank procedure, the soil sample was watered and cared for in an attempt to germinate all species present in the soil. Part of the soil was not subjected to watering and sunlight and was instead split into two groups that would become my unknown seed mix and my pooled soil sample. The batch of soil that underwent the process of germination resulted in the identification of the 8 species I am using in my study as reference species. Leaf tissue DNA was extracted from each plant using CTAB protocol and was sequenced using angiosperms353. Seed tissue was extracted from the seeds produced by the adult plants grown in the soil seed bank. The seeds remaining after DNA extraction were mixed together and placed in a bag without soil. This entire sample was then run through the CTAB protocol and angiosperms353 to produce my known seed mix DNA file. The unknown seed mix was created by taking some of the soil seed bank that did not undergo the previous procedure. Instead, seeds were extracted from the soil by using tweezers and a magnifying glass in an attempt to find all seeds present in the soil. The seeds extracted from the soil were then run through the CTAB protocol and angiosperms353 to produce my unknown seed mix DNA file. The pooled soil sample was created by running the remaining soil not subjected to either of the above variables through CTAB protocol and angiosperms353 to produce my pooled soil mix DNA file. If it is possible to detect more species in the pooled soil sample, it shows that the hypothesis that several species are being missed by using seed extraction and seed germination methods typical of the traditional soil seed bank procedure was correct. My results indicated more uniquely mapping reads present in the unknown and pooled soil samples than were present in the known seed mix. The negative control used was empty tubes that were run through the DNA sequencer and subjected to CTAB protocol and angiosperms353 as though they contained seeds or soil. Since very little DNA was detected using this procedure, it shows that the dna extraction procedure was uncontaminated with extra eDNA from sources other than the soil seed bank. Angiosperms353 works by finding 353 unique protein coding genes and extracting them from the plant's genome. This is important to do for plant dna because plant genomes are much larger than the genomes of most other organisms due to multiple genome duplication events. If the entire plant genome was sequenced, it would take a lot of time to do and multiple redundant sequences would be sequenced. Angiosperms353 detects the genes unique to angiosperms so that we can skip over the redundant genes present in most other plants. What hybpiper does is take the extracted genes and sews them together to form a complete sequence. With the DNA extracted from both my seed mixtures and the untampered with soil, I used hybpiper to organize each file into complete sequences. I then used hybpiper on the DNA extracted from each species' seed and leaf DNA in order to create species reference files. After that, I used the bwa mem command in hybpiper to compare each seed bank variable sequence against each individual reference species file. In the known seed mix, all 8 species should be detectable. In the unknown seed mix, all 8 species plus some more should be detectable. In the pooled soil sample, there should be multiple more species detectable assuming multiple species are being missed by traditional seed bank species detection methods. My results showed that the number of unique reads in the pooled soil was much higher than the known and unknown mixtures. This indicates that more species than the 8 known species are present. In the future, creating more species references for Texas angiosperms and running them against the pooled soil should show more species present. In order to create species reference files for the 8 species in my study, I used hybpiper. The following is a link to the github for hybpiper: https://github.com/mossmatters/HybPiper To create the species reference files, I first copied the DNA extracted using CTAB protocol and Angiosperms353 into my home directory: ``` cp -r ../SSB/ /scratch/mil39387/SSB2 ``` SSB2 is what I decided to name the copied over seed bank DNA sequences. After copying over the DNA for the 8 species into my home directory, I recorded which species was present in each data file I obtained. The DNA for my experiment was collected from the leaf tissue of plants that germinated from the soil seed bank. The seed tissue DNA was collected from the seeds of the plants the leaf tissue was taken from. The following is the data labels for the species and tissue samples used in my project: data1=pooled soil data2=known seed mix data3=unknown seed mix data100=negative control Seed and leaf samples with corresponding data number: Silene antirrhina Caryophyllaceae seed=data7 Silene antirrhina Caryophyllaceae leaf tissue=data25 Descurainia pinnata Brassicaceae leaf tissue=data11 Descurainia pinnata Brassicaceae seed=data28 Lotus salsuginosus var. brevivexillus Fabaceae seed=data17 Lotus salsuginosus var. brevivexillus Fabaceae leaf tissue=data33 Lotus strigosus Fabaceae seed=data18 Lotus strigosus Fabaceae leaf tissue=data34 Salvia columbariae Lamiaceae seed=data23 Salvia columbariae Lamiaceae leaf tissue=data24 Astragalus nuttalianus Fabaceae seed=data26 Astragalus nuttalianus Fabaceae leaf tissue=data27 Plantago patagonica Plantaginaceae seed=data29 Plantago patagonica Plantaginaceae leaf tissue=data30 Thelypodium lasiophyllum Brassicaceae seed=data31 Thelypodium lasiophyllum Brassicaceae leaf tissue=data32 After labeling all the sources of my DNA, I copied mega353.fasta over as well. All the samples were run against mega353 because it detects the 353 DNA sequences unique to flowering plants. Since plants experienced so many gene duplication events, it would be impractical to use the entire plant genome. This way the uniquely identifying DNA for a species is more easily found. ``` cd mega353.fasta /scratch/mil39387 ``` Using the files mega353 and each tissue sample, I trimmed each sequence with the following command: ``` fastp -i SSB2/SSBseq001_S135_L002_R1_001.fastq -I SSB2/SSBseq001_S135_L002_R2_001.fastq -o trimmed/SSBseq001.trimmed.R1.fastq -O trimmed/SSBseq001.trimmed.R2.fastq -j SSBseq001.json -h SSBseq001.html ``` After trimming the data, I ran in (hybpiper). Each sample was labled according to the sequence number they were assigned. SSBseq001 became data1, SSBseq002 became data2, etc: ``` hybpiper assemble -t_dna mega353.fasta -r trimmed/SSBseq001.trimmed.R*.fastq --prefix data1 --bwa --cpu 48 ``` next I ran: ``` --run_intronerate --start_from exonerate_contigs ``` After that, I created namelist.txt and typed the name of each resulting data file into namelist.txt. I then ran hybpiper stats for everything in namelist.txt: ``` hybpiper stats -t_dna mega353.fasta gene namelist.txt ``` At this point I learned how to use the "screen" command so that my sequencing can run while I am not actively using my computer. As some of the sequences took an hour to run, this was a huge time saver. Learn how to use screen: https://linuxize.com/post/how-to-use-linux-screen/ ``` echo $STY ``` if gives a value, am in a screen Name the screen hybpiper to lessen confusion ctrl+a then press d to exit screen At this point, I have data1, data2 and data3, data7, data11, data17, and data18 for seq001, seq002, seq003, seq007, seq0011, seq0017, seq0018 with trimmed data, hybpiper assemble, and the intronerate step completed. I also wrote data1, data2, data3, data7, data11, data17, data18, and data100 inside the namelist.txt folder. After running hybpiper stats, created a heatmap of my results: https://slack-files.com/T8WH5QU5U-F04ALTGJ9R8-2bb5d2e0fc ``` hybpiper recovery_heatmap seq_lengths.tsv ``` Next, I created SSB3 and copied the data from ``` /media/johnsonseq/2021-02-08_novase tq/SSB ``` by using the command ``` cp -r ../SSB/ /scratch/mil39387/SSB3 ``` After copying over the rest of the DNA, I repeated the same process as above for the rest of the tissue samples. I trimmed all the data, run hybpiper assemble, then add the assembled data to namelist.txt to create a heatmap with all the data: https://slack-files.com/T8WH5QU5U-F04B7N8TUBH-4a36ac101c Now that the heatmap has been created, I started making species references by first retrieving all the sequences using: ``` hybpiper retrieve_sequences --single_sample_name data1 -t_dna mega353.fasta supercontig ``` have run: 1,2,3,7,11,17,18,100,23,24,25,26,27,28,29,30,31,32,33,34 All were labled following the pattern: 'data"x"_supercontig.fasta' After that, I ran the supercontigs through bwa index. then going to use bwa mem fastq fasta. After completing this, will use samtools depth gene pos depth ran ``` bwa index -a bwtsw data32_supercontig.fasta index ``` All tissue samples were used to make reference files. The data I didn't use to make reference files were data 100, 1, 2, and 3 because they are to be compared to the reference files since they contain more than one species. As a reminder, data100 is the negative control, data1 is the pooled soil, data2 is the known seed mixture, and data3 is the unknown seed mixture. After running bwa index, I used bwa mem to map the seed (read) of one species against the leaf (reference) of the same species: ``` bwa mem data25_supercontig.fasta /scratch/mil39387/seed_banks/hybpiper_output/trimmed/SSBseq007.trimmed.R1.fastq /scratch/mil39387/seed_banks/hybpiper_output/trimmed/SSBseq007.trimmed.R2.fastq | samtools sort -o data7.map2.25.bam - ``` dash at the end is important to include After running bwa mem, I looked at the results by running: ``` samtools flagstat data7.map2.25.bam ``` After running both bwa mem and samtools flagstat for each species comparing the seed tissue to the leaf reference files, I did the reverse and compared the leaf tissue reads to the seed reference. Compared the results of data7.map2.25.bam (seed against the leaf reference) data25.map2.7.bam (leaves against the seed reference). for each species. Results of seed mapped against leaf reference for each species: Silene-70% Lotus.sal-82% lotus.str-74% salvia-38% astragalus-40% plantago-25% thelypodium-56% descurainia-55% Results of leaf mapped against seed references for each species: silene-12% lotus.sal-29% lotus.str-33% salvia-40% astragalus-42% plantago-28% thelypodium-55% descurainia-95% Since some of the results were better when comparing leaf vs seed and others were better when comparing seed to leaf, I decided to combine the seed references and leaf references for each species into one file using cat. Found command for bwa mem instructions at https://biology.stackexchange.com/questions/59493/how-to-convert-bwa-mem-output-to-bam-format-without-saving-sam-file Ran the following from hybpiper_output folder with r1 and r2: ``` cat /scratch/mil39387/seed_banks/hybpiper_output/trimmed/SSBseq007.trimmed.R1.fastq /scratch/mil39387/seed_banks/hybpiper_output/trimmed/SSBseq025.trimmed.R1.fastq > sileneR1.trimmed.fastq ``` Produced silene.R1.trimmed.fastq and silene.R2.trimmed.fastq After combining all the seeds and leaves into one reference file, I ran hybpiper assemble and hybpiper stats to see if the number of genes recovered increased with the larger, more complete species references. It did result in a larger number of genes recovered for each species as expected, so used the combined files as the species reference for the rest of my project. Ran hybpiper assemble without the "intronerate" part then ran with intronerate and exonerate ``` hybpiper assemble -t_dna /scratch/mil39387/seed_banks/references/mega353.fasta -r /scratch/mil39387/seed_banks/hybpiper_output/silene.R*.trimmed.fastq --prefix silene --bwa --cpu 48 --run_intronerate --start_from exonerate_contigs ``` ``` hybpiper stats -t_dna /scratch/mil39387/seed_banks/references/mega353.fasta gene readfiles.txt ``` I created a text file called "readfiles.txt" and wrote "silene" in it for running hybpiper stats. Resulted in an increase of 50% reads from ~160 to 226 and increased 75% reads from ~75 to 115. After seeing similar results from the other species, I created species reference files for silene, lotus.sal, lotus.str, salvia, astragalus, plantago, thelypodium, and descurainia. With the more complete reference files, I ran retrieve sequences, bwa index, bwa mem, and samtools flagstat again for each species. ``` hybpiper retrieve_sequences --single_sample_name silene -t_dna /scratch/mil39387/seed_banks/references/mega353.fasta supercontig ``` ``` bwa index -a bwtsw silene_supercontig.fasta index ``` ``` bwa mem silene_supercontig.fasta /scratch/mil39387/seed_banks/hybpiper_output/trimmed/SSBseq007.trimmed.R1.fastq /scratch/mil39387/seed_banks/hybpiper_output/trimmed/SSBseq007.trimmed.R2.fastq | samtools sort -o data7.map2.silene.bam - ``` ``` samtools flagstat data7.map2.silene.bam ``` The "map to self" part of my poster that data100,1,2,and3 were compared to were the results of running the seeds of each species against the species reference files of combined seed and leaf tissue. Results of samtools flagstat for each species (seed to self): Silene-72% lotus.sal-80% lotus.str-77% salvia-39% astragalus-40% plantago-27% thelypodium-57% descurainia-54% Ran "data2" (known seed mix) against each species reference file. Expected results of this were the presence of each of the 8 species being shown since they were known to be present in the mixture. Results of data2 vs each species reference file: silene-13% lotus.sal-37% lotus.str-38% salvia-19% astragalus-34% plantago-17% thelypodium-35% descurainia-33% With the data from all data2.map2.species.bam files, created a double bar chart coupling the reads to self with the reads of data2. Results of data1(pooled soil) vs species reference files: silene-17% lotus.sal-18% lotus.str-18% salvia-16% astragalus-19% plantago-18% thelypodium-21% descurainia-23% With the data from data1, created double bar chart coupling readings to self against the reads of pooled soil. Results of data3 (unknown mix) vs species reference files: silene-16% lotus.sal-27% lotus.str-27% salvia-18% astragalus-25% plantago-18% thelypodium-19% descurainia-20% With the results of data3, created double bar chart coupling readings to self against the reads of unknown seed mix. Results of data100 (neg control) vs species reference files: silene-1% lotus.sal-1% lotus.str-1% salvia-1% astragalus-1% plantago-1% thelypodium-1% descurainia-1% With the results of data100, created a double bar chart coupling readings to self against the reads of the negative control. After creating all the bar charts, I wanted to see how many of the reads attributed to each were unique and didn't map to the other species. In order to find how many of the reads indicated above were uniquely mapping to each species, I ran the following for each: ``` parallel "samtools view -F 4 {} | cut -f1 > {.}.mapped.txt" ::: data2.*.bam ``` to create .txt files of each data2 reading. Be sure to replace "data2" with "data 1" or whatever the new reads being analyzed are called. Then, used: ``` python /scratch/joh97948/soil_seedbank/unique_reads.py *.txt ``` to see how many uniquely mapping reads there were for each species. Be sure to write "data2" or whatever the file name is directly to the left of *.txt The third column created is the unique reads. Created a pie chart in excel by copying over the data and switching the columns so that the species names were next to the unique reads. I also recorded the total number of unique reads that were given for data1, data2, and data3. As expected, the number of unique reads was much larger in the pooled soil sample, which indicates that more species are present than the 8 detected. There were more total uniquely mapping reads than unique reads that mapped to the 8 species. Because of this, a future direction of this project would be to create more species reference files for texas flowering plants. Running the pooled soil sample against many more species reference files should indicate the presence of multiple other species. I decided against doing a pie chart for the negative control since there were very few reads (as expected for running empty tubes through the machine). In order to speed up the process of finding unique reads for each species, used parallel commands: ``` parallel "bwa mem {}_supercontig.fasta /scratch/mil39387/seed_banks/hybpiper_output/trimmed/SSBseq003.trimmed.R1.fastq.gz /scratch/mil39387/seed_banks/hybpiper_output/trimmed/SSBseq003.trimmed.R2.fastq.gz | samtools sort -o data3.map2.{}.bam -" ::: silene lotus.sal lotus.str salvia astragalus plantago thelypodium descurainia ``` This allowed all species to be run at the same time when doing each. For example, rather than running the "known seed mix" group 8 times (once for each species), I only had to run it once and told it to repeat the same command for each thing starting with "data2". In conclusion, the results of my project were very promising in indicating a way to detect species present in samples of plant DNA. The continuation of this project using hybpiper and angiosperms353 could prove beneficial in revolutionizing biotechnology and ecology. Another future direction of this project could be to create an angiosperms353 file specific to the flowering plants of texas in order to increase the number of reads detected to be uniquely mapping to each species. Creating a larger bank of species references could help hold companies producing plant products accountable by indicating the presence of endangered species or species not advertised to be present in the herbal mixture. Findings post-poster printing: creating a reference file for aquifolium and oncotheca and running each mixture against them resulted in an increase of uniquely mapping reads. Since things that are closely related share some dna, that likely indicates that relatives in the same genus of both are present in each mixture. The total number of uniquely mapping reads increased by: 60,918 in known mixture 156,596 in unknown mixture 152,887 in pooled soil