Amplicons!!! (Marne's version)

--- title: "Amplicons!!! (Marne's version)" breaks: FALSE tags: RADSeq --- Marne Quigg (p32585) --- # Amplicons!!! (Marne's version) ### Quick links - **[Powerpoint from Suzy](https://github.com/CBG-Conservation-Genomics/BotanyWorkshop2024/blob/main/2.BaselineSkills/2.BaseSkills.pptx)** - **[Linex commands from Suzy](https://github.com/CBG-Conservation-Genomics/BotanyWorkshop2024/blob/main/2.BaselineSkills/2.Bioinfo_unix_command_cheatsheet.pdf)** - **[Quest User Guide](https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=505)**: Extensive manual on Quest - **[Slurm Guide](https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1964#section-partitions)**: In depth guide for submitting jobs to Quest using Slurm - **[Basic Research Computing Guide](https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=2004#quest-fundamentals)**: Suite of various tutorial videos - **[Globus](https://app.globus.org/file-manager)**: Globus file transfer service for up/downloading files to Quest - **[Job Resource Logs](https://docs.google.com/spreadsheets/d/1P1zpt1aznNnXwMjUfynWioOXvGkjydFfhUBG_EppNMQ/edit?usp=sharing)**: List of jobs and their associated resources/efficiencies - **[Jupyter Hub](https://jupyter.questanalytics.northwestern.edu)**: Link to Jupyter Hub (requires NU VPN) --- ## Pre-analysis Processing 1) Organize and FastQC: make sure sequencing went well 2) Trimmomatic: Illumina Adapter trimming and filtering out reads <30 bases long 3) BWA-MEM: Alignment (to genome?) 4) Primerclip or TrimPrimers: trim primers 5) Picard tools: On-target and coverage uniformity calculation 6) Variant calling (GATK Haplotype Caller for germline calls, lofreq, mutect or Vardict for somatic or low frequency variant calls) --- ### Step 1: Organize and FastQC ***Goal:** Standardize file names and check the quality of the sequencing data!* **Software:** FastQC and MultiQC **Resources:** - **[Interpretation](https://learn.gencore.bio.nyu.edu/ngs-file-formats/quality-scores/)** **Organize!** Let's start by standardizing the file names... Current format: SampleID_S##_R#_001.fastq.gz Desired format: SampleID_R#.fastq.qz ```bash! #start by navigating to the correct folder cd /projects/p32585/Elm/'Batch 1'/ pwd #yay you're in the right place #to remove the S##: for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_S[0-9]*_/_/')"; done #to remove the _001_: for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_001.fastq.gz/.fastq.gz/')"; done *adapted from Brenden's code ``` **FastQC** This allows us to look for seqencing errors or samples that did not sequence well. ``` bash! #start by activating fastqc module load fastqc #check the version and record it (important for publication) fastqc --version #version _____ #next we want to open up a word processor to write out the code nano fastqc.sh #the below code will run a loop and do fastqc on all the samples in the folder #!bin/bash output_dir="batch_1_qc_results" #set the output directory mkdir -p "$output_dir" #create the output directory if it doesnt exist for file in *_*.fastq.gz; do fastqc -t 40 -o "$output_dir" "$file" done #press ctrl+x then Y then enter to save it #change file permissions so you can run it chmod +x fastqc.sh #time to run it! bash fastqc.sh *Rose taught me how to do this ``` **MultiQC** This combines all of the individual reports from the FastQC into one so you don't have to look at individual reports for each sample. ``` bash! #load the module module load multiqc multiqc --version #version 1.25.1 #make a home for the multiqc output mkdir -p /projects/p32585/Elm/'Batch 1'/batch_1_qc_results/multiqc cd /projects/p32585/Elm/'Batch 1'/batch_1_qc_results/multiqc ln -s -f /projects/p32585/Elm/'Batch 1'/batch_1_qc_results/batch_1_qc_results/ #go to where the output will be cd /projects/p32585/Elm/'Batch 1'/batch_1_qc_results/ #run it multiqc . #load the module module load multiqc multiqc --version #version 1.25.1 #make a home for the multiqc output mkdir -p /projects/p32585/Elm/'Batch 1'/batch_1_qc_results/multiqc cd /projects/p32585/Elm/'Batch 1'/batch_1_qc_results/multiqc ln -s -f /projects/p32585/Elm/'Batch 1'/batch_1_qc_results/batch_1_qc_results/ #go to where the output will be cd /projects/p32585/Elm/'Batch 1'/batch_1_qc_results/ #run it multicqc . ``` **Duplicate Original Data** I'm a little scared of accidentally *damaging* my data, so I'm going to copy the folder just to have a backup set of data. ``` bash! #format cp -r original_folder copy_folder #ran cp -r 'Batch 1' copy_batch_1 ``` *so that didn't actually work due to file permissions and apparently none of these programs alter your original files so I'm just gonna continue --- ### Step 2: Trimming ***Goal:** Remove Illumina adapters and filtering out reads <30bp.* **Software:** Trimmomatic (includes a .fasta file with the adapter sequences - has the universal adapter) **Resources:** - **[Github](https://github.com/usadellab/Trimmomatic/tree/main)** - **[USA Del Lab](http://www.usadellab.org/cms/?page=trimmomatic)** - **[Very helpful PDF](chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf)** Had to move everything into Curie because I ran out of storg=age on Quest :( but here is an updated (6/10/2025) script on how to run trimmomatic. ```bash! #get all the softwares conda activate bioconda conda install bioconda::trimmomatic #version 0.39 conda install bioconda::java-jdk conda install -c conda-forge openjdk #version 1.8.0_92 cd /data/labs/Fant/Quigg/all_data #time to run it nano trimmomatic2.sh #!/bin/bash # For loop to process all R1 samples for R1 in *_R1.fastq.gz; do R2=${R1//_R1.fastq.gz/_R2.fastq.gz} R1p=${R1//.fastq.gz/_paired.fastq.gz} R1u=${R1//.fastq.gz/_unpaired.fastq.gz} R2p=${R2//.fastq.gz/_paired.fastq.gz} R2u=${R2//.fastq.gz/_unpaired.fastq.gz} # Run Trimmomatic trimmomatic PE -phred33 \ -threads 20 "$R1" "$R2" "$R1p" "$R1u" "$R2p" "$R2u" \ ILLUMINACLIP:/home/imh4101/miniconda3/envs/bioconda/share/trimmomatic-0.39-2/adapters/TruSeq3-PE.fa:2:30:10:2:true \ LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40 done #exit #I added 20 threads and changed the path to trimmomatic #change permissions chmod +x trimmomatic2.sh screen -L bash trimmomatic2.sh #to disconnect the screen (keep it running in the background) #type this into the screen itself ctrl + a d #to cancel the job screen -ls screen -X -S 2801504.pts-10.curie quit ``` --- ### Step 3: Alignment ***Goal:** Align the samples to the reference genome.* **Software:** BWA-MEM (can only do 96 samples at a time), SAMtools, Picard **Resources:** - **[Github](https://github.com/lh3/bwa)** - **[Website](https://bio-bwa.sourceforge.net/)** - **[Illumina resource](https://www.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/bwa-aligner.html)** We are going to try two techniques for this: Brenden's (align to amplicon fasta) and Suzy's (align to genomes) and compare to see which one works better. * ***I did a few different alignment trials. The first was to a csv of all potential regions that I sent to IDT a long time ago (don't use that). Then I aligned to the elm genome, plus the 3 disease genomes (good). I had the best ploidy estimations on samples aligned ot the elm genome. This you can use Picard's CollectSummaryMetrics. Then I aligned to the fasta file that came from IDT of target regions so I could run Picard's CollectTargetPCRMetrics.** **Script that will do BWA -> Primerclip** ```bash! #this starts with bwa alignment and ends with primerclip cd /projects/p32585/Elm/batch_1/post_trimming/ #Dont forgot to index bwa index /projects/p32585/Elm/genomes/cp927_OnTarget_105Refs.fasta nano pipeline.sh #!/bin/bash #SBATCH --account=p32585 #SBATCH --partition=normal #SBATCH --time=12:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=8 #SBATCH --mem=40G # Adjust memory based on previous usage #SBATCH --job-name=full_pipeline #SBATCH --mail-type=ALL #SBATCH --mail-user=marnequigg2025@u.northwestern.edu #SBATCH --error=full_pipeline.err #SBATCH --output=full_pipeline.out ### Step 1: BWA MEM Alignment ### module purge module load bwa FASTQ_DIR="/projects/p32585/Elm/batch_1/post_trimming/paired_unpaired_data/" REFERENCE="/projects/p32585/Elm/genomes/cp927_OnTarget_105Refs.fasta" ALIGNMENT_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/" # Create output directory if needed mkdir -p "$ALIGNMENT_DIR" # Align all paired-end FASTQ files with BWA MEM for R1 in "$FASTQ_DIR"/*_R1_paired.fastq.gz; do SAMPLE=$(basename "$R1" _R1_paired.fastq.gz) R2="$FASTQ_DIR/${SAMPLE}_R2_paired.fastq.gz" SAM_OUTPUT="$ALIGNMENT_DIR/${SAMPLE}.sam" echo "Aligning $SAMPLE..." bwa mem -M -t 8 "$REFERENCE" "$R1" "$R2" > "$SAM_OUTPUT" echo "Alignment complete for $SAMPLE!" done echo "All alignments completed!" ### Step 2: Convert SAM to BAM ### module load samtools BAM_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/" mkdir -p "$BAM_DIR" for SAM_FILE in "$ALIGNMENT_DIR"/*.sam; do SAMPLE=$(basename "$SAM_FILE" .sam) BAM_OUTPUT="$BAM_DIR/${SAMPLE}.bam" echo "Converting $SAM_FILE to BAM..." samtools view -b "$SAM_FILE" -o "$BAM_OUTPUT" echo "Conversion completed for $SAMPLE!" done ### Step 3: Sort BAM & Generate Flagstats ### SORTED_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/sorted_bams/" FLAGSTAT_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/flagstats/" mkdir -p "$SORTED_DIR" "$FLAGSTAT_DIR" for BAM_FILE in "$BAM_DIR"/*.bam; do SAMPLE=$(basename "$BAM_FILE" .bam) SORTED_BAM="$SORTED_DIR/${SAMPLE}_sorted.bam" echo "Sorting BAM file for $SAMPLE..." samtools sort -o "$SORTED_BAM" "$BAM_FILE" echo "Generating flagstat for $SAMPLE..." samtools flagstat "$SORTED_BAM" > "$FLAGSTAT_DIR/${SAMPLE}_flagstat.txt" done echo "Sorting and flagstat analysis completed!" ### Step 5: Sort SAM Files by Name ### INDS=($(for i in "$ALIGNMENT_DIR"/*.sam; do echo $(basename "$i" .sam); done)) for IND in "${INDS[@]}"; do samtools sort -n "$ALIGNMENT_DIR/${IND}.sam" -o "$ALIGNMENT_DIR/${IND}_nsort.sam" done echo "All SAM sorting completed!" ### Step 6: Run PrimerClip ### conda activate primerclip PRIMERFILE="/projects/p32585/Elm/genomes/cp927_Masterfile.txt" for IND in "${INDS[@]}"; do echo "Running PrimerClip on $IND..." primerclip "$PRIMERFILE" "$ALIGNMENT_DIR/${IND}_nsort.sam" "$ALIGNMENT_DIR/${IND}_clipped.sam" done echo "Primer clipping completed!" #exit sbatch pipeline.sh #job 3567647 #took about an hour ``` **Brenden/Emma's Technique:** this pipeline has you map the reads to the primer sequence fasta file that you generated with all of the fasta sequences of the genes. *you did not remove the genes that we didn't use in the final primer set This was a fun process to troubleshoot. Brendan's pipeline only worked on a couple samples then claimed to be completed but didn't work. So then we tried a loop that Emma created that seperated everything out into individual steps. ```bash! #downstream analyses require a concatenated file of the unpaired reads for file in *R1_unpaired* do file2=${file//_R1_/_R2_} file3=${file//R1_unpaired/unpaired} cat $file $file2 > $file3 done ``` 1) We need a fasta file of all the sequences we submitted to IDT. To do this you copied and pasted all of the fasta seqences and names (without the >). Use this [website](https://cdcgov.github.io/CSV2FASTA/) to convert the csv file to a fasta file. Make sure you have column labels. Move them to your genomes folder on quest. ***note: you had to redo this step and align to the genome NOT the primer sequences** ***another note: you had to mask the disease regions from the elm genome and align it to that** ```bash! scp "C:\Users\mquig\Downloads\primers_redo.fasta" imh4101@quest.northwestern.edu:/projects/p32585/Elm/genomes/ ``` :::spoiler 2) Now let's do a quick trial run with one sample. ```bash! #start by making sure we have the packages module load bwa #verion 0.7.12 module load samtools #version 1.6 #start by telling the program what to map to #bwa index /projects/p32585/Elm/genomes/primers4.fasta #or you can index to the reference genome bwa index /projects/p32585/Elm/genomes/genome.fa #took about 20 minutes for this one #if it not in the same folder as the files then include the path #try one sample bwa mem -M -t 4 /projects/p32585/Elm/genomes/primers4.fasta NY05_R1_paired.fastq.gz NY05_R2_paired.fastq.gz > NY05.sam #check to see how it went samtools flagstat NY05.sam #[W::sam_hdr_parse] Duplicated sequence 'KY866673.1' #[W::sam_hdr_parse] Duplicated sequence 'KY866675.1' #[W::sam_hdr_parse] Duplicated sequence 'KY866674.1' #[W::sam_hdr_parse] Duplicated sequence 'KY866678.1' #[W::sam_hdr_parse] Duplicated sequence 'KY866677.1' #[W::sam_hdr_parse] Duplicated sequence 'KY866676.1' #[W::sam_hdr_parse] Duplicated sequence 'EF123160.1' #790452 + 0 in total (QC-passed reads + QC-failed reads) #3816 + 0 secondary #0 + 0 supplementary #0 + 0 duplicates #742109 + 0 mapped (93.88% : N/A) #786636 + 0 paired in sequencing #393318 + 0 read1 #393318 + 0 read2 #734708 + 0 properly paired (93.40% : N/A) #737190 + 0 with itself and mate mapped #1103 + 0 singletons (0.14% : N/A) #2362 + 0 with mate mapped to a different chr #192 + 0 with mate mapped to a different chr (mapQ>=5) #ok so this isn't great, we want it to be in the high 90s ``` ::: :::spoiler 3) Ok now that we know it works, lets make a loop to run all of the samples. Actually there is an easier way to do this and the script is in a few steps! ```bash! nano slurm_bwa.sh #!/bin/bash #SBATCH --account=p32585 #SBATCH --partition=normal #SBATCH --time=12:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=8 # Test with 8 threads #SBATCH --mem=20GB # Adjusted memory based on previous usage #SBATCH --job-name=batch_1_bwamem #SBATCH --mail-type=ALL #SBATCH --mail-user=marnequigg2025@u.northwestern.edu #SBATCH --error=CUP_bwamem.out.error #SBATCH --output=bwamem.out module purge module load bwa # Run bwa mem with 8 threads for i in *_R1_paired.fastq.gz; do NM=${i//_R1_paired.fastq.gz/} # Extract the base name (e.g., NM_ARK23-01) REVERSE=${i//_R1_paired.fastq.gz/_R2_paired.fastq.gz} # Replace .1.fq.gz with .2.fq.gz for reverse reads #bwa mem -M -t 8 /projects/p32585/Elm/genomes/primers4.fasta "$i" "$REVERSE" > "${NM}.sam" bwa mem -M -t 8 /projects/p32585/Elm/genomes/genome.fa "$i" "$REVERSE" > "${NM}.sam" done #exit sbatch slurm_bwa.sh #took about 11 minutes to run which is suspicious because Brenden and Emma's took like 8 hours ``` ::: :::spoiler 4) Now we need to verify that running the loop and running the single sample was the same thing. ```bash! samtools flagstat NY05.sam #[W::sam_hdr_parse] Duplicated sequence 'KY866673.1' #[W::sam_hdr_parse] Duplicated sequence 'KY866675.1' #[W::sam_hdr_parse] Duplicated sequence 'KY866674.1' #[W::sam_hdr_parse] Duplicated sequence 'KY866678.1' #[W::sam_hdr_parse] Duplicated sequence 'KY866677.1' #[W::sam_hdr_parse] Duplicated sequence 'KY866676.1' #[W::sam_hdr_parse] Duplicated sequence 'EF123160.1' #790452 + 0 in total (QC-passed reads + QC-failed reads) #3816 + 0 secondary #0 + 0 supplementary #0 + 0 duplicates #742109 + 0 mapped (93.88% : N/A) #786636 + 0 paired in sequencing #393318 + 0 read1 #393318 + 0 read2 #734708 + 0 properly paired (93.40% : N/A) #737190 + 0 with itself and mate mapped #1103 + 0 singletons (0.14% : N/A) #2362 + 0 with mate mapped to a different chr #192 + 0 with mate mapped to a different chr (mapQ>=5) #looks similar to me! ``` ::: :::spoiler 5) sam2bam! And sort! To make the files smaller and conserve storage switch the file type to .sam ```bash! nano sam2bam_sort.sh #!/bin/bash #SBATCH --account=p32585 #SBATCH --partition=normal #SBATCH --time=12:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=8 # Test with 8 threads #SBATCH --mem=20GB # Adjusted memory based on previous usage #SBATCH --job-name=batch_1_bwamem #SBATCH --mail-type=ALL #SBATCH --mail-user=marnequigg2025@u.northwestern.edu #SBATCH --error=sam2bam.out.error #SBATCH --output=sam2bam.out for i in *.sam; do NM=${i//.sam/}; samtools view $i -b -o $NM.bam; done #exit sbatch sam2bam.sh ``` ::: 6. USE THIS SCRIPT!!! Ok so here is a script that will convert all of your files to .bam (requires less storage), and run a flagstat check to see what percentage of reads mapped to the reference. ```bash! nano sam2bam_sort.sh #!/bin/bash #SBATCH --account=p32585 #SBATCH --partition=normal #SBATCH --time=12:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=5GB # Adjust memory based on previous usage #SBATCH --job-name=sam2bam_sort #SBATCH --mail-type=ALL #SBATCH --mail-user=marnequigg2025@u.northwestern.edu #SBATCH --error=sam2bam_sort.out.error #SBATCH --output=sam2bam_sort.out module load samtools module load bwa # Process one sample at a time: convert SAM to BAM, sort, and clean up for i in *.sam; do # Check if the file exists before proceeding if [[ -e $i ]]; then NM=${i%.sam} # Remove the .sam extension # Convert the .sam file to .bam echo "Converting $i to BAM format..." samtools view -b "$i" -o "$NM.bam" # Check if conversion was successful if [[ $? -eq 0 ]]; then echo "$i successfully converted to $NM.bam" # Sort the .bam file echo "Sorting $NM.bam..." samtools sort "$NM.bam" -o "$NM_sorted.bam" # Check if sorting was successful if [[ $? -eq 0 ]]; then echo "$NM.bam successfully sorted into $NM_sorted.bam" # Generate flagstat report echo "Generating flagstat report for $NM_sorted.bam..." samtools flagstat "$NM_sorted.bam" > "$NM_flagstat.txt" echo "Flagstat report saved to $NM_flagstat.txt" # Delete the original .sam file echo "Deleting $i to free up space..." rm "$i" # Move the sorted BAM file to the sorted_bams folder mkdir -p sorted_batch_1_bams_redo mv "$NM.sorted.bam" sorted_batch_1_bams_redo/ mv "$NM.flagstat.txt" sorted_batch_1_bams_redo/ else echo "Error sorting $NM.bam" >&2 exit 1 fi else echo "Error converting $i" >&2 exit 1 fi else echo "No *.sam files found." exit 1 fi done echo "All *.sam files processed. Sorted BAM files and flagstat reports moved to sorted_bams directory." #exit sbatch sam2bam_sort.sh ``` *this was adapted from a script that Emma gave me So after running the redo samples that script didn't work it just generated the .bam files but didn't sort or generate the flagstat report. So below is the slurm script you used to sort and flag. ```bash! nano bam_sort_flag.sh #!/bin/bash #SBATCH --account=p32585 #SBATCH --partition=normal #SBATCH --time=12:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=5GB # Adjust memory based on previous usage #SBATCH --job-name=bam_sort_flag #SBATCH --mail-type=ALL #SBATCH --mail-user=marnequigg2025@u.northwestern.edu #SBATCH --error=bam_sort_flag.out.error #SBATCH --output=bam_sort_flag.out # Load necessary modules module load samtools # Define input and output directories INPUT_DIR="/projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/" SORTED_DIR="/projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/sorted_bams/" FLAGSTAT_DIR="/projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/flagstats/" # Create output directories if they don't exist mkdir -p "$SORTED_DIR" "$FLAGSTAT_DIR" # Process each BAM file for bamfile in "$INPUT_DIR"/*.bam; do filename=$(basename "$bamfile" .bam) # Sort BAM file samtools sort -o "$SORTED_DIR/${filename}_sorted.bam" "$bamfile" # Generate flagstat report samtools flagstat "$SORTED_DIR/${filename}_sorted.bam" > "$FLAGSTAT_DIR/${filename}_flagstat.txt" done echo "Processing complete." #exit sbatch bam_sort_flag.sh ``` 7. Now lets check to see how the mapping went! ```bash! nano flagstat_summary.sh #!/bin/bash #SBATCH --account=p32585 #SBATCH --partition=normal #SBATCH --time=01:00:00 # Adjust based on expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=1GB # Adjust memory based on file sizes #SBATCH --job-name=process_flagstat #SBATCH --mail-type=ALL #SBATCH --mail-user=marnequigg2025@u.northwestern.edu #SBATCH --error=process_flagstat.out.error #SBATCH --output=process_flagstat.out # Set the directory containing the .flagstat.txt files FLAGSTAT_DIR="/projects/p32585/Elm/batch_1/post_trimming/sam/sorted_batch_1_bams/" # Create a summary file SUMMARY_FILE="summary_flagstat.txt" echo -e "Sample\tQC_Passed_Reads\tMapped_Reads\tMapped_%\tProperly_Paired_Reads\tProperly_Paired_%" > "$SUMMARY_FILE" # Loop through all .flagstat.txt files for file in "$FLAGSTAT_DIR"/*_flagstat.txt; do if [[ -e "$file" ]]; then # Extract the sample name (remove directory and extension) SAMPLE=$(basename "$file" _flagstat.txt) # Extract relevant information QC_PASSED=$(grep "in total" "$file" | awk '{print $1}') MAPPED=$(grep "mapped (" "$file" | awk '{print $1}') MAPPED_PCT=$(grep "mapped (" "$file" | awk -F '[()]' '{print $2}') PROPERLY_PAIRED=$(grep "properly paired (" "$file" | awk '{print $1}') PROPERLY_PAIRED_PCT=$(grep "properly paired (" "$file" | awk -F '[()]' '{print $2}') # Append the data to the summary file echo -e "$SAMPLE\t$QC_PASSED\t$MAPPED\t$MAPPED_PCT\t$PROPERLY_PAIRED\t$PROPERLY_PAIRED_PCT" >> "$SUMMARY_FILE" else echo "No _flagstat.txt files found in $FLAGSTAT_DIR." exit 1 fi done echo "Summary of all _flagstat.txt files saved to $SUMMARY_FILE." #exit sbatch flagstat_summary.sh #job 2352119 #well that was quick ``` Well it seems to have gone ok most samples are in the low 90s or upper 80s, except for NY06 which is about 49%, oh well. We could increase the mapping percentages by changing some values in the bwa, BUT I don't want to do that because it could increase the number of incorrectly aligned reads, and I'm already worried about that because of how low our ligation temperature was during PCR. Wow aligning to the genome resulted in such higher percentage mapped! This is great! This is the end here until we use PrimerClip to purge the bad stuff. See you in step 4! --- ### Step 4: Primer Trimming ***Goal:** Remove low quality bases and PCR primer sequences* **Software:** Primerclip or TrimPrimers **Resources:** - **[Important note from Illumina](https://sfvideo.blob.core.windows.net/sitefinity/docs/default-source/application-note/primerclip-a-tool-for-trimming-primer-sequences-application-note.pdf?sfvrsn=cf83e107_10)** - **[PrimerClip Github](https://github.com/swiftbiosciences/primerclip)** - **[TrimPrimers Website](https://fulcrumgenomics.github.io/fgbio/tools/latest/TrimPrimers.html)** Ok so first we need to create a conda environment to get PrimerClip because Quest doesn't have it :( ```bash! #create a conda environment conda create -n primerclip #active it conda activate primerclip #this conda environment lives in your home folder #/home/imh4101/.conda/envs/primerclip #download primerclip conda install bioconda::primerclip #takes a few seconds to activate it #not sure what this does but brenden did it conda install bioconda/label/cf201901::samtools #no errors yay! module list #says theres no modules loaded so hopefully this works lol which primerclip #says we do have it but not which environment primerclip --version #that didnt work #make sure we can actually run primerclip chmod a+x /home/imh4101/.conda/envs/primerclip ``` Now we need to convert out .bam files to .sam files and sort them by name. Before this you sorted them but not by name and PrimerClip requires them sorted by name. Convert .bam files to .sam files... ```bash! #we should run this as a slurm nano bam2sam2.sh #!/bin/bash #SBATCH --account=p32585 #SBATCH --partition=normal #SBATCH --time=01:00:00 # Adjust based on expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=1GB # Adjust memory based on file sizes #SBATCH --job-name=bam2sam2 #SBATCH --mail-type=ALL #SBATCH --mail-user=marnequigg2025@u.northwestern.edu #SBATCH --error=bam2sam2.out.error #SBATCH --output=bam2sam2.out # Set the directory containing BAM files BAM_DIR="/projects/p32585/Elm/batch_1/post_trimming/sam/" # Loop through all BAM files in the directory for bamfile in "$BAM_DIR"/*.bam; do # Extract the base filename without the extension base_name=$(basename "$bamfile" .bam) # Convert BAM to SAM using samtools samtools view -h "$bamfile" > "$BAM_DIR/${base_name}.sam" echo "Converted $bamfile to ${base_name}.sam" done echo "All conversions completed!" #exit sbatch bam2sam2.sh ``` Sort the resulting .sam files by name.. ```bash! nano nsort_sam.sh #!/bin/bash #SBATCH --account=p32585 #SBATCH --partition=normal #SBATCH --time=01:00:00 # Adjust based on expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=1GB # Adjust memory based on file sizes #SBATCH --job-name=nsort_sam #SBATCH --mail-type=ALL #SBATCH --mail-user=marnequigg2025@u.northwestern.edu #SBATCH --error=nsort_sam.error #SBATCH --output=nsort_sam.out module load samtools INDS=($(for i in ./*.sam; do echo $(basename ${i%*}); done)) for IND in ${INDS[@]}; do samtools sort -n ${IND} -o ${IND}_nsort.sam; done #exit sbatch nsort_sam.sh ``` Yay! You did it! Now we can actually run PrimerClip on the .sam files sorted by name. This took about 25 minutes for 48 individuals. ```!bash conda activate primerclip nano primerclip.sh #!/bin/bash #SBATCH --account=p32585 #SBATCH --partition=normal #SBATCH --time=48:00:00 # Adjust based on expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=12 #SBATCH --mem=6GB # Adjust memory based on file sizes #SBATCH --job-name=primerclip #SBATCH --mail-type=ALL #SBATCH --mail-user=marnequigg2025@u.northwestern.edu #SBATCH --error=primerclip.out.error #SBATCH --output=primerclip.out #Load Conda Environment INDS=($(for i in ./*_nsort.sam; do echo $(basename ${i%_nsort*}); done)) #declares our individual names for IND in ${INDS[@]}; do primerclip /projects/p32585/Elm/genomes/cp927_Masterfile.txt ${IND}_nsort.sam ${IND}_clipped.sam; done #exit sbatch primerclip.sh ``` --- ### Step 5: On-target and coverage uniformity calculation ***Goal:** Loci amplified and how well it aligns to the genome!* **Software:** Picard tools **Resources:** - **[Website](https://broadinstitute.github.io/picard/)** - **[Github](https://github.com/broadinstitute/picard/blob/master/README.md)** - **[Picard tool for .bed file to interval list](https://gatk.broadinstitute.org/hc/en-us/articles/360036883931-BedToIntervalList-Picard)** - **[Bedtools guide](http://quinlanlab.org/tutorials/bedtools.html)** **Input** - Clipped and sorted bam files To make the clipped and sorted .bam files I consulted copilot and used this code. I just ran it in bash it took about 5 minutes. ```bash! nano sam2bam_sort.sh #!/bin/bash # Directory containing SAM files INPUT_DIR="/projects/p32585/Elm/batch_1/post_trimming/sam/clipped_sams/" OUTPUT_DIR="/projects/p32585/Elm/batch_1/post_trimming/clipped_bams/" # Convert each SAM file to BAM format for samfile in "$INPUT_DIR"/*.sam; do if [[ -f "$samfile" ]]; then bamfile="$OUTPUT_DIR/$(basename "$samfile" .sam).bam" echo "Converting $samfile to $bamfile..." samtools view -Sb "$samfile" -o "$bamfile" fi done echo "Conversion completed." #exit module load samtools chmod +x sam2bam_sort.sh ./sam2bam_sort.sh ``` **CollectAlignmentSummary** This is what you use when you align to the genome. You can do this with the elm genome and the 3 diseases. ```bash! #try it on one sample CollectAlignmentSummaryMetrics cd /projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/sorted_clipped_bams/ java -jar /software/picard/1.131/picard-tools-1.131/picard.jar CollectAlignmentSummaryMetrics \ R=/projects/p32585/Elm/genomes/genome.fa \ I=NA01_clipped_sorted.bam \ O=NA01_alignment_metrics.txt #good it worked now lets do all of them nano picard_align.sh #!/bin/bash #SBATCH --account=p32585 #SBATCH --partition=normal #SBATCH --time=48:00:00 # Adjust based on expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=12 #SBATCH --mem=6GB # Adjust memory based on file sizes #SBATCH --job-name=picard #SBATCH --mail-type=ALL #SBATCH --mail-user=marnequigg2025@u.northwestern.edu #SBATCH --error=picard.out.error #SBATCH --output=picard.out module load picard # Load Picard module if necessary PICARD_JAR="/software/picard/1.131/picard-tools-1.131/picard.jar" REFERENCE="/projects/p32585/Elm/genomes/genome.fa" OUTPUT_DIR="/projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/sorted_clipped_bams/" for bamfile in /projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/sorted_clipped_bams/*_clipped_sorted.bam; do filename=$(basename "$bamfile" .bam) java -Xmx12G -jar "$PICARD_JAR" CollectAlignmentSummaryMetrics \ R="$REFERENCE" \ I="$bamfile" \ O="$OUTPUT_DIR/${filename}_alignment_metrics.txt" done #exit sbatch picard_align.sh #run multiqc multiqc . ``` #### Elm Results: [link to multiqc](https://m-b55e05.a0115.5898.data.globus.org/projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/sorted_clipped_bams/multiqc_report.html#picard-alignmentsummary) ![picard_alignment_summary (1)](https://hackmd.io/_uploads/BksKw_Dbex.png) **CollectTargetPCRMetrics** In order to do this I needed to realign the reads to cp927_OnTarget_105Refs.fasta. This is the fasta file that IDT sent me of all the gene regions that made it into the final amplicon panel. Then we did two different runs of the CollectTargetPCRMetrics. The first one uses the full target+primer insert. The second run seperates the primer and amplicon target region. ![image](https://hackmd.io/_uploads/HkZavD5-xg.png) The amplicon/primer is the orange/reddish bit on the end, the target is the green part, and the whole purple thing is the insert. ```bash! #start by indexing the fasta file that we aligned to samtools faidx cp927_OnTarget_105Refs.fasta #this is the format to run on one sample java -jar picard.jar CollectTargetedPcrMetrics \ I=input.bam \ O=output_pcr_metrics.txt \ R=reference.fasta \ AMPLICON_INTERVALS=amplicon.interval_list \ TARGET_INTERVALS=targets.interval_list ``` Full Insert- we are also interested in the variation on the primers/amplicons not just the target regions. ```bash! #first we are going to do the same interval list because we are ok with keeping the primers #our version trial on NA01 cd /projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_sam java -jar /software/picard/1.131/picard-tools-1.131/picard.jar CollectTargetedPcrMetrics \ I=NA01_clipped.sam \ O=NA01_output_pcr_metrics.txt \ R=/projects/p32585/Elm/genomes/cp927_OnTarget_105Refs.fasta \ AMPLICON_INTERVALS=/projects/p32585/Elm/genomes/cp927_amplicon_interval_list \ TARGET_INTERVALS=/projects/p32585/Elm/genomes/cp927_Assays_interval_list nano pcr_metrics_same_interval.sh #slurm script #!/bin/bash #SBATCH --account=p32585 #SBATCH --partition=normal #SBATCH --time=01:00:00 # Adjust based on expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=5GB # Adjust memory based on file sizes #SBATCH --job-name=pcr_metrics_same #SBATCH --mail-type=ALL #SBATCH --mail-user=marnequigg2025@u.northwestern.edu #SBATCH --error=pcr_metrics_same.out.error #SBATCH --output=pcr_metrics_same.out # Define input and output directories SAM_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_sam/" OUTPUT_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_sam/pcr_metrics_same_interval/" REFERENCE="/projects/p32585/Elm/genomes/cp927_OnTarget_105Refs.fasta" AMPLICON_INTERVALS="/projects/p32585/Elm/genomes/cp927_amplicon_interval_list" TARGET_INTERVALS="/projects/p32585/Elm/genomes/cp927_Assays_interval_list" # Loop through all clipped SAM files for SAM_FILE in "$SAM_DIR"/*_clipped.sam; do SAMPLE_ID=$(basename "$SAM_FILE" _clipped.sam) OUTPUT_FILE="$OUTPUT_DIR/${SAMPLE_ID}_output_pcr_metrics.txt" echo "Processing $SAMPLE_ID..." java -jar /software/picard/1.131/picard-tools-1.131/picard.jar CollectTargetedPcrMetrics \ I="$SAM_FILE" \ O="$OUTPUT_FILE" \ R="$REFERENCE" \ AMPLICON_INTERVALS="$AMPLICON_INTERVALS" \ TARGET_INTERVALS="$TARGET_INTERVALS" echo "Completed PCR Metrics for $SAMPLE_ID!" done echo "All samples processed!" #exit sbatch pcr_metrics_same_interval.sh #job 3694727 cd /projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_sam/pcr_metrics_same_interval/ multiqc . ``` Results: [multiqc](https://m-b55e05.a0115.5898.data.globus.org/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_sam/pcr_metrics_same_interval/multiqc_report.html) ![picard_pcr_metrics_bases](https://hackmd.io/_uploads/ryTzFv9bxg.png) Primers/amplicons + targets- this time we used the seperate bed files from IDT that has the coordinates for the amplicons and the target regions. ```bash! #second we are going to do it with the amplicon intervals and targets from the bed files IDT sent #this is so we can see how well it worked for our target region #make the interval lists cd /projects/p32585/Elm/genomes/ java -jar /software/picard/1.131/picard-tools-1.131/picard.jar BedToIntervalList \ I=cp927_Inserts_v5.bed \ O=cp927_Inserts_v5_interval_list \ SD=cp927_Assays.dict java -jar /software/picard/1.131/picard-tools-1.131/picard.jar BedToIntervalList \ I=cp927_Primers_v5.bed \ O=cp927_Primers_v5_interval_list \ SD=cp927_Assays.dict #yay those both worked! cd /projects/p32585/Elm/batch_1/post_trimming/redo_aligned_IDT_fasta/ java -jar /software/picard/1.131/picard-tools-1.131/picard.jar CollectTargetedPcrMetrics \ I=NA01_clipped.sam \ O=NA01_2intervals_output_pcr_metrics.txt \ R=/projects/p32585/Elm/genomes/cp927_OnTarget_105Refs.fasta \ AMPLICON_INTERVALS=/projects/p32585/Elm/genomes/cp927_Primers_v5_interval_list \ TARGET_INTERVALS=/projects/p32585/Elm/genomes/cp927_Inserts_v5_interval_list #yay it worked!!! nano pcr_metrics_2_interval.sh #!/bin/bash #SBATCH --account=p32585 #SBATCH --partition=normal #SBATCH --time=01:00:00 # Adjust based on expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=5GB # Adjust memory based on file sizes #SBATCH --job-name=pcr_metrics_same #SBATCH --mail-type=ALL #SBATCH --mail-user=marnequigg2025@u.northwestern.edu #SBATCH --error=pcr_metrics_same.out.error #SBATCH --output=pcr_metrics_same.out # Define input and output directories SAM_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_sam/" OUTPUT_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_sam/pcr_metrics_2_interval/" REFERENCE="/projects/p32585/Elm/genomes/cp927_OnTarget_105Refs.fasta" AMPLICON_INTERVALS="/projects/p32585/Elm/genomes/cp927_Primers_v5_interval_list" TARGET_INTERVALS="/projects/p32585/Elm/genomes/cp927_Inserts_v5_interval_list" # Loop through all clipped SAM files for SAM_FILE in "$SAM_DIR"/*_clipped.sam; do SAMPLE_ID=$(basename "$SAM_FILE" _clipped.sam) OUTPUT_FILE="$OUTPUT_DIR/${SAMPLE_ID}_2_intervals_output_pcr_metrics.txt" echo "Processing $SAMPLE_ID..." java -jar /software/picard/1.131/picard-tools-1.131/picard.jar CollectTargetedPcrMetrics \ I="$SAM_FILE" \ O="$OUTPUT_FILE" \ R="$REFERENCE" \ AMPLICON_INTERVALS="$AMPLICON_INTERVALS" \ TARGET_INTERVALS="$TARGET_INTERVALS" echo "Completed PCR Metrics for $SAMPLE_ID!" done echo "All samples processed!" #exit sbatch pcr_metrics_2_interval.sh #job 3695173 cd /projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_sam/pcr_metrics_2_interval/ multiqc . ``` Results: [multiqc](https://m-b55e05.a0115.5898.data.globus.org/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_sam/pcr_metrics_2_interval/multiqc_report.html) ![picard_pcr_metrics_bases (1)](https://hackmd.io/_uploads/Bkr9Yv5ble.png) #### Coverage of Loci To calculate the coverage of each loci we used bedtools to check the coverage and saved it as a .txt file. Then we converted it to a csv file and extracted the column that has the fraction of coverage at each loci of interest. From there, we put it in R and graphed it. ```bash! ########################################################## calculate the coverage ########################################################## nano bedtools_fraction_coverage.sh #!/bin/bash #SBATCH --account=p32585 #SBATCH --partition=normal #SBATCH --time=12:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=4 # Test with 8 threads #SBATCH --mem=16GB # Adjusted memory based on previous usage #SBATCH --job-name=coverage_csv #SBATCH --mail-type=ALL #SBATCH --mail-user=marnequigg2025@u.northwestern.edu #SBATCH --error=coverage_csv.err #SBATCH --output=coverage_csv.out # Load required modules module load samtools module load bedtools # Set input/output paths BED_FILE="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_bam/clipped_sort/bedtools_interval.bed" BAM_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_bam/clipped_sort/" OUTPUT_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_bam/clipped_sort/bedtools_coverage/" # Create output directory if it doesn’t exist mkdir -p "$OUTPUT_DIR" # Loop through BAM files and save Bedtools coverage output for BAM_FILE in "$BAM_DIR"/*_clipped_sorted.bam; do SAMPLE_ID=$(basename "$BAM_FILE" _clipped_sorted.bam) OUTPUT_FILE_TXT="$OUTPUT_DIR/${SAMPLE_ID}_coverage.txt" # Run bedtools coverage and save output bedtools coverage -a "$BED_FILE" -b "$BAM_FILE" > "$OUTPUT_FILE_TXT" echo "Saved coverage results for $SAMPLE_ID." done echo "Bedtools coverage analysis completed! Results saved in: $OUTPUT_DIR" #exit sbatch bedtools_fraction_coverage.sh ########################################################## Convert to csv ########################################################## #try to convert each .txt file into a csv nano txt2csv.sh #!/bin/bash # Set input/output directories INPUT_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_bam/clipped_sort/bedtools_coverage/" OUTPUT_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_bam/clipped_sort/bedtools_coverage/bedtools_coverage_csvs/" # Create output directory if it doesn’t exist mkdir -p "$OUTPUT_DIR" # Loop through all `.txt` files and convert to `.csv` for TXT_FILE in "$INPUT_DIR"/*_coverage.txt; do SAMPLE_ID=$(basename "$TXT_FILE" _coverage.txt) CSV_FILE="$OUTPUT_DIR/${SAMPLE_ID}_coverage.csv" # Convert tab-separated text to properly formatted CSV awk 'BEGIN{FS="\t"; OFS=","} {print $1, $2, $3, $4, $5, $6, $7, $8}' "$TXT_FILE" > "$CSV_FILE" echo "Converted: $TXT_FILE → $CSV_FILE" done echo "Batch conversion complete! CSV files saved in: $OUTPUT_DIR" ########################################################## extract the fraction of coverage column ########################################################## nano fractions.sh #!/bin/bash # Set directories INPUT_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_bam/clipped_sort/bedtools_coverage/bedtools_coverage_csvs/" OUTPUT_CSV="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_bam/clipped_sort/bedtools_coverage/bedtools_coverage_csvs/coverage_fractions.csv" # Initialize CSV file with headers echo -n "Sample_Name" > "$OUTPUT_CSV" for CSV_FILE in "$INPUT_DIR"/*_coverage.csv; do SAMPLE_ID=$(basename "$CSV_FILE" _coverage.csv) echo -n ",$SAMPLE_ID" >> "$OUTPUT_CSV" done echo "" >> "$OUTPUT_CSV" # Extract **column G (7th column)** from each file and write to CSV for CSV_FILE in "$INPUT_DIR"/*_coverage.csv; do awk -F',' '{print $7}' "$CSV_FILE" > "/tmp/$(basename "$CSV_FILE" .csv)_colG.txt" done paste -d ',' /tmp/*_colG.txt >> "$OUTPUT_CSV" # Cleanup temporary files rm /tmp/*_colG.txt echo "Extraction complete! Column G summary saved in: $OUTPUT_CSV" ########################################################## switch to R and graph it ########################################################## library(tidyverse) ###import data### fractions=read.csv("coverage_fractions_final.csv", header = T) ###reformat data### ##right now we have the first column as the region names and each subsequent column is a sample and the fractions bedtools_fract_df=data.frame(fractions) bedtools_cor=data.frame(t(bedtools_fract_df[-1])) colnames(bedtools_cor) <- bedtools_fract_df[, 1] #yay now our columns are the gene region and the row are our samples ####lets take a look at which regions didn't amplify#### fractions_summary <- long_data %>% group_by(name) %>% summarize( n.val = n(), mean.val = mean(value), sd.val = sd(value), se.val = sd.val/sqrt(n.val), lci = mean.val - 1.96 * se.val, uci = mean.val + 1.96 * se.val) long_data <- pivot_longer(bedtools_cor, cols = everything()) ggplot(data = fractions_summary, aes(x = name, y = mean.val, ymin = lci, ymax = uci, color = name)) + geom_point(size = 2, show.legend = FALSE) + geom_errorbar(width = 1, show.legend = FALSE) + ylab("Fraction of Coverage") + xlab("Gene Region") + theme_classic() + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) ``` ![image](https://hackmd.io/_uploads/H1qRa3s-gx.png) #### Ophiostoma novo-ulmi (DED) I used the same script to run CollectAlignmentSummary statistics. Results: [multiqc](https://m-b55e05.a0115.5898.data.globus.org/projects/p32585/Elm/batch_1/aligned_ono/alignment_metrics/multiqc_report.html) ![picard_alignment_summary (4)](https://hackmd.io/_uploads/rJizqt9bxe.png) #### Phytoplasma Results: [multiqc](https://m-b55e05.a0115.5898.data.globus.org/projects/p32585/Elm/batch_1/aligned_phytoplasma/alignment_metrics/multiqc_report.html) ![picard_alignment_summary (2)](https://hackmd.io/_uploads/H1r9QKqWee.png) #### Dothiorella (Mal secco) *need to update for pleno Results: [multiqc](https://m-b55e05.a0115.5898.data.globus.org/projects/p32585/Elm/batch_1/aligned_dothiorella/alignment_metrics/multiqc_report.html) ![picard_alignment_summary (3)](https://hackmd.io/_uploads/HyVRKY9Wgl.png) --- ### Step 6: Genome size estimation ***Goal:** Determine ploidy because gatK can only handle one ploidy at a time and it needs to be specified.* **Software:** Jellyfish tools (kmers), ploidyNGS, nQuire. **Resources:** - **[gatK notes on ploidy](https://gatk.broadinstitute.org/hc/en-us/articles/360035889691-Does-GATK-work-on-non-diploid-organisms)** - So it works on polyploids BUT you need to specify the ploidy of each individual - Requires each group to have identical ploidy so will need to seperate before calling SNPs **Input** - aligned reads but in a fastq format Average Genomes sizes of each ploidy from Whittemore & Olson 2011. ![image](https://hackmd.io/_uploads/rJX8tkEbgl.png) **kmer plots** - **[Github](https://github.com/gmarcais/Jellyfish)** - **[Tutorial](https://bioinformatics.uconn.edu/genome-size-estimation-tutorial/)** The kmer estimation tool didn't work, so I'm not going to bother transcribing everything here. There were no peaks. It is supposed to be used on WGS data, and we have whole genes so I thought it might work out, but it didn't. **ploidyNGS** - **[Documentation](https://academic.oup.com/bioinformatics/article/33/16/2575/3104472)** - **[Github](https://github.com/diriano/ploidyNGS)** Suzy sent this to try, so let's go! For this we need aligned and sorted .bam files, which we already have! I tried running it on the samples aligned to the reference genome and the ones aligned to the fasta files. The one aligned to the genome was much more accurate but still not great. This slurm script should create log files of each ploidy estimation and put it in a csv file with the estimated ploidy. :::spoiler Pipeline ```bash! cd /projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/sorted_bams/ #start by creating a conda environment to download and run ploidyNGS conda create --name ploidyNGS #clone the program from github git clone https://github.com/diriano/ploidyNGS.git #load R and install ggplot2 module load R/4.4.0 R install.packages("ggplot2") #to exit ctr+d # to actually run it # Load necessary modules and activate environment conda activate ploidyNGS cd ~/ploidyNGS module load python-miniconda3 module load R/4.4.0 module load nano source activate ploidyNGS # Use Conda's activation instead nano slurm_ploidy.sh #!/bin/bash #SBATCH --account=p32585 #SBATCH --partition=normal #SBATCH --time=48:00:00 # Adjust based on expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=12 #SBATCH --mem=10GB # Adjust memory based on file sizes #SBATCH --job-name=slurm_ploidy #SBATCH --mail-type=ALL #SBATCH --mail-user=marnequigg2025@u.northwestern.edu #SBATCH --error=slurm_ploidy.out.error #SBATCH --output=slurm_ploidy.out # Set input and output directories BAM_DIR="/projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/sorted_bams/" OUTPUT_DIR="/projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/sorted_bams/ploidy_estimation/" CSV_FILE="$OUTPUT_DIR/ploidy_results.csv" # Initialize CSV file with headers echo "Sample_ID,Closest_Ploidy" > "$CSV_FILE" # Loop through BAM files and run ploidyNGS for BAM_FILE in "$BAM_DIR"/*_sorted.bam; do SAMPLE_ID=$(basename "$BAM_FILE" _sorted.bam) OUTPUT_LOG="$OUTPUT_DIR/${SAMPLE_ID}_ploidy.log" echo "Running ploidy estimation for $SAMPLE_ID..." # Pass sample name explicitly to ploidyNGS ./ploidyNGS.py --out "$OUTPUT_DIR" --bam "$BAM_FILE" --guess_ploidy > "$OUTPUT_LOG" 2>&1 # Extract the closest ploidy value from the output PLOIDY=$(grep "the closest ploidy to yours is" "$OUTPUT_LOG" | awk '{print $NF}') # Append results to the CSV file echo "$SAMPLE_ID,$PLOIDY" >> "$CSV_FILE" echo "Processed $SAMPLE_ID → Closest Ploidy: $PLOIDY" done echo "Ploidy estimation completed! Results saved in: $CSV_FILE" #exit sbatch slurm_ploidy.sh ``` ::: *might want to increase the computing power for this it took about 1.5 hours **nQuire** nQuire is a software that Dylan sent and I found a paper that said it works with herbarium speicmens... they lied. It was about 56% accurate. That's not great, but better than ploidyNGS. **Resources:** - **[Paper about the pipeline for herbaria](https://pmc.ncbi.nlm.nih.gov/articles/PMC6667659/)** - **[Github](https://github.com/clwgg/nQuire)** :::spoiler Code for a single sample ```bash! #to get started I want to try on one sample #using NA02 because it has a known ploidy #first everything needs to be sorted samtools sort NA02_clipped.bam -o NA02_cs.bam #start by making the bin file nQuire create -b NA02_cs.bam -o NA02 #now to get rid of all the noise nQuire denoise NA02.bin -o NA02_denoised #Positions to analyse: # Before: 3838 # After: 1440 (37.5%) #now to run the actual model nQuire lrdmodel NA02.bin #file free dip tri tet d_dip d_tri d_tet #NA02.bin 2439.456970 1576.276581 1718.501688 1887.025762 863.180388 720.955281 552.431208 #its giving tetraploid but its a diploid #try another model nQuire modeltest NA02.bin #final diploid loglik: 1576.276581 # sd1 = 0.197052 #final triploid loglik: 1718.501688 # sd1 = 0.124503 # sd2 = 0.124539 #final tetraploid loglik: 1887.025762 # sd1 = 0.079604 # sd2 = 0.076289 # sd3 = 0.079892 #ok one more model test nQuire estmodel NA02.bin #more numbers #histogram test nQuire histotest NA02.bin #Diploid: # Norm SSR: 0.0379481 # y-y slope: 0.120563, with std.Err: 0.0544135 # r^2: 0.0768159 #Triploid: # Norm SSR: 0.0324301 # y-y slope: -0.157796, with std.Err: 0.0798268 # r^2: 0.0621147 #Tetraploid: # Norm SSR: 0.0155547 # y-y slope: -0.0117182, with std.Err: 0.128294 # r^2: 0.000141382 #well I know this isn't a tetraploid but the model says it is ``` ::: ::: spoiler Here's the full pipeline ```bash! #lets goooooo nano nquire_batch.sh #!/bin/bash # Set paths BAM_DIR="/data/labs/Fant/Quigg/batch_1" OUT_DIR="/data/labs/Fant/Quigg/ploidy_tests" SORTED_DIR="/data/labs/Fant/Quigg/batch_1/sorted_bams" SUMMARY_FILE="$OUT_DIR/ploidy_summary.tsv" THREADS=5 # Make output directories mkdir -p "$SORTED_DIR" "$OUT_DIR" # Initialize summary file echo -e "Sample\tDiploid_logL\tTriploid_logL\tTetraploid_logL\tBest_Ploidy" > "$SUMMARY_FILE" # Loop through BAMs for BAM in "$BAM_DIR"/*.bam; do SAMPLE=$(basename "$BAM" .bam) echo "Processing $SAMPLE..." # Step 1: Sort BAM SORTED_BAM="$SORTED_DIR/${SAMPLE}_sorted.bam" samtools sort -o "$SORTED_BAM" "$BAM" # Step 2: nQuire pipeline BIN_PREFIX="$OUT_DIR/$SAMPLE" DENOISED="$OUT_DIR/$SAMPLE.denoised.bin" MODEL_OUT="${BIN_PREFIX}.model.txt" nQuire create -b "$SORTED_BAM" -o "$BIN_PREFIX" nQuire denoise "${BIN_PREFIX}.bin" -o "${BIN_PREFIX}.denoised" nQuire lrdmodel -t $THREADS "${BIN_PREFIX}.denoised.bin" > "${BIN_PREFIX}.model.txt" # Step 3: Parse results DIP=$(awk '/diploid/ {print $2}' "$MODEL_OUT") TRI=$(awk '/triploid/ {print $2}' "$MODEL_OUT") TET=$(awk '/tetraploid/ {print $2}' "$MODEL_OUT") BEST=$(printf "diploid\t$DIP\ntriploid\t$TRI\ntetraploid\t$TET" | sort -k2 -n | tail -n1 | cut -f1) # Step 4: Record summary echo -e "$SAMPLE\t$DIP\t$TRI\t$TET\t$BEST" >> "$SUMMARY_FILE" done echo "✅ Batch ploidy analysis complete. Results saved to $SUMMARY_FILE" #exit chmod +x nquire_batch.sh ./nquire_batch.sh #that didnt extract the values properly so this is the script to correctly do that cd /data/labs/Fant/Quigg/ploidy_tests nano ploidy_tsv.sh #!/bin/bash # Directory containing your model.txt files MODEL_DIR="/data/labs/Fant/Quigg/ploidy_tests" SUMMARY_FILE="$MODEL_DIR/ploidy_summary.tsv" echo -e "Sample\tDiploid_logL\tTriploid_logL\tTetraploid_logL\tBest_Ploidy" > "$SUMMARY_FILE" for MODEL in "$MODEL_DIR"/*.model.txt; do SAMPLE=$(basename "$MODEL" .model.txt) # Extract log-likelihoods from the second line (skip header) read -r _ FREE DIP TRI TET _ < <(tail -n 1 "$MODEL") if [[ -z "$DIP" || -z "$TRI" || -z "$TET" ]]; then echo -e "$SAMPLE\tNA\tNA\tNA\tFAILED" echo -e "$SAMPLE\tNA\tNA\tNA\tFAILED" >> "$SUMMARY_FILE" continue fi # Determine best ploidy (based on highest log-likelihood) BEST=$(printf "diploid\t$DIP\ntriploid\t$TRI\ntetraploid\t$TET" | sort -k2 -n | tail -n1 | cut -f1) echo -e "$SAMPLE\t$DIP\t$TRI\t$TET\t$BEST" >> "$SUMMARY_FILE" done echo "✅ Updated summary complete: $SUMMARY_FILE" #exit chmod +x ploidy_tsv.sh ./ploidy_tsv.sh ``` ::: **nQuack** nQuack was recommended by Shelly Gaynor, the wonderful person who created it. She worked in the Soltis lab at UF and tried to do her PhD on PopGen of plants with similar ploidy problems, then rerouted to focus on creating techniques and programs to handle ploidy problems like this because she didn't like any of the software available. She recommended I analyze each cytotype (individuals of the same ploidy) separately. **Resources:** - **[Github](https://github.com/mgaynor1/nQuack)** - **[Tutorial](https://mlgaynor.com/nQuack/articles/DataPreparation.html)** - **[Example with known ploidy](https://mlgaynor.com/nQuack/articles/BasicExample.html#bootstrap-replicates)** --- ### Step 6: SNP calling ***Goal:** Find any SNPs present in the genes!* **Software:** GatK tools **Resources:** - **[Website](https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller)** - **[Best Practices](https://gatk.broadinstitute.org/hc/en-us/articles/360035894711-About-the-GATK-Best-Practices)** - **[Notes on Ploidy](https://gatk.broadinstitute.org/hc/en-us/articles/360035889691-Does-GATK-work-on-non-diploid-organisms)** - So it works on polyploids BUT you need to specify the ploidy of each individual (just going to run as if everything is diploid, will have excess heterozygosity) **Input** - Clipped and sorted bam files - Remove or mark duplicate reads (MarkDuplicates), add read groups (AddOrReplaceReadGroups), and index (samtools index) **Output** - VCF or GVCF file with raw, unfiltered SNP and indel calls Before running the slurm script a couple notes: - You need to prepare the genome or reference, I have code in the slurm to do this but it's commented out. - You have to load an updated version of Java or it will fail. - Make sure you give it plenty of time, I gave it 12 hours to do 48 samples and it timed out after getting 46 samples done, make sure you have ~30 minutes per sample. ```bash! conda activate gatk nano gatk_processing.sh #!/bin/bash #SBATCH --account=p32585 #SBATCH --partition=normal #SBATCH --time=12:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=8 #SBATCH --mem=32G # Adjust memory based on previous usage #SBATCH --job-name=gatk_processing #SBATCH --mail-type=ALL #SBATCH --mail-user=marnequigg2025@u.northwestern.edu #SBATCH --error=gatk_processing.err #SBATCH --output=gatk_processing.out # Load required modules module load gatk/4.4.0.0 module load samtools module load picard module load java/jdk-17.0.2+8 # Set input/output directories INPUT_DIR="/projects/p32585/Elm/batch_1/aligned_elm/clipped_sorted_bams/" OUTPUT_DIR="/projects/p32585/Elm/batch_1/aligned_elm/gvcf_output/" # Create output directory if it doesn’t exist mkdir -p "$OUTPUT_DIR" # Prepare reference genome (only need to do this once) #gatk CreateSequenceDictionary -R /projects/p32585/Elm/genomes/genome_elm.fa #samtools faidx /projects/p32585/Elm/genomes/genome_elm.fa #echo "Reference genome prepared." # Loop through all BAM files in the input directory for INPUT_BAM in "$INPUT_DIR"/*.bam; do SAMPLE_ID=$(basename "$INPUT_BAM" .bam) SORTED_BAM="/projects/p32585/Elm/batch_1/aligned_elm/md_rg_indexed_bams/${SAMPLE_ID}_sorted.bam" MD_BAM="/projects/p32585/Elm/batch_1/aligned_elm/md_rg_indexed_bams/${SAMPLE_ID}_md.bam" RG_BAM="/projects/p32585/Elm/batch_1/aligned_elm/md_rg_indexed_bams/${SAMPLE_ID}_rg.bam" METRICS_FILE="/projects/p32585/Elm/batch_1/aligned_elm/md_rg_indexed_bams/${SAMPLE_ID}_marked_dup_metrics.txt" OUTPUT_VCF="$OUTPUT_DIR/${SAMPLE_ID}.g.vcf.gz" echo "Processing sample: $SAMPLE_ID..." # Step 1: Sort BAM file samtools sort "$INPUT_BAM" -o "$SORTED_BAM" echo "Sorting complete for $SAMPLE_ID." # Step 2: Mark duplicates java -jar /software/picard/1.131/picard-tools-1.131/picard.jar MarkDuplicates \ I="$SORTED_BAM" \ O="$MD_BAM" \ M="$METRICS_FILE" \ REMOVE_DUPLICATES=false \ VALIDATION_STRINGENCY=SILENT echo "Duplicate marking complete for $SAMPLE_ID." # Step 3: Add read groups java -jar /software/picard/1.131/picard-tools-1.131/picard.jar AddOrReplaceReadGroups \ I="$MD_BAM" \ O="$RG_BAM" \ RGID="$SAMPLE_ID" \ RGLB=lib1 \ RGPL=ILLUMINA \ RGPU=unit1 \ RGSM="$SAMPLE_ID" \ VALIDATION_STRINGENCY=SILENT echo "Read group addition complete for $SAMPLE_ID." # Step 4: Index the BAM file samtools index "$RG_BAM" echo "Indexing complete for $SAMPLE_ID." # Step 5: Run GATK HaplotypeCaller gatk HaplotypeCaller \ -R /projects/p32585/Elm/genomes/genome_elm.fa \ -I "$RG_BAM" \ -O "$OUTPUT_VCF" \ -ERC GVCF echo "Variant calling complete for $SAMPLE_ID. Output saved to: $OUTPUT_VCF" done echo "Batch processing of all samples completed!" #exit sbatch gatk_processing.sh ``` So this next part prepares them to be put into R for analysis and visualization. It looks like ploidy is NOT a problem here (yay!). We need to combine the gvcfs to make a cohort one then genotype them. ```bash! #note to future Marne: run this as a slurm it took soooooo long (like 80 minutes) nano combine_gvcf.sh #!/bin/bash # Load necessary modules module load gatk/4.4.0.0 # Define input/output directories INPUT_DIR="/projects/p32585/Elm/batch_1/aligned_elm/gvcf_output/" OUTPUT_GVCF="/projects/p32585/Elm/batch_1/aligned_elm/gvcf_output/cohort.g.vcf.gz" REFERENCE="/projects/p32585/Elm/genomes/genome_elm.fa" # Initialize an empty variant list GVCF_LIST="" # Loop through all `.g.vcf.gz` files in the folder and add to GVCF_LIST for GVCF in "$INPUT_DIR"/*.g.vcf.gz; do GVCF_LIST+=" --variant ${GVCF}" done # Run GATK CombineGVCFs gatk CombineGVCFs \ -R "$REFERENCE" \ $GVCF_LIST \ -O "$OUTPUT_GVCF" echo "GVCF combination complete! Output saved at: $OUTPUT_GVCF" #exit chmod +x combine_gvcf.sh ./combine_gvcf.sh #now genotype the output of that one nano genotype_gvcf.sh #!/bin/bash #SBATCH --account=p32585 #SBATCH --partition=normal #SBATCH --time=12:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --mem=32G # Adjust memory based on previous usage #SBATCH --job-name=genotype_gvcf #SBATCH --mail-type=ALL #SBATCH --mail-user=marnequigg2025@u.northwestern.edu #SBATCH --error=genotype_gvcf.err #SBATCH --output=genotype_gvcf.out # Load required module module load gatk/4.4.0.0 # Define input/output files REFERENCE="/projects/p32585/Elm/genomes/genome_elm.fa" COHORT_GVCF="/projects/p32585/Elm/batch_1/aligned_elm/gvcf_output/cohort.g.vcf.gz" FINAL_VCF="/projects/p32585/Elm/batch_1/aligned_elm/gvcf_output/final.vcf.gz" echo "Running GenotypeGVCFs on $COHORT_GVCF..." gatk --java-options "-Xmx4g" GenotypeGVCFs \ -R "$REFERENCE" \ -V "$COHORT_GVCF" \ -O "$FINAL_VCF" echo "Genotyping complete! Output saved at: $FINAL_VCF" #exit sbatch genotype_gvcf.sh ``` --- ### VCF Filtering So this is one of those points where we need to know ploidy. Why? Because you need to specify if each loci has 2 or 4 alleles. ## Microsatellites ## Analysis and Visualization - Tutorials from Landscape Genetics - [How to import VCFs](https://bookdown.org/hhwagner1/LandGenCourse_book/WE_1.html#WE_1) - [How to measure Genetic Diversity](https://bookdown.org/hhwagner1/LandGenCourse_book/WE_3.html#WE_3) - [How to look at Population Structure](https://bookdown.org/hhwagner1/LandGenCourse_book/WE_9.html#WE_9) ## Objective 1: Analyze range-wide genetic structure and areas of high genetic diversity. ## Objective 2: Determine if geographic barriers are correlated to genetic differentiation and ploidy differences of these genetic clusters (e.g. geographic regions or watersheds). ## Objective 3: Compare current and historic genetic diversity and identify if there is evidence of a bottleneck caused by DED. ## Objective 4: Support establishment of a seed provenancing plan for U. americana restoration that maximizes genetic diversity, disease resistance, and local adaptation, then distribute to restoration managers. ## Objective 5: Identify if any of the trees sampled were infected with disease.