---
title: "Amplicons!!! (Marne's version)"
breaks: FALSE
tags: RADSeq
---
Marne Quigg (p32585)
---
# Amplicons!!! (Marne's version)
### Quick links
- **[Powerpoint from Suzy](https://github.com/CBG-Conservation-Genomics/BotanyWorkshop2024/blob/main/2.BaselineSkills/2.BaseSkills.pptx)**
- **[Linex commands from Suzy](https://github.com/CBG-Conservation-Genomics/BotanyWorkshop2024/blob/main/2.BaselineSkills/2.Bioinfo_unix_command_cheatsheet.pdf)**
- **[Quest User Guide](https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=505)**: Extensive manual on Quest
- **[Slurm Guide](https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1964#section-partitions)**: In depth guide for submitting jobs to Quest using Slurm
- **[Basic Research Computing Guide](https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=2004#quest-fundamentals)**: Suite of various tutorial videos
- **[Globus](https://app.globus.org/file-manager)**: Globus file transfer service for up/downloading files to Quest
- **[Job Resource Logs](https://docs.google.com/spreadsheets/d/1P1zpt1aznNnXwMjUfynWioOXvGkjydFfhUBG_EppNMQ/edit?usp=sharing)**: List of jobs and their associated resources/efficiencies
- **[Jupyter Hub](https://jupyter.questanalytics.northwestern.edu)**: Link to Jupyter Hub (requires NU VPN)
---
## Pre-analysis Processing
1) Organize and FastQC: make sure sequencing went well
2) Trimmomatic: Illumina Adapter trimming and filtering out reads <30 bases long
3) BWA-MEM: Alignment (to genome?)
4) Primerclip or TrimPrimers: trim primers
5) Picard tools: On-target and coverage uniformity calculation
6) Variant calling (GATK Haplotype Caller for germline calls, lofreq, mutect or Vardict for somatic or low frequency variant calls)
---
### Step 1: Organize and FastQC
***Goal:** Standardize file names and check the quality of the sequencing data!*
**Software:** FastQC and MultiQC
**Resources:**
- **[Interpretation](https://learn.gencore.bio.nyu.edu/ngs-file-formats/quality-scores/)**
**Organize!**
Let's start by standardizing the file names...
Current format: SampleID_S##_R#_001.fastq.gz
Desired format: SampleID_R#.fastq.qz
```bash!
#start by navigating to the correct folder
cd /projects/p32585/Elm/'Batch 1'/
pwd #yay you're in the right place
#to remove the S##:
for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_S[0-9]*_/_/')"; done
#to remove the _001_:
for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_001.fastq.gz/.fastq.gz/')"; done
*adapted from Brenden's code
```
**FastQC**
This allows us to look for seqencing errors or samples that did not sequence well.
``` bash!
#start by activating fastqc
module load fastqc
#check the version and record it (important for publication)
fastqc --version
#version _____
#next we want to open up a word processor to write out the code
nano fastqc.sh
#the below code will run a loop and do fastqc on all the samples in the folder
#!bin/bash
output_dir="batch_1_qc_results" #set the output directory
mkdir -p "$output_dir" #create the output directory if it doesnt exist
for file in *_*.fastq.gz; do fastqc -t 40 -o "$output_dir" "$file"
done
#press ctrl+x then Y then enter to save it
#change file permissions so you can run it
chmod +x fastqc.sh
#time to run it!
bash fastqc.sh
*Rose taught me how to do this
```
**MultiQC**
This combines all of the individual reports from the FastQC into one so you don't have to look at individual reports for each sample.
``` bash!
#load the module
module load multiqc
multiqc --version
#version 1.25.1
#make a home for the multiqc output
mkdir -p /projects/p32585/Elm/'Batch 1'/batch_1_qc_results/multiqc
cd /projects/p32585/Elm/'Batch 1'/batch_1_qc_results/multiqc
ln -s -f /projects/p32585/Elm/'Batch 1'/batch_1_qc_results/batch_1_qc_results/
#go to where the output will be
cd /projects/p32585/Elm/'Batch 1'/batch_1_qc_results/
#run it
multiqc .
#load the module
module load multiqc
multiqc --version
#version 1.25.1
#make a home for the multiqc output
mkdir -p /projects/p32585/Elm/'Batch 1'/batch_1_qc_results/multiqc
cd /projects/p32585/Elm/'Batch 1'/batch_1_qc_results/multiqc
ln -s -f /projects/p32585/Elm/'Batch 1'/batch_1_qc_results/batch_1_qc_results/
#go to where the output will be
cd /projects/p32585/Elm/'Batch 1'/batch_1_qc_results/
#run it
multicqc .
```
**Duplicate Original Data**
I'm a little scared of accidentally *damaging* my data, so I'm going to copy the folder just to have a backup set of data.
``` bash!
#format
cp -r original_folder copy_folder
#ran
cp -r 'Batch 1' copy_batch_1
```
*so that didn't actually work due to file permissions and apparently none of these programs alter your original files so I'm just gonna continue
---
### Step 2: Trimming
***Goal:** Remove Illumina adapters and filtering out reads <30bp.*
**Software:** Trimmomatic (includes a .fasta file with the adapter sequences - has the universal adapter)
**Resources:**
- **[Github](https://github.com/usadellab/Trimmomatic/tree/main)**
- **[USA Del Lab](http://www.usadellab.org/cms/?page=trimmomatic)**
- **[Very helpful PDF](chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf)**
Had to move everything into Curie because I ran out of storg=age on Quest :( but here is an updated (6/10/2025) script on how to run trimmomatic.
```bash!
#get all the softwares
conda activate bioconda
conda install bioconda::trimmomatic #version 0.39
conda install bioconda::java-jdk
conda install -c conda-forge openjdk #version 1.8.0_92
cd /data/labs/Fant/Quigg/all_data
#time to run it
nano trimmomatic2.sh
#!/bin/bash
# For loop to process all R1 samples
for R1 in *_R1.fastq.gz; do
R2=${R1//_R1.fastq.gz/_R2.fastq.gz}
R1p=${R1//.fastq.gz/_paired.fastq.gz}
R1u=${R1//.fastq.gz/_unpaired.fastq.gz}
R2p=${R2//.fastq.gz/_paired.fastq.gz}
R2u=${R2//.fastq.gz/_unpaired.fastq.gz}
# Run Trimmomatic
trimmomatic PE -phred33 \
-threads 20 "$R1" "$R2" "$R1p" "$R1u" "$R2p" "$R2u" \
ILLUMINACLIP:/home/imh4101/miniconda3/envs/bioconda/share/trimmomatic-0.39-2/adapters/TruSeq3-PE.fa:2:30:10:2:true \
LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40
done
#exit
#I added 20 threads and changed the path to trimmomatic
#change permissions
chmod +x trimmomatic2.sh
screen -L bash trimmomatic2.sh
#to disconnect the screen (keep it running in the background)
#type this into the screen itself
ctrl + a
d
#to cancel the job
screen -ls
screen -X -S 2801504.pts-10.curie quit
```
---
### Step 3: Alignment
***Goal:** Align the samples to the reference genome.*
**Software:** BWA-MEM (can only do 96 samples at a time), SAMtools, Picard
**Resources:**
- **[Github](https://github.com/lh3/bwa)**
- **[Website](https://bio-bwa.sourceforge.net/)**
- **[Illumina resource](https://www.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/bwa-aligner.html)**
We are going to try two techniques for this: Brenden's (align to amplicon fasta) and Suzy's (align to genomes) and compare to see which one works better. *
***I did a few different alignment trials. The first was to a csv of all potential regions that I sent to IDT a long time ago (don't use that). Then I aligned to the elm genome, plus the 3 disease genomes (good). I had the best ploidy estimations on samples aligned ot the elm genome. This you can use Picard's CollectSummaryMetrics. Then I aligned to the fasta file that came from IDT of target regions so I could run Picard's CollectTargetPCRMetrics.**
**Script that will do BWA -> Primerclip**
```bash!
#this starts with bwa alignment and ends with primerclip
cd /projects/p32585/Elm/batch_1/post_trimming/
#Dont forgot to index
bwa index /projects/p32585/Elm/genomes/cp927_OnTarget_105Refs.fasta
nano pipeline.sh
#!/bin/bash
#SBATCH --account=p32585
#SBATCH --partition=normal
#SBATCH --time=12:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --mem=40G # Adjust memory based on previous usage
#SBATCH --job-name=full_pipeline
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marnequigg2025@u.northwestern.edu
#SBATCH --error=full_pipeline.err
#SBATCH --output=full_pipeline.out
### Step 1: BWA MEM Alignment ###
module purge
module load bwa
FASTQ_DIR="/projects/p32585/Elm/batch_1/post_trimming/paired_unpaired_data/"
REFERENCE="/projects/p32585/Elm/genomes/cp927_OnTarget_105Refs.fasta"
ALIGNMENT_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/"
# Create output directory if needed
mkdir -p "$ALIGNMENT_DIR"
# Align all paired-end FASTQ files with BWA MEM
for R1 in "$FASTQ_DIR"/*_R1_paired.fastq.gz; do
SAMPLE=$(basename "$R1" _R1_paired.fastq.gz)
R2="$FASTQ_DIR/${SAMPLE}_R2_paired.fastq.gz"
SAM_OUTPUT="$ALIGNMENT_DIR/${SAMPLE}.sam"
echo "Aligning $SAMPLE..."
bwa mem -M -t 8 "$REFERENCE" "$R1" "$R2" > "$SAM_OUTPUT"
echo "Alignment complete for $SAMPLE!"
done
echo "All alignments completed!"
### Step 2: Convert SAM to BAM ###
module load samtools
BAM_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/"
mkdir -p "$BAM_DIR"
for SAM_FILE in "$ALIGNMENT_DIR"/*.sam; do
SAMPLE=$(basename "$SAM_FILE" .sam)
BAM_OUTPUT="$BAM_DIR/${SAMPLE}.bam"
echo "Converting $SAM_FILE to BAM..."
samtools view -b "$SAM_FILE" -o "$BAM_OUTPUT"
echo "Conversion completed for $SAMPLE!"
done
### Step 3: Sort BAM & Generate Flagstats ###
SORTED_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/sorted_bams/"
FLAGSTAT_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/flagstats/"
mkdir -p "$SORTED_DIR" "$FLAGSTAT_DIR"
for BAM_FILE in "$BAM_DIR"/*.bam; do
SAMPLE=$(basename "$BAM_FILE" .bam)
SORTED_BAM="$SORTED_DIR/${SAMPLE}_sorted.bam"
echo "Sorting BAM file for $SAMPLE..."
samtools sort -o "$SORTED_BAM" "$BAM_FILE"
echo "Generating flagstat for $SAMPLE..."
samtools flagstat "$SORTED_BAM" > "$FLAGSTAT_DIR/${SAMPLE}_flagstat.txt"
done
echo "Sorting and flagstat analysis completed!"
### Step 5: Sort SAM Files by Name ###
INDS=($(for i in "$ALIGNMENT_DIR"/*.sam; do echo $(basename "$i" .sam); done))
for IND in "${INDS[@]}"; do
samtools sort -n "$ALIGNMENT_DIR/${IND}.sam" -o "$ALIGNMENT_DIR/${IND}_nsort.sam"
done
echo "All SAM sorting completed!"
### Step 6: Run PrimerClip ###
conda activate primerclip
PRIMERFILE="/projects/p32585/Elm/genomes/cp927_Masterfile.txt"
for IND in "${INDS[@]}"; do
echo "Running PrimerClip on $IND..."
primerclip "$PRIMERFILE" "$ALIGNMENT_DIR/${IND}_nsort.sam" "$ALIGNMENT_DIR/${IND}_clipped.sam"
done
echo "Primer clipping completed!"
#exit
sbatch pipeline.sh
#job 3567647
#took about an hour
```
**Brenden/Emma's Technique:** this pipeline has you map the reads to the primer sequence fasta file that you generated with all of the fasta sequences of the genes. *you did not remove the genes that we didn't use in the final primer set
This was a fun process to troubleshoot. Brendan's pipeline only worked on a couple samples then claimed to be completed but didn't work. So then we tried a loop that Emma created that seperated everything out into individual steps.
```bash!
#downstream analyses require a concatenated file of the unpaired reads
for file in *R1_unpaired*
do
file2=${file//_R1_/_R2_}
file3=${file//R1_unpaired/unpaired}
cat $file $file2 > $file3
done
```
1) We need a fasta file of all the sequences we submitted to IDT. To do this you copied and pasted all of the fasta seqences and names (without the >). Use this [website](https://cdcgov.github.io/CSV2FASTA/) to convert the csv file to a fasta file. Make sure you have column labels. Move them to your genomes folder on quest.
***note: you had to redo this step and align to the genome NOT the primer sequences**
***another note: you had to mask the disease regions from the elm genome and align it to that**
```bash!
scp "C:\Users\mquig\Downloads\primers_redo.fasta" imh4101@quest.northwestern.edu:/projects/p32585/Elm/genomes/
```
:::spoiler 2) Now let's do a quick trial run with one sample.
```bash!
#start by making sure we have the packages
module load bwa
#verion 0.7.12
module load samtools
#version 1.6
#start by telling the program what to map to
#bwa index /projects/p32585/Elm/genomes/primers4.fasta
#or you can index to the reference genome
bwa index /projects/p32585/Elm/genomes/genome.fa
#took about 20 minutes for this one
#if it not in the same folder as the files then include the path
#try one sample
bwa mem -M -t 4 /projects/p32585/Elm/genomes/primers4.fasta NY05_R1_paired.fastq.gz NY05_R2_paired.fastq.gz > NY05.sam
#check to see how it went
samtools flagstat NY05.sam
#[W::sam_hdr_parse] Duplicated sequence 'KY866673.1'
#[W::sam_hdr_parse] Duplicated sequence 'KY866675.1'
#[W::sam_hdr_parse] Duplicated sequence 'KY866674.1'
#[W::sam_hdr_parse] Duplicated sequence 'KY866678.1'
#[W::sam_hdr_parse] Duplicated sequence 'KY866677.1'
#[W::sam_hdr_parse] Duplicated sequence 'KY866676.1'
#[W::sam_hdr_parse] Duplicated sequence 'EF123160.1'
#790452 + 0 in total (QC-passed reads + QC-failed reads)
#3816 + 0 secondary
#0 + 0 supplementary
#0 + 0 duplicates
#742109 + 0 mapped (93.88% : N/A)
#786636 + 0 paired in sequencing
#393318 + 0 read1
#393318 + 0 read2
#734708 + 0 properly paired (93.40% : N/A)
#737190 + 0 with itself and mate mapped
#1103 + 0 singletons (0.14% : N/A)
#2362 + 0 with mate mapped to a different chr
#192 + 0 with mate mapped to a different chr (mapQ>=5)
#ok so this isn't great, we want it to be in the high 90s
```
:::
:::spoiler 3) Ok now that we know it works, lets make a loop to run all of the samples. Actually there is an easier way to do this and the script is in a few steps!
```bash!
nano slurm_bwa.sh
#!/bin/bash
#SBATCH --account=p32585
#SBATCH --partition=normal
#SBATCH --time=12:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8 # Test with 8 threads
#SBATCH --mem=20GB # Adjusted memory based on previous usage
#SBATCH --job-name=batch_1_bwamem
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marnequigg2025@u.northwestern.edu
#SBATCH --error=CUP_bwamem.out.error
#SBATCH --output=bwamem.out
module purge
module load bwa
# Run bwa mem with 8 threads
for i in *_R1_paired.fastq.gz; do
NM=${i//_R1_paired.fastq.gz/} # Extract the base name (e.g., NM_ARK23-01)
REVERSE=${i//_R1_paired.fastq.gz/_R2_paired.fastq.gz} # Replace .1.fq.gz with .2.fq.gz for reverse reads
#bwa mem -M -t 8 /projects/p32585/Elm/genomes/primers4.fasta "$i" "$REVERSE" > "${NM}.sam"
bwa mem -M -t 8 /projects/p32585/Elm/genomes/genome.fa "$i" "$REVERSE" > "${NM}.sam"
done
#exit
sbatch slurm_bwa.sh
#took about 11 minutes to run which is suspicious because Brenden and Emma's took like 8 hours
```
:::
:::spoiler 4) Now we need to verify that running the loop and running the single sample was the same thing.
```bash!
samtools flagstat NY05.sam
#[W::sam_hdr_parse] Duplicated sequence 'KY866673.1'
#[W::sam_hdr_parse] Duplicated sequence 'KY866675.1'
#[W::sam_hdr_parse] Duplicated sequence 'KY866674.1'
#[W::sam_hdr_parse] Duplicated sequence 'KY866678.1'
#[W::sam_hdr_parse] Duplicated sequence 'KY866677.1'
#[W::sam_hdr_parse] Duplicated sequence 'KY866676.1'
#[W::sam_hdr_parse] Duplicated sequence 'EF123160.1'
#790452 + 0 in total (QC-passed reads + QC-failed reads)
#3816 + 0 secondary
#0 + 0 supplementary
#0 + 0 duplicates
#742109 + 0 mapped (93.88% : N/A)
#786636 + 0 paired in sequencing
#393318 + 0 read1
#393318 + 0 read2
#734708 + 0 properly paired (93.40% : N/A)
#737190 + 0 with itself and mate mapped
#1103 + 0 singletons (0.14% : N/A)
#2362 + 0 with mate mapped to a different chr
#192 + 0 with mate mapped to a different chr (mapQ>=5)
#looks similar to me!
```
:::
:::spoiler 5) sam2bam! And sort! To make the files smaller and conserve storage switch the file type to .sam
```bash!
nano sam2bam_sort.sh
#!/bin/bash
#SBATCH --account=p32585
#SBATCH --partition=normal
#SBATCH --time=12:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8 # Test with 8 threads
#SBATCH --mem=20GB # Adjusted memory based on previous usage
#SBATCH --job-name=batch_1_bwamem
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marnequigg2025@u.northwestern.edu
#SBATCH --error=sam2bam.out.error
#SBATCH --output=sam2bam.out
for i in *.sam; do
NM=${i//.sam/};
samtools view $i -b -o $NM.bam; done
#exit
sbatch sam2bam.sh
```
:::
6. USE THIS SCRIPT!!! Ok so here is a script that will convert all of your files to .bam (requires less storage), and run a flagstat check to see what percentage of reads mapped to the reference.
```bash!
nano sam2bam_sort.sh
#!/bin/bash
#SBATCH --account=p32585
#SBATCH --partition=normal
#SBATCH --time=12:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=5GB # Adjust memory based on previous usage
#SBATCH --job-name=sam2bam_sort
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marnequigg2025@u.northwestern.edu
#SBATCH --error=sam2bam_sort.out.error
#SBATCH --output=sam2bam_sort.out
module load samtools
module load bwa
# Process one sample at a time: convert SAM to BAM, sort, and clean up
for i in *.sam; do
# Check if the file exists before proceeding
if [[ -e $i ]]; then
NM=${i%.sam} # Remove the .sam extension
# Convert the .sam file to .bam
echo "Converting $i to BAM format..."
samtools view -b "$i" -o "$NM.bam"
# Check if conversion was successful
if [[ $? -eq 0 ]]; then
echo "$i successfully converted to $NM.bam"
# Sort the .bam file
echo "Sorting $NM.bam..."
samtools sort "$NM.bam" -o "$NM_sorted.bam"
# Check if sorting was successful
if [[ $? -eq 0 ]]; then
echo "$NM.bam successfully sorted into $NM_sorted.bam"
# Generate flagstat report
echo "Generating flagstat report for $NM_sorted.bam..."
samtools flagstat "$NM_sorted.bam" > "$NM_flagstat.txt"
echo "Flagstat report saved to $NM_flagstat.txt"
# Delete the original .sam file
echo "Deleting $i to free up space..."
rm "$i"
# Move the sorted BAM file to the sorted_bams folder
mkdir -p sorted_batch_1_bams_redo
mv "$NM.sorted.bam" sorted_batch_1_bams_redo/
mv "$NM.flagstat.txt" sorted_batch_1_bams_redo/
else
echo "Error sorting $NM.bam" >&2
exit 1
fi
else
echo "Error converting $i" >&2
exit 1
fi
else
echo "No *.sam files found."
exit 1
fi
done
echo "All *.sam files processed. Sorted BAM files and flagstat reports moved to sorted_bams directory."
#exit
sbatch sam2bam_sort.sh
```
*this was adapted from a script that Emma gave me
So after running the redo samples that script didn't work it just generated the .bam files but didn't sort or generate the flagstat report. So below is the slurm script you used to sort and flag.
```bash!
nano bam_sort_flag.sh
#!/bin/bash
#SBATCH --account=p32585
#SBATCH --partition=normal
#SBATCH --time=12:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=5GB # Adjust memory based on previous usage
#SBATCH --job-name=bam_sort_flag
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marnequigg2025@u.northwestern.edu
#SBATCH --error=bam_sort_flag.out.error
#SBATCH --output=bam_sort_flag.out
# Load necessary modules
module load samtools
# Define input and output directories
INPUT_DIR="/projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/"
SORTED_DIR="/projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/sorted_bams/"
FLAGSTAT_DIR="/projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/flagstats/"
# Create output directories if they don't exist
mkdir -p "$SORTED_DIR" "$FLAGSTAT_DIR"
# Process each BAM file
for bamfile in "$INPUT_DIR"/*.bam; do
filename=$(basename "$bamfile" .bam)
# Sort BAM file
samtools sort -o "$SORTED_DIR/${filename}_sorted.bam" "$bamfile"
# Generate flagstat report
samtools flagstat "$SORTED_DIR/${filename}_sorted.bam" > "$FLAGSTAT_DIR/${filename}_flagstat.txt"
done
echo "Processing complete."
#exit
sbatch bam_sort_flag.sh
```
7. Now lets check to see how the mapping went!
```bash!
nano flagstat_summary.sh
#!/bin/bash
#SBATCH --account=p32585
#SBATCH --partition=normal
#SBATCH --time=01:00:00 # Adjust based on expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=1GB # Adjust memory based on file sizes
#SBATCH --job-name=process_flagstat
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marnequigg2025@u.northwestern.edu
#SBATCH --error=process_flagstat.out.error
#SBATCH --output=process_flagstat.out
# Set the directory containing the .flagstat.txt files
FLAGSTAT_DIR="/projects/p32585/Elm/batch_1/post_trimming/sam/sorted_batch_1_bams/"
# Create a summary file
SUMMARY_FILE="summary_flagstat.txt"
echo -e "Sample\tQC_Passed_Reads\tMapped_Reads\tMapped_%\tProperly_Paired_Reads\tProperly_Paired_%" > "$SUMMARY_FILE"
# Loop through all .flagstat.txt files
for file in "$FLAGSTAT_DIR"/*_flagstat.txt; do
if [[ -e "$file" ]]; then
# Extract the sample name (remove directory and extension)
SAMPLE=$(basename "$file" _flagstat.txt)
# Extract relevant information
QC_PASSED=$(grep "in total" "$file" | awk '{print $1}')
MAPPED=$(grep "mapped (" "$file" | awk '{print $1}')
MAPPED_PCT=$(grep "mapped (" "$file" | awk -F '[()]' '{print $2}')
PROPERLY_PAIRED=$(grep "properly paired (" "$file" | awk '{print $1}')
PROPERLY_PAIRED_PCT=$(grep "properly paired (" "$file" | awk -F '[()]' '{print $2}')
# Append the data to the summary file
echo -e "$SAMPLE\t$QC_PASSED\t$MAPPED\t$MAPPED_PCT\t$PROPERLY_PAIRED\t$PROPERLY_PAIRED_PCT" >> "$SUMMARY_FILE"
else
echo "No _flagstat.txt files found in $FLAGSTAT_DIR."
exit 1
fi
done
echo "Summary of all _flagstat.txt files saved to $SUMMARY_FILE."
#exit
sbatch flagstat_summary.sh
#job 2352119
#well that was quick
```
Well it seems to have gone ok most samples are in the low 90s or upper 80s, except for NY06 which is about 49%, oh well. We could increase the mapping percentages by changing some values in the bwa, BUT I don't want to do that because it could increase the number of incorrectly aligned reads, and I'm already worried about that because of how low our ligation temperature was during PCR.
Wow aligning to the genome resulted in such higher percentage mapped! This is great!
This is the end here until we use PrimerClip to purge the bad stuff. See you in step 4!
---
### Step 4: Primer Trimming
***Goal:** Remove low quality bases and PCR primer sequences*
**Software:** Primerclip or TrimPrimers
**Resources:**
- **[Important note from Illumina](https://sfvideo.blob.core.windows.net/sitefinity/docs/default-source/application-note/primerclip-a-tool-for-trimming-primer-sequences-application-note.pdf?sfvrsn=cf83e107_10)**
- **[PrimerClip Github](https://github.com/swiftbiosciences/primerclip)**
- **[TrimPrimers Website](https://fulcrumgenomics.github.io/fgbio/tools/latest/TrimPrimers.html)**
Ok so first we need to create a conda environment to get PrimerClip because Quest doesn't have it :(
```bash!
#create a conda environment
conda create -n primerclip
#active it
conda activate primerclip
#this conda environment lives in your home folder
#/home/imh4101/.conda/envs/primerclip
#download primerclip
conda install bioconda::primerclip
#takes a few seconds to activate it
#not sure what this does but brenden did it
conda install bioconda/label/cf201901::samtools
#no errors yay!
module list
#says theres no modules loaded so hopefully this works lol
which primerclip
#says we do have it but not which environment
primerclip --version
#that didnt work
#make sure we can actually run primerclip
chmod a+x /home/imh4101/.conda/envs/primerclip
```
Now we need to convert out .bam files to .sam files and sort them by name. Before this you sorted them but not by name and PrimerClip requires them sorted by name.
Convert .bam files to .sam files...
```bash!
#we should run this as a slurm
nano bam2sam2.sh
#!/bin/bash
#SBATCH --account=p32585
#SBATCH --partition=normal
#SBATCH --time=01:00:00 # Adjust based on expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=1GB # Adjust memory based on file sizes
#SBATCH --job-name=bam2sam2
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marnequigg2025@u.northwestern.edu
#SBATCH --error=bam2sam2.out.error
#SBATCH --output=bam2sam2.out
# Set the directory containing BAM files
BAM_DIR="/projects/p32585/Elm/batch_1/post_trimming/sam/"
# Loop through all BAM files in the directory
for bamfile in "$BAM_DIR"/*.bam; do
# Extract the base filename without the extension
base_name=$(basename "$bamfile" .bam)
# Convert BAM to SAM using samtools
samtools view -h "$bamfile" > "$BAM_DIR/${base_name}.sam"
echo "Converted $bamfile to ${base_name}.sam"
done
echo "All conversions completed!"
#exit
sbatch bam2sam2.sh
```
Sort the resulting .sam files by name..
```bash!
nano nsort_sam.sh
#!/bin/bash
#SBATCH --account=p32585
#SBATCH --partition=normal
#SBATCH --time=01:00:00 # Adjust based on expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=1GB # Adjust memory based on file sizes
#SBATCH --job-name=nsort_sam
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marnequigg2025@u.northwestern.edu
#SBATCH --error=nsort_sam.error
#SBATCH --output=nsort_sam.out
module load samtools
INDS=($(for i in ./*.sam; do echo $(basename ${i%*}); done))
for IND in ${INDS[@]};
do
samtools sort -n ${IND} -o ${IND}_nsort.sam;
done
#exit
sbatch nsort_sam.sh
```
Yay! You did it! Now we can actually run PrimerClip on the .sam files sorted by name. This took about 25 minutes for 48 individuals.
```!bash
conda activate primerclip
nano primerclip.sh
#!/bin/bash
#SBATCH --account=p32585
#SBATCH --partition=normal
#SBATCH --time=48:00:00 # Adjust based on expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=12
#SBATCH --mem=6GB # Adjust memory based on file sizes
#SBATCH --job-name=primerclip
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marnequigg2025@u.northwestern.edu
#SBATCH --error=primerclip.out.error
#SBATCH --output=primerclip.out
#Load Conda Environment
INDS=($(for i in ./*_nsort.sam; do echo $(basename ${i%_nsort*}); done)) #declares our individual names
for IND in ${INDS[@]};
do
primerclip /projects/p32585/Elm/genomes/cp927_Masterfile.txt ${IND}_nsort.sam ${IND}_clipped.sam;
done
#exit
sbatch primerclip.sh
```
---
### Step 5: On-target and coverage uniformity calculation
***Goal:** Loci amplified and how well it aligns to the genome!*
**Software:** Picard tools
**Resources:**
- **[Website](https://broadinstitute.github.io/picard/)**
- **[Github](https://github.com/broadinstitute/picard/blob/master/README.md)**
- **[Picard tool for .bed file to interval list](https://gatk.broadinstitute.org/hc/en-us/articles/360036883931-BedToIntervalList-Picard)**
- **[Bedtools guide](http://quinlanlab.org/tutorials/bedtools.html)**
**Input**
- Clipped and sorted bam files
To make the clipped and sorted .bam files I consulted copilot and used this code. I just ran it in bash it took about 5 minutes.
```bash!
nano sam2bam_sort.sh
#!/bin/bash
# Directory containing SAM files
INPUT_DIR="/projects/p32585/Elm/batch_1/post_trimming/sam/clipped_sams/"
OUTPUT_DIR="/projects/p32585/Elm/batch_1/post_trimming/clipped_bams/"
# Convert each SAM file to BAM format
for samfile in "$INPUT_DIR"/*.sam; do
if [[ -f "$samfile" ]]; then
bamfile="$OUTPUT_DIR/$(basename "$samfile" .sam).bam"
echo "Converting $samfile to $bamfile..."
samtools view -Sb "$samfile" -o "$bamfile"
fi
done
echo "Conversion completed."
#exit
module load samtools
chmod +x sam2bam_sort.sh
./sam2bam_sort.sh
```
**CollectAlignmentSummary**
This is what you use when you align to the genome. You can do this with the elm genome and the 3 diseases.
```bash!
#try it on one sample CollectAlignmentSummaryMetrics
cd /projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/sorted_clipped_bams/
java -jar /software/picard/1.131/picard-tools-1.131/picard.jar CollectAlignmentSummaryMetrics \
R=/projects/p32585/Elm/genomes/genome.fa \
I=NA01_clipped_sorted.bam \
O=NA01_alignment_metrics.txt
#good it worked now lets do all of them
nano picard_align.sh
#!/bin/bash
#SBATCH --account=p32585
#SBATCH --partition=normal
#SBATCH --time=48:00:00 # Adjust based on expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=12
#SBATCH --mem=6GB # Adjust memory based on file sizes
#SBATCH --job-name=picard
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marnequigg2025@u.northwestern.edu
#SBATCH --error=picard.out.error
#SBATCH --output=picard.out
module load picard # Load Picard module if necessary
PICARD_JAR="/software/picard/1.131/picard-tools-1.131/picard.jar"
REFERENCE="/projects/p32585/Elm/genomes/genome.fa"
OUTPUT_DIR="/projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/sorted_clipped_bams/"
for bamfile in /projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/sorted_clipped_bams/*_clipped_sorted.bam; do
filename=$(basename "$bamfile" .bam)
java -Xmx12G -jar "$PICARD_JAR" CollectAlignmentSummaryMetrics \
R="$REFERENCE" \
I="$bamfile" \
O="$OUTPUT_DIR/${filename}_alignment_metrics.txt"
done
#exit
sbatch picard_align.sh
#run multiqc
multiqc .
```
#### Elm
Results: [link to multiqc](https://m-b55e05.a0115.5898.data.globus.org/projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/sorted_clipped_bams/multiqc_report.html#picard-alignmentsummary)

**CollectTargetPCRMetrics**
In order to do this I needed to realign the reads to cp927_OnTarget_105Refs.fasta. This is the fasta file that IDT sent me of all the gene regions that made it into the final amplicon panel.
Then we did two different runs of the CollectTargetPCRMetrics. The first one uses the full target+primer insert. The second run seperates the primer and amplicon target region.

The amplicon/primer is the orange/reddish bit on the end, the target is the green part, and the whole purple thing is the insert.
```bash!
#start by indexing the fasta file that we aligned to
samtools faidx cp927_OnTarget_105Refs.fasta
#this is the format to run on one sample
java -jar picard.jar CollectTargetedPcrMetrics \
I=input.bam \
O=output_pcr_metrics.txt \
R=reference.fasta \
AMPLICON_INTERVALS=amplicon.interval_list \
TARGET_INTERVALS=targets.interval_list
```
Full Insert- we are also interested in the variation on the primers/amplicons not just the target regions.
```bash!
#first we are going to do the same interval list because we are ok with keeping the primers
#our version trial on NA01
cd /projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_sam
java -jar /software/picard/1.131/picard-tools-1.131/picard.jar CollectTargetedPcrMetrics \
I=NA01_clipped.sam \
O=NA01_output_pcr_metrics.txt \
R=/projects/p32585/Elm/genomes/cp927_OnTarget_105Refs.fasta \
AMPLICON_INTERVALS=/projects/p32585/Elm/genomes/cp927_amplicon_interval_list \
TARGET_INTERVALS=/projects/p32585/Elm/genomes/cp927_Assays_interval_list
nano pcr_metrics_same_interval.sh
#slurm script
#!/bin/bash
#SBATCH --account=p32585
#SBATCH --partition=normal
#SBATCH --time=01:00:00 # Adjust based on expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=5GB # Adjust memory based on file sizes
#SBATCH --job-name=pcr_metrics_same
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marnequigg2025@u.northwestern.edu
#SBATCH --error=pcr_metrics_same.out.error
#SBATCH --output=pcr_metrics_same.out
# Define input and output directories
SAM_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_sam/"
OUTPUT_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_sam/pcr_metrics_same_interval/"
REFERENCE="/projects/p32585/Elm/genomes/cp927_OnTarget_105Refs.fasta"
AMPLICON_INTERVALS="/projects/p32585/Elm/genomes/cp927_amplicon_interval_list"
TARGET_INTERVALS="/projects/p32585/Elm/genomes/cp927_Assays_interval_list"
# Loop through all clipped SAM files
for SAM_FILE in "$SAM_DIR"/*_clipped.sam; do
SAMPLE_ID=$(basename "$SAM_FILE" _clipped.sam)
OUTPUT_FILE="$OUTPUT_DIR/${SAMPLE_ID}_output_pcr_metrics.txt"
echo "Processing $SAMPLE_ID..."
java -jar /software/picard/1.131/picard-tools-1.131/picard.jar CollectTargetedPcrMetrics \
I="$SAM_FILE" \
O="$OUTPUT_FILE" \
R="$REFERENCE" \
AMPLICON_INTERVALS="$AMPLICON_INTERVALS" \
TARGET_INTERVALS="$TARGET_INTERVALS"
echo "Completed PCR Metrics for $SAMPLE_ID!"
done
echo "All samples processed!"
#exit
sbatch pcr_metrics_same_interval.sh
#job 3694727
cd /projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_sam/pcr_metrics_same_interval/
multiqc .
```
Results: [multiqc](https://m-b55e05.a0115.5898.data.globus.org/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_sam/pcr_metrics_same_interval/multiqc_report.html)

Primers/amplicons + targets- this time we used the seperate bed files from IDT that has the coordinates for the amplicons and the target regions.
```bash!
#second we are going to do it with the amplicon intervals and targets from the bed files IDT sent
#this is so we can see how well it worked for our target region
#make the interval lists
cd /projects/p32585/Elm/genomes/
java -jar /software/picard/1.131/picard-tools-1.131/picard.jar BedToIntervalList \
I=cp927_Inserts_v5.bed \
O=cp927_Inserts_v5_interval_list \
SD=cp927_Assays.dict
java -jar /software/picard/1.131/picard-tools-1.131/picard.jar BedToIntervalList \
I=cp927_Primers_v5.bed \
O=cp927_Primers_v5_interval_list \
SD=cp927_Assays.dict
#yay those both worked!
cd /projects/p32585/Elm/batch_1/post_trimming/redo_aligned_IDT_fasta/
java -jar /software/picard/1.131/picard-tools-1.131/picard.jar CollectTargetedPcrMetrics \
I=NA01_clipped.sam \
O=NA01_2intervals_output_pcr_metrics.txt \
R=/projects/p32585/Elm/genomes/cp927_OnTarget_105Refs.fasta \
AMPLICON_INTERVALS=/projects/p32585/Elm/genomes/cp927_Primers_v5_interval_list \
TARGET_INTERVALS=/projects/p32585/Elm/genomes/cp927_Inserts_v5_interval_list
#yay it worked!!!
nano pcr_metrics_2_interval.sh
#!/bin/bash
#SBATCH --account=p32585
#SBATCH --partition=normal
#SBATCH --time=01:00:00 # Adjust based on expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=5GB # Adjust memory based on file sizes
#SBATCH --job-name=pcr_metrics_same
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marnequigg2025@u.northwestern.edu
#SBATCH --error=pcr_metrics_same.out.error
#SBATCH --output=pcr_metrics_same.out
# Define input and output directories
SAM_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_sam/"
OUTPUT_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_sam/pcr_metrics_2_interval/"
REFERENCE="/projects/p32585/Elm/genomes/cp927_OnTarget_105Refs.fasta"
AMPLICON_INTERVALS="/projects/p32585/Elm/genomes/cp927_Primers_v5_interval_list"
TARGET_INTERVALS="/projects/p32585/Elm/genomes/cp927_Inserts_v5_interval_list"
# Loop through all clipped SAM files
for SAM_FILE in "$SAM_DIR"/*_clipped.sam; do
SAMPLE_ID=$(basename "$SAM_FILE" _clipped.sam)
OUTPUT_FILE="$OUTPUT_DIR/${SAMPLE_ID}_2_intervals_output_pcr_metrics.txt"
echo "Processing $SAMPLE_ID..."
java -jar /software/picard/1.131/picard-tools-1.131/picard.jar CollectTargetedPcrMetrics \
I="$SAM_FILE" \
O="$OUTPUT_FILE" \
R="$REFERENCE" \
AMPLICON_INTERVALS="$AMPLICON_INTERVALS" \
TARGET_INTERVALS="$TARGET_INTERVALS"
echo "Completed PCR Metrics for $SAMPLE_ID!"
done
echo "All samples processed!"
#exit
sbatch pcr_metrics_2_interval.sh
#job 3695173
cd /projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_sam/pcr_metrics_2_interval/
multiqc .
```
Results: [multiqc](https://m-b55e05.a0115.5898.data.globus.org/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_sam/pcr_metrics_2_interval/multiqc_report.html)

#### Coverage of Loci
To calculate the coverage of each loci we used bedtools to check the coverage and saved it as a .txt file. Then we converted it to a csv file and extracted the column that has the fraction of coverage at each loci of interest. From there, we put it in R and graphed it.
```bash!
##########################################################
calculate the coverage
##########################################################
nano bedtools_fraction_coverage.sh
#!/bin/bash
#SBATCH --account=p32585
#SBATCH --partition=normal
#SBATCH --time=12:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4 # Test with 8 threads
#SBATCH --mem=16GB # Adjusted memory based on previous usage
#SBATCH --job-name=coverage_csv
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marnequigg2025@u.northwestern.edu
#SBATCH --error=coverage_csv.err
#SBATCH --output=coverage_csv.out
# Load required modules
module load samtools
module load bedtools
# Set input/output paths
BED_FILE="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_bam/clipped_sort/bedtools_interval.bed"
BAM_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_bam/clipped_sort/"
OUTPUT_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_bam/clipped_sort/bedtools_coverage/"
# Create output directory if it doesn’t exist
mkdir -p "$OUTPUT_DIR"
# Loop through BAM files and save Bedtools coverage output
for BAM_FILE in "$BAM_DIR"/*_clipped_sorted.bam; do
SAMPLE_ID=$(basename "$BAM_FILE" _clipped_sorted.bam)
OUTPUT_FILE_TXT="$OUTPUT_DIR/${SAMPLE_ID}_coverage.txt"
# Run bedtools coverage and save output
bedtools coverage -a "$BED_FILE" -b "$BAM_FILE" > "$OUTPUT_FILE_TXT"
echo "Saved coverage results for $SAMPLE_ID."
done
echo "Bedtools coverage analysis completed! Results saved in: $OUTPUT_DIR"
#exit
sbatch bedtools_fraction_coverage.sh
##########################################################
Convert to csv
##########################################################
#try to convert each .txt file into a csv
nano txt2csv.sh
#!/bin/bash
# Set input/output directories
INPUT_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_bam/clipped_sort/bedtools_coverage/"
OUTPUT_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_bam/clipped_sort/bedtools_coverage/bedtools_coverage_csvs/"
# Create output directory if it doesn’t exist
mkdir -p "$OUTPUT_DIR"
# Loop through all `.txt` files and convert to `.csv`
for TXT_FILE in "$INPUT_DIR"/*_coverage.txt; do
SAMPLE_ID=$(basename "$TXT_FILE" _coverage.txt)
CSV_FILE="$OUTPUT_DIR/${SAMPLE_ID}_coverage.csv"
# Convert tab-separated text to properly formatted CSV
awk 'BEGIN{FS="\t"; OFS=","} {print $1, $2, $3, $4, $5, $6, $7, $8}' "$TXT_FILE" > "$CSV_FILE"
echo "Converted: $TXT_FILE → $CSV_FILE"
done
echo "Batch conversion complete! CSV files saved in: $OUTPUT_DIR"
##########################################################
extract the fraction of coverage column
##########################################################
nano fractions.sh
#!/bin/bash
# Set directories
INPUT_DIR="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_bam/clipped_sort/bedtools_coverage/bedtools_coverage_csvs/"
OUTPUT_CSV="/projects/p32585/Elm/batch_1/aligned_cp927_OnTarget_105Refs_fasta/clipped_bam/clipped_sort/bedtools_coverage/bedtools_coverage_csvs/coverage_fractions.csv"
# Initialize CSV file with headers
echo -n "Sample_Name" > "$OUTPUT_CSV"
for CSV_FILE in "$INPUT_DIR"/*_coverage.csv; do
SAMPLE_ID=$(basename "$CSV_FILE" _coverage.csv)
echo -n ",$SAMPLE_ID" >> "$OUTPUT_CSV"
done
echo "" >> "$OUTPUT_CSV"
# Extract **column G (7th column)** from each file and write to CSV
for CSV_FILE in "$INPUT_DIR"/*_coverage.csv; do
awk -F',' '{print $7}' "$CSV_FILE" > "/tmp/$(basename "$CSV_FILE" .csv)_colG.txt"
done
paste -d ',' /tmp/*_colG.txt >> "$OUTPUT_CSV"
# Cleanup temporary files
rm /tmp/*_colG.txt
echo "Extraction complete! Column G summary saved in: $OUTPUT_CSV"
##########################################################
switch to R and graph it
##########################################################
library(tidyverse)
###import data###
fractions=read.csv("coverage_fractions_final.csv", header = T)
###reformat data###
##right now we have the first column as the region names and each subsequent column is a sample and the fractions
bedtools_fract_df=data.frame(fractions)
bedtools_cor=data.frame(t(bedtools_fract_df[-1]))
colnames(bedtools_cor) <- bedtools_fract_df[, 1]
#yay now our columns are the gene region and the row are our samples
####lets take a look at which regions didn't amplify####
fractions_summary <- long_data %>%
group_by(name) %>%
summarize(
n.val = n(),
mean.val = mean(value),
sd.val = sd(value),
se.val = sd.val/sqrt(n.val),
lci = mean.val - 1.96 * se.val,
uci = mean.val + 1.96 * se.val)
long_data <- pivot_longer(bedtools_cor, cols = everything())
ggplot(data = fractions_summary,
aes(x = name, y = mean.val, ymin = lci, ymax = uci, color = name)) +
geom_point(size = 2, show.legend = FALSE) +
geom_errorbar(width = 1, show.legend = FALSE) +
ylab("Fraction of Coverage") +
xlab("Gene Region") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5))
```

#### Ophiostoma novo-ulmi (DED)
I used the same script to run CollectAlignmentSummary statistics.
Results: [multiqc](https://m-b55e05.a0115.5898.data.globus.org/projects/p32585/Elm/batch_1/aligned_ono/alignment_metrics/multiqc_report.html)

#### Phytoplasma
Results: [multiqc](https://m-b55e05.a0115.5898.data.globus.org/projects/p32585/Elm/batch_1/aligned_phytoplasma/alignment_metrics/multiqc_report.html)

#### Dothiorella (Mal secco) *need to update for pleno
Results: [multiqc](https://m-b55e05.a0115.5898.data.globus.org/projects/p32585/Elm/batch_1/aligned_dothiorella/alignment_metrics/multiqc_report.html)

---
### Step 6: Genome size estimation
***Goal:** Determine ploidy because gatK can only handle one ploidy at a time and it needs to be specified.*
**Software:** Jellyfish tools (kmers), ploidyNGS, nQuire.
**Resources:**
- **[gatK notes on ploidy](https://gatk.broadinstitute.org/hc/en-us/articles/360035889691-Does-GATK-work-on-non-diploid-organisms)**
- So it works on polyploids BUT you need to specify the ploidy of each individual
- Requires each group to have identical ploidy so will need to seperate before calling SNPs
**Input**
- aligned reads but in a fastq format
Average Genomes sizes of each ploidy from Whittemore & Olson 2011.

**kmer plots**
- **[Github](https://github.com/gmarcais/Jellyfish)**
- **[Tutorial](https://bioinformatics.uconn.edu/genome-size-estimation-tutorial/)**
The kmer estimation tool didn't work, so I'm not going to bother transcribing everything here. There were no peaks. It is supposed to be used on WGS data, and we have whole genes so I thought it might work out, but it didn't.
**ploidyNGS**
- **[Documentation](https://academic.oup.com/bioinformatics/article/33/16/2575/3104472)**
- **[Github](https://github.com/diriano/ploidyNGS)**
Suzy sent this to try, so let's go! For this we need aligned and sorted .bam files, which we already have! I tried running it on the samples aligned to the reference genome and the ones aligned to the fasta files. The one aligned to the genome was much more accurate but still not great. This slurm script should create log files of each ploidy estimation and put it in a csv file with the estimated ploidy.
:::spoiler Pipeline
```bash!
cd /projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/sorted_bams/
#start by creating a conda environment to download and run ploidyNGS
conda create --name ploidyNGS
#clone the program from github
git clone https://github.com/diriano/ploidyNGS.git
#load R and install ggplot2
module load R/4.4.0
R
install.packages("ggplot2")
#to exit
ctr+d
# to actually run it
# Load necessary modules and activate environment
conda activate ploidyNGS
cd ~/ploidyNGS
module load python-miniconda3
module load R/4.4.0
module load nano
source activate ploidyNGS # Use Conda's activation instead
nano slurm_ploidy.sh
#!/bin/bash
#SBATCH --account=p32585
#SBATCH --partition=normal
#SBATCH --time=48:00:00 # Adjust based on expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=12
#SBATCH --mem=10GB # Adjust memory based on file sizes
#SBATCH --job-name=slurm_ploidy
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marnequigg2025@u.northwestern.edu
#SBATCH --error=slurm_ploidy.out.error
#SBATCH --output=slurm_ploidy.out
# Set input and output directories
BAM_DIR="/projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/sorted_bams/"
OUTPUT_DIR="/projects/p32585/Elm/batch_1/post_trimming/sam_redos_aligned/sorted_bams/ploidy_estimation/"
CSV_FILE="$OUTPUT_DIR/ploidy_results.csv"
# Initialize CSV file with headers
echo "Sample_ID,Closest_Ploidy" > "$CSV_FILE"
# Loop through BAM files and run ploidyNGS
for BAM_FILE in "$BAM_DIR"/*_sorted.bam; do
SAMPLE_ID=$(basename "$BAM_FILE" _sorted.bam)
OUTPUT_LOG="$OUTPUT_DIR/${SAMPLE_ID}_ploidy.log"
echo "Running ploidy estimation for $SAMPLE_ID..."
# Pass sample name explicitly to ploidyNGS
./ploidyNGS.py --out "$OUTPUT_DIR" --bam "$BAM_FILE" --guess_ploidy > "$OUTPUT_LOG" 2>&1
# Extract the closest ploidy value from the output
PLOIDY=$(grep "the closest ploidy to yours is" "$OUTPUT_LOG" | awk '{print $NF}')
# Append results to the CSV file
echo "$SAMPLE_ID,$PLOIDY" >> "$CSV_FILE"
echo "Processed $SAMPLE_ID → Closest Ploidy: $PLOIDY"
done
echo "Ploidy estimation completed! Results saved in: $CSV_FILE"
#exit
sbatch slurm_ploidy.sh
```
:::
*might want to increase the computing power for this it took about 1.5 hours
**nQuire**
nQuire is a software that Dylan sent and I found a paper that said it works with herbarium speicmens... they lied. It was about 56% accurate. That's not great, but better than ploidyNGS.
**Resources:**
- **[Paper about the pipeline for herbaria](https://pmc.ncbi.nlm.nih.gov/articles/PMC6667659/)**
- **[Github](https://github.com/clwgg/nQuire)**
:::spoiler Code for a single sample
```bash!
#to get started I want to try on one sample
#using NA02 because it has a known ploidy
#first everything needs to be sorted
samtools sort NA02_clipped.bam -o NA02_cs.bam
#start by making the bin file
nQuire create -b NA02_cs.bam -o NA02
#now to get rid of all the noise
nQuire denoise NA02.bin -o NA02_denoised
#Positions to analyse:
# Before: 3838
# After: 1440 (37.5%)
#now to run the actual model
nQuire lrdmodel NA02.bin
#file free dip tri tet d_dip d_tri d_tet
#NA02.bin 2439.456970 1576.276581 1718.501688 1887.025762 863.180388 720.955281 552.431208
#its giving tetraploid but its a diploid
#try another model
nQuire modeltest NA02.bin
#final diploid loglik: 1576.276581
# sd1 = 0.197052
#final triploid loglik: 1718.501688
# sd1 = 0.124503
# sd2 = 0.124539
#final tetraploid loglik: 1887.025762
# sd1 = 0.079604
# sd2 = 0.076289
# sd3 = 0.079892
#ok one more model test
nQuire estmodel NA02.bin
#more numbers
#histogram test
nQuire histotest NA02.bin
#Diploid:
# Norm SSR: 0.0379481
# y-y slope: 0.120563, with std.Err: 0.0544135
# r^2: 0.0768159
#Triploid:
# Norm SSR: 0.0324301
# y-y slope: -0.157796, with std.Err: 0.0798268
# r^2: 0.0621147
#Tetraploid:
# Norm SSR: 0.0155547
# y-y slope: -0.0117182, with std.Err: 0.128294
# r^2: 0.000141382
#well I know this isn't a tetraploid but the model says it is
```
:::
::: spoiler Here's the full pipeline
```bash!
#lets goooooo
nano nquire_batch.sh
#!/bin/bash
# Set paths
BAM_DIR="/data/labs/Fant/Quigg/batch_1"
OUT_DIR="/data/labs/Fant/Quigg/ploidy_tests"
SORTED_DIR="/data/labs/Fant/Quigg/batch_1/sorted_bams"
SUMMARY_FILE="$OUT_DIR/ploidy_summary.tsv"
THREADS=5
# Make output directories
mkdir -p "$SORTED_DIR" "$OUT_DIR"
# Initialize summary file
echo -e "Sample\tDiploid_logL\tTriploid_logL\tTetraploid_logL\tBest_Ploidy" > "$SUMMARY_FILE"
# Loop through BAMs
for BAM in "$BAM_DIR"/*.bam; do
SAMPLE=$(basename "$BAM" .bam)
echo "Processing $SAMPLE..."
# Step 1: Sort BAM
SORTED_BAM="$SORTED_DIR/${SAMPLE}_sorted.bam"
samtools sort -o "$SORTED_BAM" "$BAM"
# Step 2: nQuire pipeline
BIN_PREFIX="$OUT_DIR/$SAMPLE"
DENOISED="$OUT_DIR/$SAMPLE.denoised.bin"
MODEL_OUT="${BIN_PREFIX}.model.txt"
nQuire create -b "$SORTED_BAM" -o "$BIN_PREFIX"
nQuire denoise "${BIN_PREFIX}.bin" -o "${BIN_PREFIX}.denoised"
nQuire lrdmodel -t $THREADS "${BIN_PREFIX}.denoised.bin" > "${BIN_PREFIX}.model.txt"
# Step 3: Parse results
DIP=$(awk '/diploid/ {print $2}' "$MODEL_OUT")
TRI=$(awk '/triploid/ {print $2}' "$MODEL_OUT")
TET=$(awk '/tetraploid/ {print $2}' "$MODEL_OUT")
BEST=$(printf "diploid\t$DIP\ntriploid\t$TRI\ntetraploid\t$TET" | sort -k2 -n | tail -n1 | cut -f1)
# Step 4: Record summary
echo -e "$SAMPLE\t$DIP\t$TRI\t$TET\t$BEST" >> "$SUMMARY_FILE"
done
echo "✅ Batch ploidy analysis complete. Results saved to $SUMMARY_FILE"
#exit
chmod +x nquire_batch.sh
./nquire_batch.sh
#that didnt extract the values properly so this is the script to correctly do that
cd /data/labs/Fant/Quigg/ploidy_tests
nano ploidy_tsv.sh
#!/bin/bash
# Directory containing your model.txt files
MODEL_DIR="/data/labs/Fant/Quigg/ploidy_tests"
SUMMARY_FILE="$MODEL_DIR/ploidy_summary.tsv"
echo -e "Sample\tDiploid_logL\tTriploid_logL\tTetraploid_logL\tBest_Ploidy" > "$SUMMARY_FILE"
for MODEL in "$MODEL_DIR"/*.model.txt; do
SAMPLE=$(basename "$MODEL" .model.txt)
# Extract log-likelihoods from the second line (skip header)
read -r _ FREE DIP TRI TET _ < <(tail -n 1 "$MODEL")
if [[ -z "$DIP" || -z "$TRI" || -z "$TET" ]]; then
echo -e "$SAMPLE\tNA\tNA\tNA\tFAILED"
echo -e "$SAMPLE\tNA\tNA\tNA\tFAILED" >> "$SUMMARY_FILE"
continue
fi
# Determine best ploidy (based on highest log-likelihood)
BEST=$(printf "diploid\t$DIP\ntriploid\t$TRI\ntetraploid\t$TET" | sort -k2 -n | tail -n1 | cut -f1)
echo -e "$SAMPLE\t$DIP\t$TRI\t$TET\t$BEST" >> "$SUMMARY_FILE"
done
echo "✅ Updated summary complete: $SUMMARY_FILE"
#exit
chmod +x ploidy_tsv.sh
./ploidy_tsv.sh
```
:::
**nQuack**
nQuack was recommended by Shelly Gaynor, the wonderful person who created it. She worked in the Soltis lab at UF and tried to do her PhD on PopGen of plants with similar ploidy problems, then rerouted to focus on creating techniques and programs to handle ploidy problems like this because she didn't like any of the software available. She recommended I analyze each cytotype (individuals of the same ploidy) separately.
**Resources:**
- **[Github](https://github.com/mgaynor1/nQuack)**
- **[Tutorial](https://mlgaynor.com/nQuack/articles/DataPreparation.html)**
- **[Example with known ploidy](https://mlgaynor.com/nQuack/articles/BasicExample.html#bootstrap-replicates)**
---
### Step 6: SNP calling
***Goal:** Find any SNPs present in the genes!*
**Software:** GatK tools
**Resources:**
- **[Website](https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller)**
- **[Best Practices](https://gatk.broadinstitute.org/hc/en-us/articles/360035894711-About-the-GATK-Best-Practices)**
- **[Notes on Ploidy](https://gatk.broadinstitute.org/hc/en-us/articles/360035889691-Does-GATK-work-on-non-diploid-organisms)**
- So it works on polyploids BUT you need to specify the ploidy of each individual (just going to run as if everything is diploid, will have excess heterozygosity)
**Input**
- Clipped and sorted bam files
- Remove or mark duplicate reads (MarkDuplicates), add read groups (AddOrReplaceReadGroups), and index (samtools index)
**Output**
- VCF or GVCF file with raw, unfiltered SNP and indel calls
Before running the slurm script a couple notes:
- You need to prepare the genome or reference, I have code in the slurm to do this but it's commented out.
- You have to load an updated version of Java or it will fail.
- Make sure you give it plenty of time, I gave it 12 hours to do 48 samples and it timed out after getting 46 samples done, make sure you have ~30 minutes per sample.
```bash!
conda activate gatk
nano gatk_processing.sh
#!/bin/bash
#SBATCH --account=p32585
#SBATCH --partition=normal
#SBATCH --time=12:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --mem=32G # Adjust memory based on previous usage
#SBATCH --job-name=gatk_processing
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marnequigg2025@u.northwestern.edu
#SBATCH --error=gatk_processing.err
#SBATCH --output=gatk_processing.out
# Load required modules
module load gatk/4.4.0.0
module load samtools
module load picard
module load java/jdk-17.0.2+8
# Set input/output directories
INPUT_DIR="/projects/p32585/Elm/batch_1/aligned_elm/clipped_sorted_bams/"
OUTPUT_DIR="/projects/p32585/Elm/batch_1/aligned_elm/gvcf_output/"
# Create output directory if it doesn’t exist
mkdir -p "$OUTPUT_DIR"
# Prepare reference genome (only need to do this once)
#gatk CreateSequenceDictionary -R /projects/p32585/Elm/genomes/genome_elm.fa
#samtools faidx /projects/p32585/Elm/genomes/genome_elm.fa
#echo "Reference genome prepared."
# Loop through all BAM files in the input directory
for INPUT_BAM in "$INPUT_DIR"/*.bam; do
SAMPLE_ID=$(basename "$INPUT_BAM" .bam)
SORTED_BAM="/projects/p32585/Elm/batch_1/aligned_elm/md_rg_indexed_bams/${SAMPLE_ID}_sorted.bam"
MD_BAM="/projects/p32585/Elm/batch_1/aligned_elm/md_rg_indexed_bams/${SAMPLE_ID}_md.bam"
RG_BAM="/projects/p32585/Elm/batch_1/aligned_elm/md_rg_indexed_bams/${SAMPLE_ID}_rg.bam"
METRICS_FILE="/projects/p32585/Elm/batch_1/aligned_elm/md_rg_indexed_bams/${SAMPLE_ID}_marked_dup_metrics.txt"
OUTPUT_VCF="$OUTPUT_DIR/${SAMPLE_ID}.g.vcf.gz"
echo "Processing sample: $SAMPLE_ID..."
# Step 1: Sort BAM file
samtools sort "$INPUT_BAM" -o "$SORTED_BAM"
echo "Sorting complete for $SAMPLE_ID."
# Step 2: Mark duplicates
java -jar /software/picard/1.131/picard-tools-1.131/picard.jar MarkDuplicates \
I="$SORTED_BAM" \
O="$MD_BAM" \
M="$METRICS_FILE" \
REMOVE_DUPLICATES=false \
VALIDATION_STRINGENCY=SILENT
echo "Duplicate marking complete for $SAMPLE_ID."
# Step 3: Add read groups
java -jar /software/picard/1.131/picard-tools-1.131/picard.jar AddOrReplaceReadGroups \
I="$MD_BAM" \
O="$RG_BAM" \
RGID="$SAMPLE_ID" \
RGLB=lib1 \
RGPL=ILLUMINA \
RGPU=unit1 \
RGSM="$SAMPLE_ID" \
VALIDATION_STRINGENCY=SILENT
echo "Read group addition complete for $SAMPLE_ID."
# Step 4: Index the BAM file
samtools index "$RG_BAM"
echo "Indexing complete for $SAMPLE_ID."
# Step 5: Run GATK HaplotypeCaller
gatk HaplotypeCaller \
-R /projects/p32585/Elm/genomes/genome_elm.fa \
-I "$RG_BAM" \
-O "$OUTPUT_VCF" \
-ERC GVCF
echo "Variant calling complete for $SAMPLE_ID. Output saved to: $OUTPUT_VCF"
done
echo "Batch processing of all samples completed!"
#exit
sbatch gatk_processing.sh
```
So this next part prepares them to be put into R for analysis and visualization. It looks like ploidy is NOT a problem here (yay!). We need to combine the gvcfs to make a cohort one then genotype them.
```bash!
#note to future Marne: run this as a slurm it took soooooo long (like 80 minutes)
nano combine_gvcf.sh
#!/bin/bash
# Load necessary modules
module load gatk/4.4.0.0
# Define input/output directories
INPUT_DIR="/projects/p32585/Elm/batch_1/aligned_elm/gvcf_output/"
OUTPUT_GVCF="/projects/p32585/Elm/batch_1/aligned_elm/gvcf_output/cohort.g.vcf.gz"
REFERENCE="/projects/p32585/Elm/genomes/genome_elm.fa"
# Initialize an empty variant list
GVCF_LIST=""
# Loop through all `.g.vcf.gz` files in the folder and add to GVCF_LIST
for GVCF in "$INPUT_DIR"/*.g.vcf.gz; do
GVCF_LIST+=" --variant ${GVCF}"
done
# Run GATK CombineGVCFs
gatk CombineGVCFs \
-R "$REFERENCE" \
$GVCF_LIST \
-O "$OUTPUT_GVCF"
echo "GVCF combination complete! Output saved at: $OUTPUT_GVCF"
#exit
chmod +x combine_gvcf.sh
./combine_gvcf.sh
#now genotype the output of that one
nano genotype_gvcf.sh
#!/bin/bash
#SBATCH --account=p32585
#SBATCH --partition=normal
#SBATCH --time=12:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --mem=32G # Adjust memory based on previous usage
#SBATCH --job-name=genotype_gvcf
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marnequigg2025@u.northwestern.edu
#SBATCH --error=genotype_gvcf.err
#SBATCH --output=genotype_gvcf.out
# Load required module
module load gatk/4.4.0.0
# Define input/output files
REFERENCE="/projects/p32585/Elm/genomes/genome_elm.fa"
COHORT_GVCF="/projects/p32585/Elm/batch_1/aligned_elm/gvcf_output/cohort.g.vcf.gz"
FINAL_VCF="/projects/p32585/Elm/batch_1/aligned_elm/gvcf_output/final.vcf.gz"
echo "Running GenotypeGVCFs on $COHORT_GVCF..."
gatk --java-options "-Xmx4g" GenotypeGVCFs \
-R "$REFERENCE" \
-V "$COHORT_GVCF" \
-O "$FINAL_VCF"
echo "Genotyping complete! Output saved at: $FINAL_VCF"
#exit
sbatch genotype_gvcf.sh
```
---
### VCF Filtering
So this is one of those points where we need to know ploidy. Why? Because you need to specify if each loci has 2 or 4 alleles.
## Microsatellites
## Analysis and Visualization
- Tutorials from Landscape Genetics
- [How to import VCFs](https://bookdown.org/hhwagner1/LandGenCourse_book/WE_1.html#WE_1)
- [How to measure Genetic Diversity](https://bookdown.org/hhwagner1/LandGenCourse_book/WE_3.html#WE_3)
- [How to look at Population Structure](https://bookdown.org/hhwagner1/LandGenCourse_book/WE_9.html#WE_9)
## Objective 1: Analyze range-wide genetic structure and areas of high genetic diversity.
## Objective 2: Determine if geographic barriers are correlated to genetic differentiation and ploidy differences of these genetic clusters (e.g. geographic regions or watersheds).
## Objective 3: Compare current and historic genetic diversity and identify if there is evidence of a bottleneck caused by DED.
## Objective 4: Support establishment of a seed provenancing plan for U. americana restoration that maximizes genetic diversity, disease resistance, and local adaptation, then distribute to restoration managers.
## Objective 5: Identify if any of the trees sampled were infected with disease.