---
title: "Ill Elms (Marne's version)"
breaks: FALSE
tags: RADSeq
---
Marne Quigg (p32585)
---
# The Tortured Elms Department (Marne's version)

As part of my amplicon panel I included genes from 3 pathogens that infect elms: *Ophiiostoma novo-ulmi*, *Candidatus phytoplasma*, and *Plenodomus tracheiphilus*. This are the causal agents for Dutch elm disease (DED), elm yellows/phytoplasma, and Mal Secco. DED and Mal secco are both fungal, and elm yellows is bacterial.
For most of the contemporary samples, we have notes on whether or not the tree was diseased. Of course, this means that the tree had to be showing OBVIOUS signs of disease.
It is possible to sequence phytoplasma from leaf samples, as evidenced by the contaminated genome...
## Gene Regions
| Pathogen | Accession | Gene Region |
| -------- | -------- | -------- |
| Ophiostoma | AF380339 | col1
| Ophiostoma | ON209352 | ribosomal RNA gene
| Ophiostoma | MN551193 | BT
| Ophiostoma | AJ519672 | cu
| Ophiostoma | MH283430 | tef1
| Ophiostoma | MH062829 | CAL
| Phytoplasma | MPL16SRIII | 16S
| Phytoplasma | MW563863 | 16S-23S
| Phytoplasma | KX784497 | secY
| Plenodomus tracheiphilus | OQ412812 | TUB2
| Plenodomus tracheiphilus | PP321284 | RPB2
| Plenodomus tracheiphilus | PP321285 | Act
| Plenodomus tracheiphilus | PQ351261 | ITS
*I don't think all of these went into the final amplicon design.
---
## Pipeline
1) Organize and FastQC: make sure sequencing went well*
2) Trimmomatic: Illumina Adapter trimming and filtering out reads <30 bases long*
3) BWA-MEM: Alignment (to fasta file of disease gene regions)
4) Picard: Mark and remove duplicates
5) Bedtools: Coverage to idetify depth and breadth of sequences at each gene region**
6) Blast and MEGAN/Kraken2: Verify the taxonomy of the reads that correspond to the disease genes
7) arcGIS: Map it!
*these steps are included in the Amplicons!!! (Marne's version)
**can include a step here to call variants and possibly determine which strain of the pathogens
---
### Step 1: Organize and FastQC
***Goal:** Standardize file names and check the quality of the sequencing data!*
---
### Step 2: Trimmomatic
***Goal:** Trim the primer adapters!*
---
### Step 3: BWA-MEM
***Goal:** Align to the fasta files of disese gene regions!*
:::spoiler This is the full pipeline adapted for this but it didn't work beyond BWA
#this starts with bwa alignment and ends with primerclip
conda create -n disease_tests python=3.9
conda activate disease_tests
#download the softwares
conda install bioconda::bwa #version 0.7.17
conda install -c conda-forge -c bioconda samtools openssl=1.1
#version 1.5
conda install bioconda::primerclip #version 0.3.8
#Dont forgot to index
#only need to do this once
dos2unix /data/labs/Fant/Quigg/genomes/disease_regions.fasta
#its in a windows format
bwa index /data/labs/Fant/Quigg/genomes/disease_regions.fasta
#start your ENGINESSSSSSSS
nano pipeline1.sh
#!/bin/bash
# Usage: process_sample.sh SAMPLE_NAME
SAMPLE=$1
# Paths
FASTQ_DIR="/data/labs/Fant/Quigg/02.trimming"
REFERENCE="/data/labs/Fant/Quigg/genomes/disease_regions.fasta"
ALIGNMENT_DIR="/data/labs/Fant/Quigg/03b.alignment_diseases"
BAM_DIR="$ALIGNMENT_DIR/bams"
SORTED_DIR="$ALIGNMENT_DIR/sorted_bams"
FLAGSTAT_DIR="$ALIGNMENT_DIR/flagstats"
PRIMERFILE="/data/labs/Fant/Quigg/genomes/cp927_Masterfile.txt"
R1="$FASTQ_DIR/${SAMPLE}_R1_paired.fastq.gz"
R2="$FASTQ_DIR/${SAMPLE}_R2_paired.fastq.gz"
SAM="$ALIGNMENT_DIR/${SAMPLE}.sam"
BAM="$BAM_DIR/${SAMPLE}.bam"
SORTED="$SORTED_DIR/${SAMPLE}_sorted.bam"
NSORT="$ALIGNMENT_DIR/${SAMPLE}_nsort.sam"
CLIPPED="$ALIGNMENT_DIR/${SAMPLE}_clipped.sam"
FLAGSTAT="$FLAGSTAT_DIR/${SAMPLE}_flagstat.txt"
log() { echo -e "[\033[1;36m$SAMPLE\033[0m] $1"; }
fail() { echo -e "[\033[1;31m$SAMPLE\033[0m] ❌ $1" >&2; exit 1; }
# Input check
[[ ! -f "$R1" || ! -f "$R2" ]] && fail "Missing FASTQ files: $R1 or $R2"
log "Aligning with BWA..."
bwa mem -M -t 4 "$REFERENCE" "$R1" "$R2" > "$SAM" || fail "bwa mem failed"
log "Converting SAM to BAM..."
samtools view -b "$SAM" -o "$BAM" || fail "samtools view failed"
log "Sorting BAM by coordinate..."
samtools sort -o "$SORTED" "$BAM" || fail "samtools sort failed"
log "Generating flagstat..."
samtools flagstat "$SORTED" > "$FLAGSTAT" || fail "flagstat failed"
log "Sorting SAM by name..."
samtools sort -n "$SAM" -o "$NSORT" || fail "name sort failed"
log "Running PrimerClip..."
primerclip "$PRIMERFILE" "$NSORT" "$CLIPPED" || fail "PrimerClip failed"
log "✅ Sample completed successfully."
#exit
chmod +x pipeline1.sh
#to launch the pipeline
nano launch_pipeline1.sh
#!/bin/bash
ALIGNMENT_DIR="/data/labs/Fant/Quigg/03b.alignment_diseases"
FASTQ_DIR="/data/labs/Fant/Quigg/02.trimming"
mkdir -p "$ALIGNMENT_DIR"/{bams,sorted_bams,flagstats}
ls "$FASTQ_DIR"/*_R1_paired.fastq.gz | sed 's/_R1_paired.fastq.gz//' | xargs -n1 basename > sample_list.txt
screen -S ploidy_alignments -dm bash -c "
echo '📡 Launching parallel alignments...'
parallel -j 4 ./pipeline1.sh :::: sample_list.txt
echo '✅ All samples processed.'
"
#exit
chmod +x launch_pipeline1.sh
./launch_pipeline1.sh
screen -L parallel -j 4 ./pipeline1.sh :::: sample_list.txt
:::
---
### Step 4 and 5: Picard and Bedtools
***Goal:** Mark and remove any duplicate reads to avoid false inflation of sequencing depth! Calculate coverage and depth of sequences!*
Start by prepping everything to run the full pipeline including indexing the fasta and creating the bed file. Also make sure the files are in a sorted bam file.
```bash!
#index and make the bed file needed for bedtools
samtools faidx disease_regions.fasta
awk '{print $1"\t0\t"$2}' disease_regions.fasta.fai > disease_regions.bed
#make a conda environment for this
conda create -n disease_analysis python=3.9
conda activate disease_analysis
conda install bioconda::samtools
conda install bioconda::picard
conda install bioconda::bedtools
conda install bioconda::blast
```
Now here's the full pipeline that will run 4 samples at once.
```bash!
#!/bin/bash
# ---- Configuration ----
BAM_DIR="bam" # Directory with input BAMs
REF="disease_genes.fasta" # FASTA of full gene sequences (from 3 pathogens)
THREADS=4 # Number of samples to process in parallel
# ---- Step 0: Generate BED file from FASTA index ----
#samtools faidx $REF
#awk '{print $1"\t0\t"$2}' ${REF}.fai > gene_regions.bed
#echo "Generated BED file: gene_regions.bed"
# ---- Function to process each sample ----
process_sample() {
BAM="$1"
SAMPLE=$(basename "$BAM" .bam)
echo "Processing $SAMPLE"
# Step 1: Mark duplicates
picard MarkDuplicates \
I=$BAM \
O=${SAMPLE}.dedup.bam \
M=${SAMPLE}.dedup.metrics.txt \
REMOVE_DUPLICATES=true \
CREATE_INDEX=true \
VALIDATION_STRINGENCY=SILENT
# Step 2: Gene-level coverage
bedtools coverage \
-a gene_regions.bed \
-b ${SAMPLE}.dedup.bam > ${SAMPLE}.coverage.tsv
# Step 3: Extract covered genes (80% breadth, >10x depth per base)
awk '($7 >= 0.8 && $5 >= 10){print $1"\t"$2"\t"$3}' ${SAMPLE}.coverage.tsv > ${SAMPLE}.present_regions.bed
# Step 4: Pull sequences for BLAST
if [[ -s ${SAMPLE}.present_regions.bed ]]; then
bedtools getfasta \
-fi $REF \
-bed ${SAMPLE}.present_regions.bed \
-fo ${SAMPLE}.present_regions.fasta
# Step 5: BLAST each detected gene
blastn -query ${SAMPLE}.present_regions.fasta \
-db /home/general/Databases/nt \
-remote \
-out ${SAMPLE}.blastn.out \
-evalue 1e-10 -outfmt 6
else
echo "No high-confidence hits for $SAMPLE"
fi
}
# Export function and variables for GNU parallel
export -f process_sample
export REF
# ---- Run pipeline in parallel across all BAMs ----
find $BAM_DIR -name "*.bam" | parallel -j $THREADS process_sample
```
---
### Step 6: Analysis
***Goal:** Verify the taxonomy of the sequences!*