Disease presence (Marne's version)

--- title: "Ill Elms (Marne's version)" breaks: FALSE tags: RADSeq --- Marne Quigg (p32585) --- # The Tortured Elms Department (Marne's version) ![american-elm-1a](https://hackmd.io/_uploads/B1-o7CW4ge.jpg) As part of my amplicon panel I included genes from 3 pathogens that infect elms: *Ophiiostoma novo-ulmi*, *Candidatus phytoplasma*, and *Plenodomus tracheiphilus*. This are the causal agents for Dutch elm disease (DED), elm yellows/phytoplasma, and Mal Secco. DED and Mal secco are both fungal, and elm yellows is bacterial. For most of the contemporary samples, we have notes on whether or not the tree was diseased. Of course, this means that the tree had to be showing OBVIOUS signs of disease. It is possible to sequence phytoplasma from leaf samples, as evidenced by the contaminated genome... ## Gene Regions | Pathogen | Accession | Gene Region | | -------- | -------- | -------- | | Ophiostoma | AF380339 | col1 | Ophiostoma | ON209352 | ribosomal RNA gene | Ophiostoma | MN551193 | BT | Ophiostoma | AJ519672 | cu | Ophiostoma | MH283430 | tef1 | Ophiostoma | MH062829 | CAL | Phytoplasma | MPL16SRIII | 16S | Phytoplasma | MW563863 | 16S-23S | Phytoplasma | KX784497 | secY | Plenodomus tracheiphilus | OQ412812 | TUB2 | Plenodomus tracheiphilus | PP321284 | RPB2 | Plenodomus tracheiphilus | PP321285 | Act | Plenodomus tracheiphilus | PQ351261 | ITS *I don't think all of these went into the final amplicon design. --- ## Pipeline 1) Organize and FastQC: make sure sequencing went well* 2) Trimmomatic: Illumina Adapter trimming and filtering out reads <30 bases long* 3) BWA-MEM: Alignment (to fasta file of disease gene regions) 4) Picard: Mark and remove duplicates 5) Bedtools: Coverage to idetify depth and breadth of sequences at each gene region** 6) Blast and MEGAN/Kraken2: Verify the taxonomy of the reads that correspond to the disease genes 7) arcGIS: Map it! *these steps are included in the Amplicons!!! (Marne's version) **can include a step here to call variants and possibly determine which strain of the pathogens --- ### Step 1: Organize and FastQC ***Goal:** Standardize file names and check the quality of the sequencing data!* --- ### Step 2: Trimmomatic ***Goal:** Trim the primer adapters!* --- ### Step 3: BWA-MEM ***Goal:** Align to the fasta files of disese gene regions!* :::spoiler This is the full pipeline adapted for this but it didn't work beyond BWA #this starts with bwa alignment and ends with primerclip conda create -n disease_tests python=3.9 conda activate disease_tests #download the softwares conda install bioconda::bwa #version 0.7.17 conda install -c conda-forge -c bioconda samtools openssl=1.1 #version 1.5 conda install bioconda::primerclip #version 0.3.8 #Dont forgot to index #only need to do this once dos2unix /data/labs/Fant/Quigg/genomes/disease_regions.fasta #its in a windows format bwa index /data/labs/Fant/Quigg/genomes/disease_regions.fasta #start your ENGINESSSSSSSS nano pipeline1.sh #!/bin/bash # Usage: process_sample.sh SAMPLE_NAME SAMPLE=$1 # Paths FASTQ_DIR="/data/labs/Fant/Quigg/02.trimming" REFERENCE="/data/labs/Fant/Quigg/genomes/disease_regions.fasta" ALIGNMENT_DIR="/data/labs/Fant/Quigg/03b.alignment_diseases" BAM_DIR="$ALIGNMENT_DIR/bams" SORTED_DIR="$ALIGNMENT_DIR/sorted_bams" FLAGSTAT_DIR="$ALIGNMENT_DIR/flagstats" PRIMERFILE="/data/labs/Fant/Quigg/genomes/cp927_Masterfile.txt" R1="$FASTQ_DIR/${SAMPLE}_R1_paired.fastq.gz" R2="$FASTQ_DIR/${SAMPLE}_R2_paired.fastq.gz" SAM="$ALIGNMENT_DIR/${SAMPLE}.sam" BAM="$BAM_DIR/${SAMPLE}.bam" SORTED="$SORTED_DIR/${SAMPLE}_sorted.bam" NSORT="$ALIGNMENT_DIR/${SAMPLE}_nsort.sam" CLIPPED="$ALIGNMENT_DIR/${SAMPLE}_clipped.sam" FLAGSTAT="$FLAGSTAT_DIR/${SAMPLE}_flagstat.txt" log() { echo -e "[\033[1;36m$SAMPLE\033[0m] $1"; } fail() { echo -e "[\033[1;31m$SAMPLE\033[0m] ❌ $1" >&2; exit 1; } # Input check [[ ! -f "$R1" || ! -f "$R2" ]] && fail "Missing FASTQ files: $R1 or $R2" log "Aligning with BWA..." bwa mem -M -t 4 "$REFERENCE" "$R1" "$R2" > "$SAM" || fail "bwa mem failed" log "Converting SAM to BAM..." samtools view -b "$SAM" -o "$BAM" || fail "samtools view failed" log "Sorting BAM by coordinate..." samtools sort -o "$SORTED" "$BAM" || fail "samtools sort failed" log "Generating flagstat..." samtools flagstat "$SORTED" > "$FLAGSTAT" || fail "flagstat failed" log "Sorting SAM by name..." samtools sort -n "$SAM" -o "$NSORT" || fail "name sort failed" log "Running PrimerClip..." primerclip "$PRIMERFILE" "$NSORT" "$CLIPPED" || fail "PrimerClip failed" log "✅ Sample completed successfully." #exit chmod +x pipeline1.sh #to launch the pipeline nano launch_pipeline1.sh #!/bin/bash ALIGNMENT_DIR="/data/labs/Fant/Quigg/03b.alignment_diseases" FASTQ_DIR="/data/labs/Fant/Quigg/02.trimming" mkdir -p "$ALIGNMENT_DIR"/{bams,sorted_bams,flagstats} ls "$FASTQ_DIR"/*_R1_paired.fastq.gz | sed 's/_R1_paired.fastq.gz//' | xargs -n1 basename > sample_list.txt screen -S ploidy_alignments -dm bash -c " echo '📡 Launching parallel alignments...' parallel -j 4 ./pipeline1.sh :::: sample_list.txt echo '✅ All samples processed.' " #exit chmod +x launch_pipeline1.sh ./launch_pipeline1.sh screen -L parallel -j 4 ./pipeline1.sh :::: sample_list.txt ::: --- ### Step 4 and 5: Picard and Bedtools ***Goal:** Mark and remove any duplicate reads to avoid false inflation of sequencing depth! Calculate coverage and depth of sequences!* Start by prepping everything to run the full pipeline including indexing the fasta and creating the bed file. Also make sure the files are in a sorted bam file. ```bash! #index and make the bed file needed for bedtools samtools faidx disease_regions.fasta awk '{print $1"\t0\t"$2}' disease_regions.fasta.fai > disease_regions.bed #make a conda environment for this conda create -n disease_analysis python=3.9 conda activate disease_analysis conda install bioconda::samtools conda install bioconda::picard conda install bioconda::bedtools conda install bioconda::blast ``` Now here's the full pipeline that will run 4 samples at once. ```bash! #!/bin/bash # ---- Configuration ---- BAM_DIR="bam" # Directory with input BAMs REF="disease_genes.fasta" # FASTA of full gene sequences (from 3 pathogens) THREADS=4 # Number of samples to process in parallel # ---- Step 0: Generate BED file from FASTA index ---- #samtools faidx $REF #awk '{print $1"\t0\t"$2}' ${REF}.fai > gene_regions.bed #echo "Generated BED file: gene_regions.bed" # ---- Function to process each sample ---- process_sample() { BAM="$1" SAMPLE=$(basename "$BAM" .bam) echo "Processing $SAMPLE" # Step 1: Mark duplicates picard MarkDuplicates \ I=$BAM \ O=${SAMPLE}.dedup.bam \ M=${SAMPLE}.dedup.metrics.txt \ REMOVE_DUPLICATES=true \ CREATE_INDEX=true \ VALIDATION_STRINGENCY=SILENT # Step 2: Gene-level coverage bedtools coverage \ -a gene_regions.bed \ -b ${SAMPLE}.dedup.bam > ${SAMPLE}.coverage.tsv # Step 3: Extract covered genes (80% breadth, >10x depth per base) awk '($7 >= 0.8 && $5 >= 10){print $1"\t"$2"\t"$3}' ${SAMPLE}.coverage.tsv > ${SAMPLE}.present_regions.bed # Step 4: Pull sequences for BLAST if [[ -s ${SAMPLE}.present_regions.bed ]]; then bedtools getfasta \ -fi $REF \ -bed ${SAMPLE}.present_regions.bed \ -fo ${SAMPLE}.present_regions.fasta # Step 5: BLAST each detected gene blastn -query ${SAMPLE}.present_regions.fasta \ -db /home/general/Databases/nt \ -remote \ -out ${SAMPLE}.blastn.out \ -evalue 1e-10 -outfmt 6 else echo "No high-confidence hits for $SAMPLE" fi } # Export function and variables for GNU parallel export -f process_sample export REF # ---- Run pipeline in parallel across all BAMs ---- find $BAM_DIR -name "*.bam" | parallel -j $THREADS process_sample ``` --- ### Step 6: Analysis ***Goal:** Verify the taxonomy of the sequences!*