# Notes + code from reference aligned Castilleja ddRAD data analysis
###### tags: `RADSeq` `STACKS`
---
author: Emma Fetterly
---
**Resources:**
1. Link to STACKS website:
https://catchenlab.life.illinois.edu/stacks/
2. Bioinformatics guide by Eric C. Anderson
https://eriqande.github.io/eca-bioinf-handbook/shell-programming.html
3. Speciation & Population Genomics: a how-to-guide
https://speciationgenomics.github.io/mapping_reference/
4. Surm script how-to from NU https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1964
5. Globus - file management
https://www.globus.org/
------
**Process Overview**
1. process_radrags - demultiplex and clean reads
2. mapping ddRADseq reads to reference genome with bwa + samtools
3. STACKS: gstacks
4. STACKS: populations
5. filtering with VCF tools and plink
----
# Process radtags
1. `process_radtags`
* Notes: truncated to 100 bp, unsed inline_null, 2 adaptor mismatches, filtered reads for quality and uncalled bases
```
process_radtags -1 /projects/p32037/caco_mn/raw_reads/FANT-11-15-2024-PLATE-2_S2_L004_R1_001.fastq.gz -2 /projects/p32037/caco_mn/raw_reads/FANT-11-15-2024-PLATE-2_S2_L004_R2_001.fastq.gz -i gzfastq -b /projects/p32037/caco_mn/raw_reads/MN_barcodes_P2.txt -o /projects/p32037/caco_mn/radtags_out_P2 --inline_null --renz_1 NsiI --renz_2 BfaI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10
```
Adaptor sequences from UWBC (Truseq kit)
Read 1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
Read 2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
**I made a new directory for each plate "radtags_out_P1", radtags_out_P2**
#### Slurm script I used for Castilleja radtags:
* This was for ApeKI
```
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=normal
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=10
#SBATCH --mem-per-cpu=50GB
#SBATCH --job-name=CACO_radtags_P2
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=CACO_radtags_P2.error
#SBATCH --output=CACO_radtags_P2.out
module purge
module load stacks/2.68-gcc-12.3.0 ##specify environment to stacks version 2.68
process_radtags -1 /projects/p32037/caco_mn/raw_reads/FANT-11-15-2024-PLATE-2_S2_L004_R1_001.fastq.gz -2 /projects/p32037/caco_mn/raw_reads/FANT-11-15-2024-PLATE-2_S2_L004_R2_001.fastq.gz -i gzfastq -b /projects/p32037/caco_mn/raw_reads/MN_barcodes_P2.txt -o /projects/p32037/caco_mn/radtags_out_P2 --inline_null --renz_1 NsiI --renz_2 BfaI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -threads 10
```
Ran for each enzyme combination, results below:
```
sbatch radtags_P1.sh
sbatch radtags_P2.sh
```
| Enzyme | Total sequences | Reads with adapter | Barcode not found | Low quality read drops | RAD cutsite not found drops | Retained reads |
| ------------------ | ----------- | ---------------- | ----------------- | ---------------------- | --------------------------- | ----------------- |
| Plate 1 | 1,327,435,862 | 46,852 (0.0%) | 38,002,790 (2.9%) | 3,268,685 (0.2%) | 21,801,131 (1.6%) | 1,264,316,404 (95.2%) |
| Plate 2 | 1,111,231,964 | 36,885 (0.0%) | 30,640,340 (2.8%) | 2,615,228 (0.2%) |17,226,720 (1.6%) | 1,060,712,791 (95.5%) |
Once process_radtags has finished running, change directories to your output directory and inspect the process_radtags.log file. This will give you basic information on how manys sequences were found per sample, and how many passed preliminary filtering.
To view the the log file
```
cat process_radtags.log
```
STACKS outputs four different files for each sample if you are using paired end reads:
sample_name.rem.1.fq.gz
sample_name.rem.2.fq.gz
sample_name.1.fq.gz
sample_name.2.fq.gz
We only use the files without the .rem; those files with a .rem. can be deleted.
```
rm *.rem.1.fq.gz
rm *.rem.2.fq.gz
```
**Removed .rem files and moved all files into a single file in "caco_mn" called "readfiles_all"**
I was curious about the raw read quality of the demultiplexed sequences, so ran a fast qc on the raw read files
```
module load fastqc/0.12.0
fastqc -o /projects/p32037/optimization/readfiles_all/fastqc -t 6 *.fq.gz
```
Hard to review each file... so ran a multiqc https://multiqc.info/docs/ in the readfiles_all directory to view all the reports at once
```
[ewf7555@quser32 fastqc]$ module load multiqc
[ewf7555@quser32 fastqc]$ multiqc .
#this searched through the entire read_files directory and found the fastqc reports generated
```
[MultiQC](https://m-b55e05.a0115.5898.data.globus.org/projects/p32037/caco_mn/readfiles_all/fastqc/multiqc_report_1.html) report of raw sequence data of Plate 1 and 2
# Mapping reads to a reference genome
### Preparing the reference:
1. Make a new directory for your reference genome files.
The reference needs to be uncompressed for this analysis.
```
#make new directory
mkdir ./reference
#move reference to this directory
mv Castillea_coccinea.hifiasm.bp.p_ctg.fa ./reference/
```
2. Index the reference genome with bwa + samtools + GATK:
```
#BWA indexing for general alignment
bwa index Castillea_coccinea.hifiasm.bp.p_ctg.fa
#create the FASTA index file
samtools faidx Castillea_coccinea.hifiasm.bp.p_ctg.fa
#create the sequence dictionary (in case you want to use GATK)
conda activate gatk
gatk CreateSequenceDictionary -R Castillea_coccinea.hifiasm.bp.p_ctg.fa -O /projects/p32037/caco_mn/reference/Castillea_coccinea.hifiasm.bp.p_ctg.dict
```
> This will create a few other files with the same basename. They all need to be in the same directory with the genome file in order for mapping to work.
### Map the reads:
> Run from the directory with the read files, this loop will align the reads for each pair of read files, generating a .sam file
Code for Castilleja mapping for sample CUP1 (trial run)
```
bwa mem -M -t 4 /projects/p32037/caco_mn/reference/Castillea_coccinea.hifiasm.bp.p_ctg.fa CUP1.1.fq.gz CUP1.2.fq.gz > CUP1.sam
```
## Evaluating with samtools:
```
samtools flagstat CUP1.sam
12265267 + 0 in total (QC-passed reads + QC-failed reads)
110941 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
12159958 + 0 mapped (99.14% : N/A)
12154326 + 0 paired in sequencing
6077163 + 0 read1
6077163 + 0 read2
11670300 + 0 properly paired (96.02% : N/A)
12035334 + 0 with itself and mate mapped
13683 + 0 singletons (0.11% : N/A)
345608 + 0 with mate mapped to a different chr
65326 + 0 with mate mapped to a different chr (mapQ>=5)
```
> Note: Good mapping percentages are in the 90's. You can mess around with the -K values of bwa mem to see if you get a higher percent mapping to the reference. This is a tradeoff because the lower the -K, that means that the fewer exact basepairs need to match the reference in order to start mapping a read. Default is 19 when bwa mem is run without a -K value. It can be a good idea to run bwa mem on one sample with several values of -K : default, -K 16 ; -K 14; -K 12 ; -K 10. Evaluate which maps the most reads and then use the loop for the rest of the samples.
Evaluate which maps the most reads and then use the loop for the rest of the samples
> Note: For this sample, K = 19 works great.
Basic loop:
```
for i in *.1.fq; do
NM=${i//.1.fq/}
REVERSE=${i//.1.fq/.2.fq}
bwa mem -M -t 8 /projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa $i $REVERSE > $NM.sam ; done
```
Slurm Script I used for a single population group
> Note: this took about 8-10 hours per pop, using ~11 GB memory
:::spoiler bwa mem slurm Script
#### Slurm script for batch processing
```
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=normal
#SBATCH --time=12:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8 # Test with 8 threads
#SBATCH --mem=20GB # Adjusted memory based on previous usage
#SBATCH --job-name=CUP_bwamem
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=CUP_bwamem_group.out.error
#SBATCH --output=CUP_bwamem_group.out
module purge
module load bwa
# Run bwa mem with 8 threads
for i in CUP*.1.fq.gz; do
NM=${i//.1.fq.gz/} # Extract the base name (e.g., NM_ARK23-01)
REVERSE=${i//.1.fq.gz/.2.fq.gz} # Replace .1.fq.gz with .2.fq.gz for reverse reads
bwa mem -M -t 8 /projects/p32037/caco_mn/reference/Castillea_coccinea.hifiasm.bp.p_ctg.fa "$i" "$REVERSE" > "${NM}.sam"
done
```
:::
Now convert the .sam files into .bam files:
```
for i in *.sam; do
NM=${i//.sam/};
samtools view $i -b -o $NM.bam; done
```
and then sort the .bam files:
```
for i in *.bam; do NM=${i//.bam/}; samtools sort $i -o $NM.sort.bam; done
```
As long as your sorted .bam files are looking good, then you can delete the .sam files because they are quite large and take up a lot of space. I put all my sorted .bams into their own folder.
:::spoiler sam to bam to sorted bam slurm script (with samtools flagstat)
```
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=normal
#SBATCH --time=12:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=5GB # Adjust memory based on previous usage
#SBATCH --job-name=Fcaco_sambamsort
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=Fcaco_sambamsort.out.error
#SBATCH --output=Fcaco_sambamsort.out
module load samtools
# Process one sample at a time: convert SAM to BAM, sort, and clean up
for i in FUG*.sam; do
# Check if the file exists before proceeding
if [[ -e $i ]]; then
NM=${i%.sam} # Remove the .sam extension
# Convert the .sam file to .bam
echo "Converting $i to BAM format..."
samtools view -b "$i" -o "$NM.bam"
# Check if conversion was successful
if [[ $? -eq 0 ]]; then
echo "$i successfully converted to $NM.bam"
# Sort the .bam file
echo "Sorting $NM.bam..."
samtools sort "$NM.bam" -o "$NM.sorted.bam"
# Check if sorting was successful
if [[ $? -eq 0 ]]; then
echo "$NM.bam successfully sorted into $NM.sorted.bam"
# Generate flagstat report
echo "Generating flagstat report for $NM.sorted.bam..."
samtools flagstat "$NM.sorted.bam" > "$NM.flagstat.txt"
echo "Flagstat report saved to $NM.flagstat.txt"
# Delete the original .sam file
echo "Deleting $i to free up space..."
rm "$i"
# Move the sorted BAM file to the sorted_bams folder
mkdir -p sorted_caco_bams
mv "$NM.sorted.bam" sorted_caco_bams/
mv "$NM.flagstat.txt" sorted_caco_bams/
else
echo "Error sorting $NM.bam" >&2
exit 1
fi
else
echo "Error converting $i" >&2
exit 1
fi
else
echo "No *.sam files found."
exit 1
fi
done
echo "All *.sam files processed. Sorted BAM files and flagstat reports moved to sorted_bams directory."
```
:::
---
## **Evaluating the alignments**
### **1. samtools flagstat**
:::spoiler Script for collecting **samtools flagstat data** from .txt files
```
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=normal
#SBATCH --time=01:00:00 # Adjust based on expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=1GB # Adjust memory based on file sizes
#SBATCH --job-name=process_flagstat
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=process_flagstat.out.error
#SBATCH --output=process_flagstat.out
# Set the directory containing the .flagstat.txt files
FLAGSTAT_DIR="/projects/p32037/caco_mn/readfiles_all/sorted_caco_bams"
# Create a summary file
SUMMARY_FILE="summary_flagstat.txt"
echo -e "Sample\tQC_Passed_Reads\tMapped_Reads\tMapped_%\tProperly_Paired_Reads\tProperly_Paired_%" > "$SUMMARY_FILE"
# Loop through all .flagstat.txt files
for file in "$FLAGSTAT_DIR"/*.flagstat.txt; do
if [[ -e "$file" ]]; then
# Extract the sample name (remove directory and extension)
SAMPLE=$(basename "$file" .flagstat.txt)
# Extract relevant information
QC_PASSED=$(grep "in total" "$file" | awk '{print $1}')
MAPPED=$(grep "mapped (" "$file" | awk '{print $1}')
MAPPED_PCT=$(grep "mapped (" "$file" | awk -F '[()]' '{print $2}')
PROPERLY_PAIRED=$(grep "properly paired (" "$file" | awk '{print $1}')
PROPERLY_PAIRED_PCT=$(grep "properly paired (" "$file" | awk -F '[()]' '{print $2}')
# Append the data to the summary file
echo -e "$SAMPLE\t$QC_PASSED\t$MAPPED\t$MAPPED_PCT\t$PROPERLY_PAIRED\t$PROPERLY_PAIRED_PCT" >> "$SUMMARY_FILE"
else
echo "No .flagstat.txt files found in $FLAGSTAT_DIR."
exit 1
fi
done
echo "Summary of all .flagstat.txt files saved to $SUMMARY_FILE."
```
:::
### 2. CollectAlignmentSummaryMetrics from Picard tools
https://gatk.broadinstitute.org/hc/en-us/articles/13832683660955-
make a shell script from this
```
java -jar picard.jar CollectAlignmentSummaryMetrics \
R=tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly \
I=input.bam \
O=output.txt
```
:::spoiler Bash script for this called run_picard.sh in the output_qual directory
```
#!/bin/bash
module load picard/2.21.4
#Reference genome
REFERENCE_GENOME="/projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa"
# Input directory containing BAM files
INPUT_DIR="/projects/p32037/optimization/readfiles_all/sorted_bams/"
# Output directory
OUTPUT_DIR="/projects/p32037/optimization/readfiles_all/output_qual/"
# Run Picard for each BAM file in the input directory
for BAM_FILE in "$INPUT_DIR"/*.bam
do
# Check if there are matching files, else break the loop
[ -e "$BAM_FILE" ] || break
OUTPUT_FILE="${OUTPUT_DIR}/$(basename ${BAM_FILE%.*})_metrics.txt"
picard CollectAlignmentSummaryMetrics \
R=$REFERENCE_GENOME \
I=$BAM_FILE \
O=$OUTPUT_FILE
echo "Picard analysis for $(basename $BAM_FILE) complete. Metrics saved to $OUTPUT_FILE"
done
```
:::
You can use multiqc to evalute the ouput of these files (see the section on fastqc/multi qc after process radtags)
[MultiQC report here](https://m-b55e05.a0115.5898.data.globus.org/projects/p32037/caco_mn/readfiles_all/sorted_caco_bams/picard/multiqc_report_1.html)

# Stacks: SNP calling
#### Basic loop for ref_map.pl
```
ref_map.pl -o /projects/p32037/caco_mn/stacks_r8 \
--popmap /projects/p32037/caco_mn/popmap/popmap_all.txt \
--samples /projects/p32037/caco_mn/readfiles_all/sorted_caco_bams/ \
-X "populations: -r 0.7 --fstats --smooth --smooth-fstats --ordered-export --hwe --genepop --vcf"
```
:::spoiler ref_map.pl slurm script
```
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=long
#SBATCH --time=24:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1 # threads
#SBATCH --mem=50GB #
#SBATCH --job-name=stacks_r8
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=stacks_r8.error
#SBATCH --output=stacks_r8.out
# Load any necessary modules (e.g., stacks, samtools)
module load stacks/2.68-gcc-12.3.0
# Run the ref_map.pl command
# Run the ref_map.pl command
ref_map.pl -o /projects/p32037/caco_mn/stacks_r8 \
--popmap /projects/p32037/caco_mn/popmap/popmap_all.txt \
--samples /projects/p32037/caco_mn/readfiles_all/sorted_caco_bams/ \
-X "populations: -r 0.7 --fstats --smooth --smooth-fstats --ordered-export --hwe --genepop --vcf"
```
:::
## GStacks output
```
stacks-dist-extract gstacks.log.distribs bam_stats_per_sample | grep -v '^#' > r70_bam_stats_per_sample.tsv
```
> Get gstacks log distribs from output file and move into R
### Fraction of alignments kept by sample

### Fraction of alignments kept by population

## Populations
Example script with parameters: r = .8 --fstats --smooth --smooth-fstats --ordered-export --hwe --genepop --vcf
::: spoiler ref_map.pl script
```#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=normal
#SBATCH --time=12:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --mem=25GB # Adjusted memory based on previous usage
#SBATCH --job-name=populations_r8
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=populations_r80.error
#SBATCH --output=populations_r80.out
module load stacks
populations -P /projects/p32037/caco_mn/stacks_r8/ -M /projects/p32037/caco_mn/popmap/popmap_all.txt -t 4 -r 0.8 --fstats --smooth --smooth-fstats --ordered-export --hwe --genepop --vcf
```
:::
#### Ran populations with R = .7, .8, .9 to test SNP recovery
| r | Loci retained | Total sites | Fixed sites (filtered out) | Varaint sites retained | Mean sites per locus (bp) | % sites covered by multiple loci |
| --- | ------------- | ----------- | --- | --- | --- | --- |
| .7 | 99,304 | 20,055,663 | 3495534 | 319,749 | 167.04 | 1.8% |
| .8 | 87,506 | 17,687,927 | 3,145,076 | 267076 | 166.54 | 1.3% |
| .9 | 73,628 | 14,896,403 | 2,722,026 | 199,623 | 165.81 | 0.8% |
# vcftools introduction
Dylan's example code for filtering the popualtions.snps.vcf file:
```
vcftools --vcf --input.vcf --min-meanDP 5 --minDP 5 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.50 --recode ---out renamed.ouput.vcf
```
### VCF Parameters:
**Depth:**
```
--min-meanDP <float>
--max-meanDP <float>
```
Includes only sites with mean depth values (over all included individuals) greater than or equal to the "--min-meanDP" value and less than or equal to the "--max-meanDP" value. One of these options may be used without the other. These options require that the "DP" FORMAT tag is included for each site.
**Hardy-Weinberg Equlibrium:**
```
--hwe <float>
```
Assesses sites for Hardy-Weinberg Equilibrium using an exact test, as defined by Wigginton, Cutler and Abecasis (2005). Sites with a p-value below the threshold defined by this option are taken to be out of HWE, and therefore excluded.
**Missing data:**
```
--max-missing <float>
```
Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed).
```
--max-missing-count <integer>
```
Exclude sites with more than this number of missing genotypes over all individuals.
**Phased:**
```
--phased
```
Excludes all sites that contain unphased genotypes.
**MISCELLANEOUS FILTERING**
```
--minQ <float>
```
### Example code
```
vcftools --vcf populations.snps.vcf --min-meanDP 5 --minDP 5 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.80 --recode --out pop_r8_miss80_mac3_DP5
```
## VCF filtering results

## SNPFiltR Filtering results
Missing by sample distribution


Looks like only two samples have really high proportion of missing data (SPK6 and SPK5)
* Mean depth across dataset = 25.5
## Results of filtering with SNPFiltR and SNPRelate
