# GBS Optimization Analysis
###### tags: `RADSeq` `STACKS`
---
author: Emma Fetterly
---
------
**Process Overview**
1. process_radrags - demultiplex and clean reads
2. Mapping ddRADseq reads to reference genome with bwa + samtools
3. Evaluate the alignments with Picard tools and multiQC report
4. STACKS: gstacks + populations - call SNPs
----
# Process radtags
## Code
1. `process_radtags`
* Notes: truncated to 100 bp, unsed inline_null, 2 adaptor mismatches, filtered reads for quality and uncalled bases
```
process_radtags -1 PstIBfaI_S389_L001_R1_001.fastq.gz -2 PstIBfaI_S389_L001_R2_001.fastq.gz -i fastq -b GBS_barcodes.PstI-BfaI-193-288.txt -o radtags_out --inline_null --renz_1 PstI --renz_2 BfaI --adapter_1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter_2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --adapter_mm 2 -r -c -q -t 100 -s 10
```
Adaptor sequences from UWBC (Truseq kit)
Read 1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
Read 2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
**I made 4 new directories for each enzyme combo within the "optimization" folder called "radtags_out_NM", radtags_out_PB, radtags_out_A and radtags_out_NB**`
#### Slurm script I used for Castilleja radtags:
* This was for ApeKI
* FYI running on older version of STACKS 2.64-gcc-9.2.0
```
#!/bin/bash
#SBATCH --account=p32037 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc)
#SBATCH --time=12:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter)
#SBATCH --nodes=1 ## how many computers/nodes do you need (no default)
#SBATCH --ntasks-per-node=1 ## how many cpus or processors do you need on per computer/node (default value 1)
#SBATCH --mem-per-cpu=20GB ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need))
#SBATCH --job-name=CACO_radtags_ApeKI ## When you run squeue -u
#SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=CACO_radtags_ApeKI.error # Enter a error log file name
#SBATCH --output=CACO_radtags_ApeKI.out # Enter your out.log file name
module purge
module load stacks/2.68-gcc-12.3.0 ##specify environment
process_radtags -1 ApeKI_S130_L008_R1_001.fastq.gz -2 ApeKI_S130_L008_R2_001.fastq.gz -i gzfastq -b GBS_barcodes.ApeKI.txt -o radtags_out_A --inline_null --renz_1 ApeKI --adapter_1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter_2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --adapter_mm 2 -r -c -q -t 100 -s 10
```
Ran for each enzyme combination, results below:
```
sbatch rad_tags_ApeKI.sh
sbatch rad_tags_NsiI-BfaI.sh
sbatch rad_tags_NsiI-MspI.sh
sbatch rad_tags_PstI-BfaI.sh
```
| Enzyme | Total sequences | Reads with adapter | Barcode not found | Low quality read drops | RAD cutsite not found drops | Retained reads |
| ------------------ | ----------- | ---------------- | ----------------- | ---------------------- | --------------------------- | ----------------- |
| NsiI-MspI | 232,059,608 | 66,208,505 (28.5%) | 9,703,700 (4.2%) | 36,118 0.0%) | 598,817 (0.3%) | 155,512,468 (67.0%) |
| PstI-BfaI | 136,230,766 | 18,778,977 (13.8%) | 4,472,128 (3.3%) | 20,945 (0.0%) |1,993,870 (1.5%) | 110,964,846 (81.5%) |
| ApeKI | 227,755,212 | 64,479,630 (28.3%) | 874,000(0.4%) | 3,957,680 (1.7%) | 801,463 (0.4%) | 157,642,439 (69.2%) |
| NsiI-BfaI | 201,448,460 | 41,565,500 (20.6%) | 8,509,654 (4.2%) | 3,018,687 (1.5%) | 1,822,925 (0.9%) | 146,531,694 (72.7%) |
Once process_radtags has finished running, change directories to your output directory and inspect the process_radtags.log file. This will give you basic information on how manys sequences were found per sample, and how many passed preliminary filtering.
To view the the log file
```
cat process_radtags.log
```
STACKS outputs four different files for each sample if you are using paired end reads:
sample_name.rem.1.fq.gz
sample_name.rem.2.fq.gz
sample_name.1.fq.gz
sample_name.2.fq.gz
We only use the files without the .rem; those files with a .rem. can be deleted.
```
rm *.rem.1.fq.gz
rm *.rem.2.fq.gz
```
**Removed .rem files and added an ezyme identifier "A/NM/PB/NB to the file names and moved all files into a single file in "optimization" called "readfiles_all"**
I was curious about the raw read quality of the demultiplexed sequences, so ran a fast qc on the raw read files
```
module load fastqc/0.12.0
fastqc -o /projects/p32037/optimization/readfiles_all/fastqc -t 6 *.fq.gz
```
Hard to review each file... so ran a multiqc https://multiqc.info/docs/ in the readfiles_all directory to view all the reports at once
```
[ewf7555@quser32 fastqc]$ module load multiqc
[ewf7555@quser32 fastqc]$ multiqc .
#this searched through the entire read_files directory and found the fastqc reports generated
```
[MultiQC](https://m-b55e05.a0115.5898.data.globus.org/projects/p32037/optimization/multiqc_report_149.html) report of raw sequence data of NM and PB enzymes (from UWBC)
[MultiQC](https://m-b55e05.a0115.5898.data.globus.org/projects/p32037/optimization/muliqc_report_150.html) report of raw sequence data of A and NM enzymes (from UWBC)
[MultiQC](https://m-b55e05.a0115.5898.data.globus.org/projects/p32037/optimization/readfiles_all/fastqc/multiqc_report.html) report of reads after process_radtags
[MultiQC](https://m-b55e05.a0115.5898.data.globus.org/projects/p32037/optimization/readfiles_all/fastqc/multiqc_report_1.html) report of reads after process_radtags with Truseq adaptors
# Mapping reads to a reference genome
### Preparing the reference:
1. Make a new directory for your reference genome files.
The reference needs to be uncompressed for this analysis.
```
#make new directory
mkdir ./reference
#move reference to this directory
mv tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa ./reference/
```
2. Index the reference genome with bwa + samtools + GATK:
```
#BWA indexing for general alignment
bwa index tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa
#create the FASTA index file
samtools faidx tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa
#creat the sequence dictionary (in case you want to use GATK)
gatk CreateSequenceDictionary -R tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa -O /projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.dict
```
> This will create a few other files with the same basename. They all need to be in the same directory with the .fsa_nt file in order for mapping to work.
I also indexed and created the FASTA index + sequnece dictionary for the repeat makes genome file (in a separate directory /projects/p32037/optimization/reference/RM/PO3051_Castilleja_pilosa.RepeatMasked.fasta)
### Map the reads:
> Run from the directory with the read files, this loop will align the reads for each pair of read files, generating a .sam file
Code for Castilleja mapping for sample A_FA23-01 (trial run
```
bwa mem -M -t 4 /projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa A_FA23-01.1.fq.gz A_FA23-01.2.fq.gz > A_FA23-01.sam
```
#### Slurm script for batch processing
```
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=normal
#SBATCH --time=5:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --mem=4GB # Adjusted memory based on previous usage
#SBATCH --job-name=bwamem_A_group
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=bwamem_A_group.out.error
#SBATCH --output=bwamem_A_group.out
module purge
module load bwa
# Run bwa mem for the NM group with 8 threads
for i in A*.1.fq.gz; do
NM=${i//.1.fq.gz/} # Extract the base name (e.g., NM_ARK23-01)
REVERSE=${i//.1.fq.gz/.2.fq.gz} # Replace .1.fq.gz with .2.fq.gz for reverse reads
bwa mem -M -t 8 /projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa "$i" "$REVERSE" > "${NM}.sam"
done
```
* This took about 15 mins for one sample...
## Evaluating with samtools:
```
samtools flagstat A_FA23-01.sam
25584356 + 0 in total (QC-passed reads + QC-failed reads)
244644 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
17295138 + 0 mapped (67.60% : N/A)
25339712 + 0 paired in sequencing
12669856 + 0 read1
12669856 + 0 read2
16222828 + 0 properly paired (64.02% : N/A)
16906604 + 0 with itself and mate mapped
143890 + 0 singletons (0.57% : N/A)
465940 + 0 with mate mapped to a different chr
127171 + 0 with mate mapped to a different chr (mapQ>=5)
```
```
#!/bin/bash
#SBATCH --account=p32037 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH --partition=short ## Required: (buyin, short, normal, long, gengpu, genhimem, etc)
#SBATCH --time=4:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter)
#SBATCH --nodes=1 ## how many computers/nodes do you need (no default)
#SBATCH --ntasks-per-node=4 ## how many cpus or processors do you need on per computer/node (default value 1)
#SBATCH --mem-per-cpu=2GB ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need))
#SBATCH --job-name=bwamem_NB_FA23-01.out ## When you run squeue -u
#SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=bwamem_NB_FA23-01.out.error # Enter a error log file name
#SBATCH --output=bwamem_NB_FA23-01.out # Enter your out.log file name
module purge
module load bwa ##specify environment
bwa mem -M -t 4 /projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa NB_FA23-01.1.fq.gz NB_FA23-01.2.fq.gz > NB_FA23-01.sam
```
```
bwa mem -M -t 4 /projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa PB_FA23-01.1.fq.gz PB_FA23-01.2.fq.gz > PB_FA23-01.sam
```
let's try the other enzyme combos to see which one has higher mapping for A_FA23-01 (default K = 19)
| Enzyme | QC-Passed reads | Mapped | Propperly paired |
| ------ | --- | -------------------- | -------------------- |
| A | 25,584,356 | 17,295,138 (67.60%) | 16,222,828 (64.02%) |
| NB | 21,854,484 | 19,941,013 (91.24%) | 17,994,898 (83.31%) |
| NM | 32,224,772 | 28,359,324 (88.00%) | 25,619,866 (80.83%) |
| PB | 17,913,790 | 13,954,441 (77.90%) | 13,053,964 (73.29%) |
> Note: Good mapping percentages are in the 90's. You can mess around with the -K values of bwa mem to see if you get a higher percent mapping to the reference. This is a tradeoff because the lower the -K, that means that the fewer exact basepairs need to match the reference in order to start mapping a read. Default is 19 when bwa mem is run without a -K value. It can be a good idea to run bwa mem on one sample with several values of -K : default, -K 16 ; -K 14; -K 12 ; -K 10. Evaluate which maps the most reads and then use the loop for the rest of the samples:
Evaluate which maps the most reads and then use the loop for the rest of the samples
> Note: This took a long time! increased threads to 8 and ran separate bash scripts for each enzyme group but took 8+ hours!
Basic loop:
```
for i in *.1.fq; do
NM=${i//.1.fq/}
REVERSE=${i//.1.fq/.2.fq}
bwa mem -M -t 8 /projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa $i $REVERSE > $NM.sam ; done
```
Slurm Script I used for a single enzyme group:
:::spoiler bwa mem slurm Script
```
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=short
#SBATCH --time=2:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8 # Test with 8 threads
#SBATCH --mem=1GB # Adjusted memory based on previous usage
#SBATCH --job-name=bwamem_PB_group
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=bwamem_PB_group.out.error
#SBATCH --output=bwamem_PB_group.out
module purge
module load bwa
# Run bwa mem for the PB group with 8 threads
for i in PB*.1.fq.gz; do
NM=${i//.1.fq.gz/}
REVERSE=${i//.1.fq.gz/.2.fq.gz}
bwa mem -M -t 8 /projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa "$i" "$REVERSE" > "${NM}.sam"
done
```
:::
Now convert the .sam files into .bam files:
```
for i in *.sam; do
NM=${i//.sam/};
samtools view $i -b -o $NM.bam; done
```
and then sort the .bam files:
```
for i in *.bam; do NM=${i//.bam/}; samtools sort $i -o $NM.sort.bam; done
```
As long as your sorted .bam files are looking good, then you can delete the .sam files because they are quite large and take up a lot of space. I put all my sorted .bams into their own folder.
:::spoiler sam to bam to sorted bam slurm script
```
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=normal
#SBATCH --time=12:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=1GB # Adjusted memory based on previous usage
#SBATCH --job-name=A_sambamsort
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=A_sambamsort.out.error
#SBATCH --output=A_sambamsort.out
module load samtools
# Process one sample at a time for files that start with "PB": convert SAM to BAM, sort, and clean up
for i in A*.sam; do
# Check if the file exists before proceeding
if [[ -e $i ]]; then
NM=${i%.sam} # Remove the .sam extension
# Convert the .sam file to .bam
echo "Converting $i to BAM format..."
samtools view -b "$i" -o "$NM.bam"
# Check if conversion was successful
if [[ $? -eq 0 ]]; then
echo "$i successfully converted to $NM.bam"
# Sort the .bam file
echo "Sorting $NM.bam..."
samtools sort "$NM.bam" -o "$NM.sorted.bam"
# Check if sorting was successful
if [[ $? -eq 0 ]]; then
echo "$NM.bam successfully sorted into $NM.sorted.bam"
# Delete the original .sam file
echo "Deleting $i to free up space..."
rm "$i"
# Move the sorted BAM file to the sorted_bams folder
mkdir -p sorted_bams
mv "$NM.sorted.bam" sorted_bams/
else
echo "Error sorting $NM.bam" >&2
exit 1
fi
else
echo "Error converting $i" >&2
exit 1
fi
else
echo "No A*.sam files found."
exit 1
fi
done
echo "All A*.sam files processed and sorted BAM files moved to sorted_bams directory."
```
:::
---
**Evaluating the alignments**
https://gatk.broadinstitute.org/hc/en-us/articles/13832683660955-
make a shell script from this
```
java -jar picard.jar CollectAlignmentSummaryMetrics \
R=tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly \
I=input.bam \
O=output.txt
```
:::spoiler Bash script for this called run_picard.sh in the output_qual directory
```
#!/bin/bash
module load picard/2.21.4
#Reference genome
REFERENCE_GENOME="/projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa"
# Input directory containing BAM files
INPUT_DIR="/projects/p32037/optimization/readfiles_all/sorted_bams/"
# Output directory
OUTPUT_DIR="/projects/p32037/optimization/readfiles_all/output_qual/"
# Run Picard for each BAM file in the input directory
for BAM_FILE in "$INPUT_DIR"/*.bam
do
# Check if there are matching files, else break the loop
[ -e "$BAM_FILE" ] || break
OUTPUT_FILE="${OUTPUT_DIR}/$(basename ${BAM_FILE%.*})_metrics.txt"
picard CollectAlignmentSummaryMetrics \
R=$REFERENCE_GENOME \
I=$BAM_FILE \
O=$OUTPUT_FILE
echo "Picard analysis for $(basename $BAM_FILE) complete. Metrics saved to $OUTPUT_FILE"
done
```
:::
You can use multiqc to evalute the ouput of these files (see the section on fastqc/multi qc after process radtags)
[MultiQC report here](https://m-b55e05.a0115.5898.data.globus.org/projects/p32037/optimization/readfiles_all/multiqc/multiqc_report.html)

# Stacks: SNP calling
#### Basic loop for ref_map.pl
```
ref_map.pl -o /projects/p32037/optimization/stacks_A --popmap /projects/p32037/optimization/popmap/popmap.txt --samples /projects/p32037/optimization/readfiles_all/sorted_bams/A -X "populations:--fstats --r 0.7 --genepop --smooth --ordered-export --fasta-loci --hwe -t 8"
```
:::spoiler ref_map.pl slurm script
```
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=normal
#SBATCH --time=12:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1 # Test with 8 threads
#SBATCH --mem=20GB # Adjusted memory based on previous usage
#SBATCH --job-name=stacks_A
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=stacks_A.error
#SBATCH --output=stacks_A.out
# Load any necessary modules (e.g., stacks, samtools)
module load stacks/2.68-gcc-12.3.0
# Run the ref_map.pl command
ref_map.pl -o /projects/p32037/optimization/stacks_A \
--popmap /projects/p32037/optimization/popmap/popmap.txt \
--samples /projects/p32037/optimization/readfiles_all/sorted_bams/A \
-X "populations:--fstats --r 0.7 --genepop --smooth --ordered-export --fasta-loci --hwe -t 8"
```
:::
## GStacks output
```
stacks-dist-extract gstacks.log.distribs bam_stats_per_sample | grep -v '^#' > PB_bam_stats_per_sample.tsv
```
> Get gstacks log distribs from output file and move into R
### Fraction of alignments kept by sample

### Fraction of alignments kept by enzyme

### Fraction of alignments kept by enzyme and lab

### Populations (without kernel smooth on)
> The only way I could get populations to run is if i turned off all flags realted to kernel smoothing (--smooth --orderedn export --smooth_fastats). This lowered the memory usage also.
Parameters: r = .80 with --fstats and --vcf only
::: spoiler ref_map.pl script w/out --smooth
```= bash
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=normal
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem-per-cpu=100GB
#SBATCH --job-name=stacks_A
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=stacks_A.error
#SBATCH --output=stacks_A.out
# Load any necessary modules (e.g., stacks, samtools)
module load stacks/2.68-gcc-12.3.0
# Run the ref_map.pl command
ref_map.pl --samples /projects/p32037/optimization/readfiles_all/sorted_bams/A --popmap /projects/p32037/optimization/popmap/popmap_A.txt -o /projects/p32037/optimization/stacks_A -T 8 -X "populations: -r 0.8 --fstats --vcf"
```
:::
| Enzyme | Lab | Private | Variant Sites | SNPs | % polymorphic | Obs_het (variant sites) |
| --------- | ----------- | ------- | ------------- | ------ | ------------- | ----------------------- |
| ApeKI | dePamphilis | 26,367 | 69,024 | 55,524 | 0.69936 | 0.21064 |
| ApeKI | Fant | 13,364 | 89,820 | 63,317 | 0.62603 | 0.21064 |
| NsiI/BfaI | dePamphilis | 6,271 | 63,393 | 60,484 | 0.89491 | 0.26274 |
| NsiI/BfaI | Fant | 2,909 | 17,211 | 10,940 | 0.13479 | 0.16774 |
| PstI/BfaI | dePamphilis | 14,595 | 46,543 | 39,452 | 0.79369 | 0.1956 |
| PstI/BfaI | Fant | 7,044 | 39,591 | 24,949 | 0.4769 | 0.17651 |
| NsiI/MspI | dePamphilis | 5,340 | 31,549 | 29,569 | 0.9057 | 0.25902 |
| NsiI/MspI | Fant | 1,969 | 11,135 | 5,784 | 0.14035 | 0.14035 |
# Repeat Masked genome outputs
## Using repeat masked genome to filter reads
#### Process overview:
1. Turn the RM genome into a .bed file for filtering
:::spoiler Make .bed file from RepeatMasked Genome
```
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=normal
#SBATCH --time=7:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=1GB # Adjusted memory based on previous usage
#SBATCH --job-name=make_bed
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=make_bed.out.error
#SBATCH --output=make_bed.out
awk 'BEGIN{FS=""; OFS="\t"}
/^>/ {if (seq) print_masked_regions(); header=$0; seq=""; next}
{seq=seq""$0}
END {print_masked_regions()}
function print_masked_regions() {
start=0; in_mask=0;
for (i=1; i<=length(seq); i++) {
base=substr(seq, i, 1);
if (base ~ /[a-z]/ && !in_mask) {
start=i; in_mask=1;
} else if (base ~ /[A-Z]/ && in_mask) {
print substr(header, 2), start-1, i-1;
in_mask=0;
}
}
}' PO3051_Castilleja_pilosa.RepeatMasked.fasta > masked_regions_capi.bed
```
:::
2. Filter bam files using .bed file
:::spoiler Filtering bam files based on repeat regions in RM genome
```
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=normal
#SBATCH --time=12:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=5GB # Adjusted memory based on previous usage
#SBATCH --job-name=intersect
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=intersect_RM.out.error
#SBATCH --output=intersect_RM.out
module load bedtools
# Directory containing your sorted BAM files
bam_dir="/projects/p32037/optimization/readfiles_all/sorted_bams"
# BED file with masked regions
bed_file="/projects/p32037/optimization/reference/RM/masked_regions_capi.bed"
# Output directory for filtered BAM files
output_dir="/projects/p32037/optimization/readfiles_all/sorted_bams/filtered_bams"
mkdir -p "$output_dir" # Create the output directory if it doesn't exist
# Loop through all sorted BAM files in the specified directory
for bam_file in "$bam_dir"/*.sorted.bam; do
# Extract the base name of the BAM file
base_name=$(basename "$bam_file" .sorted.bam)
# Define the output filtered BAM file
output_bam="$output_dir/${base_name}_filtered.bam"
# Run bedtools intersect to filter out reads that overlap with masked regions
echo "Processing $bam_file..."
bedtools intersect -v -a "$bam_file" -b "$bed_file" > "$output_bam"
echo "Filtered BAM file saved to $output_bam"
done
echo "All BAM files processed."
```
:::
3. Run stacks on filtered .bam files
:::spoiler stacks ref_map.pl run example script for RM filtered reads (ApeKI )
```
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=normal
#SBATCH --time=12:00:00 # Adjust based on your expected runtime
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --mem-per-cpu=25GB # Adjusted memory based on previous usage
#SBATCH --job-name=stacks_A_RM
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=stacks_A_RM.error
#SBATCH --output=stacks_A_RM.out
# Load any necessary modules (e.g., stacks, samtools)
module load stacks
# Run the ref_map.pl command
ref_map.pl --samples /projects/p32037/optimization/readfiles_all/sorted_bams/filtered_bams/A --popmap /projects/p32037/optimization/popmap/popmap_A.txt -o /projects/p32037/optimization/sta$
```
:::
#### gstacks output
| Enzyme | Loci built | Coverage | Mean number of sites per locus | Primary aligments kept | % Primary aligments kept |
| --------- | ---------- | ----------------------------------------------- | ------------------------------ | ---------------------- | ------------------------ |
| NsiI/BfaI | 61,977 | mean=30.8x, stdev=11.1x, min=2.5x, max=40.8x | 164.9 | 16,352,381 | 26.40% |
| PstI/BfaI | 37,763 | mean=104.5x, stdev=21.4x, min=62.9x, max=141.0x | 196.5 | 37,025,447 | 36.80% |
| NsiI/MspI | 39,778 | mean=46.8x, stdev=16.3x, min=4.6x, max=58.7x | 152.9 | 12,945,256 | 11.90% |
| ApeKI | 129,316 | mean=23.1x, stdev=6.1x, min=12.1x, max=31.6x | 160.4 | 23,821,835 | 13.70% |
#### popualtions output from this run
## Mapping directly to masked genome
# GATK SNP calling
### Get reference genome ready
1. FASTA Indexing with Samtools (done)
2. Create the Sequence Dictionary with GATK (done)
### Mark duplicates in sorted bam files
:::spoiler mark_dups.sh
```
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --job-name=mark_duplicates_A
#SBATCH --output=mark_duplicates_A.out
#SBATCH --error=mark_duplicates_A.err
#SBATCH --time=4:00:00 # Adjust the time as needed
#SBATCH --partition=short # Partition to use
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks=1
#SBATCH --mem=5G # Memory allocation
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
# Load necessary modules (adjust as per your environment)
conda activate GATK
module load gatk
module load samtools
# Set the directory containing your sorted BAM files
BAM_DIR="/projects/p32037/optimization/readfiles_all/sorted_bams/A"
OUTPUT_DIR="/projects/p32037/optimization/gatk/bams"
# Loop through each BAM file in the directory
for bamfile in ${BAM_DIR}/*.bam; do
# Extract the sample name from the BAM file (without the .bam extension)
sample=$(basename ${bamfile} .bam)
# Set the output file names
marked_bam=${OUTPUT_DIR}/${sample}_marked.bam
metrics_file=${OUTPUT_DIR}/${sample}_marked_dup_metrics.txt
# Mark duplicates
gatk MarkDuplicates \
-I ${bamfile} \
-O ${marked_bam} \
-M ${metrics_file}
# Index the marked BAM file
samtools index ${marked_bam}
echo "Processed ${sample}: duplicates marked and indexed."
done
echo "All BAM files processed."
```
:::
### Call SNPs using Haplotype Caller
:::spoiler Haplotype caller
```
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --job-name=haplotype_caller_A
#SBATCH --output=haplotype_caller_A.out
#SBATCH --error=haplotype_caller_A.err
#SBATCH --time=8:00:00 # Adjust the time as needed
#SBATCH --partition=short # Partition to use
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks=1
#SBATCH --mem=20G # Adjust memory based on your needs
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
# Load necessary modules (adjust as per your environment)
conda activate GATK
module load gatk
# Set the directory containing your marked BAM files
BAM_DIR="/projects/p32037/optimization/gatk/bams"
OUTPUT_DIR="/projects/p32037/optimization/gatk/gvcfs"
REFERENCE="/projects/p32037/reference_genomes/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa" # Update with your reference genome path
# Loop through each BAM file in the directory
for bamfile in ${BAM_DIR}/*_marked.bam; do
# Extract the sample name from the BAM file (without the _marked.bam extension)
sample=$(basename ${bamfile} _marked.bam)
# Set the output GVCF file name
gvcf_output=${OUTPUT_DIR}/${sample}.g.vcf.gz
# Run HaplotypeCaller
gatk HaplotypeCaller \
-R ${REFERENCE} \
-I ${bamfile} \
-O ${gvcf_output} \
-ERC GVCF
echo "Processed ${sample}: GVCF generated."
done
echo "All samples processed with HaplotypeCaller."
```
:::
### Combine GVCFs
:::spoiler Joint GVCF
```
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --job-name=combine_gvcfs
#SBATCH --output=combine_gvcfs.out
#SBATCH --error=combine_gvcfs.err
#SBATCH --time=4:00:00 # Adjust the time as needed
#SBATCH --partition=short # Partition to use
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks=1
#SBATCH --mem=10G # Adjust memory based on your needs
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
# Load necessary modules (adjust as per your environment)
conda activate GATK
module load gatk
# Set the directory containing your GVCF files
GVCF_DIR="/projects/p32037/optimization/gatk/gvcfs"
OUTPUT_DIR="/projects/p32037/optimization/gatk/combined"
REFERENCE="/projects/p32037/reference_genomes/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa" # Update with your reference genome path
# Output file for combined GVCFs
combined_gvcf=${OUTPUT_DIR}/combined_samples.g.vcf.gz
# Combine the GVCF files
gatk CombineGVCFs \
-R ${REFERENCE} \
$(for gvcf in ${GVCF_DIR}/*.g.vcf.gz; do echo "-V ${gvcf} "; done) \
-O ${combined_gvcf}
echo "All GVCF files combined."
```
:::
### Genotype GVCFs
:::spoiler Genotype GVCFs
```
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --job-name=genotype_gvcfs
#SBATCH --output=genotype_gvcfs.out
#SBATCH --error=genotype_gvcfs.err
#SBATCH --time=8:00:00 # Adjust the time as needed
#SBATCH --partition=short # Partition to use
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks=1
#SBATCH --mem=20G # Adjust memory based on your needs
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
# Load necessary modules (adjust as per your environment)
conda activate GATK
module load gatk
# Set the directory containing your combined GVCF
OUTPUT_DIR="/projects/p32037/optimization/gatk/combined"
REFERENCE="/projects/p32037/reference_genomes/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa" # Update with your reference genome path
# Combined GVCF file from previous step
combined_gvcf=${OUTPUT_DIR}/combined_samples.g.vcf.gz
# Output file for final multi-sample VCF
final_vcf=${OUTPUT_DIR}/final_joint_genotyping.vcf.gz
# Run joint genotyping
gatk GenotypeGVCFs \
-R ${REFERENCE} \
-V ${combined_gvcf} \
-O ${final_vcf}
echo "Joint genotyping completed. Final VCF genera
```
:::