Emma - GBS Optimization Analysis

# GBS Optimization Analysis ###### tags: `RADSeq` `STACKS` --- author: Emma Fetterly --- ------ **Process Overview** 1. process_radrags - demultiplex and clean reads 2. Mapping ddRADseq reads to reference genome with bwa + samtools 3. Evaluate the alignments with Picard tools and multiQC report 4. STACKS: gstacks + populations - call SNPs ---- # Process radtags ## Code 1. `process_radtags` * Notes: truncated to 100 bp, unsed inline_null, 2 adaptor mismatches, filtered reads for quality and uncalled bases ``` process_radtags -1 PstIBfaI_S389_L001_R1_001.fastq.gz -2 PstIBfaI_S389_L001_R2_001.fastq.gz -i fastq -b GBS_barcodes.PstI-BfaI-193-288.txt -o radtags_out --inline_null --renz_1 PstI --renz_2 BfaI --adapter_1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter_2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --adapter_mm 2 -r -c -q -t 100 -s 10 ``` Adaptor sequences from UWBC (Truseq kit) Read 1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA Read 2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT **I made 4 new directories for each enzyme combo within the "optimization" folder called "radtags_out_NM", radtags_out_PB, radtags_out_A and radtags_out_NB**` #### Slurm script I used for Castilleja radtags: * This was for ApeKI * FYI running on older version of STACKS 2.64-gcc-9.2.0 ``` #!/bin/bash #SBATCH --account=p32037 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time=12:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter) #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=1 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu=20GB ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need)) #SBATCH --job-name=CACO_radtags_ApeKI ## When you run squeue -u #SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=CACO_radtags_ApeKI.error # Enter a error log file name #SBATCH --output=CACO_radtags_ApeKI.out # Enter your out.log file name module purge module load stacks/2.68-gcc-12.3.0 ##specify environment process_radtags -1 ApeKI_S130_L008_R1_001.fastq.gz -2 ApeKI_S130_L008_R2_001.fastq.gz -i gzfastq -b GBS_barcodes.ApeKI.txt -o radtags_out_A --inline_null --renz_1 ApeKI --adapter_1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter_2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --adapter_mm 2 -r -c -q -t 100 -s 10 ``` Ran for each enzyme combination, results below: ``` sbatch rad_tags_ApeKI.sh sbatch rad_tags_NsiI-BfaI.sh sbatch rad_tags_NsiI-MspI.sh sbatch rad_tags_PstI-BfaI.sh ``` | Enzyme | Total sequences | Reads with adapter | Barcode not found | Low quality read drops | RAD cutsite not found drops | Retained reads | | ------------------ | ----------- | ---------------- | ----------------- | ---------------------- | --------------------------- | ----------------- | | NsiI-MspI | 232,059,608 | 66,208,505 (28.5%) | 9,703,700 (4.2%) | 36,118 0.0%) | 598,817 (0.3%) | 155,512,468 (67.0%) | | PstI-BfaI | 136,230,766 | 18,778,977 (13.8%) | 4,472,128 (3.3%) | 20,945 (0.0%) |1,993,870 (1.5%) | 110,964,846 (81.5%) | | ApeKI | 227,755,212 | 64,479,630 (28.3%) | 874,000(0.4%) | 3,957,680 (1.7%) | 801,463 (0.4%) | 157,642,439 (69.2%) | | NsiI-BfaI | 201,448,460 | 41,565,500 (20.6%) | 8,509,654 (4.2%) | 3,018,687 (1.5%) | 1,822,925 (0.9%) | 146,531,694 (72.7%) | Once process_radtags has finished running, change directories to your output directory and inspect the process_radtags.log file. This will give you basic information on how manys sequences were found per sample, and how many passed preliminary filtering. To view the the log file ``` cat process_radtags.log ``` STACKS outputs four different files for each sample if you are using paired end reads: sample_name.rem.1.fq.gz sample_name.rem.2.fq.gz sample_name.1.fq.gz sample_name.2.fq.gz We only use the files without the .rem; those files with a .rem. can be deleted. ``` rm *.rem.1.fq.gz rm *.rem.2.fq.gz ``` **Removed .rem files and added an ezyme identifier "A/NM/PB/NB to the file names and moved all files into a single file in "optimization" called "readfiles_all"** I was curious about the raw read quality of the demultiplexed sequences, so ran a fast qc on the raw read files ``` module load fastqc/0.12.0 fastqc -o /projects/p32037/optimization/readfiles_all/fastqc -t 6 *.fq.gz ``` Hard to review each file... so ran a multiqc https://multiqc.info/docs/ in the readfiles_all directory to view all the reports at once ``` [ewf7555@quser32 fastqc]$ module load multiqc [ewf7555@quser32 fastqc]$ multiqc . #this searched through the entire read_files directory and found the fastqc reports generated ``` [MultiQC](https://m-b55e05.a0115.5898.data.globus.org/projects/p32037/optimization/multiqc_report_149.html) report of raw sequence data of NM and PB enzymes (from UWBC) [MultiQC](https://m-b55e05.a0115.5898.data.globus.org/projects/p32037/optimization/muliqc_report_150.html) report of raw sequence data of A and NM enzymes (from UWBC) [MultiQC](https://m-b55e05.a0115.5898.data.globus.org/projects/p32037/optimization/readfiles_all/fastqc/multiqc_report.html) report of reads after process_radtags [MultiQC](https://m-b55e05.a0115.5898.data.globus.org/projects/p32037/optimization/readfiles_all/fastqc/multiqc_report_1.html) report of reads after process_radtags with Truseq adaptors # Mapping reads to a reference genome ### Preparing the reference: 1. Make a new directory for your reference genome files. The reference needs to be uncompressed for this analysis. ``` #make new directory mkdir ./reference #move reference to this directory mv tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa ./reference/ ``` 2. Index the reference genome with bwa + samtools + GATK: ``` #BWA indexing for general alignment bwa index tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa #create the FASTA index file samtools faidx tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa #creat the sequence dictionary (in case you want to use GATK) gatk CreateSequenceDictionary -R tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa -O /projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.dict ``` > This will create a few other files with the same basename. They all need to be in the same directory with the .fsa_nt file in order for mapping to work. I also indexed and created the FASTA index + sequnece dictionary for the repeat makes genome file (in a separate directory /projects/p32037/optimization/reference/RM/PO3051_Castilleja_pilosa.RepeatMasked.fasta) ### Map the reads: > Run from the directory with the read files, this loop will align the reads for each pair of read files, generating a .sam file Code for Castilleja mapping for sample A_FA23-01 (trial run ``` bwa mem -M -t 4 /projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa A_FA23-01.1.fq.gz A_FA23-01.2.fq.gz > A_FA23-01.sam ``` #### Slurm script for batch processing ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=5:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --mem=4GB # Adjusted memory based on previous usage #SBATCH --job-name=bwamem_A_group #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=bwamem_A_group.out.error #SBATCH --output=bwamem_A_group.out module purge module load bwa # Run bwa mem for the NM group with 8 threads for i in A*.1.fq.gz; do NM=${i//.1.fq.gz/} # Extract the base name (e.g., NM_ARK23-01) REVERSE=${i//.1.fq.gz/.2.fq.gz} # Replace .1.fq.gz with .2.fq.gz for reverse reads bwa mem -M -t 8 /projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa "$i" "$REVERSE" > "${NM}.sam" done ``` * This took about 15 mins for one sample... ## Evaluating with samtools: ``` samtools flagstat A_FA23-01.sam 25584356 + 0 in total (QC-passed reads + QC-failed reads) 244644 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 17295138 + 0 mapped (67.60% : N/A) 25339712 + 0 paired in sequencing 12669856 + 0 read1 12669856 + 0 read2 16222828 + 0 properly paired (64.02% : N/A) 16906604 + 0 with itself and mate mapped 143890 + 0 singletons (0.57% : N/A) 465940 + 0 with mate mapped to a different chr 127171 + 0 with mate mapped to a different chr (mapQ>=5) ``` ``` #!/bin/bash #SBATCH --account=p32037 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=short ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time=4:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter) #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=4 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu=2GB ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need)) #SBATCH --job-name=bwamem_NB_FA23-01.out ## When you run squeue -u #SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=bwamem_NB_FA23-01.out.error # Enter a error log file name #SBATCH --output=bwamem_NB_FA23-01.out # Enter your out.log file name module purge module load bwa ##specify environment bwa mem -M -t 4 /projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa NB_FA23-01.1.fq.gz NB_FA23-01.2.fq.gz > NB_FA23-01.sam ``` ``` bwa mem -M -t 4 /projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa PB_FA23-01.1.fq.gz PB_FA23-01.2.fq.gz > PB_FA23-01.sam ``` let's try the other enzyme combos to see which one has higher mapping for A_FA23-01 (default K = 19) | Enzyme | QC-Passed reads | Mapped | Propperly paired | | ------ | --- | -------------------- | -------------------- | | A | 25,584,356 | 17,295,138 (67.60%) | 16,222,828 (64.02%) | | NB | 21,854,484 | 19,941,013 (91.24%) | 17,994,898 (83.31%) | | NM | 32,224,772 | 28,359,324 (88.00%) | 25,619,866 (80.83%) | | PB | 17,913,790 | 13,954,441 (77.90%) | 13,053,964 (73.29%) | > Note: Good mapping percentages are in the 90's. You can mess around with the -K values of bwa mem to see if you get a higher percent mapping to the reference. This is a tradeoff because the lower the -K, that means that the fewer exact basepairs need to match the reference in order to start mapping a read. Default is 19 when bwa mem is run without a -K value. It can be a good idea to run bwa mem on one sample with several values of -K : default, -K 16 ; -K 14; -K 12 ; -K 10. Evaluate which maps the most reads and then use the loop for the rest of the samples: Evaluate which maps the most reads and then use the loop for the rest of the samples > Note: This took a long time! increased threads to 8 and ran separate bash scripts for each enzyme group but took 8+ hours! Basic loop: ``` for i in *.1.fq; do NM=${i//.1.fq/} REVERSE=${i//.1.fq/.2.fq} bwa mem -M -t 8 /projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa $i $REVERSE > $NM.sam ; done ``` Slurm Script I used for a single enzyme group: :::spoiler bwa mem slurm Script ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=short #SBATCH --time=2:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=8 # Test with 8 threads #SBATCH --mem=1GB # Adjusted memory based on previous usage #SBATCH --job-name=bwamem_PB_group #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=bwamem_PB_group.out.error #SBATCH --output=bwamem_PB_group.out module purge module load bwa # Run bwa mem for the PB group with 8 threads for i in PB*.1.fq.gz; do NM=${i//.1.fq.gz/} REVERSE=${i//.1.fq.gz/.2.fq.gz} bwa mem -M -t 8 /projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa "$i" "$REVERSE" > "${NM}.sam" done ``` ::: Now convert the .sam files into .bam files: ``` for i in *.sam; do NM=${i//.sam/}; samtools view $i -b -o $NM.bam; done ``` and then sort the .bam files: ``` for i in *.bam; do NM=${i//.bam/}; samtools sort $i -o $NM.sort.bam; done ``` As long as your sorted .bam files are looking good, then you can delete the .sam files because they are quite large and take up a lot of space. I put all my sorted .bams into their own folder. :::spoiler sam to bam to sorted bam slurm script ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=1GB # Adjusted memory based on previous usage #SBATCH --job-name=A_sambamsort #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=A_sambamsort.out.error #SBATCH --output=A_sambamsort.out module load samtools # Process one sample at a time for files that start with "PB": convert SAM to BAM, sort, and clean up for i in A*.sam; do # Check if the file exists before proceeding if [[ -e $i ]]; then NM=${i%.sam} # Remove the .sam extension # Convert the .sam file to .bam echo "Converting $i to BAM format..." samtools view -b "$i" -o "$NM.bam" # Check if conversion was successful if [[ $? -eq 0 ]]; then echo "$i successfully converted to $NM.bam" # Sort the .bam file echo "Sorting $NM.bam..." samtools sort "$NM.bam" -o "$NM.sorted.bam" # Check if sorting was successful if [[ $? -eq 0 ]]; then echo "$NM.bam successfully sorted into $NM.sorted.bam" # Delete the original .sam file echo "Deleting $i to free up space..." rm "$i" # Move the sorted BAM file to the sorted_bams folder mkdir -p sorted_bams mv "$NM.sorted.bam" sorted_bams/ else echo "Error sorting $NM.bam" >&2 exit 1 fi else echo "Error converting $i" >&2 exit 1 fi else echo "No A*.sam files found." exit 1 fi done echo "All A*.sam files processed and sorted BAM files moved to sorted_bams directory." ``` ::: --- **Evaluating the alignments** https://gatk.broadinstitute.org/hc/en-us/articles/13832683660955- make a shell script from this ``` java -jar picard.jar CollectAlignmentSummaryMetrics \ R=tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly \ I=input.bam \ O=output.txt ``` :::spoiler Bash script for this called run_picard.sh in the output_qual directory ``` #!/bin/bash module load picard/2.21.4 #Reference genome REFERENCE_GENOME="/projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa" # Input directory containing BAM files INPUT_DIR="/projects/p32037/optimization/readfiles_all/sorted_bams/" # Output directory OUTPUT_DIR="/projects/p32037/optimization/readfiles_all/output_qual/" # Run Picard for each BAM file in the input directory for BAM_FILE in "$INPUT_DIR"/*.bam do # Check if there are matching files, else break the loop [ -e "$BAM_FILE" ] || break OUTPUT_FILE="${OUTPUT_DIR}/$(basename ${BAM_FILE%.*})_metrics.txt" picard CollectAlignmentSummaryMetrics \ R=$REFERENCE_GENOME \ I=$BAM_FILE \ O=$OUTPUT_FILE echo "Picard analysis for $(basename $BAM_FILE) complete. Metrics saved to $OUTPUT_FILE" done ``` ::: You can use multiqc to evalute the ouput of these files (see the section on fastqc/multi qc after process radtags) [MultiQC report here](https://m-b55e05.a0115.5898.data.globus.org/projects/p32037/optimization/readfiles_all/multiqc/multiqc_report.html) ![Screenshot 2024-10-15 at 8.57.55 PM](https://hackmd.io/_uploads/SknTpq2J1x.png) # Stacks: SNP calling #### Basic loop for ref_map.pl ``` ref_map.pl -o /projects/p32037/optimization/stacks_A --popmap /projects/p32037/optimization/popmap/popmap.txt --samples /projects/p32037/optimization/readfiles_all/sorted_bams/A -X "populations:--fstats --r 0.7 --genepop --smooth --ordered-export --fasta-loci --hwe -t 8" ``` :::spoiler ref_map.pl slurm script ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=1 # Test with 8 threads #SBATCH --mem=20GB # Adjusted memory based on previous usage #SBATCH --job-name=stacks_A #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=stacks_A.error #SBATCH --output=stacks_A.out # Load any necessary modules (e.g., stacks, samtools) module load stacks/2.68-gcc-12.3.0 # Run the ref_map.pl command ref_map.pl -o /projects/p32037/optimization/stacks_A \ --popmap /projects/p32037/optimization/popmap/popmap.txt \ --samples /projects/p32037/optimization/readfiles_all/sorted_bams/A \ -X "populations:--fstats --r 0.7 --genepop --smooth --ordered-export --fasta-loci --hwe -t 8" ``` ::: ## GStacks output ``` stacks-dist-extract gstacks.log.distribs bam_stats_per_sample | grep -v '^#' > PB_bam_stats_per_sample.tsv ``` > Get gstacks log distribs from output file and move into R ### Fraction of alignments kept by sample ![Screenshot 2024-10-15 at 2.58.06 PM](https://hackmd.io/_uploads/B1cutB3Jyl.png) ### Fraction of alignments kept by enzyme ![Screenshot 2024-10-15 at 3.04.15 PM](https://hackmd.io/_uploads/SyZR5rnJ1x.png) ### Fraction of alignments kept by enzyme and lab ![Screenshot 2024-10-15 at 3.03.08 PM](https://hackmd.io/_uploads/rJJq9H2ykl.png) ### Populations (without kernel smooth on) > The only way I could get populations to run is if i turned off all flags realted to kernel smoothing (--smooth --orderedn export --smooth_fastats). This lowered the memory usage also. Parameters: r = .80 with --fstats and --vcf only ::: spoiler ref_map.pl script w/out --smooth ```= bash #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem-per-cpu=100GB #SBATCH --job-name=stacks_A #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=stacks_A.error #SBATCH --output=stacks_A.out # Load any necessary modules (e.g., stacks, samtools) module load stacks/2.68-gcc-12.3.0 # Run the ref_map.pl command ref_map.pl --samples /projects/p32037/optimization/readfiles_all/sorted_bams/A --popmap /projects/p32037/optimization/popmap/popmap_A.txt -o /projects/p32037/optimization/stacks_A -T 8 -X "populations: -r 0.8 --fstats --vcf" ``` ::: | Enzyme | Lab | Private | Variant Sites | SNPs | % polymorphic | Obs_het (variant sites) | | --------- | ----------- | ------- | ------------- | ------ | ------------- | ----------------------- | | ApeKI | dePamphilis | 26,367 | 69,024 | 55,524 | 0.69936 | 0.21064 | | ApeKI | Fant | 13,364 | 89,820 | 63,317 | 0.62603 | 0.21064 | | NsiI/BfaI | dePamphilis | 6,271 | 63,393 | 60,484 | 0.89491 | 0.26274 | | NsiI/BfaI | Fant | 2,909 | 17,211 | 10,940 | 0.13479 | 0.16774 | | PstI/BfaI | dePamphilis | 14,595 | 46,543 | 39,452 | 0.79369 | 0.1956 | | PstI/BfaI | Fant | 7,044 | 39,591 | 24,949 | 0.4769 | 0.17651 | | NsiI/MspI | dePamphilis | 5,340 | 31,549 | 29,569 | 0.9057 | 0.25902 | | NsiI/MspI | Fant | 1,969 | 11,135 | 5,784 | 0.14035 | 0.14035 | # Repeat Masked genome outputs ## Using repeat masked genome to filter reads #### Process overview: 1. Turn the RM genome into a .bed file for filtering :::spoiler Make .bed file from RepeatMasked Genome ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=7:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=1GB # Adjusted memory based on previous usage #SBATCH --job-name=make_bed #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=make_bed.out.error #SBATCH --output=make_bed.out awk 'BEGIN{FS=""; OFS="\t"} /^>/ {if (seq) print_masked_regions(); header=$0; seq=""; next} {seq=seq""$0} END {print_masked_regions()} function print_masked_regions() { start=0; in_mask=0; for (i=1; i<=length(seq); i++) { base=substr(seq, i, 1); if (base ~ /[a-z]/ && !in_mask) { start=i; in_mask=1; } else if (base ~ /[A-Z]/ && in_mask) { print substr(header, 2), start-1, i-1; in_mask=0; } } }' PO3051_Castilleja_pilosa.RepeatMasked.fasta > masked_regions_capi.bed ``` ::: 2. Filter bam files using .bed file :::spoiler Filtering bam files based on repeat regions in RM genome ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=5GB # Adjusted memory based on previous usage #SBATCH --job-name=intersect #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=intersect_RM.out.error #SBATCH --output=intersect_RM.out module load bedtools # Directory containing your sorted BAM files bam_dir="/projects/p32037/optimization/readfiles_all/sorted_bams" # BED file with masked regions bed_file="/projects/p32037/optimization/reference/RM/masked_regions_capi.bed" # Output directory for filtered BAM files output_dir="/projects/p32037/optimization/readfiles_all/sorted_bams/filtered_bams" mkdir -p "$output_dir" # Create the output directory if it doesn't exist # Loop through all sorted BAM files in the specified directory for bam_file in "$bam_dir"/*.sorted.bam; do # Extract the base name of the BAM file base_name=$(basename "$bam_file" .sorted.bam) # Define the output filtered BAM file output_bam="$output_dir/${base_name}_filtered.bam" # Run bedtools intersect to filter out reads that overlap with masked regions echo "Processing $bam_file..." bedtools intersect -v -a "$bam_file" -b "$bed_file" > "$output_bam" echo "Filtered BAM file saved to $output_bam" done echo "All BAM files processed." ``` ::: 3. Run stacks on filtered .bam files :::spoiler stacks ref_map.pl run example script for RM filtered reads (ApeKI ) ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=8 #SBATCH --mem-per-cpu=25GB # Adjusted memory based on previous usage #SBATCH --job-name=stacks_A_RM #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=stacks_A_RM.error #SBATCH --output=stacks_A_RM.out # Load any necessary modules (e.g., stacks, samtools) module load stacks # Run the ref_map.pl command ref_map.pl --samples /projects/p32037/optimization/readfiles_all/sorted_bams/filtered_bams/A --popmap /projects/p32037/optimization/popmap/popmap_A.txt -o /projects/p32037/optimization/sta$ ``` ::: #### gstacks output | Enzyme | Loci built | Coverage | Mean number of sites per locus | Primary aligments kept | % Primary aligments kept | | --------- | ---------- | ----------------------------------------------- | ------------------------------ | ---------------------- | ------------------------ | | NsiI/BfaI | 61,977 | mean=30.8x, stdev=11.1x, min=2.5x, max=40.8x | 164.9 | 16,352,381 | 26.40% | | PstI/BfaI | 37,763 | mean=104.5x, stdev=21.4x, min=62.9x, max=141.0x | 196.5 | 37,025,447 | 36.80% | | NsiI/MspI | 39,778 | mean=46.8x, stdev=16.3x, min=4.6x, max=58.7x | 152.9 | 12,945,256 | 11.90% | | ApeKI | 129,316 | mean=23.1x, stdev=6.1x, min=12.1x, max=31.6x | 160.4 | 23,821,835 | 13.70% | #### popualtions output from this run ## Mapping directly to masked genome # GATK SNP calling ### Get reference genome ready 1. FASTA Indexing with Samtools (done) 2. Create the Sequence Dictionary with GATK (done) ### Mark duplicates in sorted bam files :::spoiler mark_dups.sh ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --job-name=mark_duplicates_A #SBATCH --output=mark_duplicates_A.out #SBATCH --error=mark_duplicates_A.err #SBATCH --time=4:00:00 # Adjust the time as needed #SBATCH --partition=short # Partition to use #SBATCH --nodes=1 # Number of nodes #SBATCH --ntasks=1 #SBATCH --mem=5G # Memory allocation #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu # Load necessary modules (adjust as per your environment) conda activate GATK module load gatk module load samtools # Set the directory containing your sorted BAM files BAM_DIR="/projects/p32037/optimization/readfiles_all/sorted_bams/A" OUTPUT_DIR="/projects/p32037/optimization/gatk/bams" # Loop through each BAM file in the directory for bamfile in ${BAM_DIR}/*.bam; do # Extract the sample name from the BAM file (without the .bam extension) sample=$(basename ${bamfile} .bam) # Set the output file names marked_bam=${OUTPUT_DIR}/${sample}_marked.bam metrics_file=${OUTPUT_DIR}/${sample}_marked_dup_metrics.txt # Mark duplicates gatk MarkDuplicates \ -I ${bamfile} \ -O ${marked_bam} \ -M ${metrics_file} # Index the marked BAM file samtools index ${marked_bam} echo "Processed ${sample}: duplicates marked and indexed." done echo "All BAM files processed." ``` ::: ### Call SNPs using Haplotype Caller :::spoiler Haplotype caller ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --job-name=haplotype_caller_A #SBATCH --output=haplotype_caller_A.out #SBATCH --error=haplotype_caller_A.err #SBATCH --time=8:00:00 # Adjust the time as needed #SBATCH --partition=short # Partition to use #SBATCH --nodes=1 # Number of nodes #SBATCH --ntasks=1 #SBATCH --mem=20G # Adjust memory based on your needs #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu # Load necessary modules (adjust as per your environment) conda activate GATK module load gatk # Set the directory containing your marked BAM files BAM_DIR="/projects/p32037/optimization/gatk/bams" OUTPUT_DIR="/projects/p32037/optimization/gatk/gvcfs" REFERENCE="/projects/p32037/reference_genomes/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa" # Update with your reference genome path # Loop through each BAM file in the directory for bamfile in ${BAM_DIR}/*_marked.bam; do # Extract the sample name from the BAM file (without the _marked.bam extension) sample=$(basename ${bamfile} _marked.bam) # Set the output GVCF file name gvcf_output=${OUTPUT_DIR}/${sample}.g.vcf.gz # Run HaplotypeCaller gatk HaplotypeCaller \ -R ${REFERENCE} \ -I ${bamfile} \ -O ${gvcf_output} \ -ERC GVCF echo "Processed ${sample}: GVCF generated." done echo "All samples processed with HaplotypeCaller." ``` ::: ### Combine GVCFs :::spoiler Joint GVCF ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --job-name=combine_gvcfs #SBATCH --output=combine_gvcfs.out #SBATCH --error=combine_gvcfs.err #SBATCH --time=4:00:00 # Adjust the time as needed #SBATCH --partition=short # Partition to use #SBATCH --nodes=1 # Number of nodes #SBATCH --ntasks=1 #SBATCH --mem=10G # Adjust memory based on your needs #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu # Load necessary modules (adjust as per your environment) conda activate GATK module load gatk # Set the directory containing your GVCF files GVCF_DIR="/projects/p32037/optimization/gatk/gvcfs" OUTPUT_DIR="/projects/p32037/optimization/gatk/combined" REFERENCE="/projects/p32037/reference_genomes/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa" # Update with your reference genome path # Output file for combined GVCFs combined_gvcf=${OUTPUT_DIR}/combined_samples.g.vcf.gz # Combine the GVCF files gatk CombineGVCFs \ -R ${REFERENCE} \ $(for gvcf in ${GVCF_DIR}/*.g.vcf.gz; do echo "-V ${gvcf} "; done) \ -O ${combined_gvcf} echo "All GVCF files combined." ``` ::: ### Genotype GVCFs :::spoiler Genotype GVCFs ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --job-name=genotype_gvcfs #SBATCH --output=genotype_gvcfs.out #SBATCH --error=genotype_gvcfs.err #SBATCH --time=8:00:00 # Adjust the time as needed #SBATCH --partition=short # Partition to use #SBATCH --nodes=1 # Number of nodes #SBATCH --ntasks=1 #SBATCH --mem=20G # Adjust memory based on your needs #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu # Load necessary modules (adjust as per your environment) conda activate GATK module load gatk # Set the directory containing your combined GVCF OUTPUT_DIR="/projects/p32037/optimization/gatk/combined" REFERENCE="/projects/p32037/reference_genomes/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa" # Update with your reference genome path # Combined GVCF file from previous step combined_gvcf=${OUTPUT_DIR}/combined_samples.g.vcf.gz # Output file for final multi-sample VCF final_vcf=${OUTPUT_DIR}/final_joint_genotyping.vcf.gz # Run joint genotyping gatk GenotypeGVCFs \ -R ${REFERENCE} \ -V ${combined_gvcf} \ -O ${final_vcf} echo "Joint genotyping completed. Final VCF genera ``` :::