Emma's Castilleja ddRAD (MN)

# Notes + code from reference aligned Castilleja ddRAD data analysis ###### tags: `RADSeq` `STACKS` --- author: Emma Fetterly --- **Resources:** 1. Link to STACKS website: https://catchenlab.life.illinois.edu/stacks/ 2. Bioinformatics guide by Eric C. Anderson https://eriqande.github.io/eca-bioinf-handbook/shell-programming.html 3. Speciation & Population Genomics: a how-to-guide https://speciationgenomics.github.io/mapping_reference/ 4. Surm script how-to from NU https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1964 5. Globus - file management https://www.globus.org/ ------ **Process Overview** 1. process_radrags - demultiplex and clean reads 2. mapping ddRADseq reads to reference genome with bwa + samtools 3. STACKS: gstacks 4. STACKS: populations 5. filtering with VCF tools and plink ---- # Process radtags 1. `process_radtags` * Notes: truncated to 100 bp, unsed inline_null, 2 adaptor mismatches, filtered reads for quality and uncalled bases ``` process_radtags -1 /projects/p32037/caco_mn/raw_reads/FANT-11-15-2024-PLATE-2_S2_L004_R1_001.fastq.gz -2 /projects/p32037/caco_mn/raw_reads/FANT-11-15-2024-PLATE-2_S2_L004_R2_001.fastq.gz -i gzfastq -b /projects/p32037/caco_mn/raw_reads/MN_barcodes_P2.txt -o /projects/p32037/caco_mn/radtags_out_P2 --inline_null --renz_1 NsiI --renz_2 BfaI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 ``` Adaptor sequences from UWBC (Truseq kit) Read 1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA Read 2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT **I made a new directory for each plate "radtags_out_P1", radtags_out_P2** #### Slurm script I used for Castilleja radtags: * This was for ApeKI ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=10 #SBATCH --mem-per-cpu=50GB #SBATCH --job-name=CACO_radtags_P2 #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=CACO_radtags_P2.error #SBATCH --output=CACO_radtags_P2.out module purge module load stacks/2.68-gcc-12.3.0 ##specify environment to stacks version 2.68 process_radtags -1 /projects/p32037/caco_mn/raw_reads/FANT-11-15-2024-PLATE-2_S2_L004_R1_001.fastq.gz -2 /projects/p32037/caco_mn/raw_reads/FANT-11-15-2024-PLATE-2_S2_L004_R2_001.fastq.gz -i gzfastq -b /projects/p32037/caco_mn/raw_reads/MN_barcodes_P2.txt -o /projects/p32037/caco_mn/radtags_out_P2 --inline_null --renz_1 NsiI --renz_2 BfaI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -threads 10 ``` Ran for each enzyme combination, results below: ``` sbatch radtags_P1.sh sbatch radtags_P2.sh ``` | Enzyme | Total sequences | Reads with adapter | Barcode not found | Low quality read drops | RAD cutsite not found drops | Retained reads | | ------------------ | ----------- | ---------------- | ----------------- | ---------------------- | --------------------------- | ----------------- | | Plate 1 | 1,327,435,862 | 46,852 (0.0%) | 38,002,790 (2.9%) | 3,268,685 (0.2%) | 21,801,131 (1.6%) | 1,264,316,404 (95.2%) | | Plate 2 | 1,111,231,964 | 36,885 (0.0%) | 30,640,340 (2.8%) | 2,615,228 (0.2%) |17,226,720 (1.6%) | 1,060,712,791 (95.5%) | Once process_radtags has finished running, change directories to your output directory and inspect the process_radtags.log file. This will give you basic information on how manys sequences were found per sample, and how many passed preliminary filtering. To view the the log file ``` cat process_radtags.log ``` STACKS outputs four different files for each sample if you are using paired end reads: sample_name.rem.1.fq.gz sample_name.rem.2.fq.gz sample_name.1.fq.gz sample_name.2.fq.gz We only use the files without the .rem; those files with a .rem. can be deleted. ``` rm *.rem.1.fq.gz rm *.rem.2.fq.gz ``` **Removed .rem files and moved all files into a single file in "caco_mn" called "readfiles_all"** I was curious about the raw read quality of the demultiplexed sequences, so ran a fast qc on the raw read files ``` module load fastqc/0.12.0 fastqc -o /projects/p32037/optimization/readfiles_all/fastqc -t 6 *.fq.gz ``` Hard to review each file... so ran a multiqc https://multiqc.info/docs/ in the readfiles_all directory to view all the reports at once ``` [ewf7555@quser32 fastqc]$ module load multiqc [ewf7555@quser32 fastqc]$ multiqc . #this searched through the entire read_files directory and found the fastqc reports generated ``` [MultiQC](https://m-b55e05.a0115.5898.data.globus.org/projects/p32037/caco_mn/readfiles_all/fastqc/multiqc_report_1.html) report of raw sequence data of Plate 1 and 2 # Mapping reads to a reference genome ### Preparing the reference: 1. Make a new directory for your reference genome files. The reference needs to be uncompressed for this analysis. ``` #make new directory mkdir ./reference #move reference to this directory mv Castillea_coccinea.hifiasm.bp.p_ctg.fa ./reference/ ``` 2. Index the reference genome with bwa + samtools + GATK: ``` #BWA indexing for general alignment bwa index Castillea_coccinea.hifiasm.bp.p_ctg.fa #create the FASTA index file samtools faidx Castillea_coccinea.hifiasm.bp.p_ctg.fa #create the sequence dictionary (in case you want to use GATK) conda activate gatk gatk CreateSequenceDictionary -R Castillea_coccinea.hifiasm.bp.p_ctg.fa -O /projects/p32037/caco_mn/reference/Castillea_coccinea.hifiasm.bp.p_ctg.dict ``` > This will create a few other files with the same basename. They all need to be in the same directory with the genome file in order for mapping to work. ### Map the reads: > Run from the directory with the read files, this loop will align the reads for each pair of read files, generating a .sam file Code for Castilleja mapping for sample CUP1 (trial run) ``` bwa mem -M -t 4 /projects/p32037/caco_mn/reference/Castillea_coccinea.hifiasm.bp.p_ctg.fa CUP1.1.fq.gz CUP1.2.fq.gz > CUP1.sam ``` ## Evaluating with samtools: ``` samtools flagstat CUP1.sam 12265267 + 0 in total (QC-passed reads + QC-failed reads) 110941 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 12159958 + 0 mapped (99.14% : N/A) 12154326 + 0 paired in sequencing 6077163 + 0 read1 6077163 + 0 read2 11670300 + 0 properly paired (96.02% : N/A) 12035334 + 0 with itself and mate mapped 13683 + 0 singletons (0.11% : N/A) 345608 + 0 with mate mapped to a different chr 65326 + 0 with mate mapped to a different chr (mapQ>=5) ``` > Note: Good mapping percentages are in the 90's. You can mess around with the -K values of bwa mem to see if you get a higher percent mapping to the reference. This is a tradeoff because the lower the -K, that means that the fewer exact basepairs need to match the reference in order to start mapping a read. Default is 19 when bwa mem is run without a -K value. It can be a good idea to run bwa mem on one sample with several values of -K : default, -K 16 ; -K 14; -K 12 ; -K 10. Evaluate which maps the most reads and then use the loop for the rest of the samples. Evaluate which maps the most reads and then use the loop for the rest of the samples > Note: For this sample, K = 19 works great. Basic loop: ``` for i in *.1.fq; do NM=${i//.1.fq/} REVERSE=${i//.1.fq/.2.fq} bwa mem -M -t 8 /projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa $i $REVERSE > $NM.sam ; done ``` Slurm Script I used for a single population group > Note: this took about 8-10 hours per pop, using ~11 GB memory :::spoiler bwa mem slurm Script #### Slurm script for batch processing ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=8 # Test with 8 threads #SBATCH --mem=20GB # Adjusted memory based on previous usage #SBATCH --job-name=CUP_bwamem #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=CUP_bwamem_group.out.error #SBATCH --output=CUP_bwamem_group.out module purge module load bwa # Run bwa mem with 8 threads for i in CUP*.1.fq.gz; do NM=${i//.1.fq.gz/} # Extract the base name (e.g., NM_ARK23-01) REVERSE=${i//.1.fq.gz/.2.fq.gz} # Replace .1.fq.gz with .2.fq.gz for reverse reads bwa mem -M -t 8 /projects/p32037/caco_mn/reference/Castillea_coccinea.hifiasm.bp.p_ctg.fa "$i" "$REVERSE" > "${NM}.sam" done ``` ::: Now convert the .sam files into .bam files: ``` for i in *.sam; do NM=${i//.sam/}; samtools view $i -b -o $NM.bam; done ``` and then sort the .bam files: ``` for i in *.bam; do NM=${i//.bam/}; samtools sort $i -o $NM.sort.bam; done ``` As long as your sorted .bam files are looking good, then you can delete the .sam files because they are quite large and take up a lot of space. I put all my sorted .bams into their own folder. :::spoiler sam to bam to sorted bam slurm script (with samtools flagstat) ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=5GB # Adjust memory based on previous usage #SBATCH --job-name=Fcaco_sambamsort #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=Fcaco_sambamsort.out.error #SBATCH --output=Fcaco_sambamsort.out module load samtools # Process one sample at a time: convert SAM to BAM, sort, and clean up for i in FUG*.sam; do # Check if the file exists before proceeding if [[ -e $i ]]; then NM=${i%.sam} # Remove the .sam extension # Convert the .sam file to .bam echo "Converting $i to BAM format..." samtools view -b "$i" -o "$NM.bam" # Check if conversion was successful if [[ $? -eq 0 ]]; then echo "$i successfully converted to $NM.bam" # Sort the .bam file echo "Sorting $NM.bam..." samtools sort "$NM.bam" -o "$NM.sorted.bam" # Check if sorting was successful if [[ $? -eq 0 ]]; then echo "$NM.bam successfully sorted into $NM.sorted.bam" # Generate flagstat report echo "Generating flagstat report for $NM.sorted.bam..." samtools flagstat "$NM.sorted.bam" > "$NM.flagstat.txt" echo "Flagstat report saved to $NM.flagstat.txt" # Delete the original .sam file echo "Deleting $i to free up space..." rm "$i" # Move the sorted BAM file to the sorted_bams folder mkdir -p sorted_caco_bams mv "$NM.sorted.bam" sorted_caco_bams/ mv "$NM.flagstat.txt" sorted_caco_bams/ else echo "Error sorting $NM.bam" >&2 exit 1 fi else echo "Error converting $i" >&2 exit 1 fi else echo "No *.sam files found." exit 1 fi done echo "All *.sam files processed. Sorted BAM files and flagstat reports moved to sorted_bams directory." ``` ::: --- ## **Evaluating the alignments** ### **1. samtools flagstat** :::spoiler Script for collecting **samtools flagstat data** from .txt files ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=01:00:00 # Adjust based on expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=1GB # Adjust memory based on file sizes #SBATCH --job-name=process_flagstat #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=process_flagstat.out.error #SBATCH --output=process_flagstat.out # Set the directory containing the .flagstat.txt files FLAGSTAT_DIR="/projects/p32037/caco_mn/readfiles_all/sorted_caco_bams" # Create a summary file SUMMARY_FILE="summary_flagstat.txt" echo -e "Sample\tQC_Passed_Reads\tMapped_Reads\tMapped_%\tProperly_Paired_Reads\tProperly_Paired_%" > "$SUMMARY_FILE" # Loop through all .flagstat.txt files for file in "$FLAGSTAT_DIR"/*.flagstat.txt; do if [[ -e "$file" ]]; then # Extract the sample name (remove directory and extension) SAMPLE=$(basename "$file" .flagstat.txt) # Extract relevant information QC_PASSED=$(grep "in total" "$file" | awk '{print $1}') MAPPED=$(grep "mapped (" "$file" | awk '{print $1}') MAPPED_PCT=$(grep "mapped (" "$file" | awk -F '[()]' '{print $2}') PROPERLY_PAIRED=$(grep "properly paired (" "$file" | awk '{print $1}') PROPERLY_PAIRED_PCT=$(grep "properly paired (" "$file" | awk -F '[()]' '{print $2}') # Append the data to the summary file echo -e "$SAMPLE\t$QC_PASSED\t$MAPPED\t$MAPPED_PCT\t$PROPERLY_PAIRED\t$PROPERLY_PAIRED_PCT" >> "$SUMMARY_FILE" else echo "No .flagstat.txt files found in $FLAGSTAT_DIR." exit 1 fi done echo "Summary of all .flagstat.txt files saved to $SUMMARY_FILE." ``` ::: ### 2. CollectAlignmentSummaryMetrics from Picard tools https://gatk.broadinstitute.org/hc/en-us/articles/13832683660955- make a shell script from this ``` java -jar picard.jar CollectAlignmentSummaryMetrics \ R=tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly \ I=input.bam \ O=output.txt ``` :::spoiler Bash script for this called run_picard.sh in the output_qual directory ``` #!/bin/bash module load picard/2.21.4 #Reference genome REFERENCE_GENOME="/projects/p32037/optimization/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa" # Input directory containing BAM files INPUT_DIR="/projects/p32037/optimization/readfiles_all/sorted_bams/" # Output directory OUTPUT_DIR="/projects/p32037/optimization/readfiles_all/output_qual/" # Run Picard for each BAM file in the input directory for BAM_FILE in "$INPUT_DIR"/*.bam do # Check if there are matching files, else break the loop [ -e "$BAM_FILE" ] || break OUTPUT_FILE="${OUTPUT_DIR}/$(basename ${BAM_FILE%.*})_metrics.txt" picard CollectAlignmentSummaryMetrics \ R=$REFERENCE_GENOME \ I=$BAM_FILE \ O=$OUTPUT_FILE echo "Picard analysis for $(basename $BAM_FILE) complete. Metrics saved to $OUTPUT_FILE" done ``` ::: You can use multiqc to evalute the ouput of these files (see the section on fastqc/multi qc after process radtags) [MultiQC report here](https://m-b55e05.a0115.5898.data.globus.org/projects/p32037/caco_mn/readfiles_all/sorted_caco_bams/picard/multiqc_report_1.html) ![Screenshot 2025-01-22 at 7.07.31 PM](https://hackmd.io/_uploads/HyXuLfku1g.png) # Stacks: SNP calling #### Basic loop for ref_map.pl ``` ref_map.pl -o /projects/p32037/caco_mn/stacks_r8 \ --popmap /projects/p32037/caco_mn/popmap/popmap_all.txt \ --samples /projects/p32037/caco_mn/readfiles_all/sorted_caco_bams/ \ -X "populations: -r 0.7 --fstats --smooth --smooth-fstats --ordered-export --hwe --genepop --vcf" ``` :::spoiler ref_map.pl slurm script ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=long #SBATCH --time=24:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=1 # threads #SBATCH --mem=50GB # #SBATCH --job-name=stacks_r8 #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=stacks_r8.error #SBATCH --output=stacks_r8.out # Load any necessary modules (e.g., stacks, samtools) module load stacks/2.68-gcc-12.3.0 # Run the ref_map.pl command # Run the ref_map.pl command ref_map.pl -o /projects/p32037/caco_mn/stacks_r8 \ --popmap /projects/p32037/caco_mn/popmap/popmap_all.txt \ --samples /projects/p32037/caco_mn/readfiles_all/sorted_caco_bams/ \ -X "populations: -r 0.7 --fstats --smooth --smooth-fstats --ordered-export --hwe --genepop --vcf" ``` ::: ## GStacks output ``` stacks-dist-extract gstacks.log.distribs bam_stats_per_sample | grep -v '^#' > r70_bam_stats_per_sample.tsv ``` > Get gstacks log distribs from output file and move into R ### Fraction of alignments kept by sample ![Screenshot 2025-01-22 at 3.08.36 PM](https://hackmd.io/_uploads/BkODRRRDJx.png) ### Fraction of alignments kept by population ![Screenshot 2025-01-22 at 3.07.38 PM](https://hackmd.io/_uploads/B1DNAR0P1x.png) ## Populations Example script with parameters: r = .8 --fstats --smooth --smooth-fstats --ordered-export --hwe --genepop --vcf ::: spoiler ref_map.pl script ```#!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 # Adjust based on your expected runtime #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --mem=25GB # Adjusted memory based on previous usage #SBATCH --job-name=populations_r8 #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=populations_r80.error #SBATCH --output=populations_r80.out module load stacks populations -P /projects/p32037/caco_mn/stacks_r8/ -M /projects/p32037/caco_mn/popmap/popmap_all.txt -t 4 -r 0.8 --fstats --smooth --smooth-fstats --ordered-export --hwe --genepop --vcf ``` ::: #### Ran populations with R = .7, .8, .9 to test SNP recovery | r | Loci retained | Total sites | Fixed sites (filtered out) | Varaint sites retained | Mean sites per locus (bp) | % sites covered by multiple loci | | --- | ------------- | ----------- | --- | --- | --- | --- | | .7 | 99,304 | 20,055,663 | 3495534 | 319,749 | 167.04 | 1.8% | | .8 | 87,506 | 17,687,927 | 3,145,076 | 267076 | 166.54 | 1.3% | | .9 | 73,628 | 14,896,403 | 2,722,026 | 199,623 | 165.81 | 0.8% | # vcftools introduction Dylan's example code for filtering the popualtions.snps.vcf file: ``` vcftools --vcf --input.vcf --min-meanDP 5 --minDP 5 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.50 --recode ---out renamed.ouput.vcf ``` ### VCF Parameters: **Depth:** ``` --min-meanDP <float> --max-meanDP <float> ``` Includes only sites with mean depth values (over all included individuals) greater than or equal to the "--min-meanDP" value and less than or equal to the "--max-meanDP" value. One of these options may be used without the other. These options require that the "DP" FORMAT tag is included for each site. **Hardy-Weinberg Equlibrium:** ``` --hwe <float> ``` Assesses sites for Hardy-Weinberg Equilibrium using an exact test, as defined by Wigginton, Cutler and Abecasis (2005). Sites with a p-value below the threshold defined by this option are taken to be out of HWE, and therefore excluded. **Missing data:** ``` --max-missing <float> ``` Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed). ``` --max-missing-count <integer> ``` Exclude sites with more than this number of missing genotypes over all individuals. **Phased:** ``` --phased ``` Excludes all sites that contain unphased genotypes. **MISCELLANEOUS FILTERING** ``` --minQ <float> ``` ### Example code ``` vcftools --vcf populations.snps.vcf --min-meanDP 5 --minDP 5 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.80 --recode --out pop_r8_miss80_mac3_DP5 ``` ## VCF filtering results ![Screenshot 2025-02-17 at 10.17.09 AM](https://hackmd.io/_uploads/BJrMbkbckl.png) ## SNPFiltR Filtering results Missing by sample distribution ![Screenshot 2025-02-17 at 10.31.10 AM](https://hackmd.io/_uploads/ryaUEkbcJl.png) ![Screenshot 2025-02-17 at 10.31.53 AM](https://hackmd.io/_uploads/B1UFN1bq1x.png) Looks like only two samples have really high proportion of missing data (SPK6 and SPK5) * Mean depth across dataset = 25.5 ## Results of filtering with SNPFiltR and SNPRelate ![Screenshot 2025-02-18 at 9.22.18 AM](https://hackmd.io/_uploads/HkUhB7z9Jg.png)