GT-Seq Panel Design

--- # GT-Seq Panel Design Workflow This document outlines a pipeline for designing a GT-seq panel using reference-guided RAD-seq data. It follows the full process from SNP discovery to primer validation. ## Background Panel design begins with the reference-guided RAD-seq analysis workflow, where SNP discovery is constrained to loci with sufficient and continuous coverage (≥134 bp). The initial coverage estimates and filtering steps are critical for defining target regions. ## Step 1: SNP Discovery ### Inputs * Aligned BAM files * Regions file (ANGSD-style): continuous 134+ bp coverage regions * Coverage stats file for depth filtering * Needed for an need an estimate of locus-specific coverage to remove loci above and below a specific threshold (particular to your dataset) > This is also generated in the reference-guided RAD analysis workflow ### Step 1a: Call SNPs **Script:** `angsd_covtargets.sh` ` Script needed` Requires: * BAM list text file * Regions file in format: ``` SUPER_9:99709714-99710394 SUPER_9:99711983-99712365 ... ``` * ANGSD identifies SNPs in specified regions with locus coverage between 1–15x (dataset-specific threshold). Coverage summary: ``` locus.mean locus.sd Min. : 0.3553 Min. : 0.5123 Median : 1.5455 Median : 1.7558 Max. :455.3050 Max. :270.9477 ``` SNPs are retained only if they occur in regions: * ≥134 bp in length * With 1–15x depth Loci were filtered based on sequencing depth to exclude low-confidence and over-amplified regions. Specifically, loci with coverage below 1× or above 15× were removed. The final target region list for ANGSD retained only loci with continuous coverage of at least 134 bp and coverage between 1× and 15×, ensuring consistent and reliable representation across regions. ## Step 2: Identify Microhaplotype Regions ### Goal To improve computational efficiency, the genome was first divided into regions of interest. SNP density was then calculated using sliding windows across the genome. Windows containing between 2 and 5 SNPs were retained as candidate microhaplotype regions based on the desired window size. ### Requirements * Install `bedops` * Generate `.bed` from `.mafs`: ```bash awk 'BEGIN { OFS="\t" }{print $1, $2}' input.mafs > mafs.bed ``` Generate a bed file from your SNP calls. Because we used ANGSD to id SNPs, it didn't generate a VCF file, but a MAF file (minor allele frequency file) which looks like this: ``` chromo position major minor unknownEM nInd scaffold_116_arrow_ctg1 16522 G A 0.053692 61 scaffold_116_arrow_ctg1 16588 G A 0.460261 77 scaffold_116_arrow_ctg1 16610 C A 0.449404 77 scaffold_116_arrow_ctg1 16611 A G 0.451667 77 scaffold_72_arrow_ctg1 368 C T 0.073181 81 scaffold_72_arrow_ctg1 491 G A 0.459636 83 ``` ### Chromosome Extent Now identify the extent of each chromosome that contains snps, based on your bed file with SNP positions: > you can do this in R on our cluster ``` module load r-rocker-ml-verse/4.2.3+apptainer library("tidyverse") mafs<-read.table("mafs.bed",header=TRUE) chrbeg<-mafs %>% group_by(chromo) %>% slice_min(order_by = position) chrend<-mafs %>% group_by(chromo) %>% slice_max(order_by = position) chextent<-left_join(chrbeg,chrend,by="chromo") chextent<-chextent[,c(1,2,7)] write.table(chextent,"chromextent.bed",sep="\t",quote = FALSE, col.names = FALSE, row.names = FALSE) ``` ### Filter by SNP Density filter in R to keep regions with 2-5 SNPs > Don't want more than that as can indicate possible sequencing errors or hypervariable regions ``` module load r-rocker-ml-verse/4.2.3+apptainer #pull in snp density for filtering loci library(tidyverse) snpdens<-read.table("snpdensity.bed", sep = '\t', header=FALSE) names(snpdens)<-c("chrom","start","stop","snps") microhaps<-snpdens %>% dplyr::filter(snps>2) %>% dplyr:filter(snps<5) #this filters down to 300bp regions that have between 2 and 5 snps in them. sum(microhaps$snps) #still leaves 94375 SNPs, and this iw without phased data so not sure if these microhaplotype regions actually have more than two haplotypes write.table(microhaps[,1:3],"brazil_microhaps.bed",sep="\t",quote = FALSE, col.names = FALSE, row.names = FALSE) ``` ### BEDOPS Windowing The `bedops` step divides the input BED file (representing chromosome extents of interest) into 300 bp windows with a 10 bp overlap between adjacent windows. Since BED format is zero-based, the start positions are adjusted using $2 - 1 to ensure correct window boundaries. Repeat for multiple populations and intersect: ```bash intersectBed -a pop1.bed -b pop2.bed -wao -f 0.8 > microhap_overlap.bed awk '$7 > 0' microhap_overlap.bed > microhap_overlap.filtered.bed ``` Extract overlapping SNPs: ```bash intersectBed -wa -wb -a overlap.filtered.bed -b snps.bed > filteredregionsnps.bed awk -vOFS="\t" '{ print $4, $6 }' filteredregionsnps.bed > snp_filter_list.bed ``` The `HI_overlappingmicrohapregions.bed` file contains 300 bp regions with 2–5 SNPs shared between two populations. The `HI_snps.bed` file includes all SNPs identified in one population, originally from a .maf file and converted to BED format (a VCF could also be used for this step). To track which SNPs fall within each region, use `-wa -wb `with bedtools intersect so the output retains the original entries from file B (`HI_snps.bed`). This is useful for identifying which SNPs overlap each region. > Note: The rightmost three columns of the output list the SNPs overlapping each region. Remember that the actual SNP position is given in the last column, since BED format uses 0-based start and 1-based end coordinates. ``` awk -vOFS="\t" '{ print $4, $6; }' HIfilteredregionsnps.bed >HIfilteredregionsnps.2.bed #just get the columns with chromo and SNP for use in filtering maf file ``` ### Filter MAF File ``` module load r-rocker-ml-verse/4.2.3+apptainer library(dplyr) allmafs<-read.table("covrestrict_0.05_IND48.mafs",header=TRUE, sep='\t') filtsnps<-read.table("HIfilteredregionsnps.2.bed",header=FALE) filtsnps$name<-paste(filtsnps$V1, filtsnps$V2, sep="_") #make combined name for searching allmafs$name<-paste(allmafs$chromo, allmafs$position, sep="_") filt_mafs<-allmafs %>% filter(name %in% filtsnps$name) #filter to keep just the rows that match filtered mcrohaplotype region names write.table(filt_mafs[,1:6],"HI_filteredSNPs.mafs"sep="\t",quote = FALSE, col.names = TRUE, row.names = FALSE)) ``` Confirm the line count to verify the number of SNPs (e.g., 4,315 SNPs currently identified for the population of interest). Download the SNP BED file and import it into R for downstream analysis, such as running CKMR. ## Step 3: Simulate Kinship Power with CKMRsim Using ckmrsim for this, John's script [here](https://github.com/MolEcolConsLab/RAD_SNP_filtering/blob/main/Kinship_UPDATED.Rmd) modified to use input maf file: > Need to filter before this to get down to less than 10k SNPs ### Simulate with Filtered MAF File Using a filtered SNP file containing 4,315 SNPs (shared between the focal and reference populations), analyses were run without accounting for physical linkage. Simulated kinship relationships included: * U = unrelated * PO = parent-offspring * HS = half-sibling * FS = full-sibling * HAN = half aunt-niece Error rate thresholds were estimated based on recommendations from the CKMR vignette, which suggests setting the false positive rate (FPR) to ~10–100 times lower than the reciprocal of the number of pairwise comparisons. For n = 87 individuals, the resulting FPR threshold was 1.32e-05. To distinguish half-siblings from unrelated pairs, a log-likelihood cutoff of 60 yielded a false negative rate of 0 and a very low FPR, indicating strong discriminatory power for this relationship class. Error rates for distinguishing parent-offspring vs. unrelated were not formally estimated, as the distributions were clearly separated. Marker filtering criteria: * Regions with 2–5 SNPs within 170 bp * First SNP must start after position 20 within the window (to avoid primer-binding sites) This resulted in a final marker set composed primarily of regions containing multiple SNPs. The majority of these regions were multi-allelic, making them well-suited for downstream analyses. ### Example Simulations suggest a log-likelihood value of 60 distinguishes unrelated vs half-sib relationships with minimal FPR and FNR. ### Final Loci Selection Filtered by: * 2–5 SNPs within 170 bp * Exclude SNPs in first 20 bp of region ## Step 4: Primer Design & Validation ### Primer3 & Thermodynamics * Use BatchPrimer3 and IDT OligoAnalyzer * ΔG should be > -9 for both primers ### BLAST for Amplicon Similarity ``` makeblastdb -in targetamps11.21.fa -dbtype nucl -out targetamps11.21_db blastn -db targetamps11.21_db -query targetamps11.21.fa -out ampcompare.txt -outfmt 6 awk '$3 <100' ampcompare.txt # look at matches that weren't 100% which were the amplicons matching to themselves ``` Amplicons with high sequence similarity to others were flagged for review. For overlapping or near-duplicate regions, retain the best-quality representative based on design considerations (e.g., coverage, SNP content, or primer position). Only a small number of loci were removed during this filtering step. Remove redundant amplicons. ### Check for Primer Off-Target Matches * Run blastn-short for each primer Drop: * amplicon similarity * multi-mapping * dual primer hits ### Final QC with Samtools ```bash samtools coverage ... ``` Drop loci with coverage > 1.5× mean ## Summary 1. Use ANGSD to call SNPs from regions with continuous 134+ bp coverage 2. Identify 300 bp windows with 2–5 SNPs (putative microhaplotypes) 3. Intersect across populations to find shared regions 4. Simulate power to detect kinship with CKMRsim 5. Design primers and validate using BLAST and thermodynamic rules 6. Finalize loci based on uniqueness, GC content, coverage depth, and amplicon spacing **Result:** High-confidence, non-redundant loci suitable for GT-seq panel development *Authored by Jamie Adkins*:[ GT-Seq Panel Design](https://hackmd.io/@jlastoll/r11Uvxs96) *Last updated by Mikayla Newbrey*