---
# GT-Seq Panel Design Workflow
This document outlines a pipeline for designing a GT-seq panel using reference-guided RAD-seq data. It follows the full process from SNP discovery to primer validation.
## Background
Panel design begins with the reference-guided RAD-seq analysis workflow, where SNP discovery is constrained to loci with sufficient and continuous coverage (≥134 bp). The initial coverage estimates and filtering steps are critical for defining target regions.
## Step 1: SNP Discovery
### Inputs
* Aligned BAM files
* Regions file (ANGSD-style): continuous 134+ bp coverage regions
* Coverage stats file for depth filtering
* Needed for an need an estimate of locus-specific coverage to remove loci above and below a specific threshold (particular to your dataset)
> This is also generated in the reference-guided RAD analysis workflow
### Step 1a: Call SNPs
**Script:** `angsd_covtargets.sh`
` Script needed`
Requires:
* BAM list text file
* Regions file in format:
```
SUPER_9:99709714-99710394
SUPER_9:99711983-99712365
...
```
* ANGSD identifies SNPs in specified regions with locus coverage between 1–15x (dataset-specific threshold).
Coverage summary:
```
locus.mean locus.sd
Min. : 0.3553 Min. : 0.5123
Median : 1.5455 Median : 1.7558
Max. :455.3050 Max. :270.9477
```
SNPs are retained only if they occur in regions:
* ≥134 bp in length
* With 1–15x depth
Loci were filtered based on sequencing depth to exclude low-confidence and over-amplified regions. Specifically, loci with coverage below 1× or above 15× were removed. The final target region list for ANGSD retained only loci with continuous coverage of at least 134 bp and coverage between 1× and 15×, ensuring consistent and reliable representation across regions.
## Step 2: Identify Microhaplotype Regions
### Goal
To improve computational efficiency, the genome was first divided into regions of interest. SNP density was then calculated using sliding windows across the genome. Windows containing between 2 and 5 SNPs were retained as candidate microhaplotype regions based on the desired window size.
### Requirements
* Install `bedops`
* Generate `.bed` from `.mafs`:
```bash
awk 'BEGIN { OFS="\t" }{print $1, $2}' input.mafs > mafs.bed
```
Generate a bed file from your SNP calls. Because we used ANGSD to id SNPs, it didn't generate a VCF file, but a MAF file (minor allele frequency file) which looks like this:
```
chromo position major minor unknownEM nInd
scaffold_116_arrow_ctg1 16522 G A 0.053692 61
scaffold_116_arrow_ctg1 16588 G A 0.460261 77
scaffold_116_arrow_ctg1 16610 C A 0.449404 77
scaffold_116_arrow_ctg1 16611 A G 0.451667 77
scaffold_72_arrow_ctg1 368 C T 0.073181 81
scaffold_72_arrow_ctg1 491 G A 0.459636 83
```
### Chromosome Extent
Now identify the extent of each chromosome that contains snps, based on your bed file with SNP positions:
> you can do this in R on our cluster
```
module load r-rocker-ml-verse/4.2.3+apptainer
library("tidyverse")
mafs<-read.table("mafs.bed",header=TRUE)
chrbeg<-mafs %>% group_by(chromo) %>%
slice_min(order_by = position)
chrend<-mafs %>% group_by(chromo) %>%
slice_max(order_by = position)
chextent<-left_join(chrbeg,chrend,by="chromo")
chextent<-chextent[,c(1,2,7)]
write.table(chextent,"chromextent.bed",sep="\t",quote = FALSE, col.names = FALSE, row.names = FALSE)
```
### Filter by SNP Density
filter in R to keep regions with 2-5 SNPs
> Don't want more than that as can indicate possible sequencing errors or hypervariable regions
```
module load r-rocker-ml-verse/4.2.3+apptainer
#pull in snp density for filtering loci
library(tidyverse)
snpdens<-read.table("snpdensity.bed", sep = '\t', header=FALSE)
names(snpdens)<-c("chrom","start","stop","snps")
microhaps<-snpdens %>%
dplyr::filter(snps>2) %>%
dplyr:filter(snps<5) #this filters down to 300bp regions that have between 2 and 5 snps in them.
sum(microhaps$snps) #still leaves 94375 SNPs, and this iw without phased data so not sure if these microhaplotype regions actually have more than two haplotypes
write.table(microhaps[,1:3],"brazil_microhaps.bed",sep="\t",quote = FALSE, col.names = FALSE, row.names = FALSE)
```
### BEDOPS Windowing
The `bedops` step divides the input BED file (representing chromosome extents of interest) into 300 bp windows with a 10 bp overlap between adjacent windows. Since BED format is zero-based, the start positions are adjusted using $2 - 1 to ensure correct window boundaries.
Repeat for multiple populations and intersect:
```bash
intersectBed -a pop1.bed -b pop2.bed -wao -f 0.8 > microhap_overlap.bed
awk '$7 > 0' microhap_overlap.bed > microhap_overlap.filtered.bed
```
Extract overlapping SNPs:
```bash
intersectBed -wa -wb -a overlap.filtered.bed -b snps.bed > filteredregionsnps.bed
awk -vOFS="\t" '{ print $4, $6 }' filteredregionsnps.bed > snp_filter_list.bed
```
The `HI_overlappingmicrohapregions.bed` file contains 300 bp regions with 2–5 SNPs shared between two populations. The `HI_snps.bed` file includes all SNPs identified in one population, originally from a .maf file and converted to BED format (a VCF could also be used for this step).
To track which SNPs fall within each region, use `-wa -wb `with bedtools intersect so the output retains the original entries from file B (`HI_snps.bed`). This is useful for identifying which SNPs overlap each region.
> Note: The rightmost three columns of the output list the SNPs overlapping each region. Remember that the actual SNP position is given in the last column, since BED format uses 0-based start and 1-based end coordinates.
```
awk -vOFS="\t" '{ print $4, $6; }' HIfilteredregionsnps.bed >HIfilteredregionsnps.2.bed #just get the columns with chromo and SNP for use in filtering maf file
```
### Filter MAF File
```
module load r-rocker-ml-verse/4.2.3+apptainer
library(dplyr)
allmafs<-read.table("covrestrict_0.05_IND48.mafs",header=TRUE, sep='\t')
filtsnps<-read.table("HIfilteredregionsnps.2.bed",header=FALE)
filtsnps$name<-paste(filtsnps$V1, filtsnps$V2, sep="_") #make combined name for searching
allmafs$name<-paste(allmafs$chromo, allmafs$position, sep="_")
filt_mafs<-allmafs %>% filter(name %in% filtsnps$name) #filter to keep just the rows that match filtered mcrohaplotype region names
write.table(filt_mafs[,1:6],"HI_filteredSNPs.mafs"sep="\t",quote = FALSE, col.names = TRUE, row.names = FALSE))
```
Confirm the line count to verify the number of SNPs (e.g., 4,315 SNPs currently identified for the population of interest). Download the SNP BED file and import it into R for downstream analysis, such as running CKMR.
## Step 3: Simulate Kinship Power with CKMRsim
Using ckmrsim for this, John's script [here](https://github.com/MolEcolConsLab/RAD_SNP_filtering/blob/main/Kinship_UPDATED.Rmd) modified to use input maf file:
> Need to filter before this to get down to less than 10k SNPs
### Simulate with Filtered MAF File
Using a filtered SNP file containing 4,315 SNPs (shared between the focal and reference populations), analyses were run without accounting for physical linkage. Simulated kinship relationships included:
* U = unrelated
* PO = parent-offspring
* HS = half-sibling
* FS = full-sibling
* HAN = half aunt-niece
Error rate thresholds were estimated based on recommendations from the CKMR vignette, which suggests setting the false positive rate (FPR) to ~10–100 times lower than the reciprocal of the number of pairwise comparisons.
For n = 87 individuals, the resulting FPR threshold was 1.32e-05.
To distinguish half-siblings from unrelated pairs, a log-likelihood cutoff of 60 yielded a false negative rate of 0 and a very low FPR, indicating strong discriminatory power for this relationship class. Error rates for distinguishing parent-offspring vs. unrelated were not formally estimated, as the distributions were clearly separated.
Marker filtering criteria:
* Regions with 2–5 SNPs within 170 bp
* First SNP must start after position 20 within the window (to avoid primer-binding sites)
This resulted in a final marker set composed primarily of regions containing multiple SNPs. The majority of these regions were multi-allelic, making them well-suited for downstream analyses.
### Example
Simulations suggest a log-likelihood value of 60 distinguishes unrelated vs half-sib relationships with minimal FPR and FNR.
### Final Loci Selection
Filtered by:
* 2–5 SNPs within 170 bp
* Exclude SNPs in first 20 bp of region
## Step 4: Primer Design & Validation
### Primer3 & Thermodynamics
* Use BatchPrimer3 and IDT OligoAnalyzer
* ΔG should be > -9 for both primers
### BLAST for Amplicon Similarity
```
makeblastdb -in targetamps11.21.fa -dbtype nucl -out targetamps11.21_db
blastn -db targetamps11.21_db -query targetamps11.21.fa -out ampcompare.txt -outfmt 6
awk '$3 <100' ampcompare.txt # look at matches that weren't 100% which were the amplicons matching to themselves
```
Amplicons with high sequence similarity to others were flagged for review. For overlapping or near-duplicate regions, retain the best-quality representative based on design considerations (e.g., coverage, SNP content, or primer position). Only a small number of loci were removed during this filtering step.
Remove redundant amplicons.
### Check for Primer Off-Target Matches
* Run blastn-short for each primer
Drop:
* amplicon similarity
* multi-mapping
* dual primer hits
### Final QC with Samtools
```bash
samtools coverage ...
```
Drop loci with coverage > 1.5× mean
## Summary
1. Use ANGSD to call SNPs from regions with continuous 134+ bp coverage
2. Identify 300 bp windows with 2–5 SNPs (putative microhaplotypes)
3. Intersect across populations to find shared regions
4. Simulate power to detect kinship with CKMRsim
5. Design primers and validate using BLAST and thermodynamic rules
6. Finalize loci based on uniqueness, GC content, coverage depth, and amplicon spacing
**Result:** High-confidence, non-redundant loci suitable for GT-seq panel development
*Authored by Jamie Adkins*:[ GT-Seq Panel Design](https://hackmd.io/@jlastoll/r11Uvxs96)
*Last updated by Mikayla Newbrey*