# Identifying microhaplotypes for kinship assignment in Cownose Rays
The goal of this pipeline is to identify RAD loci containing multiple SNPs which, when phased into "microhaplotypes" ([Baestcher et. al. 2018](https://onlinelibrary-wiley-com.silk.library.umass.edu/doi/full/10.1111/1755-0998.12737?casa_token=CWK7YvZGj_0AAAAA%3Aim9AFqSbw6drvZ3f0l4POj4FZ43lH6O0AM-cpaYv4_5ibv2_KAVhHYbjHf-LiS3dz6Wy3Co7Z2EiRg)), provide sufficient power for kinship assignment at the level of parent-offspring and half-sibling.
This protocol should work for any system that lacks a reference genome, and should also be easily adaptable to reference-based workflows. My hope is that others in the lab will find this protocol useful as well.
The GitHub repository with the scripts mentioned here can be found at this link: https://github.com/JDSwenson/Cownose_Ray_RAD/tree/main/02_kinship
The upstream steps are documented here: https://hackmd.io/6pi-p8ftQVug71nTvYWUbQ
## Background
<font size=2>[Close-kin mark-recapture](https://projecteuclid.org/journals/statistical-science/volume-31/issue-2/Close-Kin-Mark-Recapture/10.1214/16-STS552.full) is an emerging method for using DNA and relatedness among individuals to estimate abundance of a population. Relying as it does on accurate kinship assignment, implementing CKMR requires genetic markers that can reliably differentiate among different categories of kin (e.g. parent-offspring, half-sibling, unrelated).
A year ago (January 2021), I obtained sequence data from cownose rays via RAD-Seq, and have been working to identify regions of DNA that are informative for kinship and stock assignment, with the hopes of eventually applying CKMR to cownose rays in Chesapeake Bay. I've been particularly interested in [microhaplotypes](https://onlinelibrary-wiley-com.silk.library.umass.edu/doi/full/10.1111/1755-0998.12737?casa_token=x8uUi2wn7kgAAAAA%3At6qRr0uy6g-WiOSveeYb9LCQ_0IajhMNeOUxrINlfSnrL7LnzRQsGllzfL2JXBV3G4RtqE_7_rVWCA) (i.e. loci with multiple SNPs), as these have been found to significantly improve the power of kinship assignment while also reducing false positives. The increased power of microhaplotypes arises from the fact that they are multi-allelic, while SNPs are generally bi-allelic. Once I've identified an array of microhaplotypes that can be used for kinship assignment, I aim to develop a capture panel - either [Rapture](https://academic.oup.com/genetics/article/202/2/389/5930231?login=true) or [GT-Seq](https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.12357) - that will reduce costs and analytical complexity for an eventual CKMR-based abundance estimate.
In addition, capture panels that are designed for one species [may also work for related species](https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12980?casa_token=Akd_cLvhc80AAAAA:GWUfu1fO6T-6IwxzNyZiRwtCTs4znRixSv4jiZj7rQ6XiZljwXAl9JGBMU6G8e2d_hthsigNmmnBoio). Cownose Rays are closely related to Mobula Rays (*Mobula sp.*) and one step further removed from Eagle Rays (*Aetobatus sp.*). Both of these genera include multiple threatened or endangered species. Therefore, I am also interested in whether the microhaplotypes I identify in Cownose Rays are also present in these successive sister taxa. If so, the capture panels I develop may enable CKMR and other kinship-based methods for these taxa.
The steps outline below follow initial parameter optimization and locus formation conducted with Stacks2. The upstream steps end at Step III (hence why we start with Step IV below) and are documented here: https://hackmd.io/6pi-p8ftQVug71nTvYWUbQ</font>
### Step IV: Identify microhaplotypes that are informative for kinship in Cownose Rays
#### Step IV.1: Re-run populations with only POPs
<font size=2>**Script run: 05_run_populations_CNR_POPsOnly2.sh**
**Purpose**
I re-ran the Stacks populations program to filter loci and generate a vcf file that only includes samples from known parent-offspring pairs (7 pairs = 14 individuals). My goal is to identify loci that are informative for kinship, so I decided to focus entirely on these samples to find those loci, with the idea that I will examine their variability in other non-related individuals and species after.
**Filters applied (all filters were applied haplotype-wise):**
- **maximum observed heterozygosity:** 0.7
- Loci that are heterozygous in 70% of individuals or more will be removed.
- **minimum minor allele count:** 3
- Loci with a minor allele count < 3 will be removed. This number ensures that each minor allele will be present in *at least* two individuals (one homozygote and one heterozygote).
- **SNP must be present in 80% of individuals**
- I wanted to focus on SNPs that are present in at least 6 POPs (i.e. 12/14 individuals); setting this value to 80% means that a locus can be missing in a maximum of two individuals.
- **Filter haplotype-wise**
- Stacks phases SNPs that occur at the same locus into haplotypes. Because I'm looking for microhaplotypes, I applied the above filters haplotype-wise.
**Results**
- Identified **59704 SNPs** that passed the above filters
**Questions**
- Any concerns or thoughts about the above parameters?</font>
#### Step IV.2: Filter loci for missingness and multiple SNPs (microhaplotypes)
<font size=2>**Script run: 06_CNR_POPsOnly2_SNP_filter.Rmd**
**Purpose**
Here, my goal was to filter SNPs down to the most variable and reliable SNPs among the parent-offspring pairs (POPs) in the dataset. I focused on SNPs associated with microhaplotypes (i.e. SNPs that occur nearby one another that can be phased into a multi-allelic locus).
**Filters applied:**
- **Paralog filters:** D = |4|; H = 0.95
- Instructions for filtering paralogs based on the below graph suggest trying to capture the rightward pointing "finger-like projection" that contains the dense cloud of points. In this case, the cloud is spread out into vertical lines I suspect because I only used 14 samples, which sets a limit to the values H can take.
- **D** represents the "read ratio deviation", which is high when paralogs are collapsed into a single locus. It appears we can eliminate most outliers by setting D to 4 (which also seems to be a solid default based on other sources)
- **H** represents heterozygosity. I am reluctant to filter for heterozygosity here, since I've already instituted a heterozygosity filter. The filter used was based on haplotypes, so it's not unexpected that partitioning into SNPs will results in high levels of heterozygosity. Still, that point at 1 seems pointless, so I got rid of it.

- **minimum depth per locus:** 7 (i.e. 2m + 1)
- m is the parameter that was used in the [initial Stacks pipeline](https://hackmd.io/6pi-p8ftQVug71nTvYWUbQ) to remove reads that were sequenced to a depth less than m. This helps reduce the potential for retaining sequencing errors in the dataset. However, there is a chance that for heterozygotes, one allele will meet this depth requirement while the other will be dropped, potentially causing heterozygotes to be called as homozygotes. By setting a filter here that requires the minimum depth of a locus to be 2m + 1, we are more likely to eliminate false homozygous bases from our dataset.
- **IPass:** 80
- Any sample missing IPass % of loci will be removed before calculating missingness.
- I set this number high because I did not want to remove any sample before further filtering.

- **Imiss:** 30
- Samples missing data from this percent of loci or more after iterative filtering will be removed from the dataset.
- This seemed like a reasonable number, but ultimately I was interested in identifying loci that are present in all the samples. This filter ultimately removed one sample - the one with more missing data than the others in the above graph - and retained the other 13 samples.
- **Lmiss:** 0
- While samples are iteratively filtered down to Imiss, loci are iteratively filtered down to Lmiss in a loop.
- I wanted to target loci that will reliably sequence in cownose rays, so set the locus missingness filter to 0. I still ended up with X SNPs spread across Y loci.
These filters gave rise to the below graphs:

**Additional filters**
- **Mean read depth:** 18
- The plot in the bottom right of the graph above shows a bimodal distribution of read depth. Generally, focusing on loci with a mean coverage of 20x is considered a safe bet to avoid miscalling genotyping/sequencing errors as SNPs. Further, elasmobranchs are known to have lots of repetitive elements in their genomes, and these tend to be a challenge to sequence accurately. To play things safe, I set a filter for mean read depth to 18 to eliminate loci with mean read depths in the lower hump, in case those loci represent errors. After this filter, the bimodal distribution is now unimodal (is that a word? seems like it probably is):

This gives me a total of **9,966 SNPs distributed over 8,027 loci**.
Again, I'm most interested in loci that contain multiple SNPs. When I break down the loci a little further to look at the number of SNPs per locus, this is what we see:

Though most loci only contain one SNP, there are also over 1,000 loci that contain multiple SNPs. These are the loci I want to target.
From here, I isolated the SNPs that were associated with a microhaplotype locus. This reduced the dataset down to **3553 SNPs distributed over 1614 loci**.
Heterozygosity values per SNP seem pretty solid.

**Questions**
Are the above filters too stringent?
Other filters I should add?</font>
#### Step IV.3: Re-run populations to generate a new vcf file for microhaplotypes
<font size=2>**Script run: 07_run_populations_w_whitelist.sh**
**Purpose**
The Stacks populations program will take a whitelist of loci, as well as previous files generated by the Stacks pipeline, and produce various output formats using only the samples and loci specified. Here, my main motivation for using Stacks again was to generate radpainter files, which I can feed to a program called [fineRADstructure](https://academic.oup.com/mbe/article/35/5/1284/4883220) to confirm that the microhaplotypes assign kinship as expected.
**Filters applied**
- **Whitelist of loci**
- These are the microhaplotypes identified in the previous step</font>
#### Step IV.4: Test that loci accurately assign kinship via fineRADstructure and CKMRSim
<font size=2>**Scripts run:**
**1. 08.fineRADstructure**
**2. 08.2a_fineRADstructurePlot**
**3. 09_CNR_kinship**
**Purpose**
The **fineRADstructure** scripts use the haplotype data from Stacks to examine genetic similarity; if the loci are informative for kinship assignment, then this program should reveal that.
The below graph was generated by fineRADstructure, and shows that kinship is appropriately assigned for 5/6 of the groups that include both pairs (the program makes a pdf, and this is just a snapshot, but you can trust me that the blue boxes correspond to the correct kin assignment :))

The **09_CNR_kinship** script formats the data for CKMRSim and runs simulations to assess the power of the markers, as well as to assign kinship among individuals.
The below plot comes from simulations using the markers I've identified. The simulations show the power of distinguishing different relationships from one another. The fact that the below distributions are separated (except for PO and FS) suggests that the markers I've selected have pretty good power to differentiate among different types of kinship. FS (full sibling) and PO (parent-offspring) are difficult to distinguish in general, so the hope is that I would be able to tell them apart irl using additional data from the samples (e.g. age). 
When I run CKMRSim on my acutal samples with the same markers, it correctly identifies the relatives.
**Filters applied**
None. These scripts are testing whether the previous filtering steps were sufficient, but are not applying any new filters.
**Questions**
Is it concerning that one of the pairs is more ambiguously assigned in fineRADstructure when the other five pairs have very strong support?</font>
### Step V: Assess the presence/absence of SNPs in Mobula Rays at the same loci
<font size=2>[Hosegood et. al. (2020)](https://onlinelibrary-wiley-com.silk.library.umass.edu/doi/full/10.1111/mec.15683) did RAD-Sequencing of over 100 mobula rays, using the same enzyme I used with cownose rays. I retrieved their publicly-available data and have been working to see if the microhaplotypes I've identified in cownose rays are also variable in mobula rays. If so, then I could conceivably develop RAD-capture panels that can be used to facilitate CKMR-based abundance estimates with multiple species.</font>
#### Step V.1: Download Mobula Ray data and align to microhaplotype sequences
<font size=2>I don't remember how I did this but I did. Used the interactive queue, I think. Moving on.</font>
#### Step V.2: Align Mobula Ray and Cownose Ray data to microhaplotype sequences
<font size=2>**Script run: 10.2_align_2_whitelistCatalog.sh**
**Purpose**
Here, I used bowtie2 with the --very-sensitive flag to align the raw reads from the mobula data AND the rest of my cownose ray data to the whitelisted microhaplotypes identified above. After alignment, I passed the output alignments through samtools (multiple times) to sort and index the files to prepare for SNP calling.
**Filters applied**
- the **--very-sensitive** flag was given to bowtie2, which maximizes alignment accuracy at the expense of computing speed.</font>
#### Step V.3: Call SNPs for Mobula Rays and full Cownose Ray dataset
<font size=2>**Scripts run:**
**1. 11_callSNPs_FreeBayes**
**2. 12_callSNPs_StacksRef**
**Purpose**
I used two different SNP callers on the alignment files produces above to see how the results vary and whether one algorithm is more successful in this instance. They both appeared to produce similar results, so the below steps represent the StacksRef algorithm.
**Filters applied**
- **significance level for SNP/variant call:** 0.01
- The default for this program is 0.05, but due to the whole repetitive element thing, I wanted to make this filter a bit more stringent.</font>
#### Step V.4: Examine variation among Mobula Rays
<font size=2>**Scripts run: 13_CNR_POPs2_WhiteList_cross-species_comp.Rmd**
**Filters applied**
- **Minor allele count within Mobula Rays:** 2
- **Locus must be variable *within* at least two species of mobula rays**
- Most of the SNPs identified in mobula rays were only variable among species. I'd like to find SNPs that are variable *within* species
**Results**
There were a total of 44,140 SNPs called for the full dataset (which includes cownose rays) from the alignments to the microhaplotype sequences. The majority of these were missing from mobula rays. After applying the above filters, I am left with **191 SNPs distributed over 62 loci**
Soooo ... not many. Heterozygosity looks alright, though not nearly as high as cownose rays at the same loci (which is to be expected):

### Other approaches explored
The steps outlined above represent, I believe, the best approach to identifying microhaplotypes that are informative for kinship in cownose rays and *also* variable in mobula rays. However, I tried several other strategies before arriving at this one. In particular, I came at the issue from the opposite direction: I ran multiple iterations of Stacks where I combined my cownose ray data with the mobula ray data first, optimized Stacks parameters for the combined datasets, identified shared SNPs between the species, and *then* looked for a subset of these SNPs that are informative for kinship. Unfortunately, the loci that were variable between species were not variable within cownose rays and vice versa.</font>