# A proposed addendum to account for both locus and marker definitions in human microhaplotype nomenclature
## Abstract
Under construction.
## Background
Microhaplotypes have emerged in recent years as a novel type of genetic marker with promising qualities, and interest in their applications continues to grow within the forensics, anthropology, and population genetics communities. A microhaplotype (*microhap* or *MH*) has been defined as a short region of DNA that 1) spans multiple common SNPs 2) exhibiting multiple allelic combinations that 3) can be spanned by a single next generation sequencing (NGS) read (CITATION Kidd 2014).
In 2016, Kidd proposed nomenclature guidelines for microhaps (CITATION Kidd 2016). According to the proposed specification, each microhaplotype is assigned an identifier composed of a standard fixed prefix ("mh"), a two-digit chromosome label, a unique symbol representing the laboratory or principal investigator publishing the microhap, a hyphen, and a lab-specific number or designation. For example, `mh05KK-170` refers to the Ken Kidd lab's microhap #170 on chromosome 5. This proposal has been adopted widely as a *de facto* standard in the forensic genetics literature (Table 1) and community resources like the MicroHapDB database (CITATION Standage 2020).
**Table 1. Identifiers of a representative set of microhaplotypes published by several independent laboratories.**
| Identifier | Source |
| -------- | -------- |
| mh01NH-04 | Hiroaki 2015 |
| mh14KK-101 | Kidd 2018 |
| mh11PK-63643 | van der Gaag 2018 |
| mh08CP-009 | Chen 2019 |
| mh03USC-3qC | de la Puente 2020 |
| mh18ZBF-002 | Jin 2020 |
| mh19ZHA-009 | Kureshi 2020 |
| mh02ZHA-013 | Sun 2020 |
| mh06SHY-005 | Wu 2021 |
| mh11FHL-007 | Fan 2022 |
| mh04WL-052 | Yu 2022 |
| mh03HYP-09 | Zou 2022 |
This naming system *raises* the distinction between a locus and an allele at that locus but does not conclusively *resolve* the distinction in terms of assigning identifiers to microhaps. As additional laboratories report discoveries of novel microhaps or proposed adjustments to previously published microhap loci, the community’s collective understanding of what precisely constitutes a microhaplotype is unclear, as is the scope of any given microhap name. It seems appropriate now to consider explicitly whether a microhap identifier is intended to refer to (1) a locus or (2) a specific set of SNPs that will define alleles (haplotypes) at that locus. The authors admit to courteous disagreement on this point until recently.
It is important to acknowledge the extent to which developments in detection technology and laboratory techniques since the 1980s have influenced how microhaps and microhap alleles are defined. Early studies of human haplotype variation targeted individual sites (called RFLPs then, now called SNPs) whose genomic coordinates and precise relative distances were unknown. These sites were genotyped individually using Southern blotting and phased statistically to determine the haplotype. As the human genome sequence was released and real-time PCR-based genotyping assays became available for some SNPs, TaqMan became the preferred method for genotyping specific defined sites in a genome region. One rarely studied all sites known to vary at a locus, and TaqMan assays were usually available only for the more heterozygous sites. Under those circumstances the “definition” of a haplotype marker necessarily involved the SNP assays used, with haplotype alleles depicted as an aggregate of the targeted sites (such as A-G-G-T for a haplotype represented by four SNPs).
The advent and rapid improvement of DNA sequencing technologies has offered contrasting and complementary strengths. NGS permits the recovery of a complete microhap sequence, the entire extent of which can be informative and empirically phase-resolved. Detection of MH alleles (haplotypes) at the region of interest does not depend on targeting specific SNPs, or the availability of any particular SNP in e.g. a TaqMan assay. Instead, the microhap can be defined by the extent of region of interest--specifically, the 5’ and 3’ boundaries of the region containing the variation of interest. Microhap sequencing assays then depend not on a set of targeted SNPs but on primers flanking each microhap to enable PCR amplification, or probes spanning the locus to enable hybridization capture enrichment.
While microhaps have always been envisioned with NGS assays in mind, early proofs of concept demonstrating microhaps for forensic applications used SNP-based assays and allele representations. Notably, much of the published population frequency data for microhaps—upon which forensic interpretation relies—remains in this format.
Together, these factors place competing demands on microhap nomenclature. On one hand is the requirement for a microhap locus identifier that is stable despite whether the alleles at that microhap are represented as complete sequences or as the composite alleles at specific sites. On the other hand is the requirement for an identifier to distinguish between different SNPs sets that have historically been used to define alleles at a particular locus.
Here, we present a proposed addendum to the standard microhap nomenclature that resolves this tension by accommodating both locus and marker/assay information in microhap identifiers. Any set of SNPs defined at the same locus is assigned the same root identifier, but a unique suffix is also assigned to unambiguously distinguish each specific set of SNPs. Rules for merging duplicate records at the same locus are also proposed. These recommendations have been implemented in MicroHapDB to facilitate ongoing work in the forensic genetics community to define a standard set of microhap panels for different forensic applications.
## Specification
### Microhaplotype Identifiers
Under construction.
MicroHapDB nomenclature module: ID parsing and validation
### Suffix Assignment
Under construction.
### Merging of Identifiers
Under construction.
## Discussion
Under construction.
Emphasize future mootness point