Field Sample Identification

--- title: Field Sample Identification subtitle: Associating persitent identifiers with samples in the field. tags: identifier, field, sample --- # Field Sample Identification Problem : Associating a globally unique, resolvable, persistent, identifier (GURPI) to a physical sample collected in the field with limited connectivity and technical capacity. There are basically three approaches to ensuring uniqueness of generated identifiers: 1. Algorithmic global uniqueness 2. Selection from a predefined list 3. Context dependent uniqueness There's some overlap between these, for example a predefined list may be used within a context. Persistence requires that the identifier is stored in a system that promotes content persistence. Resolvability requires that the identifier is addressable through some mechanism (e.g. Internet service, card catalog). Persistence of resolvability should also be a consideration - URLs in themselves are convenient as an addressable identifier, though are notoriously fragile. Hence DOI, ARK, etc identifiers should be preferred. ## Algorithmic Uniqueness An algorithmically generated unique identifier relies on entropy to produce values that are statistically unique. e.g. [UUIDs](https://en.wikipedia.org/wiki/Universally_unique_identifier) are 128 bit values that rely on physical hardware or random values (for UUID versions 3 and 4) to guarantee uniqeness (a 50% chance of collision in 2.7x10^18 UUID-4 values, or about 1 billion per second for 85 years). UUIDs are 128 bits, and so are unwieldy for use in field work or any place where transcription is required. [Expressed as a URN](https://datatracker.ietf.org/doc/html/rfc4122), a uuid takes the form: ``` urn:uuid:41504c16-6eb2-4B5d-9898-42a951d50e56 ``` Pros : - Easy to generate - Guaranteed globally unique Cons : - Lengthy, unsuitable for transcription - Field generation requires technology ## Predefined Uniqueness A list of identifiers may be generated and verified to be unique. A variant of precalculated identifiers is a [linear congruent generator](https://en.wikipedia.org/wiki/Linear_congruential_generator) where the list of identifiers is algorithmically predefined with sequential values that appear random. [Hashids](https://hashids.org/) can be used as a simple LCG when driven by a counter. A challenge with predefined identifiers is recordation of use. Out of a list of say 1000 identifiers generated, only 100 may be used. Pros : - Pre-generation of labels negates need for field technology - Verifiably unique - Opportunity for pre-registration of common metadata Cons : - Managing unused identifiers - Coordination of generation needed to prevent potential overlap ## Contextual Uniqueness A context, or sequence of contexts, can provide a namespace within which uniqueness can be guaranteed. It is simple to create a unique identifier within a tightly controlled context, and if the context is globally unique, then it follows that identifiers within the context are also globally unique when combined with the context identifier. For example, a single sheet of paper may have individually numbered written observations for different samples. The individual numbers are unique within the context of the page. The page is unique within the notebook, the notebook unique within the project. Thus given a globally unique identifier for the project, a globally unique identifier for an observation could be constructed from the combination of context identifiers: ``` project + notebook + page + observation = GURPI ``` Any number of context levels may be constructed, with the principle requirement being uniqueness of a sub-context within any context. Pros : - Simple, may be created by hand - Natural workflow - Contexts combine common metadata Cons : - Requires careful context management