Misc identifier notes

--- title: Misc identifier notes --- # General notes on Identifiers An identifier is a sequence of characters used to identify a resource. A resolvable identifier is an identifier which, when provided to a resolving service, provides the location of the identified resource. There are many implementations of resolver services, most mated to particular types of identifier, or identifiers that conform to a particular scheme. Each identifier scheme defines syntax and behavior peculiar to the scheme. However, there are elements of common syntax and behavior across the common successful schemes. This document attempts to describe the common syntax and behavior, with particular emphasis on DOI, IGSN, and ARK identifiers within the broad context of the earth, environmental, and biological sciences. ## Syntax Identifiers generally have several components: $$ \color{navy}{\mathtt{Scheme}}:\color{salmon}{\mathtt{Prefix}}/\color{red}{\mathtt{Shoulder}}\color{darkorange}{\mathtt{Suffix}}\color{lightblue}{\mathtt{Extra}} $$ $\color{navy}{\mathtt{Scheme}}$ indicates the type of identifier, for example "doi", "ark", and "igsn". $\color{salmon}{\mathtt{Prefix}}$ is an allocating agent identifier which is a string of characters (generally numeric characters) identifiying the agent that created the identifier. Allocating agents are registered with the identifier infrastructure. Equivalent to a DOI "prefix" or an ARK "Name Assigning Authority Number" (NAAN). $\color{red}{\mathtt{Shoulder}}$ is a set of characters at the start of the identifier value that can be used by an allocating agent to group identifiers. This is useful for activities such as expeditions or experiments where there is a logical grouping of identifiers or the allocating agent delegates minting ability to multiple agents at the same time. In the latter case, agents may mint without concern for overlap since the Shoulder + Suffix would be unique. Note that for legacy IGSNs, the shoulders (called "namespaces" in the IGSN domain) are collaboratively managed, so that shoulders are unique across all allocating agents. $\color{darkorange}{\mathtt{Suffix}}$ is the part of the identifier that is specific to a single resource within the context of a scheme, allocating agent, and namespace combination. $\color{lightblue}{\mathtt{Extra}}$ includes any characters beyond the minted identifier. These offer no value in resolving an identifier, but a resolver should pass these characters verbatim to the resource provider when included in a resolution request. A $\color{green}{\mathtt{Namespace}}$ is the combination of $\color{salmon}{\mathtt{Prefix}}/\color{red}{\mathtt{Shoulder}}$ and uniquely identifies a set of identifiers within the $\color{navy}{\mathtt{Scheme}}$. An identifier $\color{black}{\mathtt{Value}}$ is the combination of $\color{navy}{\mathtt{Scheme}}:\color{salmon}{\mathtt{Prefix}}/\color{red}{\mathtt{Shoulder}}\color{darkorange}{\mathtt{Suffix}}$ and is a globally unique handle to a resource. ### DOI $$ \color{navy}{\mathtt{doi}}:\color{salmon}{\mathtt{10.1234}}/\color{red}{\mathtt{zz}}\color{darkorange}{\mathtt{fq98d}}\color{lightblue}{\mathtt{?k1=v1\&k2=v2}} $$ ### ARK $$ \color{navy}{\mathtt{ark}}:\color{salmon}{\mathtt{1234}}/\color{red}{\mathtt{zz}}\color{darkorange}{\mathtt{fq98d}}\color{lightblue}{\mathtt{?k1=v1\&k2=v2}} $$ ### IGSN The IGSN scheme is currently (mid 2022) undergoing a transition. The existing (legacy) IGSN pattern is: $$ \color{navy}{\mathtt{igsn}}:\color{red}{\mathtt{zz}}\color{darkorange}{\mathtt{fq98d}}\color{lightblue}{\mathtt{?k1=v1\&k2=v2}} $$ The future IGSN pattern will be the same as DOIs: $$ \color{navy}{\mathtt{igsn}}:\color{salmon}{\mathtt{10.1234}}/\color{red}{\mathtt{zz}}\color{darkorange}{\mathtt{fq98d}}\color{lightblue}{\mathtt{?k1=v1\&k2=v2}} $$ ## Case Sensitivity DOIs and IGSNs are not case sensitive. For example, the DOI `doi:10.1234/zzfq89d` is equivalent to `doi:10.1234/ZZFQ89D`. ARKs are sort of case sensitive. The case normalized form of a DOI is upper case using ASCII case folding [^1]. Similarly, the case normalized form of an IGSN is uper case [^i] The case normalized form of an ARK is more complicated, though lower case is prevalent [^2]: > Normalization of an ARK for the purpose of octet-by-octet equality comparison with another ARK consists of four steps. First, any upper case letters in the "ark:" label and the two characters following a `%' are converted to lower case. The case of all other letters in the ARK string must be preserved. (Note there are three more steps for ARK normalization)[^3] Case consistency is obviously important in systems where identifier case is important. One example is when identifiers are used as IRIs in an RDF environment. In that system, literal character comparison is used to determine IRI equivalence, and hence normalization of identifiers is necessary to ensure consistency across uses. ## Character Sets DOIs may incorporate any printable character from UCS-2 of ISO/IEC 10646. ARKs, and IGSNs use US ASCII (note that future IGSNs may adopt the DOI character set). ## Reserved Characters [^1]: https://www.doi.org/doi_handbook/2_Numbering.html#2.4 [^i]: See § "Recommended Practice" in https://igsn.github.io/syntax/ [^2]: https://arks.org/about/running-minters-and-resolvers/ [^3]: See § 2.7 in https://www.ietf.org/archive/id/draft-kunze-ark-18.txt