Sample Reliquary Draft

--- title: Sample Reliquary Draft subtitle: A product of the iSamples Project funding: This work is sponsored by the US National Science Foundation under Grant Numbers 2004839, 2004562, 2004642, and 2004815. --- # Sample Reliquary Draft This document contains draft information on development of sample reliquary support in the iSamples project. ## Summary > **reliquary** <sub>*noun*</sub> > a container or shrine in which sacred relics are kept A **sample reliquary** is a mechanism to reference a set of samples. Sample references are achieved through the use of persistent, resolvable, globally unique identifiers. Services must be available to perform actions based on information contained in reliquary documents. There are two major aspects of sample reliquary implementation: 1. The reliquary document. 2. An ecosystem of services for generating, viewing, and otherwise supporting actions on reliquary documents. :::info **Note** There is an implicit assumption that entities (samples, datasets, documents, etc) are identified with globally unique, resolvable identifiers. ::: ### The Sample Reliquary Document A reliquary document can be generalized to "just" a list of identifiers, ideally accompanied with predicates that indicate the relationship between the subject of the reliquary (i.e. the publication, study, or other artifact producing the reliquary document) and the identifiers contained therein. For example, a reliquary document may contain identifiers of samples that were used in some analysis, but also contain identifiers of samples that were deliberately excluded. In this case a simple list of identifiers mut be augmented with predicates indicating the use of the identifiers within the context of the analysis. A reliquary document should contain sufficient information to identify the context to which it is applicable. At a minimum, this may be an identifier referring to the artifact for which it was produced, but time stamps and other contextual metadata can be helpful when interpreting or rendering a reliquary document. A reliquary document may be perused by humans, but many uses of a sample reliquary require interaction with a number of different services. Hence, a reliquary document should be in a machine readable format that may be readily presented in a human readable form. Hence, a reliquary document should: - Contain a list of sample identifiers with indication of the use of the identifiers. - Include contextual metadata indicating the purpose of the document and properties to distinguish revisions of the document (e.g. a "date modified" timestamp). - Be machine readable. ### The Sample Reliquary Ecosystem of Services Certain fundamental operations must be supported by systems interacting with reliquary documents. Reliquary documents are primarily composed of grouped sets of identifiers referencing other artifacts. For each identifier, it should be possible to: - Retrieve metadata about the identifier - Retrieve metadata about the identified artifact - Retrieve the identified artifact With general support of these operations the holder of a reliquery document could dynamically expand the referenced resources (identifiers) to gather additional details about each item, and so generate a more complete view of the reliquary document that goes far beyond the simple list of identifiers (e.g. taxonomic and spati-temporal details for each referenced resource). Such dynamically expanded documents could be persisted so as not to require dynamic creation with each viewing. Persisting such documents should be in a manner consistent with the basic reliquary document structure, so that an expanded reliquary document contains no less information than the original. Invocation of these actions can be used to trigger events such as incrementing metrics counts and other notification of use. Hence, systems supporting the resolution of identifiers should also include mechanisms for tracking access events and being able to report such events in a consistent manner (e.g. [COUNTER Code for Practice for Research Data](https://www.countermetrics.org/code-of-practice/)). The sample reliquary ecosystem of services includes: - Reliquary document generators - Create a reliquary document from some set of identifiers - Expand a reliquary document by realizing metadata for each of the contained identifiers - Persists a reliquary document in a suitable format - Regenerates a reliquary document by re-querying sources and/or re-expanding sources - Identifier resolvers / collections - Resolve an identifier to metadata or a resource depending on the request - Record or emit events pertaining to content access - Reliquary document viewers - Present a human readable representation of a reliquary document. - Expand a reliquary as a reliquary document generator. - Facilitate annotation of reliquary documents using Hypothesis or similar infrastructure. - Reliquary document discovery - Enable search and discovery of reliquary documents - Facilitate reverse lookup, i.e. given a sample identifier, what reliquary documents does it appear in (for a predicate)? See also: * https://www.rd-alliance.org/group/complex-citations-working-group/case-statement/complex-citations-working-group-case-statement * https://docs.google.com/document/d/1xGYLWdtCYDz_KJ4JOeU1w0_gvjdAKCOLPT83h4yPmTU/edit # Functional Use Cases The following use cases are a work in progress, though are generally representative of the goals for reliquary support in iSamples. These use cases are deliberately narrow in focus as they are intended to help identify the technical implementation needed to support. ## 1. As an author, I would like to provide a record of all samples used [for a particular purpose] in a publication. Reference identifiers of samples directly in the publication. ![](https://i.imgur.com/0uGHIVn.png) and/or: Produce a reliquary document that contains a list of identifiers. Generate an identifier for the document. Reference the identifier in the publication. ![](https://i.imgur.com/hrfUw1A.png) It may be prudent or necessary for a publication to reference multiple sets of identifiers. This implies that a reliquary may be nested- a reliquary may contain identifiers of other reliquaries. ![](https://i.imgur.com/HUs8ANr.png) ## 2. As an author, I would like to generate a list of citations to acknowledge the samples contributing to my publication. For each referenced sample, get metadata, group by creator. Generate citations for each group. Consider including a summary report as part of the reliquary - more convenient for human consumption. ![](https://i.imgur.com/t8S04VB.png) ## 3. As an agent with interest in samples [collector, administrator, curator], I would like to know how samples I have interest in are being used. Given one or more samples, find identifiers of the containing sample sets. (number of sample sets containing interesting records) Given one or more sample set identifiers, find entities that reference those sample set identifiers. i.e. given the set of triples `<subject (S), predicate (P), object (O)>` recursively gather `S` given `O` for any `P`, where `On+1 = Sn`. ![](https://i.imgur.com/fjl0EEx.png) ## 4. As a sample collector (agent who curated, participated in collecting event, uploaded to system, or identified), I would like to receive acknowledgement of use of any of my samples. Given a sample, find identifiers of all the containing sample sets. Given one or more sample set identifiers, find all publications that reference any of those identifiers. There is a need to identify the role of agents associated with the sample. Samples may be split - track derivatives. Consider also CARE principles - how samples are actually being used from indigenous stewards perspective. Consider also access control and use notification. Associating local context identifiers with samples. Pattern is functionally like 3. ## 5. As a collection manager, I would like to record major actions (e.g. transfer, deaccesion) performed on sets of records. Generate reliquary documents, preserve them, reference the reliquary document identifiers. Functionally like 1. ## 6. As a researcher, I would like to repeat an analysis described in a publication with exactly the same samples. Given a sample set, retrieve the sample records. Functionally like 2, adding `get_data(pid)` which returns data the identifiers references. ## 7. As a researcher, I would like to preserve a set of records for later use. Create a document that contains the list of sample identifiers. Include any other pertinent information. Verify services can provide sample data and metadata for future use. Alternatively, gather that information to be contained in the document. Functionally like 1. ## 8. As a researcher, I would like to know all the samples that contributed to a publication. Extract identifiers from the publication. Determine if identifiers refer to samples or sample set documents. Sample identifiers is the union of individual sample identifiers and all referenced sample set documents. ## 9. As a researcher, I would like to know if more samples may be available to refine a previous analysis. Given the information used to create a sample set, create a new sample set with the same information. e.g. re-issue the query. This requires that the query and its context of execution be captured in the reliquary. ## 10. As a researcher, I would like to combine several sets of records into a single set. Use a tool designed for such purpose. Conceptally this could be the union of all sample identifiers, but would also need consideration of handling multiple time stamps, queries, and other docment properties. ## 11. As a researcher, I would like to share a set of records with my colleagues. Place the sample set document in a persistent location where it can be reliably retrieved using a persistent, globally unique identifier. --- # Core Operations `get(pid:str)->Any:` Given an identifier, retrieve the identified entity. Side effects include recordation of the action for metrics collation. Other actions may depend on the type of item. For example, retrieval of a reliquary document may trigger transitive notification to contained identifiers. `get_meta(pid:str)->Dict[str,Any]` Given an identifier, retrieve information about the identified entity. Side effects include recordation of the action for metrics collation. `get_idmeta(pid:str)->Dict[str,Any]:` Given an identifier, retrieve information about the identifier. `get_spo(s:str=None, p:str=None, o:str=None)->Tuple[str, str, str]:` Given one or more of subject s, predicate p, or object o, return the matching statements (i.e. the matching set of `s,p,o`) `get_subjects(pid)` Given an identifier, retrieve identifiers of entities that reference the identifier. `find_pids(query)` Retrieve identifiers of entities that match a given set of entity properties, entity metadata properties, or identifier properties. Other operations: `reliquary_regenerate(pid)` Given a reliquary document, regenerate it (e.g. re-execute query used to generate it). `reliquary_diff(pid_1, pid_2)` Compute the difference between two reliquary documents. --- # Core Structures `reliquary` A data structure containing a list of identifiers, when the list was created, who created, how it was created. Identifiers should be accompanied by predicates, for which the subject is the holder of the reliquary. `reliquery_index` A data structure that captures reliquary contents, enabling lookup of reliquary identifier(s) given sample identifier(s). --- # Initial implementation Notes * [Issue #247](https://github.com/isamplesorg/isamples_inabox/issues/247) ```json { "@type":"reliquary", "timestamp": "2023-01-10T15:05:34.721295Z", "generator": { "service":"http://localhost:8000/thing/reliquery", "parameters":{ "query":"source:SESAR" }, "user":"URI of user" }, "entity_count": 4606504, "entity_list_count": 10000, "identifiers": [ "IGSN:SLB0000GZ", "IGSN:SLB0000H0", "IGSN:JBG00006B", … ], "description": "Arbitary supplied text" } ``` ``` POST /reliquery query: {SOLR Query} description: {Optional description} ``` ## Supporting Attribution Naive approach. Ping every identifier in the reliquary doc when it is retrieved. Result is traffic amplification, scalability issues. ```plantuml actor Bob actor Alice participant iSamples participant IDProvider participant Source Source -> iSamples: records == Search == Bob -> iSamples: preserve query activate iSamples iSamples -> iSamples: generate reliquary iSamples -> IDProvider: getID activate IDProvider IDProvider -> iSamples: reliquary_id deactivate IDProvider iSamples -> Bob: reliquery deactivate iSamples == Later == Bob -> Alice: reliquary_id Alice -> IDProvider: resolve reliquary_id activate IDProvider IDProvider -> Alice: iSamples deactivate IDProvider Alice -> iSamples: get reliquary_id activate iSamples iSamples -> Alice: reliquary group async loop [each identifier] iSamples -> Source: doc Source -> Source: increment counter end deactivate iSamples ``` Alternatively, the record source gathers use metrics and exposes via API. Also enables notification of use for samples, so a system presenting a reliquary may notify access to identifiers contained therein to the origin. More efficient, but requies stonger agreements between systems. ```plantuml actor Alice participant iSamples participant IDProvider participant Source Alice -> IDProvider: resolve reliquary_id activate IDProvider IDProvider -> Alice: iSamples deactivate IDProvider Alice -> iSamples: get reliquary_id activate iSamples iSamples -> iSamples: increment counter iSamples -> Alice: reliquary deactivate iSamples Source -> iSamples: getMetrics activate iSamples iSamples -> iSamples: count iSamples -> Source: counts deactivate iSamples ```