FAIR PID: Current Status of Implementation

# FAIR PID: Current Status of Implementation A small report on FAIR principles and what we have done currently to enable them in our system. ## Table Of Contents 1. [Introduction](#What-is-all-this-about?) 2. [How Content Negotiation and Signposting works](#How-Content-Negotiation-and-Signposting-works) 3. [Resolvers and General Methods of Accessing Information](#Resolvers-General-Methods-of-Accessing-Information) 4. [Flows](#Flows) 5. [Summary/Review](#Summary/Review) 6. [Issues](#Issues/Future-Improvements) --- ## What is all this about? We want to provide multiple ways of making data publicly available for everyone. If someone wants to find a published resource, RAISE should allow that. Because of the academic nature of the content, we can expect all types of clients (machines, crawlers, human users) to want to obtain information about a research object. **FAIR (Findable, Accessible, Interoperable, and Reusable) data** is what will allow us to achieve the aforementioned goal, along with mechanisms like **signposting** to help machines and humans discover related content. As for how resources are delivered in different formats depending on who’s requesting, that’s handled through **content negotiation.** So what are these things and what do they mean? 1. **FAIR:** These are properties meant to ensure that digital assets (e.g datasets) can be easily found, accessed, and understood by both humans and machines. The whole point of this is to reduce friction, meaning **if something is FAIR, it’s easier to discover, share, and reuse without as much time as it would normally take**. You can see, this is especially useful when it comes to academic purposes, where data is most important. 2. **Signposting:** You can think of this as a way to make sure everybody knows where they can find the resource they're interested in. It's essentially a method that increases discoverability and accessibility of metadata and data in general. For our purposes, we are going with HTTP headers (Link headers) and `<link>` elements as you will see later on. **Regardless of the road a client takes, they should be able to find their destination if there are signs pointing to it everywhere.** 3. **Content Negotiation:** This is a way for clients (machines or not) to specify in what format they would like their content. We do this through the **Accept Header** and **Apache and our API** will handle any request accordingly. So if someone wants metadata of a dataset in XML format, the API should respond with the metadata in that format, if it is supported by us. ## How Content Negotiation and Signposting works ### 1. Link Headers on Backend Responses - **Where:** Metadata and resource endpoints on the FAIR PID controller. - **What:** Backend includes `Link` headers in the HTTP responses that point to the linkset endpoint for the given PID. - **Why:** This enables machines to automatically discover related information in a standardized way. When a machine requests metadata or resource data, the `Link` header points to the linkset endpoint, which in turn grants access to further information about the PID. This reinforces interoperability and allows for automatic discovery of information. --- ### 2. `<link>` Elements on Frontend - **Where:** PID landing page, specifically `rai-finder/{pid}`. - **What:** Angular SSR renders the Rai Finder page, attaching a `link` element to the document head. - **Why:** This ensures that crawlers and automated agents that cannot parse HTML can still follow the `link` element to retrieve the linkset in order to get metadata and the related PID resource, increasing findability and accessibility. e.g This will appear when resolving a specific PID. ```bash <link rel="linkset" type="application/json" href="https://develop.api.portal.raise-science.eu/fair-pid/linksets/linksets.json/21.T15999/raise-dev/<PID-NUMBER>"> ``` --- ### 3. Content Negotiation for Metadata and Resource requests *While this may concern multiple types of clients, we will be referring to machines specifically, as they are the focus of this section.* Once a machine gets access to the linkset for a given PID, they now have access to that PID's metadata and resource. **The linkset provides options for the format** for each "type" of data object (Resource,Metadata) and the client can "grab" the data in whatever format they wish. **This is content negotiation and it is accomplished via the Accept Header.** This basically means that by making a request to one of the endpoints below (with a **valid** Accept header), the server will respond accordingly. For more info on how this is done, [see here.](#2.-Access-PID-Metadata-and/or-Resource-via-Linkset) *If the Accept header is invalid, then the server responds with an error, demanding that a valid Accept header must exist.* ***This only applies to the metadata and resource endpoints on the API**. Namely:* ```bash https://develop.api.portal.raise-science.eu/fair-pid/metadata/{pid} https://develop.api.portal.raise-science.eu/fair-pid/resource/{pid} ``` --- ## Resolvers, General Methods of Accessing Information *This section goes over all methods of obtaining resources and related information about a PID. For more specific flows, see [here](#Flows).* ### `curl` These examples show how machines (or clients in general) can access metadata, landing pages, and negotiate content using standard `curl` commands. **These work the same with similar tools like Postman or other software that don't run Javascript**. #### 1. Access Linkset via `link` tag in HTML (Crawlers) When entering the Rai-Finder page, **Angular SSR** takes care of serving the page and **attaches a `<link>` element on the document head**. For example: ```bash <link rel="linkset" type="application/json" href="https://develop.api.portal.raise-science.eu/fair-pid/linksets/linksets.json/21.T15999/raise-dev/5"> ``` Since this page is server-side rendered, **crawlers and other types of bots that don't run Javascript on a website, can easily access the linkset** of a PID to get metadata and other information on the resource it is representing. You can use this command to try it out yourself. This will send a request to the rai-finder page and return only the <link> element: ```bash curl -s https://develop.portal.raise-science.eu/rai-finder/21.T5999/raise-dev/5 \ | grep -o '<link[^>]*rel="linkset"[^>]*>' ``` #### 2. Access PID Metadata and/or Resource via Linkset Through the linkset, there will be many different links that point to the metadata endpoint on the backend, showcasing all possible formats for the metadata. The machine/client can then simply follow the link and get the metadata for the PID. The following is an example of a linkset when making a request to the linkset endpoint. ```bash [ { "rel": "item", "href": "https://develop.api.portal.raise-science.eu/fair-pid/resource/21.T5999/raise-dev/5", "type": "application/json" }, { "rel": "describedby", "href": "https://develop.api.portal.raise-science.eu/fair-pid/metadata/21.T5999/raise-dev/5", "type": "application/vnd.datacite.datacite+json" } ] ``` *Any machine/client can follow the href links to get whatever they want. As an example, a simple `curl` will suffice to showcase this.* ```bash curl https://develop.api.portal.raise-science.eu/fair-pid/resource/21.T5999/raise-dev/5 ``` ### DOI Resolver The DOI Resolver doi.org redirects to the pid landing page successfully given a RAI PID. So for example something like `21.T15999/raise-dev/5` will redirect to this page: `https://develop.portal.raise-science.eu/rai-finder/21.T15999/raise-dev/5` *With a successfull redirection, there is an **HTTP 302 Found** redirection, followed by HTTP 200 OK (that Apache returns for the landing page). Bots are typically programmed to automatically follow `href` links so this should work fine.* *This is very similar to how [Datacite Content Negotiation](https://support.datacite.org/docs/datacite-content-resolver) works.* --- ### Mendeley Mendeley Web Importer support demands the existence of a specific meta tag in our HTML: `<meta name="citation_doi" content="INSERT-PID">` Our app uses **Angular SSR** so that the meta tag appears in time for Web Importer plugin to detect the reference (in the document head). This also works for the **Mendeley Reference Manager** (via "Add Reference manually") where users can input a PID and the information will be retrieved. ## Flows In this section, we will be going into further detail as to how all these systems tie together to facilitate easy access to data. We will be referring to different flows, but in reality they are all combined to form the system, this just makes it easier to understand. ### Human Flow (with Resolver or not) *This one is pretty self explanatory, but let's go over it shortly.* 1. Client visits our website (RAISE Portal) 3. Client clicks on Rai-Finder "tab" to enter the page. 4. The client inserts a PID and sees its information. This can also be done instantly if the user explicitly types the appropriate url and identifier to instantly see the data (e.g typing this `https://develop.portal.raise-science.eu/rai-finder/21.T15999/raise-dev/5` onto the search bar) **With the usage of a resolver like** [DOI](https://www.doi.org/): 1. Client visits [doi.org](https://www.doi.org/) 2. Client inserts a PID 3. [DOI](https://www.doi.org/) redirects to our Rai-Finder page, displaying the appropriate information. ### Crawler/Machine Flow *This goes over how web crawlers can access metadata (or any other type of information regarding a resource)* 1. Crawler visits the Rai-Finder page with a PID (e.g `https://develop.portal.raise-science.eu/rai-finder/21.T15999/raise-dev/5`) 2. The Rai-Finder page is server-side rendered and has a `link` element in the document head that points to the linkset for the particular PID. 3. The crawler (or any type of bot that can't parse HTML) can simply inspect the `link` element and follow the URL within to access the linkset. 4. Then, through the linkset **metadata and the resource itself can be accessed.** ### General Flow - Human Clients utilize their browser to see the content. - Machines/Crawlers inspect the `<link>` element on the Rai Finder page and then can utilize **content negotiation** to get the information they are seeking. We also provide automatic content discovery in the backend too (via `Link` Headers on the API) which eases the pursuit of information. - **Researchers using Mendeley** can enter a PID manually or use the **Mendeley Web Importer** browser plugin to instantly get all information regarding a PID. ## Summary/Review We have 2 types of information: **Metadata** and the **PID Resource** itself. Through the [above methods](#Resolvers-General-Methods-of-Accessing-Information) and the [above flows](#Flows), machines, crawlers and normal human clients can access all available information through the `Link` headers in the responses of commonly hit endpoints, `<link>` elements on the frontend and the linkset documents linked to those headers which in turn grant access to the actual metadata and resource endpoints. **Current Supported Accept Header (MIME Type) Formats:** `application/vnd.datacite.datacite+json` --- ## Issues/Future Improvements 1. **DOI Resolving**: Dataset, Script and Result PIDs redirect to this page: `https://www.raise.eu/{dataset | script | result}/record`. This is wrong, it should point to the `Rai-Finder` Page. 2. Mendeley doesn't fill in information on a reference when given one of our PIDs. 3. Metadata for RAI PIDs is not yet implemented. 4. Resource endpoint does not take Accept Header into account.