### Integrating data and analysis technologies within leading environmental research infrastructures: Challenges and approaches
---
### Research Infrastructures (RI)
RIs are platforms that acquire, curate and publish continuous observation data for research and policy making accessible and reusable **not only for human users but also for machines**
- 80% of the effor is spent on data preparation
---

---
### Surveys of state-of-the-art approaches
The survey highlights the heterogeneity of existing practices and shows that there is enormous potential for practice harmonization.
- API
- Programming language
---
<!-- ### Structure of the paper
2. Survey
3. Solutions
4. Discussion
5. Roadmap
6. Conclusions -->
## 2. Survey
- a systematic review of the approaches implemented by world-leading RIs in Earth System and Environmental Sciences to enable data and metadata access for both humans and machines.
---
### 2.1. Selected research infrastructures
----
#### PANGAEA (www.pangaea.de)
Our services are generally open for archiving, publishing, and re-usage of data. The World Data Center PANGAEA is member of the World Data System.
a data publisher in Earth & Environmental Science
----
#### The Terrestrial Ecosystem Research Network (TERN, https://www.tern.org.au)
TERN provides open data, research and management tools, data infrastructure and site-based research equipment.
Australia’s terrestrial ecosystems
----
#### AuScope (https://www.auscope.org.au/)
We provide research tools, data, analytics and support to Australia’s geoscience community.
----
#### The Commonwealth Scientific and Industrial Research Organization (https://www.csiro.au)
Australia’s national research agency, covering a broad spectrum of science, engineering and medical research domains
----
#### The National Ecological Observatory Network (NEON, https://www.neonscience.org)
Good science is built on good data
The National Ecological Observatory Network, or NEON, offers expert ecological data from sites across the continent to power the most important science being done today.
----
#### The Chinese Ecosystem Research Network (CERN, http://www.cern.ac.cn/)
CERN
----
#### The Integrated European Long-Term Ecosystem, Critical Zone & SocioEcological Research Infrastructure (eLTER RI https://www.lter-europe.net/elter-esfri)
Taking Europe's pulse research for our continent's future
----
### The Integrated Carbon Observation System (ICOS, https://www.icos-cp.eu/)
a European-wide greenhouse gas research infrastructure.
----
### The European Network for Earth System Modeling Climate Data Infrastructure (ENES CDI, https://is.enes.org/)
Infrastructure for the European Network for Earth System Modelling
----
#### NCI Australia (National Computational Infrastructure) (https://nci.org.au/)
high-performance supercomputer infrastructure and cloud systems that generate data, process data streams or analyze dat
---
### 2.2 Design
- persistent identifiers used,
- the metadata and data formats and standards offered
- implemented access protocols and interfaces.
<!-- we asked the participants to provide example links to both
data and metadata resources as well as the required authentication
protocols. -->
---
### metadata
- Dublin Core
- Schema.org
- it allows us to describe data sets in detail using the Schema.org/Dataset type
- its use improves search engine harvesting and thus visibility and discoverability of described data sets.
---

<!--
we asked the respondents to gather practices and
thoughts on the following questions: Describe how a data scientist can
write a script (any language) that based on a DOI/PID loads the identified
(meta)data into a data frame (native data structure in your language of
choice). Do you or third parties offer special libraries for data access? If a
DOI/PID is not sufficient, what information does the data scientist need to
load the data of your RI into a data frame?
-->
---
#### 2.2.1 Embedded metadata
- embed metadata in the HTML
- Links (Uniform Resource Locators - URLs) to data objects can be embedded
---
#### 2.2.2 Content negotiation
This is done by a client application sending HTTP header requests in which the expected
response format is specified using a MIME type.
```
GET doi:10.1594/PANGAEA.80968 HTTP/1.1
Accept: application/ld+json
```
**content negotiation & schema.org/Dataset is recommended by google**
---
### 2.3 Evaluation
F-UJI is a web service to programatically assess FAIRness of research data object

---
### A schematic overview of HTTP based methods

---

----
#### GraphQL
- HTTPレイヤーの上に被さるようなプロトコル
- リクエストに応じた形状のデータをやりとりできる
- endpointが増えない
- 実装上の利点
- APIに型の情報を持たせることができる
---

---
## 3. Solution
Some of the investigated RIs have proposed solutions by developing specialized software libraries that automate such ingestion.
----
- PANGAEA has recently published the Python library (pangaeapy)
```
pandata = PanDataSet(’doi:10.1594/PANGAEA.889516’)
```
----
- ICOS provides a Python library (icoscp)
```
icosdata = Dobj(’XA_Ifq7BKqS0tkQd4dGVEFnM’)
icosdata = Dobj(’https://hdl.handle.net/11676/XA_Ifq7BKqS0tkQd4dGVEFnM’)
icosdata = Dobj(’https://meta.icos-cp.eu/objects/XA_Ifq7BKqS0tkQd4dGVEFnM’)
```
----
Having data available in data frames is an important first step in overall data processing and analysis
```
icosdata = Dobj(’https://hdl.handle.net/11676/
xgu4rfCmqvXb4w1wGGD6mYsB’)
icosdata_frame = icosdata.get()
pandata = PanDataSet(’https://doi.org/10.1594/PANGAEA.
889516’)
pandata_frame = pandata.data
```
---
### Plot water temperature observations in degree Celsius

<!-- Time series of water temperature in degree Celsius for two ships crossing the Atlantic ocean. On the left in orange, the data set published by PANGAEA, with
observations from Europe to South America. On the right in blue, the data set published by ICOS, with data collected from Europe to Brazil. (For interpretation of the
references to colour in this figure legend, the reader is referred to the web version of this article.) -->
Different key names
---
##### map

<!-- The PANGAEA and ICOS data sets plotted with geolocation and colour gradients (dark blue, minimum, to yellow, maximum) to represent the sampled water
temperature. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) -->
---
##### Citation

---
The presented chart and map plotting examples nicely show the
advantage of libraries that streamline the data ingestion and harmonization tasks and thus contribute to ensuring that data scientists can focus more on data analysis.
---
## 4. Discussion
### Although the approach to use specialized libraries is promising, our use case revealed some open issue
- it takes time to get familiar with multiple libraries
- available for only one programming language
---
community specific formats → generic and easy to implement format
- Decreasing interest in ISO19115 after about 2011
- Increasing interest in Schema.org
----

---
<!--
### Findable, Accessible, Interoperable and Reusable (FAIR)
FAIR principles have had overwhelming success within the scientific community
Both the FAIR data principles as well as search engine optimisation
(SEO) approaches have similar requirements for domain agnostic pro-
vision of metadata and have a comparably high standard with respect to
detail and completeness. As Schema.org serves two purposes (SEO and
FAIR), it is now used by a rapidly increasing number of data providers to
enable FAIR metadata and data provision.
The use of persistent identifiers is another prerequisite for FAIR data,
and their advantages have been described in detail by Philipson (2019).
-->
### CSV on the Web


---
### Frictionless data

---
## 5. Roadmap

---
## 6. Conclusions
- largest environmental research infrastructures (RIs) make their data available to the scientific community
- not straightforward for machines
- FAIR principles
- Findable, Accessible, Interoperable and Reusable
{"metaMigratedAt":"2023-06-17T04:48:33.684Z","metaMigratedFrom":"YAML","title":"Untitled","breaks":true,"contributors":"[{\"id\":\"5d4a9bcc-32de-437a-a96c-5cb81b5833a7\",\"add\":10112,\"del\":1457}]"}