Data-as-a-Service

--- tags: DaaS --- # Data-as-a-Service # > **_STATUS:_** THIS IS CURRENTLY A DRAFT VERSION ```yaml title: UCloud Data-as-a-Service authors: Per Møldrup-Dalum, Kristoffer Nielbo contact: kln@cas.au.dk ``` # 1. Overview # The need for fast and seamless access to data has been growing in all research areas during the last decade. Many areas do not have adequate data infrastructure to utilize recent advances in the computational sciences and thier research risks coming to a dead-end or being outpaced by private/commercial actors. With the Data-as-a-Service project, we want develop a data discovery and access layer in UCloud. We will work with typical use-cases from social sciences/humanities, focusing particularly on access to siloed and restricted data at national data providers (e.g., national royal libraries and collections). # 2. Project Justification # __Researcher needs__ - Searchable interface for data (discovery platform for data sets) - Reduce overhead in applying for and getting access to data (onboarding and activation) - Secure and collaborative environment to work on restricted data - Opportunities to test and use new solutions for sharing and collborating on research data (e.g., federated learning) __Provider needs__ - User recruitment and retention - Trustworthy security - Easy way of a) increasing awareness of existing data, and b) standardized ways of sharing data - Easy way of securing payments for services # 3. Objectives # - Provide researchers and students seamless access to research (relevant) data in UCloud from public and private data providers - Develop necessary UCloud infrastructure to interface data from multiple heterougeneous sources (develop up against data providers API) - Manage diverse access policies (copyright, GDPR requirements &c) - Manage diverse technical solutions in the providers end (api/no api) - Manage diverse curation - al facets of data sets/providers - Convince data provider that UCloud DaaS is a secure and sustainable way to share data with researchers - Provide a platform for sharing data in a secure platform - Based on a data responsible PI - Access to data via projects managed by a data responsible PI in UCloud (PI: Project + Person) - Increase awareness and knowledge of available data set - Uniform and easy search interface for data sets in UCloud # 4. Use cases # We have collected a number of use cases that exemplify typical needs of a social sciences/humanities researchers. Currently use cases focuses on the Danish Royal Library. ## Use case #1 -- API access to open data at the Royal Library ## :::info A researcher from Aalborg University needs to access open data from the Royal Library in UCloud to compute and visualize descriptive statistics interactively in JypyterLab. ::: The Royal Library has been experimenting with exposing its open data through standard APIs. These experiments will probably lead to more extensive use of this technology in the future for more cultural heritage collections. OAuth is being considered as the authentication layer. The experimental API exposes part of the newspaper collection through a Swagger API at http://labs.statsbiblioteket.dk/labsapi/api//api-docs?url=/labsapi/api/openapi.yaml#/ An example of the usage, could be trying to answer the question: > "How many articles contains Kierkegaard per year?" This can be answered like this http://labs.statsbiblioteket.dk/labsapi/api/aviser/stats/timeline?query=kierkegaard&filter=recordBase%3Adoms_aviser&granularity=year&startTime=1666&endTime=2021&elements=articles&elements=pages&elements=editions&structure=header&structure=content&format=CSV which gives us CSV data with the columns "timestamp","pages","articles","editions","pages_percentage","articles_percentage","editions_percentage" and data like this ... "2004",1729,2004,1393,0.3657981868765405,0.04137638845243635,12.277454609554027 "2005",4818,5091,3493,1.0608222363136757,0.1106481993641043,32.97460587180213 "2006",3397,3560,2459,0.6592647364202526,0.07505275829427849,20.9383514986376 "2007",918,1051,840,0.1997667212144424,0.02631518987373466,6.749156355455568 "2008",395,441,372,0.2294470616254146,0.02637238040334792,7.101947308132875 "2009",251,274,223,0.279338934950754,0.03208370803357314,7.590197413206263 "2010",33,37,28,0.16500825041252062,0.02442340950796731,3.10077519379845 "2011",45,46,45,0.23651844843897823,0.03234061700272786,6.364922206506365 "2012",87,87,87,0.33844238699136386,0.04650839556726878,9.613259668508288 ... Another requerst could be for all articles containing both Kierkegaard and Grundtvig: http://labs.statsbiblioteket.dk/labsapi/api/aviser/export/fields?query=kierkegaard%20AND%20grundtvig&fields=link&fields=recordID&fields=timestamp&fields=pwa&fields=cer&fields=fulltext_org&fields=pageUUID&fields=editionUUID&fields=titleUUID&fields=editionId&fields=familyId&fields=newspaper_page&fields=newspaper_edition&fields=lplace&fields=location_name&fields=location_coordinates&max=10&structure=header&structure=content&format=JSON which gives us the full text of the matching articles that are not protected by copy right. This latter means, that they must be published before 1881. The result is returned as JSON date including a set of metadata. ## Use case #2 — Complex data source: Accessing data from Netarkivet, closed data, no api, research only, KB :::info A researcher from University of Southern Denmark needs to access large and complex Danish web data in UCloud for training an language (ML) model ::: This use-case assumes: - that the data size is large. i.e., potentially in the order of terabytes. - A researcher knows about the existence of the archive, but neither the curational nor the technical aspects of how the data is collected, stored, or accessed when it comes to the collection as data[^1] - Maybe the researcher has previously had access to the graphical front end of the Netarkiv, where it is possible to search the complete archive or browse archived versions of web pages. The steps necessary for gaining access to the collection as data, are as follows 1) acquire an understanding of the collection as data, including how it is collected and stored. This is necessary as the Netarkiv is complex both with regard to storage format and how the data is collected. E.g., a. some websites are collected several times a day, while others only a few times a year. b. text resources are stores in multiple copies, while e.g. images only has one copy that is references when collected subsequent. c. modern HTML pages is often more form and scaffolding than content d. what, all in all, is actual collected for the archive 2) define the subset of the collection that should be extracted from the archive. This definition should be in terminology of the archive or have a one-to-one relationship with the terminology of the archive. This could be as simple as a Solr query or a more complex set of instructions and computations 3) ensure access to a system (for storage and computation) designed for storing sensitive personal data (e.g., UCloud) 4) sign release agreements, where the researcher or their institution becomes data responsible 5) define any transformations needed or by advantage could be performed on the data during the extraction process. 6) define the hand-over of data e.g., a simple data dump or by using an API. _How could this be implemented in a Data-as-a-Service?_ In the Netarkiv application in the DaaS application in UCloud, a researcher should be able to aquire all the necessary information and requirements and perform all the necessary steps to go from idea to dialogue with the RL to getting the data. The above points in detail: *Ad 1.* This requires either very detailed and up to date documentation of the Netarkiv or an office which can handle inquiries on a case-by-case scenario. Given the expected number of requests, this would probably be best implemented as a dedicated office. In time, more and better documentation could be developed. *Ad 2.* Same as above. *Ad 3 and 4.* Develop a template for institution-wise agreements between The Royal Library (RL) and the Danish universities. Ensure an existing data processing agreement between relevant parties in the academic sector including RL. *Ad 5.* If transformations are needed, they can only be performed by the above-mentioned office on some form of economic agreement, based on "time and material". *Ad 6.* Handing over the data, should be by API. ### Example of previous use-case A researcher had a research question regarding an investigation on the development of user tracking across web sites. To that end, the researcher needed HTML documents and metadata on the collection of each HTML document. To have a fairly representative subset of the collection, a previous research project had defined such a subset, defining one broad harvest (from several yearly harvests) from each year from 2006 to 2016. That other project had also developed an algorithm for selecting one version of a document, when multiple copies of the same document had been collected. To extract the actual HTML code and metadata, a custom program had to be developed by RL, as only RL can have the necessary privileges to actual perform such an extraction. When the extraction and transformation was complete, the derived data was loaded on the computation and processing platform DeiC National Cultural Heritage Cluster at RL, which was a forerunner to UCloud. ## Use case #3 'Henrik Ibsens Skrifter' from open collection without API :::info A researcher from Aarhus University needs to _Henrik Ibsens Skrifter_ in UCloud for stylistic analysis in RStudio. ::: Henrik Ibsen Skrifter is an open collection hosted by University of Oslo and available online https://www.ibsen.uio.no/. The collection is not a proper data set and it does not provide API access. Instead it is neccesary to created a gold standard data set from the collection as part of the researh project. Afterward UCloud's DaaS can invlude the curated data set as part of its services. The original host (University of Oslo) needs to be acknowledged in publications that use the data. Tasks - how do we define and create a data set from a collaction - Where are data sets that result from research stored? ## Use case #4 Proprietary news data with a complicated ToS and API access ## :::info A research group from Aarhus University and Copenhagen University needs to newspaper data in UCloud. ::: Infomedia (owned by Opoint Technology) is a commercial data provider of news media data. Bulk access is possible as a paid service through their API (should be covered by the research project). Because Infomedia acts as an infomration broker of news (they only have limited right of use) it is necessary that the project obtains right to use the data for research from the orignal owner (i.e., news media companies). This procedure is not standardized, but from several successful projects' applications, it is possible to draft a template. Furthermore, this use case requires a data sharing agreement between two (or more) Danish universities.