owned this note
owned this note
Published
Linked with GitHub
# 1. Introduction
## 1.1. Data management
The vast amounts of data generated through scientific research give rise to fresh difficulties concerning the storage, curation, and distribution of this data. Hence, data management has become a challenging and crucial topic for the scientific community. Almost all research generates data, requires storage, re-uses previously collected data, synchronizes, and backups, nowadays. However, many scientists have faced data loss by not regularly backuping up their data to an external storage, and confusion derived from the outputs and processing steps of other scientists due to lack of metadata. Different labeling standards, metadata forms, and cloud backups have been developed to partially overcome these problems but yet not enough. Finding data among various files/data, the safety of confidential data, and accessibility of data for other users are other problems that remained unsolved or have been done manually which are time-consiuming operations. In such situations, data management software and protocols would help scientific communities to have a FAIR (Findable, Accessible, Interoperability, and Reusable) discipline in data management and solve all previously mentioned problems. Software such as Yoda, ManGO, CKAN (Comprehensive Knowledge Archive Network), and iRODS (integrated Rule-Oriented Data Service) have been developed to fulfill all scientists’ data management requirements. Each one of the above-mentioned software has its pros and cons. The iRODS as an open-source tool aims to help researchers manage their data in diverse ways such as different levels of accessibility of data for each user, metadata development, and cloud storage for backup and synchronization with customization of an organization. The iRODS, compared to other data management software like CKAN, offers advantages such as data virtualization for seamless access to distributed resources, the ability to enforce data policies, scalability, and performance for large-scale data environments, robust metadata management, customization, and integration capabilities. However, working within a black terminal with numerous icommands, which makes it a non-user-friendly interface, would make it difficult for non-experts and requires a lot of support by developers, in comparison to user-friendly software like CKAN. Therefore, a combination of the iRODS with a user-friendly interface such as ManGO and the ability to host big data/large-scale projects with SURF would provide FAIR, quick, and easily manageable solutions. In this project for the Delft University of Technology, we are planning to integrate ManGO empowered by iRODS with SURF hosting service to provide a better data management package for the potentially following projects: (a) floating wind energy, and (b) DAPWell for geothermal energy projects. The following sections explain the details of this pilot project and its usage in details.
## 1.2. ManGO
ManGO is a new data management service for active research data. It is based on iRODS, a web-based friendly interface open source software for advanced data management. With ManGO, researchers can store, describe, automate and share their research data in a secure and efficient way. The following **[~~LINK~~](https:/https://www.kuleuven.be/rdm/en/mango/)** **[LINK](https://www.kuleuven.be/rdm/en/news/mango-launched)** from KU Leuven as developer of ManGO provides further detailed information.
:::info
:bulb: **Some of the abilities of the ManGO is listed as the following:**
1. Storing data on reliable and secure systems hosted.
2. Describe data and files using metadata. These metadata can be added using templates, via automatic metadata extraction or completely manually.
3. Automating data workflows.
4. Sharing files with other users (groups) inside and outside the university.
:::
# 2. Necessity
## 2.1. Managing data
Nowadays, most research has follow-up projects. This may lead to the acquisition of more data, more processing, a bigger storage demand, and different expertise involvement. The DAPWell and floating wind energy are two projects with a huge number of datasets. For instance, for the DAPWell project, [**different types of data**](https://surfdrive.surf.nl/files/index.php/s/ZeqmW0GWafmW2wM) will be collected, uploaded, downloaded, processed, and used by technicians, engineers, data stewards, data managers, graduate and undergraduate students, Ph.D. candidates, Postdoctoral researchers, and professors. Data types include: a) hundreds of rock samples, which will undergo various experimental analyses, producing data in [**different amounts and formats**](https://surfdrive.surf.nl/files/index.php/s/KpyyI6qgG4UUL9q) (e.g., .pdf, .csv, .ascii, .xlsx, .img, .jpeg, etc.); b) CT scanned images of rock samples is estimated to produced 20TB of data in the first 2 years; c) distributed acoustic sensors (DAS) will produce 40TB of data within one year; d) geophysical monitoring data is estimated to produce 3TB of data in a year, e) several engineering and geological data types (borehole logging, real-time operations, well tests, lithological logs, etc.) are produced yearly in the order of 50GB. A summary of the information model can be found [**here**](https://surfdrive.surf.nl/files/index.php/s/aLhSVFUgnnollU6). Therefore, data storage and the management of active data is a priority for the DAPWELL project.
Some of the data management challenges of these two projects would be as the following: data acquisition and collection, uploading to and downloading from a database, safely storing and synchronizing, encrypting, metadata development and labeling, giving access permission to each user(s), managing users and their access, sharing the data, and providing support. Users will have different levels of access to the system (data upload, download, share, search, view, and edit) with the diagnosis of the data steward(s) and data manager(s). This project will help users and departments to cover all the above-mentioned items, minimize potential errors, and date/time/money loss by using customized-empowered ManGO with iRODS and SURF.
## 2.2. ManGO for DAPWell data management
ManGO with a user-friendly interface will assist and accelarate data management to store, and process measured data. Ability of having access from everywhere anytime, previewing data and its metadata (e.g., size, modified date, content, format, and owner), classifying different levels of access to data, and searching a specific data among huge amount of data have made ManGO suitable for big projects like DAPWell. DAPWell is a complex project that will last ten years or more thus, requires a well-organized data management plan and software. Based on the ManGO's abilities which are briefly mentioned above and comprehensive explained in the following section, is able to bring different data and users together to benefit from outputs of the project with no concern in data management.
# 3. Structure of ManGO
Empowered iRODS with ManGO and SURF has the following tabs as its main structure:
:::info
:bulb: **ManGO's main menu includes:**
1. Collections
2. Search
3. Metadata
4. Group admin
5. Trash
6. Main admin
![](https://hackmd.io/_uploads/H1gxE9-F2.png)
:::
In the following function of each tab is described in detail as a tutorial to how to use ManGO for data management.
## 3.1. Collections
Within the collection tab, users can create, copy, move, or delete a collection. Each collection requires a name followed up by its owner, created time, modified time, and size. Also it is possible to add metadata for the collection and decide who to has access to this collection using the permission tab.
![](https://hackmd.io/_uploads/B1AkvqZYh.png)
### 3.1.1. Data properties
By clicking on each data further information about that data such as owner, created and modified dates, size, internal ID, status, and checksum (backup) are observable.
![](https://hackmd.io/_uploads/HJeOdcWF3.png)
### 3.1.2. Data permission
Within permission tab, data owners are able to edit who can have access to the uploaded data by adding or removing already defined users.
![](https://hackmd.io/_uploads/BkIzK9ZK2.png)
### 3.1.3. Data preview
This section shows the uploaded data with specific formats such as .jpg if they are smaller than 200 MB.
**Data info:** *The image is NASA's ASTER satellite image from an area in IRAN which is proccessed by Fardad and is presented in false color composite mode (unreal colors).*
![](https://hackmd.io/_uploads/ry0zq5bt3.png)
### 3.1.4. Metadata extraction
This tab shows properties of the uploaded data such as data precision, type, resolution and number of pixels, and version.
![](https://hackmd.io/_uploads/rys-sq-Y2.png)
## 3.2. Search
By filling the available options within the search section such as type, collection, name, created date, and metadata (value, unit, and name), ManGO will find a specific data that you are looking for.
![](https://hackmd.io/_uploads/Skfq25bYh.png)
## 3.3. Metadata schemas
Here, labeling format and access permission is defined by data managers and data stewards. They can chose predefiend labeling standards or define their format.
![](https://hackmd.io/_uploads/H1Oha9ZF2.png)
## 3.4. Group administration
This section shows different clusters of users including data managers, admins, researchers and etc. Admins can define new clusters and add or remove a user from each cluster.
![](https://hackmd.io/_uploads/r1xWko-K3.png)
## 3.5. Trash
This folder contains temporarely deleted data. The data will remain here and could be recovered within 30 days. After 30days, the data will be deleted permanently automatically.
![](https://hackmd.io/_uploads/S1qdyiZth.png)
## 3.6. Main admin
This section is only available for developers and support members and will not be seen by other users. In case of a problem, main admins are responsible for troubleshooting the problem.
# 4. Protocols and scenarios
Several steps and scenarios are listed as the following for the data management of the projects with iRODS.
:::success
:thumbsup: iRODS data management process
```mermaid
graph LR
Data-->Login-->Upload-->iRODS
```
```mermaid
graph LR
virtualization-->Discovery-->Metadata-->Statistics
```
:::
:::success
:thumbsup: Data manager
```mermaid
graph LR
User_defenition-->Access-->Upload/Download/Share
```
:::
# 5. iRODS' cheatsheet icommands
This section is for developers and iRODS users who want to directly use icommands for their projects, in case it is required. The icommands below summarizes important functions for iRODS. For more information regarding iRODS installation, icommands, and codes please check [**LINK**](https:/github.com/irods/irods_training/blob/main/beginner/irods_beginner_training_2023.pdf/).
| icommand | Function |
| -------- | -------- |
| -h | help |
| iinit and ienv | Running iRODS environment |
| iexit | Exit iRODS |
| iuserinfo | Entered user's information |
| iadmin mkuser | Creating a user |
| iadmin moduser password | Creating a password for a user |
| ichmod | Giving permission to a user to download your files |
| ipwd | Corrent collection directory |
| icd | Changing collection directory |
| ils | Sub-collections stored in a collection |
| imkdir | Make a collection |
| ilsresc | iRODS' resources |
| ifsck | Checking backup |
| ilocate | Searching within a collection |
| imcoll | Managing iRODS collection |
| ichechsum | Checksum one or more data objects or collections |
| ichmod | Modify access to data objects and collections |
| imiscsvrinfo | Connect to server and retrieve information|
| ireg | Register a file or a directory to iRODS |
| iput | Upload a data from a directory of local computer into the iRODS |
| iget | Copy a data from iRODS collection to local computer directory |
| irm | Temporarly deleting a data from a collection within iRODS |
| imv | Recovering a deleted data |
| irepl | Replicate files |
| ibun | Upload and download a structured file |
| ophymv | Physically move a file in iRODS to another storage |
| icp | Copy within iRODS|
| irsync | Synchornization |
| iscan | Scan a local directory that is connected to the iRODS |
| ierror | Convert an iRODS error to text |
| isysmeta | Modify system metadata |