(AM Session) Foundational Open Science Skills (FOSS) Lesson 3: Intro to Data Management

Date: 2020-09-28
Today Lead Instructor: Jason
Today Helpers: Michele, Tina, Tyson
Course Website: https://cyverse-learning-materials.github.io/foss
Zoom Link: https://arizona.zoom.us/j/86152278453

Instant Feedback: (please complete before you leave class) Complete Form

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Agenda

Warm-up (5 minutes):

Questions & Comments about Project Management left over last week?

Introduction to Data Management (50 minutes)

Discussion Q/A

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

BioBreak (5 minutes)

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Data Management Cont.

Introduction to CyVerse

https://user.cyverse.org - User Portal (Account Creation)

https://learning.cyverse.org/what_is_cyverse/

Homework Assignment

Work through Self-Paced CyVerse Intro: https://cyverse-learning-materials.github.io/cyverse_mooc/

Intro Data Mgmt self assessment

***If you give your data to a colleague who has not been involved with your project, would they be able to make sense of it? Would they be able to use it properly?

Nope, not a chance. The data is spread between three different programs and we haven't figured how to best streamline data sharing. But I am here!
Maybe; it would depend on the project, but for the most part I think they might have a hard time figuring out what is going on
I believe they would as I include detailed metadata for the variables and tidy format for the columns and rows
In order to make sense of my data I will have to provide some readme file that explains how data is organized. As my data is stored now in my computer, some data would not be understandable by anybody else.
Maybe. I comment my code well, but…
Maybe, if they have the exact skils to understand it

***If you come back to your own data in five years, will you be able to make sense of it? Will you be able to use it properly?

Yes , if I refer to the code I used to analyzed. I keep track now of everything with Markdown

I think so having memory of the project attributes
Maybe, not really worrying about that.

Yes. I think so. It would take some time though.

-Again, it would depend on the project. I think generally I would be able to make sense of it, but I've certainly stumped myself in these situations before and it's been WAY less than 5 years lol

***When you are ready to publish a paper, is it easy to find all the correct versions of all the data you used and present them in a comprehensible manner?

Yes my data is available. Versions doesn't make sense to me in this context.

Yes, I've gotten in the habit of submitting my data with all publications and make a version that is understandable

Usually. But it depends how complicated the project is. I use the date in my analysis files YYYYMMDD, and usually the most recent one is correct, but…

Probably not. As I am still getting used to sharing data. I've been a looner and never really needed to share. Plus, my recent past job usually required us not to share.

Usually yes, I try to put all the files (including data) that goes for publication in a folder

Discussion on Data types

In small groups, discuss the following questions. You will be provided with a space for documenting our shared answers.

What are the two or three data types that you most frequently work with? - Think about the sources (observational, experimental, simulated, compiled/derived) - Also consider the formats (tabular, sequence, database, image, etc.)

Formats

excel files - csv files
REDCap
Fastq files of seq data
Qualtrics -> csv files
shape files and rasters

Sources

Experimental
observational
surveys from people
Remotely Sensed imagery - drones, satellite

What is the scale of your data?

samples - small
county-level (hundreds of square miles)
Massive datasets - terabytes

Discussion on Data Strategies

What is your strategy for storing and backing up your data?

external hard drives
google drive
UA Box
One drive
SyncBack

What is your strategy for verifying the integrity of your data? (i.e., verifying that your data has not be altered)

Use code to run lots of checks
working with mentees - provide training and trust in them
Running checks on datasets

"data testing"

What is your strategy for searching your data?

project databases
window search bar
organized folder structures
file naming

What is your strategy for sharing (and getting credit for) your data? (i.e., How will you share with your community/clients? How is that sharing documented? How do you evaluate the impact of data shared?)

Repositories; generated DOIs

Question

How do you decide on what what metadata to use?

I think about what other potential users might want to use or know for searching
I imagine what I would need to reprodce my own work.
I document meanings for all variable
For spatial data I would want to think about when it was collected, how it was collected, geographic coverage, culturally sensitive areas in the AOI
I use the data that better describes the dataset to be used by other people
I'm not sure. Maybe use whatever is standard in the field; e.g., photos have EXIF.

Question

What repositories do you use?

Cyverse of course
Mendeley data
University repository
Figshare

Links shared in class

Ontology: https://en.wikipedia.org/wiki/Ontology_(information_science)
Schema v. Ontology: https://www.w3.org/wiki/SchemaVsOntology
How to FAIR/metadata: https://howtofair.dk/how-to-fair/metadata/
Get your ORCID: https://orcid.org/
Data Testing Tutorial: https://www.guru99.com/data-testing.html

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

(AM Session) Foundational Open Science Skills (FOSS) Lesson 3: Intro to Data Management

Image Not Showing Possible Reasons The image file may be corruptedThe server hosting the image is unavailableThe image path is incorrectThe image format is not supported Learn More → Agenda

Warm-up (5 minutes):

Questions & Comments about Project Management left over last week?

Introduction to Data Management (50 minutes)

Discussion Q/A

Data Management Cont.

Introduction to CyVerse

Homework Assignment

Intro Data Mgmt self assessment

Discussion on Data types

Discussion on Data Strategies

Question

Question

Links shared in class

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Agenda