# (AM Session) Foundational Open Science Skills (FOSS) Lesson 3: Intro to Data Management
:::info
**Date**: `2020-09-28`
**Today Lead Instructor:** Jason
**Today Helpers:** Michele, Tina, Tyson
**Course Website:** https://cyverse-learning-materials.github.io/foss
**Zoom Link:** https://arizona.zoom.us/j/86152278453
**Instant Feedback:** (please complete before you leave class) [Complete Form](https://docs.google.com/forms/d/e/1FAIpQLSeVyEB8sU99Mn4IuzQ561Crp7v_wDl-yEcD2iutBxXRfrHo-Q/viewform?usp=sf_link)
:::
## :stopwatch: Agenda
### Warm-up (5 minutes):
#### Questions & Comments about Project Management left over last week?
## Introduction to Data Management (50 minutes)
## Discussion Q/A
## :toilet: BioBreak (5 minutes) :coffee: :tea:
## Data Management Cont.
---
## Introduction to CyVerse
https://user.cyverse.org - User Portal (Account Creation)
https://learning.cyverse.org/what_is_cyverse/
### Homework Assignment
Work through Self-Paced CyVerse Intro: https://cyverse-learning-materials.github.io/cyverse_mooc/
### Intro Data Mgmt self assessment
:::info
***If you give your data to a colleague who has not been involved with your project, would they be able to make sense of it? Would they be able to use it properly?
:::
- Nope, not a chance. The data is spread between three different programs and we haven't figured how to best streamline data sharing. But I am here!
- Maybe; it would depend on the project, but for the most part I think they might have a hard time figuring out what is going on
- I believe they would as I include detailed metadata for the variables and tidy format for the columns and rows
- In order to make sense of my data I will have to provide some readme file that explains how data is organized. As my data is stored now in my computer, some data would not be understandable by anybody else.
- Maybe. I comment my code well, but...
Maybe, if they have the exact skils to understand it
:::info
***If you come back to your own data in five years, will you be able to make sense of it? Will you be able to use it properly?
:::
Yes , if I refer to the code I used to analyzed. I keep track now of everything with Markdown
I think so having memory of the project attributes
Maybe, not really worrying about that.
Yes. I think so. It would take some time though.
-Again, it would depend on the project. I think generally I would be able to make sense of it, but I've certainly stumped myself in these situations before and it's been WAY less than 5 years lol
:::info
***When you are ready to publish a paper, is it easy to find all the correct versions of all the data you used and present them in a comprehensible manner?
:::
no
Yes my data is available. Versions doesn't make sense to me in this context.
Yes, I've gotten in the habit of submitting my data with all publications and make a version that is understandable
Usually. But it depends how complicated the project is. I use the date in my analysis files YYYYMMDD, and usually the most recent one is correct, but...
Probably not. As I am still getting used to sharing data. I've been a looner and never really needed to share. Plus, my recent past job usually required us not to share.
Usually yes, I try to put all the files (including data) that goes for publication in a folder
## Discussion on Data types
In small groups, discuss the following questions. You will be provided with a space for documenting our shared answers.
1. What are the two or three data types that you most frequently work with? - Think about the sources (observational, experimental, simulated, compiled/derived) - Also consider the formats (tabular, sequence, database, image, etc.)
Formats
- excel files - csv files
- [REDCap](https://www.project-redcap.org/)
- Fastq files of seq data
- Qualtrics -> csv files
- shape files and rasters
Sources
- Experimental
- observational
- surveys from people
- Remotely Sensed imagery - drones, satellite
2. What is the scale of your data?
- samples - small
- county-level (hundreds of square miles)
- Massive datasets - terabytes
## Discussion on Data Strategies
3. What is your strategy for storing and backing up your data?
- external hard drives
- google drive
- UA Box
- One drive
- [SyncBack](https://www.2brightsparks.com/)
4. What is your strategy for verifying the integrity of your data? (i.e., verifying that your data has not be altered)
- Use code to run lots of checks
- working with mentees - provide training and trust in them
- Running checks on datasets
["data testing"](https://www.guru99.com/data-testing.html)
5. What is your strategy for searching your data?
- project databases
- window search bar
- organized folder structures
- file naming
6. What is your strategy for sharing (and getting credit for) your data? (i.e., How will you share with your community/clients? How is that sharing documented? How do you evaluate the impact of data shared?)
- Repositories; generated DOIs
### Question
How do you decide on what what metadata to use?
- I think about what other potential users might want to use or know for searching
- I imagine what I would need to reprodce my own work.
- I document meanings for all variable
- For spatial data I would want to think about when it was collected, _how_ it was collected, geographic coverage, culturally sensitive areas in the AOI
- I use the data that better describes the dataset to be used by other people
- I'm not sure. Maybe use whatever is standard in the field; e.g., photos have EXIF.
### Question
What repositories do you use?
- Cyverse of course
- Mendeley data
- University repository
- Figshare
### Links shared in class
- Ontology: https://en.wikipedia.org/wiki/Ontology_(information_science)
- Schema v. Ontology: https://www.w3.org/wiki/SchemaVsOntology
- How to FAIR/metadata: https://howtofair.dk/how-to-fair/metadata/
- Get your ORCID: https://orcid.org/
- Data Testing Tutorial: https://www.guru99.com/data-testing.html