changed 2 years ago
Published Linked with GitHub

(AM Session) Foundational Open Science Skills (FOSS) Lesson 3: Intro to Data Management

Date: 2020-09-28
Today Lead Instructor: Jason
Today Helpers: Michele, Tina, Tyson
Course Website: https://cyverse-learning-materials.github.io/foss
Zoom Link: https://arizona.zoom.us/j/86152278453

Instant Feedback: (please complete before you leave class) Complete Form

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Agenda

Warm-up (5 minutes):

Questions & Comments about Project Management left over last week?

Introduction to Data Management (50 minutes)

Discussion Q/A

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
BioBreak (5 minutes)
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Data Management Cont.


Introduction to CyVerse

https://user.cyverse.org - User Portal (Account Creation)

https://learning.cyverse.org/what_is_cyverse/

Homework Assignment

Work through Self-Paced CyVerse Intro: https://cyverse-learning-materials.github.io/cyverse_mooc/

Intro Data Mgmt self assessment

***If you give your data to a colleague who has not been involved with your project, would they be able to make sense of it? Would they be able to use it properly?

  • Nope, not a chance. The data is spread between three different programs and we haven't figured how to best streamline data sharing. But I am here!

  • Maybe; it would depend on the project, but for the most part I think they might have a hard time figuring out what is going on

  • I believe they would as I include detailed metadata for the variables and tidy format for the columns and rows

  • In order to make sense of my data I will have to provide some readme file that explains how data is organized. As my data is stored now in my computer, some data would not be understandable by anybody else.

  • Maybe. I comment my code well, but
    Maybe, if they have the exact skils to understand it

***If you come back to your own data in five years, will you be able to make sense of it? Will you be able to use it properly?

Yes , if I refer to the code I used to analyzed. I keep track now of everything with Markdown

I think so having memory of the project attributes
Maybe, not really worrying about that.

Yes. I think so. It would take some time though.

-Again, it would depend on the project. I think generally I would be able to make sense of it, but I've certainly stumped myself in these situations before and it's been WAY less than 5 years lol

***When you are ready to publish a paper, is it easy to find all the correct versions of all the data you used and present them in a comprehensible manner?

no

Yes my data is available. Versions doesn't make sense to me in this context.

Yes, I've gotten in the habit of submitting my data with all publications and make a version that is understandable

Usually. But it depends how complicated the project is. I use the date in my analysis files YYYYMMDD, and usually the most recent one is correct, but

Probably not. As I am still getting used to sharing data. I've been a looner and never really needed to share. Plus, my recent past job usually required us not to share.

Usually yes, I try to put all the files (including data) that goes for publication in a folder

Discussion on Data types

In small groups, discuss the following questions. You will be provided with a space for documenting our shared answers.

  1. What are the two or three data types that you most frequently work with? - Think about the sources (observational, experimental, simulated, compiled/derived) - Also consider the formats (tabular, sequence, database, image, etc.)

Formats

  • excel files - csv files
  • REDCap
  • Fastq files of seq data
  • Qualtrics -> csv files
  • shape files and rasters

Sources

  • Experimental
  • observational
  • surveys from people
  • Remotely Sensed imagery - drones, satellite
  1. What is the scale of your data?
  • samples - small
  • county-level (hundreds of square miles)
  • Massive datasets - terabytes

Discussion on Data Strategies

  1. What is your strategy for storing and backing up your data?
  • external hard drives
  • google drive
  • UA Box
  • One drive
  • SyncBack
  1. What is your strategy for verifying the integrity of your data? (i.e., verifying that your data has not be altered)
  • Use code to run lots of checks
  • working with mentees - provide training and trust in them
  • Running checks on datasets

"data testing"

  1. What is your strategy for searching your data?
  • project databases
  • window search bar
  • organized folder structures
  • file naming
  1. What is your strategy for sharing (and getting credit for) your data? (i.e., How will you share with your community/clients? How is that sharing documented? How do you evaluate the impact of data shared?)
  • Repositories; generated DOIs

Question

How do you decide on what what metadata to use?

  • I think about what other potential users might want to use or know for searching

  • I imagine what I would need to reprodce my own work.

  • I document meanings for all variable

  • For spatial data I would want to think about when it was collected, how it was collected, geographic coverage, culturally sensitive areas in the AOI

  • I use the data that better describes the dataset to be used by other people

  • I'm not sure. Maybe use whatever is standard in the field; e.g., photos have EXIF.

Question

What repositories do you use?

  • Cyverse of course
  • Mendeley data
  • University repository
  • Figshare
Select a repo