Foundational Open Science Skills (FOSS) Lesson 2: Intro to Data Management

Date: 09/12/2024
Today's Lead Instructor: Michele Cosi
Today's Helpers: Jeff, Tina
Course Website: https://foss.cyverse.org/02_managing_data/
Hack(pad)-of-Hack(pads): https://hackmd.io/-4TgToyFRU2eX7lmDZLRnQ

Instant Feedback: (please complete before you leave class) Complete Form


Questions & Comments about Week 1?


Self Assessment 1: answer these questions with y or n

Q: If you give your data to a colleague who has not been involved with your project, would they be able to make sense of it? Would they be able to use it properly?

I don't think so. I should make a better job at working on my code and data.
I think other can use the data but have to provide a data directory. In addition, has to inform them about the ethical concerns.
Yes, I make sure to leave detailed documentation. I also find file folder organization to be an important tool in making things clear to others.
I think final versions that I post in repos would be understandable by others, but stuff that I am currently working on is probably only understandable by me and maybe god.
No, not at this point because I have not been as organized as I could have been but I am working on making it more accessible to others. I am hoping the skills I learn through R4R will help!
No not at all unless we provide proper doucmentation regarding the data with collection background and purpose of use.
I don't think so, unless I give them the data dictionnary
Yes, I'm pretty obsessed with data documentation so typically my data files and folders are pretty decently assembled

Yes, I create a spreadsheet that contains the name of my raw data (as downloaded), the name that I give it for the project, any transformations I did to the data, and the subsequent name of the resulting data from the transdormation (this is all spatial data). All data is contained in a raw_data folder and a geodatabase for the project.
No, I dont think so. They need my assistance in understanding the data.
I would hope so, but it depends on how well the data is organized and the comments that are made to explain that organization. So, yesassuming their background is appropriate and it's well organized.

Q: If you come back to your own data in one year, will you be able to make sense of it? Will you be able to use it properly?

Yes - I recently had to return to a large amount of project data after a long (~2 year) review process and was able to quickly find what I needed to respond to reviewer comments.
Generally yes. It might take me a moment to refamiliarize myself with the project, but I have never been unable to restart a project or understand my code.
Yes, I would use it properly.
Yes, I oftentimes have to reuse data in projects and have become efficient in finding it.
Yes, It will make sense to me after one year.
Yes. I'm pretty consistent with my data organization habits.
It depends if you have properly documented about the data else you may miss the informations.

Yes, I will as long as I have the codes that I used previously

Yes, as long as I have commented and organized it well. It's difficult to remember to make comments about things you do quicklybut the time it takes to comment something will save so much more time in the future.

Q: When you are ready to publish a paper, is it easy to find all the correct versions of all the data you used and present them in a comprehensible manner?

In most cases - I run into problems when dealing with figures that involve collaborator data; I am OK on my own but sometimes it becomes unclear where certain things come from.
Yes.
It may take me a bit of time to gather the correct versions of everything, but yes, it is not too hard to find the correct versions.

Discussion on data types and strategies: Breakout 1

Q1: What are the two or three data types that you most frequently work with?

  • Think about the sources (observational, experimental, simulated, compiled/derived)
  • Also consider the formats (tabular, sequence, database, image, etc.)
  • observation and model input and output files
    observational+derived / database+map+image

Formats

Molecular simulation inputs and outputs: PDB, PSF, COR, DCD
Protein sequences and alignments
Experimental derived +
Observational +
Genomic (fasta, RNA-seq, fastq)
Previously obtained data
Remote Sensing (Planetscope) and outputs: Maps(GeoTIFF)
Obervational data (surveys, interviews, delibrations)
Spatial data (vector and raster), crowd-sourced data (Strava, cell phone, geotagged photos), and qualitative data (interviews, textual data)
Simulated
json
geotiffs, .las,
Observational and Model Generated/Simulated (NetCDF), GeoTIFF; mostly derived data from remote sensing, or direct observation

Observational and Experimental data
International Consortium for Atmospheric Research on Transport and Transformation (ICARTT)
csv
HDF5
CSV files mostly
Unstructured text, archival data

Sources

Q2: What is the scale of your data?

100's of GBs to TBs
10s of GBs for GeoTIFFs, much less for tabular data
100 GB (observational, 1Hz)
TBs

Q3: What is your strategy for storing and backing up your data?

Primary data storage on a compute cluster, backups on external drives +
HPC or external drive; some on CyVerse
Most of my data are stored in UAbox and on my computer
Laptop, external drives, One Drive, Mule (provided from Lab).
I pay Google Drive and Box :(

Q4: What is your strategy for verifying the integrity of your data? (i.e., verifying that your data has not be altered)

checksums

Q5: What is your strategy for searching your data?

Search for file types if you can't remember exactly what you named it *.tif
find command (names of files)
ROI (Place_name)> Event Names> Dates> Sensors

Q6: What is your strategy for sharing (and getting credit for) your data? (i.e., How will you share with your community/clients? How is that sharing documented? How do you evaluate the impact of your shared data?)


Licenses

Q: Have you ever specifically applied a license to your data or other IP?

No
Yes, I use MIT for my code repos.
No experience with obtaining a license for data or other IP. However, my mentors have a license through Creative Commons for their work on Indigenous Data Governance. https://creativecommons.org/licenses/by-nc-nd/4.0/
No
No experience, other than seeing the licenses present on Github, and wondered how to use them. I understand the idea, but didn't know that it was this simple to apply.
Yes, creative commons for a manuscript

Q: Do you have a favorite license type?


Instant Feedback: (please complete before you leave class) Complete Form

Select a repo