owned this note
owned this note
Published
Linked with GitHub
# Foundational Open Science Skills (FOSS) Lesson 2: Data Management & Documentation
:::info
**Date**: 09/09/2025
**Today's Lead Instructor:** Michele Cosi
**Today's Helpers:** Tina
**Course Website:** https://foss.cyverse.org/02_managing_data/
**Instant Feedback:** (please complete before you leave the session) [Complete Form](https://docs.google.com/forms/d/1dQbtjGYUpKrYP3zJ5BICwGiPNgQYuBG471EmGzuGh3g)
:::
[toc]
---
::: success
**Questions & Comments about Week 1?**
:::
---
## Self Assessment
:::danger
**Q: If you give your data to a colleague who has not been involved with your project, would they be able to make sense of it? Would they be able to use it properly?**
:::
- They would probably understand the basics, especially if it’s one of my BME 225 classmates, since we all used similar data (like ECG signals). My file names are simple and descriptive, e.g., “ECG_lab2.csv.” However, someone outside the class might not know what each column represents or what the units are.
- Within reason. Those with a background in my field should be familiar with the terminology. Recently some people have been joining from CS or IS fields. Occassionally people will ask quesitons about preprocessing.
- yes, That is if I provide the data dictionary as well. Otherwise without the data dictionary it may be a bit confusing with the coding
- Today, it would require a breif training, but on their own, it might not make the most sense. I think there would be some useable data, but again, would be better if I could explain.
- If you follow the principles of an SOP, data should be undestandable to the same degree as methodology
-Some of it, it depends on what they already know
- The final product should be easier to use than raw data, but for the interpretation they might need some understanding on the method use to estimate it and what the metrics mean.
- Yes, I think is pretty straightforward but you need to see the whole data
- Unless I give the processed data in a nice format, which I have done already, I think I can, other way I dont think so. If I give the raw data it won't be understandable.
- If I provide the Statistical Analysis Plan and the Data dictionary, with some background knowledge and experience, they should be able to understand and use the data. If it is a review paper, I will need to provide the protocol document.
::: danger
**Q: If you come back to your own data in one year, will you be able to make sense of it? Will you be able to use it properly?**
:::
- If it’s soon after class, yes, because I remember the labs.
- If I had access to the data dictionary, then yes.
- Yes.
- Yes.
- yes
- Yes
- Yes
- Yes, but it is easier when I have documented it
- Yes
- Without clear documentation of codes and proper note-taking, not-really.
::: danger
**Q: When you are ready to publish a paper, is it easy to find all the correct versions of all the data you used and present them in a comprehensible manner?**
:::
- I haven’t published yet, but even for class projects I sometimes end up with multiple versions (“final.csv,” “final2.csv”). That can get confusing.
- When publishing, I have said that people should reach out to the provider.
-Yes, I always have different dates on when I updated the data. The original version pulled form REDCap remains intact
- Yes, I try to be organized so I can understand what I did
- Yes for me because of the naming of my document .
- Ideally, but I have not yet published. Looking forward to learn best practices.
- It usually is, but sometimes I have saved too many "final" versions.
- I sometimes get lost between the multiple folders, but I have improved the way I store the data with dates and versions.
- I have learned in a hard way to be very detailed and organized when keeping updated versions of my paper. Still learning!
---
## Discussion on data types and strategies
::: danger
**Q1: What are the two or three data types that you most frequently work with?**
- Tabular data (CSV/Excel) – used for storing biomedical signal values like heart rate or EMG measurements.
Time-series signals – recorded in Python or MATLAB during labs, often representing analog or digital signals.
Images (TIFF/PNG) – simple biomedical images used in MATLAB for processing practice.
- Think about the sources (observational, experimental, simulated, compiled/derived)
- Also consider the formats (tabular, sequence, database, image, etc.)
- observation and model input and output files
observational+derived / database+map+image
-Observational, experimental, field data
- Numerical: Chemical concentrations, flow rates, intensity. Images: for qualitative interpretation
- Numerical: Observations (gridded and point) and model inputs/outputs
- -Compiled, text data
- Often it's observational data that is numeric
- Audiovisual: photos, videos and sound recordings
-
:::
**Formats**
- Videos (MP4)
- Genomic data (fasta)
- PDF lol
- data sheets,
- netcdf, grib, tables
- .ciff, .xlxs, .py, .csv, .dat
- RAW, TIFF, mp4, .ply, WAV
- Raw data csv, xlxs
**Sources**
- Public repository for genomic data (https://www.ncbi.nlm.nih.gov/)
-Historical climate and field data - CLIMAS, Forest Service
- National Centers for Environmental Information (NCEI)
- NSF NCAR Geoscience Data Exchange
- Special Collections at the U of A, Public databases such as Youtube or Google, directly from the source
- raw data colletion from interviews, observations
- medical records
- Directly from DOTs, occassionally open-source data
- American Mineralogist Crystal Structure Database
-
::: danger
**Q2: What is the scale of your data?**
:::
Add a `+` to each category you deal with:
- <1MB: + +
- 1MB-100MB:+ +
- 100MB-1GB:
- 1GB-10GB:
- 10GB-50GB:
- 50GB-100GB:
- 100GB-1TB: +
- 1TB<:
- 1MB - 5TB
::: danger
**Q3: What is your strategy for storing and backing up your data?**
:::
- Right now, I save everything on my laptop and Google Drive. Google Drive automatically backs up, which is good protection against losing files.
- My own private HD
- Multiple SSDs with data redundancy
-multiple clouds
-Google drvie, Box, and HD
- University Box Health and Box, mostly
- using BOX and google drive at the moment
- External Drives, HPC
::: danger
**Q4: What is your strategy for verifying the integrity of your data? (i.e., verifying that your data has not be altered)**
:::
- At the moment, I just open the file and check if it looks correct. That’s fine for small student datasets.
- checksums
- At data collection for the self-administered interviews I included a check question which helped me weed out double counting
- Obtaining historical data directly from souces
- have a secondary researcher perform statistical analysis using a separate methodology
- Review previous studies that use the data, visualize it, and run basic statistical analysis
::: danger
**Q5: What is your strategy for searching your data?**
:::
- Command + F
- rummage through the BOX!
- I organize data into folders by class, project, and workshop, I also use descriptive filenames with dates. Right now I rely on my computer’s search bar.
- I use a specific structure between directories, names, etc. so I know where it is located or how to
- Depends. In MS Excel i use ctrl + F and in SAS I use exploration
- Organize in folders with specific dates. Also have specific SSD for a certain project
::: danger
**Q6: What is your strategy for sharing (and getting credit for) your data? (i.e., How will you share with your community/clients? How is that sharing documented? How do you evaluate the impact of your shared data?)**
:::
- For class, I just share files with Google Drive links or email, we sometimes have discussions, class presentations. I don’t have a system for credit since these are just assignments.
- Exhibitions, printed materials, panel discussions, website
-Publishing
QR code to a pdf
---
## Licenses
::: danger
**Q: Have you ever specifically applied a license to your data or other IP?**
:::
- No, not yet.
- No
::: danger
**Q: Do you have a favorite license type?**
:::
-N
-
---
::: success
**Instant Feedback:** (please complete before you leave the session) [Complete Form](https://docs.google.com/forms/d/1dQbtjGYUpKrYP3zJ5BICwGiPNgQYuBG471EmGzuGh3g)
:::