# FOSS Spring-2021
- **Week**: 2
- **Section**: Tuesday
----
## Topic: Data Management
**FOSS Materials**
- Overview of Data mangement: https://learning.cyverse.org/projects/foss/en/latest/Data_management/overview.html
- FAIR self-assessment: https://ardc.edu.au/resources/working-with-data/fair-data/fair-self-assessment-tool/
- Data-Stewarship wizard: https://ds-wizard.org/
----
### Discussion and notes
Just a few points for each question - enough to get a sense for what people in
You can make bullet points with `-` at the beginning of a line. Adding an empty line can help make text that is not bulleted easier to read when hackMD converts the format.
**Room 1**
- Q1 - Surveys and text
- Q2 - GBs and TBs
It takes days to days
- Q3 - Drovebox, Google drive, Box, GitHub
- Q4 - Using Apps to collect the information, so noone has to introduce it manually into a computer
- Q5 - Excel and Notepad
**Room 2**
- Q1: Sequencing, animal and plant behavior, imaging (animal brains & satellite), questionnaire and health data
- Q2: Large--TBs
- Q3: Box
- Q4: Beyond basic QC, we do not really have one
- Q5:
**Room 3**
2 or 3 data types:
- Genomic
- Meteorological data
- Carbon flux
- Environmental variables
- Air quality
- LIDAR data (canopy structure, biodiversity)
- Image data
- Text, numeric data
What is the scale of your data? (Volume, Files themselves, Velocity, Variety):
- Kayla: Relatively small, 3-4 spreadshseets per summer, 4-5 variables
- Ayanna: Hard to answer, several Mb large, two types - raw data
- Justina: 5 parameters, one uniform data type from mobile data sensors: Mb-scale
- Nick: 60 Students producing 6 images per lab
- Jesse: Mb to Gb scale, velocity not that
Strategies for storing/backing up:
- Kayla: Google drive to github
- Ayanna: Github, boxdrive
- Nick: Google drive
- Justina: Google drive
- Jesse: NCBI, OSF, github to some extent
Verifying integrity:
- Kayla: Openrefine, data range values against standard
- Ayanna: For rice data, use standards to make sure within range, for NEON already QC'd
- Justina: Just started, not yet working on big data. Only use data from peer-reviewed journals
- Nick: Would like to get to this
- Jesse: Visualizing data to see if anything is wonky
Searching for data:
- Pubmed / google scholar
**Room 4**
Data types:
- MRI data - 50GB - 2/secs
- Genetic snips
- Public opinion data - couple of GB - monthly
- Social media
- Economic
- Aerial photography - 1TB - annual
Storage:
- Dropbox
- Google Cloud
- Google BigQuery
Integrity:
- Preprocess by scientist
- Share raw data
**Room 5:** lots of plant biologists but also other research types
* genomics data, leaf physiology
* data scale: gigabytes to terabytes
* some databases used; some raw files
* field data + sequencing data
* data storage on departmental servers, Google Drive, R packages
* searching data: Excel
**Room 6:**
- Question 1 - Large amounts of images, point cloud data, environmental data, spectral data
- Question 2 - 10 TB a day
- Question 3 - CyVerse infrastructure, commons and data store. We also have structured metadata
- Question 4 - Data is protected with user privileges. Not too sure, but we are told it's protected
- Question 5 - All raw data has unique identifier and we use metdata to identify. Time stamps are also in the file names
- Question 6 - Shared through CyVerse. Using user interfaces to aid in visualisation. Dockarized and containerized code.
----
### Team work
----