# FOSS Spring-2021
- **Week**: 2
- **Section**: Wednesday
----
## Topic: Data Management
**FOSS Materials/Useful links**
- Instant Feedback (please complete before you leave class):
- Overview of Data mangement: https://learning.cyverse.org/projects/foss/en/latest/Data_management/overview.html
- FAIR self-assessment: https://ardc.edu.au/resources/working-with-data/fair-data/fair-self-assessment-tool/
- Data-Stewarship wizard: https://ds-wizard.org/
----
### Discussion and notes
**General notes**
FOSS is a buffet experience, but be selective...
- How to make data more valuable, by making it more effective and with less effort?
---
**Breakout notes**
**Room 1**
Alma, Kangsan, Julia, Amy, Taylor, Ida, Peggy :+1:
Data Management group questions
- Q1- image data, sequence data (FASTA), numeric data, categorical data, text-based data, geodata, code
- Q2- Volume: GB; Velocity: streaming (sensor data), a few times daily, year intervals; single batch infrequent, ; Variety: databases, raw data, corrected data, images, videos
- Q3 - Cloud-based options (one drive, Google Drive, Box), institutional data storage?, on personal computer, hard drives, server storage at the lab
- Q4- some cloud storage can verify integrity, check metadata info,
- Q5 - using a repository/platform that has search interface, set time/location boundary information. CLI-based tools
- Q6 - onedrive->sharing option, using a public repository w/ standard citation format and metrics, options are spotty
**Room 2**
Reza, Javier, Julia, Monsour, Nicole
Data Management group questions
- Q1 Numeric, text, models (computer code), image, Instrument-specific
- Q2 It ranges from a few MB to hundreds of TBs. Velocity of data depends on the volume and also availablity of data temporally (online data). A combination of raw data and preprocess data are used (i.e., database)
- Q3 Server, Google Drive, Hard Drive, Personal computer, HPC storage, External hard drives, OSF, GitHub
- Q4 verifying that our data has not be altered is done manually, we never do it in a systematic manner.
- Q5 Search in the repository, writing codes to list all the files and then find the one we are looking for if there are too many files in the subdirectory
- Q6 GitHub, OSF, papers, DataVerse (for political sciences)
Room 3
Data Management group questions
- Q1: JSON, excel (csv), point cloud (.ply), Cerner, Nucleic acid sequences
- Q2: Cerner has a lot of data regarding health information of patients. There are repeat visits, super expansive. Takes a while to analyze completely with so much info.
- Q3: For coding -> github. For biological data - external hard drives.
- Q4: Repeatability, third party validation, checking with already existing literature for consistencies
- Q5: Health records -> asking permission to collect data from hospitals. Searching categories to separate data
- Q6: Public fileshare, mycharts, bitbucket
Room 4
Data Management group questions
- Q1: epi/environ/ (csv), images, surveys (csv), excel sheets, biological sequences (fastq/fasta), gis, databases
- Q2:
- biological data - big size (TBs)
- 3D point clouds, big size (TBs)
- tables, csvs, GBs
- not a lot of DBs
- Q3: hard drive, HPC, github, google drive, Box, Cyverse Data Store, Microsoft Cloud (Excel Data)
- Q4: visually check, manually check, checksums, version control (GitHub), pointblank (R Package)
- Q5: patience, good documentation, shell, Online Tools (ArcGIS)
- Q6: ReData (UA), Figshare, OSF : Google Drive link-sharing, Cyverse Data Store , Bisque
Room 5 Kyle, Connel, Rachade
Data Management group questions
- Q1 Questionnaire data, Automatic Logging from interactions with software, Data from sensors, biological sequences (fasta), Numeric data (i.e. plant measurements)
- Q2 GB,Very little in the TB. Some work ingests very small (KB) of data frequently (i.e. rows in databases). Other work is large captures of data that occur relatively infrequently.
- Q3 Google/One Drive, Dropbox, GitHub, AWS S3, Internal Shared drives
- Q4 Trust and control over access. Github.
- Q5 Python scripts, SQL queries, Visualization
- Q6 Only internal sharing
Room 6
Data Management group questions
- Q1:
- RGB images, multispectral/hyperspectral images, LiDAR point cloud data;
- raster datasets,
- tree physiology data
- DNA sequence data and metadata
- biometric and behavioral patient data
- Q2
- TBs
- GBs, recieved infrequently in batches
- Q3
- multiple hardrives; computer clusters;
- departmental servers, cloud services
- Q4
- visual observation of images; through 3D reconstruction software such as Metashape to find out outliers; compare data size
- Q5
- For data cleaning and quality assessment, I conduct both visual and statisitcal assements. Using plots and standard deviations to identify unusual data points, along with clinically determined cutoffs for extreme changes.
- Q6
- Deposited on NCBI
- Github
- Figshare
- Zenodo
- view/download metrics on specific repository sites
---
### Homework Reminders
----