FOSS Spring-2021

# FOSS Spring-2021 - **Week**: 2 - **Section**: Wednesday ---- ## Topic: Data Management **FOSS Materials/Useful links** - Instant Feedback (please complete before you leave class): - Overview of Data mangement: https://learning.cyverse.org/projects/foss/en/latest/Data_management/overview.html - FAIR self-assessment: https://ardc.edu.au/resources/working-with-data/fair-data/fair-self-assessment-tool/ - Data-Stewarship wizard: https://ds-wizard.org/ ---- ### Discussion and notes **General notes** FOSS is a buffet experience, but be selective... - How to make data more valuable, by making it more effective and with less effort? --- **Breakout notes** **Room 1** Alma, Kangsan, Julia, Amy, Taylor, Ida, Peggy :+1: Data Management group questions - Q1- image data, sequence data (FASTA), numeric data, categorical data, text-based data, geodata, code - Q2- Volume: GB; Velocity: streaming (sensor data), a few times daily, year intervals; single batch infrequent, ; Variety: databases, raw data, corrected data, images, videos - Q3 - Cloud-based options (one drive, Google Drive, Box), institutional data storage?, on personal computer, hard drives, server storage at the lab - Q4- some cloud storage can verify integrity, check metadata info, - Q5 - using a repository/platform that has search interface, set time/location boundary information. CLI-based tools - Q6 - onedrive->sharing option, using a public repository w/ standard citation format and metrics, options are spotty **Room 2** Reza, Javier, Julia, Monsour, Nicole Data Management group questions - Q1 Numeric, text, models (computer code), image, Instrument-specific - Q2 It ranges from a few MB to hundreds of TBs. Velocity of data depends on the volume and also availablity of data temporally (online data). A combination of raw data and preprocess data are used (i.e., database) - Q3 Server, Google Drive, Hard Drive, Personal computer, HPC storage, External hard drives, OSF, GitHub - Q4 verifying that our data has not be altered is done manually, we never do it in a systematic manner. - Q5 Search in the repository, writing codes to list all the files and then find the one we are looking for if there are too many files in the subdirectory - Q6 GitHub, OSF, papers, DataVerse (for political sciences) Room 3 Data Management group questions - Q1: JSON, excel (csv), point cloud (.ply), Cerner, Nucleic acid sequences - Q2: Cerner has a lot of data regarding health information of patients. There are repeat visits, super expansive. Takes a while to analyze completely with so much info. - Q3: For coding -> github. For biological data - external hard drives. - Q4: Repeatability, third party validation, checking with already existing literature for consistencies - Q5: Health records -> asking permission to collect data from hospitals. Searching categories to separate data - Q6: Public fileshare, mycharts, bitbucket Room 4 Data Management group questions - Q1: epi/environ/ (csv), images, surveys (csv), excel sheets, biological sequences (fastq/fasta), gis, databases - Q2: - biological data - big size (TBs) - 3D point clouds, big size (TBs) - tables, csvs, GBs - not a lot of DBs - Q3: hard drive, HPC, github, google drive, Box, Cyverse Data Store, Microsoft Cloud (Excel Data) - Q4: visually check, manually check, checksums, version control (GitHub), pointblank (R Package) - Q5: patience, good documentation, shell, Online Tools (ArcGIS) - Q6: ReData (UA), Figshare, OSF : Google Drive link-sharing, Cyverse Data Store , Bisque Room 5 Kyle, Connel, Rachade Data Management group questions - Q1 Questionnaire data, Automatic Logging from interactions with software, Data from sensors, biological sequences (fasta), Numeric data (i.e. plant measurements) - Q2 GB,Very little in the TB. Some work ingests very small (KB) of data frequently (i.e. rows in databases). Other work is large captures of data that occur relatively infrequently. - Q3 Google/One Drive, Dropbox, GitHub, AWS S3, Internal Shared drives - Q4 Trust and control over access. Github. - Q5 Python scripts, SQL queries, Visualization - Q6 Only internal sharing Room 6 Data Management group questions - Q1: - RGB images, multispectral/hyperspectral images, LiDAR point cloud data; - raster datasets, - tree physiology data - DNA sequence data and metadata - biometric and behavioral patient data - Q2 - TBs - GBs, recieved infrequently in batches - Q3 - multiple hardrives; computer clusters; - departmental servers, cloud services - Q4 - visual observation of images; through 3D reconstruction software such as Metashape to find out outliers; compare data size - Q5 - For data cleaning and quality assessment, I conduct both visual and statisitcal assements. Using plots and standard deviations to identify unusual data points, along with clinically determined cutoffs for extreme changes. - Q6 - Deposited on NCBI - Github - Figshare - Zenodo - view/download metrics on specific repository sites --- ### Homework Reminders ----

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.