owned this note changed 4 years ago

FOSS Spring-2021

  • Week: 2
  • Section: Wednesday

Topic: Data Management

FOSS Materials/Useful links


Discussion and notes

General notes

FOSS is a buffet experience, but be selective

  • How to make data more valuable, by making it more effective and with less effort?

Breakout notes

Room 1 Alma, Kangsan, Julia, Amy, Taylor, Ida, Peggy

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Data Management group questions

  • Q1- image data, sequence data (FASTA), numeric data, categorical data, text-based data, geodata, code

  • Q2- Volume: GB; Velocity: streaming (sensor data), a few times daily, year intervals; single batch infrequent, ; Variety: databases, raw data, corrected data, images, videos

  • Q3 - Cloud-based options (one drive, Google Drive, Box), institutional data storage?, on personal computer, hard drives, server storage at the lab

  • Q4- some cloud storage can verify integrity, check metadata info,

  • Q5 - using a repository/platform that has search interface, set time/location boundary information. CLI-based tools

  • Q6 - onedrive->sharing option, using a public repository w/ standard citation format and metrics, options are spotty

Room 2

Reza, Javier, Julia, Monsour, Nicole

Data Management group questions

  • Q1 Numeric, text, models (computer code), image, Instrument-specific
  • Q2 It ranges from a few MB to hundreds of TBs. Velocity of data depends on the volume and also availablity of data temporally (online data). A combination of raw data and preprocess data are used (i.e., database)
  • Q3 Server, Google Drive, Hard Drive, Personal computer, HPC storage, External hard drives, OSF, GitHub
  • Q4 verifying that our data has not be altered is done manually, we never do it in a systematic manner.
  • Q5 Search in the repository, writing codes to list all the files and then find the one we are looking for if there are too many files in the subdirectory
  • Q6 GitHub, OSF, papers, DataVerse (for political sciences)

Room 3

Data Management group questions

  • Q1: JSON, excel (csv), point cloud (.ply), Cerner, Nucleic acid sequences
  • Q2: Cerner has a lot of data regarding health information of patients. There are repeat visits, super expansive. Takes a while to analyze completely with so much info.
  • Q3: For coding -> github. For biological data - external hard drives.
  • Q4: Repeatability, third party validation, checking with already existing literature for consistencies
  • Q5: Health records -> asking permission to collect data from hospitals. Searching categories to separate data
  • Q6: Public fileshare, mycharts, bitbucket

Room 4

Data Management group questions

  • Q1: epi/environ/ (csv), images, surveys (csv), excel sheets, biological sequences (fastq/fasta), gis, databases
  • Q2:
    • biological data - big size (TBs)
    • 3D point clouds, big size (TBs)
    • tables, csvs, GBs
    • not a lot of DBs
  • Q3: hard drive, HPC, github, google drive, Box, Cyverse Data Store, Microsoft Cloud (Excel Data)
  • Q4: visually check, manually check, checksums, version control (GitHub), pointblank (R Package)
  • Q5: patience, good documentation, shell, Online Tools (ArcGIS)
  • Q6: ReData (UA), Figshare, OSF : Google Drive link-sharing, Cyverse Data Store , Bisque

Room 5 Kyle, Connel, Rachade

Data Management group questions

  • Q1 Questionnaire data, Automatic Logging from interactions with software, Data from sensors, biological sequences (fasta), Numeric data (i.e. plant measurements)
  • Q2 GB,Very little in the TB. Some work ingests very small (KB) of data frequently (i.e. rows in databases). Other work is large captures of data that occur relatively infrequently.
  • Q3 Google/One Drive, Dropbox, GitHub, AWS S3, Internal Shared drives
  • Q4 Trust and control over access. Github.
  • Q5 Python scripts, SQL queries, Visualization
  • Q6 Only internal sharing

Room 6

Data Management group questions

  • Q1:
    • RGB images, multispectral/hyperspectral images, LiDAR point cloud data;
    • raster datasets,
    • tree physiology data
    • DNA sequence data and metadata
    • biometric and behavioral patient data
  • Q2
    • TBs
    • GBs, recieved infrequently in batches
  • Q3
    • multiple hardrives; computer clusters;
    • departmental servers, cloud services
  • Q4
    • visual observation of images; through 3D reconstruction software such as Metashape to find out outliers; compare data size
  • Q5
    • For data cleaning and quality assessment, I conduct both visual and statisitcal assements. Using plots and standard deviations to identify unusual data points, along with clinically determined cutoffs for extreme changes.
  • Q6
    • Deposited on NCBI
    • Github
    • Figshare
    • Zenodo
    • view/download metrics on specific repository sites

Homework Reminders


Select a repo