# Data Science Fellows Spring 2020 - Statistics Sprint ## Day 5: April 6 2020 #### Agenda: 1. Go through google slides on more Bayesian statistics and networks 2. Spin up container on CyVerse 3. Go through final markdown file #### What are some of the advatanges of using a probabilistic statistical framework? - Works well with large amounts of observations - Networks are sexy - - - - - #### What are some of the disadvantages of using a probabilistic statistical framework? - Does not work well with small number of oberservations (stochastic framework) - Relies on good data; prone to the garbage in, garbage out problem - - - - - #### What have you heard about Markov Chains? - Used them extensively for protein domain annotations (HMMs) - Gene predicition with hidden markov model - Often used by (simple) robots in state modeling / motion planning - Financial Modelling - stock and financial assets pricing prediction - PageRank from Google - - - [Lesson 4: Extending Bayes into Networks](https://github.com/rbartelme/rstudio-stats/blob/master/lessons/Lesson4_BayesianNetworks.md) --- ## Day 4: April 3 2020 #### Agenda: 1. Go through google slides on clustering and dimensionality 2. Open CyVerse Container 3. Go through content in markdown file #### What do you think "data clustering" or "data partitioning" is? - Clustering: identifying meaningful groups of data items; Partitioning: identifying meaningful divisions of a mathematical space - identifying the underlining structure of the data. - Data clustering: identifying features in datasets that are used to measure similarity/distance across those datasets - Combining a set variables or cases (depends) in similar groups according to some features that makes them closer to the group assigned than to the groups not assigned -Data organization - identifying similar groups in your dataset. type of unsupervised clustering. [Lesson 3: Dimensionality](https://github.com/rbartelme/rstudio-stats/blob/master/lessons/Lesson3_Dimensionality.md) --- ## Day 3: April 2 2020 #### Agenda: 1. Go through google slides for data intuition 2. Spin up rstudio-stats on Cyverse 3. Go through content on github markdown file (copied below) [Lesson 2: Study Designs and Data Distributions](https://github.com/rbartelme/rstudio-stats/blob/master/lessons/Lesson2_DistributionsAndData.md) --- ## Day 2: April 1 2020 #### Agenda: 1. Go through google slides for background on probability and regression 2. Spin up rstudio-stats on Cyverse 3. Go through content on github markdown file (linked below) #### What do you think about when you hear "probability theory"? Or just "probability"? - Odds, betting, chance - Creating predictions about future data based on current data - The math behind creating distributions and asking the question "what is the chance that. . . " - probability distributions, CLT and LLN - Providing a degreee of certainty regarding some outcome - How likely that an event will occur in a random experiment [Lesson 1: Probability and Regression](https://github.com/rbartelme/rstudio-stats/blob/master/lessons/Lesson1_Probability_Regression.md) --- ## Day 1: March 30 2020 #### Agenda: 1. Go through short google slides for statistics background 2. Spin up rstudio-stats application on CyVerse DE 3. Run through the Carpetries Introduction to R #### List some topics or concepts you know from statistics: - Ex. Central limit theorem - Exploratory Stats (mean, median, mode, etc. Standard Deviations) - Sampling and Inference - visualizing data proprieties without graphics - Get a bunch of numbers. Do stuff with numbers. Mostly, develop a hypothesis and a null hypothesis, and then test. - t-tests, - ANOVA, TukeyHD, tests for determining significance of values - lines of best fit, R-squared values - Nobody really understands p-values #### Introducing the R Language [Carpentries Introduction to R](https://datacarpentry.org/R-genomics/01-intro-to-R.html) #### Outcomes: - Assess what everyone already knows about statistics - Gain familiarity with R before we get into lessons 1-4 --- ## Day 0: March 26 2020 * Post your initials under here as a sub-bullet to let me know if you got the invite and were able to access this file: * RB * HE * MO * GA * EL * AB ---