# Data Science Fellows Spring 2020 - Statistics Sprint
## Day 5: April 6 2020
#### Agenda:
1. Go through google slides on more Bayesian statistics and networks
2. Spin up container on CyVerse
3. Go through final markdown file
#### What are some of the advatanges of using a probabilistic statistical framework?
- Works well with large amounts of observations
- Networks are sexy
-
-
-
-
-
#### What are some of the disadvantages of using a probabilistic statistical framework?
- Does not work well with small number of oberservations (stochastic framework)
- Relies on good data; prone to the garbage in, garbage out problem
-
-
-
-
-
#### What have you heard about Markov Chains?
- Used them extensively for protein domain annotations (HMMs)
- Gene predicition with hidden markov model
- Often used by (simple) robots in state modeling / motion planning
- Financial Modelling - stock and financial assets pricing prediction
- PageRank from Google
-
-
-
[Lesson 4: Extending Bayes into Networks](https://github.com/rbartelme/rstudio-stats/blob/master/lessons/Lesson4_BayesianNetworks.md)
---
## Day 4: April 3 2020
#### Agenda:
1. Go through google slides on clustering and dimensionality
2. Open CyVerse Container
3. Go through content in markdown file
#### What do you think "data clustering" or "data partitioning" is?
- Clustering: identifying meaningful groups of data items; Partitioning: identifying meaningful divisions of a mathematical space
- identifying the underlining structure of the data.
- Data clustering: identifying features in datasets that are used to measure similarity/distance across those datasets
- Combining a set variables or cases (depends) in similar groups according to some features that makes them closer to the group assigned than to the groups not assigned
-Data organization
- identifying similar groups in your dataset. type of unsupervised clustering.
[Lesson 3: Dimensionality](https://github.com/rbartelme/rstudio-stats/blob/master/lessons/Lesson3_Dimensionality.md)
---
## Day 3: April 2 2020
#### Agenda:
1. Go through google slides for data intuition
2. Spin up rstudio-stats on Cyverse
3. Go through content on github markdown file (copied below)
[Lesson 2: Study Designs and Data Distributions](https://github.com/rbartelme/rstudio-stats/blob/master/lessons/Lesson2_DistributionsAndData.md)
---
## Day 2: April 1 2020
#### Agenda:
1. Go through google slides for background on probability and regression
2. Spin up rstudio-stats on Cyverse
3. Go through content on github markdown file (linked below)
#### What do you think about when you hear "probability theory"? Or just "probability"?
- Odds, betting, chance
- Creating predictions about future data based on current data
- The math behind creating distributions and asking the question "what is the chance that. . . "
- probability distributions, CLT and LLN
- Providing a degreee of certainty regarding some outcome
- How likely that an event will occur in a random experiment
[Lesson 1: Probability and Regression](https://github.com/rbartelme/rstudio-stats/blob/master/lessons/Lesson1_Probability_Regression.md)
---
## Day 1: March 30 2020
#### Agenda:
1. Go through short google slides for statistics background
2. Spin up rstudio-stats application on CyVerse DE
3. Run through the Carpetries Introduction to R
#### List some topics or concepts you know from statistics:
- Ex. Central limit theorem
- Exploratory Stats (mean, median, mode, etc. Standard Deviations)
- Sampling and Inference
- visualizing data proprieties without graphics
- Get a bunch of numbers. Do stuff with numbers. Mostly, develop a hypothesis and a null hypothesis, and then test.
- t-tests,
- ANOVA, TukeyHD, tests for determining significance of values
- lines of best fit, R-squared values
- Nobody really understands p-values
#### Introducing the R Language
[Carpentries Introduction to R](https://datacarpentry.org/R-genomics/01-intro-to-R.html)
#### Outcomes:
- Assess what everyone already knows about statistics
- Gain familiarity with R before we get into lessons 1-4
---
## Day 0: March 26 2020
* Post your initials under here as a sub-bullet to let me know if you got the invite and were able to access this file:
* RB
* HE
* MO
* GA
* EL
* AB
---