owned this note
owned this note
Published
Linked with GitHub
---
tags: ggg, ggg2020, ggg298
---
# GGG 298 - Week 2
[toc]
## Wednesday Lab Outline - 1/15
[See UNIX tutorial](https://github.com/ngs-docs/2020-GGG298/tree/master/Week2-UNIX_for_file_manipulation)
## Friday Discussion - 1/17
* check on farm account login status - has everyone logged in?
* reminder to install conda: [instructions](https://hackmd.io/PZJuNsFOTWKLWuJgmymu_Q)
Some topics to cover --
* Expectation maximization, iterative algorithms, and optimization problems
* [wikipedia for EM](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
* key ideas: we are missing data (usually true!), and the missing data is assumed to be pretty similar to the observed data (maybe true?), and we have a general idea of the shape of the distribution (here is where things like "i.i.d" distributions, independent and identically distributed, comes in handy.)
* so data + equations, with missing parameters...
* ...then you can iteratively "fit" the equations to the data to infer the missing parameters, which you can then use to understand the likely shape of the underlying distribution.
* bootstrapping
* [wikipedia for bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics))
* bootstrapping in e.g. populations with resampling, vs in e.g. phylogenetics
* Specific question: does bootstrapping up your statistical values inappropriately?
* supervised vs unsupervised classification
* supervised already has labels, unsupervised does not
* think: genome taxonomy classification.
* supervised: take a bunch of known genomes with known taxonomies, classify unknown genomes according to that taxonomy based on genome similarity
* unsupervised take a bunch of known genomes with no labels, cluster them all based on genome similarity, assign taxonomy based on that
* bayesian inference / bayes theorems
* [wikipedia](https://en.wikipedia.org/wiki/Bayes%27_theorem)
* [recording from "Gentle Intro to Bayesian Inference" workshop lecture](http://datalab.ucdavis.edu/eventscalendar/dsi-workshop-bayesian-inference/)
* page rank / google, inference of importance via "graph structure"
* if a bunch of pages link to your page for the term "quidnunc", then probably your page is a pretty good source for quidnunc
* other ideas & phrasing: guilt by association
* ...how does this go awry? :)
* how do "data driven" methodologies work in e.g. machine learning?
* training vs test (vs validation)
* role of _theory development_ vs _implementation_
* theoretical vs practical application
* where does PCA fit in?
* need to search large data sets (e.g. BLAST)
* role of heuristics and parameters!
* _search_ as a prelude to _design_, e.g. primers
* "budget constraints" and data intensive methodologies
* what do you need?
* what costs?
* exploratory data analysis (EDA)
* "exploratory data analysis"
* exploratory data analysis
* role of visualization
* new environments for doing data analysis
* Jupyter and R
* (some mathematica history... :)
* ...vs "bad old" systems...
* (are notebooks good or bad for science??)
maybe some meta thinking?
* "I have some data, what can I find in it" - exploratory
* "I have a hypothesis about what's in the data" - hypothesis testing
* I want to classify a bunch of new data based on some old data - classification
* I need to work with really large amounts of data - scalability of theory & methods
* how to read (data science) papers as a non-expert:
* look at claims
* look at data the claims rely on
* given the claims, does it pass the "sniff tests"?
* big enough data set
* broad enough data set
* unbiased enough data
* what are the key assumptions their approach relies on?
* (often the key assumptions can be at least vaguely understood based on the type of method, e.g. "supervised" => "do I believe the labels?")
* if it's a surprising paper ("wow! they cured cancer!") google the url and/or the paper title, and look for e.g. twitter commentary.
* if there is some, browse the tone
* if there isn't any, ...beware :)
## Homework for Week 3
### Assignment
(Due Friday 1/24 at 11am, entered into [this form](https://docs.google.com/forms/d/e/1FAIpQLSfYEV2hp3Ejl9qNpI_CX9th9uQgY_Un8S6Tnt2UHLlSogdBPQ/viewform).)
Please read [How Science Works](https://undsci.berkeley.edu/lessons/pdfs/how_science_works.pdf) and write 2-3 sentences about where you see data science in general (or, if you prefer, DNA sequencing data analysis in specific) fitting in to science. In particular, is it part of hypothesis testing, and if so, how?