Module 4 - HackMD

# Module 4 (Notes from initial scoping on Feb) Module 4: Modeling - Taught: - Discuss one or two important ML methods (e.g. trees, neural networks) and compare them to linear regression - which one is good for what? - Practical tips on how to tune your models when doing feature engineering and selection. - Model evaluation (choosing baselines, metrics, establishing statistical significance of the findings). - EDI in modeling: - Case studies where modeling went wrong - Who is modeling and for whom? - Involving communities - Recognising contributors and making labour visible. - Hands-on project: - European Social Survey or Gentrification data - teams will be asked to build a machine learning classification model and achieve a certain improvement over a baseline. They should be able to justify the decision of the chosen experimental setup (baseline, model, comments on comparison). - Teams will need to find ways of assessing the "history" and representativeness of the dataset and the limits of the collection and data science approaches adopted. ## Notes from 13/07/21 Candidate model is logistic regression on self reported health (see Aldabe et al., 2010). Modelling principles take time. The models should be as simple as possible. No multi-level modelling. Taught or hands-on first? We could use dataset and code from our taught part. The hands-on should be goal orientated, with specific things that we want them to do and code they can edit. **advanced** What factors affect health? **simplified**: Can you predict self-reported health from this data? The overarching qu `What factors predict health` is for m1->m3. For m4 we break it down This also mimics what we do in practice. How do we know when the model is good? simplified 2 columns predicting outcome. The benchmark score is something to aim for. - but overfitting. [github issue](https://github.com/alan-turing-institute/rds-course/issues/15) describing how the RQ would be used across modules. --- ### Brainstorming important concepts Measurement matters. Hypotheses are not models. Overfitting: fitting is easy, prediction is hard. Quantifying uncertainty. Probability Distributions. Weighted average. feature extraction, training ### Resources Bishop Statistical Rethinking Gelman - Data Analysis using Regression and Multilevel/Hierarchical models. --- ## Syllabus drafting 20/07/21 [modelling principles hack](https://hackmd.io/94y-n1emTku_mMMhLPT32Q) ### **1. Modelling is finding patterns within uncertainty**: (Link with M3) - We have explored relationships between variables. Seen that there are some trends (patterns/regularities), and some uncertainty around those trends. - Modelling is the practice of **capturing patterns**, mathematically, **amid data uncertainty**. - Describing the data generating process mathematically. - - Talk about different sources of uncertainty that gets in the way of the true data generating process. - The goal of modelling is to be able to predict future observations arising from some data generating phenomena. If we are able to do that well, we can say that we know something about that topic (and build theories or disprove hyptheses etc.). **That's Science, bitch!** (we may want to expand for different modelling use-cases) - **inference** is about learning something from data using models. We can use models to test hypthoses. - Mixture of markdown and visuals carried over. - Introduce the mode of teaching, we are going to build a model using the data from previous modules, while doing so we will encounter modelling principles that we will take a moment to emphasise... ### **2. Building a simple model**: So look at our RQ and try and build a model that captures the trends we want. - We want to understand `What factors predict health?` - What do we know already? (recap on previous modules) - So what variables will be useful (big list!) - These are a massive interaction machine. We want to start simple and build up gradually. **feature extraction** - First, we need an outcome measure. What type of measure is it? - Ordered categorical. - TODO: read up on best practice - how to predict this, as a combination of variables: **regression**. - all models should start simple. - most models are regression models. - **baseline** - Briefly, say how a regression model is specified mathematically and is fitted. - briefly bayes vs frequentist. Figure out how to do this concisely. ### **3. Interpreting a model** - Models only know about the world you build for them. - Methods of interpreting vary depending on model. For regression models you get coefficients on your inputs. - Coefficients are the contribution of that variable to the outcome _assuming you already know all the other variables_. - this is important because the selection of other variables affects your coefficient. You can't interpret coefficients in splendid isolation - parameter uncertainty + errors. - Brief interlude on probability bayes vs frequentist - Visualisation helps you understand what the model is telling you. Ask yourself, what does the model thing is going on in your data. - Explicitly test hypotheses. (can we do inference here? Or do we need uncertainty/residuals from the next section?). - prediction/simulation. - aside: interpretability of complex models. And maybe case studies? ### **4. Validating a model** - But is this any good? - Ask yourself: how useful is my model at explaining patterns in the data? Is there variability/uncertainty in the data that my model does not capture well? - This process is called model validation. How to assess this? - Trends vs uncertainty. In regression to uncertainty = residuals. - Reporting/quantifying uncertainty. - Brief interlude on probability bayes vs frequentist. - Different sources of uncertainty: measurement uncertainty, fitting uncertainty. (in multilevel models you also have modelled uncertainty e.g. random effects). - We want to learn something general. Fitting is easy, prediction is hard. The importance of this underlies most model evaluation. - Overfitting: model complexity vs out-of-sample prediction. (variance) (regularization) - Underfitting: Not enough useful information (bias) - Carry-over previous section's visuals and underly data. - But how do we assess out-of-sample error? - Cross validation - Simulations (do they qualitatively match up to your data) ### **5. Improving a model** - If so, how do we adapt our model to explain this variability? - Do we give it more information? (i.e. another variable, more data to train on...) - Or do we change the structure of the model? Increase the complexitity... - You can always improve a model but there are real-world considerations: time, expense, expertise. - Improvement isn't a one-dimensional thing: higher precision, higher out-of-sample accuracy, better clarity of communication? Parsimony is often desirable (especially in theoretical models). - Practicalities: benchmarking, book-keeping & version control. - Models are always wrong. Model evaluation is about understanding why your model is wrong and whether the level of incorrectness is acceptable - Real-world significance of models. Think about how your data is structured, remember these are real people. - Should not treat everyone the same. - Multilevel models? (for hands-on sessions Maybe there should be three streams.) ### Hands-on notes: - we need to be super clear what we want. - should be a few different streams - provide a template of how to collaborate. - e.g. rotate who 'drives' every 30 mins. Have a separate notetaker to the coder. use github to sync up after 30 mins. #### Hands-on session 1: Build your own model - A good description of the model (inputs, outputs): - includes rationale - Reporting the results in any way you see fit. - includes interpretation - validation - Evaluating your model - what have you learnt? - what are the limitations? - how could it be improved further? - you can interleave visuals and text in any way you see fit. #### Hands-on session 2: Apply this model to different country #### Hands-on session 3: ## Resources [p hacking in observational data](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0149144) [manifesto in repr science](https://www.nature.com/articles/s41562-016-0021.pdf) [538 on nutrition research](https://fivethirtyeight.com/features/you-cant-trust-what-you-read-about-nutrition/) - [Introductory text to Bayesian modelling - Statistical Rethinking](https://xcelab.net/rmpubs/sr2/statisticalrethinking2_chapters1and2.pdf). The first two chapters might be useful. From Oscar. - [Modeling and inference](https://betanalpha.github.io/assets/case_studies/modeling_and_inference.html#1_probabilistic_modeling). From Oscar. # Notes from 4.3 QUESTIONS = TODO LIST (in comment - click to exponse) GENERAL NOTES: For us: How are regression coefficients fitted? MLE using the bernouilli likelihood to evaluate the dataset against the predicted p(x) from It strikes me that a nice way of introducing models is by getting averages of data, and learning the distribution of the data around that average. Make the distinction between learning average patterns and the observations that go into learning those patterns. How do we tweak tyhis so it is less 'how to do logistic regression' and more 'how to do modelling'. When reviewing we should advise people to try and notice when something general crops up. note - 4.1 to include: - We are harking here by basing our model development almost exclusively on the relationships within the data. - we are not prescribing a model - in modelling for explanation you need to understand the model in detail FOR THIS SECTION (4.3): - add text sections to later plots to motivate the choice of another variable. The hacky way that we've done it is to take a slice of DeprIndex and explore which variables still vary across SRH (so are orthogonal to DeprIndex). - Talk that no we are essentially controlling for age. Age is a confound that has an effect of the independent variable but that we are not interested in directly. - We need to introduce odds ratios as a way of assessing the effect of a predictor on the outcome. - add annotations to plots - prettify with margin notes etc. FOR SECTION 4.4 EVALUATION/VALIDATING: - Confusion matrix and talk about decision boundaries. - F1/ROC curves. - Train-test-split in an overfitting context