# Callum notes on modelling the data Steps - [ ] Plot the variables I have selected. Assess variability and colinearity. - [ ] Make notes on what to epect from the variables. - [ ] Start fitting simple models. - [ ] Figure out a good way to report the results and plot the results. - [ ] Ordinal logistic regression is probably the better way. RQ: `We want to investigate the contribution of material, occupational, and psychosocial factors on the self reported health (SRH) across different European countries. We will use SRH information collected by the Wave 2 and 3 of the EQLTS survey, aware that they offer only a partial representation of European populations and that SRH is per-se a highly subjective indicator, difficult to compare across countries.` [Data](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=7724#!/details) If you download the csv then you download a folder of the directory structure: ``` UKDA-7724-csv - csv #here is the data - eqls_2007.csv - eqls_2007and2011.csv - eqls_2011.csv - mrdoc #here is additional info - allissue #data dictionaries - excel - eqls_api_map.csv # description of what the column names mean - eqls_concordance_grid.xlxs #described which variables were included in which waves. and the mapping between waves - pdf #user guide - UKDA #study info ``` ### [User guide](http://doc.ukdataservice.ac.uk/doc/7724/mrdoc/pdf/7724_eqls_2007-2011_user_guide_v2.pdf) - The intended dimensions to cover are "employment and work-life balance, income and deprivation, housing and local environment, family and social contacts, health and mental well-being, subjective wellbeing (e.g. happiness, life satisfaction), social exclusion, perceived quality of society (e.g. tensions, trust in institutions) as well as access to and perceived quality of public services. " - 195 variables. see the concordance_grid.xlsx for those that we only included in wave 3. - Variables are grouped into primary and secondary topics. They are _also_ grouped into variable groupings. These differ slightly. The topics are an attempt to succinct describe the domain of each variable. The variable groupings are slightly overlapping (e.g Health crops up twice), but also includes indicators such as 'Derived Variables', which isn't a domain. - Derived variables 'group numeric responses of other related variables or to collapse groupings of related categorical variables into fewer categories'. The derived variables are probably what we should be using, they aim to: - To enhance the data quality by aggregating the responses into more usable and consistent format across both waves of the Survey, - To provide a clearer structure of datasets by reducing the number of variables, - To ensure confidentiality and anonymity of personal information and all respondents. ### aside on Aldabe - Variables used: - Occupation, Education level for SES - In addition there was a huge list of _material_, _occupational_ and _psychosocial_ factors. See Table 1 of paper. They do not use any derived variables - It seems like most variables used was associated with better or worse self-reported health. - 81.14 % of men were good health, 76.91 % of women were good health. - Model 1 was an SES predicting SRH with age added as a control. ### Selecting Variables. The derived variables are made for easier analysis of the dataset. Whcih of these group onto Useful links on regression with sklearn https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/ statsmodels.org/dev/examples/notebooks/generated/ordinal_regression.html https://rikunert.com/ordinal_rating [Multicollinearity](https://www.datasklr.com/ols-least-squares-regression/multicollinearity) hand-holding walkthrough: https://ajaytech.co/python-logistic-regression/ https://towardsdatascience.com/https-towardsdatascience-com-what-and-why-of-log-odds-64ba988bf704! [](https://i.imgur.com/gknnmlF.png) [MLE for logistic regression](https://web.stanford.edu/class/archive/cs/cs109/cs109.1178/lectureHandouts/220-logistic-regression.pdf) [MLE for logistic regression 2](https://towardsdatascience.com/understanding-maximum-likelihood-estimation-mle-7e184d3444bd) https://bookdown.org/jefftemplewebb/IS-6489/logistic-regression.html https://machinelearningmastery.com/logistic-regression-with-maximum-likelihood-estimation/ https://machinelearningmastery.com/implement-logistic-regression-stochastic-gradient-descent-scratch-python/