owned this note
owned this note
Published
Linked with GitHub
# Methods 2 - 3rd portfolio
Prepping for data:
```{r}
# load packages
library(tidyverse)
library(dplyr)
library(rstanarm)
library(MASS)
#The next package here is used for 3D plots only
#This package requires you to install something called ZQuartz on your computer - mac, at least.
install.packages("plot3D")
library("plot3D")
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(message=FALSE, error=FALSE, warning=FALSE, comment=NA)
setwd("~/Documents/Methods2/Portfolios/portfolio-assignment-3-study-group-7")
```
```{r}
# set nice theme
theme_set(theme_bw())
```
```{r}
# load data
getwd()
child_iq <- read.csv("/Users/LauraSoerineVoldgaard/Desktop/portfolio-assignment-3-study-group-7/KidIQ/data/child_iq.csv")
child_iq <- child_iq %>%
rename(child_test_score = ppvt, mom_educ = educ_cat, mom_age = momage)
```
## Exercise 10.5
### a)
Fit a regression of child test scores on mother’s age, display the data and fitted model, check assumptions, and interpret the slope coefficient. Based on this analysis, when do you recommend mothers should give birth? What are you assuming in making this recommendation?
```{r}
M1 <- stan_glm(child_test_score ~ mom_age, data = child_iq)
plot(child_iq$mom_age, child_iq$child_test_score, xlab="Mom's age",
ylab="Child's test score")+abline(coef(M1), col="hotpink")
print(M1)
```
By the regression of child test scores on mother’s age we can see that the regression line for children's test scores increases slightly with increasing age of the mother.
The slope coefficient of 0.8 indicates that if the mother is 1 year older when she gives birth, her 3 year old child will - according to the regression line - on average get a test score that is 0.8 higher than other children at age 3 with 1 year younger mothers.
The intercept isn't really useful in this context, as a mother couldn't possibly have born her child at age 0 - where the test score indicated by the regression line would have been 67.7.
## Assumptions
### Representativeness
In the data set provided the mothers age ranged from 17 to 29 years, with a somewhat normal distribution represented in this age range.
```{r}
hist(child_iq$mom_age, main = 'Density plot of mothers age')
```
The limited age range means that the model might not be representative of mothers who give birth after the age of 30. This is a limitation that needs to be taken into consideration when interpreting the model and using it to make predictions.
### Normality of errors
```{r}
qqnorm(child_iq$child_test_score)
qqline(child_iq$child_test_score)
```
### Equal variance of errors (homoscedasticity), linearity, outliers
```{r}
ggplot(data = child_iq, aes(x = predict(M1), y = resid(M1))) +
geom_point() +
geom_smooth(method = "loess", se = FALSE) +
labs(x = "Fitted values", y = "Residuals") +
ggtitle("Residuals vs. Fitted values")
```
Here, we are checking whether the residuals vs. fitted values are following a trend - which they shouldn't.
We see that linearity seems to be stable, as the blue line is close to 0 on the y-axis. We can also see that we have homoscedasticity because the residuals are more or less equally spread along the x-axis.
By looking at the plot, we conclude there are no crucial outliers.
After chekcing the assumptions, we conclude that our fitted model meets all assumptions.
## Conclusion
Based on this analysis, we recommend mothers to give birth as late as possible for a higher child_test_score assuming that their biological watch stays intact AND that there are no other predictors influencing this output.
### b)
Repeat this for a regression that further includes mother’s education, interpreting both slope coefficients in this model. Have your conclusions about the timing of birth changed?
##### Interpretation for the child_iq dataset
1 = not completed high school
2 = high school education
3 = college
4 = a graduate degree
```{r}
# Creating the regression with mom_educ as a predictor as well.
M2 <- stan_glm(child_test_score ~ mom_age + mom_educ, data = child_iq, refresh = 0)
summary(M2)
```
These new parameters are indicating that the education level of the mother is a stronger predictor of child iq than the age of the mother.
We see that the age is now weaker as a predictor than it was before, (could be) because age and education level must be correlated - a 19 year old could not possibly have finished a graduate degree.
The regression model here is
$
\text{child_test_score} = 69.3 + 4.7 \times \text{mom_educ} + 0.3 \times \text{mom_age} + \varepsilon
$
, where 4.7 is the coefficient for the mother's education level (which is discrete numbers) and 0.3 is the coefficient for the mother's age. These coefficients indicate the slope of the regression that is due to the particular coefficient (assuming that the other predictor remains constant).
### Checking for assumptions again
#### Representativeness
```{r}
mom_educ_table <- table(child_iq$mom_educ)
barplot(mom_educ_table, col = "lightblue")
```
Regarding representativeness, we can see that the higher education level category of mothers are not as strongly represented as the other education levels. Nonetheless, we will continue.
#### Equal variance of errors (homoscedasticity), linearity, outliers
```{r}
# Checking for some assumptions again??
ggplot(data = child_iq, aes(x = predict(M2), y = resid(M2))) +
geom_point() +
geom_smooth(method = "loess", se = FALSE) +
labs(x = "Fitted values", y = "Residuals") +
ggtitle("Residuals vs. Fitted values")
```
The plot here still meets the assumptions for linearity. Regarding homoscedasticity, we don't see any pattern and see that the residuals are equally distributed within each discrete value category.
### Conclusion
Our previous recommendation - that the mother should give birth later as to get a higher IQ score for her child - has changed. Based on our results, we don't recommend that the time of birth should be influenced too much by age itself, but to look more towards higher education levels as an indicator of when the mother should give birth in order for her child to achieve a higher IQ score.
It is however important to note that the education level of the mother might reflect other lifestyle parameters, such as standard of living, general wealth and social connectivity, that could also effect the child's development.
### c)
Now create an indicator variable reflecting whether the mother has completed high school or not. Consider interactions between high school completion and mother’s age. Also create a plot that shows the separate regression lines for each high school completion status group.
```{r}
#Creating an indicator variable for highschool completion
child_iq <- child_iq %>%
mutate(highschool_binary = ifelse(mom_educ > 1, 1, 0))
# standardizing age so that intercept for age is mean age
mean(child_iq$mom_age)
child_iq <- child_iq %>%
mutate(standard_age = mom_age-mean(mom_age))
M3_1 <- stan_glm(child_test_score ~ mom_age, data = filter(child_iq, highschool_binary == 1), refresh = 0)
M3_2 <- stan_glm(child_test_score ~ mom_age, data = filter(child_iq, highschool_binary == 0), refresh = 0)
M_interact <- stan_glm(child_test_score ~ standard_age * highschool_binary, data = child_iq)
summary(M_interact)
```
The coefficients of the model with interactions:
We can see that the average age of the mothers is 22,8. In order to avoid confusion of trying to interpret imaginary IQ-scores for children of mothers aged 0, we subtracted the mean of the mothers age from the age column.
The intercept is now 77. That is the expected IQ in our dataset of a child of a mother of average age (22.8yr), having no highschool education.
The age coefficient is -1.2, indicating that as the mothers get older, we would expect a lower IQ score for their children (if the mothers don't have a highschool education)
The highschool coefficient is 11.9, indicating that for the average age, a mother who has graduated highschool is expected to have a child with an IQ score that is 11.9 points higher than the IQ score of a child of a similarly aged mother, who has not completed highschool.
The coefficient of the interaction is 2.2, meaning that if a mother has completed highschool, it changes the coefficient of age by 2.2 - this means that for mothers who have completed highschool, the age coefficient is actually -1.2 + 2.2 = 1, which means that she is expected to have children with higher IQ scores the later she gives birth.
This is in good accordance with what you see on the plot.
```{r}
plot(child_iq$mom_age, child_iq$child_test_score,
main = "Childs IQ and Mom age",
xlab = "Mom's age",
ylab = "Child's IQ",
col = ifelse(child_iq$highschool_binary == "1", "blue", "red"))
legend("topleft",
pch = c(1, 1),
c("no hs", "hs"),
col = c("red", "blue"))
abline(M3_1, col = "blue")
abline(M3_2, col = "red")
```
The plot above is made with the non-standardized mom_age data. We evaluate that it reflects the standardized model well and is a very intuitive way to view our data.
### d)
Finally, fit a regression of child test scores on mother’s age and education level for the first 200 children and use this model to predict test scores for the next 200. Graphically display comparisons of the predicted and actual scores for the final 200 children.
```{r}
# selecting first 200 for training dataset
child_iq_training <- head(child_iq,200)
# selecting last 200 for test dataset
child_iq_test <- tail(child_iq,200)
# fitting a model to the first 200 children
M_train <- stan_glm(child_test_score ~ mom_age, data = child_iq_training)
summary(M_train)
# making prediction for the last 200 children
predz <- predict(M_train, data = child_iq_test)
# creating a data frame with the actual scores and predicted scores
pred_df <- data.frame(actual_scores = child_iq_test$child_test_score, predicted_scores = predz)
# plot the actual scores vs the predicted scores
ggplot(pred_df, aes(x = actual_scores, y = predicted_scores)) +
geom_point() + geom_abline(color = "hotpink") +
labs(x = "Observed scores", y = "Predicted scores") +
ggtitle("Observed scores compared to predicted scores") +
xlim(20, 150) + ylim(20, 150)
```
The graphical display of the comparison of the predicted and observed scores shows that the model isn't very good at predicting the IQ scores of the final 200 children in the dataset. If the model had been good at doing this, we would see the points being centeret along the pink line.
## Exercise 10.6
```{r}
# Reading the data
beauty <- read.csv("Beauty/data/beauty.csv")
```
### a)
Run a regression using beauty (the variable `beauty`) to predict course evaluations (`eval`), adjusting for various other predictors. Graph the data and fitted model, and explain the meaning of each of the coefficients along with the residual standard deviation. Plot the residuals versus fitted values.
```{r}
# Running a regression to predict the evaluation by beauty and all the other various predictors.
# Making the regression
B1 <- stan_glm(beauty$eval ~ beauty$beauty, refresh = 0)
# Making super many regressions to see how each predictor influence evaluation scores.
B2 <- stan_glm(beauty$eval ~ beauty$beauty + beauty$female, refresh = 0)
B3 <- stan_glm(beauty$eval ~ beauty$beauty + beauty$age, refresh = 0)
B4 <- stan_glm(beauty$eval ~ beauty$beauty + beauty$minority, refresh = 0)
B5 <- stan_glm(beauty$eval ~ beauty$beauty + beauty$nonenglish, refresh = 0)
B6 <- stan_glm(beauty$eval ~ beauty$beauty + beauty$lower, refresh = 0)
B7 <- stan_glm(beauty$eval ~ beauty$beauty + beauty$female + beauty$nonenglish, refresh = 0)
# Making simpler regressions excluding the beauty as a predictor to be able to graph it
B22 <- stan_glm(beauty$eval ~ beauty$female, refresh = 0)
B55 <- stan_glm(beauty$eval ~ beauty$nonenglish, refresh = 0)
print(B1)
print(B2)
print(B3)
print(B4)
print(B5)
print(B6)
print(B7)
# Plotting the data and fitted models by manually writing in the regression model, where the other predictor is kept at a constant of 0.
b <- B7$coefficients
sex_eval_0 <- b[1] + b[2] * beauty$beauty + b[3] * 0 + b[4] * 0
sex_eval_1 <- b[1] + b[2] * beauty$beauty + b[3] * 1 + b[4] * 0
english_eval_0 <- b[1] + b[2] * beauty$beauty + b[3] * 0 + b[4] * 0
english_eval_1 <- b[1] + b[2] * beauty$beauty + b[3] * 0 + b[4] * 1
ggplot(beauty, aes(x = beauty, y = eval, shape = factor(female))) +
geom_point(size = 1) +
geom_line(aes(x = beauty, y= sex_eval_0 ), color = "blue" ) +
geom_line(aes(x = beauty, y= sex_eval_1 ), color = "red") + ggtitle("Beauty and evaluation scores along with the regression of eval ~ female")
ggplot(beauty, aes(x = beauty, y = eval, shape = factor(nonenglish))) +
geom_point(size = 1) +
geom_line(aes(x = beauty, y= english_eval_0), color = "green" ) +
geom_line(aes(x = beauty, y= english_eval_1), color = "yellow") + ggtitle("Beauty and evaluation scores along with the regression of eval ~ nonenglish")
```
In the plot above showing the female variable as a predictor in the regression line, men are visualized as the blue line and women as the red one.
In the other plot above showing the nonenglish variable as a predictor in the regression line, english speakers are visualized with the green line and nonenglish speakers as the yellow line.
### Interpretation of adjusted models
Adjusting for sex as an additional predictor of evaluation reveals - that in addition to beauty - also sex plays a role for predicting evaluation. Also the addition of other predictors like minority, nonenglish and lower reveal having a (small) impact on the evaluation scores. However, in addition to beauty the measurement of age doesn't make a difference in prediction of evaluation and doesn't need to be adjusted for in this context.
For our final regression model, we simply chose the predictors (in addition to the beauty predictor), where the coefficients have the highest impact (furthest away from 0).
Then we end up with the following regression model:
$$
\text{evaluation} = 4.1 + 0.1 \times \text{beauty} -0.2 \times \text{female} -0.3 \times \text{nonenglish} + \epsilon
$$
We think this also makes sense since nonenglish speakers might have a dialect, making them harder to understand and there has been indications of a gender bias in teacher evalution scores - where female on average are rated lower than men (https://doi.org/10.1093/jeea/jvx057).
The output for the regression model has an intercept of 4.1, which means that if all the predictors are at a score of 0 (being a male, that speaks English and has 0 as a beauty score) the predicted evaluation score indicated by the regression would be 4.1.
Then, adjusting for the various predictors - beauty, sex and whether they are nonenglish speakers or not - affect the predicted evaluation scores in the following ways:
if e.g. we're assuming the other predictors to remain constant, the coefficient for the beauty score indicates the slope of the regression that is due to the beauty score. This means that if we add 1 to the beauty-axis, the evalution wil increase by 0.1.
The same applies to the female and nonenglish predictors, only with the difference that those are discrete variables. This means that being a female makes your predicted evaluation score decrease by -0.2 and not speaking English makes the predicted evaluation score decrease by -0.3. All of these interpretations only apply, if we assume that the other predictors remain constant when varying the particular predictor. Also, there is a residual standard deviation of 0.5 for the regression model.
Finally we are plotting the residuals vs fitted values:
# Fitted vs. residual values plot
```{r}
ggplot(data = beauty, aes(x = predict(B7), y = resid(B7))) +
geom_point() +
geom_smooth(method = "loess", se = FALSE) +
labs(x = "Fitted values", y = "Residuals") +
ggtitle("Residuals vs. Fitted values")
```
The previous plots both show a simplified model in which one predictor is left out. The following models will show the exact same thing, but visualized in 3D instead.
# Creating 3D plots!!
```{r}
# First for the beauty, evaluation and female variables
# set the x, y, and z variables of our 3D plot between beauty and female variables as a predictor for evaluation score.
x <- beauty$female
y <- beauty$beauty
z <- beauty$eval
# Compute the linear regression
fit <- stan_glm(z ~ x + y)
# create a grid from the x and y values (min to max) and predict values for every point
# this will become the regression plane
grid.lines = 40
x.pred <- seq(min(x), max(x), length.out = grid.lines)
y.pred <- seq(min(y), max(y), length.out = grid.lines)
xy <- expand.grid( x = x.pred, y = y.pred)
z.pred <- matrix(predict(fit, newdata = xy),
nrow = grid.lines, ncol = grid.lines)
# create the fitted points for droplines to the surface
fitpoints <- predict(fit)
# scatter plot with regression plane
scatter3D(x, y, z, pch = 1, cex = 1,colvar = NULL, col="lightblue",
theta = 30, phi = 20, bty="b",
xlab = "Sex", ylab = "beauty", zlab = "evaluation",
surf = list(x = x.pred, y = y.pred, z = z.pred,
facets = TRUE, fit = fitpoints, col=ramp.col (col = c("dodgerblue3","seagreen2"), n = 300, alpha=0.9), border="grey"), main = "3D model for beauty and sex as a predictor for evaluation")
```
```{r}
# Then for the beauty, evaluation and nonenglish variables
# set the x, y, and z variables of our 3D plot between beauty and nonenglish variables as a predictor for evaluation score.
x1 <- beauty$nonenglish
y1 <- beauty$beauty
z1 <- beauty$eval
# Compute the linear regression
fit1 <- stan_glm(z1 ~ x1 + y1)
# create a grid from the x and y values (min to max) and predict values for every point
# this will become the regression plane
grid.lines = 40
x.pred1 <- seq(min(x1), max(x1), length.out = grid.lines)
y.pred1 <- seq(min(y1), max(y1), length.out = grid.lines)
xy1 <- expand.grid( x1 = x.pred1, y1 = y.pred1)
z.pred1 <- matrix(predict(fit1, newdata = xy1),
nrow = grid.lines, ncol = grid.lines)
# create the fitted points for droplines to the surface
fitpoints1 <- predict(fit1)
# scatter plot with regression plane
scatter3D(x1, y1, z1, pch = 1, cex = 1,colvar = NULL, col = "lightblue",
theta = 30, phi = 20, bty="b",
xlab = "Sex", ylab = "nonenglish", zlab = "evaluation",
surf = list(x = x.pred1, y = y.pred1, z = z.pred1,
facets = TRUE, fit = fitpoints1, col=ramp.col (col = c("dodgerblue3","seagreen2"), n = 300, alpha=0.9), border="grey"), main = "3D model for beauty and noneglish as a predictor for evaluation")
```
### b)
Fit some other models, including beauty and also other predictors. Consider at least one model with interactions. For each model, explain the meaning of each of its estimated coefficients.
For this exercise, we chose a range of different fits as listed and explained below:
```{r}
#The different regressions with some interaction effects, that we found interesting:
B8 <- stan_glm(beauty$eval ~ beauty$female * beauty$beauty, refresh = 0)
B9 <- stan_glm(beauty$eval ~ beauty$beauty + beauty$female * beauty$age, refresh = 0)
B10 <- stan_glm(beauty$eval ~ beauty$beauty + beauty$minority * beauty$nonenglish, refresh = 0)
print(B8)
print(B9)
print(B10)
```
For **B8** the intercept is 4.1, meaning that if all the predictors are at 0 - the evaluation score will be at 4.1. For the predictors to be at 0, that would require one to have a 0 in beauty score and be a male. Then for each 1 point you gain in beauty score, the model indicates that your evaluation score increases by 0.2 assuming that the other predictor remains constant (e.g. still a male). If you are a female with the exact same beauty score at 0 (remaining constant), your evaluation score decreases by -0.2 as indicated by the model. For the coefficient where beauty$female and beauty$beauty are interacting, it means that for females the higher the beauty score affects your evaluation score by -0.1.
For **B9** the intercept is 4.0, meaning that if all the predictors are at 0 - the evaluation score will be at 4.0. For the predictors to be at 0, that would require one to have a 0 in beauty score and be a 0-year-old male (making it a non-interpretable intercept). Then for each 1 point you gain in beauty score, the model indicates that your evaluation score increases by 0.1 assuming that the other predictor remains constant (e.g. still a 0-year-old male). If you are a female with the exact same beauty score and age at 0 (remaining constant), your evaluation score increases by 0.3 as indicated by the model. The coefficient for age indicates no impact on the evaluation score in this model.
For the coefficient where beauty$female and beauty$age are interacting, it also indicates no effect on the evaluation score.
For **B10** the intercept is still 4.0, meaning that if all the predictors are at 0 - the evaluation score will be at 4.0. For the predictors to be at 0, that would require one to have a 0 in beauty score and not be part of a minority and be English speaking.
Then for each 1 point you gain in beauty score, the model indicates that your evaluation score increases by 0.1 assuming that the other predictor remains constant (e.g. still a non-minority English speaking person).
If you are a minority with the exact same beauty score and still English speaking (predictors remaining constant), your evaluation score decreases by -0.1 as indicated by the model.
If, on the other hand, you are non-english speaking but not part of a minority and still has a beauty score at 0, the evaluation score is indicated by the model to decrease by -0.3.
For the coefficient where beauty$minority and beauty$nonenglish are interacting, it indicates no effect on the evaluation score.
## Exercise 10.7
### a)
Instructor A is a 50-year-old woman who is a native English speaker and has a beauty score of -1. Instructor B is a 60-year-old man who is a native English speaker and has a beauty score of -0.5. Simulate 1000 random draws of the course evaluation rating of these two instructors. In your simulation, use posterior_predict to account for the uncertainty in the regression parameters as well as predictive uncertainty.
For this exercise we decided to use the model from exercise 10.6a).
We created a dataframe for each professor and used the model to predict 1000 possible evaluation scores for each of them.
```{r}
B7 <- stan_glm(eval ~ beauty + female + nonenglish, data = beauty, refresh = 0)
prof_A <- data_frame(beauty = -1, female = 1, age = 50, nonenglish = 0)
prof_B <- data_frame(beauty = -0,5, female = 0, age = 60, nonenglish = 0)
prof_A_eval <- posterior_predict(B7,newdata = prof_A, draws = 1000 )
prof_B_eval <- posterior_predict(B7,newdata = prof_B, draws = 1000 )
head(prof_A_eval, 10)
head(prof_B_eval, 10)
```
### b)
Make a histogram of the difference between the course evaluations for A and B. What is the probability that A will have a higher evaluation?
With the 1000 scores predicted for each professor in a) we calculated the difference between scores for professor A and Professor B and plotted the difference in a histogram.
```{r}
diff_AB <- data_frame(diff = prof_A_eval - prof_B_eval)
hist(diff_AB$diff,xlab = 'Diffrence', main = 'Diffrence in predicted course evaluations')
```
The histogram follows a gausian distribution, centered just below 0 indicating that professor B is predicted to get slightly higher evaluation scores than professor A.
This is in accordance with our interpretation of the coefficients in exercise 10.6.
```{r}
(sum(with(diff_AB, diff > 0))/1000)*100
```
We calculated that the probability that professor A will have a higher evaluation score than professor B is 30.5%. This again supports our interpretation of the histogram and the model's coefficients.