---
title: "Sesame Street Case Study"
author: 'Matthew Oberholzer, Alex Chen, Steven Fontanella'
date: 'September 22, 2019'
output:
pdf_document:
toc: yes
toc_depth: '2'
---
```{r, include=FALSE}
my_plot_hook <- function(x, options)
paste("\n", knitr::hook_plot_tex(x, options), "\n")
knitr::knit_hooks$set(plot = my_plot_hook)
```
```{r Front Matter, include=FALSE}
# clean up & set default chunk options
rm(list = ls())
knitr::opts_chunk$set(echo = FALSE)
library(readxl)
library(corrplot)
sesame <- read_excel("sesame.xlsx")
sesame$viewcat <- as.factor(sesame$viewcat)
sesame$site <- as.factor(sesame$site)
sesame[sesame$sex == 2, ]$sex <- 0 #male = 1, female = 0#
sesame <- sesame[-c(5, 163),] #deleting the miscoded rows#
```
# Project Description
*A client named Dr. Education has recently began working for the production company called Children's Television Workshop, and was tasked with presenting the findings of a study that he was not involved with. Prior to Dr. Education's hiring, the production company hired Educational Testing Services to collect data on the effectiveness of their show Sesame Street. Not much information was given regarding the data collection techniques, besides the fact that the subjects were children from the ages of 3 to 5, and were sampled from 5 different sites around the United States. This sampling technique suggests that the study was observational, meaning no causation can be inferred. Since the data has been collected, an analysis of the data can be done. An analysis of the efficiency of Sesame Street and particularly effective subject areas being taught by Sesame Street will be analyzed. The analysis will hopefully be able to generate an association between the amount of Sesame Street watched and the effectiveness of learning on children. Since the data collected was not randomly sampled from the entire population but rather from specific sites, it will be difficult to generalize it to the entire United States population. *
## Research Questions
**Question 1:** Does Sesame Street improve children’s knowledge of letters, numbers, and body parts?
**Question 2:** What, if any, area should the client focus on for improvement? E.g., is the client better at teaching letters than they are at numbers?
## Statistical Questions
**Question 1:** Is the population slope of test improvement vs Sesame Street viewing non-zero and significant for each test? i.e.
$$\beta_{letters test} \stackrel{?}{=} 0$$
$$\beta_{numb test} \stackrel{?}{=} 0$$
$$\beta_{body test} \stackrel{?}{=} 0$$
**Question 2:** Are the mean test improvements different between the three tests?
$$\mu_{letters test} \stackrel{?}{=} \mu_{numb test} \stackrel{?}{=} \mu_{body test} $$
## Variables
### Response variable
Post-test - Pre-test (Improvement Score) : This is done in order to quantify the improvement over the 6 month period of watching Sesame Street for each of the three tests to be observed: letters, body parts, and numbers. Scales are different for each test, so scores should be standardized for comparison relating to the second research question. The unit of measure for this variable will be the number of points on the test
### Possible explanatory variables
Site: Coded as 1, 2, 3, 4, or 5 based on the aforementioned sampling sites
Viewcat: Viewing categories which are coded as 1 if the children rarely watched the show to 4 if the children watched the show an average of more than 5 times a week.
Peabody: The mental age scores of the children from the administration of the Peabody Picture Vocabulary Test in order to measure vocabulary maturity
# Exploratory Data Analysis (EDA)
```{r, include=FALSE}
# normalize by max score
# read in dat
data = sesame
data$prebody <- data$prebody / 32
data$postbody <- data$postbody / 32
data$prelet <- data$prelet / 58
data$postlet <- data$postlet / 58
data$prenumb <- data$prenumb / 54
data$postnumb <- data$postnumb / 54
```
```{r, echo=FALSE}
# define ys
data$ybody <- data$postbody - data$prebody
data$ynumb <- data$postnumb - data$prenumb
data$ylet <- data$postlet - data$prelet
ys <- list("body"=data$ybody, "number"=data$ynumb, "letter"=data$ylet)
```
Before any analysis, we remove two observations with observed maximum scores greater than the reported maximum score of the test. No bonus points were given in these tests, so we assume that this was an error in data entry. We also normalize all test scores by their maximums, bringing them into the range of $[0, 1]$. We normalize the test scores in order to accurately compare the effectivness of sesame street for each focus area.
For ease of analysis, we define:
$$y_{body} = postbody - prebody$$
$$y_{num} = postnumb - prenumb$$
$$y_{let} = postlet - prelet$$
We start by examining a plot of the relationship between Sesame Street viewing frequency (`viewcat`) and each $y$ value. On first glance, we see little to no improvement across `viewcat` on the body test, some improvement in the number test, and larger improvement in the letter test.
```{r, echo=FALSE}
x <- data$viewcat
m <- mapply(function(name, y){
boxplot(y ~ x, xlab="Viewing Category", ylab=paste("Difference in", name, "score"), main=paste("Viewing category vs. change in", name, "score"))
}, names(ys), ys)
```
We also examine $y$ values in isolation of `viewcat`. This time we see that the improvement on the body test is lower overall, even without consideration of Sesame Street viewing. Additionally, we see that at least 75% of students improved their scores in each test subject.
```{r, echo=FALSE}
# numerical summaries
mapply(function(name, y){
summary(y)
}, names(ys), ys)
```
```{r, echo=FALSE}
m <- mapply(function(name, y){
hist(y, main=paste("Histogram of", name, "improvements"), xlab=paste(name, "improvement (percentage points)"), xlim=c(-.8, .8), ylim=c(0,60))
}, names(ys), ys)
```
Two variables were undocumented: `regular` and `encour`. We found that
$$
regular = \begin{cases}
0, & viewcat = 1 \\
1, & viewcat > 1
\end{cases}
$$
and
$$
encour = \begin{cases}
0, & viewenc = 2 \\
1, & viewenc = 1
\end{cases}
$$
# Statistical Analysis
3 multi-linear regression models: 1 for each of our focus variables (letters, numbers, body parts)
```{r, echo = FALSE}
numbMod = lm(ynumb ~ site + age + setting + prenumb + viewcat, data = data)
bodyMod = lm(ybody ~ site + sex + age + prebody + viewcat, data = data)
letMod = lm(ylet ~ site + age + prelet + viewcat, data = data)
```
## Assumptions of multi-linear regression models
### Linearity - a scatterplot of the residuals and y values which help determine whether the data itself can be linearly related
If the scatterplot of the residuals and fitted y values shows the average of the residuals is close to 0, then we can meet the assumption of linearity. Refer to Figure 1.
### Independence - An assumption to check whether each sample/row in the data is taken separately from the other samples/rows
The details of the sampling method of this data is not known, therefore we can confirm that data is independently sampled. However, because residuals are mostly random and unpredictable, we see little evidence that data are not independent.
### Normality - Checks to make sure that the sample data has been drawn from a normally distributed population
The residuals when plotted, should exhibit a normal distribution. A Q-Q plot can be drawn to observe whether the assumption is met or not. Points should be close to the line on the Q-Q plot. Refer to Figure 2.
### Equivariance - Variance in a sample should be homogenuous so that distribution isn't skewed and within variance is homogenuous
Scatter plot of residuals and y-values shouldn't have any pattern to it. Should be randomly scattered around to meet the equivariance assumption. Refer to Figure 3.
## Interpretation of results
For interpretation purposes, we can multiply the coefficient of each variable by how much we normalized in the beginning. For body, we can multiply the coefficients by 32 for a more accurate interpretation and then similarly for letters and numbers. Multiply coefficients of letters by 58 and multiply coefficients of numbers by 54.
### Numbers
```{r, echo = FALSE}
summary(numbMod)
```
From the p-value of `site2`, it appears a child from site 2 is significantly different from a child in site 1 by about 5 points for the difference between pre-test and post-test for numbers. The other significant explanatory variables are pre-test score for numbers and view category. For each point that a child scored in the pre-test for numbers lowered the difference between their post-test minus pre-test by 0.4 points. Lastly, it appears the more that a child watched sesame street, the higher their score improvement was. Comparing a child who watched with the least (view category 1) with children who watched with view category 2, 3, and 4 improved less by approximately 5, 8, and 9 fewer points, respectively.
### Body
```{r, echo = FALSE}
summary(bodyMod)
```
The statistically significant predictors for the improvement in score for body parts is site, age, pre-test for body, and view category. For sites 2, 4, and 5, they all appear to be better than site 1 with having a higher score improvement by approximately 2. For age, since the p-value is less than $\alpha=0.05$, we can reject the null that the coefficient of age is 0 and conclude that age is significant predictor. An increase of one month of age increases the improvement score by 0.09. For the pretest for body, a point increase in how well a child scored in the pre-test for body is associated with a decrease of 0.62 in the improvement score. Lastly, view categories of 2, 3, and 4 are significantly different than view category 1. A child with a view category of 2 will have about 2.66 points more for improvement score than a child with view category of 1. Similarly, a child with view category of 3 and 4 will have approximately 4 and 4.5 more points, respectively in improvement score compared to a child in category of 1.
### Letters
```{r, echo = FALSE}
summary(letMod)
```
Similar to the other areas, the significant predictors for the improvement score of letters are site, age, prelet, and view category. The significant predictors within site are site 2 and site 3. Site 2 is significantly different from site 1 with about a 7.714 increase in improvement score if it's a child from site 2 rather than site 1. A child in site 3, however, will have about a 5 point decrease in improvement score compared to a child in site 1. In addition, an increase in 1 month for a child will result in about 0.255 increase in improvement score. For every point that a child scores for the pre-test for letters, the child will score about 0.4 points lower for the improvement. Lastly, all levels of view category are significant. For a view category of 2 compared to view category 1, a child in view category 2 will score about 6 points higher for improvement. Then, for a child who has a view category of 3 or 4 will score about 12 points higher than a child who had a view category of 1.
# Conclusion
## Research Questions revisited
### 1) Does Sesame Street improve children’s knowledge of letters, numbers, and body parts?
All models have a significant viewing category slope at an alpha level of 0.01. We conclude that increased watching of sesame street is significantly associated with an improvement in children's knowledge of letters, numbers, and body parts without being able to account for potential confounding factors. These limitations are further discussed below.
### 2) What, if any, area should the client focus on for improvement? E.g., is the client better at teaching letters than they are at numbers?
Based on the normalized improvement scores, the area the client should focus on is body parts. A child with a viewing category of 4 has approximately 0.14 increase in the normalized improvement score for body parts while a child with a viewing category of 4 has approximately 0.17 and 0.21 increase in normalized improvement score for numbers and letters, respectively. Thus, the client should look to focus their improvement on teaching body parts in Sesame Street.
## Recommendations
**Question 1:**
Our client should conduct further studies attempting to account for potential confounding factors such as income level before making a causal connection between Sesame Street and learning.
**Question 2:**
Our client should focus on body parts for improvement because it had the lowest improvement among the focus subject areas.
## Scope of inference and Study limitations
We note that because the data was observed rather than derived from an experiment, a causal link cannot be established. For example, there may be differences between children who view Sesame Street regularly and those who do not, such as household income. Socioeconomic status was controlled for in this study, however because the sampling method of these groups is unknown, there may be bias in the groups. We recommend that more investigation is performed on the subject.
# Technical Appendix
## Model Assumptions
### Linearity
#### Figure 1
```{r, include = TRUE}
plot(numbMod, which = c(1), main = "Residuals vs. Fitted for Numbers")
plot(bodyMod, which = c(1), main = "Residuals vs. Fitted for Body Parts")
plot(letMod, which = c(1), main = "Residuals vs. Fitted for Letters")
```
Since all of the three plots have the average of the residuals as close to 0, we can say that the assumption of linearity is met.
### Normality
#### Figure 2
```{r, include = TRUE}
plot(numbMod, which = c(2), main = "Q-Q plot for Numbers")
plot(bodyMod, which = c(2), main = "Q-Q plot for Body Parts")
plot(letMod, which = c(2), main = "Q-Q plot for Letters")
```
Since none of the plots show a line that strays too far out from the normal line, we meet the assumption of
normality.
### Equivariance
#### Figure 3
```{r, include = TRUE}
plot(numbMod, which = c(1), main = "Residuals vs. Fitted for Numbers")
plot(bodyMod, which = c(1), main = "Residuals vs. Fitted for Body Parts")
plot(letMod, which = c(1), main = "Residuals vs. Fitted for Letters")
```
Since all of the plots don't show any kind of pattern in the residuals vs fitted values plot, we can meet the assumption of equivariance.
### R Script
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(car)
```
```{r, eval = FALSE}
library(readxl)
library(corrplot)
sesame <- read_excel("sesame.xlsx")
sesame$viewcat <- as.factor(sesame$viewcat)
sesame$site <- as.factor(sesame$site)
sesame[sesame$sex == 2, ]$sex <- 0 #male = 1, female = 0#
sesame <- sesame[-c(5, 163),] #deleting the miscoded rows#
```
## Creating a subset matrix with only the variables that we care about
```{r, eval = FALSE}
new_sesame <- sesame[, !colnames(sesame) %in% c("agecat","viewenc","viewcat","preform","prerelat","preclasf","postform","postrelat","postclasf")]
summary(new_sesame)
```
Correlation matrix of our subset matrix.
```{r, eval = FALSE}
NewSES <- cor(new_sesame)
corrplot(NewSES)
```
## Histograms
```{r, eval = FALSE}
hist(ynumb) #slight left skew#
hist(ybody) #fairly normal#
hist(ylet) #not normal, slight right skew#
hist(sesame$peabody) #right skew with a very low outlier#
```
## QQ-Plots
```{r, eval = FALSE}
qqnorm(sesame$ynumb); qqline(sesame$ynumb) #fairly normal with some outliers in the left tail#
qqnorm(sesame$ybody); qqline(sesame$ybody) #fairly normal with some outliers in both tails#
qqnorm(sesame$ylet); qqline(sesame$ylet) #not really normal#
```
## Preliminary Regression Models
### Model for ynumb
```{r, eval = FALSE}
lm_ynumb <- lm(ynumb ~ sesame$site + sesame$sex + sesame$age + sesame$setting + sesame$prenumb + sesame$peabody + sesame$encour + sesame$viewcat)
#summary(lm_ynumb)#
lm_ynumb1 <- lm(ynumb ~ sesame$sex + sesame$age + sesame$setting + sesame$prenumb + sesame$peabody + sesame$encour + sesame$viewcat)
#summary(lm_ynumb1)#
lm_ynumb2 <- lm(ynumb ~ sesame$sex + sesame$age + sesame$setting + sesame$prenumb + sesame$peabody + sesame$viewcat)
#summary(lm_ynumb2)#
lm_ynumb3 <- lm(ynumb ~ sesame$age + sesame$setting + sesame$prenumb + sesame$peabody + sesame$viewcat)
#summary(lm_ynumb3)#
lm_ynumb4 <- lm(ynumb ~ sesame$setting + sesame$prenumb + sesame$peabody + sesame$viewcat) #all predictors here are significant at alpha = 0.1, setting is the only one not at an alpha of 0.05. It's p-value is 0.0666#
#summary(lm_ynumb4)#
lm_ynumb5 <- lm(ynumb ~ sesame$prenumb + sesame$peabody + sesame$viewcat) #all predictors are significant here at 0.05 level#
summary(lm_ynumb5)
#Model with interaction#
lm_ynumb_int <- lm(ynumb ~ (sesame$site + sesame$sex + sesame$age + sesame$setting + sesame$prenumb + sesame$peabody + sesame$encour + sesame$viewcat)*sesame$sex)
#summary(lm_ynumb_int)#
lm_ynumb_int1 <- lm(ynumb ~ (sesame$sex + sesame$age + sesame$setting + sesame$prenumb + sesame$peabody + sesame$encour + sesame$viewcat)*sesame$sex + sesame$site)
#summary(lm_ynumb_int1)#
lm_ynumb_int2 <- lm(ynumb ~ (sesame$sex + sesame$age + sesame$setting + sesame$prenumb + sesame$peabody + sesame$viewcat)*sesame$sex + sesame$site + sesame$encour)
#summary(lm_ynumb_int2)#
lm_ynumb_int3 <- lm(ynumb ~ (sesame$sex + sesame$setting + sesame$prenumb + sesame$peabody + sesame$viewcat)*sesame$sex + sesame$site + sesame$encour + sesame$age)
#summary(lm_ynumb_int3)#
lm_ynumb_int4 <- lm(ynumb ~ (sesame$sex + sesame$setting + sesame$prenumb + sesame$viewcat)*sesame$sex + sesame$site + sesame$encour + sesame$age + sesame$peabody)
#summary(lm_ynumb_int4)#
lm_ynumb_int5 <- lm(ynumb ~ (sesame$sex + sesame$prenumb + sesame$viewcat)*sesame$sex + sesame$site + sesame$encour + sesame$age + sesame$peabody + sesame$setting)
#summary(lm_ynumb_int5)#
lm_ynumb_int6 <- lm(ynumb ~ (sesame$sex + sesame$prenumb)*sesame$sex + sesame$site + sesame$encour + sesame$age + sesame$peabody + sesame$setting + sesame$viewcat)
#summary(lm_ynumb_int6)#
lm_ynumb_int7 <- lm(ynumb ~ (sesame$sex + sesame$prenumb)*sesame$sex + sesame$site + sesame$age + sesame$peabody + sesame$setting + sesame$viewcat)
#summary(lm_ynumb_int7)#
lm_ynumb_int8 <- lm(ynumb ~ (sesame$sex + sesame$prenumb)*sesame$sex + sesame$age + sesame$peabody + sesame$setting + sesame$viewcat)
#summary(lm_ynumb_int8)#
lm_ynumb_int9 <- lm(ynumb ~ (sesame$sex + sesame$prenumb)*sesame$sex + sesame$peabody + sesame$setting + sesame$viewcat)
#summary(lm_ynumb_int9)#
lm_ynumb_int10 <- lm(ynumb ~ (sesame$sex + sesame$prenumb)*sesame$sex + sesame$peabody + sesame$viewcat)
summary(lm_ynumb_int10)#every predictor is significant except for sex, but it's interaction is significant#
#Comparing AIC
AIC(lm_ynumb_int10, lm_ynumb5)
#Using step function to select a model
select_ynumb <- step(lm_ynumb)
summary(select_ynumb)
select_ynumbint <- step(lm_ynumb_int)
summary(select_ynumbint)
#comparing all AICs#
AIC(select_ynumb, lm_ynumb_int10, select_ynumbint)
#Looking at VIFs#
vif(lm_ynumb_int10)
vif(select_ynumb)
vif(select_ynumbint)
#Plotting selected model#
plot(select_ynumb)
```
select_ynumbint has the smallest AIC, but some of its VIF values are greater than 4. This means that there are some multicollinearity issues with the model. Model select_ynumb has a similar AIC, with all VIF values being smaller than 2. So, we choose model select_ynumb.
### Model for ylet
```{r, eval = FALSE}
lm_ylet <- lm(ylet ~ sesame$site + sesame$sex + sesame$age + sesame$setting + sesame$prelet + sesame$peabody + sesame$encour + sesame$viewcat)
#summary(lm_ylet)#
lm_ylet1 <- lm(ylet ~ sesame$site + sesame$sex + sesame$age + sesame$prelet + sesame$peabody + sesame$encour + sesame$regular)
#summary(lm_ylet1)#
lm_ylet2 <- lm(ylet ~ sesame$site + sesame$sex + sesame$age + sesame$prelet + sesame$peabody + sesame$regular)
#summary(lm_ylet2)#
lm_ylet3 <- lm(ylet ~ sesame$site + sesame$age + sesame$prelet + sesame$peabody + sesame$regular)
#summary(lm_ylet3)#
lm_ylet4 <- lm(ylet ~ sesame$site + sesame$prelet + sesame$peabody + sesame$regular)#all predictors are significant at a level of 0.1, except site)
#summary(lm_ylet4)#
lm_ylet5 <- lm(ylet ~ sesame$prelet + sesame$peabody + sesame$regular) #all predictors are significant at an alpha level of 0.05#
summary(lm_ylet5)
#Model with interaction#
lm_ylet_int <- lm(ylet ~ (sesame$site + sesame$sex + sesame$age + sesame$setting + sesame$prelet + sesame$peabody + sesame$encour + sesame$regular)*sesame$sex)
#summary(lm_ylet_int)#
lm_ylet_int1 <- lm(ylet ~ (sesame$site + sesame$sex + sesame$age + sesame$setting + sesame$prelet + sesame$peabody + sesame$regular)*sesame$sex + sesame$encour)
#summary(lm_ylet_int1)#
lm_ylet_int2 <- lm(ylet ~ (sesame$site + sesame$sex + sesame$age + sesame$setting + sesame$prelet + sesame$peabody)*sesame$sex + sesame$encour + sesame$regular)
#summary(lm_ylet_int2)#
lm_ylet_int3 <- lm(ylet ~ (sesame$site + sesame$sex + sesame$age + sesame$setting + sesame$peabody)*sesame$sex + sesame$encour + sesame$regular + sesame$prelet)
#summary(lm_ylet_int3)#
lm_ylet_int4 <- lm(ylet ~ (sesame$site + sesame$sex + sesame$age + sesame$setting)*sesame$sex + sesame$encour + sesame$regular + sesame$prelet + sesame$peabody)
#summary(lm_ylet_int4)#
lm_ylet_int5 <- lm(ylet ~ (sesame$sex + sesame$age + sesame$setting)*sesame$sex + sesame$encour + sesame$regular + sesame$prelet + sesame$peabody + sesame$site)
#summary(lm_ylet_int5)#
lm_ylet_int6 <- lm(ylet ~ (sesame$sex + sesame$age + sesame$setting)*sesame$sex + sesame$regular + sesame$prelet + sesame$peabody + sesame$site)
#summary(lm_ylet_int6)#
lm_ylet_int7 <- lm(ylet ~ (sesame$sex + sesame$age)*sesame$sex + sesame$regular + sesame$prelet + sesame$peabody + sesame$site + sesame$setting)
#summary(lm_ylet_int7)#
lm_ylet_int8 <- lm(ylet ~ (sesame$sex + sesame$age)*sesame$sex + sesame$regular + sesame$prelet + sesame$peabody + sesame$site)
#summary(lm_ylet_int8)#
lm_ylet_int9 <- lm(ylet ~ sesame$sex + sesame$regular + sesame$prelet + sesame$peabody + sesame$site)
#summary(lm_ylet_int9)#
lm_ylet_int10 <- lm(ylet ~ sesame$regular + sesame$prelet + sesame$peabody + sesame$site)
#summary(lm_ylet_int10)#
lm_ylet_int11 <- lm(ylet ~ sesame$regular + sesame$prelet + sesame$peabody)
summary(lm_ylet_int11)
#checking for interaction turned out to be unnecessary since all interaction terms were insignificant.#
#Using step() function to find best models#
select_ylet <- step(lm_ylet)
summary(select_ylet)
select_yletint <- step(lm_ylet_int) #checking if step() function found any interactions to be significant, it did not#
summary(select_yletint)
#plotting best model#
plot(select_ylet)
#Comparing AICs#
AIC(select_ylet, lm_ylet5)
#Checking VIFs
vif(select_ylet)
```
By comparing the AICs of the selected model throught the step() function and the one we found through backwards stepwise regression, we find our best model to be select_ylet.
### Model for ybody
```{r, eval = FALSE}
lm_ybody <- lm(ybody ~ sesame$site + sesame$sex + sesame$age + sesame$setting + sesame$prebody + sesame$peabody + sesame$encour + sesame$viewcat)
#summary(lm_ybody)#
lm_ybody1 <- lm(ybody ~ sesame$site + sesame$sex + sesame$age + sesame$prebody + sesame$peabody + sesame$encour + sesame$viewcat)
#summary(lm_ybody1)#
lm_ybody2 <- lm(ybody ~ sesame$site + sesame$sex + sesame$age + sesame$prebody + sesame$peabody + sesame$viewcat)
#summary(lm_ybody2)#
lm_ybody3 <- lm(ybody ~ sesame$site + sesame$sex + sesame$age + sesame$prebody + sesame$viewcat)
#summary(lm_ybody3)#
lm_ybody4 <- lm(ybody ~ sesame$site + sesame$sex + sesame$prebody + sesame$viewcat)
#summary(lm_ybody4)#
lm_ybody5 <- lm(ybody ~ sesame$site + sesame$prebody + sesame$viewcat)#all predictors are significant at an alpha level of 0.05#
summary(lm_ybody5)
#Model with interaction#
lm_ybody_int <- lm(ybody ~ (sesame$site + sesame$sex + sesame$age + sesame$setting + sesame$prebody + sesame$peabody + sesame$encour + sesame$viewcat)*sesame$sex)
#summary(lm_ybody_int)#
lm_ybody_int1 <- lm(ybody ~ (sesame$sex + sesame$age + sesame$setting + sesame$prebody + sesame$peabody + sesame$encour + sesame$viewcat)*sesame$sex + sesame$site)
#summary(lm_ybody_int1)#
lm_ybody_int2 <- lm(ybody ~ (sesame$sex + sesame$age + sesame$setting + sesame$prebody + sesame$encour + sesame$viewcat)*sesame$sex + sesame$site + sesame$peabody)
#summary(lm_ybody_int2)#
lm_ybody_int3 <- lm(ybody ~ (sesame$sex + sesame$age + sesame$setting + sesame$encour + sesame$viewcat)*sesame$sex + sesame$site + sesame$peabody + sesame$prebody)
#summary(lm_ybody_int3)#
lm_ybody_int4 <- lm(ybody ~ (sesame$sex + sesame$age + sesame$setting + sesame$encour)*sesame$sex + sesame$site + sesame$peabody + sesame$prebody + sesame$viewcat)
#summary(lm_ybody_int4)#
lm_ybody_int5 <- lm(ybody ~ (sesame$sex + sesame$setting + sesame$encour)*sesame$sex + sesame$site + sesame$peabody + sesame$prebody + sesame$viewcat + sesame$age)
#summary(lm_ybody_int5)#
lm_ybody_int6 <- lm(ybody ~ (sesame$sex + sesame$encour)*sesame$sex + sesame$site + sesame$peabody + sesame$prebody + sesame$viewcat + sesame$age + sesame$setting)
#summary(lm_ybody_int6)#
lm_ybody_int7 <- lm(ybody ~ (sesame$sex + sesame$encour)*sesame$sex + sesame$site + sesame$peabody + sesame$prebody + sesame$viewcat + sesame$age)
#summary(lm_ybody_int7)#
lm_ybody_int8 <- lm(ybody ~ sesame$sex + sesame$site + sesame$peabody + sesame$prebody + sesame$viewcat + sesame$age)
#summary(lm_ybody_int8)#
lm_ybody_int9 <- lm(ybody ~ sesame$sex + sesame$site + sesame$prebody + sesame$viewcat + sesame$age)
#summary(lm_ybody_int9)#
lm_ybody_int10 <- lm(ybody ~ sesame$sex + sesame$site + sesame$prebody + sesame$viewcat)
#summary(lm_ybody_int10)#
lm_ybody_int11 <- lm(ybody ~ sesame$site + sesame$prebody + sesame$viewcat)
summary(lm_ybody_int11)
#No interactions were significant, so we arrived at the same model without any interaction terms#
#Using step() function to find model#
select_ybody <- step(lm_ybody) #step() function found the same model I did#
summary(select_ybody)
select_ybodyint <- step(lm_ybody_int) #step() function did not find any significant interactions either#
summary(select_ybodyint)
#Checking AICs#
AIC(select_ybody, lm_ybody5)
#Checking VIFs#
vif(select_ybody)
#Plotting our selected model#
plot(select_ybody)
```
The best model we found was select_ybody.