--- title: "Logistic Regression: GSS Analysis Lab" output: html_notebook --- # Tutorial 12: GLR For this lab on logistic regression, you will work through each of the chunks and respond to the questions. You should knit the lab to html (not Word). Please submit your knitted html and Rmd to Sakai by noon on November 21st. We are using dta from the General Social Survey (GSS) which is a survey that occurs every few years and provides researchers and policymakers a better understanding of Americans views on a variety of policy issues. Please read more about the GSS here: https://gss.norc.org/About-The-GSS. **The central purpose of our modeling is to determine if there is a relationship between time spent in nature, access to nature and beliefs about spending on the environment.** #### Response Variable #### **natenvir** is a variable that measures the response to the question: I would like to talk with you about some things people think about today. We are faced with many problems in this country, none of which can be solved easily or inexpensively. I'm going to name some of these problems, and for each one I'd like you to tell me whether you think we're spending too much money on it, too little money, or about the right amount: Improving the environment." #### Explanatory Variables of Interest #### **nattime** is a variable that captures the response to the question: Usually, I spend time in natural environments, such as public parks, gardens or trails, at least once a week. **nataccess** is a variable that captures the response to the question: I have easy access to natural environments, such as public parks, gardens or trails. Before you get started, please install the package "readstata13". This package allows us to read in data from Stata (.dta).I downloaded the GSS data in .dta form from the website: https://gss.norc.org/get-the-data/stata. ```{r library} library(ggplot2) library(readstata13) library(dplyr) ``` The following chunk reads in the GSS data from the year 2018 and makes a dataframe which I have named gss.df. You will receive some messages about duplicated factor levels. That's okay--please ignore. ```{r} gss.df<-read.dta13("GSS2018.dta", generate.factors=T, nonint.factors=T) head(gss.df) ``` In the chunk below, look at the structure of the gss dataframe and answer the following questions. Please answer in the white space below each of the questions. **Question 1** How many observations are in the dataset? (5 point) **Question 2** How many variables are in the dataset? (5 point) ```{r} ``` We are interested in a subset of variables, so we will make a smaller dataframe of the variables of interest using the dplyr package. We will call the new dataframe gss_subset. Please see the metadata and the questionnaire in the lab folder on Sakai for these variables. You can also easily search for the variable by name at this website to get the information. https://gssdataexplorer.norc.org/variables/vfilter?doslider=1&keyword=&page_v=8&search_type=&subjects=&years=&yrmax=2018&yrmin=2018#var_459 natenvir (response variable) nattime educ sex race income16 pres16 age extra variables that we won't use, but you can play around with. natpark natsci natenrgy hunt1 hrs2 ```{r} gss_select<-gss.df %>% dplyr::select( "natenvir", "nattime", "nataccess", "natpark", "natenrgy", "natsci", "confed", "consci", "hunt1", "hrs2", "age", "sex", "racecen1", "hispanic", "educ", "income16", "pres16") head(gss_select) ``` **Question 3** In the chunk below, use the levels() function to determine the levels of the response variable natenvir. What are the levels of the variable natenvir? What are the levels of the two explanatory variables of interest (nattime, nataccess)? (5 points) ```{r} ``` **Question 4** Below you will make a basic barplot of the variable natenvir. Properly label and title the plot in the chunk below. Add color to make the plot attractive. Check out STHDA bar plots for more information (http://www.sthda.com/english/wiki/ggplot2-barplots-quick-start-guide-r-software-and-data-visualization). [You can earn five bonus points if you figure out how to developed a stacked bar chart of multiple varibles in the same figure.] (5 points) ```{r barplot} ggplot(data=gss_select, aes(x=natenvir))+ geom_bar() ``` We are interested in developing a model to explain the variation in beliefs about spending on the environment (the variable natenvir). To make things easier (although I typically wouldn't recommend this), we will recode the variable into two categories so we can use a logistic model. We will make a binary "too little" and "enough or too much". To do this, we need to recode the variable. we will use the dplyr function recode. Read more about recode here: https://dplyr.tidyverse.org/reference/recode.html. Before I recode, I often like to see the counts of each category. https://dplyr.tidyverse.org/reference/count.html. Using the pipe and the count() function, R calculates how many respondents said that spending on environment is about right, too little or too much. ```{r counts} gss_select %>% count(natenvir) ``` **Question 5** In the same way we calculated the counts for natenvir above, in the chunk below, please calculate the counts for the variables nattime and nataccess.Describe the frequencies in a sentence or two. (5 points) ```{r} ``` Let's now calculate the percent of respondents across gender and how they responded to the question about funding to improve the environment (natenvir). We can do this using the table() and prop.table() functions in base R. The function table() gives us counts of each of the levels across the two variables. The function prop.table() calculates the proportion. I calculated the proportion by column, so i set the argument as 2. Please read more about prop.table() in the help. What happens if you switch the argument in prop.table to 1? ```{r} table.gss<-table(gss_select$natenvir, gss_select$sex) table.gss prop.table.gss<-prop.table(table.gss,2) prop.table.gss ``` **Question 6** In the chunk below, calculate the proportion of women who spend various amounts of time in nature. Remember the text of the question is: Usually, I spend time in natural environments, such as public parks, gardens or trails, at least once a week. Discuss the proportions in a sentence. (5 points) ```{r} ``` Now that we know the counts and proportions, let's recode the variable natenvir into a binary variable so we can use a logistic model. I will call the new variable natenvir.binary. As you see, there are three extraneous levels (DK, IAP, NA). None of the three have any observations. I will code "too much" as a "1", "about right" as a "1" and too little as a "0". We want the variable to be a factor, so we will make it as a factor as well. ```{r} gss_select$natenvir.binary<-as.factor(dplyr::recode(gss_select$natenvir, "too much"= 1, "about right" = 1, "too little" = 0)) gss_select %>% count(natenvir.binary) ``` **Question 7** Before we make the model, please run counts on each of the additional explanatory variables. Discuss why it may be challenging to include the variables as they currently are in the model. Also discuss any shortcomings for how these variables are measured (5 points). ```{r} gss_select %>% count(nattime) gss_select %>% count(nataccess) gss_select %>% count(sex) gss_select %>% count(racecen1) gss_select %>% count(hispanic) gss_select %>% count(income16) gss_select %>% count(pres16) gss_select %>% count(educ) gss_select %>% count(age) ``` Now that we have seen the data, we will need to clean it up a bit to get it ready for our model. We can change the variables age and educ into integers (numeric) to include into our model. For ease of interpretation (not ideal, but fine for our purposes), we will treat responses to nattime (time in nature) and nataccess (access to nature) as continuous numeric data. Run the chunk below and see how R recoded the factor variables to numeric variables. ```{r recode} gss_select$age<-as.numeric(gss_select$age) #makes the variable age numeric gss_select$educ<-as.numeric(gss_select$educ) #makes education numeric gss_select$nattime.int<-as.numeric(gss_select$nattime) #makes nattime numeric gss_select$nataccess.int<-as.numeric(gss_select$nataccess) #makes nataccess numeric head(gss_select$nattime) head(gss_select$nattime.int) head(gss_select$nataccess) head(gss_select$nataccess.int) ``` As you saw above, income has multiple categories. We probably don't want to include that many variables in our model. We will recode the income variable into three levels (low, medium, high). It is challenging to determine what should be considered low, medium and high income since incomes vary across regions in the US and a livable income depends on the number of household members, among other factors. For the sake of our model, We will consider high income anything over $110,000. Medium income between $50,000 and $109999 and low income below $50,000. I did the recoding for you. You're welcome (you owe me a beer =) ). ```{r} gss_select$income16.recode<-dplyr::recode(gss_select$income16, "under $1 000" = "low", "$1 000 to 2 999" = "low", "$3 000 to 3 999" = "low", "$4 000 to 4 999" = "low", "$5 000 to 5 999" = "low", "$6 000 to 6 999" = "low", "$7 000 to 7 999" = "low", "$8 000 to 9 999" = "low", "$10000 to 12499" = "low", "$12500 to 14999" = "low", "$15000 to 17499" = "low", "$17500 to 19999" = "low", "$20000 to 22499" = "low", "$22500 to 24999" = "low", "$25000 to 29999" = "low", "$30000 to 34999" = "low", "$35000 to 39999" = "low", "$40000 to 49999" = "low", "$50000 to 59999" = "medium", "$60000 to 74999" = "medium", "$75000 to $89999" = "medium", "$90000 to $109999" = "medium", "$110000 to $129999" = "high", "$130000 to $149999" = "high", "$150000 to $169999" = "high", "$170000 or over" = "high" ) gss_select %>% count(income16.recode) ``` **Question 8** Now we will look at the variable racecen1. Please run the chunk below to see the counts of the racecen1 variable. Why is it challenging and not ideal to include the variable 'as is' in a model as an explanatory variable? (5 points) ```{r} gss_select %>% count(racecen1) ``` As you see, many Asian ethnicities are represented in the sample. For the purpose of this model (and I recognize that this isn't ideal), I will recode the racecen1 variable to group Asian ethnicities together.There are also 28 respondents who listed their race as 'some other race'. I will combine Guamanian and Other Pacific Islander with this group (again, not ideal). ```{r} gss_select$racecen1.recode<-dplyr::recode(gss_select$racecen1, "asian indian" = "Asian", "chinese" = "Asian", "filipino" = "Asian", "japanese" = "Asian", "korean" = "Asian", "vietnamese" = "Asian", "other asian" = "Asian", "guamanian or chamorro" = "some other race", "other pacific islander" = "some other race") gss_select %>% count(racecen1.recode) ``` And now we will recode the pres16 variable. If respondents reported voting for President Trump in 2016, we will recode the variable a 1, all else a zero. Here I am using the function ifelse(). The new recoding variable is called pres16.recode. ```{r} gss_select$pres16.recode<-ifelse(gss_select$pres16 == "trump", 1, 0) gss_select %>% count(pres16) gss_select %>% count(pres16.recode) ``` Let's now make our model to predict natenvir.binary. We want to explain the response variable (natenvir.binary) with variables of nattime and nataccess, controlling for age, gender, income, race, Hispanic ethnicity, and political ideology (proxy: if they voted for Trump in 2016). Remember that our response variable is 1= enough or too much spending on improving the environment and zero represents too little spending. When we include categorical variables in a model, R automatically drops one level--because of the dummy variable trap. For the variable racecen1.recode, the level not included in the model is 'white', so when interpreting the race coefficients, the odds ratios are compared to white.The level that is left out of the income16.recode variable is low--therefore when discussing the odds ratio on income16.recodemedium, the comparison group is low income. ```{r} model.1<-glm(data=gss_select, natenvir.binary~nattime.int + nataccess.int + age+racecen1.recode+sex+educ+ pres16.recode + income16.recode, family=binomial) summary(model.1) ``` Now we can exponentiate the coefficients by using the function exp(coef(model_name)). ```{r} exp(coef(model.1)) # to get the odds ratios ``` **Question 9** Discuss the model results above in terms of the original question about the relationship between time and nature, access to nature and support for funding for the environment. Interpret the odds ratios on age and the two explanatory variables of interest--time in nature (nattime) and access to nature (nataccess). What variables are significant in the model? (10 points)