Tutorial 10: Simple Linear Regression

--- title: "Tutorial 10: Simple Linear Regression" author: "Elizabeth A. Albright, PhD" output: html_notebook --- # Tutorial 10: Simple Linear Regression This tutorial will walk you through the steps of running a simple linear regression (SLR) in R. I encourage you to read about Yale's Environmental Performance Index (EPI) here: https://epi.envirocenter.yale.edu/. ### Variables: GDP and Environmental Performance Index (EPI) For the tutorial we will explore the relationship between GDP per capita (GDPpc) and the Yale Environmnetal Performance Index. The Environmental Performance Index (EPI.new) is an index for each nation on its environmental performance. It is an index combined of numerous weighted measures. Read more about the data on the Yale website and the technical appendix on Sakai (Wolf et al., 2022). EPI.new: Environmental Performance Index HLT.new: Environmental Health AIR.new: Air quality index HAD.new: Household Air Pollution [DALY rate, Disability Adjusted Life Year] DALY: "One DALY represents the loss of the equivalent of one year of full health" (World Health Organization, https://www.who.int/data/gho/indicator-metadata-registry/imr-details/158) PMD.new: Ambient PM2.5 [DALY rate] OZD.new: Ozone [DALY rate] ```{r libraries} library(ggplot2) library(dplyr) library(wbstats) library(readr) library(broom) rm(list=ls()) ``` We will read in the data with read_csv() (a readr function). ```{r import} epi.df<-read_csv("epi2022.csv") head(epi.df) ``` ### Import GDP per capita data from World Bank (in current USD) ```r wb_data <- wb_data( indicator = c("NY.GDP.PCAP.CD"), country = "countries_only", start_date = 2020, end_date = 2020 )%>% glimpse() ``` ### Renaming the GDP variable We will now rename "NY.GDP.PCAP.CD" to gdp_pc. The renamed variable will be in the wb_data data frame. ```r rename the variable wb_data <- rename(wb_data, gdp_pc = NY.GDP.PCAP.CD) glimpse(wb_data) ``` Let's look at the gdp_pc data to make sure it looks good. ```r head(wb_data, 10) ``` ### Selecting variables that we will keep Using the head and glimpse functions above, you saw that there are some variables that we probably don't need. Let's clean that up by using the **select()** function in dplyr. We will select the variables we want using select() and then will look at the data frame with the glimpse() function. ```r wb_data<-wb_data%>% select(iso3c, date, gdp_pc)%>% glimpse() ``` You should now have three variables in the data frame: iso3c, date, and gdp_pc. ### Joining data Using the full_join() function in dplyr We should now join wb_data with the epi.df using the three letter country code that is in both data frames epi.df (iso) and wb_data (iso3c). A full join joins all rows that are in either x or y. Call the new joined data frame epi_gdp.df. Read more about the joining functions in dplyr here: https://dplyr.tidyverse.org/reference/mutate-joins.html https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti ```r ``` You should now have 11 variables in the new data frame (epi_gdp.df). Yay! We can now make our scatterplot. ### Scatterplots You will now make a scatter plot of GDP per capita (x) and EPI.new (Y). ```r scatter.plot options(scipen = 999) #this bit of code turns off scientific notation p1 <- epi_gdp.df %>% ggplot(aes(gdp_pc, EPI.new, label = iso))+ geom_point(size = 1.5) + theme_minimal()+ ylab("Environmental Performance Index")+ xlab("Per Capita Gross Domestic Product (USD)")+ ylim(0,100) p1 ``` You can see in the scatterplot that GDP per capita is positively skewed. We should take the log of GDP per capita and then make the scatterplot with the logged gdp_pc data. Use the mutate() function to add in log.gdp_pc to the epi_gdp.df data frame. Do so in the chunk below. ```{r log_gdp} ``` Develop a scatter plot of EPI.new ~ log.gdp_pc. Notice the units on the Y-axis. You can set the limits of the y axis by adding the code: ylim(0,100). Add text labels to the plot by setting label=iso in the aesthetic, and then adding geom_text(nudge=0.25). Please add appropriate labels for the x- and y-axes. ```{r log_scatter} ``` We could also visualize the original data on a log scale of the x-axis by adding scale_x_log10(). This adjusts the scale to log10. What is the advantage of using this approach to visualizing the data vs logging the data and plotting the logged data? Any disadvantages? ```r p1+scale_x_log10() ``` ## Calculating correlations The correlation coefficient ( r ) is a measure of linear relationship and can range between -1 and 1. Please calculate the correlation coefficients on both the original data (EPI.new and gdp_pc) and the (natural) logged data of GDPpc. The function for correlation coefficient is cor(). For correlations, it does not matter which variable you place first. if there are missing data, you should use the argument use="pairwise.complete.obs". Please type ?cor in the console so you can see the arguments before running the chunk below. ```{r correlation} ``` What are the two correlation coefficients? Please interpret these coefficients. ## Simple linear model The function to run a linear model in R is lm() To see the arguments of lm() please type ?lm in the console. The model is written Y~X. ```r linearmodel model.1 <- lm(EPI.new ~ gdp_pc , data = epi_gdp.df) model.1 ``` The output of R gives us the equation for the simple linear model. The equation can be written as EPI.new = 37.21 + 0.0004415*GDPpc Why do you think the slope is such a small number? We can interpret this equation as, a one dollar increase in per capita GDP is associated with a 0.0004415 increase in predicted Environmental Performance Index. A dollar is such a small amount, so it is probably better to interpret in terms of $1000 increase in GDPpc is associated with a 0.4415 increase in Environmental Performance Index. I calculated this number by multiplying 1000*0.0004415. We can also summarize the model to see the estimates, SE, t, and p-values as well as the R2 value and F-statistic ```r summarylm summary(model.1) ``` The test of significance on the coefficient tests whether the slope of the line (of the populations) is zero. There is conclusive evidence to suggest that the slope is different than zero (a flat line). R2 is the percent of the variation in the Y variable that is explained by the X variable. In this case, 46.9%. We won't worry about adjusted R2 and the F statistic until we get to multiple linear regression. We can now add a linear regression line to the plot. This is the line that is drawn by ordinary least squares (OLS). We can use the function stat_smooth and the argument method=lm to tell it to draw the line that we calculated from a linear model. This will draw the confidence interval around the line based on the linear model. ```r scatter.lm ggplot(data=epi_gdp.df, aes(x = gdp_pc, y = EPI.new)) + geom_point() + stat_smooth(method=lm)+ labs(x="GDP per capita (USD)", y="Environmental Performance Index") ``` Now based on this plot, we can see that this relationship isn't linear. We should run the simple linear model using the logGDPpc data. ```r model.2 <- lm(EPI.new ~ log_gdp , data = epi_gdp.df) model.2 summary(model.2) ``` As you might have guessed, we now have a new equation when we use logGDPpc. We no longer have a linear model, but rather what we call a linear-log model. To interpret a linear-log model, an increase in x (GDPpc) by one percent is associated with an increase in predicted y by ($\beta$/100) units (EPI). So, a 1% increase in per capita GDP is associated with a 0.0652 increase in predicted EPI. The R2 is now 0.53. Let's use the package broom to make a nice tidy tibble of the regression results. First I'll turn scientific notation back on. ```r options(scipen=0) tidy(summary(model.2)) ``` Let's make a new scatterplot with the loggedGDPpc data that includes the regression line. Please run the chunk below. ```r scatterlog ggplot(data=epi_gdp.df, aes(x = log_gdp, y = EPI.new)) + geom_point() + stat_smooth(method=lm)+ labs(x="Logged GDP per capita (USD)", y="Environmental Performance Index") ``` Awesome! That scatter plot looks pretty darn good. If we are meeting the assumptions of ordinary least squares, we want equal spread of the observations above and below the regression line across all values of our explanatory variable (log gdp per capita). **I encourage you to play around with one of the other variables and look at its relationship with GDP per capita.** **Congratulations on finishing Tutorial 10!** Citations: Wolf, M. J., Emerson, J. W., Esty, D. C., de Sherbinin, A., Wendling, Z. A., et al. (2022). 2022 Environmental Performance Index. New Haven, CT: Yale Center for Environmental Law & Policy. epi.yale.edu