Studio 8: Regression

--- title: 'Studio 8: Regression' layout: post label: studio geometry: margin=2cm tags: studio --- # CS 100: Studio 8 ### Simple Linear Regression ##### November 2, 2022 ### Instructions In this week’s studio, you will be learning how to create and interpret a simple linear regression model in R. You will also get some practice with data transformations. Upon completion of all tasks, a TA will give you credit for today's studio. If you do not manage to complete all the assigned work during the studio period, do not worry. You can continue to work on this assignment until Sunday, November 6 at 7 PM. Come by TA hours any time before then to show us your completed work and get credit for today's studio. ### Objectives By the end of this studio, you will know: - How to determine when two variables are related linearly, so that a simple linear regression model is applicable - How to create a simple linear regression model in R - How to interpret a simple linear regression model in R - How to transform data to search for linear relationships - What Tukey’s ladder of transformations is ### Regression *Regression* is a type of supervised learning used to model a relationship among numerical variables where it is hypothesized that one depends on the others. An example is modeling the price of a house using features like the location, the number of bedrooms, the number of bathrooms, and the age of the house. ##### Linear Regression *Linear regression* is the most widely used tool for establishing a link between two (or more) variables. A linear regression represents the "line of best fit" between a response variable $Y$ and one or more explanatory variables $X_1, X_2, \ldots, X_n$. An *explanatory* variable (also called a *predictor*) is a variable that is (usually) independent of the others, meaning its value does not change with the values of the other variables. Explanatory variables are typically plotted on the $x$-axis. On the other hand, a *response* variable, typically plotted on the $y$-axis, is a variable whose value may depend on the value of the explanatory variables. For example, if we were investigating the effect of parents’ heights on children’s heights, the parents’ heights would be the explanatory variable and the children’s heights would be the response variable. In the case of just one explanatory variable $X$, you can create a scatter plot of pairs of $x_i$ and corresponding $y_i$ values. Generalizing from the data in such a scatter plot, a linear regression is simply the *line* that best fits the data, meaning the line that minimizes the error between the predicted (*a.k.a.* fitted) and the observed values. Given an unobserved value of $X$, such a model can be used to predict the corresponding value of $Y$. **Aside:** Recall that the equation of a line is $Y = mX + b$, where $m$ is the slope of the line (which tells you how steep it is) and $b$ is the $y$ intercept (which tells you where the line intersects the $y$ axis). For any given value of $x$, you can find the corresponding value of $y$ by calculating $mx + b$. There are two types of linear regression: simple and multiple. Simple linear regression is a method of modelling a quantitative response $Y$ on the basis of a single predictor $X$. The model assumes that there is a linear relationship between $X$ and $Y$. We write this linear relationship as follows: $$Y = \beta_0 + \beta_1 X.$$ Here $\beta_0$ and $\beta_1$ are the two coefficients of the linear model, representing the intercept and slope, respectively. Specifically, $\beta_1$ describes how $Y$ changes in response to changes in $X$. **Note:** Because regression generalizes to multiple variables, we replaced $b$ by $\beta_0$, and $m$ by $\beta_1$ in the familiar $Y = mX + b$ formula. Multiple linear regression is an extension of simple linear regression that handles multiple explanatory variables. In general, when there are $n$ distinct predictors, multiple linear regression takes the following form: $$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n.$$ Here the $\beta_i$, for $i \in \{ 0, \ldots, n \}$, are the coefficients of the linear model, representing the gradients (i.e., slopes) in each direction, respectively. Specifically, $\beta_i$ describes how $Y$ changes in response to changes in variable $X_i$, holding the values of all other variables fixed. ##### Linear Regression in R To carry out a linear regression in R between explanatory variable `x` and response variable `y`, you can use the command `lm(y ~ x)`. The syntax `y ~ x` tells R to model `y` using `x`, while the function `lm` tells R to create a linear model. To understand the output of `lm` in R, we’ll go through an example. Here's a scatterplot of weights in pounds (Weight) versus heights in inches (Height) for 250 Americans, plus the line of best fit. ![heightweight_scatter](https://i.imgur.com/gZFXurR.png) Let’s fit a linear regression model to these data and analyze the output. ~~~ fit <- lm(healthMeasures$Weight ~ healthMeasures$Height) summary(fit) ~~~ ![heightweight_lm](https://i.imgur.com/Z2SKpJH.png) For this example, our model is Weight = -194.4861 + 5.2995(Height). Here’s how we interpret a few of the metrics in the R output (highlighted in red): * The $y$-intercept of -194.4861 is the predicted weight of a person whose height is 0. Because no one is 0 inches tall, this $y$-intercept is meaningless. (More generally, predictions are not usually valid outside the range of the $x$ values observed in the data, and become more and more precise the closer the queried $x$ value is to the sample mean $\bar{x}$.) * The slope of 5.2995 indicates that for every one inch increase in height, we can expect a 5.2995 pound increase in weight. * The $p$-value for the $y$-intercept is less than $1.67 \times 10^{-6}$ (i.e., nearly 0). Likewise, the $p$-value for the slope is less than $2 \times 10^{-16}$ (i.e., essentially 0). When a $p$-value is less than 0.05, we usually conclude that the predictor is significant (i.e., including this predictor in the model yields better results than not, where "not" means assuming a coefficient of 0 for that predictor). * The $R^2$ value is 0.2631. This value of $R^2$ tells us how much of the variation (from the mean) in weight can be explained by variation (from the mean) in height. An $R^2$ value can range from 0 to 1; the closer it is to 1, the better the model explains the data. In this example, we might expect $R^2$ to be low, because there are many other factors besides height that affect a person’s weight. ### Part I: Randomly Generated Data Now it’s time for you to create your own linear models. Before working with real data, you are going to practice with randomly generated data. ##### Part 1a The `runif(n, min, max)` command generates a vector of `n` random numbers uniformly distributed between `min` and `max`. (*Note:* `runif` stands for "random uniform.") Using `runif()`, generate two vectors, `x` and `y`, where each has 50 numbers ranging between 0 and 100. Create a scatterplot of x and y using the command `plot(x, y)`. Do you see any patterns? Report the correlation between x and y using `cor(x, y)`. Is there a strong correlation between the two variables? Regardless of your answer, run a linear regression using `my_model <- lm(y ~ x)`. Add a line of best fit to your scatterplot using `abline(my_model)`. Display the output of your model using `summary(my_model)`. What is the slope of your model? Is it significant? Explain why or why not. (Hint: Check the $p$-value.) Look at the $R^2$ value. What does it tell you about how well your model explains the data? A **residual error** in a model is the distance between a $y_i$-value and its predicted value, which is usually denoted $\hat{y}_i$. Plot the residuals. *Hint:* To do so, use the following command: `plot(my_model, which = 1)`. The parameter `which = 1` tells R that you are interested in the residual plot. If you’d like to see some other diagnostic plots as well, you can omit this argument. ##### Part 1b The uniformly random data you just generated did not exhibit a linear relationship. In this problem, you are going to randomly generate data according to a linear model: $Y = a + bX + \epsilon$, where $\epsilon$ denotes the noise in the model. (Test yourself: What are the coefficients of this linear model?) Using `runif()` again, generate a vectors, `x` of 50 numbers ranging from, say, 0 and 100, and a second vector `e` (which stands for error) of 50 numbers ranging from, say, -10 to 10. Choose values for `a` and `b`, and then combine `x` and `e` to produce a vector `y` as follows: `y <- a + b * x + e`. Calculate the correlation coefficient between `y` and `x`. By design, these variables are very highly correlated! Now, create a scatter plot displaying `y` as a function of `x`. Discuss with your partner what you observe. Create a linear model to predict `y` from `x`. Comment on the model obtained. How do the estimated values of the slope and the intercept compare to their true values? Add the regression line to your scatterplot using `abline`. You can use the `col` parameter to specify a color: `abline(my_model, col = "red")`. Additionally, plot the true linear model: i.e., $Y= a + bX$. You can do so using `abline(a = a, b = b, col = "blue")`. *Hint*: You can use the `legend()` command to add an appropriate legend to your plot as follows: `legend("topleft", legend = c("Best Fit", "True Model"), col = c("red", "blue"), lty = 1)`. How do the two lines differ? How well does the best fit line fit the data? Compare the $r$ and $R^2$ values to their counterparts in Part 1a. Lastly, as a final sanity check, plot the residuals. What do you observe? ### Part II: World Population Growth Not all relationships among data are linear. For example, the world population is growing exponentially: i.e., it is growing *very* quickly over time. If we take the logarithm of something that grows exponentially, we should observe linear growth instead. Load world population data into R as follows: ~~~ world_pop <- read.csv("http://cs.brown.edu/courses/cs100/studios/data/8/worldpop.csv") ~~~ For convenience, rename the inconveniently named population variable, and filter `world_pop` to only include years since 1000. Now plot "Population" vs. "Year": i.e., create a scatter plot with "Year" on the x-axis and Population on the y-axis. What kind of trend do you observe? Although "Year" appears to be a significant predictor of "Population", the relationship between the two variables is not linear. Still, it is possible we could transform one or both of these variables to make the relationship linear. Try plotting a few transformations of the data. What do you notice? How does population appear to have grown over time? What, then, is an appropriate transformation? Many data relationships become linear after an appropriate transformation. The famous statistician, [John Tukey](https://en.wikipedia.org/wiki/John_Tukey), defined what he called a **ladder of transformations** that describes transformations from not necessarily linear to more or less linear relationships. Tukey’s ladder invokes a power transformation, in which a variable, say, $v$, is raised to a power that depends on a parameter $\lambda$. If $\lambda$ is positive, then the transformation is simply $v^{\lambda}$. If $\lambda$ is negative, then the transformation is $-v^{\lambda}$. Finally, $\lambda = 0$, then the transformation is $\log{v}$. You can read more about Tukey’s ladder [here](http://onlinestatbook.com/2/transformations/tukey.html). For the remainder of this studio, you will search for the best *positive* value of $\lambda$ you can find under which a power transformation creates a linear relationship in population data. As you know, there are multiple diagnostics that evaluate the quality of a linear relationship, but for the purpose of this studio $R^2$ will suffice. We recommend you read through all of the following steps before proceeding. *Hint*: In this part of studio, you will repeat the same steps multiple times for different values of $\lambda$. Therefore, you will probably find it worthwhile to encapsulate these steps in a function or a loop. **Initialization**: Create a vector of a few values of $\lambda$ to try. *Hint*: Use the `c` function initially; you can add additional values of $\lambda$ that turn out to be of interest later using the `append` function: e.g., `append(c(1, 2, 3, 4), 5)` evaluates to `[1] 1 2 3 4 5`. Next, initialize an empty vector in which to hold the $R^2$ values of the linear models that correspond to the power transformations of each value of $\lambda$. Do so like this: `R2 <- c()`, and then use the `append` function to append values to *the end* of this vector. **Loop**: For each $\lambda$ value, transform the "Year" variable. For example, to transform "Year" by raising it to your first $\lambda$ power, you might write code like this: ~~~{r} world_pop$lambda1 <- world_pop$Year ** lambdas[1] ~~~ Once you’ve transformed the "Year" variable, plot it against "Population" to check whether it looks linear. Then create a linear model to summarize the relationship. Save the $R^2$ value of this linear model in your `R2` vector. You can access this value here: `summary(my_model)$r.squared`. **Search**: Your goal is to find a value of $\lambda$ that achieves an $R^2$ value above $0.9$. To guide your search, you can plot your $\lambda$s against the corresponding $R^2$ values. Such a plot would help you identify new values of $\lambda$ to try. *Hint*: Make sure your $\lambda$ and $R^2$ vectors are always of the same length; otherwise, you cannot easily plot them against one another. ### End of Studio When you are done please call over a TA to review your work, and check you off for this studio. If you do not finish within the two hour studio period, remember to come to TA office hours to get checked off.