# Lab 12: Regression
## An Introduction to Regression and Econometrics
A core practice within economics is econometrics, or the use of statistics concepts and economic interpretation to understand the underlying relationship between two or more variables - how one variable affects the other. The tool by which economists and statisticians do this is regression. We predict some variable Y, noted as the outcome or independent variable, using another variable X, known as the regressor or explanatory variable.
As we have learned from Data 8, regression is simply the method of fitting a line to a bunch of data points. Thorugh this, we select the slope and intercept that minimize the sum of squared errors. The line that is generated from this method is called the line of best fit.
When given that line, the coefficients on the variables then become important explanatory tools for understanding the effects of one variable upon another. This notebook will give an introduction into single and multi-variable regression, and their interpretations in economic contexts.
## Terminology
#### Left-Hand Side
$y$ - Outcome variable, independent variable
#### Right-Hand Side
$x$ - Regressor, dependent variable, explanatory variable. In machine learning, this is called a feature.
$\alpha$ or $\beta$ - Coefficient on the variable or, if it is not associated to any variable, an intercept term.
$\varepsilon$ - Error term, containing any unexplained variation that the model does not capture.
Categorical Variable - When the Right Hand Side variable is a 0-1 variable, in econometrics we call this a dummy variable, whereas in machine learning we call this a one-hot encoding. When the left-hand side variable is a 0-1 variable we call this a classification problem in ML, and we would usually call the specification a logistic regression.
## Introducing our dataset: NLSY79
Throughout the notebook, we will be using the NLSY79 dataset. This is a survey of young men and women who were 14-22 years old and was first collected in 1979. It contains information such as years of schooling, intelligence measured through a test called AFQT, and annual earnings.
For this lab, we will aim to predict individuals' annual earnings from different information provided by the dataset. Thus, using the terminology above:
$y$ - Annual earnings
$x$ - Years of schooling, AFQT
## Single-Variable Regression
The underlying formula that guides linear regression is the following. It is also called the regression line.
The general notation is:
$$
y = \alpha + \beta \cdot x + \varepsilon
$$
- $y$ represents the outcome or the thing we want to predict. It is also know as the dependent variable.
- $\alpha$ is the intercept term.
- $\beta$ is the slope of the regression line, or the coefficient on the $x$ variable.
- $\varepsilon$ is the error term. This is what attempts to model the variance in the data, and is also called noise.
The idea behind this formula is that if my $x$ value increases by 1, I expect my $y$ value to change by $\beta$. That is rise over run. That's why we also call $\beta$ the slope of the regression line. We assume that in the world, the "true model" follows this equation. There is a "true" $\alpha$ and $\beta$ value and some random noise. The $y$ that we observe is a linear combination of these.
Since the error is random, with our linear model, we aim to predict our best estimate of $\alpha$ and $\beta$. We will call them $\hat{\alpha}$ and $\hat{\beta}$. These are read as "alpha hat" and "beta hat". The 'hats' represent estimates of the true values.
First, let our model prediction be called $\hat{y}$, which is given by:
$$\hat{y} = \hat{\alpha} + \hat{\beta}x$$
While we can arbitrarily pick $\hat{\alpha}$ and $\hat{\beta}$ values, we do want to pick the values that help predict $\hat{y}$ that are closest to actual $y$ values. To achieve this, we want to minimize a loss function called the "Root Mean Squared Error" which is defined as
$$
\text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^n \left ( y_i - \hat{y}_i \right ) ^2 }
$$
$n$ is the number of observations. The effect of this is to take the mean of the distance of each value of $\hat{y}$ from its corresponding value in $y$; squaring these values keeps them positive, and then we take the square root to correct the units of the error.
Plugging in the formula $\hat{y}$ in RMSE formula, we get,
$$
\text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^n \left ( y_i - (\hat{\alpha} + \hat{\beta}x_i) \right ) ^2 }
$$
By doing a bit of math (which we will not go over in this class), we get the following formulas for $\hat{\alpha}$ and $\hat{\beta}$
$$\Large
\hat{\beta} = r\frac {SD_y} {SD_x}
$$
$$\Large
\hat{\alpha} = \bar{y} - \hat{\beta}\bar{x}
$$
- $r$ is the correlation between x and y
- ${SD_y}$ is the standard deviation of y
- ${SD_x}$ is the standard deviation of x
- $\bar{y}$ is the average of all our $y$ values
- $\bar{x}$ is the average of all our $x$ values
## Ordinary Least Squares
## Multi-Variable Regression
So far we have been operating under a large limitation: we are only using one feature, years of schooling, as our explanatory variable! Intuitively, using more than one feature will allow us to provide more explanatory power to the predicted value. Suppose we want to predict future earnings - it would make sense that both years of schooling and some measure of intelligence could both possibly contribute to one's earnings. A multi-variable model is useful here.
Visually, the multiple regression model is very similar to a single-variable regression model. The only difference is the additional number of explanatory variables. The following is an example of a multiple regression model using two features:
$$
y = \alpha + \beta_{1} \cdot x_{1} + \beta_{2} \cdot x_{2} + \varepsilon
$$
$\beta_{1}$ is the slope coefficient on $x_{1}$, and $\beta_{2}$ is the slope coefficient on $x_{2}$. You can interpret each coefficient as the expected marginal change in $y$ resulting from a 1 unit change in the corresponding regressor, holding all else constant.
How is this different from doing two single-variable regressions? Let's go through a hypothetical example. Suppose we regress earnings on years of schooling, and generate a coefficient of $5000$, meaning for each additional year of schooling, we expect annual earnings to increase by \$5000. Then, suppose we regress earnings on some measure of intelligence, like AFQT, and we generate a coefficient of $400$, meaning that for each additional point on the AFQT scale we expect a rise in earnings of \$400 annually. Does this mean that if we do a multi-variate regression, with years of schooling as $x_1$ and intelligence as $x_2$, we will get a $\beta_1$ of $5000$ and a $\beta_2$ of $400$? Not necessarily.
To find out why, think about the relation between years of schooling and education. If I tell you that someone has 20 years of schooling, you can probably make some reasonable conclusions about their intelligence, and vice versa, if I tell you that someone is particularly intelligent, you can probably assume they likely have more years of schooling. Knowing this, return to the regression of earnings on years of schooling. The coefficient of 5000 means that for a 1 year increase in schooling, we expect a \$5000 increase in annual earnings. However, we have also just observed that a 1 year increase in schooling tends to be associated with a small increase in intelligence as well. Therefore, when we say "for a 1 year increase in schooling..." implicit in this is also an increase in intelligence, and the coefficient of 5000 reflects the effect of schooling on earnings *as well as* the effects of intelligence that accompany a rise in schooling.
When we do multi-variable regression, the coefficients that the program outputs reflect the expected effect of a change in one variable *keeping all other variables constant*. So were we to do multi-variable regression of earnings on years of schooling and intelligence, we would likely not get coefficients of 5000 and 400, respectively. Rather, the coefficients would likely be less than 5000 and 400, as these two coefficients include multiple effects, as we saw earlier. If we want to observe just the effect of years of schooling on earnings, without the associated change in intelligence, we can expect an effect of less than \$5000.
## The `statsmodels` Package for Regression
Statsmodels is a popular Python package used to create and analyze various statistical models. To create a linear regression model in statsmodels, we use the following code in general:
`import numpy as np`
`import statsmodels.api as sm`
`X = data.select(features)` # Separate features and target
`Y = data.select(target)`
`model = sm.OLS(Y, X)` # Initialize the OLS regression model
`model.fit()` # Fit the regression model
`print(model.summary())`
### Interpreting Regression Results
The `summary()` method outputs a detailed description of various relevant results from our regression, including number of observations, the fitted $\beta$ coefficients, and the value of $\alpha$. The tabular results are formatted similarly to regression summaries in other popular languages in econometrics such as STATA.
For the purposes of this lab, we will focus on the `coef` column. Here are the interpretations of each value:
- `const`: $\alpha$, the OLS intercept term
- `x1`: The OLS value of $\beta_1$
- `x2`: The OLS value of $\beta_2$
## Categorical and Dummy Variables
Perhaps one useful indicator to predict earnings is an individual's gender. Historically, men have earned more than women, so incorporating gender into our regression may be helpful as an explanatory variable in predicting earnings.
But how would we encode this into our model? After all, being male or female is not a number, unlike years of schooling.
So far, we assume that the inputs to our regression model were continuous values (aka numbers). However, not all data is continuous and thus cannot be directly inputted into a regression model. Categorical variables are a common case of this phenomenon.
Categorical variables are not necessarily binary, like gender. Another example of a categorical variable is a person's race - we could have any arbitrary amount of race categories or subgroups depending on our dataset.
To translate any categorical variable to continuous inputs to our regression model, we convert them into dummy variables - binary, numeric variables that represent subgroups in categorical variables. Thus, each subgroup is designated as either 0 or 1, indicating whether the subgroup can be attributed to a particular observation or not.
Hence, to do dummy encoding for gender, we would create a variable for each category, or each gender in our case. When the unit is male, the variable for male would be 1 and the variable for female would be 0. Our regression would follow the form:
$$y = \alpha + \beta_1x_{\text{education}} + \beta_2x_{\text{male}} + \beta_3x_{\text{female}}$$
Notably, $\beta_2-\beta_3$ would be the difference in log earnings that is associated with being male rather than female.
## OPTIONAL READING: Reading Economics Papers
In upper division economics courses, you'll often read economics papers that utilize ordinary least squares to conduct regression. Now that we have familiarized ourselves with multi-variate regression, let's familiarize ourselves with reading the results of economics papers!
Let's consider an existing empirical study conducted by David Card, a professor at UC Berkeley, that regresses income on education:

Every column here is from a different regression: the first column predicts the log hourly earnings from years of education, the fifth column predicts the log annual earnings from years of education, and so on. For now, let's focus on the first column, which states the linear regression as follows:
$$
\ln{(\text{hourly earnings})_i} = \alpha + \beta \cdot (\text{years of schooling})_i + \varepsilon_i
$$
From the table, the education coefficient is 0.100, with a (0.001) underneath it. This means that our $\beta$ value is equal to 0.100. What does the (0.001) mean? It is the standard error: which is essentially a measure of our uncertainty. From Data 8, the standard error is most similar to the standard deviation of sample means, which is a measure of the spread in the population mean. Similarly, the the standard error here is a measure of the spread in the population coefficient. We can use the standard error to construct a confidence interval of the actual coefficient: a 95% confidence interval is between 2 standard errors above and below the reported value.
The effects of schooling on income is captured by the education coefficient term: 0.100. This means that an increase in 1 unit (year) of education is correlated with a log hourly earnings by 0.1. This approximately corresponds to a 10% increase in wages per year of schooling.