acaprau
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Lab 12: Regression ## An Introduction to Regression and Econometrics A core practice within economics is econometrics, or the use of statistics concepts and economic interpretation to understand the underlying relationship between two or more variables - how one variable affects the other. The tool by which economists and statisticians do this is regression. We predict some variable Y, noted as the outcome or independent variable, using another variable X, known as the regressor or explanatory variable. As we have learned from Data 8, regression is simply the method of fitting a line to a bunch of data points. Thorugh this, we select the slope and intercept that minimize the sum of squared errors. The line that is generated from this method is called the line of best fit. When given that line, the coefficients on the variables then become important explanatory tools for understanding the effects of one variable upon another. This notebook will give an introduction into single and multi-variable regression, and their interpretations in economic contexts. ## Terminology #### Left-Hand Side $y$ - Outcome variable, independent variable #### Right-Hand Side $x$ - Regressor, dependent variable, explanatory variable. In machine learning, this is called a feature. $\alpha$ or $\beta$ - Coefficient on the variable or, if it is not associated to any variable, an intercept term. $\varepsilon$ - Error term, containing any unexplained variation that the model does not capture. Categorical Variable - When the Right Hand Side variable is a 0-1 variable, in econometrics we call this a dummy variable, whereas in machine learning we call this a one-hot encoding. When the left-hand side variable is a 0-1 variable we call this a classification problem in ML, and we would usually call the specification a logistic regression. ## Introducing our dataset: NLSY79 Throughout the notebook, we will be using the NLSY79 dataset. This is a survey of young men and women who were 14-22 years old and was first collected in 1979. It contains information such as years of schooling, intelligence measured through a test called AFQT, and annual earnings. For this lab, we will aim to predict individuals' annual earnings from different information provided by the dataset. Thus, using the terminology above: $y$ - Annual earnings $x$ - Years of schooling, AFQT ## Single-Variable Regression The underlying formula that guides linear regression is the following. It is also called the regression line. The general notation is: $$ y = \alpha + \beta \cdot x + \varepsilon $$ - $y$ represents the outcome or the thing we want to predict. It is also know as the dependent variable. - $\alpha$ is the intercept term. - $\beta$ is the slope of the regression line, or the coefficient on the $x$ variable. - $\varepsilon$ is the error term. This is what attempts to model the variance in the data, and is also called noise. The idea behind this formula is that if my $x$ value increases by 1, I expect my $y$ value to change by $\beta$. That is rise over run. That's why we also call $\beta$ the slope of the regression line. We assume that in the world, the "true model" follows this equation. There is a "true" $\alpha$ and $\beta$ value and some random noise. The $y$ that we observe is a linear combination of these. Since the error is random, with our linear model, we aim to predict our best estimate of $\alpha$ and $\beta$. We will call them $\hat{\alpha}$ and $\hat{\beta}$. These are read as "alpha hat" and "beta hat". The 'hats' represent estimates of the true values. First, let our model prediction be called $\hat{y}$, which is given by: $$\hat{y} = \hat{\alpha} + \hat{\beta}x$$ While we can arbitrarily pick $\hat{\alpha}$ and $\hat{\beta}$ values, we do want to pick the values that help predict $\hat{y}$ that are closest to actual $y$ values. To achieve this, we want to minimize a loss function called the "Root Mean Squared Error" which is defined as $$ \text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^n \left ( y_i - \hat{y}_i \right ) ^2 } $$ $n$ is the number of observations. The effect of this is to take the mean of the distance of each value of $\hat{y}$ from its corresponding value in $y$; squaring these values keeps them positive, and then we take the square root to correct the units of the error. Plugging in the formula $\hat{y}$ in RMSE formula, we get, $$ \text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^n \left ( y_i - (\hat{\alpha} + \hat{\beta}x_i) \right ) ^2 } $$ By doing a bit of math (which we will not go over in this class), we get the following formulas for $\hat{\alpha}$ and $\hat{\beta}$ $$\Large \hat{\beta} = r\frac {SD_y} {SD_x} $$ $$\Large \hat{\alpha} = \bar{y} - \hat{\beta}\bar{x} $$ - $r$ is the correlation between x and y - ${SD_y}$ is the standard deviation of y - ${SD_x}$ is the standard deviation of x - $\bar{y}$ is the average of all our $y$ values - $\bar{x}$ is the average of all our $x$ values ## Ordinary Least Squares ## Multi-Variable Regression So far we have been operating under a large limitation: we are only using one feature, years of schooling, as our explanatory variable! Intuitively, using more than one feature will allow us to provide more explanatory power to the predicted value. Suppose we want to predict future earnings - it would make sense that both years of schooling and some measure of intelligence could both possibly contribute to one's earnings. A multi-variable model is useful here. Visually, the multiple regression model is very similar to a single-variable regression model. The only difference is the additional number of explanatory variables. The following is an example of a multiple regression model using two features: $$ y = \alpha + \beta_{1} \cdot x_{1} + \beta_{2} \cdot x_{2} + \varepsilon $$ $\beta_{1}$ is the slope coefficient on $x_{1}$, and $\beta_{2}$ is the slope coefficient on $x_{2}$. You can interpret each coefficient as the expected marginal change in $y$ resulting from a 1 unit change in the corresponding regressor, holding all else constant. How is this different from doing two single-variable regressions? Let's go through a hypothetical example. Suppose we regress earnings on years of schooling, and generate a coefficient of $5000$, meaning for each additional year of schooling, we expect annual earnings to increase by \$5000. Then, suppose we regress earnings on some measure of intelligence, like AFQT, and we generate a coefficient of $400$, meaning that for each additional point on the AFQT scale we expect a rise in earnings of \$400 annually. Does this mean that if we do a multi-variate regression, with years of schooling as $x_1$ and intelligence as $x_2$, we will get a $\beta_1$ of $5000$ and a $\beta_2$ of $400$? Not necessarily. To find out why, think about the relation between years of schooling and education. If I tell you that someone has 20 years of schooling, you can probably make some reasonable conclusions about their intelligence, and vice versa, if I tell you that someone is particularly intelligent, you can probably assume they likely have more years of schooling. Knowing this, return to the regression of earnings on years of schooling. The coefficient of 5000 means that for a 1 year increase in schooling, we expect a \$5000 increase in annual earnings. However, we have also just observed that a 1 year increase in schooling tends to be associated with a small increase in intelligence as well. Therefore, when we say "for a 1 year increase in schooling..." implicit in this is also an increase in intelligence, and the coefficient of 5000 reflects the effect of schooling on earnings *as well as* the effects of intelligence that accompany a rise in schooling. When we do multi-variable regression, the coefficients that the program outputs reflect the expected effect of a change in one variable *keeping all other variables constant*. So were we to do multi-variable regression of earnings on years of schooling and intelligence, we would likely not get coefficients of 5000 and 400, respectively. Rather, the coefficients would likely be less than 5000 and 400, as these two coefficients include multiple effects, as we saw earlier. If we want to observe just the effect of years of schooling on earnings, without the associated change in intelligence, we can expect an effect of less than \$5000. ## The `statsmodels` Package for Regression Statsmodels is a popular Python package used to create and analyze various statistical models. To create a linear regression model in statsmodels, we use the following code in general: `import numpy as np` `import statsmodels.api as sm` `X = data.select(features)` # Separate features and target `Y = data.select(target)` `model = sm.OLS(Y, X)` # Initialize the OLS regression model `model.fit()` # Fit the regression model `print(model.summary())` ### Interpreting Regression Results The `summary()` method outputs a detailed description of various relevant results from our regression, including number of observations, the fitted $\beta$ coefficients, and the value of $\alpha$. The tabular results are formatted similarly to regression summaries in other popular languages in econometrics such as STATA. For the purposes of this lab, we will focus on the `coef` column. Here are the interpretations of each value: - `const`: $\alpha$, the OLS intercept term - `x1`: The OLS value of $\beta_1$ - `x2`: The OLS value of $\beta_2$ ## Categorical and Dummy Variables Perhaps one useful indicator to predict earnings is an individual's gender. Historically, men have earned more than women, so incorporating gender into our regression may be helpful as an explanatory variable in predicting earnings. But how would we encode this into our model? After all, being male or female is not a number, unlike years of schooling. So far, we assume that the inputs to our regression model were continuous values (aka numbers). However, not all data is continuous and thus cannot be directly inputted into a regression model. Categorical variables are a common case of this phenomenon. Categorical variables are not necessarily binary, like gender. Another example of a categorical variable is a person's race - we could have any arbitrary amount of race categories or subgroups depending on our dataset. To translate any categorical variable to continuous inputs to our regression model, we convert them into dummy variables - binary, numeric variables that represent subgroups in categorical variables. Thus, each subgroup is designated as either 0 or 1, indicating whether the subgroup can be attributed to a particular observation or not. Hence, to do dummy encoding for gender, we would create a variable for each category, or each gender in our case. When the unit is male, the variable for male would be 1 and the variable for female would be 0. Our regression would follow the form: $$y = \alpha + \beta_1x_{\text{education}} + \beta_2x_{\text{male}} + \beta_3x_{\text{female}}$$ Notably, $\beta_2-\beta_3$ would be the difference in log earnings that is associated with being male rather than female. ## OPTIONAL READING: Reading Economics Papers In upper division economics courses, you'll often read economics papers that utilize ordinary least squares to conduct regression. Now that we have familiarized ourselves with multi-variate regression, let's familiarize ourselves with reading the results of economics papers! Let's consider an existing empirical study conducted by David Card, a professor at UC Berkeley, that regresses income on education: ![](https://i.imgur.com/FPLII4s.png) Every column here is from a different regression: the first column predicts the log hourly earnings from years of education, the fifth column predicts the log annual earnings from years of education, and so on. For now, let's focus on the first column, which states the linear regression as follows: $$ \ln{(\text{hourly earnings})_i} = \alpha + \beta \cdot (\text{years of schooling})_i + \varepsilon_i $$ From the table, the education coefficient is 0.100, with a (0.001) underneath it. This means that our $\beta$ value is equal to 0.100. What does the (0.001) mean? It is the standard error: which is essentially a measure of our uncertainty. From Data 8, the standard error is most similar to the standard deviation of sample means, which is a measure of the spread in the population mean. Similarly, the the standard error here is a measure of the spread in the population coefficient. We can use the standard error to construct a confidence interval of the actual coefficient: a 95% confidence interval is between 2 standard errors above and below the reported value. The effects of schooling on income is captured by the education coefficient term: 0.100. This means that an increase in 1 unit (year) of education is correlated with a log hourly earnings by 0.1. This approximately corresponds to a 10% increase in wages per year of schooling.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully