# CS 410/1411 Homework 6: Linear Regression & The Bias-Variance Tradeoff
**Due Date: 10/23/2024 at 12pm**
**Need help?** Remember to check out [Edstem](https://edstem.org/us/courses/61309) and our website for TA assistance.
## Assignment Overview
The topic of this assignment is linear regression, a means of computing a set of coefficients $w \in \mathbb{R}^{d}$ given a matrix $X \in \mathbb{R}^{n \times d}$ such that $X w \in \mathbb{R}^{n}$ approximates a vector $y \in \mathbb{R}^{n}$. You will run a simple linear regression on just one variable (i.e., $d = 1$), and you will run a polynomial regression, where you transform the columns of $X_k$ to alternatives like $X_k^2$ and $X_k^3$.
Using all of this machinery, you will build a model from a real-world data set to estimate life expectancy, given explanatory variables like health outcomes, population size, education level, and so on. First, you will compare simple linear regression, a model with high bias and low variance, to multiple linear regression, a model with low bias and high variance. Then you will perform an analysis of the bias-variance trade-off on synthetic data. Finally, you will return to the modeling task, where you will use regularization to select a model with a good balance of bias and variance.
### Learning Objectives
What you will know:
- the bias-variance decomposition theorem
- linear regression, polynomial regression, ridge regression, and LASSO
What you will be able to do:
- build simple and multiple linear regression models using a real-world dataset
- regularize a linear regression model to trade off between bias and variance
## Part 1: Simple Linear Regression
Over the course of this semester, Steve has been working hard to cultivate the beginnings of a successful society--improving high-school graduation rates, filtering spam emails, winning at Connect-4. Now that Steve's civilization is pretty well developed, he's interested in predicting the life expectancy of the inhabitants of his city. What a caring leader! While researching how to go about this, Steve stumbled upon some life expectancy data collected by the [World Health Organization (WHO)](https://www.who.int/). Since Steve is still relatively new to machine learning, he's hoping you can use these data to help him predict the life expectancy of his villagers.
### Data
You can find the `life_expectancy.csv` dataset in your project folder, within the `data/` subfolder. This dataset contains health indicators for a variety of countries--ranging from Afghanistan to Zimbabwe--between the years 2000 and 2015. It contains statistics on life expectancy for each country, related health categories like infant deaths, incidence of measles, and BMI, as well as country-specific data like GDP and population size. In addition, it labels countries as developing or developed. The intent is enable governments to improve life expectancy by identifying and then addressing the most impactful factors.
The original dataset is located [here](https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who?resource=download). We have **cleaned** this dataset for you--a necessary step before embarking upon any machine learning task. In particular, we deleted all rows with missing values. We also **standardized** the feature values. To standardize a feature, you subtract its sample mean (i.e., its average value) and then divide by its sample standard deviation (i.e., the square root of its sample variance). After standardizing, all feature values lie between $0$ and $1$. Standardization is important when building in linear models, because it prevents overweighting a feature's importance based solely on its scale.
### Linear Regression
Linear regression is a statistical technique that estimates a linear relationship between a **response**, or **dependent**, variable, such as life expectancy, and a set of **features**, also called **explanatory** variables or **predictors**, such as infant death rate, average education level, etc.
Given a matrix $X \in \mathbb{R}^{n \times d}$ of $n$ observations, each one described by $d$ feature values, together with associated responses $y \in \mathbb{R}^n$, one way to frame the linear regression problem is as searching for a weight vector $w \in \mathbb{R}^d$ that minimizes the squared error, i.e., $(X w - y)^T (X w - y)$. This optimization problem has a closed-form solution, namely $w = (X^T X)^{-1} X^T y$.
A bias term in a linear model is a parameter that offsets the model from the origin. For example, in the linear equation $y = mx + b$, the $y$-intercept $b$ is a bias term. To incorporate a bias term, say $w_{d+1}$, into a linear model, it suffices to append a $d+1$st column of 1s to the $X$ matrix. After doing so, the vector $X w$ includes the addition of $w_{d+1}$, the bias term, in all of its entries.
A **simple linear regression** regresses the response variable on only one explanatory variable. An advantage to running a simple regression is that the dimensionality of $X$ is $n \times 1$, i.e., $X$ is actually just a vector, say $x \in \mathbb{R}^n$, in which case $X^T X = x^T x$ is a scalar, meaning $X^T X$ is invertible (as long as $X \ne 0$, which would not make for a very interesting explanatory variable!).
:::spoiler **Bias Term**
Note that $X^T X$ remains invertible even with an additional parameter, since what is required for $X^T X$ to be invertible is that the columns of $X$ are linearly independent. Just as a column of 0's would not make for a very interesting explanatory variable, nor would any constant-valued column.
:::
In this part of the assignment, you will be building several simple linear models--models based on just one feature. One challenge will be to choose sufficiently explanatory variables.
### Tasks
In this part of the assignment, you will build a `LinearRegression` model. This model will extend `sklearn`'s `BaseEstimator` class. Inheriting from `BaseEstimator` imposes the requirement on your models that they implement the `fit` and `predict` methods. Your `LinearRegression` class will also provide a method to compute mean-squared error (defined below).
:::info
**Task 1.1**
In the `LinearRegression` class in `models.py`, implement the closed-form solution to least-squares optimization in the `fit` function, and the `predict` function, which makes a prediction using the output of `fit`.
:::
:::success
**Signature**
The `fit` method estimates the model's parameters' values $w \in \mathbb{R}^{d}$, called `coefficients`, by fitting them to a given dataset $X \in \mathbb{R}^{n \times d}$ and corresponding $y \in \mathbb{R}^{n}$. The `predict` method returns a vector $X w \in \mathbb{R}^{m}$ of estimated values, given a dataset $X \in \mathbb{R}^{m \times d}$, of arbitrary size $m$.
$X$ and $y$ are provided as `np.ndarray`s to `fit` and `predict`.
:::
:::danger
**⚠️WARNING⚠️**
Avoid `for` loops! Write vectorized code only!
:::
:::spoiler **Hint** Incorporating Bias
To incorporate a bias term, you should append a vector of 1s to the matrix $X$ before computing the closed-form solution.
:::
Given a vector of predictions $\hat{y} \in \mathbb{R}^n$ and a vector of observations $y \in \mathbb{R}^n$, the **mean-squared error (MSE)** between $\hat{y}$ and $y$ is defined as follows:
$$\textrm{MSE} (\hat{y}, y) = \frac{1}{n} \sum_{i = 1}^n \left( \hat{y}_i - y_i \right)^2$$
:::info
**Task 1.2**
Implement the `mse` function in the `LinearRegression` class to compute the mean-squared error between a vector of predictions $X w$ and a vector of observations $y$.
:::
:::success
**Signature**
The `mse` function takes as input a matrix $X$ and a vector $y$, computes $\hat{y} = X w$, and outputs $\textrm{MSE} (\hat{y}, y)$.
:::
**Correlation** is a statistical measure between two random variables that describes their linear relationship to one another. Its values range from $-1$ to $+1$, with negative values suggesting a negative correlation (one increases as the other decreases), and positive values suggesting a positive correlation. Correlation is computed as the covariance of two random variables, divided by the product of their standard deviations.
Covariance generalizes variance. The **covariance** of two random variables $Y$ and $Z$ is defined as follows:
$$\textrm{Cov}(Y, Z) = \mathbb{E} \left[ (Y - \mathbb{E} \left[ Y \right]) (Z - \mathbb{E} \left[ Z \right]) \right]$$ **Standard deviation** is defined as the square root of the variance. Therefore,
$$\textrm{Corr}(Y, Z) = \frac{\textrm{Cov}(Y, Z)}{\sqrt{\textrm{Var}(Y)} \sqrt{\textrm{Var}(Z)}}$$
:::spoiler **Standardization**
As in standardizing feature values, dividing by the standard deviations of $Y$ and $Z$ serves to standardize covariance values, so that correlation always lies in the range $[-1, 1]$.
:::
Your next task is to calculate the **(sample) correlation matrix** $R$, which describes the correlations between all the features in the life expectancy dataset $X$. In particular, $r_{ij}$ is the (sample) correlation between feature $i$ and feature $j$, for $i, j \in \{ 1, \ldots, n \}$, e.g., the correlation between `life_expectancy` and `infant_deaths`.
A **sample** correlation, like a sample mean, is calculated from data. The **sample mean** of the `life_expectancy` feature, for example, is the average of all the `life_expectancy` values. Formally, the sample mean $\bar{x}$ of a vector $x \in \mathbb{R}^n$ is defined as $\frac{1}{n} \sum_{i=1}^n x_i$.
The formula for the sample correlation matrix $R$, given a matrix $X$, is as follows:
$$r_{ij} = \frac{\sum_{k=1}^n \left( x_{ki} - \bar{x}_i \right) \left( x_{kj} - \bar{x}_j \right)}{\sqrt{\sum_{k=1}^n \left( x_{ki} - \bar{x}_i \right)^2} \sqrt{\sum_{k=1}^n \left( x_{kj} - \bar{x}_j \right)^2}}$$ Intuitively, each column $k$ of the matrix $X$ represents samples of some underlying random variable $X_k$ (e.g., `HIV/AIDS`), and the sample correlation $r_{ij}$ is the sample covariance of $X_i$ and $X_j$ divided by the sample standard deviations of each.
:::info
**Task 1.3**
Write the `corr` function in the `life_expectancy.py` file to compute a correlation matrix for all pairs of features in a given matrix $X$. Then run `python life_expectancy.py --correlation` to run `corr`. In your README, report on two or three of the correlation coefficients you find most interesting.
:::
:::spoiler **Hint** numpy
Feel free to use numpy!
:::
:::info
**Task 1.4**
Choose three features that are highly correlated--either positively or negatively--with life expectancy, and regress `life_expectancy` on each one using your `fit` and `predict` functions. Add your implementation to `single_feature_regression`. Then use your `mse` function to calculate the mean-squared error for each model, both on a training set and a test set.
You can run a a single feature regression using the command line argument `feature`: `python --feature BMI`, or `python --feature Polio`. To see a list of avaialable features, run `python --feature help`.
In your README, summarize your three simple linear regression models, reporting the training and test error in all cases. Which one seemed to discover the best fit visually? Did this one also yield the lowest training and test MSE?
:::
:::spoiler **Hint** Plots
Use the `plot_model` method provided by LinearRegression to visualize the three linear models.
:::
:::spoiler **Hint** Holdout Data
Remember, you can use the `train_test_split` function in `sklearn` to hold out data for testing purposes.
:::
Steve likes what he sees so far with these simple linear regression models. Gaining in confidence, he decides to build a linear model that incorporates *all* the features.
Run `python life_expectancy.py --all-features --linear` to train a `LinearRegression` model with all available features. This function outputs a training and test MSE, alongside the coefficient values for each feature.
Steve looks at the training MSE. Steve looks at the test MSE. He looks at the train error again. He looks back at the test error. He sighs, and says "Huh? That's not quite right".
On this "all-features" model, the training MSE is significantly lower than the test MSE, a sign that the model is **overfitting** the training data, because it does not generalize to the test data. On the other hand, a simple linear regression is likely **underfit** to the training data; it is missing important trends in the data that do generalize. Complex (resp. simple) models are more likely to overfit (resp. underfit) the training data. Complexity is tied to generalization error through the statistical measures of **bias** and **variance**.
## Part 2: The Bias-Variance Tradeoff
Recall that an **estimator** is a rule $\hat{f}_{\mathcal{D}}$ for computing a **statistic**, i.e., estimating a quantity $\hat{f}_{\mathcal{D}}(x)$ from data $\mathcal{D}$. The expected value of an estimator $\mathbb{E} [\hat{f}_{\mathcal{D}}(x)]$ is the expected value with respect to $\mathcal{D}$ of the predictions generated by the corresponding estimators $\hat{f}_{\mathcal{D}}$.
The bias and variance of an estimator are two statistical quantities central to understanding how machine learning models perform. The **bias** of an estimator $\hat{f}$ at a point $x$ is defined as
$$\mathrm{Bias} \left( \hat{f}_{\mathcal{D}}, f; x \right) = \mathbb{E} [\hat{f}_{\mathcal{D}}(x)] - f(x)$$ In other words, bias is the difference between the expected value of the estimator evaluated at $x$ and the underlying function $f$ evaluated at $x$. The **variance** of an estimator is at a point $x$ is given by
$$\mathrm{Var} \left( \hat{f}_{\mathcal{D}}; x \right) = \mathbb{E} \left[ \left( \hat{f}_{\mathcal{D}} (x)- \mathbb{E}_{\mathcal{D}} \left[ \hat{f}_{\mathcal{D}} (x) \right] \right)^2 \right]$$
So why are bias and variance so important? Because all the mean-squared error in a machine learning model can be decomposed in terms of bias and variance!
Stated in terms of expectations (rather than samples), the **mean-squared error** is defined as follows:
$$\mathrm{MSE} \left( \hat{f}_{\mathcal{D}}, f; x \right) = \mathbb{E} \left[ \left( \hat{f}_{\mathcal{D}} (x) - f (x) \right)^2 \right]$$
This quantity decomposes as follows:
$$\mathrm{MSE} \left( \hat{f}_{\mathcal{D}}, f \right) = \mathrm{Bias} \left( \hat{f}_{\mathcal{D}}, f \right)^2 + \mathrm{Var} \left( \hat{f}_{\mathcal{D}} \right)$$
### Tasks
In this part of the assignment, you will run experiments that demonstrate the bias-variance tradeoff in action. You will use **polynomial regression** to build a model $\hat{f}_{\mathcal{D}}$ given data $\mathcal{D}$ drawn from a synthetic, noisy data generation function $f(x) + \epsilon$, where $x \in \mathbb{R}$ and $\epsilon$ represents noise.
<!-- The y-noise is a normal distribution around f(x) with standard deviation $\sigma$, the x are uniformly sampled from [-1, 1], the noise isn't truncated. -->
A **polynomial regression** of degree $d$ is simply a linear regression performed on an augmented matrix of feature values. For example, given a feature vector $x \in \mathbb{R}^n$, a polynomial regression of degree 3 would build a matrix $\begin{bmatrix} x^0 & x^1 & x^2 & x^3 \end{bmatrix}$, where the entries of $x^i$, for some $i \in \{ 0, 1, 2, 3 \}$, are simply
$$\begin{bmatrix} x_1^i \\ x_2^i \\ \vdots \\ x_n^i \end{bmatrix}$$ and then fit a (linear) model $\hat{f}_{\mathcal{D}} (x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3$.
:::info
**Task 2.1**
In `models.py`, implement `PolynomialRegression`. Like `LinearRegression`, `PolynomialRegression` inherits from `BaseEstimator`, and thus requires `fit` and `predict`.
:::
:::success
Prediction in polynomial regression is identical to prediction in linear regression. Fitting, however, varies, only insomuch as the data are transformed before a linear model is fit.
:::
The data in this part of the assignment is synthetic, and generated by a known function, $f(x) = \sin (\pi x)$. For pedagogical purposes, however, we will pretend that this function is unknown, and that only noisy samples $y$ of $f(x)$ are available. These samples are distort $f(x)$ by additive noise $\epsilon \sim \mathcal{N}(0, \sigma)$. That is, $y \sim f(x) + \epsilon$.
:::spoiler **Normal Distribution**
The notation $\mathcal{N}(\mu, \sigma)$ denotes the normal distribution with mean $\mu$ and variance $\sigma$.
:::
To investigate the bias-variance tradeoff, you will build multiple estimators from multiple sampled datasets: i.e., multiple pairs $(x, y) \in \mathbb{R}^{n \times 2}$. The procedure goes as follows:
1. `sample_function`: Sample a dataset from the data-generating function $y = f(x) + \mathcal{N}(0, \sigma)$, by generating sample $x$ values and calculating a corresponding $y$ value for each.
2. `build_sample_estimator`: Fit an estimator to your sample dataset $(x, y)$.
3. `build_many_estimators`: Repeat steps 1 and 2 `num_estimators` times.
:::info
**Task 2.2**
Implement `sample_function`, `build_sample_estimator`, and `build_many_estimators` in `bias_variance.py`.
:::
:::success
Build your sample estimators using the `fit` method in your `PolynomialRegression` class.
:::
Run `python bias_and_variance.py --visualize-sample` to visualize $f(x)$ and the set of points sampled by `sample_function`.
Run `python bias_and_variance.py --visualize-estimators -d 3` to visualize the the sampled estimators and their mean, computed using polynomial regression of degree 3. Compare these plots with those generated using polynomial regression of degrees 1 and 7.
These plots will depict each of the sample estimators (lighter lines) as well as their mean predictions (darker line).
Polynomial regression models with lower degrees (i.e., less complexity) tend to be similar to one another, because their high degree of bias (low degree) causes them to miss important trends. As a result, they tend to underfit the training data.
Polynomial regression models with higher degrees (i.e., more complexity) tend to overfit the training data. Moreover, they tend to differ from one another quite drastically, as each is overfit to the particular dataset it was trained on. In other words, high complexity models tend to exhibit high variance.
:::info
**Task 2.3**
Implement `calculate_bias` and `calculate_variance` in `bias_variance.py`.
:::
After implementing these methods, run `python bias_and_variance.py --visualize-loss`. This program will sample build many estimators, and then report the squared bias, variance, MSE, and squared bias + variance for this set of estimators. According to the bias-variance decomposition theorem, squared bias + variance should equal MSE, for each polynomial.
The program will also produce the figure `figs/train_test_error.png`, which depicts the relationship between training error, test error, bias, variance, and model complexity (the degree of the polynomial).
:::info
**Task 2.4**
Answer the following question in your README:
Explain in 2-3 sentences any connections you perceive between the results depicted in `figs/train_test_error.png` and the trends you observed while working with decision trees in Homework 4. How do you think the bias and variance of decision trees are affected as they become more or less complex (e.g., as their maximum depth varies).
:::
If we wanted to find the best set of model parameters for this task, we could perform the same search process as in Homework 4 (i.e., cross-validation with polynomials of different degrees). We could optimize the bias-variance tradeoff by conducting a search through this parameter space.
## Part 3: Ridge Regression
<!-- Next, we introduce a tool for preventing overfitting, and thereby helping to optimize the bias-variance tradeoff: **regularization**. Regularization is a set of techniques that penalizes models for their complexity, albeit by altering their objective function. -->
Take a look at `figs/estimators_d_9.png` (reproduced below), which should depict some sample models that have clearly overfit their training set, so don't match the underlying function closely at all. What makes one of these models vary so wildly? It turns out they possess a commonality: Many of their coefficients are very large--in the 1,000s or 10,000s. In almost all polynomial regression models, large coefficients are an indication that the model has overfit. So what if we were to force our models to prefer smaller parameters? Doing so is a form of **regularization**.

Standard linear regression seeks to minimize the squared error, i.e., $(X w - y)^T (X w - y)$. **Ridge regression** adds an additional term--called a **regularizer**--to this objective, namely $\lambda w^T w$, for some hyperparameter $\lambda \ge 0$. This regularizer can be interpreted as penalizing large values of $w$, since the aim is to *minimize* the objective. The objective, or loss function $\mathcal{L}$, for ridge regression is thus given by:
$$\mathcal{L} = (X w - y)^T (X w - y) + \lambda w^T w$$
Just as in the standard linear regression case, we minimize this loss function by solving for $w^*$ such that $\nabla_w \mathcal{L} (w^*) = 0$. And just as in standard linear regression case, there is a closed-form solution for $w$, namely:
$$w = (X^TX +\lambda I)^{-1}X^Ty$$
This closed-form solution is a simple generalization of linear regression, in which $\lambda I$ is added to $X^TX$ before computing the inverse. The value of the hyperparameter $\lambda$ controls how much the model is penalized for using large coefficients.
:::info
**Task 3.1**
Implement the `fit` method of `RidgeRegression` in `models.py`.
:::
:::warning
**Note** Since `RidgeRegression` inherits from `LinearRegression`, you *not* need implement `predict`.
:::
To see the results of running ridge regression, run `python bias_and_variance.py --visualize-estimators -d 7 --ridge --lam 1.0`, modifying the `lam` parameter as desired.
:::spoiler Why use variables called `lam` and not `lambda`?
`lambda` is a reserved keyword in Python (and many other languages). It is used to create anonymous functions (i.e., functions without names). The history of anonymous functions is rich, dating back to the [$\lambda$-calculus](https://en.wikipedia.org/wiki/Lambda_calculus), which was fundamental to the theory of computation before computers even existed!
:::
::: info
**Task 3.2**
Answer the following question in your README:
Compare the bias and variance of various 7th degree polynomial regression models created using different values of the hyperparameter $\lambda$. How does changing the value of $\lambda$ effect a model's bias and variance?
<!-- So far, we have kept a number of other variables the same, including $f(x)$, $\sigma$, the number of sampled estimators, and the number of sampled points to train each estimator. Choose another function for $f(x)$ (but not a polynomial). Run `python bias_and_variance.py --visualize-loss` with your new $f(x)$ and report your findings. Do you see the same trends in bias, variance, and error that you observed with $f(x)=\sin(\pi x)$? -->
:::
## Part 4: Life Expectancy Redux
Steve has learned so much about linear regressions, bias, and variance, and feels ready to return to predicting life expectancy!
Ridge regression seemed to work pretty well on the synthetic data. Let's see how it does on our real-world data set.
Run `python life_expectancy.py --all-features --ridge --lam 0.1` to build a ridge regression model on the life expectancy dataset with $\lambda = 0.1$.
How does the performance compare to a standard linear regression? You can use `python life_expectancy.py --all-features --ridge --lam 0.1 --linear` to run both models and compare their errors.
Ridge Regression should perform better than Linear Regression, but only slightly so.
Ridge regression adds a penalty term to the usual linear regression objective using $w^T w = \sum_i w_i^2$ as the regularizer. **LASSO** is based on the same intuitive idea--penalizing large $w$s--but it uses a different regularizer, namely $\sum_i |w_i|$.
Unfortunately, the absolute value function is not everywhere differentiable, so we cannot resort to calculus as usual to solve for a closed-form solution to LASSO. But not to fear, there are other tools in our toolbox!
Remember our good friend gradient descent. (Steve remembers her well!) Well, of course gradient descent also relies on a gradient, but it can be generalized to an algorithm called *subgradient* descent, which is slower, but still effective at minimizing the LASSO loss function.
<!-- Adding an absolute value to the objective function introduces issues for our normal closed form approach to linear regression. Absolute value functions are not-differentiable everywhere. So how can you find a good solution when you can't just set the gradient to 0? Using gradient descent! -->
To build a LASSO regression model, you will use `LassoRegression` in the `sklearn` module. We have already done you the favor of constructing the LASSO regression object in the `compare_models` method of `life_expectancy.py`.
Run `python life_expectancy.py --all-features --all-models --lam 0.1`. Observe the performance difference between LASSO, ridge, and standard linear regression in terms of both their training and test error.
:::info
**Task 4.1**
Answer the following question in your README:
LASSO and ridge regression both supposedly penalize large coefficients. Compare the coefficient values for each model. Do the LASSO and ridge coefficients differ significantly from the standard regression values. Do they differ from each other?
:::
## Downloads
Please click [here](https://classroom.github.com/a/1lPYEsr-) to download the assignment code.
### Support Code
* `life_expectancy.csv`: This is the cleaned life expectancy data set.
### Stencil Code
* `life_expectancy.py`: This is where you will analyze the life expectancy data we have provided.
* `bias_and_variance.py`: This is where you will experiment with the bias-variance tradeoff.
* `models.py`: This is where you will implement your the various regression models.
## Submission
### Handin
Your handin should contain the following:
- all modified files, including comments describing the logic of your implementations, and your tests
- the graphs you generated
- and, as always, a README containing:
- a brief paragraph summarizing your implementation
- your solutions to any conceptual questions
- known problems in the code
- anyone you worked with
- any outside resources used (eg. Stack Overflow, ChatGPT)
### Gradescope
Submit your assignment via Gradescope.
To submit through GitHub, follow these commands:
1. `git add -A`
2. `git commit -m "commit message"`
3. `git push`
Now, you are ready to upload your repo to Gradescope.
*Tip*: If you are having difficulties submitting through GitHub, you may submit by zipping up your hw folder.
### Rubric
| Component | Points | Notes |
|-------------------|------|--------------------------------|
| 1.1 Linear Regression | 10 | Points awarded for correct implementation of `fit` and `predict`. Your linear regression models should find the correct coefficients when provided a sample dataset and should return the correct $\hat{y}$ when `predict` is called. The closed form solution is deterministic, so your values should match exactly. |
| 1.2 Mean-squared error | 5 | Points awarded for correct calculation of `mse`. Your values should match ours exactly. |
| 1.3 Correlation Matrix | 15 | Points awarded for correct implementation of `corr`. Your code should return a correct correlation matrix, but how you use the information in `corr` to isolate correlated features of interest is up to you. Points also awarded for clearly reporting findings in your README.|
| 1.4 Single Feature Regressions | 10 | Points awarded for correct implementation of `single_feature_regression` and for clearly reporting findings for the three simple linear regressions in your README. |
| 2.1 Polynomial Regression | 10 | Points awarded for correctly implementing a $d$ degree polynomial regression, meaning a correct implementation of `fit` and `predict`. Your coefficients should exactly match ours exactly. |
| 2.2 Building Estimators | 15 | Points awarded for correct implementation of the estimator-building procedure. Your solution should return a list of estimators fit on sampled data. |
| 2.3 Bias and Variance | 10 | Points awarded for correct calculations of bias squared and variance. **Hint**: There is an easy way to confirm that your MSE, bias squared, and variance are correct. |
| 2.4 Answer questions | 5 | Points awarded for clear answer to the question and thoughtful connections to HW4 in README. |
| 3.1 Ridge regression | 5 | Points awarded for correct implementation of `fit`. |
| 3.2 Answer questions | 5 | Points awarded for correct interpretation of the figure and clear articulation of the bias-variance tradeoff for ridge regression in README. |
| 4.1 Answer questions | 10 | Points awarded for clearly identifying any meaningful differences between coefficients across regression types.|
:::success
Congrats on submitting your homework; Steve is proud of you!!


:::