# Linear Regression, Model Selection, and Regularization
## Introduction
Linear regression is a fundamental statistical technique for modeling linear relationships between a continuous dependent variable and one or more independent variables. It is widely used across diverse domains to analyze these relationships and predict how changes in the independent variables will affect the dependent variable.
## Simple Linear Regression
The simple linear regression model with a single independent variable $X$ and a dependent variable $Y$ is:
$$
Y = \beta_0 + \beta_1 X + \epsilon
$$
where $\epsilon$ represents the random error term.
### Estimating the Coefcients (Least Square method)
Given a dataset of $n$ observations $\{(x_i, y_i)\}_{i=1}^n$, we estimate the best-fit line by finding coefficients $\hat{\beta}_0$ and $\hat\beta_1$ that minimize the **residual sum of squares (RSS)**:
$$
\text{RSS} = \sum_{i=1}^{n} (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2.
$$
The optimal values for the coefficients are:
$$
\hat{\beta}1 = \frac{\sum{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}, \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}
$$
where $\bar x$ and $\bar y$ are the sample means of the predictor and response variables, respectively. This method is called the *Least Square method*.
### Example
We use the `sm.OLS` function from the statsmodels library to fit a simple linear regression model, where the response variable is medv and the predictor is lstat from the Boston housing dataset.
```python=
import numpy as np
import pandas as pd
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import ModelSpec as MS, summarize
Boston = load_data("Boston")
X = pd.DataFrame({'intercept': np.ones(Boston.shape[0]), 'lstat': Boston['lstat']})
y = Boston['medv']
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
```
The output shows the estimated coefficients and their statistical significance:
```python
coef std err t P>|t|
intercept 34.5538 0.563 61.415 0.0
lstat -0.9500 0.039 -24.528 0.0
```
The `get_prediction` method can be used to obtain predictions, confidence intervals, and prediction intervals for new values of `lstat`.
```python=+
new_df = pd.DataFrame({'lstat': [5, 10, 15]})
newX = MS(['lstat']).fit_transform(new_df)
new_predictions = results.get_prediction(newX)
print("Predicted values:", new_predictions.predicted_mean)
print("95% Confidence Intervals:", new_predictions.conf_int(alpha=0.05))
print("95% Prediction Intervals:", new_predictions.conf_int(obs=True, alpha=0.05))
```
```python
Predicted values: [29.80359411 25.05334734 20.30310057]
95% Confidence Intervals: [[29.00741194 30.59977628]
[24.47413202 25.63256267]
[19.73158815 20.87461299]]
95% Prediction Intervals: [[17.56567478 42.04151344]
[12.82762635 37.27906833]
[ 8.0777421 32.52845905]]
```
## Multiple Linear Regression
Multiple linear regression extends the concept to handle multiple independent variables:
$$
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon,
$$
where:
* $X_j$ represents the $j$-th independent variable
* $\beta_j$ quantifies the association between that variable and the response, holding the other variables constant
* $\epsilon$ is the random error term
### Estimating the Coefficients (Least square method)
Given estimates $\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_p$, predictions are made using:
$$
\hat y = \hat \beta_0 + \hat \beta_1 x_1 + \dots + \hat \beta_p x_p.
$$
The coefficients are estimated by minimizing the RSS:
$$
\text{RSS} = \sum_{i=1}^{n} (y_i - \hat y_i)^2
$$
using the least squares method.
### Example
In this example, we fit a multiple linear regression model to the Boston housing data, using all the variables except `medv` as predictors and `medv` as the response.
``` python=
from ISLP import load_data
from ISLP.models import ModelSpec as MS, summarize
import statsmodels.api as sm
Boston = load_data("Boston")
# Exclude 'medv' from the predictors
terms = Boston.columns.drop('medv')
# Construct the design matrix X
X = MS(terms).fit_transform(Boston)
# Set 'medv' as the response variable
y = Boston['medv']
# Fit the multiple linear regression model
model = sm.OLS(y, X)
results = model.fit()
# Summarize the results
summarize(results)
```
The output shows the estimated coefficients and their statistical significance:
```python
coef std err t P>|t|
intercept 41.6173 4.936 8.431 0.000
crim -0.1214 0.033 -3.678 0.000
zn 0.0470 0.014 3.384 0.001
indus 0.0135 0.062 0.217 0.829
chas 2.8400 0.870 3.264 0.001
nox -18.7580 3.851 -4.870 0.000
rm 3.6581 0.420 8.705 0.000
age 0.0036 0.013 0.271 0.787
dis -1.4908 0.202 -7.394 0.000
rad 0.2894 0.067 4.325 0.000
tax -0.0127 0.004 -3.337 0.001
ptratio -0.9375 0.132 -7.091 0.000
lstat -0.5520 0.051 -10.897 0.000
```
To perform a regression using all variables except `age`, we can use the following code:
```python=+
minus_age = Boston.columns.drop(['medv', 'age'])
Xma = MS(minus_age).fit_transform(Boston)
model1 = sm.OLS(y, Xma)
summarize(model1.fit())
```
```python
Coef Std err t P>|t|
intercept 41.5251 4.920 8.441 0.000
crim -0.1214 0.033 -3.683 0.000
zn 0.0465 0.014 3.379 0.001
indus 0.0135 0.062 0.217 0.829
chas 2.8528 0.868 3.287 0.001
nox -18.4851 3.714 -4.978 0.000
rm 3.6811 0.411 8.951 0.000
dis -1.5068 0.193 -7.825 0.000
rad 0.2879 0.067 4.322 0.000
tax -0.0127 0.004 -3.333 0.001
ptratio -0.9346 0.132 -7.099 0.000
lstat -0.5474 0.048 -11.483 0.000
```
## Assessing Model Accuracy
Key statistics for evaluating the performance of a linear regression model:
* **F-statistic:** $F = \frac{(\text{TSS} - \text{RSS}) / p}{\text{RSS} / (n - p - 1)}$
* A high $F$-statistic indicates that at least one independent variable has a significant effect on the dependent variable.
* **R-squared ($R^2$):** $R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}$
* Measures the proportion of variability in the dependent variable $Y$ that is explained by the model.
* A value closer to 1 indicates a better fit.
* **Residual Standard Error (RSE):** $\text{RSE} = \sqrt{\frac{1}{n - p - 1} \text{RSS}}$
* Measures the average amount that the response deviates from the true regression line.
* A smaller RSE indicates a better fit.
Where
* $n$ is the number of observations
* $p$ is the number of predictors
* TSS (Total Sum of Squares) = $\sum_{i=1}^{n} (y_i - \bar{y})^2$, the total variability in the dependent variable
* RSS (Residual Sum of Squares) = $\sum_{i=1}^{n} (y_i - \hat{y}_i)^2$, the variability unexplained by the model
## Beyond the Basics: Model Selection and Regularization
While the standard linear model serves as a useful baseline, we can often improve its performance and interpretability through model selection and regularization methods. These techniques are important for:
* **Model Interpretability:** Simpler models with only the most relevant variables are easier to understand and explain.
* **Prediction Accuracy:** Carefully controlling model complexity can prevent overfitting and lead to better predictions on new, unseen data.
### Key Techniques
1. **Subset Selection:** Identifying the most important subset of predictors related to the response variable, such as best subset selection and stepwise selection.
2. **Shrinkage (Regularization):** Retaining all predictors but reducing the magnitude of their coefficients towards zero, using methods like Ridge Regression and Lasso.
3. **Dimension Reduction:** Transforming the predictors into a lower-dimensional space by creating linear combinations (projections) of the original variables, as in *Principal Components Regression (PCR)* and *Partial Least Squares (PLS)*.
## Subset Selection
In linear regression with multiple variables, subset selection techniques help identify the most important subset of predictors:
### Best Subset Selection
* Exhaustively evaluates all possible combinations of predictors to find the theoretically optimal model.
* Use cross-validation or a validation set to prevent overfitting and select a model that generalizes well to unseen data.
:::success
<font color=blue>Algorithm: Best Subset Selection</font>
**Input:**
- Dataset with predictors $X = (X_1, \dots, X_p)$ and response $Y$
- Evaluation metric (e.g., cross-validation error, AIC, BIC)
**Output:** The "best" subset of predictors according to the chosen evaluation metric
**Procedure:**
1. **Initialize:**
- Let `M_0` be the null model (no predictors).
- Set `best_model` = `M_0`
- Set `best_error` = Evaluate(`M_0`) (using the chosen metric)
2. **Iterate over subset sizes (k):**
- For k = 1 to p:
- Generate all possible subsets: Create all possible combinations of `k` predictors from the total `p` predictors.
- Fit and evaluate:
- For each subset `S` of size `k`:
- Fit the linear regression model using predictors in `S`.
- Calculate `error` = Evaluate(model using `S`)
- Update best: If `error` < `best_error`:
- `best_model` = model using `S`
- `best_error` = `error`
- Store the current `best_model` as `M_k`
3. **Select the best:**
* Return `best_model` (which will be one of the stored `M_k` models)
:::
#### Key Points
* **Exhaustive Search:** This algorithm explores all possible combinations of predictors, making it computationally expensive for large `p`.
* **Evaluation Metric:** The choice of metric is crucial. Cross-validation error is often preferred to prevent overfitting. AIC and BIC are alternatives that penalize model complexity.
* **Model Selection:** The algorithm doesn't just identify the best subset, but also fits the corresponding regression model for you.
* **Practical Considerations:** For large datasets, you might consider stepwise selection or regularization methods (like Lasso or Ridge regression) as computationally efficient alternatives.
### Stepwise Selection
Computationally efficient alternatives for large datasets:
#### Forward Stepwise Selection
* Begins with no predictors and iteratively adds the most beneficial predictor at each step.
:::success
<font color=blue>Algorithm: Forward Stepwise Selection</font>
**Input:**
* Dataset with predictors $X = (X_1, \dots, X_p)$ and response $Y$
* Evaluation metric (e.g., cross-validation error, AIC, BIC)
**Output:** The "best" subset of predictors according to the chosen evaluation metric
**Procedure:**
1. **Initialize:**
* Let `M_0` be the null model (no predictors).
* Set `remaining_predictors` = all predictors (`X_1` to `X_p`)
2. **Iterate, adding one predictor at a time:**
* While `remaining_predictors` is not empty:
* **Find the best predictor to add:**
* For each predictor `X_j` in remaining_predictors:
* Fit the model by adding X_j to the current model.
* Calculate `error` = Evaluate(model with `X_j` added)
* Select the predictor `X_best` that results in the lowest `error`.
* **Update the model and remaining predictors:**
* Add `X_best` to the current model.
* Remove `X_best` from remaining_predictors.
* Store the current model as `M_{k+1}` (where k is the current number of predictors in the model).
3. **Select the best:**
* Choose the best model among all `M_k` models using the evaluation metric.
:::
##### Key Points
* **Greedy Algorithm:** Forward Stepwise Selection is a greedy algorithm. It adds predictors one at a time, choosing the one that improves the model the most at each step. This doesn't guarantee finding the absolute best subset but is computationally less expensive than Best Subset Selection.
* **Evaluation Metric:** Just like Best Subset Selection, the choice of metric is crucial for avoiding overfitting.
* **Stopping Criterion:** The algorithm continues until all predictors have been added. In practice, you might use a stopping criterion based on the evaluation metric (e.g., stop when the error doesn't decrease significantly anymore).
* **Model Selection:** The algorithm identifies the best subset and the corresponding models at each step. You can then choose the final model based on your evaluation metric or other criteria like interpretability.
#### Backward Stepwise Selection
* Starts with all predictors and iteratively removes the least significant one at each step.
* Can be advantageous if you expect many predictors to be relevant.
:::success
<font color=blue>Algoithm: Backward Stepwise Selection</font>
**Input:**
* Dataset with predictors $X = (X_1, \dots, X_p)$ and response $Y$
* Evaluation metric (e.g., cross-validation error, AIC, BIC)
**Output:** The "best" subset of predictors according to the chosen evaluation metric
**Procedure:**
1. **Initialize:**
* Let `M_p` be the full model (all predictors).
* Set `current_predictors` = all predictors (`X_1` to `X_p`)
2. **Iterate, removing one predictor at a time:**
* While `current_predictors` is not empty:
* **Find the best predictor to remove:**
* For each predictor `X_j` in `current_predictors`:
* Fit the model by removing `X_j` from the current model.
* Calculate `error` = Evaluate(model without `X_j`)
* Select the predictor `X_worst` that results in the lowest `error` (or the least degradation of model performance).
* **Update the model and predictors:**
* Remove `X_worst` from the current model.
* Remove `X_worst` from `current_predictors`.
* Store the current model as `M_{k-1}` (where k is the current number of predictors in the model).
3. **Select the best:**
* Choose the best model among all `M_k` models using the evaluation metric.
:::
##### Key Points
* **Greedy Algorithm:** Like Forward Stepwise Selection, Backward Stepwise Selection is also a greedy algorithm. It starts with the full model and removes predictors one by one, aiming to improve the model at each step.
* **Evaluation Metric:** The choice of metric is important to guide the predictor removal process and avoid overfitting.
* **Stopping Criterion:** The algorithm runs until no predictors are left. In practice, you can use a stopping criterion based on the evaluation metric (e.g., stop when removing a predictor significantly worsens the error).
* **Model Selection:** The algorithm keeps track of the best model at each step. You can then choose the final model based on the evaluation metric or consider a trade-off between model performance and complexity.
### Model Selection Criteria
To prevent overfitting, we use criteria that balance goodness of fit with model complexity:
* **$C_p$, AIC, BIC:** Lower values generally indicate better models. They penalize complexity with different strengths (BIC has the strongest penalty).
* $C_p = \frac{1}{n}(\text{RSS} + 2d\hat{\sigma}^2)$
* $\text{AIC} = \frac{1}{n \hat{\sigma}^2}(\text{RSS} + 2d\hat{\sigma}^2)$
* $\text{BIC} = \frac{1}{n \hat{\sigma}^2} (\text{RSS}+\log(n)d\hat{\sigma}^2)$
Here $\hat{\sigma}^2$ is the estimated variance of the error term.
* **Adjusted $R^2$:** Also penalizes complexity. Higher values (up to 1) are better, indicating that the model explains a larger proportion of the variance in the response variable.
* $\text{Adjusted} \ R^2 = 1 - \frac{\text{RSS}/(n-d-1)}{\text{TSS}/(n-1)}$
### Example: Implementing Subset Selection Techniques
To begin, we import necessary libraries and tools that facilitate data manipulation, statistical analysis, and model fitting.
```python
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from statsmodels.api import OLS
import sklearn.model_selection as skm
import sklearn.linear_model as skl
from sklearn.preprocessing import StandardScaler
# Assuming ISLP is a custom library for loading specific datasets
from ISLP import load_data
from ISLP.models import ModelSpec as MS, Stepwise, sklearn_selected
from functools import partial
from l0bnb import fit_path
```
#### Forward Selection using $C_p$
We use the `Hitters` dataset, which contains statistics of baseball players, to predict players' salaries. The initial step involves data cleaning to ensure a robust dataset.
```python=
Hitters = load_data('Hitters').dropna()
print(f"Dataset shape: {Hitters.shape}")
```
```python
(263, 20)
```
Once the dataset is prepared, defining a scoring function based on the *negative $C_p$* statistic helps evaluate model performance, considering both fit and complexity.
```python=+
def nCp(sigma2, estimator, X, Y):
n, p = X.shape
Yhat = estimator.predict(X)
RSS = np.sum((Y - Yhat)**2)
return -(RSS + 2 * p * sigma2) / n
```
Preparing the dataset involves separating the target variable `Salary` from the predictors and estimating the error variance $\sigma^2$.
```python=+
design = MS(Hitters.columns.drop('Salary')).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters)
sigma2 = OLS(Y, X).fit().scale
```
Forward stepwise selection is applied, selecting predictors that incrementally improve the model's performance.
```python=+
neg_Cp = partial(nCp , sigma2)
strategy = Stepwise.first_peak(design ,
direction='forward',
max_terms=len(design.terms))
hitters_Cp = sklearn_selected(OLS ,
strategy ,
scoring=neg_Cp)
hitters_Cp.fit(Hitters , Y)
hitters_Cp.selected_state_
```
```python
('Assists', 'AtBat', 'CAtBat', 'CRBI', 'CRuns', 'CWalks', 'Division', 'Hits', 'PutOuts', 'Walks')
```
The result contains 10 (out of 20) selected predictors.
#### Best Subset Selection with `l0bnb`
The `l0bnb` package enables finding the optimal subset of predictors. This approach generates a series of models, identifying the one that achieves a balance between complexity and accuracy.
```python=+
from l0bnb import fit_path
D = design.fit_transform(Hitters).drop('intercept', axis=1)
X = np.asarray(D)
path = fit_path(X, Y, max_nonzeros=X.shape[1])
print(path[3])
```
```python
{'B': array([0. , 3.254844, 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0.677753 , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ]),
'B0': -38.98216739555494,
'lambda_0': 0.011416248027450194,
'M': 0.5829861733382011,
'Time_exceeded': False}
```
The `path[3]` output represents the model with 3 non-zero coefficients. The `B` array shows the coefficient values for each predictor, with 0 indicating that the predictor is not included in the model. To determine the best model, you would evaluate the performance (e.g., using cross-validation) of models with different numbers of non-zero coefficients and select the one that balances goodness of fit and complexity.
## Shrinkage Methods
In linear regression with many predictors, especially when some are correlated, shrinkage methods are essential to prevent overfitting and improve model generalization:
### Ridge Regression
Ridge regression introduces a penalty on the size of coefficients to minimize $$\text{RSS} + \lambda \sum_{j=1}^{p} \beta_j^2$$
* As $\lambda$ increases, coefficients shrink towards zero. When $\lambda=0$, ridge regression becomes equivalent to least squares.
* Ridge regression has a *constrained optimization form* that highlights restricting the square sum of the coefficients: $$\min_{\beta}{\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2}, \text{ subject to } \sum_{j=1}^{p} \beta_j^2 \le s$$
### The Lasso
The Lasso minimizes:
$$
\text{RSS} + \lambda \sum_{j=1}^{p} |\beta_j|
$$
* The Lasso's L1 penalty leads to sparse solutions, setting some coefficients to *exactly zero*, making it useful for feature selection.
* The Lasso has a constrained optimization form that highlights restricting the sum of the absolute values of the coefficients: $$\min_{\beta}{\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2}, \text{ subject to } \sum_{j=1}^{p} |\beta_j|\le s.$$
### Lasso vs. Ridge Regression
* Ridge regression might be easier to interpret, while the Lasso can offer superior predictive accuracy if only a few features are important.
* Standardizing features is often recommended before applying shrinkage methods.
* Cross-validation is crucial for selecting the optimal tuning parameter $\lambda$ that balances bias and variance.
#### Visualizing the Difference

Contours of the error (RSS) and constraint functions for the lasso (left) and ridge regression (right).
* The solid blue areas represent the constraint regions ($\sum_{j=1}^{p} |\beta_j|\le s$ for lasso, $\sum_{j=1}^{p} \beta_j^2 \le s$ for ridge), while the red ellipses are the contours of the RSS.
* The lasso constraint has corners, which allows coefficients to be set to exactly zero when the solution is at a corner. In contrast, the ridge constraint is smooth, so coefficients are shrunk but not set to zero.
### Example: Implementing Ridge Regression and Lasso
In this section, we explore the implementation of Ridge Regression and Lasso using Python and the scikit-learn library. These techniques are particularly useful when dealing with multicollinearity among predictors or when the number of predictors is large.
#### Ridge Regression
We demonstrate how to apply Ridge Regression over a range of $\lambda$ values to identify the optimal setting.
```python=
# Standardize the predictors
Xs = (X - X.mean(0)) / X.std(0)
lambdas = 10**np.linspace(8, -2, 100) / Y.std()
# Perform Ridge Regression using ElasticNet with l1_ratio set to 0
soln_array = skl.ElasticNet.path(Xs, Y, l1_ratio=0., alphas=lambdas)[1]
# Examine the shape of the solution array
print(soln_array.shape)
```
```python
(19,100)
```
The `soln_array` is a matrix with 19 rows (one for each predictor) and 100 columns (one for each value of $\lambda$).
```python=+
soln_path = pd.DataFrame(soln_array.T,
columns=D.columns ,
index=-np.log(lambdas))
soln_path.index.name = 'negative log(lambda)'
soln_path
```
```python
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League[N] Division[W] PutOuts Assists Errors NewLeague[N]
negative log(lambda)
-12.310855 0.000800 0.000889 0.000695 0.000851 0.000911 0.000900 0.000812 0.001067 0.001113 0.001064 0.001141 0.001149 0.000993 -0.000029 -0.000390 0.000609 0.000052 -0.000011 -0.000006
-12.078271 0.001010 0.001122 0.000878 0.001074 0.001150 0.001135 0.001025 0.001346 0.001404 0.001343 0.001439 0.001450 0.001253 -0.000037 -0.000492 0.000769 0.000065 -0.000014 -0.000007
-11.845686 0.001274 0.001416 0.001107 0.001355 0.001451 0.001433 0.001293 0.001698 0.001772 0.001694 0.001816 0.001830 0.001581 -0.000046 -0.000621 0.000970 0.000082 -0.000017 -0.000009
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10.482412 -290.958537 337.377455 37.587122 -60.158256 -26.750044 134.964477 -16.954081 -389.423414 88.164178 -13.219329 478.398404 258.967059 -213.505412 31.253747 -58.435983 78.761230 53.677759 -22.184893 -12.376191
10.714996 -290.986528 337.470648 37.642077 -60.243522 -26.800522 134.987027 -16.900054 -389.760135 87.864551 -13.416889 478.881540 259.321007 -213.584869 31.252760 -58.431454 78.761235 53.689152 -22.179964 -12.370587
```
Visualizing the path of coefficients as $\lambda$ changes helps in understanding the effect of regularization:
```python=+
path_fig , ax = subplots(figsize=(8,8))
soln_path.plot(ax=ax, legend=False)
ax.set_xlabel('$-\log(\lambda)$', fontsize=20)
ax.set_ylabel('Standardized coefficients', fontsize=20)
ax.legend(loc='upper left');
# Select a specific lambda value and show the corresponding coefficients
beta_hat = soln_path.loc[soln_path.index[39]]
print("Lambda:", lambdas[39])
print("Coefficients at this lambda:")
print(beta_hat)
```
```python
Lambda: 25.53538897200662
Coefficients at this lambda:
AtBat 5.433750
Hits 6.223582
HmRun 4.585498
Runs 5.880855
RBI 6.195921
Walks 6.277975
Years 5.299767
CAtBat 7.147501
CHits 7.539495
CHmRun 7.182344
CRuns 7.728649
CRBI 7.790702
CWalks 6.592901
League[N] 0.042445
Division[W] -3.107159
PutOuts 4.605263
Assists 0.378371
Errors -0.135196
NewLeague[N] 0.150323
Name: -3.240065292879872, dtype: float64
```

#### **Lasso**
The Lasso performs variable selection by setting some coefficients to exactly zero. We apply Lasso using cross-validation to select the optimal $\lambda$.
```python=
# Setting up the Lasso model
lassoCV = skl.ElasticNetCV(n_alphas=100, l1_ratio=1, cv=5)
pipeCV = Pipeline(steps=[('scaler', StandardScaler()), ('lasso', lassoCV)])
pipeCV.fit(X, Y)
# Optimal lambda value
tuned_lasso = pipeCV.named_steps['lasso']
optimal_lambda = tuned_lasso.alpha_
print("Optimal lambda for Lasso:", optimal_lambda)
```
```python
Optimal lambda for Lasso: 3.1472370031649866
```
Visualizing the Lasso coefficient paths:
```python=+
# Compute Lasso path
lambdas, soln_array = skl.Lasso.path(Xs, Y, alphas=np.linspace(0.001, optimal_lambda, 100))[:2]
# Plotting the coefficient trajectories
plt.figure(figsize=(8, 8))
plt.plot(-np.log10(lambdas), soln_array.T)
plt.xlabel('-log10(lambda)')
plt.ylabel('Standardized Coefficients')
plt.title('Coefficient Paths - Lasso')
plt.legend(D.columns, loc='upper right')
plt.show()
```

As $\lambda$ increases, some coefficients are set to exactly zero, performing variable selection.
```python=+
print("Minimum Mean Squared Error:", np.min(tuned_lasso.mse_path_.mean(1)))
print("Coefficients at optimal lambda:")
print(tuned_lasso.coef_)
```
```python
Minimum Mean Squared Error: 114690.73118253653
Coefficients at optimal lambda:
[-210.01008773 243.4550306 0. 0. 0.
97.69397357 -41.52283116 -0. 0. 39.62298193
205.75273856 124.55456561 -126.29986768 15.70262427 -59.50157967
75.24590036 21.62698014 -12.04423675 -0. ]
```
#### Cross-Validation Performance
Evaluating model performance through cross-validation:
```python=
# Cross-validation for Lasso
mse_path = pipeCV.named_steps['lasso'].mse_path_
mean_mse = np.mean(mse_path, axis=1)
opt_index = np.argmin(mean_mse)
opt_lambda = pipeCV.named_steps['lasso'].alphas_[opt_index]
plt.figure(figsize=(8, 6))
plt.plot(-np.log10(pipeCV.named_steps['lasso'].alphas_), mean_mse)
plt.axvline(-np.log10(opt_lambda), linestyle='--', color='k', label='Optimal lambda')
plt.xlabel('-log10(lambda)')
plt.ylabel('Mean Squared Error')
plt.title('Lasso Cross-Validation')
plt.legend()
plt.show()
```

```python=+
print(tuned_lasso.coef_)
```
```python
array([-210.01008773, 243.4550306 , 0. , 0. ,
0. , 97.69397357, -41.52283116, -0. ,
0. , 39.62298193, 205.75273856, 124.55456561,
-126.29986768, 15.70262427, -59.50157967, 75.24590036,
21.62698014, -12.04423675, -0. ])
```
## Dimension Reduction Methods
When facing high-dimensional data in linear regression, dimension reduction is essential:
1. **Constructing New Predictors:**
* Transform the original predictors $(X_1, X_2, ..., X_p)$ into a smaller set $(Z_1, Z_2, ..., Z_M)$, where $M < p$.
* Each new predictor is a linear combination: $Z_m = \sum_{j=1}^{p}\phi_{jm}X_j$
2. **Fitting the Model:** Fit a standard linear regression using the new predictors: $y_i=\theta_0+\sum_{m=1}^{M} \theta_m z_{im} + \epsilon_i$
### Principal Components Regression (PCR)
* Uses **Principal Component Analysis (PCA)** to create orthogonal components for the regression model.
* The first component identifies the direction that explains the most variance in the data.
* Subsequent components maximize the remaining variance while being orthogonal to previous components.
* PCR performs linear regression using a subset of these components as predictors.
### Partial Least Squares (PLS)
* PLS considers both the predictors and the response variable when constructing new features, making it useful when predictors are highly collinear.
* Each new component is chosen to maximize the covariance between the response and the component, while being orthogonal to previous components.
* PLS prioritizes features that explain variance in the predictors and have a strong relationship with the response.
The optimal number of components for PCR and PLS is typically chosen through cross-validation.
### Examples of Dimension Reduction Techniques
#### Principal Components Regression
```Python
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
import sklearn.model_selection as skm
import sklearn.dummy as skd
```
```python=
# Configure the pipeline with PCA and linear regression
pca = PCA() # Let GridSearchCV choose the number of components
linreg = LinearRegression()
pipe = Pipeline([
('scaler', StandardScaler()),
('pca', pca),
('linreg', linreg)
])
# Use GridSearchCV to choose the number of components
param_grid = {'pca__n_components': range(1, X.shape[1])}
grid = skm.GridSearchCV(pipe ,
param_grid ,
cv=kfold ,
scoring='neg_mean_squared_error')
grid.fit(X, Y)
# Best model and number of components
best_model = grid.best_estimator_
best_n_components = grid.best_params_['pca__n_components']
print(f"Best number of components: {best_n_components}")
print(f"Coefficients from PCR: {best_model.named_steps['linreg'].coef_}")
# Visualize the cross-validated MSE for different numbers of components
plt.figure(figsize=(8, 8))
plt.errorbar(param_grid['pca__n_components'], -grid.cv_results_['mean_test_score'], yerr=grid.cv_results_['std_test_score'] / np.sqrt(kfold))
plt.ylabel('Cross-validated MSE')
plt.xlabel('Number of Principal Components')
plt.xticks(param_grid['pca__n_components'])
plt.axvline(best_n_components, color='red', linestyle='--', label=f'Best: {best_n_components} components')
plt.legend()
plt.show()
```
```python
Best number of components: 17
Coefficients from PCR: [ 106.36859204 -21.60350456 24.2942534 -36.9858579 -58.41402748
62.20632652 24.63862038 15.82817701 29.57680773 99.64801199
-30.11209105 20.99269291 72.40210574 -276.68551696 -74.17098665
422.72580227 -347.05662353]
```

The graph shows the cross-validated MSE for PCR models with different numbers of principal components. The red dashed line indicates the best number of components chosen by `GridSearchCV`.
To provide context, we can compare the performance of PCR to a null model that always predicts the mean of the response variable:
```python=+
# Initialize a dummy regressor to represent the null model
null_model = skd.DummyRegressor(strategy="mean")
# Cross-validate the null model
cv_results_null = skm.cross_validate(null_model,
X, # The null model doesn't use the features, so X can be passed directly
Y,
cv=kfold,
scoring='neg_mean_squared_error')
# Calculate and print the mean negative MSE
mean_neg_mse_null = np.mean(cv_results_null['test_score'])
print(f"Mean Negative MSE of the Null Model: {mean_neg_mse_null:.2f}")
print(f"Mean Negative MSE of the Best PCR Model: {-grid.best_score_:.2f}")
```
```python
{'fit_time': array([0.01192951, 0.00453162, 0.00129294, 0.00115967, 0.00858212]), 'score_time': array([0.00078082, 0.00064325, 0.00054502, 0.00050497, 0.00090194]), 'test_score': array([-119988.92291156, -331360.53801348, -225425.91968383,
-202274.93309837, -149185.5382921 ])}
Mean Negative MSE of the Null Model: 205647.17
Mean Negative MSE of the Best PCR Model: 116222.02
```
The PCR model achieves a much lower MSE compared to the null model, indicating that the principal components are capturing useful information for predicting the response.
After fitting the PCA model, we can examine the proportion of variance explained by each principal component:
```python=+
# Assuming 'best_model' is the best PCR model from GridSearchCV
explained_variance_ratios = best_model.named_steps['pca'].explained_variance_ratio_
print("Explained Variance Ratios:")
for i, ratio in enumerate(explained_variance_ratios, start=1):
print(f"PC{i}: {ratio:.3f}")
```
```python
Explained Variance Ratios:
PC1: 0.383
PC2: 0.218
PC3: 0.107
PC4: 0.082
PC5: 0.053
PC6: 0.043
PC7: 0.036
PC8: 0.027
PC9: 0.013
PC10: 0.010
PC11: 0.007
PC12: 0.007
PC13: 0.005
PC14: 0.003
PC15: 0.003
PC16: 0.001
PC17: 0.001
```
#### Partial Least Squares (PLS)
```python=+
from sklearn.cross_decomposition import PLSRegression
# Configure PLS
pls = PLSRegression()
# Optimize the number of components using GridSearchCV
param_grid = {'n_components': range(1, X.shape[1])}
grid = GridSearchCV(pls, param_grid, cv=kfold, scoring='neg_mean_squared_error')
grid.fit(X, Y)
# Best model and number of components
best_model = grid.best_estimator_
best_n_components = grid.best_params_['n_components']
print(f"Best number of components: {best_n_components}")
print(f"Coefficients from PLS: {best_model.coef_}")
# Visualizing the MSE across different component counts
plt.figure(figsize=(8, 8))
plt.errorbar(param_grid['n_components'], -grid.cv_results_['mean_test_score'], yerr=grid.cv_results_['std_test_score'] / np.sqrt(kfold))
plt.ylabel('Cross-validated MSE')
plt.xlabel('Number of PLS Components')
plt.xticks(param_grid['n_components'])
plt.axvline(best_n_components, color='red', linestyle='--', label=f'Best: {best_n_components} components')
plt.legend()
plt.show()
```
```python
Best number of components: 12
Coefficients from PLS: [[-343.21483428]
[ 372.88649473]
[ 25.36751813]
[ -30.4196112 ]
[ -28.80370164]
[ 135.61719771]
[ -45.41307594]
[-156.7355504 ]
[ 145.38910249]
[ 79.51339391]
[ 258.86694584]
[ 116.72379857]
[-195.85925799]
[ 23.90121611]
[ -60.97623923]
[ 79.17512777]
[ 45.77911642]
[ -17.35633328]
[ -7.56084319]]
```

Similar to PCR, the graph shows the cross-validated MSE for PLS models with different numbers of components. The red dashed line indicates the best number of components chosen by `GridSearchCV`.
In summary,
* PCR focuses on capturing the maximum variance in the predictors, while PLS also considers the relationship between the predictors and the response.
* The optimal number of components is typically chosen through cross-validation to balance model complexity and predictive performance.
## Regression in High Dimensions
High-dimensional datasets, where the number of predictors $p$ can vastly exceed the number of observations $n$, pose distinct challenges for linear regression:
* **Overfitting:** Ordinary least squares tends to overfit the data, resulting in poor performance on new data.
* **Multicollinearity:** Strong correlations among predictors make it difficult to identify the true drivers of the response variable, leading to unstable coefficient estimates.
* **Curse of Dimensionality:** Adding more features can degrade model performance unless they are genuinely informative about the response.
* **Inadequacy of Classical Validation:** Traditional model selection metrics like $C_p$, AIC, BIC, and adjusted $R^2$ may not be reliable when $p$ is large relative to $n$.
To address these challenges, specialized techniques have been developed:
1. **Regularization:** Methods like ridge regression and lasso introduce penalties on the magnitude of coefficients, preventing overfitting and improving generalization.
2. **Feature Selection:** Techniques like forward stepwise selection aim to identify a relevant subset of predictors that strongly influence the response variable.
3. **Dimension Reduction:** Approaches such as PCR and PLS create a smaller set of derived predictors that capture the most important information in the original feature space.
Careful selection of tuning parameters through cross-validation, assessing model performance on independent validation sets, and interpreting results with caution are crucial in high-dimensional regression problems.
## Reference
* Chap 3 & 6 in [An Introduction to Statistical Learning with Applications in Python](https://www.statlearning.com)