# Linear Regression, Model Selection, and Regularization ## Introduction Linear regression is a fundamental statistical technique for modeling linear relationships between a continuous dependent variable and one or more independent variables. It is widely used across diverse domains to analyze these relationships and predict how changes in the independent variables will affect the dependent variable. ## Simple Linear Regression The simple linear regression model with a single independent variable $X$ and a dependent variable $Y$ is: $$ Y = \beta_0 + \beta_1 X + \epsilon $$ where $\epsilon$ represents the random error term. ### Estimating the Coefcients (Least Square method) Given a dataset of $n$ observations $\{(x_i, y_i)\}_{i=1}^n$, we estimate the best-fit line by finding coefficients $\hat{\beta}_0$ and $\hat\beta_1$ that minimize the **residual sum of squares (RSS)**: $$ \text{RSS} = \sum_{i=1}^{n} (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2. $$ The optimal values for the coefficients are: $$ \hat{\beta}1 = \frac{\sum{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}, \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} $$ where $\bar x$ and $\bar y$ are the sample means of the predictor and response variables, respectively. This method is called the *Least Square method*. ### Example We use the `sm.OLS` function from the statsmodels library to fit a simple linear regression model, where the response variable is medv and the predictor is lstat from the Boston housing dataset. ```python= import numpy as np import pandas as pd import statsmodels.api as sm from ISLP import load_data from ISLP.models import ModelSpec as MS, summarize Boston = load_data("Boston") X = pd.DataFrame({'intercept': np.ones(Boston.shape[0]), 'lstat': Boston['lstat']}) y = Boston['medv'] model = sm.OLS(y, X) results = model.fit() summarize(results) ``` The output shows the estimated coefficients and their statistical significance: ```python coef std err t P>|t| intercept 34.5538 0.563 61.415 0.0 lstat -0.9500 0.039 -24.528 0.0 ``` The `get_prediction` method can be used to obtain predictions, confidence intervals, and prediction intervals for new values of `lstat`. ```python=+ new_df = pd.DataFrame({'lstat': [5, 10, 15]}) newX = MS(['lstat']).fit_transform(new_df) new_predictions = results.get_prediction(newX) print("Predicted values:", new_predictions.predicted_mean) print("95% Confidence Intervals:", new_predictions.conf_int(alpha=0.05)) print("95% Prediction Intervals:", new_predictions.conf_int(obs=True, alpha=0.05)) ``` ```python Predicted values: [29.80359411 25.05334734 20.30310057] 95% Confidence Intervals: [[29.00741194 30.59977628] [24.47413202 25.63256267] [19.73158815 20.87461299]] 95% Prediction Intervals: [[17.56567478 42.04151344] [12.82762635 37.27906833] [ 8.0777421 32.52845905]] ``` ## Multiple Linear Regression Multiple linear regression extends the concept to handle multiple independent variables: $$ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon, $$ where: * $X_j$ represents the $j$-th independent variable * $\beta_j$ quantifies the association between that variable and the response, holding the other variables constant * $\epsilon$ is the random error term ### Estimating the Coefficients (Least square method) Given estimates $\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_p$, predictions are made using: $$ \hat y = \hat \beta_0 + \hat \beta_1 x_1 + \dots + \hat \beta_p x_p. $$ The coefficients are estimated by minimizing the RSS: $$ \text{RSS} = \sum_{i=1}^{n} (y_i - \hat y_i)^2 $$ using the least squares method. ### Example In this example, we fit a multiple linear regression model to the Boston housing data, using all the variables except `medv` as predictors and `medv` as the response. ``` python= from ISLP import load_data from ISLP.models import ModelSpec as MS, summarize import statsmodels.api as sm Boston = load_data("Boston") # Exclude 'medv' from the predictors terms = Boston.columns.drop('medv') # Construct the design matrix X X = MS(terms).fit_transform(Boston) # Set 'medv' as the response variable y = Boston['medv'] # Fit the multiple linear regression model model = sm.OLS(y, X) results = model.fit() # Summarize the results summarize(results) ``` The output shows the estimated coefficients and their statistical significance: ```python coef std err t P>|t| intercept 41.6173 4.936 8.431 0.000 crim -0.1214 0.033 -3.678 0.000 zn 0.0470 0.014 3.384 0.001 indus 0.0135 0.062 0.217 0.829 chas 2.8400 0.870 3.264 0.001 nox -18.7580 3.851 -4.870 0.000 rm 3.6581 0.420 8.705 0.000 age 0.0036 0.013 0.271 0.787 dis -1.4908 0.202 -7.394 0.000 rad 0.2894 0.067 4.325 0.000 tax -0.0127 0.004 -3.337 0.001 ptratio -0.9375 0.132 -7.091 0.000 lstat -0.5520 0.051 -10.897 0.000 ``` To perform a regression using all variables except `age`, we can use the following code: ```python=+ minus_age = Boston.columns.drop(['medv', 'age']) Xma = MS(minus_age).fit_transform(Boston) model1 = sm.OLS(y, Xma) summarize(model1.fit()) ``` ```python Coef Std err t P>|t| intercept 41.5251 4.920 8.441 0.000 crim -0.1214 0.033 -3.683 0.000 zn 0.0465 0.014 3.379 0.001 indus 0.0135 0.062 0.217 0.829 chas 2.8528 0.868 3.287 0.001 nox -18.4851 3.714 -4.978 0.000 rm 3.6811 0.411 8.951 0.000 dis -1.5068 0.193 -7.825 0.000 rad 0.2879 0.067 4.322 0.000 tax -0.0127 0.004 -3.333 0.001 ptratio -0.9346 0.132 -7.099 0.000 lstat -0.5474 0.048 -11.483 0.000 ``` ## Assessing Model Accuracy Key statistics for evaluating the performance of a linear regression model: * **F-statistic:** $F = \frac{(\text{TSS} - \text{RSS}) / p}{\text{RSS} / (n - p - 1)}$ * A high $F$-statistic indicates that at least one independent variable has a significant effect on the dependent variable. * **R-squared ($R^2$):** $R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}$ * Measures the proportion of variability in the dependent variable $Y$ that is explained by the model. * A value closer to 1 indicates a better fit. * **Residual Standard Error (RSE):** $\text{RSE} = \sqrt{\frac{1}{n - p - 1} \text{RSS}}$ * Measures the average amount that the response deviates from the true regression line. * A smaller RSE indicates a better fit. Where * $n$ is the number of observations * $p$ is the number of predictors * TSS (Total Sum of Squares) = $\sum_{i=1}^{n} (y_i - \bar{y})^2$, the total variability in the dependent variable * RSS (Residual Sum of Squares) = $\sum_{i=1}^{n} (y_i - \hat{y}_i)^2$, the variability unexplained by the model ## Beyond the Basics: Model Selection and Regularization While the standard linear model serves as a useful baseline, we can often improve its performance and interpretability through model selection and regularization methods. These techniques are important for: * **Model Interpretability:** Simpler models with only the most relevant variables are easier to understand and explain. * **Prediction Accuracy:** Carefully controlling model complexity can prevent overfitting and lead to better predictions on new, unseen data. ### Key Techniques 1. **Subset Selection:** Identifying the most important subset of predictors related to the response variable, such as best subset selection and stepwise selection. 2. **Shrinkage (Regularization):** Retaining all predictors but reducing the magnitude of their coefficients towards zero, using methods like Ridge Regression and Lasso. 3. **Dimension Reduction:** Transforming the predictors into a lower-dimensional space by creating linear combinations (projections) of the original variables, as in *Principal Components Regression (PCR)* and *Partial Least Squares (PLS)*. ## Subset Selection In linear regression with multiple variables, subset selection techniques help identify the most important subset of predictors: ### Best Subset Selection * Exhaustively evaluates all possible combinations of predictors to find the theoretically optimal model. * Use cross-validation or a validation set to prevent overfitting and select a model that generalizes well to unseen data. :::success <font color=blue>Algorithm: Best Subset Selection</font> **Input:** - Dataset with predictors $X = (X_1, \dots, X_p)$ and response $Y$ - Evaluation metric (e.g., cross-validation error, AIC, BIC) **Output:** The "best" subset of predictors according to the chosen evaluation metric **Procedure:** 1. **Initialize:** - Let `M_0` be the null model (no predictors). - Set `best_model` = `M_0` - Set `best_error` = Evaluate(`M_0`) (using the chosen metric) 2. **Iterate over subset sizes (k):** - For k = 1 to p: - Generate all possible subsets: Create all possible combinations of `k` predictors from the total `p` predictors. - Fit and evaluate: - For each subset `S` of size `k`: - Fit the linear regression model using predictors in `S`. - Calculate `error` = Evaluate(model using `S`) - Update best: If `error` < `best_error`: - `best_model` = model using `S` - `best_error` = `error` - Store the current `best_model` as `M_k` 3. **Select the best:** * Return `best_model` (which will be one of the stored `M_k` models) ::: #### Key Points * **Exhaustive Search:** This algorithm explores all possible combinations of predictors, making it computationally expensive for large `p`. * **Evaluation Metric:** The choice of metric is crucial. Cross-validation error is often preferred to prevent overfitting. AIC and BIC are alternatives that penalize model complexity. * **Model Selection:** The algorithm doesn't just identify the best subset, but also fits the corresponding regression model for you. * **Practical Considerations:** For large datasets, you might consider stepwise selection or regularization methods (like Lasso or Ridge regression) as computationally efficient alternatives. ### Stepwise Selection Computationally efficient alternatives for large datasets: #### Forward Stepwise Selection * Begins with no predictors and iteratively adds the most beneficial predictor at each step. :::success <font color=blue>Algorithm: Forward Stepwise Selection</font> **Input:** * Dataset with predictors $X = (X_1, \dots, X_p)$ and response $Y$ * Evaluation metric (e.g., cross-validation error, AIC, BIC) **Output:** The "best" subset of predictors according to the chosen evaluation metric **Procedure:** 1. **Initialize:** * Let `M_0` be the null model (no predictors). * Set `remaining_predictors` = all predictors (`X_1` to `X_p`) 2. **Iterate, adding one predictor at a time:** * While `remaining_predictors` is not empty: * **Find the best predictor to add:** * For each predictor `X_j` in remaining_predictors: * Fit the model by adding X_j to the current model. * Calculate `error` = Evaluate(model with `X_j` added) * Select the predictor `X_best` that results in the lowest `error`. * **Update the model and remaining predictors:** * Add `X_best` to the current model. * Remove `X_best` from remaining_predictors. * Store the current model as `M_{k+1}` (where k is the current number of predictors in the model). 3. **Select the best:** * Choose the best model among all `M_k` models using the evaluation metric. ::: ##### Key Points * **Greedy Algorithm:** Forward Stepwise Selection is a greedy algorithm. It adds predictors one at a time, choosing the one that improves the model the most at each step. This doesn't guarantee finding the absolute best subset but is computationally less expensive than Best Subset Selection. * **Evaluation Metric:** Just like Best Subset Selection, the choice of metric is crucial for avoiding overfitting. * **Stopping Criterion:** The algorithm continues until all predictors have been added. In practice, you might use a stopping criterion based on the evaluation metric (e.g., stop when the error doesn't decrease significantly anymore). * **Model Selection:** The algorithm identifies the best subset and the corresponding models at each step. You can then choose the final model based on your evaluation metric or other criteria like interpretability. #### Backward Stepwise Selection * Starts with all predictors and iteratively removes the least significant one at each step. * Can be advantageous if you expect many predictors to be relevant. :::success <font color=blue>Algoithm: Backward Stepwise Selection</font> **Input:** * Dataset with predictors $X = (X_1, \dots, X_p)$ and response $Y$ * Evaluation metric (e.g., cross-validation error, AIC, BIC) **Output:** The "best" subset of predictors according to the chosen evaluation metric **Procedure:** 1. **Initialize:** * Let `M_p` be the full model (all predictors). * Set `current_predictors` = all predictors (`X_1` to `X_p`) 2. **Iterate, removing one predictor at a time:** * While `current_predictors` is not empty: * **Find the best predictor to remove:** * For each predictor `X_j` in `current_predictors`: * Fit the model by removing `X_j` from the current model. * Calculate `error` = Evaluate(model without `X_j`) * Select the predictor `X_worst` that results in the lowest `error` (or the least degradation of model performance). * **Update the model and predictors:** * Remove `X_worst` from the current model. * Remove `X_worst` from `current_predictors`. * Store the current model as `M_{k-1}` (where k is the current number of predictors in the model). 3. **Select the best:** * Choose the best model among all `M_k` models using the evaluation metric. ::: ##### Key Points * **Greedy Algorithm:** Like Forward Stepwise Selection, Backward Stepwise Selection is also a greedy algorithm. It starts with the full model and removes predictors one by one, aiming to improve the model at each step. * **Evaluation Metric:** The choice of metric is important to guide the predictor removal process and avoid overfitting. * **Stopping Criterion:** The algorithm runs until no predictors are left. In practice, you can use a stopping criterion based on the evaluation metric (e.g., stop when removing a predictor significantly worsens the error). * **Model Selection:** The algorithm keeps track of the best model at each step. You can then choose the final model based on the evaluation metric or consider a trade-off between model performance and complexity. ### Model Selection Criteria To prevent overfitting, we use criteria that balance goodness of fit with model complexity: * **$C_p$, AIC, BIC:** Lower values generally indicate better models. They penalize complexity with different strengths (BIC has the strongest penalty). * $C_p = \frac{1}{n}(\text{RSS} + 2d\hat{\sigma}^2)$ * $\text{AIC} = \frac{1}{n \hat{\sigma}^2}(\text{RSS} + 2d\hat{\sigma}^2)$ * $\text{BIC} = \frac{1}{n \hat{\sigma}^2} (\text{RSS}+\log(n)d\hat{\sigma}^2)$ Here $\hat{\sigma}^2$ is the estimated variance of the error term. * **Adjusted $R^2$:** Also penalizes complexity. Higher values (up to 1) are better, indicating that the model explains a larger proportion of the variance in the response variable. * $\text{Adjusted} \ R^2 = 1 - \frac{\text{RSS}/(n-d-1)}{\text{TSS}/(n-1)}$ ### Example: Implementing Subset Selection Techniques To begin, we import necessary libraries and tools that facilitate data manipulation, statistical analysis, and model fitting. ```python import numpy as np import pandas as pd from matplotlib import pyplot as plt from statsmodels.api import OLS import sklearn.model_selection as skm import sklearn.linear_model as skl from sklearn.preprocessing import StandardScaler # Assuming ISLP is a custom library for loading specific datasets from ISLP import load_data from ISLP.models import ModelSpec as MS, Stepwise, sklearn_selected from functools import partial from l0bnb import fit_path ``` #### Forward Selection using $C_p$ We use the `Hitters` dataset, which contains statistics of baseball players, to predict players' salaries. The initial step involves data cleaning to ensure a robust dataset. ```python= Hitters = load_data('Hitters').dropna() print(f"Dataset shape: {Hitters.shape}") ``` ```python (263, 20) ``` Once the dataset is prepared, defining a scoring function based on the *negative $C_p$* statistic helps evaluate model performance, considering both fit and complexity. ```python=+ def nCp(sigma2, estimator, X, Y): n, p = X.shape Yhat = estimator.predict(X) RSS = np.sum((Y - Yhat)**2) return -(RSS + 2 * p * sigma2) / n ``` Preparing the dataset involves separating the target variable `Salary` from the predictors and estimating the error variance $\sigma^2$. ```python=+ design = MS(Hitters.columns.drop('Salary')).fit(Hitters) Y = np.array(Hitters['Salary']) X = design.transform(Hitters) sigma2 = OLS(Y, X).fit().scale ``` Forward stepwise selection is applied, selecting predictors that incrementally improve the model's performance. ```python=+ neg_Cp = partial(nCp , sigma2) strategy = Stepwise.first_peak(design , direction='forward', max_terms=len(design.terms)) hitters_Cp = sklearn_selected(OLS , strategy , scoring=neg_Cp) hitters_Cp.fit(Hitters , Y) hitters_Cp.selected_state_ ``` ```python ('Assists', 'AtBat', 'CAtBat', 'CRBI', 'CRuns', 'CWalks', 'Division', 'Hits', 'PutOuts', 'Walks') ``` The result contains 10 (out of 20) selected predictors. #### Best Subset Selection with `l0bnb` The `l0bnb` package enables finding the optimal subset of predictors. This approach generates a series of models, identifying the one that achieves a balance between complexity and accuracy. ```python=+ from l0bnb import fit_path D = design.fit_transform(Hitters).drop('intercept', axis=1) X = np.asarray(D) path = fit_path(X, Y, max_nonzeros=X.shape[1]) print(path[3]) ``` ```python {'B': array([0. , 3.254844, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.677753 , 0. , 0. , 0. , 0. , 0. , 0. , 0. ]), 'B0': -38.98216739555494, 'lambda_0': 0.011416248027450194, 'M': 0.5829861733382011, 'Time_exceeded': False} ``` The `path[3]` output represents the model with 3 non-zero coefficients. The `B` array shows the coefficient values for each predictor, with 0 indicating that the predictor is not included in the model. To determine the best model, you would evaluate the performance (e.g., using cross-validation) of models with different numbers of non-zero coefficients and select the one that balances goodness of fit and complexity. ## Shrinkage Methods In linear regression with many predictors, especially when some are correlated, shrinkage methods are essential to prevent overfitting and improve model generalization: ### Ridge Regression Ridge regression introduces a penalty on the size of coefficients to minimize $$\text{RSS} + \lambda \sum_{j=1}^{p} \beta_j^2$$ * As $\lambda$ increases, coefficients shrink towards zero. When $\lambda=0$, ridge regression becomes equivalent to least squares. * Ridge regression has a *constrained optimization form* that highlights restricting the square sum of the coefficients: $$\min_{\beta}{\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2}, \text{ subject to } \sum_{j=1}^{p} \beta_j^2 \le s$$ ### The Lasso The Lasso minimizes: $$ \text{RSS} + \lambda \sum_{j=1}^{p} |\beta_j| $$ * The Lasso's L1 penalty leads to sparse solutions, setting some coefficients to *exactly zero*, making it useful for feature selection. * The Lasso has a constrained optimization form that highlights restricting the sum of the absolute values of the coefficients: $$\min_{\beta}{\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2}, \text{ subject to } \sum_{j=1}^{p} |\beta_j|\le s.$$ ### Lasso vs. Ridge Regression * Ridge regression might be easier to interpret, while the Lasso can offer superior predictive accuracy if only a few features are important. * Standardizing features is often recommended before applying shrinkage methods. * Cross-validation is crucial for selecting the optimal tuning parameter $\lambda$ that balances bias and variance. #### Visualizing the Difference ![image](https://hackmd.io/_uploads/Hkr_VnHa6.png) Contours of the error (RSS) and constraint functions for the lasso (left) and ridge regression (right). * The solid blue areas represent the constraint regions ($\sum_{j=1}^{p} |\beta_j|\le s$ for lasso, $\sum_{j=1}^{p} \beta_j^2 \le s$ for ridge), while the red ellipses are the contours of the RSS. * The lasso constraint has corners, which allows coefficients to be set to exactly zero when the solution is at a corner. In contrast, the ridge constraint is smooth, so coefficients are shrunk but not set to zero. ### Example: Implementing Ridge Regression and Lasso In this section, we explore the implementation of Ridge Regression and Lasso using Python and the scikit-learn library. These techniques are particularly useful when dealing with multicollinearity among predictors or when the number of predictors is large. #### Ridge Regression We demonstrate how to apply Ridge Regression over a range of $\lambda$ values to identify the optimal setting. ```python= # Standardize the predictors Xs = (X - X.mean(0)) / X.std(0) lambdas = 10**np.linspace(8, -2, 100) / Y.std() # Perform Ridge Regression using ElasticNet with l1_ratio set to 0 soln_array = skl.ElasticNet.path(Xs, Y, l1_ratio=0., alphas=lambdas)[1] # Examine the shape of the solution array print(soln_array.shape) ``` ```python (19,100) ``` The `soln_array` is a matrix with 19 rows (one for each predictor) and 100 columns (one for each value of $\lambda$). ```python=+ soln_path = pd.DataFrame(soln_array.T, columns=D.columns , index=-np.log(lambdas)) soln_path.index.name = 'negative log(lambda)' soln_path ``` ```python AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League[N] Division[W] PutOuts Assists Errors NewLeague[N] negative log(lambda) -12.310855 0.000800 0.000889 0.000695 0.000851 0.000911 0.000900 0.000812 0.001067 0.001113 0.001064 0.001141 0.001149 0.000993 -0.000029 -0.000390 0.000609 0.000052 -0.000011 -0.000006 -12.078271 0.001010 0.001122 0.000878 0.001074 0.001150 0.001135 0.001025 0.001346 0.001404 0.001343 0.001439 0.001450 0.001253 -0.000037 -0.000492 0.000769 0.000065 -0.000014 -0.000007 -11.845686 0.001274 0.001416 0.001107 0.001355 0.001451 0.001433 0.001293 0.001698 0.001772 0.001694 0.001816 0.001830 0.001581 -0.000046 -0.000621 0.000970 0.000082 -0.000017 -0.000009 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 10.482412 -290.958537 337.377455 37.587122 -60.158256 -26.750044 134.964477 -16.954081 -389.423414 88.164178 -13.219329 478.398404 258.967059 -213.505412 31.253747 -58.435983 78.761230 53.677759 -22.184893 -12.376191 10.714996 -290.986528 337.470648 37.642077 -60.243522 -26.800522 134.987027 -16.900054 -389.760135 87.864551 -13.416889 478.881540 259.321007 -213.584869 31.252760 -58.431454 78.761235 53.689152 -22.179964 -12.370587 ``` Visualizing the path of coefficients as $\lambda$ changes helps in understanding the effect of regularization: ```python=+ path_fig , ax = subplots(figsize=(8,8)) soln_path.plot(ax=ax, legend=False) ax.set_xlabel('$-\log(\lambda)$', fontsize=20) ax.set_ylabel('Standardized coefficients', fontsize=20) ax.legend(loc='upper left'); # Select a specific lambda value and show the corresponding coefficients beta_hat = soln_path.loc[soln_path.index[39]] print("Lambda:", lambdas[39]) print("Coefficients at this lambda:") print(beta_hat) ``` ```python Lambda: 25.53538897200662 Coefficients at this lambda: AtBat 5.433750 Hits 6.223582 HmRun 4.585498 Runs 5.880855 RBI 6.195921 Walks 6.277975 Years 5.299767 CAtBat 7.147501 CHits 7.539495 CHmRun 7.182344 CRuns 7.728649 CRBI 7.790702 CWalks 6.592901 League[N] 0.042445 Division[W] -3.107159 PutOuts 4.605263 Assists 0.378371 Errors -0.135196 NewLeague[N] 0.150323 Name: -3.240065292879872, dtype: float64 ``` ![下載 (19)](https://hackmd.io/_uploads/BkcqpRMeA.png) #### **Lasso** The Lasso performs variable selection by setting some coefficients to exactly zero. We apply Lasso using cross-validation to select the optimal $\lambda$. ```python= # Setting up the Lasso model lassoCV = skl.ElasticNetCV(n_alphas=100, l1_ratio=1, cv=5) pipeCV = Pipeline(steps=[('scaler', StandardScaler()), ('lasso', lassoCV)]) pipeCV.fit(X, Y) # Optimal lambda value tuned_lasso = pipeCV.named_steps['lasso'] optimal_lambda = tuned_lasso.alpha_ print("Optimal lambda for Lasso:", optimal_lambda) ``` ```python Optimal lambda for Lasso: 3.1472370031649866 ``` Visualizing the Lasso coefficient paths: ```python=+ # Compute Lasso path lambdas, soln_array = skl.Lasso.path(Xs, Y, alphas=np.linspace(0.001, optimal_lambda, 100))[:2] # Plotting the coefficient trajectories plt.figure(figsize=(8, 8)) plt.plot(-np.log10(lambdas), soln_array.T) plt.xlabel('-log10(lambda)') plt.ylabel('Standardized Coefficients') plt.title('Coefficient Paths - Lasso') plt.legend(D.columns, loc='upper right') plt.show() ``` ![image](https://hackmd.io/_uploads/H1wGQI_xR.png) As $\lambda$ increases, some coefficients are set to exactly zero, performing variable selection. ```python=+ print("Minimum Mean Squared Error:", np.min(tuned_lasso.mse_path_.mean(1))) print("Coefficients at optimal lambda:") print(tuned_lasso.coef_) ``` ```python Minimum Mean Squared Error: 114690.73118253653 Coefficients at optimal lambda: [-210.01008773 243.4550306 0. 0. 0. 97.69397357 -41.52283116 -0. 0. 39.62298193 205.75273856 124.55456561 -126.29986768 15.70262427 -59.50157967 75.24590036 21.62698014 -12.04423675 -0. ] ``` #### Cross-Validation Performance Evaluating model performance through cross-validation: ```python= # Cross-validation for Lasso mse_path = pipeCV.named_steps['lasso'].mse_path_ mean_mse = np.mean(mse_path, axis=1) opt_index = np.argmin(mean_mse) opt_lambda = pipeCV.named_steps['lasso'].alphas_[opt_index] plt.figure(figsize=(8, 6)) plt.plot(-np.log10(pipeCV.named_steps['lasso'].alphas_), mean_mse) plt.axvline(-np.log10(opt_lambda), linestyle='--', color='k', label='Optimal lambda') plt.xlabel('-log10(lambda)') plt.ylabel('Mean Squared Error') plt.title('Lasso Cross-Validation') plt.legend() plt.show() ``` ![下載 (1)](https://hackmd.io/_uploads/ry0U49tyR.png) ```python=+ print(tuned_lasso.coef_) ``` ```python array([-210.01008773, 243.4550306 , 0. , 0. , 0. , 97.69397357, -41.52283116, -0. , 0. , 39.62298193, 205.75273856, 124.55456561, -126.29986768, 15.70262427, -59.50157967, 75.24590036, 21.62698014, -12.04423675, -0. ]) ``` ## Dimension Reduction Methods When facing high-dimensional data in linear regression, dimension reduction is essential: 1. **Constructing New Predictors:** * Transform the original predictors $(X_1, X_2, ..., X_p)$ into a smaller set $(Z_1, Z_2, ..., Z_M)$, where $M < p$. * Each new predictor is a linear combination: $Z_m = \sum_{j=1}^{p}\phi_{jm}X_j$ 2. **Fitting the Model:** Fit a standard linear regression using the new predictors: $y_i=\theta_0+\sum_{m=1}^{M} \theta_m z_{im} + \epsilon_i$ ### Principal Components Regression (PCR) * Uses **Principal Component Analysis (PCA)** to create orthogonal components for the regression model. * The first component identifies the direction that explains the most variance in the data. * Subsequent components maximize the remaining variance while being orthogonal to previous components. * PCR performs linear regression using a subset of these components as predictors. ### Partial Least Squares (PLS) * PLS considers both the predictors and the response variable when constructing new features, making it useful when predictors are highly collinear. * Each new component is chosen to maximize the covariance between the response and the component, while being orthogonal to previous components. * PLS prioritizes features that explain variance in the predictors and have a strong relationship with the response. The optimal number of components for PCR and PLS is typically chosen through cross-validation. ### Examples of Dimension Reduction Techniques #### Principal Components Regression ```Python from sklearn.decomposition import PCA from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV from sklearn.preprocessing import StandardScaler import sklearn.model_selection as skm import sklearn.dummy as skd ``` ```python= # Configure the pipeline with PCA and linear regression pca = PCA() # Let GridSearchCV choose the number of components linreg = LinearRegression() pipe = Pipeline([ ('scaler', StandardScaler()), ('pca', pca), ('linreg', linreg) ]) # Use GridSearchCV to choose the number of components param_grid = {'pca__n_components': range(1, X.shape[1])} grid = skm.GridSearchCV(pipe , param_grid , cv=kfold , scoring='neg_mean_squared_error') grid.fit(X, Y) # Best model and number of components best_model = grid.best_estimator_ best_n_components = grid.best_params_['pca__n_components'] print(f"Best number of components: {best_n_components}") print(f"Coefficients from PCR: {best_model.named_steps['linreg'].coef_}") # Visualize the cross-validated MSE for different numbers of components plt.figure(figsize=(8, 8)) plt.errorbar(param_grid['pca__n_components'], -grid.cv_results_['mean_test_score'], yerr=grid.cv_results_['std_test_score'] / np.sqrt(kfold)) plt.ylabel('Cross-validated MSE') plt.xlabel('Number of Principal Components') plt.xticks(param_grid['pca__n_components']) plt.axvline(best_n_components, color='red', linestyle='--', label=f'Best: {best_n_components} components') plt.legend() plt.show() ``` ```python Best number of components: 17 Coefficients from PCR: [ 106.36859204 -21.60350456 24.2942534 -36.9858579 -58.41402748 62.20632652 24.63862038 15.82817701 29.57680773 99.64801199 -30.11209105 20.99269291 72.40210574 -276.68551696 -74.17098665 422.72580227 -347.05662353] ``` ![下載 (21)](https://hackmd.io/_uploads/H1L3beXlC.png) The graph shows the cross-validated MSE for PCR models with different numbers of principal components. The red dashed line indicates the best number of components chosen by `GridSearchCV`. To provide context, we can compare the performance of PCR to a null model that always predicts the mean of the response variable: ```python=+ # Initialize a dummy regressor to represent the null model null_model = skd.DummyRegressor(strategy="mean") # Cross-validate the null model cv_results_null = skm.cross_validate(null_model, X, # The null model doesn't use the features, so X can be passed directly Y, cv=kfold, scoring='neg_mean_squared_error') # Calculate and print the mean negative MSE mean_neg_mse_null = np.mean(cv_results_null['test_score']) print(f"Mean Negative MSE of the Null Model: {mean_neg_mse_null:.2f}") print(f"Mean Negative MSE of the Best PCR Model: {-grid.best_score_:.2f}") ``` ```python {'fit_time': array([0.01192951, 0.00453162, 0.00129294, 0.00115967, 0.00858212]), 'score_time': array([0.00078082, 0.00064325, 0.00054502, 0.00050497, 0.00090194]), 'test_score': array([-119988.92291156, -331360.53801348, -225425.91968383, -202274.93309837, -149185.5382921 ])} Mean Negative MSE of the Null Model: 205647.17 Mean Negative MSE of the Best PCR Model: 116222.02 ``` The PCR model achieves a much lower MSE compared to the null model, indicating that the principal components are capturing useful information for predicting the response. After fitting the PCA model, we can examine the proportion of variance explained by each principal component: ```python=+ # Assuming 'best_model' is the best PCR model from GridSearchCV explained_variance_ratios = best_model.named_steps['pca'].explained_variance_ratio_ print("Explained Variance Ratios:") for i, ratio in enumerate(explained_variance_ratios, start=1): print(f"PC{i}: {ratio:.3f}") ``` ```python Explained Variance Ratios: PC1: 0.383 PC2: 0.218 PC3: 0.107 PC4: 0.082 PC5: 0.053 PC6: 0.043 PC7: 0.036 PC8: 0.027 PC9: 0.013 PC10: 0.010 PC11: 0.007 PC12: 0.007 PC13: 0.005 PC14: 0.003 PC15: 0.003 PC16: 0.001 PC17: 0.001 ``` #### Partial Least Squares (PLS) ```python=+ from sklearn.cross_decomposition import PLSRegression # Configure PLS pls = PLSRegression() # Optimize the number of components using GridSearchCV param_grid = {'n_components': range(1, X.shape[1])} grid = GridSearchCV(pls, param_grid, cv=kfold, scoring='neg_mean_squared_error') grid.fit(X, Y) # Best model and number of components best_model = grid.best_estimator_ best_n_components = grid.best_params_['n_components'] print(f"Best number of components: {best_n_components}") print(f"Coefficients from PLS: {best_model.coef_}") # Visualizing the MSE across different component counts plt.figure(figsize=(8, 8)) plt.errorbar(param_grid['n_components'], -grid.cv_results_['mean_test_score'], yerr=grid.cv_results_['std_test_score'] / np.sqrt(kfold)) plt.ylabel('Cross-validated MSE') plt.xlabel('Number of PLS Components') plt.xticks(param_grid['n_components']) plt.axvline(best_n_components, color='red', linestyle='--', label=f'Best: {best_n_components} components') plt.legend() plt.show() ``` ```python Best number of components: 12 Coefficients from PLS: [[-343.21483428] [ 372.88649473] [ 25.36751813] [ -30.4196112 ] [ -28.80370164] [ 135.61719771] [ -45.41307594] [-156.7355504 ] [ 145.38910249] [ 79.51339391] [ 258.86694584] [ 116.72379857] [-195.85925799] [ 23.90121611] [ -60.97623923] [ 79.17512777] [ 45.77911642] [ -17.35633328] [ -7.56084319]] ``` ![image](https://hackmd.io/_uploads/ByHczlQlR.png) Similar to PCR, the graph shows the cross-validated MSE for PLS models with different numbers of components. The red dashed line indicates the best number of components chosen by `GridSearchCV`. In summary, * PCR focuses on capturing the maximum variance in the predictors, while PLS also considers the relationship between the predictors and the response. * The optimal number of components is typically chosen through cross-validation to balance model complexity and predictive performance. ## Regression in High Dimensions High-dimensional datasets, where the number of predictors $p$ can vastly exceed the number of observations $n$, pose distinct challenges for linear regression: * **Overfitting:** Ordinary least squares tends to overfit the data, resulting in poor performance on new data. * **Multicollinearity:** Strong correlations among predictors make it difficult to identify the true drivers of the response variable, leading to unstable coefficient estimates. * **Curse of Dimensionality:** Adding more features can degrade model performance unless they are genuinely informative about the response. * **Inadequacy of Classical Validation:** Traditional model selection metrics like $C_p$, AIC, BIC, and adjusted $R^2$ may not be reliable when $p$ is large relative to $n$. To address these challenges, specialized techniques have been developed: 1. **Regularization:** Methods like ridge regression and lasso introduce penalties on the magnitude of coefficients, preventing overfitting and improving generalization. 2. **Feature Selection:** Techniques like forward stepwise selection aim to identify a relevant subset of predictors that strongly influence the response variable. 3. **Dimension Reduction:** Approaches such as PCR and PLS create a smaller set of derived predictors that capture the most important information in the original feature space. Careful selection of tuning parameters through cross-validation, assessing model performance on independent validation sets, and interpreting results with caution are crucial in high-dimensional regression problems. ## Reference * Chap 3 & 6 in [An Introduction to Statistical Learning with Applications in Python](https://www.statlearning.com)