# Beyond Linear Regression: Nonlinear Models ## Overview Linear regression, while foundational, often doesn't fully capture the intricate relationships within real-world datasets. To address this, advanced regression techniques provide the flexibility to model non-linear patterns, leading to better predictions and deeper insights. ### Key Techniques * **Polynomial Regression:** Introduces non-linearity by including powers of the predictors in the model. * **Step Functions:** Partition a predictor into segments for piecewise modeling. * **Regression Splines:** Fit piecewise polynomials, ensuring smoothness at knots. * **Smoothing Splines:** Balance fitting data and maintaining a smooth curve. * **Local Regression:** Fit models using subsets of data near the target point. * **Generalized Additive Models (GAMs):** Allow non-linear functions of predictors. ## Polynomial Regression Polynomial regression offers a way to model non-linear relationships between a predictor $X$ and the response $Y$ by introducing powers of the predictor: $$ Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \ldots + \beta_d X^d + \epsilon $$ ### Key Considerations * **Degree $d$:** Controls the complexity of the curve. Higher degrees allow for more flexible shapes. * **Overfitting:** Use caution with high-degree polynomials. They can overfit to the training data, harming their ability to generalize to new observations. Normally, it is unusual to use d greater than 3 or 4. ### Example We start with some of our standard imports. ```python= import numpy as np, pandas as pd from matplotlib.pyplot import subplots import statsmodels.api as sm from ISLP import load_data from ISLP.models import (summarize, poly, ModelSpec as MS) ``` Here we use the built-in data `Wage` in Python, with a size of 3000. ```python=+ Wage = load_data('Wage') y = Wage['wage'] age = Wage['age'] poly_age = MS([poly('age', degree=4)]).fit(Wage) M = sm.OLS(y, poly_age.transform(Wage)).fit() summarize(M) ``` ``` coef std err t P>|t| intercept 111.7036 0.729 153.283 0.000 poly(age, degree=4)[0] 447.0679 39.915 11.201 0.000 poly(age, degree=4)[1] -478.3158 39.915 -11.983 0.000 poly(age, degree=4)[2] 125.5217 39.915 3.145 0.002 poly(age, degree=4)[3] -77.9112 39.915 -1.952 0.051 ``` ```python=+ age_grid = np.linspace(age.min(),age.max(),100) age_df = pd.DataFrame({'age': age_grid}) ``` ```python=+ def plot_wage_fit(age_df ,basis ,title): X = basis.transform(Wage) Xnew = basis.transform(age_df) M = sm.OLS(y, X).fit() preds = M.get_prediction(Xnew) bands = preds.conf_int(alpha=0.05)# 95% confidence interval fig , ax = subplots(figsize=(8,8)) ax.scatter(age ,y,facecolor='gray',alpha=0.5) for val,ls in zip([preds.predicted_mean,bands[:,0],bands[:,1]],['b','r--','r--']): ax.plot(age_df.values , val , ls, linewidth=3) ax.set_title(title , fontsize =20) ax.set_xlabel('Age', fontsize=20) ax.set_ylabel('Wage', fontsize=20); return ax plot_wage_fit(age_df ,poly_age ,'Degree -4 Polynomial'); ``` ![下載 (6)](https://hackmd.io/_uploads/rk34-n2J0.png) ## Step Functions Step functions provide flexibility for modeling non-linear relationships when a single global structure (like a polynomial) isn't suitable. They achieve this by partitioning the range of the predictor $X$ into intervals. ### Model 1. **Define cutpoints:** Select cutpoints $c_1, c_2, \dots, c_K$ that divide the range of $X$. 2. **Create indicator variables:** For each interval, create a variable $C_1(X), C_2(X), ..., C_K(X)$ taking a value of 1 if $X$ falls in that interval, 0 otherwise. 3. **Fit the linear model:** Use these indicator variables as predictors in a linear model: $$ Y = \beta_0 + \beta_1 C_1(X) + \beta_2 C_2(X) + \beta_3 C_3(X) + \ldots + \beta_k C_k(X) + \epsilon$$ ### Considerations * **Cutpoint placement:** Crucial for good fit! Ideally, align them with underlying changes in the data's behavior. * **Potential abruptness:** Step functions create piecewise-constant fits. For smooth transitions, consider splines. ### Example ```python= # Equally divide age into 4 intervals cut_age = pd.qcut(age , 4) summarize(sm.OLS(y, pd.get_dummies(cut_age)).fit()) # get_dummies creates the columns of the model matrix for this categorical variable ``` ```python coef std err t P>|t| (17.999, 33.75] 94.1584 1.478 63.692 0.0 (33.75, 42.0] 116.6608 1.470 79.385 0.0 (42.0, 51.0] 119.1887 1.416 84.147 0.0 (51.0, 80.0] 116.5717 1.559 74.751 0.0 ``` Here `pd.qcut()` automatically picked the cutpoints based on the quantiles 25%, 50% and 75%, and return an categorical array which use the 4 intervals as categories. ![下載 (23)](https://hackmd.io/_uploads/BJUcfwOeC.png) ## Basis Functions Polynomial and piecewise-constant regression, despite their differences, illustrate the power of basis functions to enhance the capabilities of linear models. ### The Concept 1. **Transformation:** We begin with a set of pre-defined functions $b_1(X), b_2(X), ..., b_K(X)$. These transform the original predictor $X$. 2. **Linear Fitting:** Instead of a direct linear model on X, we fit: $$Y = \beta_0 + \beta_1 b_1(X) + \beta_2 b_2(X) + \beta_3 b_3(X) + \ldots + \beta_k b_k(X) + \epsilon.$$ This model, while linear in the transformed variables, captures non-linearity with respect to the original $X$. ### Advantages of Basis Functions * **Flexibility:** Diverse families of basis functions (polynomials, splines, etc.) allow us to model a wide range of non-linear patterns. * **Interpretability:** Even after complex transformations, the linear structure in terms of the coefficients aids in interpreting the model's behavior. ## Regression Splines Regression splines offer a powerful way to model non-linear relationships between a predictor $X$ and a response variable. They do this by fitting low-degree polynomials within different regions of $X$. ### Piecewise Polynomial The core idea is to divide the range of $X$ into segments. Separate low-degree polynomials are fitted within each segment. #### Cubic Splines A commonly used and visually appealing type is the cubic spline. Its general form is: $$ Y = \begin{cases} \beta_{01} + \beta_{11}X + \beta_{21}X^2 + \beta_{31}X^3 & \text{if } X \le c_1 \\ \beta_{02} + \beta_{12}X +\beta_{22}X^2 + \beta_{32}X^3 & \text{if } c_1 \leq X \leq c_2 \\ \vdots \\ \beta_{0k} +\beta_{1k}X + \beta_{2k}X^2 + \beta_{3k}X^3 & \text{if } c_k \leq X \end{cases} $$ Each of these polynomial functions can be fit using least squares applied to simple functions of the original predictor. * **Knots:** The points $c_1, c_2, \dots, c_K$ where the polynomial pieces connect are called knots. ### Ensuring Smoothness To ensure a visually smooth fit, we impose constraints: * **Continuity:** The polynomial pieces must meet at the knots. * **Smoothness:** Typically, we require the first and second derivatives to also match at the knots. This prevents sharp bends. #### Key Terms * **Cubic Spline:** A popular choice due to its smoothness at knots. * **Natural Spline:** A cubic spline with the added constraint of linear behavior beyond the boundary knots. This can improve its behavior at the edges of the predictor's range. ### Advantages of Splines * **Flexibility:** Can model a wide variety of non-linear patterns. * **Smoothness:** Produce more natural-looking curves than step functions. * **Control:** Knot placement offers control over where the model has more flexibility (more knots in a region = more flexible fit there). ### Example There are two functions that can help to constructe an appropriate matrix of basis functions for fitting a regression spline, `BSpline` and `bs`. 1. `BSpline` By default, the B-splines produced are cubic. To change the degree, use the degree argument. ```python= from ISLP.models import bs, ns from ISLP.transforms import BSpline, NaturalSpline bs_ = BSpline(internal_knots=[25, 40, 60], intercept=True).fit(age) bs_age = bs_.transform(age) print(bs_age.shape) ``` ``` (3000, 7) ``` 2. `bs` ```python=+ bs_age = MS([bs('age', internal_knots=[25, 40, 60], name='bs(age)')]) Xbs = bs_age.fit_transform(Wage) M = sm.OLS(y, Xbs).fit() summarize(M) ``` ```python coef std err t P>|t| intercept 60.4937 9.460 6.394 0.000 bs(age)[0] 3.9805 12.538 0.317 0.751 bs(age)[1] 44.6310 9.626 4.636 0.000 bs(age)[2] 62.8388 10.755 5.843 0.000 bs(age)[3] 55.9908 10.706 5.230 0.000 bs(age)[4] 50.6881 14.402 3.520 0.000 bs(age)[5] 16.6061 19.126 0.868 0.385 ``` We could also use the `df` (degrees of freedom) option to specify the complexity of the spline. In the below case, using `degree=0` results in piecewise constant functions, as in our example with `pd.qcut` above. ```python=+ bs_age0 = MS([bs('age', df=3, degree=0)]).fit(Wage) Xbs0 = bs_age0.transform(Wage) summarize(sm.OLS(y, Xbs0).fit()) ``` ```python coef std err t P>|t| intercept 94.1584 1.478 63.687 0.0 bs(age, df=3, degree=0)[0] 22.3490 2.152 10.388 0.0 bs(age, df=3, degree=0)[1] 24.8076 2.044 12.137 0.0 bs(age, df=3, degree=0)[2] 22.7814 2.087 10.917 0.0 ``` To fit natural splines, we use the `NaturalSpline` transform with the corresponding helper `ns`. ```python=+ ns_age = MS([ns('age', df=5)]).fit(Wage) # 1 knot M_ns = sm.OLS(y, ns_age.transform(Wage)).fit() summarize(M_ns) ``` ```python coef std err t P>|t| intercept 60.4752 4.708 12.844 0.000 ns(age, df=5)[0] 61.5267 4.709 13.065 0.000 ns(age, df=5)[1] 55.6912 5.717 9.741 0.000 ns(age, df=5)[2] 46.8184 4.948 9.463 0.000 ns(age, df=5)[3] 83.2036 11.918 6.982 0.000 ns(age, df=5)[4] 6.8770 9.484 0.725 0.468 ``` ```python=+ plot_wage_fit(age_df, ns_age, 'Natural spline, df=5') ``` ![下載 (12)](https://hackmd.io/_uploads/BJZngSxxR.png) ## Smoothing Splines Smoothing splines address the risk of overfitting by striking a balance between closely fitting the observed data and maintaining a smooth curve. ### How Smoothing Splines Work A smoothing spline finds a function $g$ that minimizes: $$ \sum_{i=1}^n(y_i-g(x_i))^2+\lambda \int g''(t)^2 dt \quad (*) $$ Where: * The first term ensures fidelity to the data. * The second term penalizes roughness. Its integral measures the total change in the slope $g'(t)$ of the fitted function. * Tuning Parameter $\lambda$: * $\lambda = 0$ favors data fit (potential overfitting). * $\lambda \to \infty$ favors smoothness (approaches a linear fit). ### Choosing the Smoothing Parameter $\lambda$ The optimal value of $\lambda$ is often found using cross-validation. Leave-one-out cross-validation (LOOCV) is especially efficient for smoothing splines due to this formula: $$ \text{RSS}_{cv}(\lambda)=\sum_{i=1}^n(y_i- \hat g_{\lambda}^{(-i)}(x_i))^2=\sum_{i=1}^n \left[\frac{(y_i- \hat g_{\lambda}(x_i)}{1-\{ S_\lambda \}_{ii}} \right]^2, $$ where * $\hat g_{\lambda}^{(-i)}(x_i)$: The fitted value at $x_i$ when the $i$-th observation is excluded. * $S_\lambda$: An $n \times n$ matrix such that $\hat g_{\lambda} = S_\lambda y$ where $\hat g_{\lambda}$ is the solution to $(*)$. ### Example A smoothing spline is in fact a special case of a GAM with squared-error loss and a single feature. So here we will use a similar method to fit a smoothing spline. The function `LinearGAM` is specified by associating each column of a model matrix with a particular smoothing operation: `s` for smoothing spline; `l` for linear, and `f` for factor or categorical variables. For example, `LinearGAM(s(0) + l(1) + f(2,3))`. ```python= from pygam import s as s_gam, l as l_gam, f as f_gam, LinearGAM, LogisticGAM from ISLP.pygam import approx_lam, degrees_of_freedom, plot as plot_gam X_age = np.asarray(age).reshape((-1, 1)) # s is for spline and 0 means the first predictor gam = LinearGAM(s_gam(0, lam=0.6)) # lam is the penalty parameter λ gam.fit(X_age, y) ``` ```python LinearGAM(callbacks=[Deviance(), Diffs()], fit_intercept=True, max_iter=100, scale=None, terms=s(0) + intercept, tol=0.0001, verbose=False) ``` ```python=+ age_grid = np.linspace(age.min(),age.max(),100) # Plotting fig , ax = subplots(figsize=(8,8)) ax.scatter(age , y, facecolor='gray', alpha=0.5) for lam in np.logspace(-2, 6, 5): # Vary lam from 10^-2 to 10^-6 gam = LinearGAM(s_gam(0, lam=lam)).fit(X_age , y) ax.plot(age_grid , gam.predict(age_grid), label='{:.1e}'.format(lam), linewidth=3) ax.set_xlabel('Age', fontsize=20) ax.set_ylabel('Wage', fontsize=20); ax.legend(title='$\lambda$'); #search for an optimal smoothing parameter. gam_opt = gam.gridsearch(X_age, y) ax.plot(age_grid, gam_opt.predict(age_grid), label='Grid search', linewidth=4) ax.legend() fig ``` ![下載 (8)](https://hackmd.io/_uploads/rJho0yJgA.png) ## Local Regression Local regression offers a flexible way to model non-linear relationships where a single global structure doesn't describe the data well. Instead of fitting one line across the entire dataset, local regression focuses on fitting models within local neighborhoods of a target point $x_0$. ### Algorithm :::success <font color=blue>Algorithm (Local Regression At $X = x_0$)</font> **Input:** * Dataset with predictors $X=(x_1, \dots, x_n)$ and response $Y=(y_1, \dots, y_n)$ * Target point $x_0$ for prediction * Kernel function $K(x_i, x_0)$ (e.g., Gaussian kernel, Epanechnikov kernel) * Bandwidth parameter (controls the size of the neighborhood) **Output:** Predicted value $\hat{f}(x_0)$ at the target point **Procedure:** 1. **Define the Neighborhood:** * Calculate distances between $x_0$ and all $x_i$ in the dataset. * Select the 'k' nearest neighbors of $x_0$ based on these distances. 2. **Assign Weights (Kernelize):** * For each $x_i$ in the neighborhood: * Calculate the weight $K_{i0} = K(x_i,x_0)$ using the chosen kernel function and bandwidth. * For all $x_i$ outside the neighborhood, set $K_{i0} =0$. 3. **Fit Weighted Least Squares Regression:** * Find $\hat{\beta}_0$ and $\hat{\beta}_1$ that minimize the weighted residual sum of squares: $$\sum_{i=1}^{n} K_{i0}(y_i - \hat{\beta}_0 - \hat{\beta}_1x_i)^2.$$ 4. **Predict at the Target Point:** * Calculate the predicted value: $\hat{f}(x_0) = \hat{\beta}_0 + \hat{\beta}_1x_0$. ::: ### Key Points and Considerations * **Local Fitting:** The core idea is to fit a simple linear regression model to a small neighborhood around the target point. This allows the model to adapt to local patterns in the data. * **Kernel Function:** The kernel function determines how the weights decrease as the distance from the target point increases. Common choices include: * Gaussian Kernel: $K(x_i,x_0) = \exp(-\frac{(x_i-x_0)^2}{2h^2})$ * Epanechnikov Kernel: $K(x_i,x_0) = \frac34 (1-\frac{(x_i-x_0)^2}{h^2})$ for $|x_i-x_0| \leq h$. * **Bandwidth Selection:** Choosing the right bandwidth is crucial. A small bandwidth captures local details but might lead to overfitting. A large bandwidth makes the model smoother but less responsive to local fluctuations. * **Non-linear Regression:** While the pseudocode focuses on linear regression, you can extend the idea to fit locally weighted polynomial regression models for more complex relationships. ![7.9](https://hackmd.io/_uploads/B12Ulj1xC.png) * <font color=ligblue>Blue curve</font> represents the real $f(x)$. * <font color=orange>Light orange curve</font> corresponds to the local regression estimate $\hat f(x)$. * The <font color=orange>orange colored points</font> are local to the target point $x_0$. * The <font color= #EAC100>yellow bell-shape</font> superimposed on the plot indicates weights assigned to each point, decreasing to zero with distance from the target point. ### Advantages * **Adaptability:** Captures complex patterns that global models might miss. * **No Single Structure:** Doesn't assume a fixed functional form across all data. ### Beyond One Dimension * **Multiple Features:** Local regression can be combined with global modeling, fitting locally with respect to specific features. * **Higher Dimensions:** Can be extended to scenarios where we want the fit to be local with respect to two or more predictors. ### Example ```python= lowess = sm.nonparametric.lowess # (locally weighted scatterplot smoothing), fig , ax = subplots(figsize=(8,8)) ax.scatter(age , y, facecolor='gray', alpha=0.5) for span in [0.2, 0.5]: fitted = lowess(y, age , frac=span , xvals=age_grid) ax.plot(age_grid , fitted , label='{:.1f}'.format(span), linewidth=4) ax.set_xlabel('Age', fontsize=20) ax.set_ylabel('Wage', fontsize=20); ax.legend(title='span', fontsize=15); ``` ![下載 (11)](https://hackmd.io/_uploads/Hk94gjJlR.png) ## Generalized Additive Models Generalized additive models (GAMs) extend linear models to capture complex relationships between predictors and the response variable. Their structure is: $$ Y = \beta_0 + f_1(X_1) + f_2(X_2) + \ldots + f_p(X_p) + \epsilon. $$ * **Building Blocks:** The functions $f_i$ represent smooth, non-linear transformations of the predictors. Common choices include splines, local regression, or polynomials. * **Additivity:** The key idea is to model each predictor's effect separately and then combine these effects. ### Pros * **Enhanced Modeling:** Handle non-linear patterns that linear models cannot, potentially improving predictions. * **No Predefined Transformations:** Flexible fitting process avoids the need to manually determine transformations. * **Interpretability:** The additive structure aids in understanding the isolated effect of each predictor. * **Flexibility:** Smoothness of each function $f_j$ can often be controlled (e.g., degrees of freedom in splines). ### Cons * **Additivity Limitation:** Interactions between predictors might be missed unless explicitly included as interaction terms (e.g. $X_j X_k$ or functions like $f_{jk}(X_j, X_k)$). ### GAMs for Classification: Logistic Extension For binary classification problems (outcomes of 0 or 1), logistic regression is a standard tool: $$ \log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p, $$ where $p(X) = \mathbb{P}(Y=1|X)$ is the conditional probability of the positive class. GAMs allow this to be extended for non-linear relationships: $$ \log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + f_1(X_1) + f_2(X_2) + \ldots + f_p(X_p). $$ ### Example 1. **Manual Approach:** Using natural splines and piecewise constant functions. ```python= ns_age = NaturalSpline(df=4).fit(age) ns_year = NaturalSpline(df=5).fit(Wage['year']) Xs = [ns_age.transform(age), ns_year.transform(Wage['year']), pd.get_dummies(Wage['education']).values] X_bh = np.hstack(Xs) # shape = (3000, 14) gam_bh = sm.OLS(y, X_bh).fit() ``` ```python=+ age_grid = np.linspace(age.min(), age.max(), 100) X_age_bh = X_bh.copy()[:100] X_age_bh[:] = X_bh[:].mean(0)[None ,:] X_age_bh[:,:4] = ns_age.transform(age_grid) preds = gam_bh.get_prediction(X_age_bh) bounds_age = preds.conf_int(alpha=0.05) partial_age = preds.predicted_mean center = partial_age.mean() partial_age -= center bounds_age -= center fig , ax = subplots(figsize=(8,8)) ax.plot(age_grid , partial_age , 'b', linewidth=3) ax.plot(age_grid , bounds_age[:,0], 'r--', linewidth=3) ax.plot(age_grid , bounds_age[:,1], 'r--', linewidth=3) ax.set_xlabel('Age') ax.set_ylabel('Effect on wage') ax.set_title('Partial dependence of age on wage', fontsize=20); ``` ![下載 (9)](https://hackmd.io/_uploads/rJoFKzyl0.png) ```python=+ year_grid = np.linspace(2003, 2009, 100) year_grid = np.linspace(Wage['year'].min(), Wage['year'].max(), 100) X_year_bh = X_bh.copy()[:100] X_year_bh[:] = X_bh[:].mean(0)[None ,:] X_year_bh[:,4:9] = ns_year.transform(year_grid) preds = gam_bh.get_prediction(X_year_bh) bounds_year = preds.conf_int(alpha =0.05) partial_year = preds.predicted_mean center = partial_year.mean() partial_year -= center bounds_year -= center fig , ax = subplots(figsize=(8,8)) ax.plot(year_grid , partial_year , 'b', linewidth=3) ax.plot(year_grid , bounds_year[:,0], 'r--', linewidth=3) ax.plot(year_grid , bounds_year[:,1], 'r--', linewidth=3) ax.set_xlabel('Year') ax.set_ylabel('Effect on wage') ax.set_title('Partial dependence of year on wage', fontsize=20); ``` ![下載 (10)](https://hackmd.io/_uploads/SJROsz1xA.png) 2. Using the `pygam` Package: ```python= from pygam import s as s_gam, l as l_gam, f as f_gam, LinearGAM, LogisticGAM from ISLP.pygam import approx_lam, degrees_of_freedom, plot as plot_gam, anova as anova_gam gam_full = LinearGAM(s_gam(0) + s_gam(1, n_splines=7) + f_gam(2, lam=0)) Xgam = np.column_stack([age, Wage['year'], Wage['education'].cat.codes]) gam_full = gam_full.fit(Xgam, y) ``` ```python=+ fig , ax = subplots(figsize=(8,8)) plot_gam(gam_full , 0, ax=ax) ax.set_xlabel('Age') ax.set_ylabel('Effect on wage') ax.set_title('Partial dependence of age on wage - default lam=0.6', fontsize=20); ``` ![下載 (13)](https://hackmd.io/_uploads/rkB3KBgxR.png) ```python=+ age_term = gam_full.terms[0] age_term.lam = approx_lam(Xgam , age_term , df=4+1) year_term = gam_full.terms[1] year_term.lam = approx_lam(Xgam , year_term , df=4+1) gam_full = gam_full.fit(Xgam , y) ``` ```python=+ fig , ax = subplots(figsize=(8,8)) plot_gam(gam_full ,1, ax=ax) # PLot the second term ax.set_xlabel('Year') ax.set_ylabel('Effect on wage') ax.set_title('Partial dependence of year on wage', fontsize=20) ``` ![下載 (14)](https://hackmd.io/_uploads/BJChFrxl0.png) #### ANOVA Tests for Additive Models An *analysis of variance* or *ANOVA* tests the null hypothesis that a model M1's variance is sufficient to explain the data against the alternative hypothesis that a more complex model M2 is required. The determination is based on an F-test. **Condition:** The models M1 and M2 must be nested, i.e., the space spanned by the predictors in M1 must be a subspace of the space spanned by the predictors in M2. Here, We perform a series of ANOVA tests to determine which of these three models is best: 1. A GAM that excludes year (M1) 2. A GAM that uses a linear function of year (M2) 3. A GAM that uses a spline function of year (M3) ```python= # ANOVA test for year gam_0 = LinearGAM(age_term + f_gam(2, lam=0)) gam_0.fit(Xgam, y) gam_linear = LinearGAM(age_term + l_gam(1, lam=0) + f_gam(2, lam=0)) gam_linear.fit(Xgam, y) anova_gam(gam_0, gam_linear, gam_full) ``` ```python deviance df deviance_diff df_diff F pvalue 0 3.714362e+06 2991.004005 NaN NaN NaN NaN 1 3.696746e+06 2990.005190 17616.542840 0.998815 14.265131 0.002314 2 3.693143e+06 2987.007254 3602.893655 2.997936 0.972007 0.435579 ``` Based on the results of this ANOVA, M2 (a GAM with a linear function of year) is preferred due to its moderate-sized p-value. We can repeat the same process for age as well. We see very clear evidence that a non-linear term is required for `age`. ```python=+ # ANOVA test for age gam_0 = LinearGAM(year_term + f_gam(2, lam=0)) gam_linear = LinearGAM(l_gam(0, lam=0) + year_term + f_gam(2, lam=0)) gam_0.fit(Xgam , y) gam_linear.fit(Xgam , y) anova_gam(gam_0 , gam_linear , gam_full) ``` ```python deviance df deviance_diff df_diff F pvalue 0 3.975443e+0 2991.000589 NaN NaN NaN NaN 1 3.850247e+06 2990.000704 125196.137317 0.999884 101.270106 1.681120e-07 2 3.693143e+06 2987.007254 157103.978302 2.993450 42.447812 5.669414e-07 ``` ```python=+ Yhat = gam_full.predict(Xgam) ``` #### Logistic Regression GAM ```python=+ gam_logit = LogisticGAM(age_term + l_gam(1, lam=0) + f_gam(2, lam=0)) gam_logit.fit(Xgam, high_earn) ``` ```python=+ fig , ax = subplots(figsize=(8, 8)) ax = plot_gam(gam_logit , 2) ax.set_xlabel('Education') ax.set_ylabel('Effect on wage') ax.set_title('Partial dependence of wage on education', fontsize=20); ax.set_xticklabels(Wage['education'].cat.categories , fontsize=8); ``` ![下載 (15)](https://hackmd.io/_uploads/ByxurUbx0.png) ```python=+ pd.crosstab(Wage['high_earn'], Wage['education']) ``` ```python education 1. < HS Grad 2. HS Grad 3. Some College 4. College Grad 5. Advanced Degree high_earn False 268 966 643 663 381 True 0 5 7 22 45 ``` We see that there are no high earners in the first category of education, meaning that the model will have a hard time fitting. We will fit a logistic regression GAM excluding all observations falling into this category. This provides more sensible results. ```python=+ only_hs = Wage['education'] == '1. < HS Grad' Wage_ = Wage.loc[~only_hs] Xgam_ = np.column_stack([Wage_['age'], Wage_['year'], Wage_['education'].cat.codes - 1]) high_earn_ = Wage_['high_earn'] ``` ```python=+ gam_logit_ = LogisticGAM(age_term + year_term + f_gam(2, lam=0)) gam_logit_.fit(Xgam_, high_earn_) ``` ```python=+ fig, ax = subplots(figsize=(8, 8)) ax = plot_gam(gam_logit_, 2) ax.set_xlabel('Education') ax.set_ylabel('Effect on wage') ax.set_title('Partial dependence of high earner status on education', fontsize=20) ax.set_xticklabels(Wage['education'].cat.categories[1:], fontsize=8) ``` ![下載 (16)](https://hackmd.io/_uploads/S1K_P8Zx0.png) ```python=+ fig, ax = subplots(figsize=(8, 8)) ax = plot_gam(gam_logit_, 1) ax.set_xlabel('Year') ax.set_ylabel('Effect on wage') ax.set_title('Partial dependence of high earner status on year', fontsize=20) ``` ![下載 (17)](https://hackmd.io/_uploads/rkWYwL-g0.png) ```python=+ fig, ax = subplots(figsize=(8, 8)) ax = plot_gam(gam_logit_, 0) ax.set_xlabel('Age') ax.set_ylabel('Effect on wage') ax.set_title('Partial dependence of high earner status on age', fontsize=20) ``` ![下載 (18)](https://hackmd.io/_uploads/H1qtwU-eA.png) ## Reference * Chap 7 in [An Introduction to Statistical Learning with Applications in Python](https://www.statlearning.com)