# Beyond Linear Regression: Nonlinear Models
## Overview
Linear regression, while foundational, often doesn't fully capture the intricate relationships within real-world datasets. To address this, advanced regression techniques provide the flexibility to model non-linear patterns, leading to better predictions and deeper insights.
### Key Techniques
* **Polynomial Regression:** Introduces non-linearity by including powers of the predictors in the model.
* **Step Functions:** Partition a predictor into segments for piecewise modeling.
* **Regression Splines:** Fit piecewise polynomials, ensuring smoothness at knots.
* **Smoothing Splines:** Balance fitting data and maintaining a smooth curve.
* **Local Regression:** Fit models using subsets of data near the target point.
* **Generalized Additive Models (GAMs):** Allow non-linear functions of predictors.
## Polynomial Regression
Polynomial regression offers a way to model non-linear relationships between a predictor $X$ and the response $Y$ by introducing powers of the predictor:
$$
Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \ldots + \beta_d X^d + \epsilon
$$
### Key Considerations
* **Degree $d$:** Controls the complexity of the curve. Higher degrees allow for more flexible shapes.
* **Overfitting:** Use caution with high-degree polynomials. They can overfit to the training data, harming their ability to generalize to new observations. Normally, it is unusual to use d greater than 3 or 4.
### Example
We start with some of our standard imports.
```python=
import numpy as np, pandas as pd
from matplotlib.pyplot import subplots
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (summarize, poly, ModelSpec as MS)
```
Here we use the built-in data `Wage` in Python, with a size of 3000.
```python=+
Wage = load_data('Wage')
y = Wage['wage']
age = Wage['age']
poly_age = MS([poly('age', degree=4)]).fit(Wage)
M = sm.OLS(y, poly_age.transform(Wage)).fit()
summarize(M)
```
```
coef std err t P>|t|
intercept 111.7036 0.729 153.283 0.000
poly(age, degree=4)[0] 447.0679 39.915 11.201 0.000
poly(age, degree=4)[1] -478.3158 39.915 -11.983 0.000
poly(age, degree=4)[2] 125.5217 39.915 3.145 0.002
poly(age, degree=4)[3] -77.9112 39.915 -1.952 0.051
```
```python=+
age_grid = np.linspace(age.min(),age.max(),100)
age_df = pd.DataFrame({'age': age_grid})
```
```python=+
def plot_wage_fit(age_df ,basis ,title):
X = basis.transform(Wage)
Xnew = basis.transform(age_df)
M = sm.OLS(y, X).fit()
preds = M.get_prediction(Xnew)
bands = preds.conf_int(alpha=0.05)# 95% confidence interval
fig , ax = subplots(figsize=(8,8))
ax.scatter(age ,y,facecolor='gray',alpha=0.5)
for val,ls in zip([preds.predicted_mean,bands[:,0],bands[:,1]],['b','r--','r--']):
ax.plot(age_df.values , val , ls, linewidth=3)
ax.set_title(title , fontsize =20)
ax.set_xlabel('Age', fontsize=20)
ax.set_ylabel('Wage', fontsize=20);
return ax
plot_wage_fit(age_df ,poly_age ,'Degree -4 Polynomial');
```

## Step Functions
Step functions provide flexibility for modeling non-linear relationships when a single global structure (like a polynomial) isn't suitable. They achieve this by partitioning the range of the predictor $X$ into intervals.
### Model
1. **Define cutpoints:** Select cutpoints $c_1, c_2, \dots, c_K$ that divide the range of $X$.
2. **Create indicator variables:** For each interval, create a variable $C_1(X), C_2(X), ..., C_K(X)$ taking a value of 1 if $X$ falls in that interval, 0 otherwise.
3. **Fit the linear model:** Use these indicator variables as predictors in a linear model: $$ Y = \beta_0 + \beta_1 C_1(X) + \beta_2 C_2(X) + \beta_3 C_3(X) + \ldots + \beta_k C_k(X) + \epsilon$$
### Considerations
* **Cutpoint placement:** Crucial for good fit! Ideally, align them with underlying changes in the data's behavior.
* **Potential abruptness:** Step functions create piecewise-constant fits. For smooth transitions, consider splines.
### Example
```python=
# Equally divide age into 4 intervals
cut_age = pd.qcut(age , 4)
summarize(sm.OLS(y, pd.get_dummies(cut_age)).fit())
# get_dummies creates the columns of the model matrix for this categorical variable
```
```python
coef std err t P>|t|
(17.999, 33.75] 94.1584 1.478 63.692 0.0
(33.75, 42.0] 116.6608 1.470 79.385 0.0
(42.0, 51.0] 119.1887 1.416 84.147 0.0
(51.0, 80.0] 116.5717 1.559 74.751 0.0
```
Here `pd.qcut()` automatically picked the cutpoints based on the quantiles 25%, 50% and 75%, and return an categorical array which use the 4 intervals as categories.

## Basis Functions
Polynomial and piecewise-constant regression, despite their differences, illustrate the power of basis functions to enhance the capabilities of linear models.
### The Concept
1. **Transformation:** We begin with a set of pre-defined functions $b_1(X), b_2(X), ..., b_K(X)$. These transform the original predictor $X$.
2. **Linear Fitting:** Instead of a direct linear model on X, we fit: $$Y = \beta_0 + \beta_1 b_1(X) + \beta_2 b_2(X) + \beta_3 b_3(X) + \ldots + \beta_k b_k(X) + \epsilon.$$
This model, while linear in the transformed variables, captures non-linearity with respect to the original $X$.
### Advantages of Basis Functions
* **Flexibility:** Diverse families of basis functions (polynomials, splines, etc.) allow us to model a wide range of non-linear patterns.
* **Interpretability:** Even after complex transformations, the linear structure in terms of the coefficients aids in interpreting the model's behavior.
## Regression Splines
Regression splines offer a powerful way to model non-linear relationships between a predictor $X$ and a response variable. They do this by fitting low-degree polynomials within different regions of $X$.
### Piecewise Polynomial
The core idea is to divide the range of $X$ into segments. Separate low-degree polynomials are fitted within each segment.
#### Cubic Splines
A commonly used and visually appealing type is the cubic spline. Its general form is:
$$
Y = \begin{cases}
\beta_{01} + \beta_{11}X + \beta_{21}X^2 + \beta_{31}X^3 & \text{if } X \le c_1 \\
\beta_{02} + \beta_{12}X +\beta_{22}X^2 + \beta_{32}X^3 & \text{if } c_1 \leq X \leq c_2 \\
\vdots \\
\beta_{0k} +\beta_{1k}X + \beta_{2k}X^2 + \beta_{3k}X^3 & \text{if } c_k \leq X
\end{cases}
$$
Each of these polynomial functions can be fit using least squares applied to simple functions of the original predictor.
* **Knots:** The points $c_1, c_2, \dots, c_K$ where the polynomial pieces connect are called knots.
### Ensuring Smoothness
To ensure a visually smooth fit, we impose constraints:
* **Continuity:** The polynomial pieces must meet at the knots.
* **Smoothness:** Typically, we require the first and second derivatives to also match at the knots. This prevents sharp bends.
#### Key Terms
* **Cubic Spline:** A popular choice due to its smoothness at knots.
* **Natural Spline:** A cubic spline with the added constraint of linear behavior beyond the boundary knots. This can improve its behavior at the edges of the predictor's range.
### Advantages of Splines
* **Flexibility:** Can model a wide variety of non-linear patterns.
* **Smoothness:** Produce more natural-looking curves than step functions.
* **Control:** Knot placement offers control over where the model has more flexibility (more knots in a region = more flexible fit there).
### Example
There are two functions that can help to constructe an appropriate matrix of basis functions for fitting a regression spline, `BSpline` and `bs`.
1. `BSpline`
By default, the B-splines produced are cubic. To change the degree, use the degree argument.
```python=
from ISLP.models import bs, ns
from ISLP.transforms import BSpline, NaturalSpline
bs_ = BSpline(internal_knots=[25, 40, 60], intercept=True).fit(age)
bs_age = bs_.transform(age)
print(bs_age.shape)
```
```
(3000, 7)
```
2. `bs`
```python=+
bs_age = MS([bs('age', internal_knots=[25, 40, 60], name='bs(age)')])
Xbs = bs_age.fit_transform(Wage)
M = sm.OLS(y, Xbs).fit()
summarize(M)
```
```python
coef std err t P>|t|
intercept 60.4937 9.460 6.394 0.000
bs(age)[0] 3.9805 12.538 0.317 0.751
bs(age)[1] 44.6310 9.626 4.636 0.000
bs(age)[2] 62.8388 10.755 5.843 0.000
bs(age)[3] 55.9908 10.706 5.230 0.000
bs(age)[4] 50.6881 14.402 3.520 0.000
bs(age)[5] 16.6061 19.126 0.868 0.385
```
We could also use the `df` (degrees of freedom) option to specify the complexity of the spline. In the below case, using `degree=0` results in piecewise constant functions, as in our example with `pd.qcut` above.
```python=+
bs_age0 = MS([bs('age', df=3, degree=0)]).fit(Wage)
Xbs0 = bs_age0.transform(Wage)
summarize(sm.OLS(y, Xbs0).fit())
```
```python
coef std err t P>|t|
intercept 94.1584 1.478 63.687 0.0
bs(age, df=3, degree=0)[0] 22.3490 2.152 10.388 0.0
bs(age, df=3, degree=0)[1] 24.8076 2.044 12.137 0.0
bs(age, df=3, degree=0)[2] 22.7814 2.087 10.917 0.0
```
To fit natural splines, we use the `NaturalSpline` transform with the corresponding helper `ns`.
```python=+
ns_age = MS([ns('age', df=5)]).fit(Wage)
# 1 knot
M_ns = sm.OLS(y, ns_age.transform(Wage)).fit()
summarize(M_ns)
```
```python
coef std err t P>|t|
intercept 60.4752 4.708 12.844 0.000
ns(age, df=5)[0] 61.5267 4.709 13.065 0.000
ns(age, df=5)[1] 55.6912 5.717 9.741 0.000
ns(age, df=5)[2] 46.8184 4.948 9.463 0.000
ns(age, df=5)[3] 83.2036 11.918 6.982 0.000
ns(age, df=5)[4] 6.8770 9.484 0.725 0.468
```
```python=+
plot_wage_fit(age_df, ns_age, 'Natural spline, df=5')
```

## Smoothing Splines
Smoothing splines address the risk of overfitting by striking a balance between closely fitting the observed data and maintaining a smooth curve.
### How Smoothing Splines Work
A smoothing spline finds a function $g$ that minimizes:
$$
\sum_{i=1}^n(y_i-g(x_i))^2+\lambda \int g''(t)^2 dt \quad (*)
$$
Where:
* The first term ensures fidelity to the data.
* The second term penalizes roughness. Its integral measures the total change in the slope $g'(t)$ of the fitted function.
* Tuning Parameter $\lambda$:
* $\lambda = 0$ favors data fit (potential overfitting).
* $\lambda \to \infty$ favors smoothness (approaches a linear fit).
### Choosing the Smoothing Parameter $\lambda$
The optimal value of $\lambda$ is often found using cross-validation. Leave-one-out cross-validation (LOOCV) is especially efficient for smoothing splines due to this formula:
$$
\text{RSS}_{cv}(\lambda)=\sum_{i=1}^n(y_i-
\hat g_{\lambda}^{(-i)}(x_i))^2=\sum_{i=1}^n \left[\frac{(y_i-
\hat g_{\lambda}(x_i)}{1-\{ S_\lambda \}_{ii}} \right]^2,
$$
where
* $\hat g_{\lambda}^{(-i)}(x_i)$: The fitted value at $x_i$ when the $i$-th observation is excluded.
* $S_\lambda$: An $n \times n$ matrix such that $\hat g_{\lambda} = S_\lambda y$ where $\hat g_{\lambda}$ is the solution to $(*)$.
### Example
A smoothing spline is in fact a special case of a GAM with squared-error loss and a single feature. So here we will use a similar method to fit a smoothing spline.
The function `LinearGAM` is specified by associating each column of a model matrix with a particular smoothing operation: `s` for smoothing spline; `l` for linear, and `f` for factor or categorical variables. For example, `LinearGAM(s(0) + l(1) + f(2,3))`.
```python=
from pygam import s as s_gam, l as l_gam, f as f_gam, LinearGAM, LogisticGAM
from ISLP.pygam import approx_lam, degrees_of_freedom, plot as plot_gam
X_age = np.asarray(age).reshape((-1, 1))
# s is for spline and 0 means the first predictor
gam = LinearGAM(s_gam(0, lam=0.6))
# lam is the penalty parameter λ
gam.fit(X_age, y)
```
```python
LinearGAM(callbacks=[Deviance(), Diffs()], fit_intercept=True,
max_iter=100, scale=None, terms=s(0) + intercept, tol=0.0001,
verbose=False)
```
```python=+
age_grid = np.linspace(age.min(),age.max(),100)
# Plotting
fig , ax = subplots(figsize=(8,8))
ax.scatter(age , y, facecolor='gray', alpha=0.5)
for lam in np.logspace(-2, 6, 5):
# Vary lam from 10^-2 to 10^-6
gam = LinearGAM(s_gam(0, lam=lam)).fit(X_age , y)
ax.plot(age_grid ,
gam.predict(age_grid),
label='{:.1e}'.format(lam),
linewidth=3)
ax.set_xlabel('Age', fontsize=20)
ax.set_ylabel('Wage', fontsize=20);
ax.legend(title='$\lambda$');
#search for an optimal smoothing parameter.
gam_opt = gam.gridsearch(X_age, y)
ax.plot(age_grid, gam_opt.predict(age_grid), label='Grid search', linewidth=4)
ax.legend()
fig
```

## Local Regression
Local regression offers a flexible way to model non-linear relationships where a single global structure doesn't describe the data well. Instead of fitting one line across the entire dataset, local regression focuses on fitting models within local neighborhoods of a target point $x_0$.
### Algorithm
:::success
<font color=blue>Algorithm (Local Regression At $X = x_0$)</font>
**Input:**
* Dataset with predictors $X=(x_1, \dots, x_n)$ and response $Y=(y_1, \dots, y_n)$
* Target point $x_0$ for prediction
* Kernel function $K(x_i, x_0)$ (e.g., Gaussian kernel, Epanechnikov kernel)
* Bandwidth parameter (controls the size of the neighborhood)
**Output:** Predicted value $\hat{f}(x_0)$ at the target point
**Procedure:**
1. **Define the Neighborhood:**
* Calculate distances between $x_0$ and all $x_i$ in the dataset.
* Select the 'k' nearest neighbors of $x_0$ based on these distances.
2. **Assign Weights (Kernelize):**
* For each $x_i$ in the neighborhood:
* Calculate the weight $K_{i0} = K(x_i,x_0)$ using the chosen kernel function and bandwidth.
* For all $x_i$ outside the neighborhood, set $K_{i0} =0$.
3. **Fit Weighted Least Squares Regression:**
* Find $\hat{\beta}_0$ and $\hat{\beta}_1$ that minimize the weighted residual sum of squares: $$\sum_{i=1}^{n} K_{i0}(y_i - \hat{\beta}_0 - \hat{\beta}_1x_i)^2.$$
4. **Predict at the Target Point:**
* Calculate the predicted value: $\hat{f}(x_0) = \hat{\beta}_0 + \hat{\beta}_1x_0$.
:::
### Key Points and Considerations
* **Local Fitting:** The core idea is to fit a simple linear regression model to a small neighborhood around the target point. This allows the model to adapt to local patterns in the data.
* **Kernel Function:** The kernel function determines how the weights decrease as the distance from the target point increases. Common choices include:
* Gaussian Kernel: $K(x_i,x_0) = \exp(-\frac{(x_i-x_0)^2}{2h^2})$
* Epanechnikov Kernel: $K(x_i,x_0) = \frac34 (1-\frac{(x_i-x_0)^2}{h^2})$ for $|x_i-x_0| \leq h$.
* **Bandwidth Selection:** Choosing the right bandwidth is crucial. A small bandwidth captures local details but might lead to overfitting. A large bandwidth makes the model smoother but less responsive to local fluctuations.
* **Non-linear Regression:** While the pseudocode focuses on linear regression, you can extend the idea to fit locally weighted polynomial regression models for more complex relationships.

* <font color=ligblue>Blue curve</font> represents the real $f(x)$.
* <font color=orange>Light orange curve</font> corresponds to the local regression estimate $\hat f(x)$.
* The <font color=orange>orange colored points</font> are local to the target point $x_0$.
* The <font color= #EAC100>yellow bell-shape</font> superimposed on the plot indicates weights assigned to each point, decreasing to zero with distance from the target point.
### Advantages
* **Adaptability:** Captures complex patterns that global models might miss.
* **No Single Structure:** Doesn't assume a fixed functional form across all data.
### Beyond One Dimension
* **Multiple Features:** Local regression can be combined with global modeling, fitting locally with respect to specific features.
* **Higher Dimensions:** Can be extended to scenarios where we want the fit to be local with respect to two or more predictors.
### Example
```python=
lowess = sm.nonparametric.lowess # (locally weighted scatterplot smoothing),
fig , ax = subplots(figsize=(8,8))
ax.scatter(age , y, facecolor='gray', alpha=0.5)
for span in [0.2, 0.5]:
fitted = lowess(y,
age ,
frac=span ,
xvals=age_grid)
ax.plot(age_grid ,
fitted ,
label='{:.1f}'.format(span),
linewidth=4)
ax.set_xlabel('Age', fontsize=20)
ax.set_ylabel('Wage', fontsize=20);
ax.legend(title='span', fontsize=15);
```

## Generalized Additive Models
Generalized additive models (GAMs) extend linear models to capture complex relationships between predictors and the response variable. Their structure is:
$$
Y = \beta_0 + f_1(X_1) + f_2(X_2) + \ldots + f_p(X_p) + \epsilon.
$$
* **Building Blocks:** The functions $f_i$ represent smooth, non-linear transformations of the predictors. Common choices include splines, local regression, or polynomials.
* **Additivity:** The key idea is to model each predictor's effect separately and then combine these effects.
### Pros
* **Enhanced Modeling:** Handle non-linear patterns that linear models cannot, potentially improving predictions.
* **No Predefined Transformations:** Flexible fitting process avoids the need to manually determine transformations.
* **Interpretability:** The additive structure aids in understanding the isolated effect of each predictor.
* **Flexibility:** Smoothness of each function $f_j$ can often be controlled (e.g., degrees of freedom in splines).
### Cons
* **Additivity Limitation:** Interactions between predictors might be missed unless explicitly included as interaction terms (e.g. $X_j X_k$ or functions like $f_{jk}(X_j, X_k)$).
### GAMs for Classification: Logistic Extension
For binary classification problems (outcomes of 0 or 1), logistic regression is a standard tool:
$$
\log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p,
$$
where $p(X) = \mathbb{P}(Y=1|X)$ is the conditional probability of the positive class.
GAMs allow this to be extended for non-linear relationships:
$$
\log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + f_1(X_1) + f_2(X_2) + \ldots + f_p(X_p).
$$
### Example
1. **Manual Approach:** Using natural splines and piecewise constant functions.
```python=
ns_age = NaturalSpline(df=4).fit(age)
ns_year = NaturalSpline(df=5).fit(Wage['year'])
Xs = [ns_age.transform(age), ns_year.transform(Wage['year']), pd.get_dummies(Wage['education']).values]
X_bh = np.hstack(Xs) # shape = (3000, 14)
gam_bh = sm.OLS(y, X_bh).fit()
```
```python=+
age_grid = np.linspace(age.min(),
age.max(),
100)
X_age_bh = X_bh.copy()[:100]
X_age_bh[:] = X_bh[:].mean(0)[None ,:]
X_age_bh[:,:4] = ns_age.transform(age_grid)
preds = gam_bh.get_prediction(X_age_bh)
bounds_age = preds.conf_int(alpha=0.05)
partial_age = preds.predicted_mean
center = partial_age.mean()
partial_age -= center
bounds_age -= center
fig , ax = subplots(figsize=(8,8))
ax.plot(age_grid , partial_age , 'b', linewidth=3)
ax.plot(age_grid , bounds_age[:,0], 'r--', linewidth=3)
ax.plot(age_grid , bounds_age[:,1], 'r--', linewidth=3)
ax.set_xlabel('Age')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of age on wage', fontsize=20);
```

```python=+
year_grid = np.linspace(2003, 2009, 100)
year_grid = np.linspace(Wage['year'].min(),
Wage['year'].max(),
100)
X_year_bh = X_bh.copy()[:100]
X_year_bh[:] = X_bh[:].mean(0)[None ,:]
X_year_bh[:,4:9] = ns_year.transform(year_grid)
preds = gam_bh.get_prediction(X_year_bh)
bounds_year = preds.conf_int(alpha =0.05)
partial_year = preds.predicted_mean
center = partial_year.mean()
partial_year -= center
bounds_year -= center
fig , ax = subplots(figsize=(8,8))
ax.plot(year_grid , partial_year , 'b', linewidth=3)
ax.plot(year_grid , bounds_year[:,0], 'r--', linewidth=3)
ax.plot(year_grid , bounds_year[:,1], 'r--', linewidth=3)
ax.set_xlabel('Year')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of year on wage', fontsize=20);
```

2. Using the `pygam` Package:
```python=
from pygam import s as s_gam, l as l_gam, f as f_gam, LinearGAM, LogisticGAM
from ISLP.pygam import approx_lam, degrees_of_freedom, plot as plot_gam, anova as anova_gam
gam_full = LinearGAM(s_gam(0) + s_gam(1, n_splines=7) + f_gam(2, lam=0))
Xgam = np.column_stack([age, Wage['year'], Wage['education'].cat.codes])
gam_full = gam_full.fit(Xgam, y)
```
```python=+
fig , ax = subplots(figsize=(8,8))
plot_gam(gam_full , 0, ax=ax)
ax.set_xlabel('Age')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of age on wage - default lam=0.6',
fontsize=20);
```

```python=+
age_term = gam_full.terms[0]
age_term.lam = approx_lam(Xgam , age_term , df=4+1)
year_term = gam_full.terms[1]
year_term.lam = approx_lam(Xgam , year_term , df=4+1)
gam_full = gam_full.fit(Xgam , y)
```
```python=+
fig , ax = subplots(figsize=(8,8))
plot_gam(gam_full ,1, ax=ax)
# PLot the second term
ax.set_xlabel('Year')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of year on wage', fontsize=20)
```

#### ANOVA Tests for Additive Models
An *analysis of variance* or *ANOVA* tests the null hypothesis that a model M1's variance is sufficient to explain the data against the alternative hypothesis that a more complex model M2 is required. The determination is based on an F-test.
**Condition:** The models M1 and M2 must be nested, i.e., the space spanned by the predictors in M1 must be a subspace of the space spanned by the predictors in M2.
Here, We perform a series of ANOVA tests to determine which of these three models is best:
1. A GAM that excludes year (M1)
2. A GAM that uses a linear function of year (M2)
3. A GAM that uses a spline function of year (M3)
```python=
# ANOVA test for year
gam_0 = LinearGAM(age_term + f_gam(2, lam=0))
gam_0.fit(Xgam, y)
gam_linear = LinearGAM(age_term + l_gam(1, lam=0) + f_gam(2, lam=0))
gam_linear.fit(Xgam, y)
anova_gam(gam_0, gam_linear, gam_full)
```
```python
deviance df deviance_diff df_diff F pvalue
0 3.714362e+06 2991.004005 NaN NaN NaN NaN
1 3.696746e+06 2990.005190 17616.542840 0.998815 14.265131 0.002314
2 3.693143e+06 2987.007254 3602.893655 2.997936 0.972007 0.435579
```
Based on the results of this ANOVA, M2 (a GAM with a linear function of year) is preferred due to its moderate-sized p-value.
We can repeat the same process for age as well. We see very clear evidence that a non-linear term is required for `age`.
```python=+
# ANOVA test for age
gam_0 = LinearGAM(year_term + f_gam(2, lam=0))
gam_linear = LinearGAM(l_gam(0, lam=0) +
year_term +
f_gam(2, lam=0))
gam_0.fit(Xgam , y)
gam_linear.fit(Xgam , y)
anova_gam(gam_0 , gam_linear , gam_full)
```
```python
deviance df deviance_diff df_diff F pvalue
0 3.975443e+0 2991.000589 NaN NaN NaN NaN
1 3.850247e+06 2990.000704 125196.137317 0.999884 101.270106 1.681120e-07
2 3.693143e+06 2987.007254 157103.978302 2.993450 42.447812 5.669414e-07
```
```python=+
Yhat = gam_full.predict(Xgam)
```
#### Logistic Regression GAM
```python=+
gam_logit = LogisticGAM(age_term + l_gam(1, lam=0) + f_gam(2, lam=0))
gam_logit.fit(Xgam, high_earn)
```
```python=+
fig , ax = subplots(figsize=(8, 8))
ax = plot_gam(gam_logit , 2)
ax.set_xlabel('Education')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of wage on education',
fontsize=20);
ax.set_xticklabels(Wage['education'].cat.categories , fontsize=8);
```

```python=+
pd.crosstab(Wage['high_earn'], Wage['education'])
```
```python
education 1. < HS Grad 2. HS Grad 3. Some College 4. College Grad 5. Advanced Degree
high_earn
False 268 966 643 663 381
True 0 5 7 22 45
```
We see that there are no high earners in the first category of education, meaning that the model will have a hard time fitting. We will fit a logistic regression GAM excluding all observations falling into this category. This provides more sensible results.
```python=+
only_hs = Wage['education'] == '1. < HS Grad'
Wage_ = Wage.loc[~only_hs]
Xgam_ = np.column_stack([Wage_['age'], Wage_['year'], Wage_['education'].cat.codes - 1])
high_earn_ = Wage_['high_earn']
```
```python=+
gam_logit_ = LogisticGAM(age_term + year_term + f_gam(2, lam=0))
gam_logit_.fit(Xgam_, high_earn_)
```
```python=+
fig, ax = subplots(figsize=(8, 8))
ax = plot_gam(gam_logit_, 2)
ax.set_xlabel('Education')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of high earner status on education', fontsize=20)
ax.set_xticklabels(Wage['education'].cat.categories[1:], fontsize=8)
```

```python=+
fig, ax = subplots(figsize=(8, 8))
ax = plot_gam(gam_logit_, 1)
ax.set_xlabel('Year')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of high earner status on year', fontsize=20)
```

```python=+
fig, ax = subplots(figsize=(8, 8))
ax = plot_gam(gam_logit_, 0)
ax.set_xlabel('Age')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of high earner status on age', fontsize=20)
```

## Reference
* Chap 7 in [An Introduction to Statistical Learning with Applications in Python](https://www.statlearning.com)