# Simple Linear Regression
### 套件輸入
```python=
import pandas as pd
import numpy as np
# Modeling ("sklearn + statsmodels" use these two packages to modeling)
from sklearn.metrics import mean_squared_error, explained_variance_score, r2_score
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.formula.api as smf # for the ols()
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
from statsmodels.stats.outliers_influence import OLSInfluence
from statsmodels.graphics.regressionplots import *
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
from scipy import stats # statistic analysis
# packages for the plot
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use("seaborn-white")
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
```
### Simple Linear Regression
:::info
**`ISLR`** 中使用 **`Boston`** 作為實作資料集, 包含了506筆在Boston的 **`medv(房價中位數)`** , 可以透過資料集中的12個欄位(預測因素)來去預測 **`medv`**
:::
```python=
Boston = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Statistical Learning/Linear Regression/Boston.csv", index_col = "Unnamed: 0") # the original dataset have the first col "Unnamed: 0", change this col as the DataFrame's index.
Boston.index = Boston.index-1 # After the change, the index start from 1. Code like this to let the index start from 0.
Boston.head(5)
```
- **use the ols() to fit a simple linear regression model:**
```python=
# est = smf.ols(y ~ x, data)
est = smf.ols("medv ~ lstat", data = Boston).fit()
print(est.summary())
```
- **Another way for using scikit-learn like API as follows:**
```python=
X = Boston["lstat"] # data for the X1 in regression equation
X = sm.add_constant(X) # add a X0=1(intercept) to complete a simple regression equation
y = Boston["medv"]
model = sm.OLS(y, X).fit()
print(model.summary())
```
- **create testing data as a dataframe to predict:**
```python=
X_new = pd.DataFrame({"lstat": [5,10,15]}) # create a dataframe that contains the data of lstat, named X_new.
est.predict(X_new) # use the model(est) we fit before, and predict "medv" by the testing data we simulate.
```
:rotating_light: **Question:
Noticed that the model we fit from SKlearn got a constant, not a intercept.
How to solve it?**
-> **`OLS`** is a class, and **`ols`** is a method of the **`OLS`** class that is inherited from **`statsmodels.base.model.Model`**
```python=
[Input]: X_new = pd.DataFrame({"lstat": [5,10,15]}) # create a dataframe that contains the data of lstat, named X_new.
model.predict(X_new)
[Output]:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-beea597c26ef> in <module>()
1 X_new = pd.DataFrame({"lstat": [5,10,15]}) # create a dataframe that contains the data of lstat, named X_new.
----> 2 model.predict(X_new)
1 frames
/usr/local/lib/python3.7/dist-packages/statsmodels/regression/linear_model.py in predict(self, params, exog)
362 exog = self.exog
363
--> 364 return np.dot(exog, params)
365
366 def get_distribution(self, params, scale, exog=None, dist_class=None):
<__array_function__ internals> in dot(*args, **kwargs)
ValueError: shapes (3,1) and (2,) not aligned: 1 (dim 1) != 2 (dim 0)
```
:::warning
:fairy: **Supplemental Information:**
When **`statsmodel`** detected as a categorical variable, an integer column can be treated as categorical using: **`model = ols("VIQ ~ C(Gender)", data).fit()`**
It's like the concept of **`one-hot encoding(單熱編碼)`** introduced in **`ISLR`**
:::
- **Confidence Interval**
```python=
est.conf_int(alpha = 0.05)
```
- **Confidence & Prediction Interval**
```python=
prediction = est.get_prediction(X_new)
prediction.summary_frame() # in the summary_frame(), alpha=0.05 is by default.
# mean_ci -> confidence interval
# obs_ci -> prediction interval
```
- plot **`medv`** and **`lstat`** along with the least squares regression line using **`matplotlib`** or **`regplot`** functions.
```python=
sns.scatterplot(x = "lstat", y = "medv", data = Boston) # scatterplot with "lstat" / "medv"
X = pd.DataFrame({"lstat": [Boston.lstat.min(), Boston.lstat.max()]})
Y_pred = est.predict(X)
sns.lineplot(x = X.values[:,0], y = Y_pred.values, color = "red") # [:,0] -> [first_row:last_row, column_0]
```

```python=
sns.regplot(x = "lstat", y = "medv", data = Boston, scatter_kws = {"color": "Teal"}, line_kws = {"color": "lightsalmon"})
```

```python=
influence = OLSInfluence(est)
influence.summary_frame()
```
```python=
ols_sm_resid = est.resid # residuals from "est"
ols_fitted = est.fittedvalues
prstd = wls_prediction_std(est)[0] # wls_prediciton_std 會返回 標準偏差 和 信賴區間 (prstd / ivLow / ivUp)
ols_sm_resid_stud = influence.resid_studentized_internal # studentized residuals or influence.resid_studentized_internal. BUT the front one will have different value with the latter one...? so i decided to use the latter one at here
leverage = OLSInfluence(est).hat_matrix_diag
# fig setting
f, axes = plt.subplots(2, 2, sharex = False, sharey = False)
f.set_figheight(15)
f.set_figwidth(20)
#作圖寫法: 可以從sns.~plot()裡的ax參數設定將該圖畫在subplot劃分的哪個位置 or 直接選擇axes[?,?].scatter()來將圖畫在指定位置
# regression plot
sns.regplot(x = "lstat", y = "medv", data = Boston, ax = axes[0,0], scatter_kws = {"alpha": 0.2}) # the alpha is for the transparency of the scatter dot
axes[0,0].set_title("reg_plot")
# residual plot
sns.scatterplot(x = ols_fitted, y = ols_sm_resid, ax = axes[0,1], alpha = 0.2)
axes[0,1].set_xlabel("fittedvalues")
axes[0,1].set_ylabel("residual")
axes[0,1].set_title("residual plot")
# leverage plot
axes[1,0].scatter(x = leverage, y = ols_sm_resid_stud, alpha = 0.3, color = "red")
axes[1,0].set_xlabel("leverage")
axes[1,0].set_ylabel("studentized residuals")
axes[1,0].set_title("leverage plot")
# studentized residual plot
axes[1,1].scatter(x = ols_fitted, y = ols_sm_resid_stud, alpha = 0.2, color = "magenta")
axes[1,1].axhline(y = 0, linestyle = ":", color = ".2") # Add a horizontal line across the axis
axes[1,1].axhline(y = 3, linestyle = ":", color = "gray")
axes[1,1].axhline(y = -3, linestyle = ":", color = "gray")
axes[1, 1].set_ylim(-5, 5)
axes[1,1].set_xlabel("fittedvalues")
axes[1,1].set_ylabel("studentized residuals")
axes[1,1].set_title("studentized residual plot")
#fittedvalues和ols_sm_resid_stud作為不同的變數 寫在同一行他會去比較嗎? 如果會的話 x最後存的值就是ols_sm_resid_stud超過[-3:3]這個區間範圍的fittedvalues的值嗎?
x = est.fittedvalues[np.logical_or(ols_sm_resid_stud > 3, ols_sm_resid_stud < -3)] # logical_or 會返回布林值
y = ols_sm_resid_stud[np.logical_or(ols_sm_resid_stud > 3, ols_sm_resid_stud < -3)]
for i, x, y in zip(x.index, x, y):
axes[1,1].annotate(i, xy = (x,y)) # Axes.annotate(s, xy, *args, **kwargs) s: 註釋文字的內容 xy: 被註釋的座標點
```
