Simple Linear Regression

# Simple Linear Regression ### 套件輸入 ```python= import pandas as pd import numpy as np # Modeling ("sklearn + statsmodels" use these two packages to modeling) from sklearn.metrics import mean_squared_error, explained_variance_score, r2_score from sklearn import linear_model from sklearn.preprocessing import PolynomialFeatures import statsmodels.formula.api as smf # for the ols() import statsmodels.api as sm from statsmodels.sandbox.regression.predstd import wls_prediction_std from statsmodels.stats.outliers_influence import OLSInfluence from statsmodels.graphics.regressionplots import * from statsmodels.stats.outliers_influence import variance_inflation_factor as vif from scipy import stats # statistic analysis # packages for the plot import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline plt.style.use("seaborn-white") sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5}) ``` ### Simple Linear Regression :::info **`ISLR`** 中使用 **`Boston`** 作為實作資料集, 包含了506筆在Boston的 **`medv(房價中位數)`** , 可以透過資料集中的12個欄位(預測因素)來去預測 **`medv`** ::: ```python= Boston = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Statistical Learning/Linear Regression/Boston.csv", index_col = "Unnamed: 0") # the original dataset have the first col "Unnamed: 0", change this col as the DataFrame's index. Boston.index = Boston.index-1 # After the change, the index start from 1. Code like this to let the index start from 0. Boston.head(5) ``` - **use the ols() to fit a simple linear regression model:** ```python= # est = smf.ols(y ~ x, data) est = smf.ols("medv ~ lstat", data = Boston).fit() print(est.summary()) ``` - **Another way for using scikit-learn like API as follows:** ```python= X = Boston["lstat"] # data for the X1 in regression equation X = sm.add_constant(X) # add a X0=1(intercept) to complete a simple regression equation y = Boston["medv"] model = sm.OLS(y, X).fit() print(model.summary()) ``` - **create testing data as a dataframe to predict:** ```python= X_new = pd.DataFrame({"lstat": [5,10,15]}) # create a dataframe that contains the data of lstat, named X_new. est.predict(X_new) # use the model(est) we fit before, and predict "medv" by the testing data we simulate. ``` :rotating_light: **Question: Noticed that the model we fit from SKlearn got a constant, not a intercept. How to solve it?** -> **`OLS`** is a class, and **`ols`** is a method of the **`OLS`** class that is inherited from **`statsmodels.base.model.Model`** ```python= [Input]: X_new = pd.DataFrame({"lstat": [5,10,15]}) # create a dataframe that contains the data of lstat, named X_new. model.predict(X_new) [Output]: --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-19-beea597c26ef> in <module>() 1 X_new = pd.DataFrame({"lstat": [5,10,15]}) # create a dataframe that contains the data of lstat, named X_new. ----> 2 model.predict(X_new) 1 frames /usr/local/lib/python3.7/dist-packages/statsmodels/regression/linear_model.py in predict(self, params, exog) 362 exog = self.exog 363 --> 364 return np.dot(exog, params) 365 366 def get_distribution(self, params, scale, exog=None, dist_class=None): <__array_function__ internals> in dot(*args, **kwargs) ValueError: shapes (3,1) and (2,) not aligned: 1 (dim 1) != 2 (dim 0) ``` :::warning :fairy: **Supplemental Information:** When **`statsmodel`** detected as a categorical variable, an integer column can be treated as categorical using: **`model = ols("VIQ ~ C(Gender)", data).fit()`** It's like the concept of **`one-hot encoding(單熱編碼)`** introduced in **`ISLR`** ::: - **Confidence Interval** ```python= est.conf_int(alpha = 0.05) ``` - **Confidence & Prediction Interval** ```python= prediction = est.get_prediction(X_new) prediction.summary_frame() # in the summary_frame(), alpha=0.05 is by default. # mean_ci -> confidence interval # obs_ci -> prediction interval ``` - plot **`medv`** and **`lstat`** along with the least squares regression line using **`matplotlib`** or **`regplot`** functions. ```python= sns.scatterplot(x = "lstat", y = "medv", data = Boston) # scatterplot with "lstat" / "medv" X = pd.DataFrame({"lstat": [Boston.lstat.min(), Boston.lstat.max()]}) Y_pred = est.predict(X) sns.lineplot(x = X.values[:,0], y = Y_pred.values, color = "red") # [:,0] -> [first_row:last_row, column_0] ``` ![Minion](https://i.imgur.com/8yuevxr.jpg =400x300) ```python= sns.regplot(x = "lstat", y = "medv", data = Boston, scatter_kws = {"color": "Teal"}, line_kws = {"color": "lightsalmon"}) ``` ![Minion](https://i.imgur.com/uHMFHR0.png =400x300) ```python= influence = OLSInfluence(est) influence.summary_frame() ``` ```python= ols_sm_resid = est.resid # residuals from "est" ols_fitted = est.fittedvalues prstd = wls_prediction_std(est)[0] # wls_prediciton_std 會返回標準偏差和信賴區間 (prstd / ivLow / ivUp) ols_sm_resid_stud = influence.resid_studentized_internal # studentized residuals or influence.resid_studentized_internal. BUT the front one will have different value with the latter one...? so i decided to use the latter one at here leverage = OLSInfluence(est).hat_matrix_diag # fig setting f, axes = plt.subplots(2, 2, sharex = False, sharey = False) f.set_figheight(15) f.set_figwidth(20) #作圖寫法: 可以從sns.~plot()裡的ax參數設定將該圖畫在subplot劃分的哪個位置 or 直接選擇axes[?,?].scatter()來將圖畫在指定位置 # regression plot sns.regplot(x = "lstat", y = "medv", data = Boston, ax = axes[0,0], scatter_kws = {"alpha": 0.2}) # the alpha is for the transparency of the scatter dot axes[0,0].set_title("reg_plot") # residual plot sns.scatterplot(x = ols_fitted, y = ols_sm_resid, ax = axes[0,1], alpha = 0.2) axes[0,1].set_xlabel("fittedvalues") axes[0,1].set_ylabel("residual") axes[0,1].set_title("residual plot") # leverage plot axes[1,0].scatter(x = leverage, y = ols_sm_resid_stud, alpha = 0.3, color = "red") axes[1,0].set_xlabel("leverage") axes[1,0].set_ylabel("studentized residuals") axes[1,0].set_title("leverage plot") # studentized residual plot axes[1,1].scatter(x = ols_fitted, y = ols_sm_resid_stud, alpha = 0.2, color = "magenta") axes[1,1].axhline(y = 0, linestyle = ":", color = ".2") # Add a horizontal line across the axis axes[1,1].axhline(y = 3, linestyle = ":", color = "gray") axes[1,1].axhline(y = -3, linestyle = ":", color = "gray") axes[1, 1].set_ylim(-5, 5) axes[1,1].set_xlabel("fittedvalues") axes[1,1].set_ylabel("studentized residuals") axes[1,1].set_title("studentized residual plot") #fittedvalues和ols_sm_resid_stud作為不同的變數寫在同一行他會去比較嗎? 如果會的話 x最後存的值就是ols_sm_resid_stud超過[-3:3]這個區間範圍的fittedvalues的值嗎? x = est.fittedvalues[np.logical_or(ols_sm_resid_stud > 3, ols_sm_resid_stud < -3)] # logical_or 會返回布林值 y = ols_sm_resid_stud[np.logical_or(ols_sm_resid_stud > 3, ols_sm_resid_stud < -3)] for i, x, y in zip(x.index, x, y): axes[1,1].annotate(i, xy = (x,y)) # Axes.annotate(s, xy, *args, **kwargs) s: 註釋文字的內容 xy: 被註釋的座標點 ``` ![](https://i.imgur.com/dIgCVSr.png)