--- # Linear Regression and K-Nearest Neighbors(KNN) --- ## 0.1 Formulas (Matrix Calculus) ### 0.1.1 向量对向量求导$\frac{\partial{\boldsymbol{y}}}{\partial{\boldsymbol{x}}}$ | Condition|Expression| 用$\boldsymbol{x^T}$和$\boldsymbol{y}$表示的Numerator Layout | 用$\boldsymbol{y^T}$和$\boldsymbol{x}$Denominator Layout | | -------- | -------- | -------- | -------- | | $\boldsymbol{a}$是与$\boldsymbol{x}$无关的函数 | $\frac{\partial{\boldsymbol{a}}}{\partial{\boldsymbol{x}}}=$ | $\boldsymbol{0}$ | $\boldsymbol{0}$ | | $\boldsymbol{x}$对自身求导 | $\frac{\partial{\boldsymbol{x}}}{\partial{\boldsymbol{x}}}=$ | $\boldsymbol{I}$ | $\boldsymbol{I}$ | | $\boldsymbol{A}$是与$\boldsymbol{x}$无关的函数 | $\frac{\partial{\boldsymbol{Ax}}}{\partial{\boldsymbol{x}}}=$ | $\boldsymbol{A}$ | $\boldsymbol{A}^T$ | | $\boldsymbol{A}$是与$\boldsymbol{x}$无关的函数 | $\frac{\partial{\boldsymbol{x}^T\boldsymbol{A}}}{\partial{\boldsymbol{x}}}=$ | $\boldsymbol{A}^T$ | $\boldsymbol{A}$ | | $a$是与$\boldsymbol{x}$无关的函数,$\boldsymbol{u}=\boldsymbol{u}(\boldsymbol{x})$ | $\frac{\partial{a\boldsymbol{u}}}{\partial{\boldsymbol{x}}}=$ | $a\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}$ | $a\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}$ | | $\boldsymbol{a}=\boldsymbol{a}(\boldsymbol{x})$,$\boldsymbol{u}=\boldsymbol{u}(\boldsymbol{x})$ | $\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}=$ | $\boldsymbol{a}\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}$ | $\boldsymbol{a}\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}$ | | $\boldsymbol{A}$是与$\boldsymbol{x}$无关的函数,$\boldsymbol{u}=\boldsymbol{u}(\boldsymbol{x})$ | $\frac{\partial{\boldsymbol{Au}}}{\partial{\boldsymbol{x}}}=$ | $\boldsymbol{A}\frac{\partial{\boldsymbol{u}}}{\boldsymbol{\partial{x}}}$ | $\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}\boldsymbol{A}^T$ | | $\boldsymbol{u}=\boldsymbol{u}(\boldsymbol{x})$,$\boldsymbol{v}=\boldsymbol{v}(\boldsymbol{x})$ | $\frac{\partial{(\boldsymbol{u}+\boldsymbol{v})}}{\partial{\boldsymbol{x}}}$ | $\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}+\frac{\partial{\boldsymbol{v}}}{\partial{\boldsymbol{x}}}$ | $\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}+\frac{\partial{\boldsymbol{v}}}{\partial{\boldsymbol{x}}}$ | | $\boldsymbol{u}=\boldsymbol{u}(\boldsymbol{x})$ | $\frac{\partial{\boldsymbol{g}(\boldsymbol{u})}}{\partial{\boldsymbol{x}}}$ | $\frac{\partial{\boldsymbol{g}(\boldsymbol{u})}}{\partial{\boldsymbol{u}}}\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}$ | $\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}\frac{\partial{\boldsymbol{g}(\boldsymbol{u})}}{\partial{\boldsymbol{u}}}$ | | $\boldsymbol{u}=\boldsymbol{u}(\boldsymbol{x})$ | $\frac{\partial{\boldsymbol{f}(\boldsymbol{g}(\boldsymbol(u)))}}{\partial{\boldsymbol{x}}}$ | $\frac{\partial{\boldsymbol{f}(\boldsymbol{g})}}{\partial{\boldsymbol{g}}}\frac{\partial{\boldsymbol{g}(\boldsymbol{u})}}{\partial{\boldsymbol{u}}}\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}$ | $\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}\frac{\partial{\boldsymbol{g}(\boldsymbol{u})}}{\partial{\boldsymbol{u}}}\frac{\partial{\boldsymbol{f}(\boldsymbol{g})}}{\partial{\boldsymbol{g}}}$ | ## 1.1 Linear Models and Least Sqaures 给定一个input向量${X^T=(X_1, X_2,..., X_p)}$,$X^T$的形状为$K\times p$,$X_1$的形状为$K\times 1$,要通过下面的模型来预测output向量$Y$: \begin{equation}\hat{Y}=\hat{β_0}+\sum_{j=1}^{p}{X_j\hat{β_j}}\tag{1.1}\end{equation} $\hat{β_0}$为**截距**(intercept),在Machine Learning中也叫**偏差**(bias)。通常我们会在自变量$X$中加入常量1,并将$\hat{β_0}$合并入$\hat{β_j}$中,然后就可以将$(1.1)$用线性模型的向量内积的形式写成如下式子:\begin{equation}\hat{Y}={X^T\hat{β}}\tag{1.2}\end{equation} 其中$X^T$是大小为$K\times p$的矩阵或向量的转置(其中$X$是列向量)。 通常来说,如果$\hat{Y}$是一个大小为$K$的向量,那么$\beta$就是大小为$p\times 1$的系数矩阵。在$(p+1)$维的input-output空间中,$(X,\hat{Y})$代表一个超平面。 * 如果$X$中含有一个常量1,那么该超平面(hyperplane)就包含原点,且是一个子空间。 * 如果$X$中不包含常量1,那么该超平面就是一个分割了$Y$轴的凸集(affine set),与$Y$轴的交点是(0, $\hat{\beta_0}$)。 接下来的讨论中,我们假设$X$中含有一个常量1,即截距包含在$\beta$中。 在一个$p$维的input space中($X^T$为$K\times p$的矩阵),$f(X)=X^T\beta$是一个线性函数。该函数的梯度为:$f'(X)=\beta$,它是在input space($X$)中指向上山方向的最陡方向(the steepest uphill direction)。 Q1:现在,我们如何用training data来拟合该线性模型? A1:有很多种方法。最流行的方法还是least squares(最小二乘)。该方法中我们要找能使得下面式子(残差平方和, $RSS$, Residual Sum of Squares)最小的系数$\beta$的值: \begin{equation}RSS(\beta)=\sum_{i=1}^{N}{(y_i-x_i^T \beta)^2 \tag{1.3}}\end{equation} Q2:为什么要用$RSS$? A2:先看下$RSS$的定义:**数据集因变量的实际值与模型预测的预测值之差的平方和**。所以很直观地,实际值与预测值之差越小,说明我们的模型越准确。(这里还有一系列问题,例如,1. 为什么用2次方,不用4次方或绝对值?我们暂时先不讨论) 根据$(1.3)$我们能看出,$RSS$是只与$\beta$有关的二次函数,所以该函数的最小值一定存在,但不一定是唯一的。我们将上面的$RSS(\beta)$写成矩阵的形式,然后对矩阵求导就很容易找到$RSS$的最小值,以及相应的$\beta$值(下面式子的黑体是向量或矩阵): \begin{equation}RSS(\beta)=(\boldsymbol{y}-\boldsymbol{X}\beta)^T(\boldsymbol{y}-\boldsymbol{X}\beta) \tag{1.4}\end{equation} 其中$\boldsymbol{X}$是$N\times p$的矩阵,每一行是一个input vector (如,$X_1$)。$\boldsymbol{y}$是training set中大小为$N$的向量。对$\beta$求一阶导并让导数等于0,就得到: $\begin{equation}\boldsymbol{X}^T(\boldsymbol{y}-\boldsymbol{X}\beta)=0 \tag{1.5}\end{equation}$ 如果$\boldsymbol{X}^T\boldsymbol{X}$是非奇异的(nonsingular),那么上式就有如下唯一解: $\hat{\beta}=(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y} \tag{1.6}$ 在第$i$个数据点$x_i$的拟合值为:$\hat{y}_i=\hat{y}(x_i)=x_i^T\hat{\beta}$。整个拟合曲面可以用$p$个参数$\hat{\beta}$来表示。一般来说我们似乎并不需要非常大的数据集来集合这个模型。 ``` import numpy as np import pandas as pd import matplotlib.pyplot as plt # 1. Download 'iris.data' from the link below: # https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ # 2. Rename 'iris.data' as 'iris.csv' # Import 'iris.csv' and create headers: # 'sepal_length', 'sepal_width', 'petal_length', 'petal_width' and 'class' dataset = pd.read_csv('iris.csv', names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'] ) print(dataset.shape) print(dataset.head()) # 3. Initialize input and output data: X = dataset['sepal_length'].values Y = dataset['sepal_width'].values # mean of input and output: X_mean = np.mean(X) Y_mean = np.mean(Y) # total number of data points: n = len(X) # Calculate beta1 and beta0 numerator = 0 denominator = 0 for i in range(n): numerator += (X[i] - X_mean) * (Y[i] - Y_mean) denominator += (X[i] - X_mean) ** 2 beta1 = numerator / denominator beta0 = Y_mean - (beta1 * X_mean) print('beta0: ', beta0, 'beta1: ', beta1) # 4. Data Visualization X_max = np.max(X) X_min = np.min(X) # Calculate line values of x and y x = np.linspace(X_min, X_max, 1000) y = beta0 + beta1 * x plt.plot(x, y, color='#00ff00', label='Linear Regression') # Plot the data point plt.scatter(X, Y, color='#ff0000', label='Data Point') # x-axis label plt.xlabel('sepal length') # y-axis label plt.ylabel('sepal width') plt.legend() plt.show() # 5. Model evaluation: # Calculate Root Mean Squared Error (RMSE) rmse = 0 for i in range(n): Y_pred = beta0 + beta1 * X[i] rmse += (Y[i] - Y_pred) ** 2 rmse = np.sqrt(rmse / n) print('Root Mean Squared Error: ', rmse) # Root Mean Squared Error: 0.43 # Calculate R-square # Total Sum of Squares (SST), Sum of Squares of Residuals (SSR) # R-square = 1 - (SSR / SST) sst, ssr = 0, 0 for i in range(n): Y_pred = beta0 + beta1 * X[i] sst += (Y[i] - Y_mean) ** 2 ssr += (Y[i] - Y_pred) ** 2 r_square = 1 - (ssr / sst) print('R-square: ', r_square) # R-square: 0.012 ``` ## 面试问题: ### Linear regression and Basic Statistics: 1. What is the main null hypothesis of a multiple **linear** regression? * There is no **linear** relationship between X variables and Y variable. 因为用的是线性模型,所以null hypothesis是假设线性关系不存在,从而,我们希望通过显著性来拒绝null hypothesis。 2. What is **p-value**? * **p-value**是表示某一事件发生概率的统计假设检验的边际显著性水平。p值被用来替代拒绝点,以提供拒绝原假设的最小显著性水平。(The **p-value** is the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event. The **p-value** is used as an alternative to rejection points to provide the smallest level of significance at which the null hypothesis would be rejeced.) 3. What does **p-value < 0.05** mean? * We reject null hypothesis if **p-value < 0.05**; * We fail to reject null hypothesis if **p-value > 0.05**; 4. Significance Level ($\alpha$) * 显著性水平是当原假设为真,但我们错误地拒绝了原假设的概率。Significance level $\alpha$ is the probability of rejecting the null hypothesis when it is true.