---
# Linear Regression and K-Nearest Neighbors(KNN)
---
## 0.1 Formulas (Matrix Calculus)
### 0.1.1 向量对向量求导$\frac{\partial{\boldsymbol{y}}}{\partial{\boldsymbol{x}}}$
| Condition|Expression| 用$\boldsymbol{x^T}$和$\boldsymbol{y}$表示的Numerator Layout | 用$\boldsymbol{y^T}$和$\boldsymbol{x}$Denominator Layout |
| -------- | -------- | -------- | -------- |
| $\boldsymbol{a}$是与$\boldsymbol{x}$无关的函数 | $\frac{\partial{\boldsymbol{a}}}{\partial{\boldsymbol{x}}}=$ | $\boldsymbol{0}$ | $\boldsymbol{0}$ |
| $\boldsymbol{x}$对自身求导 | $\frac{\partial{\boldsymbol{x}}}{\partial{\boldsymbol{x}}}=$ | $\boldsymbol{I}$ | $\boldsymbol{I}$ |
| $\boldsymbol{A}$是与$\boldsymbol{x}$无关的函数 | $\frac{\partial{\boldsymbol{Ax}}}{\partial{\boldsymbol{x}}}=$ | $\boldsymbol{A}$ | $\boldsymbol{A}^T$ |
| $\boldsymbol{A}$是与$\boldsymbol{x}$无关的函数 | $\frac{\partial{\boldsymbol{x}^T\boldsymbol{A}}}{\partial{\boldsymbol{x}}}=$ | $\boldsymbol{A}^T$ | $\boldsymbol{A}$ |
| $a$是与$\boldsymbol{x}$无关的函数,$\boldsymbol{u}=\boldsymbol{u}(\boldsymbol{x})$ | $\frac{\partial{a\boldsymbol{u}}}{\partial{\boldsymbol{x}}}=$ | $a\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}$ | $a\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}$ |
| $\boldsymbol{a}=\boldsymbol{a}(\boldsymbol{x})$,$\boldsymbol{u}=\boldsymbol{u}(\boldsymbol{x})$ | $\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}=$ | $\boldsymbol{a}\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}$ | $\boldsymbol{a}\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}$ |
| $\boldsymbol{A}$是与$\boldsymbol{x}$无关的函数,$\boldsymbol{u}=\boldsymbol{u}(\boldsymbol{x})$ | $\frac{\partial{\boldsymbol{Au}}}{\partial{\boldsymbol{x}}}=$ | $\boldsymbol{A}\frac{\partial{\boldsymbol{u}}}{\boldsymbol{\partial{x}}}$ | $\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}\boldsymbol{A}^T$ |
| $\boldsymbol{u}=\boldsymbol{u}(\boldsymbol{x})$,$\boldsymbol{v}=\boldsymbol{v}(\boldsymbol{x})$ | $\frac{\partial{(\boldsymbol{u}+\boldsymbol{v})}}{\partial{\boldsymbol{x}}}$ | $\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}+\frac{\partial{\boldsymbol{v}}}{\partial{\boldsymbol{x}}}$ | $\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}+\frac{\partial{\boldsymbol{v}}}{\partial{\boldsymbol{x}}}$ |
| $\boldsymbol{u}=\boldsymbol{u}(\boldsymbol{x})$ | $\frac{\partial{\boldsymbol{g}(\boldsymbol{u})}}{\partial{\boldsymbol{x}}}$ | $\frac{\partial{\boldsymbol{g}(\boldsymbol{u})}}{\partial{\boldsymbol{u}}}\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}$ | $\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}\frac{\partial{\boldsymbol{g}(\boldsymbol{u})}}{\partial{\boldsymbol{u}}}$ |
| $\boldsymbol{u}=\boldsymbol{u}(\boldsymbol{x})$ | $\frac{\partial{\boldsymbol{f}(\boldsymbol{g}(\boldsymbol(u)))}}{\partial{\boldsymbol{x}}}$ | $\frac{\partial{\boldsymbol{f}(\boldsymbol{g})}}{\partial{\boldsymbol{g}}}\frac{\partial{\boldsymbol{g}(\boldsymbol{u})}}{\partial{\boldsymbol{u}}}\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}$ | $\frac{\partial{\boldsymbol{u}}}{\partial{\boldsymbol{x}}}\frac{\partial{\boldsymbol{g}(\boldsymbol{u})}}{\partial{\boldsymbol{u}}}\frac{\partial{\boldsymbol{f}(\boldsymbol{g})}}{\partial{\boldsymbol{g}}}$ |
## 1.1 Linear Models and Least Sqaures
给定一个input向量${X^T=(X_1, X_2,..., X_p)}$,$X^T$的形状为$K\times p$,$X_1$的形状为$K\times 1$,要通过下面的模型来预测output向量$Y$:
\begin{equation}\hat{Y}=\hat{β_0}+\sum_{j=1}^{p}{X_j\hat{β_j}}\tag{1.1}\end{equation}
$\hat{β_0}$为**截距**(intercept),在Machine Learning中也叫**偏差**(bias)。通常我们会在自变量$X$中加入常量1,并将$\hat{β_0}$合并入$\hat{β_j}$中,然后就可以将$(1.1)$用线性模型的向量内积的形式写成如下式子:\begin{equation}\hat{Y}={X^T\hat{β}}\tag{1.2}\end{equation}
其中$X^T$是大小为$K\times p$的矩阵或向量的转置(其中$X$是列向量)。
通常来说,如果$\hat{Y}$是一个大小为$K$的向量,那么$\beta$就是大小为$p\times 1$的系数矩阵。在$(p+1)$维的input-output空间中,$(X,\hat{Y})$代表一个超平面。
* 如果$X$中含有一个常量1,那么该超平面(hyperplane)就包含原点,且是一个子空间。
* 如果$X$中不包含常量1,那么该超平面就是一个分割了$Y$轴的凸集(affine set),与$Y$轴的交点是(0, $\hat{\beta_0}$)。
接下来的讨论中,我们假设$X$中含有一个常量1,即截距包含在$\beta$中。
在一个$p$维的input space中($X^T$为$K\times p$的矩阵),$f(X)=X^T\beta$是一个线性函数。该函数的梯度为:$f'(X)=\beta$,它是在input space($X$)中指向上山方向的最陡方向(the steepest uphill direction)。
Q1:现在,我们如何用training data来拟合该线性模型?
A1:有很多种方法。最流行的方法还是least squares(最小二乘)。该方法中我们要找能使得下面式子(残差平方和, $RSS$, Residual Sum of Squares)最小的系数$\beta$的值:
\begin{equation}RSS(\beta)=\sum_{i=1}^{N}{(y_i-x_i^T \beta)^2 \tag{1.3}}\end{equation}
Q2:为什么要用$RSS$?
A2:先看下$RSS$的定义:**数据集因变量的实际值与模型预测的预测值之差的平方和**。所以很直观地,实际值与预测值之差越小,说明我们的模型越准确。(这里还有一系列问题,例如,1. 为什么用2次方,不用4次方或绝对值?我们暂时先不讨论)
根据$(1.3)$我们能看出,$RSS$是只与$\beta$有关的二次函数,所以该函数的最小值一定存在,但不一定是唯一的。我们将上面的$RSS(\beta)$写成矩阵的形式,然后对矩阵求导就很容易找到$RSS$的最小值,以及相应的$\beta$值(下面式子的黑体是向量或矩阵):
\begin{equation}RSS(\beta)=(\boldsymbol{y}-\boldsymbol{X}\beta)^T(\boldsymbol{y}-\boldsymbol{X}\beta) \tag{1.4}\end{equation}
其中$\boldsymbol{X}$是$N\times p$的矩阵,每一行是一个input vector (如,$X_1$)。$\boldsymbol{y}$是training set中大小为$N$的向量。对$\beta$求一阶导并让导数等于0,就得到:
$\begin{equation}\boldsymbol{X}^T(\boldsymbol{y}-\boldsymbol{X}\beta)=0 \tag{1.5}\end{equation}$
如果$\boldsymbol{X}^T\boldsymbol{X}$是非奇异的(nonsingular),那么上式就有如下唯一解:
$\hat{\beta}=(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y} \tag{1.6}$
在第$i$个数据点$x_i$的拟合值为:$\hat{y}_i=\hat{y}(x_i)=x_i^T\hat{\beta}$。整个拟合曲面可以用$p$个参数$\hat{\beta}$来表示。一般来说我们似乎并不需要非常大的数据集来集合这个模型。
```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# 1. Download 'iris.data' from the link below:
# https://archive.ics.uci.edu/ml/machine-learning-databases/iris/
# 2. Rename 'iris.data' as 'iris.csv'
# Import 'iris.csv' and create headers:
# 'sepal_length', 'sepal_width', 'petal_length', 'petal_width' and 'class'
dataset = pd.read_csv('iris.csv',
names=['sepal_length', 'sepal_width', 'petal_length',
'petal_width', 'class']
)
print(dataset.shape)
print(dataset.head())
# 3. Initialize input and output data:
X = dataset['sepal_length'].values
Y = dataset['sepal_width'].values
# mean of input and output:
X_mean = np.mean(X)
Y_mean = np.mean(Y)
# total number of data points:
n = len(X)
# Calculate beta1 and beta0
numerator = 0
denominator = 0
for i in range(n):
numerator += (X[i] - X_mean) * (Y[i] - Y_mean)
denominator += (X[i] - X_mean) ** 2
beta1 = numerator / denominator
beta0 = Y_mean - (beta1 * X_mean)
print('beta0: ', beta0, 'beta1: ', beta1)
# 4. Data Visualization
X_max = np.max(X)
X_min = np.min(X)
# Calculate line values of x and y
x = np.linspace(X_min, X_max, 1000)
y = beta0 + beta1 * x
plt.plot(x, y, color='#00ff00', label='Linear Regression')
# Plot the data point
plt.scatter(X, Y, color='#ff0000', label='Data Point')
# x-axis label
plt.xlabel('sepal length')
# y-axis label
plt.ylabel('sepal width')
plt.legend()
plt.show()
# 5. Model evaluation:
# Calculate Root Mean Squared Error (RMSE)
rmse = 0
for i in range(n):
Y_pred = beta0 + beta1 * X[i]
rmse += (Y[i] - Y_pred) ** 2
rmse = np.sqrt(rmse / n)
print('Root Mean Squared Error: ', rmse) # Root Mean Squared Error: 0.43
# Calculate R-square
# Total Sum of Squares (SST), Sum of Squares of Residuals (SSR)
# R-square = 1 - (SSR / SST)
sst, ssr = 0, 0
for i in range(n):
Y_pred = beta0 + beta1 * X[i]
sst += (Y[i] - Y_mean) ** 2
ssr += (Y[i] - Y_pred) ** 2
r_square = 1 - (ssr / sst)
print('R-square: ', r_square) # R-square: 0.012
```
## 面试问题:
### Linear regression and Basic Statistics:
1. What is the main null hypothesis of a multiple **linear** regression?
* There is no **linear** relationship between X variables and Y variable. 因为用的是线性模型,所以null hypothesis是假设线性关系不存在,从而,我们希望通过显著性来拒绝null hypothesis。
2. What is **p-value**?
* **p-value**是表示某一事件发生概率的统计假设检验的边际显著性水平。p值被用来替代拒绝点,以提供拒绝原假设的最小显著性水平。(The **p-value** is the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event. The **p-value** is used as an alternative to rejection points to provide the smallest level of significance at which the null hypothesis would be rejeced.)
3. What does **p-value < 0.05** mean?
* We reject null hypothesis if **p-value < 0.05**;
* We fail to reject null hypothesis if **p-value > 0.05**;
4. Significance Level ($\alpha$)
* 显著性水平是当原假设为真,但我们错误地拒绝了原假设的概率。Significance level $\alpha$ is the probability of rejecting the null hypothesis when it is true.