Ridge regression

# Ridge regression * linear regression * a technique used in regression analysis to cope with **multicollinearity**, where independent variables are highly correlated **multicollinearity** * By adding a degree of bias to the regression estimates, it manages to reduce the standard errors by introducing a penalty parameter 𝜆(lambda) * The essence of Ridge Regression is to find a new line that doesn't fit the training data as perfectly as ordinary least squares (OLS), in exchange for improved generalization to new data ## objective of the regression * $J(\theta) = \Vert A \theta - b \Vert^2_2 - \Vert \Gamma \theta \Vert^2_2$ * $A$: 是一個$n \times d$的matrix，有n比用d-vector描述的feature * $b$: 是一個$n \times 1$的vector，表示feature所對應到的真實值 * $\theta$: 是一個$d \times 1$的vector,為regressionn所要學習的parameter * $A\theta$就是regression的預測值 * $\lambda \theta$: L2 reguration * $\Gamma = \alpha I$ ## solution of ridge regression($\theta$) * $\hat{\theta} = (X^T X + \lambda I)^{-1}X^T Y$ * proof: $\frac{\partial{J(\theta)}}{\partial{\theta}} = 0$ ## example * observation history: $X = \begin{matrix} x_1 \\ x_2 \\ ... \\ x_t \\ \end{matrix}$ ($t*d$) * $x_i (1 \times d)$是在time step t時的observation(input) * $x_1$ = [1,2,3,4] * result history: $Y = \begin{matrix} y_1 \\ y_2 \\ ... \\ y_t \\ \end{matrix}$ ($t*1$) * $y_i (1 \times 1)$是在time step t時的result(output) * 預測 1. 用回歸算出$\hat{\theta} = A^{-1} B$ 1. $A = X^T X + \lambda I$ ($d*d$) 2. $B = X^T Y$ ($d*1$) 2. 預測$\hat{y_t} = x_i \hat{\theta}$ * 更新($(x_i,y_i)$一筆一筆分次的進來) * NOTE: 雖然回歸適用所有過去資料來建立model，但是其實我們不需要每次要更衣比新的資料時就重新計算一次矩陣乘法$X^T X$(把所有history都重新在算一次)，而是我們只需要知道前一輪的$A$和$B$然後把新的資料$(x_i,y_i)$以下方式加上去就好。計算量少非常多!!! 1. $A = A + {x_t}^T \times x_t$ 2. $B = B + {x_t}^T y_t$ ## consideration 1. Choosing 𝜆: The choice of 𝜆 is crucial. If 𝜆 is too large, it can overly smooth the model, leading to underfitting. If 𝜆 is too small, the model can mimic OLS, leading potentially to overfitting. The optimal value of 𝜆 is usually selected via cross-validation. 2. Scaling: It’s important to scale or standardize features before applying Ridge Regression because the regularization term $𝜆∥ 𝛽 ∥^2$ penalizes all coefficients equally. Therefore, the scale of the features influences how much the coefficients are penalized. 3. Bias and Variance: Ridge Regression is used to reduce the variance of the estimators at the cost of introducing some bias, with the trade-off typically leading to better performance on new, unseen data