Statistical Learning, Midterm 1

# Statistical Learning, Midterm 1 https://drive.google.com/file/d/1QF-MI33p1V0bxDZvKXWQX8lC1ikCD3MI/view?usp=sharing Bogdan's book with chapter about multiple testing and model selection ## a) ![](https://i.imgur.com/scN6d8h.png) IMPORTANT: $\hat\beta_{LS} = (X'X)^{-1}X'Y$; if $x\sim N(\mu,\Sigma)$ then $Ax\sim N(A\mu, A\Sigma A')$ - we know that $\hat\beta_{LS} \sim N(\beta, \sigma^2(X'X)^{-1})$, where $\hat\sigma^2 = \frac{||Y - X\hat\beta_{LS}||^2}{n - p} = \frac{RSS}{n - p}$, so the std of $i-th$ coordinate of the beta vector is $\sqrt{\sigma^2(X'X)^{-1}_{i,i}}$ ![reference link](https://i.imgur.com/dyZiyqD.png) T test (we test whether $B_i=\mu=0$): $T_i = \frac{\hat\beta_i -\mu}{S(\hat\beta_i)} = \frac{\hat\beta_i}{S(\hat\beta_i)}$ , it has t-student distribution with n-k degrees of freedom, where $S(\hat\beta_i) = \sqrt{\hat\sigma^2(X'X)^{-1}_{i,i}}$ Teoretyczna (jeśli znamy $\sigma$): $S(\hat\beta_i) = \sqrt{\sigma * \frac{n}{n-k-1}}$ 95% confidence interval, that $B_i=\mu$ based on T test ($T = \frac{\hat\beta_i -\mu}{S(\hat\beta_i)}$): $c = t_{crit} = q_{tstudent, n-k}(0.975)$ $$Pr(-c \leq T \leq c) = 0.95$$ $$Pr(-c -\frac{\hat\beta_i}{S(\hat\beta_i)} \leq -\frac{\mu}{S(\hat\beta_i)}) \leq c -\frac{\hat\beta_i}{S(\hat\beta_i)}) = 0.95$$ $$Pr(\hat\beta_i - {c}\cdot{S(\hat\beta_i)} \leq \mu \leq \hat\beta_i + {c}\cdot{S(\hat\beta_i)}) = 0.95$$ In case of Z-test (wheter $B_i=\mu=0$) (we know the variance): $Z = \frac{\hat\beta_i -\mu}{S(\hat\beta_i)} = \frac{\hat\beta_i}{S(\hat\beta_i)}$, it has distribution $N(0,1)$, where $S(\hat\beta_i) = \sqrt{\sigma^2(X'X)^{-1}_{i,i}}$ $c = t_{crit} = q_{standard\ normal}(0.975)$ $$Pr(-c \leq T \leq c) = 0.95$$ $$Pr(\hat\beta_i - {c}\cdot{S(\hat\beta_i)} \leq \mu \leq \hat\beta_i + {c}\cdot{S(\hat\beta_i)}) = 0.95$$ - $confRange = [\hat\beta_i - {c}\cdot S(\hat\beta_i), \hat\beta_i + {c}\cdot S(\hat\beta_i)]$ - $confLen = 2 * c * S(\hat\beta_i)$ Power of test of significance: Probability that we reject the false null hypothesis (so we did good): $c = t_{crit} = q_{standard\ normal}(0.975)$ $c = t_{crit} = q_{tstudent, n-k}(0.975)$ Power of test, when we know true $\mu = \theta \not=0$: $Pr(T > c) + Pr(T < -c)$ $Pr(T > c) = Pr(\frac{\hat\beta_i -\mu}{S(\hat\beta_i)} > c \ |\ \mu=\theta\not=0)$ $=Pr(\frac{\hat\beta_i}{S(\hat\beta_i) } > c | \mu=\theta) = Pr(\frac{\hat\beta_i -\theta + \theta}{S(\hat\beta_i) } > c | \mu=\theta)$ $=1 - Pr(\frac{\hat\beta_i -\theta}{S(\hat\beta_i) } < c - \frac{\theta}{S(\hat\beta_i) }| \mu=\theta)$ $Pr(T < -c) = Pr(\frac{\hat\beta_i -\theta + \theta}{S(\hat\beta_i) } < -c | \mu=\theta)$ $=Pr(\frac{\hat\beta_i -\theta}{S(\hat\beta_i) } < -c - \frac{\theta}{S(\hat\beta_i) }| \mu=\theta)$ $\frac{\hat\beta_i -\theta}{S(\hat\beta_i) }$ follows approximately a t-student with n-k df/standard normal distribution: Power: $B(\theta) \approx (1 - q_{statistic}(c - \frac{\theta}{S(\hat\beta_i)})) + q_{statistic}(-c -\frac{\theta}{S(\hat\beta_i)})$ ## b) ![](https://i.imgur.com/RdvRQZX.png) Covariance matrix of $\hat\beta_{LS}$ is $\sigma^2(X'X)^{-1}$ Since $X'X$ follows Wishart distribution, thus $(X'X)^{-1}$ follows inverse Wishart distribution. (with mean 0, and scale matrix $\Sigma^{-1}=I$ From the inverse Wishart distribution we know the mean is, $E[W]=\frac{\Sigma}{n-m-1}$ so in our case: $E[(X'X)^{-1}] = \frac{I}{n-p-1}$ So $E[Var(\hat\beta_i)] = E[\sigma^2]\cdot E[(X'X)^{-1}]_{i,i}=\frac{\sigma^2}{n-p-1}$ ## c) ![](https://i.imgur.com/7RnimzI.png) **Bonferroni**: multiply $p$-values by $n$ and check desired signifcance level ($p*n < \alpha$?) **Benjamini-Hochberg**: 1. Sort $p$-values, $p_{(1)}, p_{(1)}, \ldots, p_{(n)}$ 2. Find larget $i$ s.t. $p_{(i)} \leq \frac{i}{n}q$ where $q\in[0,1]$ 3. Reject all $H_j$ for $j\leq i$ ## d) ![](https://i.imgur.com/aVscRsr.png) FD - false discovery p - size of the model  $E[FD] = \sum_i P(\beta_i \text{ is FD}) = \sum_i \mathbb{1}(\beta_i \text{ is not significant})P(H_i \text{ rejected} | \beta_i\text{ is not significant})=\\ (n-p)\cdot\alpha$ # # e) ![](https://i.imgur.com/tJ8B6qr.png) ![](https://i.imgur.com/TPrSQJk.png) S - number of true alternative hypothesis (called significant) V - number of true null hypothesis (called significant) among S (False Positives) $FDP = \frac{V}{S+V}$ ## f) ![](https://i.imgur.com/ZKaR1r7.png) Stein's unbiased risk estimate - SURE $h(x) = x + g(x)$ is an estimator of $\mu$ from $x$, $\mu_i$, $i=1,..,n$ are independent normall variables with the same variance $SURE(h) = -n\sigma^2 + \lVert g(x)\rVert^2 + 2\sigma^2 \sum_{i}^{n}{\frac{d}{dx_i}h_i(x)}$ it is unbiased estimate of MSE: $E[SURE(h)] = MSE(h) = E[\lVert h(x) - \mu\rVert^2]$ consider a new sample: $Y' = X\beta + \epsilon'$, $\mu=X\beta$ $PE = E\lVert Y' - \hat Y\rVert^2 = E[\sum_i X^2(\beta - \hat\beta)^2] + E[\sum_i 2X(\beta - \hat\beta)\epsilon'] + E[\sum_i \epsilon'^2]$ $= E\lVert X(\beta - \hat\beta)\rVert^2 + 0 + \sum_i(V[\epsilon'] + E[\epsilon']^2)$ $= E\lVert\mu - \hat\mu\rVert^2 + n\cdot\sigma^2$ IMPORTANT: If $\hat Y = M_{n \times n} Y$ then $PE = E(RSS) + 2\sigma^2Tr(M)$ Proof, using SURE: $\hat\mu = h(Y) = MY = \hat Y = Y + MY - Y = Y + g(Y)$ $g(Y) = \hat Y - Y = MY - Y$ $E[SURE(h)] = E\lVert\hat\mu - \mu\rVert^2 = -n\sigma^2 + E(\lVert g(x)\rVert)^2 + E[2\sigma^2\sum_i^n\frac{d}{dY_i}h_i(Y)]$ $= -n\sigma^2 + E(RSS) + 2\sigma^2TrM$ So $PE = E\lVert\mu - \hat\mu\rVert^2 + n\cdot\sigma^2 = E(RSS) + 2\sigma^2TrM$ In Least Squares $M = X(X'X)^{-1}X'$, $Tr(M) = rank(X)$ If $rank(X) = p$ $PE = RSS + 2\sigma^2p$ ## g) ![](https://i.imgur.com/aWi9Ko2.png) We have $RSS$ for different models, each has $M_k$ matrix $L(X, \theta) = \Pi^n_if(X_i, \theta)$ $AIC(M_k) = logL(X, \theta_{MLE}) - k$ if errors are iid then: $L(Y|X,\beta,\sigma) = (\frac{1}{\sigma\sqrt{2\pi}})^n e^{-\frac{\lVert Y - X\beta\rVert^2}{2\sigma^2}}$ $AIC(M_k) = C - nlog(\sigma) - \frac{\lVert Y - X\beta\rVert^2}{2\sigma^2} -k = C - nlog(\sigma) - \frac{RSS}{2\sigma^2} - k$ if sigma is known then maximizing $AIC$ corresponds to minimizing $RSS + 2\sigma^2k$, which is a $PE$ if sigma is unknown then we minimize $nlog(RSS) + 2k$: $\sigma^2_{MLE} = \frac{RSS}{n}$ $AIC(M_k) = C - \frac{n}{2}log(RSS) - k$ To find the best model given RSS and k we minimize: - AIC: $RSS + 2\sigma^2k$ ($\sigma$ known) or $nlog(RSS) + 2k$ ($\sigma$ unknown) - BIC: $RSS + \sigma^2\cdot k\cdot log(n)$ - RIC: $RSS + \sigma^2 \cdot 2k \cdot log(p)$ - mBIC: $RSS + k\sigma^2(logn +2log(\frac{p}{C}))$ - mBIC2: $RSS + \sigma^2 \cdot (klog(n) + 2klog(p/4) - 2log(k!))$ where C is prior expected number of nonzero coefficients, p is the number of available predictors ## h) ![](https://i.imgur.com/PmFr4ed.png) Residuals: $r = Y - \hat Y$ Projection matrix H: $X(X^TX)^{-1}X^T$ $RSS = \lVert r\rVert^2$ $PE_{CV} = \sum_{i=1}^n(\frac{r_i}{1 - H_{ii}})^2$ ## i) ![](https://i.imgur.com/FsOsI4M.png) Assume $X'X = I$ $\hat\beta = X'Y$ $RSS = (Y - X\hat\beta)'(Y - X\hat\beta) = Y'Y - Y'X\hat\beta - \hat\beta'X'Y + \hat\beta'X'X\hat\beta$ $= Y'Y - 2Y'X\hat\beta + \hat\beta'X'X\hat\beta = Y'Y - 2\hat\beta'\hat\beta + \hat\beta'\hat\beta$ $RSS = Y'Y - \hat\beta'\hat\beta = Y'Y - \sum_i^k\hat\beta_i^2$ We take new variable -> AIC increases by $2\sigma^2$, so we will take this variable iff $|\hat\beta_i| \geq \sqrt{2}\sigma$. AIC: minimize $RSS + 2\sigma^2k$ We take $\hat\beta_i$ when: $|\hat\beta_i| \geq \sqrt{2}\sigma$ when $\beta_i=0$, $\hat\beta \sim N(0, \sigma^2)$, so the probability of type I error: $P(X_i \ is\ selected | \beta_i=0) = P(|\hat\beta_i| \geq \sqrt{2}\sigma | \beta_i=0) = 2(1 - F(\sqrt{2})) = 0.16$ BIC: $RSS + \sigma^2\cdot k\cdot log(n)$ We take $\hat\beta_i$ when: $|\hat\beta_i| \geq \sqrt{log\ n}\sigma$ $P(X_i \ is\ selected | \beta_i=0) = P(|\hat\beta_i| \geq \sqrt{log\ n}\sigma | \beta_i=0) = 2(1 - F(\sqrt{log\ n}))$ RIC: $RSS + \sigma^2 \cdot 2k \cdot log(p)$ We take $\hat\beta_i$ when: $|\hat\beta_i| \geq \sqrt{2log\ p}\ \sigma$ $P(X_i \ is\ selected | \beta_i=0) = 2(1 - F(\sqrt{2log\ p}))$ ## j) ![](https://i.imgur.com/VivI9qZ.png) We use: - AIC: - BIC: - RIC: - mBIC: - mBIC2:  ## k) ![](https://i.imgur.com/d9HfGsm.png) Ridge regression: $\hat\beta = argmin_{\beta}L(b)$, where $L(b) = \lVert Y - Xb\rVert^2 + \gamma\lVert b \rVert^2$ IMPORTANT: $\hat\beta = (X'X + \gamma I)^{-1}X'Y$, where $\gamma > 0$ Projection matrix: $M = X(X'X + \gamma I)^{-1}X'$ $Tr[M] = Tr[((X'X + \gamma I)^{-1}X'X)]$ $Tr[M] = \sum_i^p\lambda_i(M)$, $\lambda_i(M)$ are eigenvalues of M ![](https://i.imgur.com/88Gly7C.png) IMPORTANT: If $\hat Y = M_{n \times n} Y$ then $PE = E(RSS) + 2\sigma^2Tr(M)$ Proof, using SURE: $\hat\mu = h(Y) = MY = \hat Y = Y + MY - Y = Y + g(Y)$ $g(Y) = \hat Y - Y = MY - Y$ $E[SURE(h)] = E\lVert\hat\mu - \mu\rVert^2 = -n\sigma^2 + E(\lVert g(x)\rVert)^2 + E[2\sigma^2\sum_i^n\frac{d}{dY_i}h_i(Y)]$ $= -n\sigma^2 + E(RSS) + 2\sigma^2TrM$ So $PE = E\lVert\mu - \hat\mu\rVert^2 + n\cdot\sigma^2 = E(RSS) + 2\sigma^2TrM$ In Ridge Regression: $PE = RSS + 2\sigma^2\sum_i^p\frac{\lambda_i(X'X)}{\lambda_i(X'X) + \gamma}$ ## l) ![](https://i.imgur.com/DdBItOI.png) ## m) ![](https://i.imgur.com/tD9viDo.png) ![](https://i.imgur.com/mANGRRp.png) ## n) ![](https://i.imgur.com/qGlFgbS.png) (googlować Lasso orthogonal design power?) Jest w drugim hacku, ale power może być źle https://hackmd.io/WC_6RbBoQg6qslAbH_WU6g#LASSO ## o) nowe o) Consider adaptive LASSO with lambda_i= w_i lambda. i) How can you calculate adaptive LASSO estimator using the numerical solver for LASSO. ii) In the orthogonal case (X’X=I) calculate the value of the adaptive LASSO estimator for the specific coordinate of the beta vector, Given the true value of this parameter calculate the bias, variance and the mean squared error of this estimator. (i) The approach is to rescale the predictors by the weights, so that the penalty for each predictor is effectively normalized by its corresponding weight. Specifically, you can define a new matrix X' and response vector y' as follows: X' = diag(sqrt(w)) * X y' = diag(sqrt(w)) * y Then, you can run the LASSO solver on X' and y' using a single penalty parameter lambda, which will effectively correspond to a different penalty for each predictor due to the rescaling. Finally, you can obtain the adaptive LASSO estimator by multiplying the LASSO estimator by the corresponding weight: beta_adaptive = beta_lasso * w This approach essentially implements the adaptive LASSO penalty using a single penalty parameter for the LASSO solver, and then adjusts the resulting estimates based on the weights. (ii) todo ## o) stare ![](https://i.imgur.com/ATpkXKa.png) https://andrewcharlesjones.github.io/journal/irrepresentable.html ## p) ![](https://i.imgur.com/uLcK1OP.png) Co to jest $X_A$? ## q) ![](https://i.imgur.com/I3dU20d.png) We prefer elastic net over LASSO when jak jest dużo zmiennych i nie mamy założeń ile z nich powinno być zerowych (?) i jak niektóre parametry są skorelowane ~maciej Lasso Regression tends to pick just one of the correlated terms and eliminates the others whereas Ridge Regression tends to shrink all of the parameters for the correlated variables together. By combining Lasso and Ridge Regression groups and shrinks the parameters associated with the correlated variables and leaves them in equation or removes them at once. Main differences between these methods are Elastic net to RSS z L1 penalty (jak w LASSO) i L2 penalty (jak w Ridge) z dwoma lambdami (jak je wybrać?) ## r) ![](https://i.imgur.com/pAG5Fz1.png) Lasso and elastic net perform variable selection

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.