# Statistical Learning, Midterm 1
https://drive.google.com/file/d/1QF-MI33p1V0bxDZvKXWQX8lC1ikCD3MI/view?usp=sharing
Bogdan's book with chapter about multiple testing and model selection
## a)

IMPORTANT: $\hat\beta_{LS} = (X'X)^{-1}X'Y$; if $x\sim N(\mu,\Sigma)$ then $Ax\sim N(A\mu, A\Sigma A')$
- we know that $\hat\beta_{LS} \sim N(\beta, \sigma^2(X'X)^{-1})$, where $\hat\sigma^2 = \frac{||Y - X\hat\beta_{LS}||^2}{n - p} = \frac{RSS}{n - p}$, so the std of $i-th$ coordinate of the beta vector is $\sqrt{\sigma^2(X'X)^{-1}_{i,i}}$

T test (we test whether $B_i=\mu=0$):
$T_i = \frac{\hat\beta_i -\mu}{S(\hat\beta_i)} = \frac{\hat\beta_i}{S(\hat\beta_i)}$ , it has t-student distribution with n-k degrees of freedom, where
$S(\hat\beta_i) = \sqrt{\hat\sigma^2(X'X)^{-1}_{i,i}}$
Teoretyczna (jeśli znamy $\sigma$):
$S(\hat\beta_i) = \sqrt{\sigma * \frac{n}{n-k-1}}$
95% confidence interval, that $B_i=\mu$ based on T test ($T = \frac{\hat\beta_i -\mu}{S(\hat\beta_i)}$):
$c = t_{crit} = q_{tstudent, n-k}(0.975)$
$$Pr(-c \leq T \leq c) = 0.95$$
$$Pr(-c -\frac{\hat\beta_i}{S(\hat\beta_i)} \leq -\frac{\mu}{S(\hat\beta_i)}) \leq c -\frac{\hat\beta_i}{S(\hat\beta_i)}) = 0.95$$
$$Pr(\hat\beta_i - {c}\cdot{S(\hat\beta_i)} \leq \mu \leq \hat\beta_i + {c}\cdot{S(\hat\beta_i)}) = 0.95$$
In case of Z-test (wheter $B_i=\mu=0$) (we know the variance): $Z = \frac{\hat\beta_i -\mu}{S(\hat\beta_i)} = \frac{\hat\beta_i}{S(\hat\beta_i)}$, it has distribution $N(0,1)$, where $S(\hat\beta_i) = \sqrt{\sigma^2(X'X)^{-1}_{i,i}}$
$c = t_{crit} = q_{standard\ normal}(0.975)$
$$Pr(-c \leq T \leq c) = 0.95$$
$$Pr(\hat\beta_i - {c}\cdot{S(\hat\beta_i)} \leq \mu \leq \hat\beta_i + {c}\cdot{S(\hat\beta_i)}) = 0.95$$
- $confRange = [\hat\beta_i - {c}\cdot S(\hat\beta_i), \hat\beta_i + {c}\cdot S(\hat\beta_i)]$
- $confLen = 2 * c * S(\hat\beta_i)$
Power of test of significance: Probability that we reject the false null hypothesis (so we did good):
$c = t_{crit} = q_{standard\ normal}(0.975)$
$c = t_{crit} = q_{tstudent, n-k}(0.975)$
Power of test, when we know true $\mu = \theta \not=0$:
$Pr(T > c) + Pr(T < -c)$
$Pr(T > c) = Pr(\frac{\hat\beta_i -\mu}{S(\hat\beta_i)} > c \ |\ \mu=\theta\not=0)$
$=Pr(\frac{\hat\beta_i}{S(\hat\beta_i) } > c | \mu=\theta) = Pr(\frac{\hat\beta_i -\theta + \theta}{S(\hat\beta_i) } > c | \mu=\theta)$
$=1 - Pr(\frac{\hat\beta_i -\theta}{S(\hat\beta_i) } < c - \frac{\theta}{S(\hat\beta_i) }| \mu=\theta)$
$Pr(T < -c) = Pr(\frac{\hat\beta_i -\theta + \theta}{S(\hat\beta_i) } < -c | \mu=\theta)$
$=Pr(\frac{\hat\beta_i -\theta}{S(\hat\beta_i) } < -c - \frac{\theta}{S(\hat\beta_i) }| \mu=\theta)$
$\frac{\hat\beta_i -\theta}{S(\hat\beta_i) }$ follows approximately a t-student with n-k df/standard normal distribution:
Power:
$B(\theta) \approx (1 - q_{statistic}(c - \frac{\theta}{S(\hat\beta_i)})) + q_{statistic}(-c -\frac{\theta}{S(\hat\beta_i)})$
## b)

Covariance matrix of $\hat\beta_{LS}$ is $\sigma^2(X'X)^{-1}$
Since $X'X$ follows Wishart distribution, thus $(X'X)^{-1}$ follows inverse Wishart distribution. (with mean 0, and scale matrix $\Sigma^{-1}=I$
From the inverse Wishart distribution we know the mean is, $E[W]=\frac{\Sigma}{n-m-1}$ so in our case:
$E[(X'X)^{-1}] = \frac{I}{n-p-1}$
So $E[Var(\hat\beta_i)] = E[\sigma^2]\cdot E[(X'X)^{-1}]_{i,i}=\frac{\sigma^2}{n-p-1}$
## c)

**Bonferroni**: multiply $p$-values by $n$ and check desired signifcance level ($p*n < \alpha$?)
**Benjamini-Hochberg**:
1. Sort $p$-values, $p_{(1)}, p_{(1)}, \ldots, p_{(n)}$
2. Find larget $i$ s.t. $p_{(i)} \leq \frac{i}{n}q$ where $q\in[0,1]$
3. Reject all $H_j$ for $j\leq i$
## d)

FD - false discovery
p - size of the model
<!-- $E[FDR] = E[\frac{FD}{D}] \approx \frac{E[FD]}{E[D]} = \frac{power\cdot}{\alpha\cdot V}$ -->
$E[FD] = \sum_i P(\beta_i \text{ is FD}) = \sum_i \mathbb{1}(\beta_i \text{ is not significant})P(H_i \text{ rejected} | \beta_i\text{ is not significant})=\\
(n-p)\cdot\alpha$
#
# e)


S - number of true alternative hypothesis (called significant)
V - number of true null hypothesis (called significant) among S (False Positives)
$FDP = \frac{V}{S+V}$
## f)

Stein's unbiased risk estimate - SURE
$h(x) = x + g(x)$ is an estimator of $\mu$ from $x$, $\mu_i$, $i=1,..,n$ are independent normall variables with the same variance
$SURE(h) = -n\sigma^2 + \lVert g(x)\rVert^2 + 2\sigma^2 \sum_{i}^{n}{\frac{d}{dx_i}h_i(x)}$
it is unbiased estimate of MSE:
$E[SURE(h)] = MSE(h) = E[\lVert h(x) - \mu\rVert^2]$
consider a new sample: $Y' = X\beta + \epsilon'$, $\mu=X\beta$
$PE = E\lVert Y' - \hat Y\rVert^2 = E[\sum_i X^2(\beta - \hat\beta)^2] + E[\sum_i 2X(\beta - \hat\beta)\epsilon'] + E[\sum_i \epsilon'^2]$
$= E\lVert X(\beta - \hat\beta)\rVert^2 + 0 + \sum_i(V[\epsilon'] + E[\epsilon']^2)$
$= E\lVert\mu - \hat\mu\rVert^2 + n\cdot\sigma^2$
IMPORTANT: If $\hat Y = M_{n \times n} Y$ then $PE = E(RSS) + 2\sigma^2Tr(M)$
Proof, using SURE:
$\hat\mu = h(Y) = MY = \hat Y = Y + MY - Y = Y + g(Y)$
$g(Y) = \hat Y - Y = MY - Y$
$E[SURE(h)] = E\lVert\hat\mu - \mu\rVert^2 = -n\sigma^2 + E(\lVert g(x)\rVert)^2 + E[2\sigma^2\sum_i^n\frac{d}{dY_i}h_i(Y)]$
$= -n\sigma^2 + E(RSS) + 2\sigma^2TrM$
So $PE = E\lVert\mu - \hat\mu\rVert^2 + n\cdot\sigma^2 = E(RSS) + 2\sigma^2TrM$
In Least Squares $M = X(X'X)^{-1}X'$, $Tr(M) = rank(X)$
If $rank(X) = p$
$PE = RSS + 2\sigma^2p$
## g)

We have $RSS$ for different models, each has $M_k$ matrix
$L(X, \theta) = \Pi^n_if(X_i, \theta)$
$AIC(M_k) = logL(X, \theta_{MLE}) - k$
if errors are iid then:
$L(Y|X,\beta,\sigma) = (\frac{1}{\sigma\sqrt{2\pi}})^n e^{-\frac{\lVert Y - X\beta\rVert^2}{2\sigma^2}}$
$AIC(M_k) = C - nlog(\sigma) - \frac{\lVert Y - X\beta\rVert^2}{2\sigma^2} -k = C - nlog(\sigma) - \frac{RSS}{2\sigma^2} - k$
if sigma is known then maximizing $AIC$ corresponds to minimizing $RSS + 2\sigma^2k$, which is a $PE$
if sigma is unknown then we minimize $nlog(RSS) + 2k$:
$\sigma^2_{MLE} = \frac{RSS}{n}$
$AIC(M_k) = C - \frac{n}{2}log(RSS) - k$
To find the best model given RSS and k we minimize:
- AIC: $RSS + 2\sigma^2k$ ($\sigma$ known) or $nlog(RSS) + 2k$ ($\sigma$ unknown)
- BIC: $RSS + \sigma^2\cdot k\cdot log(n)$
- RIC: $RSS + \sigma^2 \cdot 2k \cdot log(p)$
- mBIC: $RSS + k\sigma^2(logn +2log(\frac{p}{C}))$
- mBIC2: $RSS + \sigma^2 \cdot (klog(n) + 2klog(p/4) - 2log(k!))$
where C is prior expected number of nonzero coefficients, p is the number of available predictors
## h)

Residuals: $r = Y - \hat Y$
Projection matrix H: $X(X^TX)^{-1}X^T$
$RSS = \lVert r\rVert^2$
$PE_{CV} = \sum_{i=1}^n(\frac{r_i}{1 - H_{ii}})^2$
## i)

Assume $X'X = I$
$\hat\beta = X'Y$
$RSS = (Y - X\hat\beta)'(Y - X\hat\beta) = Y'Y - Y'X\hat\beta - \hat\beta'X'Y + \hat\beta'X'X\hat\beta$
$= Y'Y - 2Y'X\hat\beta + \hat\beta'X'X\hat\beta = Y'Y - 2\hat\beta'\hat\beta + \hat\beta'\hat\beta$
$RSS = Y'Y - \hat\beta'\hat\beta = Y'Y - \sum_i^k\hat\beta_i^2$
We take new variable -> AIC increases by $2\sigma^2$, so we will take this variable iff $|\hat\beta_i| \geq \sqrt{2}\sigma$.
AIC: minimize $RSS + 2\sigma^2k$
We take $\hat\beta_i$ when: $|\hat\beta_i| \geq \sqrt{2}\sigma$
when $\beta_i=0$, $\hat\beta \sim N(0, \sigma^2)$, so the probability of type I error:
$P(X_i \ is\ selected | \beta_i=0) = P(|\hat\beta_i| \geq \sqrt{2}\sigma | \beta_i=0) = 2(1 - F(\sqrt{2})) = 0.16$
BIC: $RSS + \sigma^2\cdot k\cdot log(n)$
We take $\hat\beta_i$ when: $|\hat\beta_i| \geq \sqrt{log\ n}\sigma$
$P(X_i \ is\ selected | \beta_i=0) = P(|\hat\beta_i| \geq \sqrt{log\ n}\sigma | \beta_i=0) = 2(1 - F(\sqrt{log\ n}))$
RIC: $RSS + \sigma^2 \cdot 2k \cdot log(p)$
We take $\hat\beta_i$ when: $|\hat\beta_i| \geq \sqrt{2log\ p}\ \sigma$
$P(X_i \ is\ selected | \beta_i=0) = 2(1 - F(\sqrt{2log\ p}))$
## j)

We use:
- AIC:
- BIC:
- RIC:
- mBIC:
- mBIC2:
<!-- ## Ridge regresion
$\hat\beta=(X'X + \gamma I)^{-1}X'Y$ where $\gamma >0$ -->
## k)

Ridge regression:
$\hat\beta = argmin_{\beta}L(b)$, where $L(b) = \lVert Y - Xb\rVert^2 + \gamma\lVert b \rVert^2$
IMPORTANT:
$\hat\beta = (X'X + \gamma I)^{-1}X'Y$, where $\gamma > 0$
Projection matrix:
$M = X(X'X + \gamma I)^{-1}X'$
$Tr[M] = Tr[((X'X + \gamma I)^{-1}X'X)]$
$Tr[M] = \sum_i^p\lambda_i(M)$, $\lambda_i(M)$ are eigenvalues of M

IMPORTANT: If $\hat Y = M_{n \times n} Y$ then $PE = E(RSS) + 2\sigma^2Tr(M)$
Proof, using SURE:
$\hat\mu = h(Y) = MY = \hat Y = Y + MY - Y = Y + g(Y)$
$g(Y) = \hat Y - Y = MY - Y$
$E[SURE(h)] = E\lVert\hat\mu - \mu\rVert^2 = -n\sigma^2 + E(\lVert g(x)\rVert)^2 + E[2\sigma^2\sum_i^n\frac{d}{dY_i}h_i(Y)]$
$= -n\sigma^2 + E(RSS) + 2\sigma^2TrM$
So $PE = E\lVert\mu - \hat\mu\rVert^2 + n\cdot\sigma^2 = E(RSS) + 2\sigma^2TrM$
In Ridge Regression:
$PE = RSS + 2\sigma^2\sum_i^p\frac{\lambda_i(X'X)}{\lambda_i(X'X) + \gamma}$
## l)

## m)


## n)

(googlować Lasso orthogonal design power?)
Jest w drugim hacku, ale power może być źle https://hackmd.io/WC_6RbBoQg6qslAbH_WU6g#LASSO
## o) nowe
o) Consider adaptive LASSO with lambda_i= w_i lambda.
i) How can you calculate adaptive LASSO estimator using the numerical solver for LASSO.
ii) In the orthogonal case (X’X=I) calculate the value of the adaptive LASSO estimator for the specific coordinate of the beta vector, Given the true value of this parameter calculate the bias, variance and the mean squared error of this estimator.
(i) The approach is to rescale the predictors by the weights, so that the penalty for each predictor is effectively normalized by its corresponding weight. Specifically, you can define a new matrix X' and response vector y' as follows:
X' = diag(sqrt(w)) * X
y' = diag(sqrt(w)) * y
Then, you can run the LASSO solver on X' and y' using a single penalty parameter lambda, which will effectively correspond to a different penalty for each predictor due to the rescaling. Finally, you can obtain the adaptive LASSO estimator by multiplying the LASSO estimator by the corresponding weight:
beta_adaptive = beta_lasso * w
This approach essentially implements the adaptive LASSO penalty using a single penalty parameter for the LASSO solver, and then adjusts the resulting estimates based on the weights.
(ii) todo
## o) stare

https://andrewcharlesjones.github.io/journal/irrepresentable.html
## p)

Co to jest $X_A$?
## q)

We prefer elastic net over LASSO when
jak jest dużo zmiennych i nie mamy założeń ile z nich powinno być zerowych (?) i jak niektóre parametry są skorelowane ~maciej
Lasso Regression tends to pick just one of the correlated terms and eliminates the others
whereas Ridge Regression tends to shrink all of the parameters for the correlated variables together.
By combining Lasso and Ridge Regression groups and shrinks the parameters associated with the correlated variables and leaves them in equation or removes them at once.
Main differences between these methods are
Elastic net to RSS z L1 penalty (jak w LASSO) i L2 penalty (jak w Ridge) z dwoma lambdami (jak je wybrać?)
## r)

Lasso and elastic net perform variable selection