# MLE
### MLE and glossary
1. Construct likelihood $\mathcal{L}(\theta)$
2. Log likelhood: $l(\theta) = log(\mathcal{L}(\theta))$
3. [First derivative] Score: $U(\theta) = \frac{\partial l(\theta)}{\partial \theta}$
4. Solve $\hat{\theta}_{\rm MLE}$ s.t. $U(\hat{\theta}) = 0$
5. [Second derivative] Information:
+ Observed information: $-\frac{\partial^2 l(\theta)}{\partial \theta^2}$
+ Fisher information: $I(\theta) = E(-\frac{\partial^2 l(\theta)}{\partial \theta^2})$
6. Central limit theorem (CLT) and law of large numer (LLN)
+ CLT:
$\sqrt{n}(\bar{X}-\mu) \xrightarrow{d} N(0, \sigma^2)$ (Converge in distribution)
+ strong LLN (SLLN):
$\bar{X} \xrightarrow{a.s.} \mu$ (Converge almost surely)
> $P(\lim_{n\rightarrow\infty} | \bar{X} - E(X) | < \epsilon) = 1 (\forall \epsilon > 0)$
> You do experiments for several times, and you find out the $X$'s average for each time can converge to the expectation.
+ weak LLN (WLLN):
$\bar{X} \xrightarrow{p} \mu$ (Converge in probability)
> $\lim_{n\rightarrow\infty} P(| \bar{X} - E(X) | < \epsilon) = 1 (\forall \epsilon > 0)$
### Consistent $\hat{\theta}_{\rm MLE}$
\begin{equation}
\begin{aligned}
\hat{\theta} \xrightarrow{p} \theta_0
\end{aligned}
\tag{1}
\end{equation}
Not proving here.
### Asymptotic normal $\hat{\theta}_{\rm MLE}$
By [MVT](#Mean-value-theorem), we can find $\theta^* \in [\theta_0, \hat{\theta}]$ s.t.
\begin{equation}
\begin{aligned}
U(\hat{\theta}) (= 0) &= U(\theta_0) + U'(\theta^*) (\hat{\theta} - \theta_0)\\
\Rightarrow \hat{\theta} - \theta_0 &= - [U'(\theta^*) ]^{-1}U(\theta_0)
\end{aligned}
\end{equation}
\begin{equation}
\begin{aligned}
\sqrt{n}(\hat{\theta} - \theta_0) &= \frac{\sqrt{n} \left(1/n \sum_i U_i(\theta_0)\right)}{- 1/n \sum_i U_i'(\theta^*)} (\text{where }i\text{ denotes the different event under one experiment})
\end{aligned}
\end{equation}
+ $1/n \sum_i U_i(\theta_0)$
\begin{equation}
\begin{aligned}
\sqrt{n} \left(1/n \sum_i U_i(\theta_0)\right) &= \sqrt{n} \left(1/n \sum_i U_i(\theta_0) - E(U_i(\theta_0))\right)\\
& \xrightarrow{d} N(0, Var[U_i(\theta_0)])
\end{aligned}
\tag{by CLT, (2)}
\end{equation}
> $E(U_i(\theta_0))$ is 0:
\begin{equation}
\begin{aligned}
E(U_i(\theta_0)) &= \frac{\partial}{\partial \theta} \int \log(f_{X}(x_i; \theta)) \cdot f_{X}(x_i; \theta) dx_i |_{\theta=\theta_0}\\
&= \int \frac{f'_{X}(x_i; \theta)}{f_{X}(x_i; \theta)} \cdot f_{X}(x_i; \theta) dx_i |_{\theta=\theta_0}\\
&= \frac{\partial}{\partial \theta} \int f_{X}(x_i; \theta) dx_i |_{\theta=\theta_0} = \frac{\partial}{\partial \theta} (1) |_{\theta=\theta_0} = 0
\end{aligned}
\tag{$f_{X}(x_i; \theta)$ is the pdf}
\end{equation}
+ $1/n \sum_i U_i'(\theta^*)$
\begin{equation}
\begin{aligned}
1/n \sum_i U_i'(\theta^*) \xrightarrow{p} E[U_i' (\theta_0)]
\end{aligned}
\tag{by WLLN, (3)}
\end{equation}
Combining (2,3) and by Slutsky theorem:
\begin{equation}
\begin{aligned}
\sqrt{n}(\hat{\theta} - \theta_0) \xrightarrow{d} N(0, \frac{Var[U_i(\theta_0)]}{E^2[U_i' (\theta_0)]}) = N(0, I^{-1}(\theta))
\end{aligned}
\tag{*}
\end{equation}
#### Note: Above make no assumption on the form of likelihood.
> \begin{equation}
\begin{aligned}
\frac{Var[U_i(\theta_0)]}{E^2[U_i' (\theta_0)]}
&= \frac{E[U_i^2(\theta_0)]-E^2[U_i(\theta_0)]}{E^2[U_i' (\theta_0)]}\\
&= \frac{-E[U_i'(\theta_0)]}{E^2[U_i' (\theta_0)]}
= -\frac{1}{E[U_i'(\theta_0)]}
\end{aligned}
\tag{$E[U_i'(\theta_0)]=0$ by}
\end{equation}
> \begin{equation}
\begin{aligned}
E(U_i'(\theta_0)) &= \frac{\partial^2}{\partial \theta^2} \int \log(f_{X}(x_i; \theta)) \cdot f_{X}(x_i; \theta) dx_i |_{\theta=\theta_0}\\
&= \frac{\partial}{\partial \theta} \int \frac{f'_{X}(x_i; \theta)}{f_{X}(x_i; \theta)} \cdot f_{X}(x_i; \theta) dx_i |_{\theta=\theta_0}\\
&= \int \left( -\frac{f'^2_{X}(x_i; \theta)}{f^2_{X}(x_i; \theta)} + \frac{f''_{X}(x_i; \theta)}{f_{X}(x_i; \theta)} \right) f_{X}(x_i; \theta) dx_i \\
&= \int -\left( \frac{f'_{X}(x_i; \theta)}{f_{X}(x_i; \theta)}\right)^2 f_{X}(x_i; \theta) dx_i + \int \left(\frac{f''_{X}(x_i; \theta)}{f_{X}(x_i; \theta)} \right) f_{X}(x_i; \theta) dx_i \\
&= -E[U^2(\theta_0)] + \int \frac{\partial^2}{\partial \theta^2} f_{X}(x_i; \theta) dx_i |_{\theta=\theta_0} \\
&= -E[U^2(\theta_0)] + \frac{\partial^2}{\partial \theta^2} (1)\\
&= -E[U^2(\theta_0)]
\end{aligned}
\end{equation}
### Wald test
test statistics $T = \frac{\sqrt{n}(\hat{\theta} - \theta_0)}{\sqrt{I^{-1}(\theta_0)}} \sim N(0,1)$
In practice, we will use $T^* = \frac{\sqrt{n}(\hat{\theta} - \theta_0)}{\sqrt{I^{-1}(\hat{\theta})}}$
(proof is skipped ...)
### Mean value theorem
\begin{equation}
\begin{aligned}
\frac{g(y_2)-g(y_1)}{y_2-y_1} &= g'(y^*)\\
\Rightarrow g(y_2) &= g(y_1) + g'(y^*) (y_2-y_1)
\end{aligned}
\end{equation}
### Slutsky's theorem
\begin{equation}
\begin{aligned}
X_n &\xrightarrow{d} X\\
Y_n &\xrightarrow{p} c\\
\end{aligned}
\end{equation}
then
\begin{equation}
\begin{aligned}
X_n + Y_n &\xrightarrow{d} X +c\\
X_n Y_n &\xrightarrow{d} cX\\
X_n/Y_n &\xrightarrow{d} X/c\\
\end{aligned}
\end{equation}