MLE - HackMD

# MLE ### MLE and glossary 1. Construct likelihood $\mathcal{L}(\theta)$ 2. Log likelhood: $l(\theta) = log(\mathcal{L}(\theta))$ 3. [First derivative] Score: $U(\theta) = \frac{\partial l(\theta)}{\partial \theta}$ 4. Solve $\hat{\theta}_{\rm MLE}$ s.t. $U(\hat{\theta}) = 0$ 5. [Second derivative] Information: + Observed information: $-\frac{\partial^2 l(\theta)}{\partial \theta^2}$ + Fisher information: $I(\theta) = E(-\frac{\partial^2 l(\theta)}{\partial \theta^2})$ 6. Central limit theorem (CLT) and law of large numer (LLN) + CLT: $\sqrt{n}(\bar{X}-\mu) \xrightarrow{d} N(0, \sigma^2)$ (Converge in distribution) + strong LLN (SLLN): $\bar{X} \xrightarrow{a.s.} \mu$ (Converge almost surely) > $P(\lim_{n\rightarrow\infty} | \bar{X} - E(X) | < \epsilon) = 1 (\forall \epsilon > 0)$ > You do experiments for several times, and you find out the $X$'s average for each time can converge to the expectation. + weak LLN (WLLN): $\bar{X} \xrightarrow{p} \mu$ (Converge in probability) > $\lim_{n\rightarrow\infty} P(| \bar{X} - E(X) | < \epsilon) = 1 (\forall \epsilon > 0)$ ### Consistent $\hat{\theta}_{\rm MLE}$ \begin{equation} \begin{aligned} \hat{\theta} \xrightarrow{p} \theta_0 \end{aligned} \tag{1} \end{equation} Not proving here. ### Asymptotic normal $\hat{\theta}_{\rm MLE}$ By [MVT](#Mean-value-theorem), we can find $\theta^* \in [\theta_0, \hat{\theta}]$ s.t. \begin{equation} \begin{aligned} U(\hat{\theta}) (= 0) &= U(\theta_0) + U'(\theta^*) (\hat{\theta} - \theta_0)\\ \Rightarrow \hat{\theta} - \theta_0 &= - [U'(\theta^*) ]^{-1}U(\theta_0) \end{aligned} \end{equation} \begin{equation} \begin{aligned} \sqrt{n}(\hat{\theta} - \theta_0) &= \frac{\sqrt{n} \left(1/n \sum_i U_i(\theta_0)\right)}{- 1/n \sum_i U_i'(\theta^*)} (\text{where }i\text{ denotes the different event under one experiment}) \end{aligned} \end{equation} + $1/n \sum_i U_i(\theta_0)$ \begin{equation} \begin{aligned} \sqrt{n} \left(1/n \sum_i U_i(\theta_0)\right) &= \sqrt{n} \left(1/n \sum_i U_i(\theta_0) - E(U_i(\theta_0))\right)\\ & \xrightarrow{d} N(0, Var[U_i(\theta_0)]) \end{aligned} \tag{by CLT, (2)} \end{equation} > $E(U_i(\theta_0))$ is 0: \begin{equation} \begin{aligned} E(U_i(\theta_0)) &= \frac{\partial}{\partial \theta} \int \log(f_{X}(x_i; \theta)) \cdot f_{X}(x_i; \theta) dx_i |_{\theta=\theta_0}\\ &= \int \frac{f'_{X}(x_i; \theta)}{f_{X}(x_i; \theta)} \cdot f_{X}(x_i; \theta) dx_i |_{\theta=\theta_0}\\ &= \frac{\partial}{\partial \theta} \int f_{X}(x_i; \theta) dx_i |_{\theta=\theta_0} = \frac{\partial}{\partial \theta} (1) |_{\theta=\theta_0} = 0 \end{aligned} \tag{$f_{X}(x_i; \theta)$ is the pdf} \end{equation} + $1/n \sum_i U_i'(\theta^*)$ \begin{equation} \begin{aligned} 1/n \sum_i U_i'(\theta^*) \xrightarrow{p} E[U_i' (\theta_0)] \end{aligned} \tag{by WLLN, (3)} \end{equation} Combining (2,3) and by Slutsky theorem: \begin{equation} \begin{aligned} \sqrt{n}(\hat{\theta} - \theta_0) \xrightarrow{d} N(0, \frac{Var[U_i(\theta_0)]}{E^2[U_i' (\theta_0)]}) = N(0, I^{-1}(\theta)) \end{aligned} \tag{*} \end{equation} #### Note: Above make no assumption on the form of likelihood. > \begin{equation} \begin{aligned} \frac{Var[U_i(\theta_0)]}{E^2[U_i' (\theta_0)]} &= \frac{E[U_i^2(\theta_0)]-E^2[U_i(\theta_0)]}{E^2[U_i' (\theta_0)]}\\ &= \frac{-E[U_i'(\theta_0)]}{E^2[U_i' (\theta_0)]} = -\frac{1}{E[U_i'(\theta_0)]} \end{aligned} \tag{$E[U_i'(\theta_0)]=0$ by} \end{equation} > \begin{equation} \begin{aligned} E(U_i'(\theta_0)) &= \frac{\partial^2}{\partial \theta^2} \int \log(f_{X}(x_i; \theta)) \cdot f_{X}(x_i; \theta) dx_i |_{\theta=\theta_0}\\ &= \frac{\partial}{\partial \theta} \int \frac{f'_{X}(x_i; \theta)}{f_{X}(x_i; \theta)} \cdot f_{X}(x_i; \theta) dx_i |_{\theta=\theta_0}\\ &= \int \left( -\frac{f'^2_{X}(x_i; \theta)}{f^2_{X}(x_i; \theta)} + \frac{f''_{X}(x_i; \theta)}{f_{X}(x_i; \theta)} \right) f_{X}(x_i; \theta) dx_i \\ &= \int -\left( \frac{f'_{X}(x_i; \theta)}{f_{X}(x_i; \theta)}\right)^2 f_{X}(x_i; \theta) dx_i + \int \left(\frac{f''_{X}(x_i; \theta)}{f_{X}(x_i; \theta)} \right) f_{X}(x_i; \theta) dx_i \\ &= -E[U^2(\theta_0)] + \int \frac{\partial^2}{\partial \theta^2} f_{X}(x_i; \theta) dx_i |_{\theta=\theta_0} \\ &= -E[U^2(\theta_0)] + \frac{\partial^2}{\partial \theta^2} (1)\\ &= -E[U^2(\theta_0)] \end{aligned} \end{equation} ### Wald test test statistics $T = \frac{\sqrt{n}(\hat{\theta} - \theta_0)}{\sqrt{I^{-1}(\theta_0)}} \sim N(0,1)$ In practice, we will use $T^* = \frac{\sqrt{n}(\hat{\theta} - \theta_0)}{\sqrt{I^{-1}(\hat{\theta})}}$ (proof is skipped ...) ### Mean value theorem \begin{equation} \begin{aligned} \frac{g(y_2)-g(y_1)}{y_2-y_1} &= g'(y^*)\\ \Rightarrow g(y_2) &= g(y_1) + g'(y^*) (y_2-y_1) \end{aligned} \end{equation} ### Slutsky's theorem \begin{equation} \begin{aligned} X_n &\xrightarrow{d} X\\ Y_n &\xrightarrow{p} c\\ \end{aligned} \end{equation} then \begin{equation} \begin{aligned} X_n + Y_n &\xrightarrow{d} X +c\\ X_n Y_n &\xrightarrow{d} cX\\ X_n/Y_n &\xrightarrow{d} X/c\\ \end{aligned} \end{equation}