Estimators - HackMD

# Estimators {%hackmd @themes/orangeheart %} ###### tags: `Causal Inference and Prediction in Econometrics` ## Asymototic Theory Asymptotic theory could be also called large-sample theory, which provides us a framwork that how data converge or diverge when the sample size grows infinite. ### The Concept of Limit of Asymptotic Theory Suppose there is a deterministic sequence of random variables: $$ \{a_1, a_2 \cdots, a_n\} $$ when $n \to \infty$, the sequence would converge to a constant $a$: $$ \lim_{n \to \infty} a_n = a $$ ![](https://i.imgur.com/I1dTZRf.png) which satisfies the following condition when for all $\delta >0$, there exists $n_{\delta} < \infty$ such that for all $n > n_{\delta}$ $$ |a_n - a| < \delta $$ then we call it that the sequence $\{a_n\}$ converges to $a$. ### Converge in Probability Given a sequence of random variables, $$ \{z_1, z_2 \cdots z_n\} $$ if $$ \lim \mathbb{P}\big(|z_n - z| \leq \delta \big) = 1 \quad \text{ or } \quad \lim \mathbb{P}\big(|z_n - z| > \delta \big) = 0, \quad \forall \delta > 0 $$ then we say the sequence converge in probability towards a random variable $z$, which is called **probability limit** of the sequence. Technically we could express the property with following statements: $$ \{z_n\} \overset{p}{\to} z \qquad z_n \overset{p}{\to} z, \qquad p\lim_{n \to \infty} z_n = z $$ ### Law of Large Numbers(LLN) Law of large numbers describes the result of performing the same experiment a large number of times. The law implies that after we do several trials, the sample average will be close to the expected value: $$ \overline{x} \overset{p}{\to} \mu\ $$ For example, given a sequence of random variables, $$ \{y_1, y_2 \cdots y_n\}, $$ and define sample mean to be $\overline{y}_n = \frac{1}{n}\sum^n_{i=1}y_i$ and $\mathbb{E}(y_i) = \mu$ as well. Suppose $y_i \overset{\text{i.i.d.}}{\sim} (\mu, \sigma^2)$, and the moment exists $\mathbb{E}|y| < \infty$, then when $n \to \infty$, $$ \overline{y}_n \overset{p}{\to} \mu $$ ### Convergence in Distribution A sequence of random variables $Z_n$ defined as $$ \{z_1, z_2 \cdots z_n\} $$ with cumulative distribution function (CDF) $$ F(z_1), F(z_2) \cdots F(z_n) $$ is converge in distribution to a random variable $Z$ with CDF $F$ if $$ \lim _{n \to \infty} F_n(z)=F_Z(z) $$ Intuitively, $Z_n$ has the same distribution with $Z$ when the sample size grow indefinitely, and we say $Z$ is the limiting distribution of $Z_n$. We denote the theorem by $$ Z_n \overset{d}{\to} Z $$ ### Central Limit Theorem (CLT) Given a sequence of random variables $X_n$, $$ \{x_1, x_2 \cdots x_n\}, $$ where expected value $\mathbb{E}(X) = \mu < \infty$ and variance $\operatorname{var}(X) = \sigma^2 < \infty$. Now let $\overline{X}_n = \frac{1}{n}\sum^n_{i=1}X_i$, then by recentering and rescaling the sample average, we obtain $$ Z_n=\frac{\bar{X}_n-\mathbb{E}\left(X_n\right)}{\sqrt{\operatorname{Var}\left(\bar{X}_n\right)}}=\sqrt{n}\left(\frac{\bar{X}_n-\mu}{\sigma}\right) \stackrel{d}{\rightarrow} N(0,1) $$ or $$ \sqrt{n}\left(\bar{X}_n-\mu\right) \stackrel{d}{\rightarrow} N\left(0, \sigma^2\right) $$ ## Unbiased and Consistent Estimators Suppose $\beta$ is a vector composed of interested parameters, we say $\hat{\beta}$ is an estimator of $\beta$, if - $\mathbb{E}(\hat{\beta}) = \beta_0$ : $\hat{\beta}$ is an unbiased estimator of $\beta$ ; - $p\lim_{n \to \infty}\hat{\beta} = \beta_0$ or $\hat{\beta} \overset{p}{\to}\beta_0$ : $\hat{\beta}$ is a consistent estimator of $\beta$ ![](https://i.imgur.com/ckPYgLG.png) ![](https://i.imgur.com/Kk8TmiP.png) ### Ordinary Least Square (OLS) Ordinary Least Square chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function of the independent variable.[^1] Suppose we denote the sequence of $Y_n$ and $X_n$ as matrices, $\mathbf{Y}$ and $\mathbf{X}$, and define $m(x_i)$ to the the conditional expected value of $y_i$ when given $x_i$, $$ m(x_i) \equiv \mathbb{E}(y_i | x_i) $$ and $m(x_i)$ is the best prediction in the sense that it minimises $$ \mathbb{E}[y_i-g(x_i)]^2, $$ where $g(\cdot)$ is some function of $x$. The proof is shown below: $$ \begin{aligned} \mathbb{E}[y_i-g(x_i)]^2 =& \mathbb{E}[y_i-m(x_i)+m(x_i)-g(x_i)]^2\\ =& \mathbb{E}\Big\{e_i^2 + e_i[m(x_i) - g(x_i)] + [m(x_i) - g(x_i)]e_i + [m(x_i) - g(x_i)]^2\Big\}\\ =& \mathbb{E}(e_i^2) + \underbrace{\mathbb{E}[m(x_i) - g(x_i)]^2}_{> 0}, \end{aligned} $$ which shows that $\operatorname{MSE}$ is minimised if $g(x_i) = m(x_i)$. #### Simple Linear Regression Now given a simple linear regression, $$ y_i = \beta_0 + \beta_1 x_i + e_i, $$ if we want to find the optimal $\beta_0$ and $\beta_1$ such that the $\operatorname{MSE}$, then $$ \begin{aligned}&\arg \min _{\beta_{0}, \beta_{1}} \sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\beta_{1}, x_{i}\right)^{2}\end{aligned} $$ then by the first order condtion with respect to $\beta_0$ and $\beta_1$, we obtain, $$ \begin{aligned} &\sum_{i=1}^{n}\left(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1} x_{i}\right)=\sum_{i=1}^{n} \hat{e}_{i}=0 \\&\sum_{i=1}^{n} x_{i}\left(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1} x_{i}\right)=\sum_{i=1}^{n} x_{i} \hat{e}_{i}=0, \end{aligned} $$ which is called normal equation. We could rewrite it as $$ n\beta_0 + \sum_{i=1}^{n}x_i \beta_1= \sum_{i=1}^{n}y_i $$ and $$ \sum_{i=1}^{n}x_i \beta_0 + \sum_{i=1}^{n}x_i^2 \beta_1= \sum_{i=1}^{n}x_iy_i. $$ Then tranform the equations into a matrix form, $$ \begin{bmatrix} n&\sum_{i=1}^{n}x_i\\ \sum_{i=1}^{n}x_i & \sum_{i=1}^{n}x_i^2\end{bmatrix} \begin{bmatrix} \beta_0\\\beta_1\end{bmatrix} = \begin{bmatrix} \sum_{i=1}^{n}y_i\\\sum_{i=1}^{n}x_iy_i \end{bmatrix}\quad $$ By Cramer's rule, $$ \begin{aligned}\hat{\beta_1} =& \frac{\begin{vmatrix} n&\sum_{i=1}^{n}y_i\\ \sum_{i=1}^{n}x_i & \sum_{i=1}^{n}x_iy_i \end{vmatrix}}{\begin{vmatrix} n&\sum_{i=1}^{n}x_i\\ \sum_{i=1}^{n}x_i & \sum_{i=1}^{n}x_i^2\end{vmatrix}}\\=&\frac{n\sum_{i=1}^{n}x_iy_i- \sum_{i=1}^{n}x_iy_i}{n\sum_{i=1}^{n}x_i^2-\left(\sum_{i=1}^{n}x_i^2\right)} \\ =& \frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}\end{aligned} $$ and $\hat{\beta_0} = \overline{y_i} - \hat{\beta_1} \overline{x}$. #### Multiple Linear Regression Now think of multiple linear regression, $$ \mathbf{y} = \mathbf{x}\beta + \mathbf{e}, $$ where $\mathbf{y}$, $\mathbf{x}$ and $\mathbf{e}$ are the vectors of data. Consider the following minimisation problem: $$ \min_{\beta} \;\sum^n_{i=1}(\mathbf{y}-\mathbf{x}\beta)^2 $$ then $$ (\mathbf{y}-\mathbf{x} \beta)^{\top}(\mathbf{y}-\mathbf{x} \beta)=\sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\beta_{1} x_{i 1}-\beta_{2} x_{i 2}-\cdots-\beta_{K} x_{i K}\right)^{2}\\ $$ with some calculation, we obtain $$ \mathbf{y}^{\top} \mathbf{y}-\beta^{\top} \mathbf{X}^{\top} \mathbf{y}-\mathbf{y}^{\top} \mathbf{X} \beta+\beta^{\top} \mathbf{X}^{\top} \mathbf{X} \beta. $$ By first order condition and simplication, we get $$ \hat{\beta} = (\mathbf{x}^{\top}\mathbf{x})^{-1}\mathbf{x}^{\top}\mathbf{y} $$ [^1]:"Ordinary Least Squares." Wikipedia, Wikimedia Foundation, 5 Sept. 2022, https://en.wikipedia.org/wiki/Ordinary_least_squares.