# Estimators
{%hackmd @themes/orangeheart %}
###### tags: `Causal Inference and Prediction in Econometrics`
## Asymototic Theory
Asymptotic theory could be also called large-sample theory, which provides us a framwork that how data converge or diverge when the sample size grows infinite.
### The Concept of Limit of Asymptotic Theory
Suppose there is a deterministic sequence of random variables:
$$
\{a_1, a_2 \cdots, a_n\}
$$
when $n \to \infty$, the sequence would converge to a constant $a$:
$$
\lim_{n \to \infty} a_n = a
$$

which satisfies the following condition when for all $\delta >0$, there exists $n_{\delta} < \infty$ such that for all $n > n_{\delta}$
$$
|a_n - a| < \delta
$$
then we call it that the sequence $\{a_n\}$ converges to $a$.
### Converge in Probability
Given a sequence of random variables,
$$
\{z_1, z_2 \cdots z_n\}
$$
if
$$
\lim \mathbb{P}\big(|z_n - z| \leq \delta \big) = 1 \quad \text{ or } \quad \lim \mathbb{P}\big(|z_n - z| > \delta \big) = 0, \quad \forall \delta > 0
$$
then we say the sequence converge in probability towards a random variable $z$, which is called **probability limit** of the sequence. Technically we could express the property with following statements:
$$
\{z_n\} \overset{p}{\to} z \qquad z_n \overset{p}{\to} z, \qquad p\lim_{n \to \infty} z_n = z
$$
### Law of Large Numbers(LLN)
Law of large numbers describes the result of performing the same experiment a large number of times. The law implies that after we do several trials, the sample average will be close to the expected value:
$$
\overline{x} \overset{p}{\to} \mu\
$$
For example, given a sequence of random variables,
$$
\{y_1, y_2 \cdots y_n\},
$$
and define sample mean to be $\overline{y}_n = \frac{1}{n}\sum^n_{i=1}y_i$ and $\mathbb{E}(y_i) = \mu$ as well. Suppose $y_i \overset{\text{i.i.d.}}{\sim} (\mu, \sigma^2)$, and the moment exists $\mathbb{E}|y| < \infty$, then when $n \to \infty$,
$$
\overline{y}_n \overset{p}{\to} \mu
$$
### Convergence in Distribution
A sequence of random variables $Z_n$ defined as
$$
\{z_1, z_2 \cdots z_n\}
$$
with cumulative distribution function (CDF)
$$
F(z_1), F(z_2) \cdots F(z_n)
$$
is converge in distribution to a random variable $Z$ with CDF $F$ if
$$
\lim _{n \to \infty} F_n(z)=F_Z(z)
$$
Intuitively, $Z_n$ has the same distribution with $Z$ when the sample size grow indefinitely, and we say $Z$ is the limiting distribution of $Z_n$. We denote the theorem by
$$
Z_n \overset{d}{\to} Z
$$
### Central Limit Theorem (CLT)
Given a sequence of random variables $X_n$,
$$
\{x_1, x_2 \cdots x_n\},
$$
where expected value $\mathbb{E}(X) = \mu < \infty$ and variance $\operatorname{var}(X) = \sigma^2 < \infty$. Now let $\overline{X}_n = \frac{1}{n}\sum^n_{i=1}X_i$, then by recentering and rescaling the sample average, we obtain
$$
Z_n=\frac{\bar{X}_n-\mathbb{E}\left(X_n\right)}{\sqrt{\operatorname{Var}\left(\bar{X}_n\right)}}=\sqrt{n}\left(\frac{\bar{X}_n-\mu}{\sigma}\right) \stackrel{d}{\rightarrow} N(0,1)
$$
or
$$
\sqrt{n}\left(\bar{X}_n-\mu\right) \stackrel{d}{\rightarrow} N\left(0, \sigma^2\right)
$$
## Unbiased and Consistent Estimators
Suppose $\beta$ is a vector composed of interested parameters, we say $\hat{\beta}$ is an estimator of $\beta$, if
- $\mathbb{E}(\hat{\beta}) = \beta_0$ : $\hat{\beta}$ is an unbiased estimator of $\beta$ ;
- $p\lim_{n \to \infty}\hat{\beta} = \beta_0$ or $\hat{\beta} \overset{p}{\to}\beta_0$ : $\hat{\beta}$ is a consistent estimator of $\beta$


### Ordinary Least Square (OLS)
Ordinary Least Square chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function of the independent variable.[^1] Suppose we denote the sequence of $Y_n$ and $X_n$ as matrices, $\mathbf{Y}$ and $\mathbf{X}$, and define $m(x_i)$ to the the conditional expected value of $y_i$ when given $x_i$,
$$
m(x_i) \equiv \mathbb{E}(y_i | x_i)
$$
and $m(x_i)$ is the best prediction in the sense that it minimises
$$
\mathbb{E}[y_i-g(x_i)]^2,
$$
where $g(\cdot)$ is some function of $x$. The proof is shown below:
$$
\begin{aligned}
\mathbb{E}[y_i-g(x_i)]^2 =& \mathbb{E}[y_i-m(x_i)+m(x_i)-g(x_i)]^2\\
=& \mathbb{E}\Big\{e_i^2 + e_i[m(x_i) - g(x_i)] + [m(x_i) - g(x_i)]e_i + [m(x_i) - g(x_i)]^2\Big\}\\
=& \mathbb{E}(e_i^2) + \underbrace{\mathbb{E}[m(x_i) - g(x_i)]^2}_{> 0},
\end{aligned}
$$
which shows that $\operatorname{MSE}$ is minimised if $g(x_i) = m(x_i)$.
#### Simple Linear Regression
Now given a simple linear regression,
$$
y_i = \beta_0 + \beta_1 x_i + e_i,
$$
if we want to find the optimal $\beta_0$ and $\beta_1$ such that the $\operatorname{MSE}$, then
$$
\begin{aligned}&\arg \min _{\beta_{0}, \beta_{1}} \sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\beta_{1}, x_{i}\right)^{2}\end{aligned}
$$
then by the first order condtion with respect to $\beta_0$ and $\beta_1$, we obtain,
$$
\begin{aligned}
&\sum_{i=1}^{n}\left(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1} x_{i}\right)=\sum_{i=1}^{n} \hat{e}_{i}=0 \\&\sum_{i=1}^{n} x_{i}\left(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1} x_{i}\right)=\sum_{i=1}^{n} x_{i} \hat{e}_{i}=0,
\end{aligned}
$$
which is called normal equation. We could rewrite it as
$$
n\beta_0 + \sum_{i=1}^{n}x_i \beta_1= \sum_{i=1}^{n}y_i
$$
and
$$
\sum_{i=1}^{n}x_i \beta_0 + \sum_{i=1}^{n}x_i^2 \beta_1= \sum_{i=1}^{n}x_iy_i.
$$
Then tranform the equations into a matrix form,
$$
\begin{bmatrix} n&\sum_{i=1}^{n}x_i\\ \sum_{i=1}^{n}x_i & \sum_{i=1}^{n}x_i^2\end{bmatrix} \begin{bmatrix} \beta_0\\\beta_1\end{bmatrix} = \begin{bmatrix} \sum_{i=1}^{n}y_i\\\sum_{i=1}^{n}x_iy_i \end{bmatrix}\quad
$$
By Cramer's rule,
$$
\begin{aligned}\hat{\beta_1} =& \frac{\begin{vmatrix} n&\sum_{i=1}^{n}y_i\\ \sum_{i=1}^{n}x_i & \sum_{i=1}^{n}x_iy_i \end{vmatrix}}{\begin{vmatrix} n&\sum_{i=1}^{n}x_i\\ \sum_{i=1}^{n}x_i & \sum_{i=1}^{n}x_i^2\end{vmatrix}}\\=&\frac{n\sum_{i=1}^{n}x_iy_i- \sum_{i=1}^{n}x_iy_i}{n\sum_{i=1}^{n}x_i^2-\left(\sum_{i=1}^{n}x_i^2\right)} \\ =& \frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}\end{aligned}
$$
and $\hat{\beta_0} = \overline{y_i} - \hat{\beta_1} \overline{x}$.
#### Multiple Linear Regression
Now think of multiple linear regression,
$$
\mathbf{y} = \mathbf{x}\beta + \mathbf{e},
$$
where $\mathbf{y}$, $\mathbf{x}$ and $\mathbf{e}$ are the vectors of data. Consider the following minimisation problem:
$$
\min_{\beta} \;\sum^n_{i=1}(\mathbf{y}-\mathbf{x}\beta)^2
$$
then
$$
(\mathbf{y}-\mathbf{x} \beta)^{\top}(\mathbf{y}-\mathbf{x} \beta)=\sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\beta_{1} x_{i 1}-\beta_{2} x_{i 2}-\cdots-\beta_{K} x_{i K}\right)^{2}\\
$$
with some calculation, we obtain
$$
\mathbf{y}^{\top} \mathbf{y}-\beta^{\top} \mathbf{X}^{\top} \mathbf{y}-\mathbf{y}^{\top} \mathbf{X} \beta+\beta^{\top} \mathbf{X}^{\top} \mathbf{X} \beta.
$$
By first order condition and simplication, we get
$$
\hat{\beta} = (\mathbf{x}^{\top}\mathbf{x})^{-1}\mathbf{x}^{\top}\mathbf{y}
$$
[^1]:"Ordinary Least Squares." Wikipedia, Wikimedia Foundation, 5 Sept. 2022, https://en.wikipedia.org/wiki/Ordinary_least_squares.