{%hackmd 5xqeIJ7VRCGBfLtfMi0_IQ %}
# Covariance estimation
## Question
Given a set of samples $\bx = (x_1, \ldots, x_N)$. What is its covariacne? It is agreeable that the mean is
$$
\overline{\bx} = \frac{1}{N}\sum_{i=1}^N x_i.
$$
For covariance, maybe you have heard of
$$
\sigma^2 = \frac{1}{N}\sum_{i=1}^N (x_i - \overline{\bx})^2 \text{ and }s^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \overline{\bx})^2.
$$
Which formula for covariance is correct?
## Experiments
You need: [handout](https://www.math.nsysu.edu.tw/~chlin/math-runway/covariance-estimation.pdf), 5 dice per group
1. Review that the covariance of a _fair_ dice is $2.916\cdots$.
2. Roll 5 dice at once. Record their numbers, calculate the mean, and the value $\tau^2$.
3. Calculate the mean for the column of $\sigma^2$. Check if it is close to $2.916\cdots$.
4. If there are several groups, we may combine the data together.
## Intuition
In probability theory, we assume we know the details of the probability distribution $X$, where the event $X = x$ happens with probability $p_x$. Here is an example of a fair dice.
| value | 1 | 2 | 3 | 4 | 5 | 6 |
| ----------- | --- | --- | --- | --- | --- | --- |
| probability | 1/6 | 1/6 | 1/6 | 1/6 | 1/6 | 1/6 |
There are fomal definitions for the mean and the covariance of $X$
$$
\mathbb{E}[X] = \sum_x p_x\cdot x \text{ and } \operatorname{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2].
$$
This covariance is usually called the _population covariance_ , indicating that you know the details of the whole population.
However, we never know if a dice is fair or not. We can only get some samples and use these data to _estimate_ the population covariance. Here comes the problem. You obtain some samples $\bx = (x_1, \ldots, x_N)$. When we want to estimate the covariance $\operatorname{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]$, we do not know the mean $\mathbb{E}[X]$ as well. The only thing we can do is to replace it by the _sample mean_
$$
\overline{\bx} = \frac{1}{N}\sum_{i=1}^N x_i.
$$
Then calculate the _sample covariance_
$$
s^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \overline{\bx})^2.
$$
The point is that $s^2$ is itself a random variable, depending on your samples. Ideally, when you run the experiment several times, the average of $s^2$ should be close to the real anser $\operatorname{Var}(X)$. As you have seen in the experiment, setting the denominator as $N - 1$ suprisingly did the job!
## More questions to think about
1. Calculate the covariance of a _fair_ coin of two sides $0$ and $1$.
2. Consider a random variable $\overline{\bx}$ as the mean of five dice. Describe its probability distribution.
3. Consider a random variable $s^2$ as the sample variance of five dice. Describe its probability distribution.
## Resources
1. [YouTube: Why Sample Variance is Divided by n-1 by Krish Naik](https://youtu.be/vGsRwB3TsiE?si=MYVKGRhHdANgt85-)