{%hackmd 5xqeIJ7VRCGBfLtfMi0_IQ %} # Covariance estimation ## Question Given a set of samples $\bx = (x_1, \ldots, x_N)$. What is its covariacne? It is agreeable that the mean is $$ \overline{\bx} = \frac{1}{N}\sum_{i=1}^N x_i. $$ For covariance, maybe you have heard of $$ \sigma^2 = \frac{1}{N}\sum_{i=1}^N (x_i - \overline{\bx})^2 \text{ and }s^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \overline{\bx})^2. $$ Which formula for covariance is correct? ## Experiments You need: [handout](https://www.math.nsysu.edu.tw/~chlin/math-runway/covariance-estimation.pdf), 5 dice per group 1. Review that the covariance of a _fair_ dice is $2.916\cdots$. 2. Roll 5 dice at once. Record their numbers, calculate the mean, and the value $\tau^2$. 3. Calculate the mean for the column of $\sigma^2$. Check if it is close to $2.916\cdots$. 4. If there are several groups, we may combine the data together. ## Intuition In probability theory, we assume we know the details of the probability distribution $X$, where the event $X = x$ happens with probability $p_x$. Here is an example of a fair dice. | value | 1 | 2 | 3 | 4 | 5 | 6 | | ----------- | --- | --- | --- | --- | --- | --- | | probability | 1/6 | 1/6 | 1/6 | 1/6 | 1/6 | 1/6 | There are fomal definitions for the mean and the covariance of $X$ $$ \mathbb{E}[X] = \sum_x p_x\cdot x \text{ and } \operatorname{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]. $$ This covariance is usually called the _population covariance_ , indicating that you know the details of the whole population. However, we never know if a dice is fair or not. We can only get some samples and use these data to _estimate_ the population covariance. Here comes the problem. You obtain some samples $\bx = (x_1, \ldots, x_N)$. When we want to estimate the covariance $\operatorname{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]$, we do not know the mean $\mathbb{E}[X]$ as well. The only thing we can do is to replace it by the _sample mean_ $$ \overline{\bx} = \frac{1}{N}\sum_{i=1}^N x_i. $$ Then calculate the _sample covariance_ $$ s^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \overline{\bx})^2. $$ The point is that $s^2$ is itself a random variable, depending on your samples. Ideally, when you run the experiment several times, the average of $s^2$ should be close to the real anser $\operatorname{Var}(X)$. As you have seen in the experiment, setting the denominator as $N - 1$ suprisingly did the job! ## More questions to think about 1. Calculate the covariance of a _fair_ coin of two sides $0$ and $1$. 2. Consider a random variable $\overline{\bx}$ as the mean of five dice. Describe its probability distribution. 3. Consider a random variable $s^2$ as the sample variance of five dice. Describe its probability distribution. ## Resources 1. [YouTube: Why Sample Variance is Divided by n-1 by Krish Naik](https://youtu.be/vGsRwB3TsiE?si=MYVKGRhHdANgt85-)