Basics about probability distribution and Gaussians

# Basics about probability distribution and Gaussians written by [@marc_lelarge](https://twitter.com/marc_lelarge) ## Probability recap We start with real random variables (r.v.). 1- **Why is variance positive?** Recall that $\text{Var}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2$ so $\text{Var}(X)\geq 0$ means that $\mathbb{E}[X^2] \geq \mathbb{E}[X]^2$. **answer:** Start with \begin{eqnarray} \mathbb{E}\left[(X-\mathbb{E}[X])^2 \right] &=& \mathbb{E}\left[ X^2 - 2X\mathbb{E}[X]+\mathbb{E}[X]^2\right]\\ &=& \mathbb{E}[X^2] - 2\mathbb{E}[X]\mathbb{E}[X] + \mathbb{E}[X]^2\\ &=& \mathbb{E}[X^2] - \mathbb{E}[X]^2\\ &=& \text{Var}(X). \end{eqnarray} Similarly, we have for the [**Covariance**](https://en.wikipedia.org/wiki/Covariance) of the random variables $X$ and $Y$: \begin{eqnarray*} \text{Cov}(X,Y) = \mathbb{E}\left[ (X-\mathbb{E}[X])(Y-\mathbb{E}[Y]\right] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]. \end{eqnarray*} Note that $\text{Var}(X) = \text{Cov}(X,X)$. We have for $a,b\in \mathbb{R}$, $\text{Var}(aX+b) = a^2 \text{Var}(X)$, note that we use the standard notation where capital letters denote random variables and lowercase letters for constants or parameters. We have \begin{eqnarray} \text{Var}(X+Y) = \text{Var}(X) +\text{Var}(Y) +\text{Cov}(X,Y). \end{eqnarray} 2- **How to compute moments?** We start with a remark: if a random variable has a symmetric density function, i.e. $p(X=x) = p(X=-x)$ for all $x\in \mathbb{R}$, then its odd moments are zero: $\mathbb{E}[X^{2k+1}] = 0$. **answer** thanks to the [**moment generating function**](https://en.wikipedia.org/wiki/Moment-generating_function): $M_X(t) = \mathbb{E}\left[ e^{tX}\right]$ so that we get: \begin{eqnarray} \mathbb{E}\left[ X^k\right] = \left[\frac{d^kM_X(t)}{dt^k}\right]_{t=0} \end{eqnarray} To understand why this is true, we can write : $e^{tX} = \sum_{k\geq 0} \frac{(tX)^k}{k!} = 1+tX+\frac{(tX)^2}{2}+\frac{(tX)^3}{6}+\dots$ so that we have: \begin{eqnarray} \frac{d}{dt}e^{tX} &=& X + tX^2+\frac{t^2X^3}{2}+\dots\\ \frac{d^2}{dt^2}e^{tX} &=& X^2 + tX^3 +\dots\\ &\dots& \end{eqnarray} Let's apply this method for the normalized Gaussian random variable $Z$. We have \begin{eqnarray} M_Z(t) &=& \mathbb{E}\left[ e^{tZ}\right]\\ &=& \int_{-\infty}^\infty e^{tz}\frac{e^{-z^2/2}}{\sqrt{2\pi}}dz\\ &=& \int_{-\infty}^\infty \frac{e^{-(z^2-2tz+t^2)/2}}{\sqrt{2\pi}} e^{t^2/2}dz\\ &=& e^{t^2/2}. \end{eqnarray} In particular, we have \begin{eqnarray} M'_Z(t) &=& te^{t^2/2}\\ M''_Z(t) &=& (1+t^2)e^{t^2/2}\\ M^{(3)}_Z(t) &=& (3t+t^3)e^{t^2/2}\\ M^{(4)}_Z(t) &=& (3+6t^2+t^4)e^{t^2/2}... \end{eqnarray} so that we have $\mathbb{E}[Z] = 0$, $\mathbb{E}[Z^2] = 1$, $\mathbb{E}[Z^3] = 0$, $\mathbb{E}[Z^4] = 3$... Note that we already knew that the odd moments are zero but if you need the fourth moment, you need to compute the fourth derivative anyway... Note that for simple distributions, the direct computation might be easier. For example, for the uniform distribution on the interval $[a,b]$ with $a<b$, we have for $U\sim \text{Unif}(a,b)$: \begin{eqnarray} M_U(t) = \frac{e^{tb}-e^{ta}}{t(b-a)}, \end{eqnarray} and \begin{eqnarray} \mathbb{E}[U] &=& \int_a^b x\frac{dx}{b-a} = \frac{b^2-a^2}{2(b-a)} = \frac{a+b}{2},\\ \mathbb{E}[U^2] &=&\int_a^b x^2\frac{dx}{b-a} = \frac{a^2+ab+b^2}{3}, \end{eqnarray} so that $\text{Var}(U) = \frac{(b-a)^2}{12}$. 3- **Independence implies null covariance but null covariance does not imply independence!** Here is a simple example: consider the random variables $(X,Y)$ equal to $(-1,1)$, $(0,-2)$ or $(1,1)$ with equal probability. We clearly have $\mathbb{E}[X] = \mathbb{E}[Y]=0$ and $\text{Cov}(X,Y) = \mathbb{E}[XY] = \frac{-1}{3} + \frac{0}{3} + \frac{1}{3} =0$ but $X$ and $Y$ are not independent as knowing $X$ determines $Y$. More formally, we have for example $\mathbb{E}[X^2] = 2/3$ and $\mathbb{E}[X^2Y] = 2/3 \neq \mathbb{E}[X^2]\mathbb{E}[Y]$. ## [Gaussian random variables](https://en.wikipedia.org/wiki/Normal_distribution) 4- **If $X$ is a Gaussian r.v. and $Y$ is another Gaussian r.v. such that $X \perp \!\!\! \perp Y$ then $(X,Y)$ is a Gaussian vector.** 5- **If $(X,Y)$ is a Gaussian vector then $X \perp \!\!\! \perp Y$ is equivalent to $\text{Cov}(X,Y)=0$.** But even if $X\sim \mathcal{N}(0,1)$, $Y\sim \mathcal{N}(0,1)$ and $\text{Cov}(X,Y) =0$, this does not imply that $(X,Y)$ is a Gaussian vector. Here is a simple counter-example: take $X\sim \mathcal{N}(0,1)$ and define for $a>0$: \begin{eqnarray} Y = \left\{\begin{array}{cc} X& \text{if } |X|>a\\ -X& \text{if } |X|\leq a \end{array}\right. \end{eqnarray} It is easy to see that $Y\sim \mathcal{N}(0,1)$, moreover, we have \begin{eqnarray} \text{Cov}(X,Y)&=& \mathbb{E}[XY]\\ &=&\mathbb{E}[X^2\mathbf{1}(|X|>a)] - \mathbb{E}[X^2\mathbf{1}(|X|\leq a)]. \end{eqnarray} We see that for $a\to 0$, we have $\text{Cov}(X,Y) \to 1$ and when $a\to \infty$, we have $\text{Cov}(X,Y) \to -1$, so there exists a value of $a>0$ for which $\text{Cov}(X,Y) =0$. But $X+Y$ is never a Gaussain r.v. as \begin{eqnarray} X+Y = \left\{\begin{array}{cc} 2X& \text{if } |X|>a\\ 0& \text{if } |X|\leq a \end{array}\right. \end{eqnarray} 6- **Moments of Gaussian r.v.** We have for $X\sim \mathcal{N}(0,1)$ \begin{eqnarray} M_X(t) &=& \mathbb{E}[e^{tX}] = e^{\frac{t^2}{2}}\\ &=& \int e^{tx}\frac{e^{-x^2/2}}{\sqrt{2\pi}}dx\\ &=& \int \frac{e^{-(x^2-2tx+t^2)/2}}{\sqrt{2\pi}}dx e^{t^2/2}\\ &=& e^{t^2/2}\int \frac{e^{-(x-t)^2/2}}{\sqrt{2\pi}}dx. \end{eqnarray} Hence the moments for $X\sim \mathcal{N}(0,1)$ are given by $\mathbb{E}[X^{2m+1}] =0$ and $$ \mathbb{E}[X^{2m}] = \frac{2m!}{2^m m!}. $$ In general we have $$ \mathbb{E}[(\mu+\sigma X)^{k}] = \sum_{m=0}^k {k \choose m} \mu^m \sigma^{k-m}\mathbb{E}[X^{k-m}]. $$ ## [Gaussian vectors](https://en.wikipedia.org/wiki/Multivariate_normal_distribution) 7- **Partititoned Gaussians** We consider now a Gaussian vector $\mathbf{x} \sim \mathcal{N}(\mu, \Sigma)$ that we decompose as $\mathbf{x} = {\mathbf{x}_a \choose \mathbf{x}_b}$. We consider the same decomposition for the parameters: \begin{eqnarray} \mu =\left( \begin{array}{c} \mu_a\\ \mu_b \end{array}\right), \Sigma = \left( \begin{array}{cc} \Sigma_{aa}&\Sigma_{ab}\\ \Sigma_{ba}&\Sigma_{bb} \end{array}\right). \end{eqnarray} Note that $\Sigma_{ba} = \Sigma_{ab}^T$. We also introduce the precision matrix $\Lambda = \Sigma^{-1}$ and decompose it as: \begin{eqnarray} \Lambda = \left( \begin{array}{cc} \Lambda_{aa}&\Lambda_{ab}\\ \Lambda_{ba}&\Lambda_{bb} \end{array}\right). \end{eqnarray} Note that $\Lambda_{aa}\neq \Sigma_{aa}^{-1}$, indeed we can use the following formula for the inverse of a partitioned matrix: \begin{eqnarray} \left( \begin{array}{cc} A&B\\ C&D \end{array}\right)^{-1} = \left( \begin{array}{cc} M&-MBD^{-1}\\ -D^{-1}CM& D^{-1}+D^{-1}CMBD^{-1} \end{array}\right), \end{eqnarray} where $M=\left(A-BD^{-1}C\right)^{-1}$. Hence we see that $\Lambda_{aa} = \left(\Sigma_{aa}-\Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba} \right)^{-1}$. **Conditional distributions:** \begin{eqnarray} p(x_a|x_b) = \mathcal{N}(x_a|\mu_{a|b}, \Lambda_{aa}^{-1}), \end{eqnarray} with $\mu_{a|b} = \mu_a-\Lambda_{aa}^{-1}\Lambda_{ab}(x_b-\mu_b)$ **Marginal distribution** \begin{eqnarray} p(x_a) = \mathcal{N}(x_a|\mu_{a}, \Sigma_{aa}). \end{eqnarray} 8- **Marginal and Conditional Gaussians** Consider a Gaussian vector: \begin{align*} p(x) = \mathcal{N}(x|\mu, \Lambda^{-1}) \end{align*} and a linear Gaussian model: \begin{align*} p(y|x) = \mathcal{N}(y|Ax+b, L^{-1}), \end{align*} where $A,b,\mu$ are parameters governing the means, and $\Lambda$ and $L$ are precision matrices. Then $z = {x \choose y}$ is a Gaussian vector and we have \begin{eqnarray*} p(y) &= \mathcal{N}(y|A\mu+b, L^{-1}+A\Lambda^{-1}A^T) \\ p(x|y) &=\mathcal{N}(x|\Sigma\left\{A^TL(y-b)+\Lambda \mu\right\}, \Sigma),\\ &\text{ with } \Sigma = \left(\Lambda +A^TLA \right)^{-1} \end{eqnarray*} ###### tags: `public` `tutorials`

Read more

Broadcasting in Python: K-means algorithm

Autodiff and Backpropagation

Transformers using Named Tensor Notation

Rooted spectral measure of a graph