Basics about probability distribution and Gaussians

Probability recap

We start with real random variables (r.v.).

1- Why is variance positive?

Recall that

Var (X) = E [X^{2}] - E [X]^{2}

Var (X) \geq 0

means that

E [X^{2}] \geq E [X]^{2}

answer: Start with

\begin{array}{rcl} E [(X - E [X])^{2}] & = & E [X^{2} - 2 X E [X] + E [X]^{2}] \\ = & E [X^{2}] - 2 E [X] E [X] + E [X]^{2} \\ = & E [X^{2}] - E [X]^{2} \\ = & Var (X) . \end{array}

Similarly, we have for the Covariance of the random variables

X

and

Y

\begin{array}{r} Cov (X, Y) = E [(X - E [X]) (Y - E [Y]] = E [X Y] - E [X] E [Y] . \end{array}

Note that

Var (X) = Cov (X, X)

.
We have for

a, b \in R

Var (a X + b) = a^{2} Var (X)

, note that we use the standard notation where capital letters denote random variables and lowercase letters for constants or parameters.

We have

\begin{array}{r} Var (X + Y) = Var (X) + Var (Y) + Cov (X, Y) . \end{array}

2- How to compute moments?

We start with a remark: if a random variable has a symmetric density function, i.e.

p (X = x) = p (X = - x)

for all

x \in R

, then its odd moments are zero:

E [X^{2 k + 1}] = 0

answer thanks to the moment generating function:

M_{X} (t) = E [e^{t X}]

so that we get:

\begin{array}{r} E [X^{k}] = {[\frac{d^{k} M_{X} (t)}{d t^{k}}]}_{t = 0} \end{array}

To understand why this is true, we can write :

e^{t X} = \sum_{k \geq 0} \frac{(t X)^{k}}{k!} = 1 + t X + \frac{(t X)^{2}}{2} + \frac{(t X)^{3}}{6} + \dots

so that we have:

\begin{array}{rcl} \frac{d}{d t} e^{t X} & = & X + t X^{2} + \frac{t^{2} X^{3}}{2} + \dots \\ \frac{d^{2}}{d t^{2}} e^{t X} & = & X^{2} + t X^{3} + \dots \\ \dots \end{array}

Let's apply this method for the normalized Gaussian random variable

Z

. We have

\begin{array}{rcl} M_{Z} (t) & = & E [e^{t Z}] \\ = & \int_{- \infty}^{\infty} e^{t z} \frac{e^{- z^{2} / 2}}{\sqrt{2 π}} d z \\ = & \int_{- \infty}^{\infty} \frac{e^{- (z^{2} - 2 t z + t^{2}) / 2}}{\sqrt{2 π}} e^{t^{2} / 2} d z \\ = & e^{t^{2} / 2} . \end{array}

In particular, we have

\begin{array}{rcl} M_{Z}^{'} (t) & = & t e^{t^{2} / 2} \\ M_{Z}^{″} (t) & = & (1 + t^{2}) e^{t^{2} / 2} \\ M_{Z}^{(3)} (t) & = & (3 t + t^{3}) e^{t^{2} / 2} \\ M_{Z}^{(4)} (t) & = & (3 + 6 t^{2} + t^{4}) e^{t^{2} / 2} . . . \end{array}

so that we have

E [Z] = 0

E [Z^{2}] = 1

E [Z^{3}] = 0

E [Z^{4}] = 3

…

Note that we already knew that the odd moments are zero but if you need the fourth moment, you need to compute the fourth derivative anyway…

Note that for simple distributions, the direct computation might be easier. For example, for the uniform distribution on the interval

[a, b]

with

a < b

, we have for

U \sim Unif (a, b)

\begin{array}{r} M_{U} (t) = \frac{e^{t b} - e^{t a}}{t (b - a)}, \end{array}

and

\begin{array}{rcl} E [U] & = & \int_{a}^{b} x \frac{d x}{b - a} = \frac{b^{2} - a^{2}}{2 (b - a)} = \frac{a + b}{2}, \\ E [U^{2}] & = & \int_{a}^{b} x^{2} \frac{d x}{b - a} = \frac{a^{2} + a b + b^{2}}{3}, \end{array}

so that

Var (U) = \frac{(b - a)^{2}}{12}

3- Independence implies null covariance but null covariance does not imply independence!

Here is a simple example: consider the random variables

(X, Y)

equal to

(- 1, 1)

(0, - 2)

(1, 1)

with equal probability.

We clearly have

E [X] = E [Y] = 0

and

Cov (X, Y) = E [X Y] = \frac{- 1}{3} + \frac{0}{3} + \frac{1}{3} = 0

but

X

and

Y

are not independent as knowing

X

determines

Y

. More formally, we have for example

E [X^{2}] = 2 / 3

and

E [X^{2} Y] = 2 / 3 \neq E [X^{2}] E [Y]

Gaussian random variables

4- If

$X$ is a Gaussian r.v. and
$Y$ is another Gaussian r.v. such that
$X ⊥ ⊥ Y$ then
$(X, Y)$ is a Gaussian vector.

5- If

$(X, Y)$ is a Gaussian vector then
$X ⊥ ⊥ Y$ is equivalent to
$Cov (X, Y) = 0$ .

But even if

X \sim N (0, 1)

Y \sim N (0, 1)

and

Cov (X, Y) = 0

, this does not imply that

(X, Y)

is a Gaussian vector. Here is a simple counter-example: take

X \sim N (0, 1)

and define for

a > 0

\begin{array}{r} Y = {\begin{array}{cc} X & if | X | > a \\ - X & if | X | \leq a \end{array} \end{array}

It is easy to see that

Y \sim N (0, 1)

, moreover, we have

\begin{array}{rcl} Cov (X, Y) & = & E [X Y] \\ = & E [X^{2} 1 (| X | > a)] - E [X^{2} 1 (| X | \leq a)] . \end{array}

We see that for

a \to 0

, we have

Cov (X, Y) \to 1

and when

a \to \infty

, we have

Cov (X, Y) \to - 1

, so there exists a value of

a > 0

for which

Cov (X, Y) = 0

.
But

X + Y

is never a Gaussain r.v. as

\begin{array}{r} X + Y = {\begin{array}{cc} 2 X & if | X | > a \\ 0 & if | X | \leq a \end{array} \end{array}

6- Moments of Gaussian r.v.

We have for

X \sim N (0, 1)

\begin{array}{rcl} M_{X} (t) & = & E [e^{t X}] = e^{\frac{t^{2}}{2}} \\ = & \int e^{t x} \frac{e^{- x^{2} / 2}}{\sqrt{2 π}} d x \\ = & \int \frac{e^{- (x^{2} - 2 t x + t^{2}) / 2}}{\sqrt{2 π}} d x e^{t^{2} / 2} \\ = & e^{t^{2} / 2} \int \frac{e^{- (x - t)^{2} / 2}}{\sqrt{2 π}} d x . \end{array}

Hence the moments for

X \sim N (0, 1)

are given by

E [X^{2 m + 1}] = 0

and

E [X^{2 m}] = \frac{2 m!}{2^{m} m!} .

In general we have

E [(μ + σ X)^{k}] = \sum_{m = 0}^{k} (\binom{k}{m}) μ^{m} σ^{k - m} E [X^{k - m}] .

Gaussian vectors

7- Partititoned Gaussians

We consider now a Gaussian vector

x \sim N (μ, Σ)

that we decompose as

x = (\binom{x_{a}}{x_{b}})

. We consider the same decomposition for the parameters:

\begin{array}{r} μ = (\begin{array}{c} μ_{a} \\ μ_{b} \end{array}), Σ = (\begin{array}{cc} Σ_{a a} & Σ_{a b} \\ Σ_{b a} & Σ_{b b} \end{array}) . \end{array}

Note that

Σ_{b a} = Σ_{a b}^{T}

.
We also introduce the precision matrix

Λ = Σ^{- 1}

and decompose it as:

\begin{array}{r} Λ = (\begin{array}{cc} Λ_{a a} & Λ_{a b} \\ Λ_{b a} & Λ_{b b} \end{array}) . \end{array}

Note that

Λ_{a a} \neq Σ_{a a}^{- 1}

, indeed we can use the following formula for the inverse of a partitioned matrix:

\begin{array}{r} {(\begin{array}{cc} A & B \\ C & D \end{array})}^{- 1} = (\begin{array}{cc} M & - M B D^{- 1} \\ - D^{- 1} C M & D^{- 1} + D^{- 1} C M B D^{- 1} \end{array}), \end{array}

where

M = {(A - B D^{- 1} C)}^{- 1}

Hence we see that

Λ_{a a} = {(Σ_{a a} - Σ_{a b} Σ_{b b}^{- 1} Σ_{b a})}^{- 1}

Conditional distributions:

\begin{array}{r} p (x_{a} | x_{b}) = N (x_{a} | μ_{a | b}, Λ_{a a}^{- 1}), \end{array}

with

μ_{a | b} = μ_{a} - Λ_{a a}^{- 1} Λ_{a b} (x_{b} - μ_{b})

Marginal distribution

\begin{array}{r} p (x_{a}) = N (x_{a} | μ_{a}, Σ_{a a}) . \end{array}

8- Marginal and Conditional Gaussians
Consider a Gaussian vector:

\begin{array}{r} p (x) = N (x | μ, Λ^{- 1}) \end{array}

and a linear Gaussian model:

\begin{array}{r} p (y | x) = N (y | A x + b, L^{- 1}), \end{array}

where

A, b, μ

are parameters governing the means, and

Λ

and

L

are precision matrices. Then

z = (\binom{x}{y})

is a Gaussian vector and we have

\begin{array}{rc} p (y) & = N (y | A μ + b, L^{- 1} + A Λ^{- 1} A^{T}) \\ p (x | y) & = N (x | Σ {A^{T} L (y - b) + Λ μ}, Σ), \\ with Σ = {(Λ + A^{T} L A)}^{- 1} \end{array}

Basics about probability distribution and Gaussians

Probability recap

Gaussian random variables

Gaussian vectors

tags: public tutorials

Read more

A* algorithm

Broadcasting in Python: K-means algorithm

Autodiff and Backpropagation

Transformers using Named Tensor Notation

tags: `public` `tutorials`