Independent Component Analysis

We are familiar with the problem that our high-dimensional data sources (images, video, fMRI, etc.) are redundant. And actually, we don't care about the value of pixels/voxels, we need some compact representation. Do you know state-space models? Well, something like that would be great.

Of course, we have dimensionality-reduction methods, like Principal Component Analysis (PCA) or Factor Analysis (FA). Why do we need another one then? Someone just tried to get a PhD by tweaking around a bit to get something published? Fortunately, Independent Component Analysis is much more than that.

Notation

s

- signal sources/latents

x

- signal mixtures/observations

y

- reconstructed signal sources

g

- prescribed CDF function

Y

g (y)

A

- mixing matrix

W

- unmixing matrix

p_{s}

- signal PDF model/ prior pdf

Regarding ICA, the main motivation is used to be formulated as the problem of blind source separation - it is also known as the cocktail party problem. I didn't mention that we will have party tiiiiiiiiiime! (In a coronaconform way, of course).

Let's imagine that you are attending a cocktail party (in your imagination/in reality if you happen to be vaccinated), and you want to get the most recent gossip there, but your old friend also decides to unload his recent experience with correlation analysis. You don't care about that, but you don't want to hurt your friend's feelings. Still, the gossip needs to be heard. You never know…

This problem can be described mathematically as separating the mixture signal

x

into its components gossip

g

and math stuff

m

, i.e.:

x = A s,

where

A

is called the mixing matrix - I will denote its inverse with

W := A^{- 1}

- and

s = (g, m) = (s_{1}, s_{2})

x

can contain multiple mixtures of the signals. If your signals are EEG measurements, then you can have multiple sensors for example.

But we don't have access to

s

, only to

x

! Otherwise, the problem would be solved. So the problem we need to solve is:

s = W x

The goal in this setting is to somehow recover

s

with a close enough estimate. You can formulate this problem as dimensionality reduction, namely, you want to extract the not redundant components. Thus, PCA or FA could be used. We'll soon see why ICA is better.

A comparison with PCA and FA

I only give a oneliner as an introduction to PCA and FA, as their assumptions are the most important for us to differentiate ICA from those.

PCA

PCA basically solves the

s = W x

equation while maximizing the variance in the (centered) data

x - E (x)

by selecting the

m

eigenvectors with the biggest eigenvalues out of the

n

-dimensional orthonormal basis of the covariance matrix. If your principal axes are orthogonal to each other, then decomposing any signal in that basis will provide the lack of (linear) correlation - the basis is linearly independent, you see.
Morover, PCA assumes Gaussian latents, i.e.

p_{s} = N (0, I)

. In this case, the ML estimate (cf. it below) is invariant to any rotation

R

. Namely, if the reconstructed signal

W x

is Gaussian, then

R W x

will be too. Moreover, as

det R = 1

, the change of variables formula is also invariant to rotations.

FA

Factor Analysis goes a bit further by also incorporating the noise characteristics of the measurements. Thus, the problem here is

s = W x + ϵ

Different assumptions

The most important thing is to list the main assumptions of these methods, as this will determine their applicability in a given case. ICA strives to extract statistically independent factors, whereas both PCA and FA gets us uncorrelated factors. We will reason about the distributions of these factors, so you can think of them as random variables.

The definition on independence is:

p (x, y) = p (x) p (y) .

Moreover, in the ICA case, the following will also apply (as a consequence of independence):

p (x^{k}, y^{p}) = p (x^{k}) p (y^{p}) \forall k, p \in Z^{+},

which means much more constraints than in the uncorrelated case, where the correlation coefficient

ρ

is zero (and that’s it):

ρ = \frac{E (X Y) - E (X) E (Y)}{σ_{X} σ_{Y}},

which is the special case of the above with

(k = p = 1)

:
That is, uncorrelatedness implies

p (x, y) = p (x) p (y)

, but this does not mean independence. And anyway, zero correlation means zero linear correlation - for higher values of

k

and

p

it does not hold.

In the literature, it is also common to state the factorization for independent distributions for functions of them - i.e., for

p (f (x), g (y))

. But as you can have a Taylor series approximation, the powers

x^{k}

and

y^{p}

are fine too.

This means that ICA restricts the "search space" of the factors more signifcantly, as they need to satisfy more constraints. Although this is not the only difference, let's focus now on the following: is there any case when the assumptions of PCA/FA are sufficient? It turns out that there is such a case, namely, if you have Gaussian random variables. As specifying up to (including) second-order moments describe a Gaussian, you are all set with assuming uncorrelatedness - as in this setting, this implies independece as well.

Linear ICA

Okay, we need to ensure that we have independent components, but how do we do that? The idea is simple, but turning that into a practical algorithm is a bit burdensome.
Note: I will provide two derivations. ML is easier, but InfoMax is more eye-widening (at least for me), so I will begin with InfoMax.

ICA assumptions

But first, I should say something… Hmmm, ICA is different, you know? You are probably adapted to assuming that everything is Gaussian. Well, ICA assumes that the components are independent aaand non-Gaussian. Moreover, ICA assumes that mixing the independent source signals has the following effects:

The mixtures (observations) will not be independent, as they contain a combination of the independent sources.
As of the Central Limit Theorem, the mixture of the non-Gaussian sources will be closer to a Gaussian distribution than the latents themselves.
The mixture of latents/source signals has a more complicated structure than any of its component.

Regarding point 3, you can think about Fourier series, where you have simple periodic components, but as a result, you can have very complicated signals.

One more remark: ICA extracts all components in a parallel manner.

Shortcomings

You might be wondering what's the problem with Gaussians. The answer is that they are the only distribution that is rotationally symetric. This introduces another degree of freedom for the process of determining the source axis (i.e., when you calculate

W

) - this is an ambiguity we cannot handle. You can relax this condition a bit though: if one and only one source is Gaussian, then ICA still works.

Nonetheless, there are two shortcomings:

The results will be up to a permutation
$π,$ so what you get is not
$(s_{1}, s_{2})$ but
$(s_{π 1}, s_{π 2})$ . In this aspect, PCA is better as it has a clear ordering between the components.
The variances of the signals cannot be determined in an absolute manner. Namely
$s = W x = (λ W) (x / λ) : \forall λ \in R$

Infomax

In the Infomax approach, the goal is to maximize mutual information between

x

and a signal

Y = g (y),

where

y

denotes the recovered signals - i.e.,

y = W x

. Here

g

is a CDF (cumulative distribution function), which is invertible - this acts as a model, as you need to specify this in advance based on prior knowledge.

This means that you want to get a mapping of the extracted signals

y

that is not able to tell you anything about the signal mixures

x

. This is what we want as the source signals

s

(for which

y

is hopefully a good esimate) have no information about the mixtures (the matrix

A

holds all of that information).

But why do we use

g

at all? To get independent signals, we want to have maximal entropy i.e., a multivariate uniform distribution as the joint PDF (probability density function). The following turns out to be true for maximum entropy settings:

The unifrom distribution has maximal entropy.
If the joint is uniform, then the components (the random variables) are independent.
If you transform a uniform distribution with any invertible mapping (e.g. a CDF), then the result is also mutually independent.

So now you see where we can use

g

, aren't you? When we extract

y

from the data

x

, then it will have a PDF - probably not a uniform one. To get things into a uniform one, then we need to transform

y

. Its CDF comes then handy, beacuse if you transform a random variable with its CDF, you get a uniform distribution.

First, this didn't seem trivial to me, so I guess you might be happy with an example. Imagine you have

z \sim N (0, 1)

. The most probable is that you get a value near 0 if you sample from this distribution. But if you transform these samples with the CDF, then it will get you a uniform distribution. This is because the CDF is the steepest where the PDF has its maximum (which is trivial, as

CDF = \int PDF

). Thus, the more samples are nearer to the maximum, the further away they are spread. If the PDF has a low value, then the CDF will "collect" the points in the transformed domain. In the end, this equalizes the distribution into a uniform one.

There is one more thing we need for the Infomax ICA, the change of variables rule for probability distributions. The intuition here is that the probability contained in the differential area must be invariant to the transformation, i.e.

\begin{aligned} p_{Y} (Y) | \frac{\partial Y}{\partial y} | & = p_{y} (y) \end{aligned}

Here

| \frac{\partial Y}{\partial y} | = g^{'} (y) = p_{s} (y),

where

p_{s} (y)

is the underlying true distribution for the signals

s

. Substituting this value into the above expression gets:

p_{Y} (Y) = \frac{p_{y} (y)}{p_{s} (y)},

which means that the only way to get a uniform distribution for

p_{Y} (Y)

is to have

p_{y} (y) = p_{s} (y)

. If this equality holds, then the signal model holds as well. Moreover,

p_{Y} (Y)

has maximal entropy, and by point 3 above this means that

p_{y}

has too.

As we wan't to maximize the entropy of

Y

, we can express

H (Y)

with this fractional relationship:

H (Y) = - \frac{1}{N} \sum_{i = 1}^{N} \log \frac{p_{y} (y_{i})}{p_{s} (y_{i})}

Now we apply the change of variables rule to

p_{y} (y)

to get

W

into the expression:

p_{y} (y) | \frac{\partial y}{\partial x} | = p_{x} (x)

With

y = W x,

the Jacobian equals to

| W |,

thus

p_{y} (y) | W | = p_{x} (x),

which is w.r.t the entropy:

\begin{aligned} H (Y) & = - \frac{1}{N} \sum_{i = 1}^{N} \log \frac{p_{x} (x_{i})}{p_{s} (y_{i}) | W |} \\ = - \frac{1}{N} \sum_{i = 1}^{N} \log p_{x} (x_{i}) + \frac{1}{N} \sum_{i = 1}^{N} \ln p_{s} (y_{i}) + \log | W | \\ = H (X) + \frac{1}{N} \sum_{i = 1}^{N} \ln p_{s} (y_{i}) + \log | W | \end{aligned}

H (X)

doesn't depend on

W,

it can be neglected during optimization, so the objective

H_{infomax} (Y)

is:

H_{infomax} (Y) = \frac{1}{N} \sum_{i = 1}^{N} \ln p_{s} (y_{i}) + \log | W | .

To optimize this expression, we need a signal model

p_{s}

in advance.

Okay, we got the objective to optimize, but what relationship do we have to mutual information? The statement is that by getting

p_{y} (y) = p_{s} (y)

, the mutual information

I (X; Y)

will be maximizied.
Let's include the definition of that:

\begin{array}{r} I (X; Y) = H (X) - H (X | Y) \geq 0. \end{array}

As the signal mixtures

x

are given,

H (X)

is fixed; thus, maximizing the mutual information is the same as minimizing the conditional entropy

H (Y | X)

. As

Y \sim g (W x) | x

; so the conditional is

p (Y | x) = g (W x)

.
If

H (X | Y)

is minimized (i.e, it equals zero), this means that getting to know

x

does not provide additional information if

Y

is known. Namely,

x

has no new information regarding neither the latents, nor the mixing process, everything can be described with

Y

and

W

. This is what we want! As if

Y

is known, then

y = g^{- 1} (Y)

is known too. Thus, we have the signal components - knowing the mixtures does not contain additional information.
Imagine that

H (X | Y) \neq 0

- this would mean that

x

contains information about the mixing process and/or the latents. Thus, our model (with

Y

and

W

) is not sufficient.

Maximum Likelihood

If you fancy, you can do this the ML way as well. In this case, we will end up at the same objective (surpriiiiise), but the way is quite different.
Our a priori information is distilled into the signal model

p_{s},

and the goal is determine the unmixing matrix

W

under which the data has the highest likelihood.

For writing down the likelihood of the data dependent on

W

, the change of variables rule comes handy again:

\begin{aligned} L (W) & = p_{x} (x) = p_{s} (y) | \frac{\partial y}{\partial x} | \\ = p_{s} (y) | W |, \end{aligned}

where

L (W)

stands for the likelihood of

W

given the data

x

. Taking the

\log

and using the independence of the source signals

p_{s} (y) = \prod_{i = 1}^{N} p_{s} (y_{i})

gives:

l (W) = \sum_{i = 1}^{N} \log p_{s} (y_{i}) + N \log | W |,

from which it can be seen that

l (W) / N = H_{infomax} (Y)

Independent Component Analysis

Notation

Blind source separation

A comparison with PCA and FA

PCA

FA

Different assumptions

Linear ICA

ICA assumptions

Shortcomings

Infomax

Maximum Likelihood

Read more

Reading Group

Új témák MLJC-re

Importrance of Masking in Generative Modeling of Sequences

Periodic Markov-chain example