Independent Component Analysis

We are familiar with the problem that our high-dimensional data sources (images, video, fMRI, etc.) are redundant. And actually, we don't care about the value of pixels/voxels, we need some compact representation. Do you know state-space models? Well, something like that would be great.

Of course, we have dimensionality-reduction methods, like Principal Component Analysis (PCA) or Factor Analysis (FA). Why do we need another one then? Someone just tried to get a PhD by tweaking around a bit to get something published? Fortunately, Independent Component Analysis is much more than that.

Notation

s - signal sources/latents
x
- signal mixtures/observations
y
- reconstructed signal sources
g
- prescribed CDF function
Y
-
g(y)

A
- mixing matrix
W
- unmixing matrix
ps
- signal PDF model/ prior pdf

Blind source separation

Regarding ICA, the main motivation is used to be formulated as the problem of blind source separation - it is also known as the cocktail party problem. I didn't mention that we will have party tiiiiiiiiiime! (In a coronaconform way, of course).

Let's imagine that you are attending a cocktail party (in your imagination/in reality if you happen to be vaccinated), and you want to get the most recent gossip there, but your old friend also decides to unload his recent experience with correlation analysis. You don't care about that, but you don't want to hurt your friend's feelings. Still, the gossip needs to be heard. You never know

This problem can be described mathematically as separating the mixture signal

x into its components gossip
g
and math stuff
m
, i.e.:
x=As,

where
A
is called the mixing matrix - I will denote its inverse with
W:=A1
- and
s=(g,m)=(s1,s2)
.
x
can contain multiple mixtures of the signals. If your signals are EEG measurements, then you can have multiple sensors for example.

But we don't have access to

s, only to
x
! Otherwise, the problem would be solved. So the problem we need to solve is:
s=Wx

The goal in this setting is to somehow recover

s with a close enough estimate. You can formulate this problem as dimensionality reduction, namely, you want to extract the not redundant components. Thus, PCA or FA could be used. We'll soon see why ICA is better.

A comparison with PCA and FA

I only give a oneliner as an introduction to PCA and FA, as their assumptions are the most important for us to differentiate ICA from those.

PCA

PCA basically solves the

s=Wx equation while maximizing the variance in the (centered) data
xE(x)
by selecting the
m
eigenvectors with the biggest eigenvalues out of the
n
-dimensional orthonormal basis of the covariance matrix. If your principal axes are orthogonal to each other, then decomposing any signal in that basis will provide the lack of (linear) correlation - the basis is linearly independent, you see.
Morover, PCA assumes Gaussian latents, i.e.
ps=N(0,I)
. In this case, the ML estimate (cf. it below) is invariant to any rotation
R
. Namely, if the reconstructed signal
Wx
is Gaussian, then
RWx
will be too. Moreover, as
detR=1
, the change of variables formula is also invariant to rotations.

FA

Factor Analysis goes a bit further by also incorporating the noise characteristics of the measurements. Thus, the problem here is

s=Wx+ϵ.

Different assumptions

The most important thing is to list the main assumptions of these methods, as this will determine their applicability in a given case. ICA strives to extract statistically independent factors, whereas both PCA and FA gets us uncorrelated factors. We will reason about the distributions of these factors, so you can think of them as random variables.

The definition on independence is:

p(x,y)=p(x)p(y).

Moreover, in the ICA case, the following will also apply (as a consequence of independence):

p(xk,yp)=p(xk)p(yp)k,pZ+,
which means much more constraints than in the uncorrelated case, where the correlation coefficient
ρ
is zero (and that’s it):
ρ=E(XY)E(X)E(Y)σXσY,

which is the special case of the above with
(k=p=1)
:
That is, uncorrelatedness implies
p(x,y)=p(x)p(y)
, but this does not mean independence. And anyway, zero correlation means zero linear correlation - for higher values of
k
and
p
it does not hold.

In the literature, it is also common to state the factorization for independent distributions for functions of them - i.e., for

p(f(x),g(y)). But as you can have a Taylor series approximation, the powers
xk
and
yp
are fine too.

This means that ICA restricts the "search space" of the factors more signifcantly, as they need to satisfy more constraints. Although this is not the only difference, let's focus now on the following: is there any case when the assumptions of PCA/FA are sufficient? It turns out that there is such a case, namely, if you have Gaussian random variables. As specifying up to (including) second-order moments describe a Gaussian, you are all set with assuming uncorrelatedness - as in this setting, this implies independece as well.

Linear ICA

Okay, we need to ensure that we have independent components, but how do we do that? The idea is simple, but turning that into a practical algorithm is a bit burdensome.
Note: I will provide two derivations. ML is easier, but InfoMax is more eye-widening (at least for me), so I will begin with InfoMax.

ICA assumptions

But first, I should say something Hmmm, ICA is different, you know? You are probably adapted to assuming that everything is Gaussian. Well, ICA assumes that the components are independent aaand non-Gaussian. Moreover, ICA assumes that mixing the independent source signals has the following effects:

  1. The mixtures (observations) will not be independent, as they contain a combination of the independent sources.
  2. As of the Central Limit Theorem, the mixture of the non-Gaussian sources will be closer to a Gaussian distribution than the latents themselves.
  3. The mixture of latents/source signals has a more complicated structure than any of its component.

Regarding point 3, you can think about Fourier series, where you have simple periodic components, but as a result, you can have very complicated signals.

One more remark: ICA extracts all components in a parallel manner.

Shortcomings

You might be wondering what's the problem with Gaussians. The answer is that they are the only distribution that is rotationally symetric. This introduces another degree of freedom for the process of determining the source axis (i.e., when you calculate

W) - this is an ambiguity we cannot handle. You can relax this condition a bit though: if one and only one source is Gaussian, then ICA still works.

Nonetheless, there are two shortcomings:

  1. The results will be up to a permutation
    π,
    so what you get is not
    (s1,s2)
    but
    (sπ1,sπ2)
    . In this aspect, PCA is better as it has a clear ordering between the components.
  2. The variances of the signals cannot be determined in an absolute manner. Namely
    s=Wx=(λW)(x/λ):λR

Infomax

In the Infomax approach, the goal is to maximize mutual information between

x and a signal
Y=g(y),
where
y
denotes the recovered signals - i.e.,
y=Wx
. Here
g
is a CDF (cumulative distribution function), which is invertible - this acts as a model, as you need to specify this in advance based on prior knowledge.

This means that you want to get a mapping of the extracted signals

y that is not able to tell you anything about the signal mixures
x
. This is what we want as the source signals
s
(for which
y
is hopefully a good esimate) have no information about the mixtures (the matrix
A
holds all of that information).

But why do we use

g at all? To get independent signals, we want to have maximal entropy i.e., a multivariate uniform distribution as the joint PDF (probability density function). The following turns out to be true for maximum entropy settings:

  1. The unifrom distribution has maximal entropy.
  2. If the joint is uniform, then the components (the random variables) are independent.
  3. If you transform a uniform distribution with any invertible mapping (e.g. a CDF), then the result is also mutually independent.

So now you see where we can use

g, aren't you? When we extract
y
from the data
x
, then it will have a PDF - probably not a uniform one. To get things into a uniform one, then we need to transform
y
. Its CDF comes then handy, beacuse if you transform a random variable with its CDF, you get a uniform distribution.

First, this didn't seem trivial to me, so I guess you might be happy with an example. Imagine you have

zN(0,1). The most probable is that you get a value near 0 if you sample from this distribution. But if you transform these samples with the CDF, then it will get you a uniform distribution. This is because the CDF is the steepest where the PDF has its maximum (which is trivial, as
CDF=PDF
). Thus, the more samples are nearer to the maximum, the further away they are spread. If the PDF has a low value, then the CDF will "collect" the points in the transformed domain. In the end, this equalizes the distribution into a uniform one.

There is one more thing we need for the Infomax ICA, the change of variables rule for probability distributions. The intuition here is that the probability contained in the differential area must be invariant to the transformation, i.e.

pY(Y)|Yy|=py(y)
Here
|Yy|=g(y)=ps(y),
where
ps(y)
is the underlying true distribution for the signals
s
. Substituting this value into the above expression gets:
pY(Y)=py(y)ps(y),

which means that the only way to get a uniform distribution for
pY(Y)
is to have
py(y)=ps(y)
. If this equality holds, then the signal model holds as well. Moreover,
pY(Y)
has maximal entropy, and by point 3 above this means that
py
has too.

As we wan't to maximize the entropy of

Y, we can express
H(Y)
with this fractional relationship:
H(Y)=1Ni=1Nlogpy(yi)ps(yi)

Now we apply the change of variables rule to
py(y)
to get
W
into the expression:
py(y)|yx|=px(x)

With
y=Wx,
the Jacobian equals to
|W|,
thus
py(y)|W|=px(x),
which is w.r.t the entropy:
H(Y)=1Ni=1Nlogpx(xi)ps(yi)|W|=1Ni=1Nlogpx(xi)+1Ni=1Nlnps(yi)+log|W|=H(X)+1Ni=1Nlnps(yi)+log|W|

As
H(X)
doesn't depend on
W,
it can be neglected during optimization, so the objective
Hinfomax(Y)
is:
Hinfomax(Y)=1Ni=1Nlnps(yi)+log|W|.

To optimize this expression, we need a signal model
ps
in advance.

Okay, we got the objective to optimize, but what relationship do we have to mutual information? The statement is that by getting

py(y)=ps(y), the mutual information
I(X;Y)
will be maximizied.
Let's include the definition of that:
I(X;Y)=H(X)H(X|Y)0.

As the signal mixtures
x
are given,
H(X)
is fixed; thus, maximizing the mutual information is the same as minimizing the conditional entropy
H(Y|X)
. As
Yg(Wx)|x
; so the conditional is
p(Y|x)=g(Wx)
.
If
H(X|Y)
is minimized (i.e, it equals zero), this means that getting to know
x
does not provide additional information if
Y
is known. Namely,
x
has no new information regarding neither the latents, nor the mixing process, everything can be described with
Y
and
W
. This is what we want! As if
Y
is known, then
y=g1(Y)
is known too. Thus, we have the signal components - knowing the mixtures does not contain additional information.
Imagine that
H(X|Y)0
- this would mean that
x
contains information about the mixing process and/or the latents. Thus, our model (with
Y
and
W
) is not sufficient.

Maximum Likelihood

If you fancy, you can do this the ML way as well. In this case, we will end up at the same objective (surpriiiiise), but the way is quite different.
Our a priori information is distilled into the signal model

ps, and the goal is determine the unmixing matrix
W
under which the data has the highest likelihood.

For writing down the likelihood of the data dependent on

W, the change of variables rule comes handy again:
L(W)=px(x)=ps(y)|yx|=ps(y)|W|,

where
L(W)
stands for the likelihood of
W
given the data
x
. Taking the
log
and using the independence of the source signals
ps(y)=i=1Nps(yi)
gives:
l(W)=i=1Nlogps(yi)+Nlog|W|,

from which it can be seen that
l(W)/N=Hinfomax(Y)
.