[Memo] A Maximum-Likelihood Interpretation for Slow Feature Analysis

# [Memo] A Maximum-Likelihood Interpretation for Slow Feature Analysis ## Links * [paper](http://learning.eng.cam.ac.uk/pub/Public/Turner/TurnerAndSahani2007a/turner-and-sahani-2007a.pdf) * [slides](https://drive.google.com/file/d/1mVXuopiIP58TEomEQGdXCJVf_T_O5J3s/view?usp=sharing) ## Background (Dimensionality reduction, PCA) The main motivation for methods like SFA is to somehow recover a more compact representation of high-dimensional data $X = \left(x_1, x_2, \dots, x_N \right), \quad (\forall x_i \in \mathbb{R}^n)$ into the so-called latent space, defined by $Y = \left(y_1, y_2, \dots, y_N \right), \quad (\forall x_i \in \mathbb{R}^k),$ with $k < n$. $N$ denotes the sample size (the sample is independent and identically distributed, or i.i.d. for short). ### What is the goal of dimensionality reduction? The obvious answer is to decrease the dimensionality of the data. However, this answer is insufficient. Namely, what we want is to decrease dimensionality while also preserving as much information as possible. Utilizing the jargon of information theory, we can state this equivalently as preserving as much information as possible while projecting into a a latent space of smaller dimensionality. Information is measured e.g. by the amount of variance the latent-space model can account for. In the following, we will go through how this works with PCA. ### Principal Component Analysis (PCA) As the theorem of Watanabe states, given data in $\mathbb{R}^n$, PCA will preserve the maximum amount of variance (information), if we go into $\mathbb{R}^k$. As we speak in terms of variance (our aim is to maximize it), we will calculate the covariance of the data. Nonetheless, PCA will only maximize variance if we use it on centered data. More details e.g. [here](https://stats.stackexchange.com/questions/22329/how-does-centering-the-data-get-rid-of-the-intercept-in-regression-and-pca). Thus, for PCA, we calculate $$ X_0 = X- \dfrac{\sum_{i=1}^N x_i}{N} $$ before the covariance $cov(X_0) = X_0^{T}X_0$. As the covariance is a matrix in $\mathbb{R}^{n\times n}$, we can calculate the spectral decomposition of it (moreover, the covariance is positive definite, as it is symmetric). $$ cov(X_0) = \sum_{i=1}^n \lambda_i u_i u_i^T,$$ i.e. a dyadic product (composed of the eigenvectors) weighted by the corresponding eigenvalues (as all dyad is symmetric, $cov(X_0)$ is as well). Here we assume that $\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_k$ (this ordering does not impose any constraints). Note that as we calculated the eigenvalues of the covariance matrix, we got the component-wise variances for each dimension. The variance will be maximal (in $\mathbb{R}^k$) if we select the $k$-dimensional subspace spanned by the eigenvectors corresponding to the k biggest eigenvalues, i.e. $\forall i, \ y_i \in Span\left\{u_1, \dots, u_k\right\}$. This choice is optimal in the sense that it maximizes $\dfrac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^n \lambda_i}.$ ## Slow Feature Analysis (SFA) ### Useful links * Lot of details and examples from e.g. human vision, engineering [ScholarPedia](http://www.scholarpedia.org/article/Slow_feature_analysis ) * Blogpost with a numerical example [TowardsDataScience](https://towardsdatascience.com/a-brief-introduction-to-slow-feature-analysis-18c901bc2a58 ) ### What is SFA? SFA is a dimensionality reduction method, which aims to recover slowly-varying features. Thus, here, the goal is to minimize variance (note that PCA can do this as well). It is similar to PCA but SFA does not depend on the scale of the data (as it finds some sort of rotation after sphering the data; this rotation is the only difference compared to PCA). ### Mathematical formulation We would like to transform our data $X$ into features $Y$, and we assume that there is a linear mapping (if not, we can expand the space get a linear map, but it results in the curse of dimensionality knocking at our door - this is denoted by $W$ below). $$ \begin{align} y_i &= g(x_i) = W x_i\\ \mathrm{min}&_{y_i} \ \mathbb{E}(\dot{y}_i^2)\\ \mathrm{subject\ to }&\ \mathbb{E}({y}_i) = 0\\ &\ \mathbb{E}({y}_i^2) = 1\\ &\ \mathbb{E}({y}_iy_j) = 0, \ i<j\\ \end{align} $$ With the constraints inducing zero mean features with unit variance (thus, excluding a constant solution, which varies infinitely slow), and decorrelation. As the generalized eigenvalue problem is not easy to understand (at least for me), I will ick to the more intuitive algorithm from [ScholarPedia](http://www.scholarpedia.org/article/Slow_feature_analysis ), namely 0. Assume that we have a linear relationship (if not, do something to have it). 1. Normalize the data to get decorrelation and zero mean (this will result in the data lying on a hypersphere). 2. Measure the temporal variation in this normalized space with the time derivative. 3. Find the direction of smallest variance (this can be done e.g. even with PCA, where the eigenvector corresponding to the smallest eigenvalue will be chosen). ## Probabilistic SFA ### Why do we like probabilistic interpretations? It is due to a lot of reasons, e.g. we will be able with a probabilistic interpretation to: 1. Reason about uncertainty,. 2. Handle extra- (prediction) and interpolation (substitute missing data) in an intuitive way. 3. The constraints of the SFA algorith might arise naturally from this approach ### Tricks of the trade To get a meaningful probailistic algortihm, we need to fulfill several things. The first is to select a proper prior. As a decorrelated output is achieved if the prior is decorrelated as well, we opt for a fully factored one (as Probabilistic PCA and ICA do), where $\lambda$ controls the strength of the correlation: $$ \begin{align} p(Y_t | Y_{t-1}, \lambda_{1:N}, \sigma^2_{1:N}) &= \prod_{n=1}^N p(y_{i,t} | y_{i, t-1}, \lambda_{i}, \sigma^2_{i})\\ p(y_{i,t} | y_{i, t-1}, \lambda_{i}, \sigma^2_{i}) &= \mathcal{N}(\lambda_i y_{i, t-1}, \sigma_i^2) \\ p(y_{i,1} | \sigma^2_{i}) &= \mathcal{N}(0, \sigma_i^2). \end{align} $$ What we can see is that our model is Markovian ($y_t$ depends only on $y_{t-1}$). In this aspect, what we have is quite similar to Kalman filters, where the model is also Markovian. To have a full model, we need to be able to generate samples (for a further reference: in the case of Kalman filters, you also have two models; one for the temporal evolution of the latent variables, and anotehr for generating samples/observations). To achieve this generative model, we will invert the mapping from data space to latent sapce (this is the matrix $W$). With this matrix, inference is instantaneous (as we only need to condition on the latent with the same time index). The probabilistic approach is reflected in the fact that we draw a normal distribution with isotropic (i.e., diagonal) covariance matrix around the deterministic estimate: $$ p(x_i |y_i, W, \sigma_x) = \mathcal{N}(W^{-1} y_i, \sigma_x^2I). $$ Taking the limit w.r.t. the variance ($\sigma_x^2 \rightarrow 0$), the deterministic algorithm is recovered. Additionally, if you would like to have a nonlinear relationship in the generative model, you can inject e.g. neural networks - via the $\Phi$ function, resulting in a generative model of: $$ p(x_i |y_i, W, \sigma_x) = \mathcal{N}(\Phi(W^{-1} y_i), \sigma_x^2I) $$