[Paper Reading] Graph Embedding and Extensions: A General Framework for Dimensionality Reduction

# [Paper Reading] Graph Embedding and Extensions: A General Framework for Dimensionality Reduction Shuicheng Yan, Dong Xu, Benyu Zhang, Hong-Jiang Zhang, Qiang Yang, and Stephen Lin 2007 ###### tags: `paper_reading` `AMMAI` ## Abstract The author finds out that each grpah embedding algorithm can be considered as the direct graph embedding or its linear/kernel/tensor extension. The "direct graph embedding" consists of a intrinsic graph and a penalty graph. The graph embedding framework can be used as a general platform for developing new dimensionality reduction algorithms. The author also showed this framework can be used to develope new algorithm, and proposed a new method called Marginal Fisher Analysis. ## GRAPH EMBEDDING Definition: An algorithm to find the desired low-dimensional vector representations of a graph $G=\{X,W\}$. ### Direct Graph Embedding Intrinsic graph is graph $G$ itself. Penalty graph is $G^p=\{X,W^p\}$ $\DeclareMathOperator*{\argmin}{argmin} \newcommand{\T}{\mathsf{T}} Y^*=\argmin\limits_{tr(Y^\T BY)=d} \sum\|Y_i-Y_j\|_2^2 W_{i,j}= \argmin\limits_ {tr(Y^\T BY)=d} tr(Y^\T LY)$ $B$ is the constraint matrix, typically $B=L^p$, $L$ stands for Laplacian matrix. #### Linear Mapping $Y=X^TA$ #### Kernelization The kernel trick has been widely used in ML to perform linear operations on other higher dimension spaces. $K(x_i,x_j)=\phi(x_i)\cdot \phi(x_j)$ $y=\sum\alpha_i\phi(x_i)$ $\DeclareMathOperator*{\argmin}{argmin} \newcommand{\T}{\mathsf{T}} y^*=\argmin\limits_{ \alpha^\T KBK\alpha=d} \alpha^\T KLK^\T \alpha$ where $K$ is the kernel Gram matrix The solution is solved by generalized eigenvalue decomposition problem. #### Tensorization Data $X_i$ is a tensor. A one-dimensional case can be represented as $y_i=X_i\times_1w^1\times_2w^2...\times_nw^n$ The optization problem can be solved by optimizing different projection vectors iteratively. Compared with the linearization of graph embedding, the feature dimension considered in each iteration of tensorization is much smaller, which leads to a significant reduction in computational cost. ### General Framework for Dimensionality Reduction ![](https://i.imgur.com/Zmzms7D.png) #### PCA ![](https://i.imgur.com/HYagQpV.png) (a): PCA (b): LDA #### LDA Fisher's linear discriminant Kernelization: Kernel Discriminant Analysis Tensorization: 2DLDA is the second-order tensorization, DATER #### ISOMAP The algorithm first calculate a geodesic distance matrix $D_G$. Then, MDS algorithm is designed to obtain low-dimensional representations for all data points. MDS optimizes the inner porduct matrix, $\tau(D_G)= \frac1{2} C_nD_G^{(2)}C_n$ where $C_n$ is a centering matrix. ISOMAP fits the garph embedding framework with $W_{i,j}=\tau(D_G)_{i,j} \,\text{ if }\, i\neq j \text{, otherwise } W_{i,i}=0$. Thus, $\tau(D_G)$ can be viewed as Laplacian matrix of $W$, and it can be checked that $\sum\limits_i \tau(D_G)_i=0$. #### LLE Locally Linear Embedding uses linear combination of each node's neighbors to from a sparse local reconstruction matrix $M$. The loss function $\|Y - MY\|_F^2$ is minimized. LLE fits the garph embedding framework with $W_{i,j}=(M+M^T-M^TM)_{i,j} \,\text{ if }\, i\neq j \text{, otherwise } 0$ The loss function can be formulated as $\,\sum y_i^T(I-M)^T(I-M)y_i$, and $(I-M)^T(I-M)$ is the Laplacian matrix of $W$. #### Laplacian Eigenmaps LE focuses on preserving the local geometry compared to ISOMAP, the simularity matrix $W$ records the closeness between each node and its neighbors. If i and j are not neighbors, $W_{i,j}=0$. The nodes embeddings are calculated from eigenmaps of a general eigenvalue problem $LK\alpha=\lambda DK\alpha$. #### Locality Preserving Projection LPP is the linearization of LE, and LE is kernelized LPP. $Ly=\lambda Dy$ #### t-SNE, t-distributed stochastic neighbor embedding t-distribution reduces the "curse of dimensionality" by allowing dissimilar nodes get further away since it has a longer "tail" compared to Gaussian distribution. *SNE was not disscused in this paper. I think it cannot be formulated in graph embedding?* *reference: https://www.youtube.com/watch?v=GBUEjkpoxXc https://mathoverflow.net/questions/187180/how-to-prove-that-a-kernel-is-positive-definite/187190* ### MARGINAL FISHER ANALYSIS ![](https://i.imgur.com/eDUIEw9.png =50%x) marginal fisher criterion: $\newcommand{\T}{\mathsf{T}} min\frac{w^\T XLX^\T w}{w^\T XL^PX^\T w}$ In comparison to LDA, MFA has the following advantages: 1) No assumption on the data distribution of each class. 2) Superates different classes better than LDA. ## Results, Conclusion and Future Work The experiment results shows: * The kernel trick (Gaussian kernel) is very powerful, and tensorization usually improves the performance further. * When the data set distribution is complex and the training data is not enough, LPP is worse than Fisherface (LDA). * The performance can be substantially improved by exploring a certain range of PCA dimensions before conducing LDA or MFA. * When the training sample size is large enough, all algorithms discussed in this work can achieve similar performance. Graph Embedding framework only considers the L2 distance as a similarity measure, which means that it can only take into account the second-order statistics of the data set. We can see that many dimension reduction algorithm uses kNN as local relationship informantion. Thus, the analysis about the choice of k can be important.