# [Paper Reading] Graph Embedding and Extensions: A General Framework for Dimensionality Reduction
Shuicheng Yan, Dong Xu, Benyu Zhang, Hong-Jiang Zhang, Qiang Yang, and Stephen Lin 2007
###### tags: `paper_reading` `AMMAI`
## Abstract
The author finds out that each grpah embedding algorithm can be considered as the direct graph embedding or its linear/kernel/tensor extension.
The "direct graph embedding" consists of a intrinsic graph and a penalty graph. The graph embedding framework can be used as a general platform for developing new dimensionality reduction algorithms.
The author also showed this framework can be used to develope new algorithm, and proposed a new method called Marginal Fisher Analysis.
## GRAPH EMBEDDING
Definition: An algorithm to find the desired low-dimensional vector representations of a graph $G=\{X,W\}$.
### Direct Graph Embedding
Intrinsic graph is graph $G$ itself.
Penalty graph is $G^p=\{X,W^p\}$
$\DeclareMathOperator*{\argmin}{argmin}
\newcommand{\T}{\mathsf{T}}
Y^*=\argmin\limits_{tr(Y^\T BY)=d} \sum\|Y_i-Y_j\|_2^2 W_{i,j}= \argmin\limits_ {tr(Y^\T BY)=d} tr(Y^\T LY)$
$B$ is the constraint matrix, typically $B=L^p$, $L$ stands for Laplacian matrix.
#### Linear Mapping
$Y=X^TA$
#### Kernelization
The kernel trick has been widely used in ML to perform linear operations on other higher dimension spaces.
$K(x_i,x_j)=\phi(x_i)\cdot \phi(x_j)$
$y=\sum\alpha_i\phi(x_i)$
$\DeclareMathOperator*{\argmin}{argmin}
\newcommand{\T}{\mathsf{T}}
y^*=\argmin\limits_{ \alpha^\T KBK\alpha=d} \alpha^\T KLK^\T \alpha$
where $K$ is the kernel Gram matrix
The solution is solved by generalized eigenvalue decomposition problem.
#### Tensorization
Data $X_i$ is a tensor.
A one-dimensional case can be represented as
$y_i=X_i\times_1w^1\times_2w^2...\times_nw^n$
The optization problem can be solved by optimizing different projection vectors iteratively.
Compared with the linearization of graph embedding, the feature dimension considered in each iteration of tensorization is much smaller, which leads to a significant reduction in computational cost.
### General Framework for Dimensionality Reduction

#### PCA

(a): PCA
(b): LDA
#### LDA
Fisher's linear discriminant
Kernelization: Kernel Discriminant Analysis
Tensorization: 2DLDA is the second-order tensorization, DATER
#### ISOMAP
The algorithm first calculate a geodesic distance matrix $D_G$. Then, MDS algorithm is designed to obtain low-dimensional representations for all data points.
MDS optimizes the inner porduct matrix, $\tau(D_G)= \frac1{2} C_nD_G^{(2)}C_n$ where $C_n$ is a centering matrix.
ISOMAP fits the garph embedding framework with
$W_{i,j}=\tau(D_G)_{i,j} \,\text{ if }\, i\neq j \text{, otherwise } W_{i,i}=0$.
Thus, $\tau(D_G)$ can be viewed as Laplacian matrix of $W$, and it can be checked that $\sum\limits_i \tau(D_G)_i=0$.
#### LLE
Locally Linear Embedding uses linear combination of each node's neighbors to from a sparse local reconstruction matrix $M$.
The loss function $\|Y - MY\|_F^2$ is minimized.
LLE fits the garph embedding framework with $W_{i,j}=(M+M^T-M^TM)_{i,j} \,\text{ if }\, i\neq j \text{, otherwise } 0$
The loss function can be formulated as $\,\sum y_i^T(I-M)^T(I-M)y_i$, and $(I-M)^T(I-M)$ is the Laplacian matrix of $W$.
#### Laplacian Eigenmaps
LE focuses on preserving the local geometry compared to ISOMAP, the simularity matrix $W$ records the closeness between each node and its neighbors. If i and j are not neighbors, $W_{i,j}=0$.
The nodes embeddings are calculated from eigenmaps of a general eigenvalue problem $LK\alpha=\lambda DK\alpha$.
#### Locality Preserving Projection
LPP is the linearization of LE, and LE is kernelized LPP. $Ly=\lambda Dy$
#### t-SNE, t-distributed stochastic neighbor embedding
t-distribution reduces the "curse of dimensionality" by allowing dissimilar nodes get further away since it has a longer "tail" compared to Gaussian distribution.
*SNE was not disscused in this paper. I think it cannot be formulated in graph embedding?*
*reference: https://www.youtube.com/watch?v=GBUEjkpoxXc
https://mathoverflow.net/questions/187180/how-to-prove-that-a-kernel-is-positive-definite/187190*
### MARGINAL FISHER ANALYSIS

marginal fisher criterion: $\newcommand{\T}{\mathsf{T}}
min\frac{w^\T XLX^\T w}{w^\T XL^PX^\T w}$
In comparison to LDA, MFA has the following advantages:
1) No assumption on the data distribution of each class.
2) Superates different classes better than LDA.
## Results, Conclusion and Future Work
The experiment results shows:
* The kernel trick (Gaussian kernel) is very powerful, and tensorization usually improves the performance further.
* When the data set distribution is complex and the training data is not enough, LPP is worse than Fisherface (LDA).
* The performance can be substantially improved by exploring a certain range of PCA dimensions before conducing LDA or MFA.
* When the training sample size is large enough, all algorithms discussed in this work can achieve similar performance.
Graph Embedding framework only considers the L2 distance as a similarity measure, which means that it can only take into account the second-order statistics of the data set.
We can see that many dimension reduction algorithm uses kNN as local relationship informantion. Thus, the analysis about the choice of k can be important.