# KSVD-ICML2024
### Reply to Reviewer Lf5K (Submitted)
We thank the reviewer for the reply. Graph reconstruction is a typical task in node representation learning (e.g., Sec 4.2 in [1], Sec. 4.4 in [2]), and it is helpful to evaluate how well the learned representations preserve neighborhood information in embedding space. Graph reconstruction reconstructs *all existing* edges by reconstructing the full adjacency matrix from embedding space, while link prediction aims at predicting the likelihood of *future* association between two nodes or alternatively inferring *missing* links.
[1] Khosla, Megha, et al. "Node representation learning for directed graphs." Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2019.
[2] Tsitsulin, Anton, et al. "Verse: Versatile graph embeddings from similarity measures." Proceedings of World Wide Web Conference 2018.
### Reply to Reviewer 4PjV (Submitted)
We thank the reviewer for the reply recognizing a good contribution of our work and acknowledging our rebuttal.
The learning with asymmetric kernels has not been widely explored and our work has significant differences and novelties w.r.t. the mentioned related literature, which we discussed in the current paper. In the final version, we will enrich the comparisons in detail as elucidated in our rebuttal.
## Summary
Dear Chairs and Reviewers,
Thanks for the constructive comments and suggestions during the review process. As the discussion period is approaching the end, we summarize below the strengths of the paper and our responses to the main concerns from the reviewers, which we hope will be helpful for the final decision.
**Strengths & reviewers’ support:**
1. _Contribution to the learning with asymmetric kernels._
All reviewer agree on the interesting and promising applications of learning with asymmetric kernel methods, which have not been extensively explored in the community. The reviewers recognize our contribution in this aspect.
The CCE problem is new in the field. It learns the set of directions of interest in the feature space arising from the SVD of the asymmetric Gram matrix, leading to both perspectives of covariance operators and kernel operators. Our presented asymmetric Nyström is important for faster computation on large-scale problems of the proposed asymmetric kernel machine, and has not yet been formally derived, despite being intuitively outlined in some existing work.
2. _Thorough experimental evaluations._
The reviewers recognize the extensiveness of our experiments. As suggested by Reviewer Lf5K, we additionally compare with "Directed graph auto-encoders. AAAI 2022" which is specifically designed for graphs and the results further demonstrate our effectiveness in different applications.
Our empirical experiments extensively evaluate the effectiveness of learning with asymmetric kernels and compare against symmetric kernel methods and other relevant methods with discussions. Our evaluations and comparisons cover varied datasets (e.g., directed graphs, general data, million-level scale data) and tasks (e.g., node representation learning, downstream classification/regression, bi-clustering), discussing the applicable potentials of asymmetric kernel methods under different setups.
**Main concerns from the reviews:**
In the rebuttal, we have responded to each comment and addressed most of the concerns as replied by the reviewers. Below, we summarize the main concerns and kind suggestions from the reviewers.
> Reviewer R8Lb: "In case of acceptance or a future submission to other venues, I kindly recommend to revise the manuscript by adding the difference of CCE against a rather more straightforward "2KPCA" view, and put a less emphasis on the novelty of the asymmetric Nyström formula."
We have explained on the differences of CCE against the "2KPCA" view in the rebuttal of **Introduction of the CCE problem**, which will be added to the final version to elaborate the further benefits of CCE. The additional insight from the reviewer about the asymmetric Nyström will also be integrated for enhanced discussions in the final version, along with our derivations based on the integral equations. We will present more concise phrases in this section.
> Reviewer R8Lb: the Nystrom method for asymmetric kernels is not something entirely new;
The derivation in [1] has the intuition for the asymmetric Nyström. However, it is not formally derived as we do in this work, nor addressed with extensive experiments as this work. The reviewer shows an alternative insight, which we will discuss in the next version. The derivation in our work follows from another perspective, i.e., the integral equations with asymmetric kernels. This is in a similar spirit with the widely adopted symmetric Nyström [2] in machine learning, which views the Nyström approximation [3] from the integral equations with symmetric kernels, as detailedly discussed in Sec. A.2.1. and Sec. A.2.2. in the Appendix.
> Reviewer 4PjV: "I think this work is a good contribution, but the connections to related work remain weak."
In our paper, we cover related works primarily in Sec. 2.3 and 2.1, and give more detailed discussions in Appendix A. We have further expanded the connections w.r.t. the existing literature considering infinite-dimensional feature maps and using asymmetric similarities in the rebuttal of **Relation to prior works**. We will update the paper with these comparisons in main body.
[1] Michaeli, T. Wang, W. and Livescu, K. Nonparametric canonical correlation analysis. ICML 2016.
[2] Williams, C. and Seeger, M. Using the Nyström method to speed up kernel machines. NeurIPS 2000.
[3] Baker, C. T. The numerical treatment of integral equations. SIAM Review, 23(2):266, 1981.
Best wishes,
The author(s)
## Reply to Reviewer R8Lb
Thanks for the reply and suggestions.
(1) We agree that the derivation in the NCCA paper has the intuition for the asymmetric Nyström. However, it is not formally derived as we do in this work, and has not yet been empirically addressed with extensive experiments as this work.
We appreciate the reviewer's acknowledgement on our thorough experimental evaluations, shedding light on the learning with asymmetric kernels and its speedup with asymmetric Nyström, which we believe is a nontrivial contribution in this direction.
(2) We thank the reviewer for the detailed response with additional insight, which indeed is seen as an alternative to derive the asymmetric Nyström. We will update the paper to discuss this view. The derivation in our work follows from another perspective, i.e., the integral equations with asymmetric kernels. One of our goals is to align and compare w.r.t. the symmetric Nyström in [2], which is widely adopted in machine learning, which views the Nyström approximation [5] originally from the integral equations with symmetric kernels, as presented in Sec. A.2.1. and Sec. A.2.2. in the Appendix. In this sense, our paper proceeded with the treatment this way.
Regarding the further questions, we answer below.
- The adjoint eigenfunctions and the singular functions naming are two conventions. In line 221-223 (right column), we outlined to differentiate from Schmidt's. We will also point out the modern convention in the paper. As we leveraged the framework in [4] (Section 5) to base our treatment of integral equations for the continuous analogue of SVD and CCE gives rise to a generalized shifted eigenvalue problem [6], so we adopted the same naming convention in [4] herein.
- Yes, the scaling factors $l_{\lambda_s},l_{u_s},l_{v_s}$ are introduced to assist the correspondence between the infinite-dimensional case and finite-dimensional case in deriving the finite-sample approximation. They are useful in our derivations with a view reflecting the gap between deriving the symmetric and asymmetric cases, and are not imperatively needed to compute explicitly in practice. We will present more succinct explanations.
We thank the reviewer for recognizing our contributions. We will add the discussions on CCE against 2KPCA and make more concise explanations on the asymmetric Nyström.
[2] Williams, C. and Seeger, M. Using the Nyström method to speed up kernel machines. NeurIPS 2000.
[4] Stewart, G. W. On the early history of the singular value decomposition. SIAM Review, 1993.
[5] Baker, C. T. The numerical treatment of integral equations. SIAM Review, 23(2):266, 1981.
[6] Lanczos, C. Linear systems in self-adjoint form. The American Mathematical Monthly, 65(9):665–679, 1958.
## Verifications of the reviewer's reply
If you consider an asymmetric kernel $\kappa(x,y)$, then we define the induced kernel operator and its adjoint by
$$
\begin{aligned}
(Kg)(x)&=\mathbb{E}_{p(x)}[\kappa(x,Y)g(Y)],
\end{aligned}
$$
$$
\begin{aligned}
(K^*f)(y)&=\mathbb{E}_{p(y)}[\kappa(X,y)f(X)].
\end{aligned}
$$
for $L^2$-integrable functions $f$ and $g$.
Then, the left and right $s$-th singular functions $\phi_s(\cdot)$ and $\psi_s(\cdot)$ of the kernel operator $\kappa(x,y)$ satisfy
$$
\begin{aligned}
(K^*\phi_s)(y)&=\sigma_s \psi_s(y),
\end{aligned}
$$
$$
\begin{aligned}
(K\psi_s)(x)&=\sigma_s \phi_s(x),
\end{aligned}
$$
$\sigma_s\ge 0$ is the $s$-th singular value.
Given $N$ samples $x_1,\ldots,x_N$ drawn from $p(x)$ and $M$ samples $y_1,\ldots,y_M$ drawn from $p(y)$, the relations can be approximated as(==Is it derived by the property of the $L^2$-integrable functions in the first set of equation and how it is corresponding to the singular functions and singular values? Not familiar with this. But I think it should be correct.==)
$$
\begin{aligned}
\psi_s(y)&= \frac{1}{\sigma_s} (K^*\phi_i)(y) \approx \frac{1}{N\sigma_s} \sum_{i=1}^N \kappa(x_i,y)\phi_s(x_i),
\end{aligned}
$$
$$
\begin{aligned}
\phi_s(x)&= \frac{1}{\sigma_s} (K\psi_s)(x) \approx \frac{1}{M\sigma_s} \sum_{j=1}^M \kappa(x,y_j)\psi_s(y_j),
\end{aligned}
$$
**If done correctly, you only need to scale the kernel matrix by $1/{\sqrt{MN}}$, the left and right singular vectors by $1/\sqrt{M}$ and $1/\sqrt{N}$.**(==I suppose the reviewers refers to the retropical for the scaling to the singular vectors. To align with our paper, we consider $N$ samples for $x_i$ and $M$ for $y_j$, so it should be the left and right singular vectors by $\sqrt{N}$ and $\sqrt{M}$==)
We can do as what's requested by the reviewer: **you only need to scale the kernel matrix by $1/{\sqrt{MN}}$, the left and right singular vectors by $\sqrt{N}$ and $\sqrt{M}$.** As $G=U\Lambda V^T$. and obtain
$$
\begin{aligned}
\psi_s(y)& \approx \frac{1}{N\sigma_s} \sum_{i=1}^N \kappa(x_i,y)\phi_s(x_i)\rightarrow \frac{\sqrt{MN}}{N\lambda_s} \sum_{i=1}^N \kappa(x_i,y_j)\sqrt{N} U_{is} \approx \sqrt{M} V_{js} \rightarrow \frac{1}{\lambda_s} \sum_{i=1}^N \kappa(x_i,y_j) U_{is} \approx V_{js}
\end{aligned}
$$
$$
\begin{aligned}
\phi_s(x)& \approx \frac{1}{M\sigma_s} \sum_{j=1}^M \kappa(x,y_j)\psi_s(y_j)
\end{aligned} \rightarrow \frac{\sqrt{MN}}{M\lambda_s} \sum_{j=1}^M \kappa(x_i,y_j)\sqrt{M} V_{js} \approx \sqrt{N} U_{is} \rightarrow \frac{1}{\lambda_s} \sum_{j=1}^M \kappa(x_i,y_j) V_{js} \approx U_{is}
$$
which indeed correspond to matrix SVD.
## General Responses
We thank all reviewers for the precious comments, and would like to address in this comment the general points of motivation and contributions.
- Learning with asymetric similarities has recently found a new momentum, greatly due to the effectiveness of the attention mechanism that is by design asymmetric.
[CCE problem][Nystrom]
[comparisons to GNN]
[discussions to prior work]
## Reviewer R8Lb (Rating: 4, Confidence: 5)
We thank the reviewer for the careful review and we answer the main points below.
### Introduction of the CCE problem
The proposed CCE problem gives a new understanding of the set of directions of interest in the feature space, namely $W_\phi$ and $W_\psi$ from Proposition 2.5, arising from the SVD of the asymmetric gram matrix and the feature maps. As far as we are aware, this formulation is novel.
We agree with the reviewer that one could also interpret these directions as the principal directions associated to the covariance operators of the symmetrized kernels. This consists in the two separate KPCA problems with kernels respectively arising from feature maps $x \mapsto \Sigma_\psi^{1/2} \phi(x)$ and $z \mapsto \Sigma_\phi^{1/2} \psi(z)$. In the dual, this corresponds to taking the SVDs of $G G^\top$ and $G^\top G$, which is equivalent to taking the SVD of $G$. We refer to this interpretation as 2KPCA herein. From a computational standpoint, performing 2KPCA or CCE yields the same singular vectors. However, it is crucial to note that our novelty lies in the CCE modelling, not merely in the computational aspect.
- In 2KPCA, one takes the principal components associated to kernels built via complicated intrication of $\phi$ and $\psi$. In CCE, the empirical covariances associated to both feature maps appear free from any other factor. In short, we are trading two separate eigenproblems of hard-to-interpret operators (where by design the eigenvalues are the same) against a generalized eigenvalues problem of easier-to-interpret operators.
- The coupling between the two input variables within the feature maps of 2KPCA is realized through the square root of the other covariance, while in CCE the coupling of the input variables naturally arises by crossing the learned directions, cf. Eq. (1).
- For the purpose of principal component extraction, we need to compute the **projections on the singular vectors $W_\phi$ and $W_\psi$ in $\mathcal{H}$**, as explained at the end of Sec 2.2. This is easily accomplished thanks to the CCE setting with explicit directions, while it is not as clear in 2KPCA. These projections are essential in downstream tasks to extract the principal components of test points.
- In light of the subsequent Nyström approximation, the CCE makes sense as it mirrors the adjoint eigenfunctions problem from Eq. (5).
- We believe that future work can be inspired by the CCE, especially by developing new variational formulations building upon it.
Therefore, while, from the computational point of view, solving 2KPCA is equivalent to solving CCE because one can always solve SVD by symmetrization, our main point is that we propose a new asymmetric learning paradigm based on CCE for feature learning with asymmetric kernels, i.e., the CCE formulation in Eq. (1). As shown in the experiments, the proposed simple framework performs well in a wide array of tasks such as graphs, biclustering, and general representation learning. A new framework that can handle asymmetric similarities and also infinite-dimensional feature maps to capture highly nonlinear relationships in real-world data and that can be extended to large-scale datasets constitutes a valuable addition for researchers and practitioners working with a variety of data types in their own application field.
### Novelty of the Nyström estimator
We argue that the proposed Nyström estimator is novel, and that previous attempts at establishing it suffer from limitations that are not present in our work.
- In the equation after Eq. (16) of [1], **the first equality employs the existing symmetric Nyström method to compute the eigenvector of a symmetric positive definite kernel** w.r.t. Eq. (8), as explained in the text above Eq. (16). Hence, the derivation in [1] still exploits the existing symmetric technique.
- We note that [1] deals with the **special case of square asymmetric matrix** ,i.e. $N=M$ as the CCA setup tackles two paired data sources. Their Nyström method thus **fails in the non-square case**, due to the scalings associated with the two eigenfunctions not being properly handled, contrary to the derivation provided in our paper (see e.g. our Eq. (6)). To be specific, let $S\in \mathbb R^{N\times M}$. Following [1], consider $f_i, g_i$ as the respective eigenfunctions associated to the symmetric kernel operators $SS^\top$ and $S^\top S$. As explained in the text above [Equation (8) from [1]], one can apply the symmetric Nyström method using [Equation (5) from [2]] to the estimation of $f_i$ and $g_i$ performed at [Algorithm 1 from [1]]. However, we note that the scalings given from the symmetric Nyström method would respectively be $\sqrt{N}$ and $\sqrt{M}$ if $N \neq M$. In this scenario, the singular value $\sigma_i$ should be rescaled at the last equation in [Algorithm 1, output section, from [1]] in order to recover the exact SVD, as explained at [Lines 264-274, right column] in our paper. Neither [Algorithm 1 from [1]] nor the symmetric Nyström [2] method reveal the scaling problem, because the scaling between eigenfunctions and eigenvalues is cancelled out in the two sides of the integral equation associated to the symmetric kernel, as explained in Lines 670-678 in our paper.
- **Our asymmetric Nyström method encompasses general cases** on two sets of samples $\{x_1, \ldots, x_N\}, \{y_1,\ldots,y_M\}$ recovering the pair of adjoint eigenfunctions (Eq. (5) in our paper) correctly, so that the square matrix with $N=M$ is only a special case of our method, as outlined in Line 685-686 in our paper. We note that the property residing in the scalings is nontrival to the correct approximation of the asymmetric kernel and not yet discussed in the literature.
Therefore, our contribution lies in that we provide a **formal derivation of the asymmetric Nyström method** on how to jointly approximate both left and right singular vectors by starting from the pair of adjoint eigenfunctions regarding the continuous analogue of SVD associated to an asymmetric kernel operator [3-4]. Instead, [1] relies on the p.s.d. kernel and the existing Nyström method and only deals with the square case, without consideration of proper scalings for the general case. We thank the reviewer for mentioning [1], and will cite this work with appropriate discussion to detail our contributions. We are available to discuss remaining questions that you would still have.
[1] Michaeli, T. Wang, W. and Livescu, K. Nonparametric canonical correlation analysis. ICML 2016.
[2] Williams, C. and Seeger, M. Using the Nyström method to speed up kernel machines. NeurIPS 2000.
[3] Schmidt, E. Zur theorie der linearen und nichtlinearen integralgleichungen. Mathematische Annalen, 1907.
[4] Stewart, G. W. On the early history of the singular value decomposition. SIAM Review, 1993.
#### Previous response
Therefore, our contribution lies in that we provide a **formal derivation of the asymmetric Nyström method** on how to jointly approximate both left and right singular vectors by starting from the pair of adjoint eigenfunctions regarding the continuous analogue of SVD associated to an asymmetric kernel operator [3-4]. Though the formulations of the symmetric and asymmetric Nyström methods look similar, see e.g. [Equation (11-12) from our paper] versus [Equation (8) and Line 270-274 from our paper], there are more differences between those formulations than meets the eye:
- They both start from integral equations w.r.t. kernels, i.e., Eq.(10) and Eq.(5), but one relates the symmetric kernel with one set of eigenfunctions w.r.t. a continuous analogue of eigendecomposition and another relates to asymmetric kernel with two sets of adjoint eigenfunctions (singular function) w.r.t. a continuous analogue of SVD.
- The SVD on an asymmetric kernel operator can be characterized by 2KPCA on the induced two symmetric kernel operators. However, in the finite-sample approximation to the pair of singular functions and it singular value, proper scalings are required in the Nyström estimator and out-of-sample prediction for a precise correspondance to SVD. More details on comparing symmetric and asymmetric Nyström methods can refer to our Appendix A.2.1 and A.2.2.
## Reviewer 4PjV (Rating: 6, Confidence: 2)
We thank the reviewer for the thoughtful comments and answer separately the key points below.
### Motivation, finite vs infinite dimensional
Part of the motivation comes from existing limitations in the literature surrounding kernel SVD, e.g. in [1] the feature maps have to be finite-dimensional. The point of section 2 is to express the CCE problem in a dimension-agnostic context: $\mathcal{H}$ is an abstract Hilbert space that can be either finite or infinite dimensional. We do not claim that there is a need for infinite - or finite - feature maps, but rather that our approach encompasses both cases in a unifying framework, as opposed to previous work.
Moreover, the integral operator counterpart to SVD as stated in [Section 5, [2]] can deal with infinite-dimensional feature space, thus having a dimension-agnostic framework makes sense in our case.
Overall, the feasibility of working with infinite-dimensional space contributes to extending AKSVD to a more powerful asymmetric kernel machine framework and justifies the asymmetric Nyström method for the speedup computation of AKSVD.
### Relation to prior work:
We agree with the reviewer that there are many studies that consider infinite-dimensional feature maps. Not all of them are relevant for the considered asymmetric setting. Below we clarify the relationship between ours and the relevant cited works, and we plan on including the additional discussions in the paper.
In [3] is presented a perspective to understand the dot-product attention kernel through a pair of Banach spaces. On such basis, it utilizes the technique of a reproducible kernel associated with two Banach spaces and formalizes its representer theorem through empirical risk minimization in the supervised learning setting. The focus of [3] is thus narrowed to understanding the dot-product attention in Transformers as some kernel machine, and one of their findings is that the feature maps corresponding to the softmax attention kernel can be derived as infinite dimensional, as stated in Proposition 1 and Remark 7 [3]. While this approach is valuable in their case, we do not adopt the RKBS formalism in our paper as it feels unneeded.
Thus the main focuses of our work and [3] are different and so are the roles of the infinite-dimensionality in the feature maps (RKHS vs RKBS). While infinite-dimensional feature maps are common in all kernel methods, [3] only applies the resulting kernel between queries and keys to model softmax attention in Transformers. Our paper derives a new asymmetric learning paradigm for unsupervised feature learning based on the coupled covariance eigenproblem (CCE), where we employ two covariance operators $\Sigma_\phi, \Sigma_\psi \in \mathcal{L}(\mathcal{H})$ on any two sets $\mathcal{X}, \mathcal{Z}$. Little literature addresses learning with generic asymmetric kernel machines with infinite-dimensional feature maps, e.g., the formulations in [4] are derived with finite-dimensional feature spaces $\mathcal{H} = \mathbb{R}^d$.
The relation to [4] is provided in Appendix A.1, i.e., Lines 585-588. We will enrich this discussion in the final version for further clarity as follows.
- **Task and derivations**. The problem condidered in [4] is *supervised* as they introduce asymmetric kernels to the least-square support vector machine classifier. Our work is *unsupervised* and considers an asymmetric similarity for kernel SVD for feature learning for a wide range of downstream tasks. Importantly, we *derive a new asymmetric learning framework* based on the proposed coupled covariances eigenproblem (CCE) for feature learning with asymmetric kernels. We introduce the CCE formulation, derive the Nyström method for asymmetric kernels, and conduct experiments on various tasks, including node representation learning, biclustering, and general data analysis. Such derivation is new and absent in [4].
- **Setups and asymmetry**. Our work introduces a new paradigm for asymmetric learning based on a coupled covariance eigenproblem (CCE) on two sets of samples $\{x_i\}_{i=1}^n, \{z_j\}_{j=1}^m$. This is in stark contrast with [4], which is confined to a single set. Accordingly, the asymmetry in [4] only comes from the choice of the asymmetic kernel function, while our asymmetry also comes from jointly handling two different sets.
- **Kernel functions**. Our contribution is not the introduction of asymmetric similarities, which long exist; e.g., the SNE kernel was already used in [5]. Our key contributions are an asymmetric learning paradigm through the coupled covariances eigenproblem (CCE) and the asymmetric Nyström method, connecting to the underlying integral equations with an asymmetric kernel and the finite sample approximation to the adjoint pair of eigenfunctions.
References
[1] Suykens, J. A. SVD revisited: A new variational principle, compatible feature maps and nonlinear extensions. Applied and Computational Harmonic Analysis, 2016.
[2] Stewart, G. W. On the early history of the singular value decomposition. SIAM Review, 1993.
[3] Wright, M. A. and Gonzalez, J. E. Transformers are deep infinite-dimensional non-mercer binary kernel machines. arXiv, 2021.
[4] He, M., He, F., Shi, L., Huang, X., and Suykens, J. A. K. Learning with asymmetric kernels: Least squares and feature interpretation. IEEE TPAMI, 2023.
[5] Hinton, Geoffrey E., and Sam Roweis. "Stochastic neighbor embedding." Advances in neural information processing systems 15 (2002).
### Typos
We appreciate the reviewer's careful reading. The typos have been corrected.
## Reviewer Lf5K (Rating: 6, Confidence: 4)
- [asymmetric kernel in the context of graphs]
- [comparisons with AAAI 2022 Directed Graph Auto-encodeer, and SOTA approaches]
- [setups in graph reconstruction, link prediction]
- [rsvd]
### Level of complexity [NOT INCLUDED]:
- Section 2.2 establishes a correspondency between the solutions to the CCE and the SVD of the asymetric kernel matrix. It translates this correspondency in the formal mathematical language and gives insights about the involved objects, e.g. the relationship between the matrix $G$ and the operators $\Gamma_\phi$, $\Gamma_\psi$ from Proposition 2.2. We believe that this is not a repeat of previously established facts but rather explanations that help understanding our method.
- Section 3 is dedicated to the thoroughly derivations of the asymmetric Nystöm method, as there are more differences between those formulations of symmetric and asymmetric cases than meets the eye. This is a key part of our work, and we believe that it should be placed in the main part of the paper.
### Spaces superscripts:
We have adopted the notation that a collection of $n$ elements $\{x_i\}_{i=1}^n$, where each element $x_i$ belongs to a space $\mathcal{X}$, belongs to the cartesian product $\mathcal{X}^n$.
### Centering in feature space
Although we cannot find a non-trivial feature map that satisfies the centering property for all finite set of points, given some points $\{x_i\}_{i=1}^n$ one can always center the feature map by considering the translated feature map
$\tilde{\phi}(x) = \phi(x) - \frac{1}{n} \sum_{i=1}^n \phi(x_i)$. The same holds true for the $\{z_j\}_{j=1}^m$,
and the corresponding similarity matrix of interest $[\tilde{G}]_{ij} = \langle \tilde{\phi}(x_i) , \tilde{\psi}(z_j)\rangle$ can be computed straightforwardly, e.g., $\tilde{G} = (I_n-\frac{1}{n}\boldsymbol 1_n\boldsymbol 1_n^\top)G(I_m-\frac{1}{m}\boldsymbol 1_m\boldsymbol 1_m^\top)$.
### Graph reconstruction setups
Our method is designed for unsupervised data analysis. Accordingly, in Sec. 4.1, we aim to empirically evaluate our asymmetric learning in unsupervised node representation learning, i.e., by learning embeddings of nodes from graph topology alone. Central to our evaluation is the notion that adjacent nodes in the original graph should be proximate in the embedding space. With this in mind, the graph reconstruction task aims to reconstruct the graph’s adjacency matrix from the node embeddings. As it involves predicting the existence of edges in the graph, graph reconstruction implies link prediction. We therefore evaluate the performance of our method on reconstructing the graph’s adjacency matrix in the following way. We first sort any node other than the one considered by decreasing Euclidean distance among the embedding vectors. Afterwards, we reconstruct the connections of each node $v$ with outdegree $k$ by performing a $k$-nearest neighbor search in embedding space. Overall, the graph reconstruction task evaluates the ability of our method to preserve the structural properties of the graph in the embedding space.
### Comparative experiments
To address the reviewer's concern, we have conducted additional comparisons between our AKSVD applied to directed graphs with Directed Graph Autoencoders (DiGAE) [1] from AAAI 2022. The hyperparameters $\alpha, \beta$ in [1] are selected in the range suggested by the authors in their paper. The settings and datasets are the ones from Section 4.1 in the manuscript. In Tables A and B below, we report the performance results for the node embedding and the graph reconstruction tasks, respectively. For completeness, we also add the comparison with node embedding algorithms DeepWalk [2] and HOPE [3].
Table A. Node embedding downstream task performance. Higher is better.
| Dataset | F1 Score | AKSVD | DiGAE | DeepWalk | HOPE |
|-------------|----------|--------|--------|----------|--------|
| Cora | Micro | 0.792 | 0.783 | 0.741 | 0.750 |
| | Macro | 0.784 | 0.776 | 0.736 | 0.473 |
| Citeseer | Micro | 0.678 | 0.663 | 0.624 | 0.642 |
| | Macro | 0.640 | 0.627 | 0.587 | 0.607 |
| Pubmed | Micro | 0.773 | 0.781 | 0.759 | 0.771 |
| | Macro | 0.743 | 0.749 | 0.737 | 0.741 |
| TwitchPT | Micro | 0.712 | 0.633 | 0.637 | 0.685 |
| | Macro | 0.596 | 0.593 | 0.589 | 0.568 |
| BlogCatalog | Micro | 0.710 | 0.697 | 0.688 | 0.704 |
| | Macro | 0.703 | 0.690 | 0.679 | 0.697 |
Table B. Graph reconstruction downstream task performance. Lower is better.
| Dataset | Norm | AKSVD | DiGAE | DeepWalk | HOPE |
|-------------|---------|-------|-------|----------|-------|
| Cora | $\ell_1$ | 14.0 | 26.0 | 19.0 | 15.0 |
| | $\ell_2$ | 17.4 | 20.87 | 17.6 | 18.1 |
| Citeseer | $\ell_1$ | 25.0 | 25.0 | 25.0 | 26.0 |
| | $\ell_2$ | 14.3 | 16.41 | 14.4 | 13.3 |
| Pubmed | $\ell_1$ | 170.0 | 171.0 | 171.0 | 171.0 |
| | $\ell_2$ | 23.8 | 27.89 | 19.4 | 23.8 |
| Twitchpt | $\ell_1$ | 756.0 | 759.0 | 864.0 | 1108.0 |
| | $\ell_2$ | 140.3 | 79.71 | 146.5 | 158.2 |
| Blogcatalog | $\ell_1$ | 764.0 | 771.0 | 810.0 | 3709.0 |
| | $\ell_2$ | 94.2 | 202.58| 104.0 | 286.7 |
Our method demonstrates comparable or superior results to DiGAE, without relying on graph convolutional layers. Instead, the use of asymmetric kernels coupled with the coupled covariances eigenproblem (CCE) framework allows for simple and efficient representation learning across tasks.
While DiGAE, HOPE, and DeepWalk are designed specifically for graphs, our proposed method is generally applicable to any two sets of samples $\{x_i\}_{i=1}^n, \{z_j\}_{j=1}^m$. Nevertheless, the simpler AKSVD shows competitive performance, demonstrating great potentials in representation learning for directed graphs.
Graph data is just one data type our method can be applied to. In fact, we run comprehensive evaluations across various tasks, encompassing biclustering in Sec. 4.2 and downstream classification/regression on general data in Sec. 4.3. This broader applicability underscores the versatility of our approach compared to task-specific methods like DiGAE [1], rendering it suitable for a wide array of applications.
References
[1] Kollias, Georgios, et al. "Directed graph auto-encoders." Proceedings of the AAAI conference on artificial intelligence. Vol. 36. No. 7. 2022.
[2] Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. "Deepwalk: Online learning of social representations." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014.
[3] Ou, Mingdong, et al. "Asymmetric transitivity preserving graph embedding." Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016.
### Asymmetric Nyström stopping criterion
Our method is primarily aimed at feature learning through the proposed coupled covariances eigenproblem. In this context, it is much more important to well approximate the first components (those associated with the largest singular values) than the last components (those with the lowest singular values). This is because the former capture the most variance in the data, which is crucial for effective feature learning.
Truncated Singular Value Decomposition (TSVD) is typically used to approximate the singular vectors with uniform accuracy across all singular vectors. This makes it an excellent tool for SVD computation in numerical linear algebra. However, in our work, we use TSVD to machine precision as a reference to compute similarity with the estimated singular vectors from the other methods, rather than as a method for approximating the SVD itself.
Finally, we agree that if one’s goal is to find a good approximation to the SVD across all singular vectors, one would prefer using TSVD for the best accuracy or RSVD for a faster approximation. However, our focus is on using the singular vectors to compute embeddings for downstream learning tasks, where the first few components that capture the most variance are the most important. We will clarify these points in the main body.
## Reviewer Vbys (Rating: 7, Confidence: 4)
We thank the reviewer for the positive feedback and helpful suggestions. We provide the responses to the key points separately as follows.
### Real-world datasets:
The datasets in our experiments in both main manuscript and the appendix are all real-world datasets, as introduced in Appendix C, e.g., Table 7-9. For instance, with the graph data in the main paper, _BlogCatalog_ is the social blog directory which manages the bloggers and their blogs, where the nodes work as a dictionary of all blogger and the edges denote the friendship network among the bloggers; _PubMed_ comprises the citations for biomedical literature from MEDLINE, life science journals, and online books, where citations can include links to full-text content from PubMed Central and publisher web sites. In the large-scale evaluations with millions of samples and features in Appendix B.2, _Criteo_ contains feature values and click feedback for millions of display ads, e.g., each row corresponds to a display ad served by Criteo and the first column indicates whether this ad has been clicked on or not.
### Limitations:
We agree with the reviewer. We will broaden the discussions on the limitations of the proposed method for possible future research. Below, we provide a takeway summary.
- In Sec. 4.3, for the general data, AKSVD can maintain comparable or better performances than symmetric kernel methods, but the imperativeness of exploring asymmetry is less direct compared to Sec. 4.1 for directed graph data that physicallly pertain asymmetry. It is reasonable that AKSVD improvements are problem-dependent, as not all datasets can be exploited from asymmetry to help downstream tasks. If little improvement with KSVD to KPCA is obtained, it can also be regarded as a form of knowledge discovery for more in-depth understandings on the data (to see whether the additional data source gives more information to the symmetric kernel case). In future work, it would be interesting to develop metrics that can unveil how much the additional exploration through asymmetric feature learning can improve over the symmetric cases.
- Compared to the rich literature on the existing (symmetric) Nyström method, more sophisticated sampling techniques for both rows and columns are yet to be developed for the asymmetric Nyström method, beyond the satisfactory performances of uniform random samplings evaluated in our work.
- Although CCE provides a solid learning scheme and interpretations with both asymmetric kernel operators and covariance operators, the model capacity can be further improved for more complex learning tasks, for example in future work by integrating (graph) neural networks, e.g., utilizing NNs as the explicit feature mappings through the CCE optimization.