owned this note
owned this note
Published
Linked with GitHub
## Summary
The GW problem is important and challenging (non-convex quadratic objective in space of n x n bistochastic matrices). This paper proposes a provably convergent (to local minimum) single loop approach, which could be of great practical interest, because single-loop algorithms are easier to tune (no need for inner loop's # of iterations, or tolerance). The non-asymptotic convergence theory that is provided is original, but some limitations:
- The authors mention at the end of their introduction, in Table 1 and elsewhere in the paper, that their algorithm does not always return a feasible solution for GW (e.g. a bistochastic matrix).
- The author mention in R.3.7 that the algorithm used in experiments is not provably convergent (theory supports a symmetric Bregman divergence, i.e. L2; their implementation uses KL). The authors show speed-ups as well as comparable or better performance (in accuracy) to other approaches in graph alignment /partitioning tasks.
## Process
The paper was supported by all reviewers during the rebuttal phase. After the rebuttal phase, AC and reviewers discussed shortcomings. This led to 2 reviewers (h6kQ and Jipi) lowering their grades. Because the system was locked, this is only reflected in messages to the authors ("Unfortunately have to lower my score", "Lower my score due to my misunderstanding"). Reviewer h6kQ communicated to me a new grade of 5. One reviewer participated in a discussion with the others but stopped participating. Another reviewer did not join the discussion after the rebuttal period. Some of the concerns were then discussed with the authors at a later date.
The AC and SAC now acknowledge that these discussions came late in the review process, and therefore did not let enough opportunity to the authors to effectively amend their paper, and address the late concerns.
## Summary of shortcomings discussed (with and without authors)
- The "provable" algorithm featured in {the title, the introduction, Table 1} is Eq.5 uses a sym. Bregman div. (a.k.a. L2). The algorithm used in experiments uses KL projections (Eq. 2). The authors argued during a discussion that these 2 algorithms were similar enough to conflate them, and use a single name ("BAPG"). I disagree. Projections onto the simplex are crucial, here they differ significantly. While modifying the title could be easily fixed (e.g. "On the convergence of single-loop relaxations of GW for graph data"), significant parts of the paper need to be rewritten to lift that ambiguity.
- The feasibility issue has raised several questions from reviewers and AC. For instance, the only numerical evidence of (in)feasibility was Figure 2b, but it was reported by the authors to be wrong. To balance this, the authors argued that their contribution was to be understood as a relaxation of GW, tailored for graph data. Unfortunately, the paper is not written that way. The introduction mentions GW 13 times, provides shortcomings for existing GW approaches, and follows with "To bridge the above theoretical-computational gap, we propose the first provable single-loop algorithm for the GW distance computation, which we call ...". Yet, a few sentences later, the authors write "Nonetheless, the iterates generated by BAPG do not necessarily satisfy the Birkhoff polytope constraint". This contradiction appears in the paper: most comparisons focus on feasible solvers.
Comments on experiments:
- Hyperparameter search is mentioned in paper, but not in code, where they are are hardcoded. This is unreliable, given the amount of copy pasted code.
- No results for eBPG on "heat kernel" in Table 4. The argument that is mentioned (kernel sparse or dense) is unclear.
- BAPG (PyTorch) is compared to eBPG in POT (rather than reimplemented with PyTorch). The authors write "For eBPG, we use the official implementation in the PythonOTpackage, which supports running on GPU.". POT reads: "This function is backend-compatible and will work on arrays from all compatible backends. But the algorithm uses the C++ CPU backend which can lead to copy overhead on GPU arrays."
- In the "3-D space" of {accuracy / speed / feasibility} the paper uses a partial view. A unified perspective is lacking. For instance, this could be corrected by adding the infeasibility of all solutions that are reported, along with time/accuracy.
~~## Current situation
Given the late discussions, it was agreed between the AC, SAC and two PCs that the paper will proceed exceptionally as a “conditional accept”. This will give the authors a few weeks to thoroughly address the major concerns above, and to some extent minor ones. Importantly the author should commit to update the PDF following the last discussions. Then the SAC will personally reevaluate the paper to proceed with a final decision. We hope this exceptional process will be judged as fair by the authors.
~~
Given the reviewer feedback and late discussions, it was agreed between the AC, SAC and two PCs that the paper will proceed as "accept" at this point. Authors agreed to address the issues above but did not have opportunity to incorporate them during the rebuttal phase. Given these changes, the paper provides value to the research community on GW and graph learning.
The AC and SAC