ICLR 23 rebuttal - RFF

# ICLR 23 rebuttal - RFF ## Reviewer KzPf Thank you very much for the critical and insightful comments! We start with addressing the following major concerns, followed by responses to other comments. 1. Comparison to [CP17]. 2. Highlights of our novelty compared with the previous works. ### Comparison to [CP17] > This paper basically improves the work of (Chen and Phillps, 2017), by generalizing the results to general shift-invariant kernels. However, importantly, the proposed result is worse than that of (Chen and Phillps, 2017), which only focuses on the Gaussian kernel. This makes the contributions weak. First of all, we do not agree that [CP17] achieves a better result for the Gaussian case. As far as we understand, the result in [CP17] is not comparable to ours. In particular, the main error bound is Theorem 7 of [CP17], but it only works for $\|x - y\| \geq \sigma$, where $\sigma$ is the parameter in the Gaussian kernel. Please note that this condition is not clearly mentioned in the theorem statement, but it is indeed used and mentioned in one line above it. For the other case of $\|x - y\| < \sigma$, their Theorem 14 gave a related bound, but we find their guarantee quite different. Specifically, their target dimension depends linearly in the input dimension $d$. Hence, when $d$ is large (e.g., $d = \log^2 n$), this Theorem 14 is worse than ours (for the case of $\|x - y\| < \sigma$. In addition, we have to mention a subtle technical issue about [CP17]. Their Theorem 7 (which is the main upper bound) crucially uses a bound for momemt generating functions in Lemma 5. However, we find various technical issues in the proof of Lemma 5 (found in their appendix). Specifically, the term $\mathbb{E}[e^{-s \frac{1}{2} \omega^2 \|\Delta\|^2 }]$ in the last line above "But" (in page 17), should actually be $\mathbb{E}[e^{s \frac{1}{2} \omega^2 \|\Delta\|^2}]$. Even if one fixes this mistake (by negating the exponent), then eventually we can only obtain a weaker bound of $\ln M(s) \leq \frac{s^2}{4} \|\Delta\|^4 + s \|\Delta\|^2$ in the very last step, since the term $-s \|\Delta\|^2$ is negated accordingly. Hence, it is not clear if the claimed bound can still be obtained in Theorem 7, or at least it cannot achieve the claimed bound. Finally, we note that in our analysis we also somewhat encountered a similar technical difficulty as mentioned in the last paragraph, and to overcome it, we use a more careful analysis that details with every moment, instead of a collective upper bound of $\ln M(s)$ for the moment generating function $M(s)$. This is also a significant deviation from the techniques of [CP17]. We will incorporate this discussion into our next version. ### Novelty Compared with Previous Works > The notion of the kernel distance is not new and many technical tools are borrowed from previous works (Chen and Phillps, 2017, Makarychev et al., 2019), hence I feel that the novelty is limited. As discussed in the last paragraphs, our techniques deviate from [CP17] significantly. Result-wise, our bound not only holds for general $x, y$'s (without the condition $\|x - y\| \geq \sigma$ as required by [CP17]), but also generally works for any analytic (shift-invariant) kernels (not only the Gaussian kernels as in [CP17]). Technically, these are obtained by a much more involved moment analysis, as opposed to a somewhat ad-hoc (and likely problematic) collective analysis to the moment generating function $M(s)$ in [CP17]. Our clustering result is indeed heavily built on [MMR19], but it is simply an application and not considered our main result anyway. Our other results, i.e., the lower bounds and the new dimension reduction for Laplacian kernels, are both conceptually and technically new. In particular, the Laplacian kernel result turns a non-efficient/existential embedding from $\ell_1$ to squared $\ell_2$ into an efficient algorithm using pseudorandom generators on a carefully built embedding procedure, and this new algorithmic embedding from $\ell_1$ to squared $\ell_2$ may be of independent interest in the perspective of algorithm design and metric embedding. ### Other issues > The results of the upper bound and lower bound have a different notion, i.e., the magnitude-resolution ratio only appears in the lower bound. This makes it hard to compare those bounds. Can the upper bound involve in this ratio? The short answer is that the upper and lower bounds match in terms of resolution ratio, and the ratio does not appear in some of our upper bounds since they are stronger (and do not need to depend on the ratio). We highlight the difference/relation in the following. 1. Theorem 1.2 gave an upper bound for *analytic* (shift-invariant) kernels, and the bound is *oblivious* to the resolution ratio, which is *stronger* than one that depends on the ratio. 2. Then, the lower bound (Theorem 1.1) is mostly about non-analytic kernels which is the *complement* of Theorem 1.2, and the result is also stronger than "needed" since it says the target dimension is not only unbounded in general, but also has to be unbounded in a way that linearly depends on the resolution ratio. 3. Finally, in Theorem 1.4, we show if we do not use RFF (recalling that both Theorem 1.1 and 1.2 only concern RFF), which means the lower bound (Theorem 1.1) does not apply, then the target dimension does not need to depend on the resolution ratio any more (as opposed to RFF as proved in Theorem 1.1), although our running time still depends logarithmically on it. > Experiments only show the relative errors on the synthetic datasets. Since the RFF has been used in many practical applications, I think more experiments should be addressed. For example, showing the k-means clustering results and comparing the Nystrom method under real-world datasets would be nice. For kernel $k$-means, comparisons between Nystrom/SVD and RFF have been done in previous works (e.g., [Wang, Gittens and Mahoney, JMLR 2019]), under datasets such as MNIST. The conclusion was that Nystrom/SVD performs slightly better than RFF, although not by much. In the next version, we will try to add an additional experiment for kernel nearest neighbor search. This is another important yet different downstream application than kernel $k$-means, since it is based more on the pairwise distance (instead of a collective guarantee as in clustering). We expect to see RFF significantly outperform other methods. > It seems that the inequality in (2) is wrong. Indeed, the RHS of (2) is hard to parse, and it is not even clear that RHS is less than $1$. Hence, we provided an explanation in the paragraph below the statement of Theorem 4.1, which shows why the RHS is always at most $1$. We also discussed the typical setup of parameters such that inequality (2) is nontrivial (i.e., significantly greater than $0$). We hope this helps to resolve your concerns. If still not, could you please further elaborate on why you think this is wrong? > It would be good to have an independent section for the "Going beyond RFF" part. We agree, and we will try to move it from the appendix to the main text in the next version. > Theorem 3.1 and 1.2 (which informally states Theorem 3.1) are basically the same. There is no reason to state the same theorem twice, which can be a waste of space. > It would be better to use the parenthesis for citing references. Thanks for the suggestions and we will implement these in the next version. ## Reviewer vUht We thank the reviewer for the very inspiring and insightful comments! We will respond to your comments following your flow of logic. Due to the word limit, we have to separate the responses into two threads. ### RFF gets relative error for analytic shift-invariant kernels > I suspect there is a very nice corollary to this relative error result, where an epsilon-net argument can give a whole subspace embedding via RFF; proving something akin to the results in [Avron et al, 2017] but with standard RFF sampling and with the regularization parameter set to zero. I'd like to know if the authors ever considered such a guarantee? Thanks for the suggestion! It is indeed very interesting to see if the subspace embedding can be obtained from the relative distance-error bound. However, we suspect the subspace embedding that you mention cannot be done in general. We list some evidence as follows. 1. It seems [Avron et al., 2017] actually gave a lower bound for the target dimension of RFF (in their Theorem 8). By a quick look, they seemed to claim that $\lambda = 0$ requires an unbounded number of dimensions for RFF. 2. Also, for the epsilon-net approach that you mentioned, it seems the analysis requires linearity of the embedding (where one needs to represent a point as a linear combination of net points, and apply the embedding), but the RFF unfortunately is not a linear mapping. > I don't know why the authors chose to prove the concentration argument underlying this result on page 6 of the paper. We agree that some of the calculations in that proof are not very interesting. In the next version, we will omit those details from the main text, change this part into a proof sketch, and try to move back the important parts of the third result into the main text. ### RFF is incapable of relative error embedding for non-analytic kernels > Somewhat frustratingly, in Remark 4.1 at the bottom of page 7, the authors intuitively argue why this ratio is bad for Laplacian kernels and good for analytic kernels, but do not include a formal proof. Their description makes it sound pretty easy to prove, but it really should be formalized in the appendix. A concern I have is about the hardness of other non-analytic kernels. Without that proof in the appendix, it's not immediately clear to me how many other non-analytic kernels will suffer the same hardness condition. Can we state an condition on (e.g.) the Laurent coefficients of a kernel that implies RFF is incapable of giving relative error? We will add a more detailed analysis of the claims made in Remark 4.1 in the next version. In fact, there are simple sufficient conditions to assert that RFF cannot preserve the relative error for certain types of kernel functions, including Laplacian kernels. For simplicity, let's assume the input dimension is $1$, so $K : \mathbb{R} \to \mathbb{R}$. Further assume $\Delta = 1$, $\rho < 1$. Then the $(\Delta, \rho)$-bounded property simply requires $\rho \leq |x| \leq 1$. We claim that, if $K$'s first derivative is non-zero, then RFF cannot preserve relative error for such $K$. To see this, we use Taylor's expansion for $K$ at the origin, and simply use the approximation to degree one, i.e., $K(x) \approx 1 + ax$ (noting that $x \leq 1$ so this is a good approximation), where $a = K'(0)$. Then $$ s_K(x) = \frac{1 + 1 + 2ax - 2(1 + ax)^2}{2a^2x^2}=-1-\frac{1}{ax} $$ So if $a = K'(0)\neq 0$, then for sufficiently small $\rho$ and $|x| \geq \rho$, $s_K(\rho) \geq \Omega(1 / \rho)$. This in particular implies the claim in Theorem 1.1 for Laplacian kernels. Indeed, this condition is intuitively similar to what you mentioned -- we can simply look at the coefficients of the Taylor's expansion at the origin. ### Pseudorandom generators and fancier ways to make JL-style guarantees for the Laplacian kernel > This result is not really described within the body of the paper. Thanks for appreciating our result! In the next version, we will follow your suggestion to shorten other not-so-important details, and try to put at least a proof sketch of the third contribution in the main text. Here's a brief overview of the high-level idea. We try to "reduce" to the Gaussian kernel case whose embedding may be obtained using RFF. While it is possible to isometrically embed $\ell_1$ to squared $\ell_2$, the target dimension could be indefinitely high, and it is unclear if this target dimension can be reduced. We bypass this issue, by applying pseudorandom generators on a carefully designed embedding to squared $\ell_2$ (which still has a high dimension but is much more "structured"), so that the final RFF on the (high-dimensional but structured) input to the Gaussian kernel may be evaluated/simulated efficiently. ### Other comments > Section 3.2 on page 7 is a short blurb about improved sample complexity for kernel clustering. The result is neat, I guess, but is not well contextualized within the world of prior work. They also call their metric an $\ell_p$-objective, where the norm really is $\|\cdot\|_2^p$, which would not usually be called an $\ell_p$-norm. Pick a new name for it, maybe call it an $\ell_2^p$-objective? Indeed, our terms could cause confusion. By $\ell_p$ objective we meant the aggregation function of the clustering cost is $\ell_p$, i.e., the sum $p$-th power of the distance from data points to a given center (so $p = 1$ is $k$-median, $p = 2$ is $k$-means). We will clarify this in the next version. > Similarly, I know that prior work on RFF like algorithms have cared about certain kinds of relative error guarantees, like in [Avron et al, 2017]. While these are cited in the paper, the results here are not well contrasted against the prior work. This is clear in Section 1.2, which is basically a block of citations without any discussion of similarity. Why does their notions of relative error subspace embedding not extend to the authors' setting? (Formally, make a subspace embedding via RFF or RFF-LS for a 2x2 kernel matrix, and see if that subspace embedding guarantee implies a relative error pairwise guarantee like what the authors are aiming for; I can clarify if this description isn't clear). As far as we understand, the [Avron et al., 2017] result does not really give a subspace embedding result, and it is only an approximate one. In particular, they need to suffer an "additive" $\lambda I$ in the spectral guarantee, and this $\lambda$ cannot be made $0$ in their approach as they have some dependence in the statistical dimension $s_\lambda$ as well as an $n / \lambda$ term in the target dimension. Hence, this work is related, but does not seem to be able to yield the desired relative error guarantee as we need. We will add this discussion in Section 1 in the next version. > In the experiments, the authors use SVD as an algorithm to attempt to preserve pairwise distances. Why is this a benchmark? Do people really use low-rank approximation algorithms to preserve pairwise distances? This feels intuitively doomed to fail. The SVD is an "ideal" form of Nystrom method, and we compare with it since Nystrom is a popular one for kernel dimension reduction. In fact, SVD/Nystrom has been shown to preserve the relative error for specific downstream applications, such as clustering [Musco and Musco, 2017]. But you are right, they are unable to preserve the relative error of pairwise kernel distance. > Also, if you run an algorithm for 20 iterations, could you include measures of confidence intervals in the plot? This is a great suggestion, and we will add the variation information to the plots in the next version. > Theorem 1.4 isn't clearly phrased. We agree, the target dimension does not depend on $\log \frac{\Delta}{\rho}$ at all, even though the running time does. We will change the order of quantification to make this clear in the next version. ### Typos > Consider using a different typeface for vectors versus scalars. Not vital by any means, but I think it'd help legibility a bit. > [Proof of theorem 3.1, page 6, the line that starts "Assume $\delta \leq$"] ... > [Proof of theorem 3.1, page 6, the line that starts "For simplicity denote"] ... > Proof of theorem 3.1, page 6, the line that starts "By Lemma 3.1 for every"] ... > [Proof of theorem 3.1, page 6, the math immediately below "By Lemma 3.1 for every"] ... > [Proof of theorem 3.1, page 6, the math immediately above "By out choice of t"] ... > [Remark 4.1, bottom of page 7] At the start of the remark, $D = \Theta(\ldots)$ has a mis-typed parenthesis Thanks for the very detailed comments, and pointing out the minor technical issues in our proof! We will do a careful pass and address all these in the next version. ## Reviewer pJCZ We thank the reviewer for the very inspiring and insightful comments! We respond to your comments as follows. > It is not entirely convincing that the method for the Laplacian kernel is a bigger framework for other kernels. The bound $\max\{ \epsilon^{-1} \log^3(1 / \delta), \epsilon^{-2}\log(1 / \delta) \}$ looks less satisfying. Indeed, our oblivious dimension reduction for Laplacian kernels is more of a conceptual value, to show oblivious dimension reduction is possible for Laplacian kernels, beyond what RFF can do. This is basically to rule out the possibility that RFF is the "only way" to do oblivious dimension reduction for shift-invariant kernels, and suggests that new techniques must be developed to resolve this problem which we showcased. It is an interesting open question to find a better and more generally applicable framework for oblivious dimension reduction of general shift-invariant kernels. > Comment on the problem background: It says that the feature map $\phi$ is from $\mathbb{R}^d$ to a general Hilbert space $H$, on the other hand the distance is defined to be the $\ell_2$ norm. Does this mean that the Hilbert space $H$ must be embeddable into $\ell_2$? It seems that the distance does not need to be $\ell_2$ norm, it just needs to be induced by the inner product on $H$. Our motivation is indeed to consider the $\ell_2$ distance on the feature space, since it is widely used in downstream applications such as kernel clustering and kernel nearest-neighbor search. Hence, we define the kernel distance in a form similar to $\ell_2$. However, we emphasize that this does *not* mean we need the space to be embedded into $\ell_2$, since we can define the distance completely by the inner products, i.e., $\mathrm{dist}(x, y) := \sqrt{\langle x, x \rangle + \langle y, y \rangle - 2\langle x, y \rangle}$. Finally, we remark that for a finite dataset of $n$ points, it is always possible to find a Euclidean representation for the image of their kernel feature mapping $\varphi(x)$'s. > Page 2, Line -5: it would be good to elaborate on "nearly matches" Recall that our dimension reduction aims to preserve the pairwise $\ell_2$ distance of the data points in the feature space, within $(1 \pm \epsilon)$ error. For this task of preserving pairwise distances, the Johnson-Lindenstrauss transform, which uses $O(\epsilon^{-2}\log n)$ target dimension, is shown to be tight. Hence, comparing with this $O(\epsilon^{-2}\log n)$ target dimension bound, our bound $\mathrm{poly}(\epsilon^{-1} \log n)$ is nearly tight up to the degree of polynomial of the parameters (and certainly we do not suffer e.g. $\exp(\epsilon^{-1})$ terms). We will add this explanation in the next version. > Proof of Lemma 3.1: I feel there may be some small mistakes or gaps in the proof. For instance, $g_k(x) / i!$ seems to be $g_i(x) / i!$? What happens for the terms of $i < 2k$? Why do they vanish? In particular I don't see why Lemma A.2 means that they'll vanish because there is no limiting process here? Perhaps it should be an inequality here that $g_k(x)$ is dominated by the remainder term. For the term $g_k(x)/i!$, it is exactly what it is in the paper. The first equation is the Taylor expansion for $g_k(x)$. For the terms of $i<2k$, they are actually $0$ according to A.2, because since the $g_k(x)/\|x\|_1^{2k}$ is a constant when limiting $x$ to $0$, so with some effort one can show that the $i$-th derivative of $g_k(x)$ is $0$ when limiting $x$ to $0$, since this is the case for $\|x\|_1^{2k}$. ### Minor points > Page 3, Line 8: “Cauchy’s evaluation formula” -> ... > Fact 3.1, last bullet point: ... > Fact 3.1: need citation or a short proof. > Proof of Theorem 3.1, the series of inequalities in the middle of the page: ... > Proof of Theorem 3.1: “By our choice of $t$ and $\delta$ ... $\leq \delta l$, denote $\sigma'^2$ ..., so ..." -> "By our choice of $t$ and $\delta$, ... $\leq \delta l$. Let $\sigma'^2$ ..., then ..." > Proof of Theorem 3.1, last line: “combine it together” -> “combine them together”. It would be good to be explicit about what intermediate results are combined here. > Theorem 4.1: please rephrase the theorem statement. “For every …, every …, denoting…, there exists…” do not read very well. It might be better to say “Suppose that …” > Page 7, line 1 after the proof of Theorem 4.1: right hand -> right-hand > Page 7, line -7: "$\geq \Omega(1)$" -> "$=\Omega(1)$" > Page 7, Line -2: insert a comma after $O(1)$ > Page 8: please use different line styles - colours cannot be distinguished when printed black and white. > Section E.1: Do NOT use $\ell_2^2$ as this typically means a two-dimensional $\ell_2$ space. Thanks for the very detailed comments! We will address all these in the next version. <font color="red"> OLD </font> <font color="silver"> We followed your suggestion, and we tried to evaluate the performance of our method on MNIST dataset, compared against the Nystrom/SVD methods. In particular, we subsample 100 points from each of the 10 classed of MNIST, interpret each $28 \times 28$ image as a vector of $784$ pixels, and use a normalization of this as the input dataset. We pick $k = 50$ (it's standard to pick a slightly larger $k > 10$ since a single class may be classified to several sub-classes, corresponding to ways to write a same digit), and a Gaussian kernel $k(x, y) := \exp(- \|x - y\|_2^2 / 2)$. In the experiment, we use brute force to compute the exact representations of points in the feature space as the bencharmark. Then, we experiment with varying target dimension for RFF (our method), SVD (the "ideal" Nystrom) and Johnson-Lindenstrauss applied on the feature space, and compute the representation of points in these target spaces. Finally, we apply a $k$-means solver from sklearn on each of the obtained point set in the target space, and report the kernel $k$-means cost (evaluated on the original space). The result is listed as follows. (The original cost is 269.92) | Method\Dimension | 20 | 40 | 60 | 80 | 100 | | -------- | ----- | -- | ---- | ---| --- | | RFF | 340.59| 302.99 | 298.65 | 287.21 | 287.86 | | SVD | 269.40| 266.27 | 266.47 | 267.02 | 267.41 | | JL | 322.29| 294.43 | 283.05 | 283.78 | 280.08 | We can see that RFF performs well, and similar to the "optimal" cost obtained by directly running $k$-means on the feature space. </font>

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.