Reply to Reviewer DPQ8:

# Reply to Reviewer DPQ8: ## Sensitivity of hyperparameters: For sensitivity of $\ell$, $\lambda$ and $k$, we used 2D synthetic data based on reviewer Reviewer j594's design (we used 200 by 200 grid for x by y in our paper, but in his code he used 201 by 201 grid for x by y): Fixed k = 4, $\lambda = 4$ |varied $\ell$|RMSE| |---|---| |10 | 0.072 | |15 | 0.041 | |20 | 0.052 | |25 | 0.061 | Fixed k = 4, $\ell = 15$ |varied $\lambda$|RMSE| |---|---| |2 | 0.065 | |3 | 0.045 | |4 | 0.041 | |5 | 0.040 | Fixed $\lambda = 4$, $\ell = 15$ |varied $k$|RMSE| |---|---| |2 | 0.042 | |3 | 0.040 | |4 | 0.041 | |5 | 0.043 | We can see that RMSE are not that sensitive to $\lambda$, $k$, or $\ell$. ## On general position (1 & 3) Thank you for refocusing our attention on this claim based on general position in Section 3.2. That claim about $M_q$ being unique under our assumption was incorrect; however, luckily, it was also not needed. If one selects the $k' \leq k$ uniquely defined nearest points and also any arbitrary subset of $k - k'$ equally close points (where no other points are closer), then the claims about the algorithms hold. All of those $k' - k$ points will get a weight of $0$, so it does not matter which are selected. We simply remove this uniqueness claim in Section 3.2; it now reads: > Let $t \geq k \geq d+2$, as we will see in Section 4.1, this will allow us to provide results on continuity of the local krr model. The arguments about continuity, which are made with more care in Section 4.1, are unaffected. ## On general position (2) In computational geometry, *general position* is more of a concept than a specific definition. There are many ways it could be defined -- such as no 3 colinear points in 2d, no 4 coplanar points in 3d, no 3 lines in 2d intersect at a common point, or no 4 points that lie on a circle in 2d. Different geometric constructions require different, but similar, measure-zero properties to avoid strange behavior. The standard practice is to define the version required for the setting at hand. In the case of Delaunay triangulations in d dimensions, the required notion is no d+2 points on a common sphere. Note that this can be viewed to generalizes no d+2 points on a hyperplane in d dimensions, since, fixing a tangent point then as the center (and hence also radius) of a sphere goes to infinity, it converges to hyperplane. In our setting, this is also the version of general position we need. Ultimately, we feel the use and definition here is standard and correct for our setting. ## Some explanation in Section 4.1 We removed the informal terms, and began by pointing out that we consider continuously changing the location of the query $q$. The next text reads: > Consider continuously moving the position of a query point $q$. As any model point moves into the set of $k$ nearest of $q$, its weight is initially zero. So there is no boundary effect discontinuity in the contribution of the nearest $k$ models when suddenly a different set of points are the $k$ nearest. Also, when two points change their relative position in the sorted order, they have the same weight. So again there is no discontinuity in the weight as the relative order of model points change within that nearest $k$. ## Minor requests: - [x] citations changed from \citet to \citep - [x] the dual coefficient -> dual coeffcients - [x] make changes to past tense. - [x] We mean that (Liang & Rakhlin 2020) only considered for high dimension (d > n) setting and ridge term goes to 0. We were unaware of comparable work in low dimension (n > d) setting. - [x] change $p_{th}$ to $p$th in the sixth line of "Fast Nearest Neighbor Search" of Section 2 - [x] change $c_1$ to $m_1$ in the second line of "Determine M" in sec 3.1 - [x] In the third paragraph of "Continuity of KNN Interpolation": "The Lipschitz factor will be larger" -> "will be smaller"? -> It should be "smaller" I think. Sorry for typo. - [x] We added a citation to a well-known book devoted to Delaunay complex for the curious reader. Since the remark about this is mostly an aside -- noting the connection for the interested reader -- we felt it would be distracting to elaborate too much. - [x] Right after the proof of Corollary 4.2: "Lets" -> "Let us" -> text updated - [x] In Theorem 4.3: Please make it explicit that the theorem assumes M is (\gamma, k) distributed. -> We updated it in Theorem 4.3. - [x] In the second paragraph of "Local Interpolation" in Section 4.2: change b_i = $\ell$NN(x_i) -> b_i = \|x_i - $\ell$NN(x_i) \| -> text updated - [x] In "Local Interpolation" the "this" meant the use of $\lambda b(k)$; -> text updated. - [x] You do not seem to have defined "global KRR", which becomes a little bit confusing in the latter part of the paper because there are several confusing terminologies such as "global KRR", "local KRR", and "full local KRR". -> We added the explanation in the sec 5.1, tunning hyper-parameter paragraph, on the sixth line. - [x] In Section 4.3, after "Fixed number of local models", the notation of expectations should be corrected to \ensuremath{\mathbb{E}} occasionally. In Lemma 4.6, the real value set should be denoted by \ensuremath{\mathbb{R}}. -> text updated - [x] Singular NWKR In Figure 1 -> subtitles updated # Reply to Reviewer j594: ## Weakness 1: RBF Interpolator Thanks for pointing out the RBF interpolator! We will incorporate this into the revision; in the mean time, here are some comments. When running it on the 1D data set, one can indeed see some small discontinuities. Although they are perhaps more subtle than other discontinuous methods, we think the lack of guarantees is important since it does not prevent more serious issues. Empirically, we only ran one trial on this 2D dataset with our default hyper-parameters choices. And due to the randomness of training points, different trials may give different results. Our data design is also slightly different from what you showed in your code. We used np.linspace(-5, 5, 200) rather than np.linspace(-5, 5, 201). We re-ran the code under the same data generation strategy as you suggested, comparing the RBF interpolator with our method (slightly changed the number of the evaluated models from 3 to 4), and we got a comparable result (around 0.07) in the code below. Under different randomness, we only use 3 evaluated models we got 0.135, improving on the 0.323 we got last run. ``` ### rmse = 0.07192164953072741 ### import numpy as np from scipy import interpolate import faiss import numpy as np from numba import njit, jit from sklearn.kernel_ridge import KernelRidge def local_any_dimension(model_point, X, test_x, y, k, C1, C2, alpha, k2, sigma, linear = True): n,d = model_point.shape index = faiss.IndexFlatL2(d) index.add(X) model_map = {} max_i = 0 for j in range(n): D,I = index.search((model_point[j,:]).reshape((1,d)),k) b = np.max([np.linalg.norm(model_point[j,:] - x) for x in X[I[0],:]]) l, D_1, I_1 = index.range_search((model_point[j,:]).reshape((1,d)), C1 * b ** 2) new_x = X[I_1, :] knn_y_i = y[I_1] clf = KernelRidge(alpha=alpha, kernel = 'rbf', gamma = 1 / (C2 * b ** 2)).fit(new_x, knn_y_i) model_map[j] = clf m,d1 = test_x.shape index_1 = faiss.IndexFlatL2(d) index_1.add(model_point) pred_y = np.zeros((m,d1)) for q in tqdm(range(m)): D,I = index_1.search((test_x[q,:]).reshape((1,d)),k2) I_S = I[0] distances = np.array([np.linalg.norm(test_x[q,:] - x) for x in model_point[I_S,:]]) radius_min = np.min(distances) radius_max = np.max(distances) if radius_min != radius_max: raw_weight = np.array([(radius_max - x)/(radius_max - radius_min) for x in distances]) else: raw_weight = np.ones(k2) if linear: normalized_weight = raw_weight / np.sum(raw_weight) else: normalized_weight = np.exp(raw_weight * sigma) / np.sum(np.exp(raw_weight * sigma)) w = [model_map[s].predict(test_x[q,:].reshape(1,-1))[0] for s in I_S] pred_y[q] = np.sum([w[i] * normalized_weight[i] for i in range(len(I_S))], axis=0) return pred_y @jit def gon_clustering_aprox(X, k, k2, c): n,d = X.shape f = np.random.randint(0, n-1) s = np.zeros((k, d)) centers = [] centers.append(f) s[0,:] = X[f,:] phi = [0 for _ in range(n)] r = [] index = faiss.IndexFlatL2(d) index.add(X) for t in range(n): D,I = index.search((X[t,:]).reshape((1,d)),k2) r.append(D[0][-1]) for i in range(1, k): M = 0 s[i, :] = X[0, :] new_center = 0 for j in range(n): if j not in centers and np.sqrt(max(np.linalg.norm(X[j,:] - s[phi[j],:])**2 - r[centers[phi[j]]],0)) > M: M = np.sqrt(np.linalg.norm(X[j,:] - s[phi[j],:])**2 - r[centers[phi[j]]]**2) s[i,:] = X[j,:] new_center = j centers.append(new_center) for j in range(n): if np.sqrt(max(np.linalg.norm(X[j,:] - s[phi[j],:])**2 - r[centers[phi[j]]],0)) > np.sqrt(max(np.linalg.norm(X[j,:] - s[i,:])**2 - r[centers[i]],0)): phi[j] = i return phi, s def prob(x): p = np.log10(1 + np.log10(1 + 0.5 * (0.5 * x[:, 0] + 0.5 * x[:, 1])**4)) return np.maximum(0.05, p) def response(x): z1 = x[:, 0] / np.sqrt(2) + x[:, 1] / np.sqrt(2) z2 = - x[:, 0] / np.sqrt(2) + x[:, 1] / np.sqrt(2) return np.sin(2 * z1 * np.abs(z1)) + np.sin(0.5 * z2 * np.abs(z2)) + 5 * np.cos(z1) * np.sin(z2) # data generation np.random.seed(0) t = np.linspace(-5, 5, 201) X1, X2 = np.meshgrid(t, t) x_mesh = np.stack([X1.flatten(), X2.flatten()], axis=1) y_mesh = response(x_mesh) p = prob(x_mesh) q = np.random.rand(p.size) x, y = x_mesh[p >= q], y_mesh[p >= q] membership, model_points_2 = gon_clustering_aprox(x.astype(np.float32), 1500, 10, 1) preds = local_any_dimension(model_points_2.astype(np.float32), x.astype(np.float32), x_mesh.astype(np.float32), y, 10, 4, 1, 1e-3, 4, 0.05, True) print(np.sqrt(np.mean((preds[:,0] - y_mesh)**2))) ``` ## Weakness 2: Connection to NWKR Thank you for pointing out the interesting connection to Nadaraya-Watson Kernel Regression. However, we would like to point out that under this way of formulating the problem, it uses a variable-width triangle kernel, so the width of the kernel changes depending on which query point is used. Most of the analysis of NWKR we are aware of (including the referenced EJS11 paper), considered a fixed kernel -- so do not directly apply to our setting. That said, we believe there would be a way to leverage this interpretation, and associated analysis, to obtain at least some of our results. However, it seems to formalize such results this way, one would still need to analyze carefully how the bandwidth parameter varies as the query point changes, and then how this effects the kernel weighting scheme. Indeed, this is the bulk of our analysis in Section 4.1 and Appendix A. ## Minor Request There is an issue with the new proof of Theorem 4.3 sketched by the reviewer. In the second step, the expression is maximized by mapping one of the $\mu_i(q)$ terms to $-T$ (instead of $T$). After factoring out a $T$ this leaves a $p_i' + p_i$, instead of a $p_i' - p_i$ for which we have a bound. We suspect something like this can be made to work with an extra factor of $2$, but that in most cases the bound we provide (using the Lipschitz factor $L$) will be better. So we have elected to retain our proof and bound. # Reply to Reviewer Xqyg: ## kernelKNN Thank you for pointing out the KernelKNN R package. We already compared to a very similar method, knn-svm. knn-svm identifies the k nearest neighbors and then learns the best weights on that subset of data, based on kernel ridge regression. The proposed approach (kernelKNN) applies the standard NWKR method on this selection of k-nearest neighbors. It is well-known that KRR tends to outperform NRKR, and our method consistently outperforms knn-svm empirically. Moreover, both of these methods do not guarantee continuity, part of our desiderata. ## high-d data set As you suggested, we tried a larger dimension dataset pol, which has 15000 observations and 26 explanatory variables from https://github.com/treforevans/uci_datasets. 90 percent of data is treated as training data, and 10 percent of data as test data. Ridge parameter is fixed as $\eta$ = 1e-2. The $d$ in singular kernel NWKR is set as 20 (around the number of data dimension). Choices of bandwidth for global KRR, local SVM, NWKR, KNN-SVM, or Singular Kernel NWKR are selected from [0.1, 0.5, 1, 5, 10]. Choice of $\ell$ for local KRR are from [10, 50, 100, 150], $\lambda$ are from [1.5, 2, 2.5, 3], and $k = 3$. The choice of $K$ nearest neighbor of KNN-SVM are from [10,50,100,150]. We select hyperparameters based on the training RMSE and report test RMSE. The result are shown as below: | |RMSE| |---| ---| |Global KRR| 12.035| |KNN SVM| 10.275 | |Local SVM| 10.376| |Singular Kernel|9.788| |NWKR | 10.857| |Local KRR| 9.192| Even though our method mainly focuses on the dataset with variable local density and moderate dimensions, the performance on this real dataset is also comparable to other candidate methods.