Target Shift Rebuttal

# Reviewer 3cxH ## Weaknesses: I have the following three concerns about the empirical side: 1. The paper mentions a note on dealing with high-dimensional data using z=f(x), but it seems the experiments are mainly on low-dimension data sets and it is not clear whether this idea of dealing with high-dimensional data is tested. > **Our Response:** Thanks for your comment and suggestion. We agree that it worth evaluating the proposed method with high-dimensional data. During the rebuttal phase, we evaluated our method on a synthesis data with feature dimension as 500. The weight estimation MSE are shown in the table below. It can be seen that our method still achieve the best performance. We will include more experimental results on high-dimensional data in the revised version. > | Adaptation | $\sigma=0.1$ | $\sigma=0.4$ | > |------------|--------|----------| > | None | 1.8556 | 0.1510 | > | KMM | 3.0332 | 0.3176 | > | RECOLIA | 1.7094 | 0.1221 | 2. The variance of the estimated weights can affect the variance of the learning performance. Actually there are some previous results showing that not necessarily the better MSE in weight estimation error leads to better performance in learning using the weights. Table1 and Table 2 sort of also show it. So it would be nice to add some theoretical or empirical analysis on the relationship between performances/variances between the two components. > **Our Response:** <span style="color:red">TO BE ADDED</span>. > (I think this question is way beyond the scope of this paper, if we can answer this question nicely, we should write another paper and get it published.) > (I don't know it this is the best solution, but we may just say this is an important question and admit that we do not address this.--> Josh: I agree with QT for this.) > (Xin: Can we give some simple justification related with plug-in estimation?) 3. KMM is also a very strong baseline here. Sometimes the proposed method does not win significantly in terms of predictive performance. Looking at a larger range of \delta (shifts) and looking at more datasets may help. > **Our Response:** Thanks for the suggestion. We will definitely include more experiments with larger range of shift and explore different datasets. Beyond this, we want to highlight that besides the predictive performance, we also recorded the adaptation time to measure the adaptation efficiency in Table 1. In general, our method improved the adaptation speed with more than 10 times. Thus, our proposed method can achieve better predictive performance with less computation cost. ## Questions: Please refer to my comments on the limitations. Additionally: 1. The proposed algorithm involves the inversion of an n by n matrix when n is the number of data points. I am wondering why the efficiency is not a concern here, if we have large amounts of data samples? > **Our Response:** We agree that the matrix inversion will lead an intensive computation and it is a limitation of all the kernel matrix related methods, including ours and KMM. But in practice, there are serveral numerical tricks to speed up the computation, such as the low-rank approximation [1] and divide-and-conquer [2] methods. Furthermore, one can avoid the explicit matrix inversion by utilizing a particular structure of the matrix(if any) to solve a system of linear equations. [1] Chávez, Gustavo, et al. "Scalable and memory-efficient kernel ridge regression." 2020 IEEE International parallel and distributed processing symposium (IPDPS). IEEE, 2020. [2] Zhang, Yuchen, John Duchi, and Martin Wainwright. "Divide and conquer kernel ridge regression." Conference on learning theory. PMLR, 2013. 2. What is the practical implication of the assumption 1? I am a bit surprised that there is not an assumption about the density ratio w or the density ratio \rho (on the covariates). > **Our Response:** The assumption 1 is necessary to guarantee the existence of solution of the regularized problem. The assumption 2 imposes an assumption on the solution \rho(and hence the density ratio w), namely, that it belongs to a function class which is a subset of square-integrable functions. This function class is associated with the compact operator T and the definition is given in the definition 1. Furthermore, in line 104, we assume that the density ratio \eta belongs to the class of square-integrable functions. ## Limitations: There can be more discussions on the limitations. > **Our Response:** Thanks for the suggestion. We will provide more discussions in our limitation section, including the computation limitation with large size datasets. # Reviewer K9Sa ## Weaknesses: - The example presented in the introduction lacks persuasiveness, as it is uncommon for doctors to assign continuous values to patients. > **Our Response:** Thanks for your comments. We agree that doctors commonly will not present diagnostic results to patients as continuous values. But during the diagnostic process, the doctors might be interested in the predictions of the bio-physiological indicators (e.g. symptoms severity level, cell survival time and drug response time), which are usually represented with continuous values. > Examples include the ***Acute Physiology and Chronic Health Evaluation III (APACHE III)*** and ***Sequential Organ Failure Assessment (SOFA) Score***. > There are other situation where a continuous response is needed rather than a categorical one: some banks are more selective than others when approving credit cards; thus, the credit scores of their clients will have different distribution. > We will improve our representation for a better understanding and motivation. - The existing works in the literature are not sufficiently discussed. For instance, [1] considers the same learning settings, but the proposed method does not evolve to solve an integral equation, potentially leading to greater stability in certain cases. Therefore, additional theoretical discussion and empirical comparisons are needed. > **Our Response:** <span style="color:red">TO BE ADDED</span>. - The paper does not adequately address how to determine the regularization parameter. The empirical studies show that the choice of this parameter significantly impacts performance, limiting the practical performance of the proposed algorithm. > **Our Response:** Thanks for the comments. As we discussed in the limitation section, the selection of regularization parameter will affect the practial performance of our method. Currently, we proposed the selection method in line 270-283, with which our method has already completed the state-of-the-art methods. Because it is high non-trival to identify the optimal regularization parameter selection and it shared lots of commonity with kernel regression, we believe it worth a separate future investigation. [1] Nguyen, Tuan Duong, Marthinus Christoffel, and Masashi Sugiyama. Continuous Target Shift Adaptation in Supervised Learning. In: ACML 2015. ## Questions: As mentioned earlier, compared with the related work [1] that addresses the same learning problem, what are the advantages of the proposed method? Empirically, does the proposed method outperform the algorithm in [1] or achieve more stable results? > **Our Response:** Thanks for your reference. <Talk about some advantage from estimation design? Also mention the relationship with KMM> For the numerical comparison, unfortunately, we didn't find any related code in [1]. We tried to implement the proposed algorithms in [1] but with the limited rebuttal time our version doesn't get converged results. We will include the comparison in the revised version. It is suggested to change the title from "Regularized Continuous Label Shift Adaptation" to "Regularized Target Shift Adaptation in Continuous Space" to align with the literature [1,2] and differentiate it from the setting of online learning and sequential prediction. When we refer to labels, it commonly implies discrete space. The main paper should also be revised accordingly. > **Our Response:** We really appreciate your suggestion. We will change the title and contents to better alignment with the existing literatures. I am willing to increase my score if these issues are addressed. [1] Nguyen, Tuan Duong, Marthinus Christoffel, and Masashi Sugiyama. Continuous Target Shift Adaptation in Supervised Learning. In: ACML 2015. [2] Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptatio under target and conditional shift. In: ICML 2013. # Reviewer 4r3y ## Weaknesses: 1. The claim that none of the existing methods can be easily adopted to continuous label space is doubtful. For instance, using a similar idea as Just Train Twice (JTT), one can oversample the data instances with large training loss. What prevents the application of JTT to continuous label space? The authors claim that those methods cannot be adopted without a genuine attempt, which weakens the argument. > **Our Response:** Thanks for the reference. For extending JTT method[1] to continuous label space, one of blockers would be the definition of the error set defined in (6) in that paper. > When the label space is continuous, with a high probability, the prediction will not exactly equal to observed label. Thus, the error set will be the whole dataset, which would absolutely deteriorate JTT's performance. [1]Liu, Evan Z., et al. "Just train twice: Improving group robustness without training group information." International Conference on Machine Learning. PMLR, 2021. 2. The scale of the experiments is too limited. All datasets are tabular and in most cases random forest is enough to provide a good performance. It is questionable whether the method scales to problems on a larger scale, especially that dimension reduction is an important component that has a large effect on the performance. There are many large-scale regression problems such as depth or surface normal estimation of 3D data. > **Our Response:** Thanks for the suggestion. We are working on evaluating our method with large-scale regression dataset. We will include the results in the revised version. 3. The paper is difficult to read with many in-line equations and definitions. Also, in the experiment section, for the tables presented, it is better to highlight the method with the optimal performance. > **Our Response:** Thanks for your suggestion. We will work on improving the representation for better readability. ## Questions: It is unclear that in practice, how to control the dimension as discussed in remark 4? Since there is a tradeoff between the reduced dimension and performance, in terms of neural networks, the reduced dimension corresponds to the learned embedding. Are there any general guidelines about the design? Or it is simply a hyperparameter that requires tuning? I hope to see more experimental results about how dimensionality affects the performance. > **Our Response:** Thanks for your suggestion. For our current experiments, we adopted the similar strategy in the work of BBSE [17] and use the predictive model to project the data to 1-dimensional prediction. We will include more experimental results in the revised version on the dimensinality effect. # Reviewer hCbd ## Weaknesses: - The writing for the Operator Perspective part could be improved. Eq. (4) should be introduced together with the change of variable.Otherwise, it is difficult to see the connection from Eq. (1). > **Our Response:** Thanks for the suggestion. We will improve the presentation, we will add more sentences to make transition from Eq (1) to Eq (4) clearer. - The authors could have concluded the theory part by providing how to set $\alpha$ and $h$ as functions of $n$ and $m$ in order to achieve the minimum of the upper bound (11). It is also a little disappointing that the algorithm does not use such insights from the theory. > **Our Response:** (Josh: Thanks for the suggestion, we will add more experiments on practical tuning of parameters based on the asymptotic result in (11). The hyperparamter tuning itself is a nontrivial problem, and we believe more careful analysis and efficient methodology should be developed so we consider it beyond the scope of this paper. How do you think guys? The major problem right now is the inconsistency between theory part and the numerical section and within a short time, this would be difficult to resolve. So I tried my best to respond this, but feel free to edit.) - Section 4 is a bit hard to read. It introduces many symbols by just giving equations in a few lines. > **Our Response:** Thanks for the suggestion. We will improve our presentation and reduce unnecessary notations. - The theory and the algorithm have a possibly big gap because of the use of the approximations $a_i(\rho)\approx\rho(y_i)$ and $b_i(\tau)\approx\tau(x_i)$. The paper should have good discussions for this part including how much this approximation deviates from the original quantities. I also expected an explanation about why we need such approximations. > **Our Response:** The approximation we use here is based on the midpoint rule approximation for evaluating integrals. Evaluating $a_i(\rho)$ and $b_i(\tau)$ requires computing integrals. Even though we can always calculate the integrals numerically (e.g. using the Gauss-Hermite quadrature), it does not have the simple form as inverting a matrix. Furthermore, computation effort spent on computing integrals to obtain $a_i(\rho)$ and $b_i(\tau)$ can be prohibitively expensive and may not be worth to implement the proposed method in practice. - There are many missing references about class prior change, which I think is the same as label shift for discrete labels. > **Our Response:** Thanks for the comments. We will add on related references about class prior change. ## Questions: - Please discuss the motivation and the theoretical implication of the use of the approximations $a_i(\rho)\approx\rho(y_i)$ and $b_i(\tau)\approx \tau(x_i)$. > **Our Response:** (Josh: I think previous response by QT(?) addresses this issue. I am just copying what was written above: The approximation we use here is based on the midpoint rule approximation for evaluating integrals. Evaluating $a_i(\rho)$ and $b_i(\tau)$ requires computing integrals. Even though we can always calculate the integrals numerically (e.g. using the Gauss-Hermite quadrature), it does not have the simple form as inverting a matrix). - How does the proposed method manage to be so much faster than KMM? The proposed method still has a matrix inversion. I don't see what contributed to the computational efficiency. > **Our Response:** The key computation efficiency comes from that: our method has a explicit solution for the equation while KMM needs to run iterations for a constrained loss minimization. - The definition of in the experiment is not clear. If is the variance of the truncated normal distribution, the larger is, the closer is to the uniform distribution as far as I understand. However, the results of the experiments do not seem consistent with this fact. It is also confusing that the authors call "degree of shift". Am I missing something? > **Our Response:** Thanks for the comments. Yes, as the variance gets larger, the truncated normal will be closer to uniform distribution, thus the label shift will disappear and all the methods will perform similar. This is matched with our experimental results (see Figure 1). We agree that the current definition of "degree of shift" might be confusing. We will revise it to avoid ambiguity. ## Limitations: I think the method has a limitation in scalability. There is a gap between the theory and the algorithm because of the heuristic approximations and hyper-parameter selection. **Our Response:** (Josh: This is the major problem we face, but it would need more time to figure out this, so let's just try do our best and be nice and wish they accept. If not we can still try other journals.) #### Comment: I appreciate the authors' response to my comments. I have a few questions regarding the answers. The authors argue that the computation cost for the integral is prohibitively expensive, but there is no discussion about the quality of the approximation to the integral. It is great that the approximation is fast, but it would not be useful if the error were too big. For example, it would be nice if there were empirical evaluation showing under what circumstances the approximation is good and how good it is. > **Our Response**: Thanks for the comments. While working on empirical evaluation, here we provide a justification for the approximation from a theoretical point of view. Take $a_i(\rho)$ for example, it can be written as $$ a_i(\rho)=h^{-1}\int\rho(y)K_{y}\left(\frac{y-y_i}{h}\right)dy, $$ Letting $t=(y-y_i)/h$, we can rewrite the integration as $$ a_i(\rho)=\int\rho(y_i+ht)K_y(t)dt. $$ We can expand $\rho(y_i+ht)$ at $y_i$, and we can have the following results: $$ \rho(y_i+ht)=\rho(y_i)+\frac{h^2}{2}\rho^{\prime\prime}(y_i)\int K_y(v)v^2dv+o(h^2), $$ because $K_y(\cdot)$ is assumed to be symmetric. Thus, under mild bounded second derivative condition, we can see that the bias brought by the approximation is of the order $h^2$. Meanwhile, the bias of KDE itself is also of the order $h^2$, which means we can "group" the bias of this appriximation with the bias brought by KDE without changing the order. > From the numerical side, thanks for the suggestion, we will include an additional study to measure the appoximation error and its effect on the weight estimation. ```Our method only requires one to solve a symmetric linear system, not necessarily to invert a matrix, which can be done in O(n) complexity through conjugate-gradient descent``` What is here? Could the authors provide a reference that clearly states this result? #### Comments: (Not very sure, butfrom what I found, the complexity should be $O(m\sqrt{\kappa})$, where $m$ is the non-zero element in Kernel matrix. Chapet 10 in http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf Can we claim with bounded support kernel, the nonzero element is scale to n?) Assuming that each row has at max of M number of non-zero element in the Kernel matrix, m should be of order Mn. When wrote O(n) for conjugate-gradient, it simply meant that the algorithm terminates in $n$ iterations. For each iteration, you need to do matrix-vector multiplication, so to be more precise, it becomes O(mn) where $m$ is the non-zero element in the Kernel matrix. Perhaps we can make some assumption here? Although the running time of the proposed method seems to be faster than [1], the quality of the direct approximation of the marginal density is still worrying. There is usually a large gap between the estimated distribution and the underlying one, and the results are not stable. I agree with Reviewer hCbd's new comment that it would be nice to have an empirical evaluation showing under what circumstances the approximation is good and how good it is. > **Our Responses:** We agree that the estimated density will introduce estimation errors. We provide theoretical results which analyze the impact of the estimated density in the followings. For the apprixmation method we used, while working on empirical evaluation, here we provide a justification for the approximation from a theoretical point of view. Take $a_i(\rho)$ for example, it can be written as $$ a_i(\rho)=h^{-1}\int\rho(y)K_{y}\left(\frac{y-y_i}{h}\right)dy, $$ Letting $t=(y-y_i)/h$, we can rewrite the integration as $$ a_i(\rho)=\int\rho(y_i+ht)K_y(t)dt. $$ We can expand $\rho(y_i+ht)$ at $y_i$, and we can have the following results: $$ \rho(y_i+ht)=\rho(y_i)+\frac{h^2}{2}\rho^{\prime\prime}(y_i)\int K_y(v)v^2dv+o(h^2), $$ because $K_y(\cdot)$ is assumed to be symmetric. Thus, under mild bounded second derivative condition, we can see that the bias brought by the approximation is of the order $h^2$. Meanwhile, the bias of KDE itself is also of the order $h^2$, which means we can "group" the bias of this appriximation with the bias brought by KDE without changing the order. > Thanks for your suggestion and we will include a numerical study on the impact of the approximation. Also, as mentioned by Reviewer hCbd, it is suggested to include a clear discussion with the literatures on class prior change in the updated version. > **Our Responses:** Thanks for the suggestion and we will include the discussion with calss prior change in the revised version.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.