# ELSA Rebuttal
## Reviewer YJEr
### Weakness
> **Your Comment:** *The authors propose that $P(Y|X)$ is hard to estimate, then existing methods that estimate $\omega$ using the maximum likelihood estimation are not consistent anymore. To accurately estimate $\omega$, the authors propose to avoid estimating $P(Y|X)$ and estimating $P(X|Y)$ instead. However, it seems that estimating $P(Y|X)$ and $P(X|Y)$ have similar difficulty without any assumption.*
**Our Response:**
Thanks for your comment. Although we rewrite the likelihood function using $p(\boldsymbol{x}|y)$ in (5), we do not intend to estimate $p(\boldsymbol{x}|y)$. Here $p(\boldsymbol{x}|y)$ serves as an intermidiate component for deriving the perpendicular space. The reason for choosing $p(\boldsymbol{x}|y)$ is that $p(\boldsymbol{x}|y)$ is the same in both the source and target distributions, so that we do not need to differentiate $p_s(y|\boldsymbol{x})$ and $p_t(y|\boldsymbol{x})$. In our implementation, we do not need to estimate $p(\boldsymbol{x}|y)$ and thus no additional assumptions needed.
In our proposed estimator (see (11) and (12)), our estimator still utilizes $p_s(y|\boldsymbol{x})$ in the end rather than $p(\boldsymbol{x}|y)$. However, different from the maximum likelihood estimator that **requires** a correctly-specified model, our ELSA estimator adopts adopts the proposed semiparametric framework, which is more robust to the model misspecification.
### Questions
> **Your Comment:** *Why estimating $P(X|Y)$ instead of estimating $P(Y|X)$ make the estimation $\omega$ more accurate.*
**Our Response:**
Thanks for your comment. First of all, We'd like to highlight that we do not estimate $p(\boldsymbol{x}|y)$. Instead, we treat it as an inifinte dimensional (i.e. nonparametric) nuisance function. We use $p(\boldsymbol{x}|y)$ to derive the perpendicular space without the need of estimation. The semiparametric model is more robust to model misspecification and offers more flexibility such that we can "integrate" any classificiation model without calibration (see Section 4.2 in our paper).
> **Your Comment:** *It would be great to discuss the benefit of asymptotic normality to estimate $\omega$.*
**Our Response:**
Asymptotic normality enables us to perform hypothesis testing and inference. More specifically, given the null hypothesis $H_0:\boldsymbol{\omega}_0=\mathbf{1}$ (i.e. no distribution shift), one can construct a Wald statistic using the asymptotic normality property. Also, we can further construct confidence intervals for the estimated importance weights. We will add a remark under Theorem 3.2 to highlight it.
### Limitation
> **Your Comment:** *To make the method based on estimating $P(X|Y)$ conceptually more accurate than the method based on estimating $P(Y|X)$. Additional assumptions should be required to the best of my knowledge.*
**Our Response:**
We do not need additional assumption as we do not need to estimate $p(\boldsymbol{x}|y)$. The more accurate estimation hinges on the proposed semiparametric moment matching framework and the carefully designed $h_{\mathrm{ELSA}}(\boldsymbol{x};\boldsymbol{\omega})$ in (11).
## Reviewer BekR
### Weakness
> **Your Comment:** *The paper is hard to follow in certain places. In particular, the description and motivation of the method are not completely clear. Theoretically or empirically, it is unclear why ELSA gets better estimation than MLLS and BBSE. Importantly, what are the properties that the ELSA estimator depends on?*
**Our Response:**
Thanks for the suggestion. We will improve our presentation according to your comments. We will elaborate the description and motivation of our method in details in the rest of the responses.
As for the properties, our ELSA estimator is derived based on the label shift assumption and can belong to the Z-estimator (see van der Vaart 1998). We do not need additional assumption beyond the regularity assumptions for the Z-estimator. Detailed disccusions can be found in the following responses.
[van der Vaart 1998] Vaart, A. (1998). Asymptotic Statistics (Cambridge Series in Statistical and Probabilistic Mathematics). Cambridge: Cambridge University Press. doi:10.1017/CBO9780511802256
### Questions
> **Your Comment:** *It is hard to follow how the authors derived the equation (11)*
**Our Response:**
Thanks for your comments. We will elaborate it point-by-point in below.
<!-- We have elaborated the derivation of $h_{\mathrm{ELSA}}(\boldsymbol{x})$ later in the our responses. -->
> * *Authors have not defined the phrase "nuisance tangent spaces". It might be good the elaborate on these things in Appendix A. It might also be good to include some background and formal statement of how the influence functions of estimators lay in the perpendicular space.*
**Our Response:**
Thanks for your suggestion. We will include their defintions and background details in the Appendix.
<!-- We have provided more explanations later in the reponses, and will add more details on the background. -->
> * *Given authors have some space at the end of page 8, it would be good to structure equations in Theorem 2.1 in a better way. The significance of Theorem 2.1 is unclear.*
**Our Response:**
Thanks for your suggestion. We will reformulate Theorem 2.1 in the revision.
The significance of Theorem 2.1 is to provide the perpendicular space $\Lambda^\perp$, which provides a guidance on choosing the influence function. The perpendicular space also help us design the $h_{\mathrm{ELSA}}(\boldsymbol{x})$ function; detailed discussions are given later.
> * *It might good to somewhere define RAL as regular and asymptotically linear.*
**Our Response:**
Thanks for the suggestion. We will add it in the appendix.
> * *The significance of the assumption on function $g_p$ in Theorem 3.2 is hard to follow.*
**Our Response:**
The proposed estimator belongs to the family of the Z-estimator, and the conditions in Theorem 3.2 are standard regularity assumptions for the Z-estimator. More details on the Z-estimator and its regularity assumptions can be found in Chapter 5 in van der Vaart (1998).
> * *Overall, the connection between finding an importance weight estimator and finding a perpendicular space is not clear.*
**Our Response:**
Our main utilization of semiparametric models is to derive the complememnt of the nuisance tangent space (i.e., the perpendicular space) $\Lambda^{\perp}$. Based on the semiparametric theory (Bickel et al. 1998, Tsiatis 2006), this space corresponds to the influence functions for estimating the parameter of interest $\boldsymbol{\omega}$. In other words, every element in $\Lambda^{\perp}$ corresponds to a RAL estimator of $\boldsymbol{\omega}$. Also, this space indicates, any function that is *not in this space* should ***not*** be used for estimating $\boldsymbol{\omega}$ in the interest of efficiency. For example, if $\boldsymbol{\phi}$ is an function that $\boldsymbol{\phi}\not\in\Lambda^{\perp}$, then one should not use $\boldsymbol{\phi}$ but to use $\Pi(\boldsymbol{\phi}|\Lambda^{\perp})$ instead. Here
$$
\boldsymbol{\phi}=\underbrace{\boldsymbol{\phi}-\Pi(\boldsymbol{\phi}|\Lambda^{\perp})}_{\in\Lambda}\oplus\underbrace{\Pi(\boldsymbol{\phi}|\Lambda^{\perp})}_{\in\Lambda^{\perp}},
$$
and $\Pi(\boldsymbol{\phi}|\Lambda^{\perp})$ is the projection of $\boldsymbol{\phi}$ onto the space $\Lambda^{\perp}$. This is because by using the projection $\Pi(\boldsymbol{\phi}|\Lambda^{\perp})$ we can improve the efficiency (i.e., empirically, decrease the MSE of the estimator). ALso, if $\boldsymbol{\phi}\not\in\Lambda^{\perp}$, we cannot get the a estimator; thus, making it difficult to characterize the resulting estimator.
> * *It would be good to elaborate on authors' obtain equation (11).*
**Our Response:**
The motivation of the $h_{\mathrm{ELSA}}(\boldsymbol{x})$ function starts from the score function with respect to $\boldsymbol{\omega}^{(-1)}=(\omega_1,\dots,\omega_{k-1})$, and denoted by $\mathbf{S}_{\boldsymbol{\omega}}(\boldsymbol{x})$. The $i$-th element of the score function is given by
$$
[\mathbf{S}_{\boldsymbol{\omega}}(\boldsymbol{x})]_i\propto\frac{p_s(\boldsymbol{x})}{p_t(\boldsymbol{x})}\left\{p_s(y=i|\boldsymbol{x})-p_s(y=k|\boldsymbol{x})\right\},\quad i=1,\dots,k-1.
$$
We could use $\mathbf{S}_{\boldsymbol{\omega}}(\boldsymbol{x})$ directly to construct an influence function for a RAL estimator. But we can improve efficiency (i.e., reducing estimation error) by projecting it to the perpendicular space $\Lambda^{\perp}$. Prioritizing computational efficiency and feasibility, we approximate the projection $\Pi(\mathbf{S}_{\boldsymbol{\omega}}(\boldsymbol{x})|\Lambda^\perp)$ with
$$
\Pi(S_i(\boldsymbol{x})\mid \Lambda^\perp)\propto \kappa(\boldsymbol{x})S_i(\boldsymbol{x}),
$$
where $\kappa(\boldsymbol{x})$ is a "bridging" function that needs to satisfy
$$
\frac{1-\kappa(\boldsymbol{x})}{\kappa(\boldsymbol{x})}=E_t\left\{\frac{1-\Pr(R=1|Y,\boldsymbol{X})}{\Pr(R=1|Y,\boldsymbol{X})}|\boldsymbol{x}\right\}.
$$
Under the label shift assumption, we further have
$$
\frac{1-\kappa(\boldsymbol{x})}{\kappa(\boldsymbol{x})}=\frac{1-\pi}{\pi}E_t\left\{\frac{p_t(Y)}{p_s(Y)}|\boldsymbol{x}\right\}.
$$
Next we will show tht the proposed function $h_{\mathrm{ELSA}}(\boldsymbol{x})$ is proportional to $\kappa(\boldsymbol{x})\mathbf{S}_i(\boldsymbol{x})$. Because $\kappa(\boldsymbol{x})\mathbf{S}_i(\boldsymbol{x})=\kappa(\boldsymbol{x})\dfrac{p_s(\boldsymbol{x})}{p_t(\boldsymbol{\boldsymbol{x}})}\left\{p_s(y=i|\boldsymbol{x})-p_s(y=k|\boldsymbol{x})\right\}$, we only need to verify that the denominator of $h_{\mathrm{ELSA}}(\boldsymbol{x})$ is proportional to the reciprocal $\kappa(\boldsymbol{x})\dfrac{p_s(\boldsymbol{x})}{p_t(\boldsymbol{x})}$: the denominator of $h_{\mathrm{ELSA}}(\boldsymbol{x})$ is
$$
\begin{aligned}
&\frac{E_s(\rho^2\mid \boldsymbol{x})}{\pi} + \frac{E_s(\rho\mid \boldsymbol{x})}{1-\pi}\\
\propto& \frac{p_t(\boldsymbol{x})}{p_s(\boldsymbol{x})}\frac{1-\kappa(\boldsymbol{x})}{\kappa(\boldsymbol{x})}\frac1{1-\pi} + \frac{p_t(\boldsymbol{x})}{p_s(\boldsymbol{x})}\frac1{1-\pi}\\
\propto& \frac{p_t(\boldsymbol{x})}{p_s(\boldsymbol{x})}\frac{1}{\kappa(\boldsymbol{x})}.
\end{aligned}
$$
> **Your Comment:** *How did the authors implement calibration and implement MLLS? Results about computational efficiency would depend a lot on these? Did authors use LBFGS to obtain convergence of the calibration parameters as it can be faster? Did the authors use the same number of CPU cores with all the methods?*
**Our Response:**
We ran the MLLS and calibrations with the python package `abstention` owned by the team of authors of Alexandari et al. (2020). We believe this is the state-of-the-art implementation for MLLS. We checked the calibration codes in `abstention`. For the optimization part, they indeed used L-BFGS-B method for the fast computation.
In our experiments, all the methods are run under the same environment, which includes the same number of CPU cores.
> **Your Comment:** *Will the authors be releasing an implementation of the ELSA estimator? It might be good to release an implementation of the approach either in the Appendix of the paper as a code or as a github repository.*
**Our Response:**
Yes, we have built a python package for our proposed ELSA method. We will release it on github after the paper is published.
## Reviewer jn6Z
### Weakness
> **Your Comment:** *Line 97 states that BBSE method replaces $x$ with $\hat{y}$. There maybe some error in this statement.*
**Our Response:**
Thanks for your comment. The replacement of $x$ with $\hat{y}$ was proposed in Lemma 1 and Proposition 5 in Lipton et al. 2018.
> **Your Comment:** *Line 98 and 99 state that $\hat{p}_s(y|x)$ is a trained model in the where clause. However, $\hat{p}_s(y|x)$ does not occur in the statement before the where clause statement.*
**Our Response:**
Thanks for your comment. $\hat{p}_s(y|x)$ could be any classification model trained with data from the source distribution. We will add more details in the revision for clarification.
> **Your Comment:** *Line 403: Furthermore, We -> we*
**Our Response:**
Thanks for pointing it out. We have corrected the typo in the revised manuscript.
### Limitation
> **Your Comment:** *The classification performance of the method is not known, even if the estimation error is low.*
**Our Response:**
Thanks for your comment. We agree that the classification performance is an important evaluation metric. In the table below, we show the comparison results on different adaptations and datasets. The metric we reported here is the the *classification accuracy* improvement of the domain-adapted model relative to the original model (Alexandari et al. (2020)). For example, the value $+5.86%$ for ELSA under MNIST is that ELSA adaptation improves the classification accuracy with $5.86\%$ with respect to the model without adaptation. We fixed the sample size as $4500$ and Dirichlet shift with $\alpha=0.1$. It can be seen that our ELSA outperforms the other existing methods across different datasets. We will include the classification performance comparison in our final manuscript.
| Adaptation | MNIST | CIFAR-10 | CIFAR-100 |
|------------|--------|----------|-----------|
| BBSE-hard | +5.74% | +6.47% | +16.10% |
| RLLS-hard | +5.74% | +6.47% | +17.12% |
| BBSE-soft | +5.76% | +6.51% | +16.45% |
| RLLS-soft | +5.77% | +6.52% | +17.66% |
| MLLS | +5.75% | +6.55% | +14.01% |
| ELSA | +**5.86%** | +**6.76%** | +**21.27%** |