# Rebuttal for Neurips 2021 GC
# To all reviewers:
We thank all the reviewers for their encouraging and positive feedback.
## Originality & Significance:
As pointed out by the reviewers the main contribution of this paper is to provide novel moderate confidence sample complexity bounds for the original chernoff procedure in active testing and active regression setting. The GC policy is a fairly simple algorithm with virtually no parameters and unlike many of the current active learning algorithms does not require tuning hyper-parameters to optimize performance. In active testing setting GC is optimal in the $D_0$ sense, that is in the asymptotic sense. So when the $\delta \rightarrow 0$ in our bounds of Thm 2 and Thm 3, the second term with $D_0$ dominates. However, we study the non-asymptotic guarantees as well. We show that in the moderate confidence regime the first term $D_1$ may dominate (see Example 1). We further validate this in our lower bound of Thm 4 where we present an environment where GC is minimax optimal in the moderate confidence regime. Note that [Chernoff (1959)](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-30/issue-3/Sequential-Design-of-Experiments/10.1214/aoms/1177706205.full), [Naghsvar (2013)](https://arxiv.org/abs/1203.4626), [Nitinirawat (2013)](https://vvv.ece.illinois.edu/papers/journal/niti-atia-veer-tac-2013.pdf) present no such lower bound. In the active regression setting we bound a more stringent notion of risk than the average loss under uniform sampling that is traditionally looked at in the literature.
## Experiments:
Our main contribution is to provide novel theoretical techniques that motivate different sampling patterns in the active testing and active regression setting. The main role of the experiment section is to
<ol>
<li>Shed light on this sampling pattern itself and show that GC without exploration can fail spectacularly (example 1, fig 1a), whereas excessive exploration conducted by Two-phase, and GCE can lead to bad performance (3 group setting, fig 1b)</li>
<li>Show how these sampling patterns adapt to structure present in the data. In the non-linear regression setting (Fig 1c) GC leverages the structure of the problem by sampling the most informative action</li>
<li>Show that GC correctly identifies the function boundaries by leveraging the structure of the problem and learns the weights of a neural network model (fig 1f)</li>
<li>Show that fully adaptive GC with no hyper-parameter tuning has competitive performance in active regression settings against SOTA methods ActiveS, EMCM (fig 1g, 1h)</li>
</ol>
# Reviewer 2gUs:
We thank the reviewer for giving such encouraging comments.
1. (i) [Chernoff (1959)](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-30/issue-3/Sequential-Design-of-Experiments/10.1214/aoms/1177706205.full) had already shown the sampling strategy to be asymptotically optimal when the error probability $\delta \rightarrow 0$. The problem-dependent constant $D_0$ is the same as obtained by Chernoff. In this paper we have obtained an upper bound that is written as a sum of three terms and is valid for any $\delta >0$. One term scales as $1/D_0$ as $\delta \rightarrow 0$ and recovers the asymptotic upper bound. Another term does not involve $\delta$ and is proportional to $1/D_1$. While the expression is only an upper bound, for the instance in Example 1 the term with $D_1$ is much larger than the term with $D_0$. Empirically we observe that Chernoff sampling indeed requires a lot of samples to terminate, much greater than a uniform sampling baseline. This supports the validity of the term containing $D_1$ in the sample complexity expression. We also obtain a minimax lower bound with the help of the instance constructed in (6). This lower bound is valid for any $\delta > 0$ and matches the upper bound expression. This implies that for certain instances, Chernoff sampling can be the best one can do even at moderate values of $\delta$. However note that this is weaker than instance-wise optimality, we do not claim that Chernoff sampling is always optimal.
(ii) We will clarify the relationship between the various problem-dependent constants in the revised version. $D_0 \geq D_0'$ as $D_0$ is the maximized objective value while $D_0'$ is the objective value evaluated at the uniform sampling proportion. $D_1$ and $D_1'$ are not comparable but both $D_1 \leq D_0$ and $D_1' \leq D_0'$ are true.
<!-- 2. We do not claim that GC is *always* optimal. Infact one of our main contributions is to show that GC could be quite poor in the moderate confidence regime due to the factor $D_1$ as shown in example 1 (and Thm 2). We can show that $D_0 > D’_0$ as $D_0'$ depends on the uniform sampling proportion $u_{\theta\theta’}$ whereas $D_0$ depends on the optimal chernoff proportion $p_{\theta}$. The constant $D_1$ and $D_1’$ are not exactly comparable. In $D_1’$ if $(\mu_i(\theta’) - \mu_i(\theta))^2 = (\mu_i(\theta’) - \mu_i(\theta^*))^2$ and there is only one action that discriminates between $\theta’$ and $\theta^*$ then it is clear that $D_1’ \geq D_1$ as $u_{\theta'\theta^*} = 1$. Else they are not comparable. When we say optimal we mean optimal in the $D_0$ sense, that is the asymptotic regime. That is the constant multiplied to $log(J/\delta)$ goes large as $\delta$ tends to zero. On the other hand for the moderate confidence term we show via an example that GC is optimal in the $D_1$ sense. This is because it has the same constant to $log(J/\delta)$ as well as the constant to $\log J$ (see Thm 4). Further both [Chernoff (1959)](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-30/issue-3/Sequential-Design-of-Experiments/10.1214/aoms/1177706205.full) and [Naghsvar (2013)](https://arxiv.org/abs/1203.4626) do not provide any lower bounds for any $\delta > 0$. Our Thm 4 gives a novel lower sample complexity bound using the change of measure argument and show a specific environment where our upper bound matches the lower bound. -->
2. (i) Thank you for the suggestion, our focus in Theorem 6 was to bound the expected error evaluated under the worst-case loss metric $P_t$. A key technical challenge here was to analyze the error for the constantly changing sampling proportion of the GC strategy. To tackle it we proved new lemmas that can be combined with the existing proof strategy of Theorem 31 in [Frostig *et al* (2015)](http://proceedings.mlr.press/v40/Frostig15.html). Note that both [Frostig *et al* (2015)](http://proceedings.mlr.press/v40/Frostig15.html) and [Chaudhuri *et al* (2015)](https://proceedings.neurips.cc/paper/2015/hash/ca9c267dad0305d1a6308d2a0cf1c39c-Abstract.html) analyze iid sampling strategies. Theorem 1 of [Chaudhuri *et al* (2015)](https://proceedings.neurips.cc/paper/2015/hash/ca9c267dad0305d1a6308d2a0cf1c39c-Abstract.html) bounds the expected error under the $L_U$ metric. Using the same assumptions made in that work, we can also obtain a bound for the $L_U$ metric under the GC sampling strategy (key changes from our existing proof are highlighted next). We will include this as a corollary in the revised version.
(ii) First recall that $P_{t}(\boldsymbol{\theta})=\frac{1}{t} \sum_{s=1}^{t} \mathbb{E}_{I_{s \sim \mathbf{p}_{\widehat{\boldsymbol{\theta}}_{s-1}}}} \quad\left[L_{s}(\boldsymbol{\theta}) \mid \mathcal{F}^{s-1}\right]$. With the additional assumption that $\nabla^2 P_s\left(\theta^{*}\right) \succeq c \nabla^2 L_{U}\left(\theta^{*}\right)$ for some $c < 1$ at all $s\in[t]$, and that GC conducts some forced exploration like the $\bar{\Gamma}$ sampling of ActiveS we can bound the expected uniform loss $L_U$ as follows:
\begin{align*}
\left(1-c^{\prime} \epsilon_{t}\right) \frac{\sigma_1^{2}}{t} - O\left(\frac{1}{t^{p/2}}\right) \leq \mathbb{E}\left[L_{U}\left(\widehat{\boldsymbol{\theta}}_{t}\right)-L_{U}\left(\boldsymbol{\theta}^{*}\right)\right] \leq \left(1+c^{\prime} \epsilon_{t}\right) \frac{\sigma_1^{2}}{t} + O\left(\frac{1}{t^p} \right)
\end{align*}
where, $\sigma_1^{2}:=\mathbb{E}\left\|\nabla \widehat{P}_{t}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\left(\nabla^{2} L_{U}\left(\boldsymbol{\theta}^{*}\right)\right)^{-1}}^{2}$, and $\widehat{P}_{t}(\boldsymbol{\theta})=\frac{1}{t} \sum_{s=1}^{t} L_{s}(\boldsymbol{\theta})$ and $p \geq 2$.
This can be shown as follows.
In the proof of Theorem 6, the steps 1, 2, 3, and 4 remain as it is. From step 5 we can replace the function $P_t(\theta^*)$ with $L_U(\theta^*)$. It follows then for a $\widetilde{\mathbf{z}}_{t}$ between $\boldsymbol{\theta}^{*}$ and $\widehat{\boldsymbol{\theta}}_{\boldsymbol{t}}$
$$
L_{U}\left(\widehat{\boldsymbol{\theta}}_{t}\right)-L_{U}\left(\boldsymbol{\theta}^{*}\right)=\frac{1}{2}\left(\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}^{*}\right)^{\top} \nabla^{2} L_{U}\left(\widetilde{\mathbf{z}}_{t}\right)\left(\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}^{*}\right), \\
\left\|\nabla^{2} L_{U}\left(\widetilde{\mathbf{z}}_{t}\right)-\nabla^{2} L_{U}\left(\boldsymbol{\theta}^{*}\right)\right\|_{*} \leq L_{3}\left\|\widetilde{\mathbf{z}}_{t}-\boldsymbol{\theta}^{*}\right\|_{\nabla^{2} L_{U}\left(\boldsymbol{\theta}^{*}\right)} \leq \epsilon_{t}.
$$
Using the assumption that $\nabla^2 P_s\left(\theta^*\right) \succeq c \nabla^2 L_{U}\left(\theta^{*}\right)$ for all $s$ we can show that $\left\Vert \nabla^{2} \widehat{P}_{t}\left(\widetilde{\boldsymbol{\theta}}_{t}\right)-\nabla^{2} L_{U}\left(\boldsymbol{\theta}^{*}\right)\right\Vert_{*} \leq \epsilon_{t}$. It follows then,
\begin{align*}
&\left(1-\epsilon_{t}\right) \nabla^{2} L_{U}\left(\boldsymbol{\theta}^{*}\right) \preceq \nabla^{2} \widehat{P}_{t}\left(\widetilde{\boldsymbol{\theta}}_{t}\right) \preceq\left(1+\epsilon_{t}\right) \nabla^{2} L_{U}\left(\boldsymbol{\theta}^{*}\right) \\
&\left(1-\epsilon_{t}\right) \nabla^{2} L_{U}\left(\boldsymbol{\theta}^{*}\right) \preceq \nabla^{2} L_{U}\left(\widetilde{\mathbf{z}}_{t}\right) \preceq\left(1+\epsilon_{t}\right) \nabla^{2} L_{U}\left(\boldsymbol{\theta}^{*}\right)
\end{align*}
Next we can follow the same steps 7, 8, 9, and 10 in Theorem 6 to get the claim stated above. These steps use the Taylor approximation series and min-max theorem so the function $L_U$ will not alter the result.
<!-- However, analogous to eq (64) we cannot show that $\left\|\nabla^{2} \widehat{P}_{t}\left(\widetilde{\boldsymbol{\theta}}_{t}\right)-\nabla^{2} L_{U}\left(\boldsymbol{\theta}^{*}\right)\right\|_{*} \leq \epsilon_{t}$ using Lemma 14. -->
<!-- 5. Recall that $\widehat{P}_{t}(\boldsymbol{\theta})=\frac{1}{t} \sum_{s=1}^{t} L_{s}(\boldsymbol{\theta})$. Note that the steps 1,2,3 of Thm 6 holds for any function $P_t(\theta)$. Then we can show that:
> 1. Use triangle inequality to show that
\begin{align*}
\left\|\nabla^{2} \widehat{P}_{t}(\theta^*)-\nabla^{2} P_t\left(\theta^{*}\right)\right\|_{*} & \leq\left\|\nabla^{2} \widehat{P}_{t}(\theta)-\nabla^{2} \widehat{P}_{t}\left(\theta^{*}\right)\right\|_{*}+\left\|\nabla^{2} \widehat{P}_{t}\left(\theta^{*}\right)-\nabla^{2} P_t\left(\theta^{*}\right)\right\|_{*} \\
& \leq C_{3}\left\|\theta-\theta^{*}\right\|_{\nabla^{2} P_t\left(\theta^{*}\right)}+ \sqrt{\frac{cp \log d t}{t}}
\end{align*} -->
<!-- > 2. Then show that for any $\theta_t$ in the ball B we have
$$
\frac{1}{2} \nabla^{2} \widehat{P}_{t}(\theta) \preceq \nabla^{2}L_U\left(\theta^{*}\right) \preceq 2 \nabla^{2} \widehat{P}_{t}(\theta)
$$
> 3. Then we can show $\widehat{\theta}_t$ in ball $B$ as in step 3 of the proof of Thm 6. This step use Taylor Series Approximation and we can show that
\begin{align*}
\widehat{P}_{t}(\theta)-\widehat{P}_{t}\left(\theta^{*}\right) & \stackrel{}{=} \nabla \widehat{P}_{t}\left(\theta^{*}\right)^{\top}\left(\theta-\theta^{*}\right)+\frac{1}{2}\left\|\theta-\theta^{*}\right\|_{\nabla^{2} \widehat{P}_{t}(\tilde{\theta})}^{2} \\
%%%%%%%%%%%%%%%%%%%%
%& \stackrel{(b)}{\geq} \nabla \widehat{P}_{t}\left(\theta^{*}\right)^{\top}\left(\theta-\theta^{*}\right)+\frac{1}{4}\left\|\theta-\theta^{*}\right\|_{\nabla^{2} P_{t}\left(\theta^{*}\right)}^{2} \\
%& \geq-\left\|\theta-\theta^{*}\right\|_{\nabla^{2} P_{t}\left(\theta^{*}\right)}\left\|\nabla \widehat{P}_{t}\left(\theta^{*}\right)\right\|_{\left(\nabla^{2} P_{t}\left(\theta^{*}\right)\right)^{-1}}+\frac{1}{4}\left(\left\|\theta-\theta^{*}\right\|_{\nabla^{2} P_{t}\left(\theta^{*}\right)}\right)^{\top}\left(\left\|\theta-\theta^{*}\right\|_{\nabla^{2} P_{t}\left(\theta^{*}\right)}\right) \\
%%%%%%%%%%%%%%%%%%%
&\geq\left\|\theta-\theta^{*}\right\|_{\nabla^{2} L_{U}\left(\theta^{*}\right)}\left(-\left\|\nabla \widehat{P}_{t}\left(\theta^{*}\right)\right\|_{\left(\nabla^{2} L_{U}\left(\theta^{*}\right)\right)^{-1}}+\frac{1}{4}\left\|\theta-\theta^{*}\right\|_{\nabla^{2} L_{U}\left(\theta^{*}\right)}\right)
\end{align*} -->
<!-- > 2. In Step 4 of Thm 6 we now bound $\left\|\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}^{*}\right\|_{\nabla^{2} P_{t}\left(\boldsymbol{\theta}^{*}\right)}$
>3. 5, 6, 7 of Thm 6 also follows as they only use the Taylor series approximiation. We can show that
\begin{align*}
%P_{t}\left(\widehat{\boldsymbol{\theta}}_{t}\right)-P_{t}\left(\boldsymbol{\theta}^{*}\right) &=\frac{1}{2}\left(\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}^{*}\right)^{\top} \nabla^{2} P_{t}\left(\widetilde{\mathbf{z}}_{t}\right)\left(\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}^{*}\right) \\
%%%%%%%%%
%&=\frac{1}{2}\left(\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}^{*}\right)^{\top} \nabla^{2} P_{t}\left(\boldsymbol{\theta}^{*}\right) \nabla^{2} P_{t}\left(\boldsymbol{\theta}^{*}\right)^{-1} \nabla^{2} P_{t}\left(\widetilde{\mathbf{z}}_{t}\right)\left(\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}^{*}\right) \\
%&=\frac{1}{2} \nabla^{2} P_{t}\left(\boldsymbol{\theta}^{*}\right)^{-1} \nabla^{2} P_{t}\left(\widetilde{\mathbf{z}}_{t}\right)\left(\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}^{*}\right)^{\top} \nabla^{2} P_{t}\left(\boldsymbol{\theta}^{*}\right)\left(\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}^{*}\right) \\
%&=\frac{1}{2} \nabla^{2} P_{t}\left(\boldsymbol{\theta}^{*}\right)^{-1 / 2} \nabla^{2} P_{t}\left(\widetilde{\mathbf{z}}_{t}\right) \nabla^{2} P_{t}\left(\boldsymbol{\theta}^{*}\right)^{-1 / 2}\left\|\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}^{*}\right\|_{\nabla^{2} P_{t}\left(\boldsymbol{\theta}^{*}\right)}^{2} \\
%%%%%%%%%%%
L_{U}\left(\widehat{\boldsymbol{\theta}}_{t}\right)-L_{U}\left(\boldsymbol{\theta}^{*}\right) & \stackrel{}{=} \frac{1}{2} \mathbf{M}_{2, t}\left\|\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}^{*}\right\|_{\nabla^{2} L_{U}\left(\boldsymbol{\theta}^{*}\right)}^{2}
\end{align*}
\begin{align*}
%P_{t}\left(\widehat{\boldsymbol{\theta}}_{t}\right)-P_{t}\left(\boldsymbol{\theta}^{*}\right) & \geq \frac{1}{2} \lambda_{\min }\left(\mathbf{M}_{2, t}\right)\left\|\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}^{*}\right\|_{\nabla^{2} P_{t}\left(\boldsymbol{\theta}^{*}\right)}^{2} \\
%&=\frac{1}{2} \lambda_{\min }\left(\mathbf{M}_{2, t}\right)\left\|\nabla^{2} \widehat{P}_{t}\left(\widetilde{\boldsymbol{\theta}}_{t}\right)\left(\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}^{*}\right)\right\|_{\left(\nabla^{2} \widehat{P}_{t}\left(\widetilde{\boldsymbol{\theta}}_{t}\right)\right)^{-1} \nabla^{2} P_{t}\left(\boldsymbol{\theta}^{*}\right)\left(\nabla^{2} \widehat{P}_{t}\left(\widetilde{\boldsymbol{\theta}}_{t}\right)\right)^{-1}} \\
%& \geq \frac{1}{2}\left(\lambda_{\min }\left(\mathbf{M}_{1, t}\right)\right)^{2} \lambda_{\min }\left(\mathbf{M}_{2, t}\right)\left\|\nabla^{2} \widehat{P}_{t}\left(\widetilde{\boldsymbol{\theta}}_{t}\right)\left(\widehat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}^{*}\right)\right\|_{\left(\nabla^{2} P_{t}\left(\boldsymbol{\theta}^{*}\right)\right)^{-1}}^{2} \\
& L_{U}\left(\widehat{\boldsymbol{\theta}}_{t}\right)-L_{U}\left(\boldsymbol{\theta}^{*}\right) \stackrel{}{\geq} \frac{1}{2}\left(\lambda_{\min }\left(\mathbf{M}_{1, t}\right)\right)^{2} \lambda_{\min }\left(\mathbf{M}_{2, t}\right)\left\|\nabla \widehat{P}_{t}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\left(\nabla^{2} L_{U}\left(\boldsymbol{\theta}^{*}\right)\right)^{-1}}^{2}
\end{align*}
> 5. Similarly using the steps 8,9,10 of Thm 6 which holds if the above good events in steps 3,4,5,6,7 holds and $\widehat{\theta}_t\in B$ we can show that with probability greater than $1-2/t^p$
\begin{align*}
\mathbb{E}\left[L_{U}\left(\widehat{\boldsymbol{\theta}}_{t}\right)-L_{U}\left(\boldsymbol{\theta}^{*}\right)\right]
%& \geq \mathbb{E}\left[\left(P_{t}\left(\widehat{\boldsymbol{\theta}}_{t}\right)-P_{t}\left(\boldsymbol{\theta}^{*}\right)\right) I(\mathcal{E})\right] \\
%& \geq \frac{1}{2} \mathbb{E}\left[\left(\lambda_{\min }\left(\mathbf{M}_{1, t}\right)\right)^{2} \lambda_{\min }\left(\mathbf{M}_{2, t}\right)\left\|\nabla \widehat{P}_{t}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\left(\nabla^{2} P_{t}\left(\boldsymbol{\theta}^{*}\right)\right)^{-1}} I(\mathcal{E})\right] \\
%& \geq\left(1-c^{\prime} \epsilon_{t}\right) \frac{1}{2} \mathbb{E}\left[\left\|\nabla \widehat{P}_{t}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\left(\nabla^{2} P_{t}\left(\boldsymbol{\theta}^{*}\right)\right)^{-1}}^{2} I(\mathcal{E})\right] \\
%&=\left(1-c^{\prime} \epsilon_{t}\right) \frac{1}{2} \mathbb{E}\left[\left\|\nabla \widehat{P}_{t}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\left(\nabla^{2} P_{t}\left(\boldsymbol{\theta}^{*}\right)\right)^{-1}}^{2}(1-I(\operatorname{not} \mathcal{E}))\right] \\
%& \stackrel{(a)}{=}\left(1-c^{\prime} \epsilon_{t}\right)\left(\sigma^{2}-\frac{1}{2} \mathbb{E}\left[\left\|\nabla \widehat{P}_{t}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\left(\nabla^{2} P_{t}\left(\boldsymbol{\theta}^{*}\right)\right)^{-1}}^{2} I(\operatorname{not} \mathcal{E})\right]\right) \\
& \geq\left(1-c^{\prime} \epsilon_{t}\right) \sigma_1^{2}
%-\mathbb{E}\left[\left\|\nabla \widehat{P}_{t}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\left(\nabla^{2} P_{t}\left(\boldsymbol{\theta}^{*}\right)\right)^{-1}}^{2} I(\operatorname{not} \mathcal{E})\right]
\end{align*}
where, $\sigma_1^{2}:=\left\|\nabla \widehat{P}_{t}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\left(\nabla^{2} L_{U}\left(\boldsymbol{\theta}^{*}\right)\right)^{-1}}^{2}$
\begin{align*}
\mathbb{E}\left[L_{U}\left(\widehat{\boldsymbol{\theta}}_{t}\right)-L_{U}\left(\boldsymbol{\theta}^{*}\right)\right]
& \leq \left(1+c^{\prime} \epsilon_{t}\right) \frac{1}{2} \mathbb{E}\left[\left\|\nabla \widehat{P}_{t}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\left(\nabla^{2} L_{U}\left(\boldsymbol{\theta}^{*}\right)\right)^{-1}}\right]+\frac{\max _{\boldsymbol{\theta} \in \Theta}\left(L_{U}(\boldsymbol{\theta})-L_{U}\left(\boldsymbol{\theta}^{*}\right)\right)}{t^{p}}\\
&=\left(1+c^{\prime} \epsilon_{t}\right) \frac{\sigma_1^{2}}{t}
\end{align*} -->
3. The active testing experiments conducted were sufficient to bring out the main difference between GC and T2 against Two-phase and other policies. We show that GC without exploration can fail spectacularly (example 1, fig 1a), whereas excessive exploration conducted by Two-phase, and GCE can lead to bad performance (3 group setting, fig 1b). In the active regression setting we agree that the confidence intervals overlap, but the main goal in this setting is to show that the fully adaptive sampling procedure GC (with no hyper-parameter tuning) performs competitively w.r.t other SOTA methods ActiveS, EMCM. ActiveS is a two phase algorithm that requires tuning to choose $\alpha$ and an appropriate value for $m_1$ which must be greater than a problem-dependent quantity. The EMCM does not provide any theoretical guarantees and requires several hyper-parameters to tune. We show that the Chernoff proportion is actually sparse in real-life dataset (see Figure 1i) indicating that it has identified the most informative samples. Additionally it has sufficient diversity in its sampling proportion so as to perform well (see Figure 1g), whereas ActiveS and EMCM conduct forced exploration to include diversity in its sampling procedure.
# Reviewer 4ege:
We thank the reviewer for the encouraging comments.
1. Thank you, that is a typo in line 144. The gap must hold for every hypothesis $\theta \neq \widehat{\theta}(t)$ as per the stopping rule of [Chernoff (1959)](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-30/issue-3/Sequential-Design-of-Experiments/10.1214/aoms/1177706205.full) which we also use.
2. The $\delta$-PAC guarantee for any sampling strategy is ensured by the stopping condition. This is described in Lemma 9. Since T2 uses the same stopping condition as GC, it is $\delta$-PAC. We will clarify this in Theorem 3.
3. Note that for hypothesis $\boldsymbol{\theta}'$ to be different from hypothesis $\boldsymbol{\theta}^\ast$, there must be at least one $i \in [n]$ for which $\mu_i(\boldsymbol{\theta}') \neq \mu_i(\boldsymbol{\theta}^\ast)$. Otherwise given the set of available actions they are the same hypothesis. Since $D_{e}:=\min_{\boldsymbol{\theta}^{\prime} \neq \boldsymbol{\theta}^{*}} \sum_{i=1}^{n} \frac{1}{n}\left(\mu_{i}\left(\boldsymbol{\theta}^{\prime}\right)-\mu_{i}\left(\boldsymbol{\theta}^{*}\right)\right)^{2}$ we see that $D_e > 0$. We will clarify this in the revised version.
<!-- 5. Assm 2 is more stringent in the sense that it requires that $\mu_i(\theta) - \mu_{i}(\theta) > 0$ for each $i$. Intuitively GC requires this assumption because it conducts no forced exploration. In the GC sample complexity proof of Thm2 we use this assumption in lines 617-619. However, the definition of $D_{e}:=\min_{\boldsymbol{\theta}^{\prime} \neq \boldsymbol{\theta}^{*}} \sum_{i=1}^{n^{\prime}} \frac{1}{n}\left(\mu_{i}\left(\boldsymbol{\theta}^{\prime}\right)-\mu_{i}\left(\boldsymbol{\theta}^{*}\right)\right)^{2}$ is an average of all the $\mu_i(\theta’)-\mu_i(\theta^*)$ values. So $D_e > 0$, since in active testing for any hypothesis $\theta$ to be different than $\theta^*$ the value of at least one $\mu_i(\theta) - \mu_{i}(\theta)$ for some action $i$ has to be greater than 0. -->
4. In Theorem 6 we have proved a sandwich bound which is similar to the bound shown by [Chaudhuri *et al* (2015)](https://proceedings.neurips.cc/paper/2015/hash/ca9c267dad0305d1a6308d2a0cf1c39c-Abstract.html). We have shown that the expected value of the error is decaying with time and the leading factor which determines the rate of decay is denoted by $\sigma^2$ which is the expected value of a matrix norm. In particular the norm is under the inverse Hessian of the worst case loss evaluated at $\theta^*$. This interpretation is similar to [Chaudhuri *et al* (2015)](https://proceedings.neurips.cc/paper/2015/hash/ca9c267dad0305d1a6308d2a0cf1c39c-Abstract.html) sample complexity bounds where the Hessian matrix (in their setup the Hessian is the Fisher information matrix) plays an important role.
We can look into the non-linear regression setting to get a sense of the rate of decay. Our simulated example of non-linear logistic regression fits the requirements of the Theorem 6. We can simplify the $\sigma^2 :=\mathbb{E}\left[\frac{1}{2}\left\|\nabla \widehat{P}_{t}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\left(\nabla^{2} P_{t}\left(\boldsymbol{\theta}^{*}\right)\right)^{-1}}\right]$ and upper bound it by $\lambda_1dC_4\eta$ where $\lambda_1 = \max_{s \in [t]} \lambda_s$ and $\lambda_s$ is the objective value of the $\max\min$ eigenvalue optimization of GC given in Theorem 5 at time $s$. We present this discussion of $\sigma^2$ in Appendix A.9.1. In our non-linear regression example this boils down to $\sigma^2 = O(d)$. So the upper bound of expected worst case loss in Theorem 6 scales as
$$
\mathbb{E}\left[P_{t}\left(\widehat{\boldsymbol{\theta}}_{t}\right)-P_{t}\left(\boldsymbol{\theta}^{*}\right)\right] \leq O\left(\frac{d \sqrt{\log (d t)}}{t}+\frac{R}{t^{2}}\right)
$$
We will mention that $p\geq 2$ in the statement of Theorem 6 in the revised version, thank you for catching it. The $\max\min$ eigenvalue optimization can be solved efficiently as it can be formulated as a linear program, we have expanded on this in Appendix A.7.1.
5. Great question. We made a mistake in line 240, it should state $D_0 > D_1$ and $D_0 > D_e$ separately. We will change this in the revised version. $D_e$ and $D_1$ are not comparable because $D_{1}:=\min _{\left\{\mathbf{p}_{\boldsymbol{\theta}}: \boldsymbol{\theta} \in \boldsymbol{\Theta}\right\}} \min _{\boldsymbol{\theta}^{\prime} \neq \boldsymbol{\theta}^{*}} \sum_{i=1}^{n} p_{\boldsymbol{\theta}}(i)\left(\mu_{i}\left(\boldsymbol{\theta}^{\prime}\right)-\mu_{i}\left(\boldsymbol{\theta}^{*}\right)\right)^{2}$ is defined as the $\min$ verification proportions for all the hypotheses. No such verification proportion is present in the definition of $D_e$. We give an intuitive explanation for the performance of GC and GCE below. In the 3 group example with 50 actions the first action is the most discriminative between every pair of hypotheses. The solution to the max-min optimization of eq (3) puts maximum sampling mass on the first action for any $\widehat{\theta}(t)$. So GC quickly focuses on the first action and finds the best hypothesis. However, GCE (even though it solves the same max-min optimization in eq (3)) performs a forced exploration over the large action set and wastes samples exploring non-informative actions before focusing on the first action. Now consider the first example when there are just 3 actions but hypothesis $\theta'$ and $\theta''$ are not well separated under the mean value functions for the second action. In that case solving the max-min optimization in eq (3) lead us to sample action 1 if $\widehat{\theta}(t) = \theta^*$ or action 2 if $\widehat{\theta}(t) \in \{\theta', \theta''\}$. So initially GC may get stuck with verifying either $\theta'$ or $\theta''$. This can be remedied with a small exploration and consequently GCE drastically outperforms the non-exploring policy GC. The given Theorem statements (Theorem 2 for GC, Theorem 3 for T2, Theorem 4 for GCE) are upper bounds and can be loose in certain instances. Empirically we observe that in the 3 group setting GC performs better than GCE.
6. The confidence intervals that we plotted are just mean+- stddev. Therefore a wider CI means more variability in performance across different trials. We will add this to the caption. The GC CI has been overshadowed by the other larger CI, but across the several experiments we consistently observe that GC has smaller variability across trials. This means that GC is more robust. Our statement GC performs well is just on the basis of the mean curve. We will include larger plots with legible colors in the revised version's supplementary.
### Soft Recommendations:
1. We will add $\eta$ and $\eta_0$ to the table of symbols.
2. To denote the rate of convergence in Thm 6 we will use a different symbol than $p$.
3. The regularity condition 3c implies strong convexity, and 3d implies Lipchitz continuity of the Hessian matrix. [Chaudhuri *et al* (2015)](https://arxiv.org/pdf/1506.02348.pdf) also takes the same assumptions for their proof of Theorem 1.
4. The piecewise nature of the plots in figure 1f, 1g, and 1h are arising because we run all the algorithms in batches, i.e we update the models after sampling batches of 100 data points. Note that these are large real-life datasets, and training in batches speeds up the run-time considerably. For performance without the batches look into the plot of figure 1d.
5. We will correct the typos mentioned. We again thank the reviewer for taking their time and pointing out the typos in the paper.
### Limitations
We will include a discussion about limitations in the revised version. Indeed GC may not perform well on all instances, but it is a relatively easy baseline to implement. Nevertheless, its sampling proportion is updated before collecting each sample, which increases the computational cost (one fix for this is the Batched GC which solves the optimization only after collecting a batch of samples). Assumption 2 is a strong assumption, and other works have removed that in the context of Active Testing by modifying the GC strategy. Theoretical guarantees require several assumptions which may not always be satisfied. In addition to the regularity assumptions, we also need the smoothness of the mean function for Theorem 5.
# Reviewer KpWc:
We thank the reviewer for the encouraging comments.
1. You are correct, in L26 the $\epsilon$ is the additive noise. In L56 we were describing the setup of the original [Chernoff (1959)](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-30/issue-3/Sequential-Design-of-Experiments/10.1214/aoms/1177706205.full) paper, where indeed the probability distribution for each response was assumed to be known. We will clarify these in the revised version.
2. In L59 p(i) is probability of selecting action i. We will explicitly mention this.
3. In L107 the I_t indicates the action observed at time t. We only observe the reward of the action selected. We will change the wording in that line to mention explicitly that the action is being recorded.
4. We will correct the typo in L120
5. For L150 we will define the $\eta_0$ before and then introduce it.
6. In L246 we introduce the ball of radius $\epsilon$. We will use ‘r’ to denote the radius instead of $\epsilon$.
7. Thank you for bringing to our attention these references. The work of [Castro *et al* (2005)](https://nowak.ece.wisc.edu/ECE-05-03.pdf) gives minimax guarantees for learning piecewise constant or piecewise smooth class of functions, i.e., they build an estimate for the unknown mean function over its entire domain and measure its performance by the L2 loss. Their error bound scales as $O(\frac{1}{ t^{1/(d-1+1/d)}})$ where $d$ is the dimension. On the other hand, we assume the mean function for each action is parameterized by $\theta$ and we only wish to estimate the unknown $\theta^\ast$. Thus our problem is simpler and indeed our error bound decreases at a faster rate of $O(d\sqrt{\log(dt)}/t)$.
The work of [Goetz *et al* (2018)](https://papers.nips.cc/paper/2018/file/dc4c44f624d600aa568390f1f1104aa0-Paper.pdf) is also in a similar framework as that of [Castro *et al* (2005)](https://nowak.ece.wisc.edu/ECE-05-03.pdf). Notably they have a similar rate of $O(1/t^{2/(2 + d)})$. We will include this discussion in our related works.
# After Rob's comments
## 2gUs point 2
(i) We cannot directly compare our bound with the bound in [Chaudhuri *et al* (2015)](https://proceedings.neurips.cc/paper/2015/hash/ca9c267dad0305d1a6308d2a0cf1c39c-Abstract.html) as they involve different terms in the bound. But if $p_{\hat{\theta}(t)} \approx p_{\theta^\ast}$ (which will eventually be the case as $\hat{\theta}(t) \rightarrow \theta^\ast$ with increasing number of samples) then the worst-case loss $P_t$ is an upper bound to the loss under the uniform distribution, i.e., $L_U$. Consequently, if we only sum the losses at times when $p_{\hat{\theta}(t)} \approx p_{\theta^\ast}$, then our bound also holds for the expected loss evaluated using $L_U$.
(ii) However it cannot be that $p_{\hat{\theta}(t)} \approx p_{\theta^\ast}$ at all times $t$. For the GC sampling strategy, it is not clear if we can obtain a bound similar to the one in [Chaudhuri *et al* (2015)](https://proceedings.neurips.cc/paper/2015/hash/ca9c267dad0305d1a6308d2a0cf1c39c-Abstract.html) because unlike in their paper, we do not explicitly add uniform exploration to the GC strategy. Empirically we have tried adding a little uniform exploration to GC (we called it GCE) and observed that it performs similar to GC. For GCE we can follow the same reasoning as in [Chaudhuri *et al* (2015)](https://proceedings.neurips.cc/paper/2015/hash/ca9c267dad0305d1a6308d2a0cf1c39c-Abstract.html) and obtain an error bound on the expected loss under $L_U$ metric. We will add this as a remark in the revised version.
## KpWc point 7
Thank you for bringing to our attention these references. The work of [Castro *et al* (2005)](https://nowak.ece.wisc.edu/ECE-05-03.pdf) is in the non-parametric setup, where the error rates for learning are $O(t^{-\alpha_{\mathcal{F}}})$, and the exponent $\alpha_{\mathcal{F}}$ decreases as the complexity of the hypothesis class of functions increases (error in learning a more complicated function decreases slower). For example, if the hypothesis class consists of Holder smooth functions defined on domain $[0,1]^d$ then $\alpha_{\mathcal{F}}=1/(d - 1 + 1/d)$. In contrast, our work is in the parametric setting, where we only want to estimate a single parameter $\theta^\ast$ and $\alpha_{\mathcal{F}}=1$. Our contribution is to optimize and show that the problem-dependent constant multiplying the term containing $1/t$ in the error bound is better for active learning by the GC strategy. It is an interesting avenue of future work to obtain rates for for very flexible parametric models such as neural networks.
## Reviewer qYWK
We thank the reviewer for their helpful comments.
<!-- 1. <b>Novelty:</b> In the Gaussian case our setting matches that of [Chernoff (1959)](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-30/issue-3/Sequential-Design-of-Experiments/10.1214/aoms/1177706205.full). But the result of Theorem 2 is more general as it brings out the moderate confidece terms for any $\delta > 0$ whereas the result of [Chernoff (1959)](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-30/issue-3/Sequential-Design-of-Experiments/10.1214/aoms/1177706205.full) is only valid for $\delta \rightarrow 0$. -->
2. <b>Execution:</b> We introduced the paramter $\eta >0$ in Assumption 1 such that the reward from any action under any hypothesis has bounded range, i.e., $Y_{s} \in$ $[-\sqrt{\eta} / 2, \sqrt{\eta} / 2]$ almost surely at every round $s$ for some fixed $\eta>0$. We will define the parameter $\eta$ in the notation table.
<!-- 3. <b>Experiments:</b> We conducted extensive experiments to bring out all the possible scenarios where our algorithm may oputperform other SOTA algoroithms in both active testing and active regression setting. Also these experiments show that in some instances the GC policy may fail as well. We further explain the rational behind the experiments and their setup in the Appendix A.10. -->
4. <b>Theorem 1:</b> The reviewer is correct in noting that the proposed strategy is only asymptotically optimal. The strategy gives the optimum expected value for the stopping time when the probability of error under each hypothesis tends to 0 as shown in [Chernoff (1959)](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-30/issue-3/Sequential-Design-of-Experiments/10.1214/aoms/1177706205.full). We use this theorem only to justify the sampling strategy and it does not affect any of the other results in the paper. We will clarify this in the revised version.
<!-- 6. The Theorem 1 is not that critical We use this Theorem simply a way to justify this particular sampling choice.
(b) the reviewer is correct that we should have say “optimal expected values of the stopping time subject to the constraints of vanishing probabilities of error under each hypothesis” as proved in
8. In Theorem 1 setup the goal of the learner is to identify the optimal hypothesis with confidence $1-\delta$ and it collects $\tau$ observations. The Theorem 1 results hold for any given sequnece of rewards $Y_1, Y_2, \ldots, Y_{\tau}$ for the corresponding chosen actions $I_1, I_2, \ldots, I_{\tau}$ till round $\tau$. The result of Theorem 1 generalizes the max-min optimization of [Chernoff (1959)](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-30/issue-3/Sequential-Design-of-Experiments/10.1214/aoms/1177706205.full) to the sub-Gaussian noise distribution. -->
5. <b>Theorem 2:</b> We will state before the theorem that the steps in the proof follow a similar line of reasoning as in [Chernoff (1959)](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-30/issue-3/Sequential-Design-of-Experiments/10.1214/aoms/1177706205.full). Our theorem statement shows moderate confidence terms due to extra analysis steps carried out in a manner similar to [Naghshvar (2013)](https://arxiv.org/abs/1203.4626). Theorem 2 holds for the sub-Gaussian distribution setting, whereas the original [Chernoff (1959)](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-30/issue-3/Sequential-Design-of-Experiments/10.1214/aoms/1177706205.full) required to know the distribution.
<!-- Further our Theorem 2 establishes explicit constants to the problem dependent parameter $D_0$ whereas [Chernoff (1959)](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-30/issue-3/Sequential-Design-of-Experiments/10.1214/aoms/1177706205.full) holds for some constant (1 + o(1)). Also our prpopsed Theorem 2 brings out the moderate confidence term $D_1$ which requires a completely new analysis than the basic likelihood ratio test proof technique explored by [Chernoff (1959)](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-30/issue-3/Sequential-Design-of-Experiments/10.1214/aoms/1177706205.full). In this aspect our proof technique is more aligned to the recent paper of [Naghsvar (2013)](https://arxiv.org/abs/1203.4626). -->
10. <b>Theorem 5</b> Thank you for pointing out the errors, we will include the version from the appendix in the revised paper.
<!-- 11. We thank the reviewer for pointing out that the theorem in the main body and appendix statement are different. We will include the Appendix version in the main paper. -->
<!-- 12. <b>Name:</b> I think we should ignore this comment. I don't want to change the name. -->
11. <b>Typo:</b> L277-> You are correct. We will rectify it to show that $$
\sqrt{\frac{p \log (d t)}{t}} \leq c^{\prime} \min \left\{\frac{1}{L_{1} L_{3}}, \frac{\operatorname{diameter}(B)}{L_{1}}\right\}
$$
so that $t$ is on one side of the inequality. We show this step in pur proof in line 1061 of Apendix A.9.5.
<!-- 901-902 Ask. -->
## Reply to Reviewer qYWK
We again thank the reviwer for being so positive about the paper and acknowledging the finite time bounds are important and nice. Regarding your latest comment:
1. Our motivation for naming it as genralized was to point out that we extend the original multiple hypothesis testing proposed by [Chernoff (1959)](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-30/issue-3/Sequential-Design-of-Experiments/10.1214/aoms/1177706205.full) to the smoothly parameterized setting. But since the reviewer has brought up some new important points *we are ready to change the name of the algorithm to Modified Chernoff Sampling*. We will change the name of the algorithm in the paper, the references and the relevant plots in the graphs as well. We will mail organizers to *change the title* to "Modified Chernoff Sampling for Active Testing and Active Regression" as this is not allowed by the author interface. We’re unsure if the change would propagate to the title currently displayed in the open review forum. Either way, if accepted, we will change the tile while making the camera-ready version.
2. The bounds characterize how instance-specific quantities, like the $D_0$, and $D_1$, can affect the sample complexity of the instance. This provides new insight into what makes problems easier or more challenging. Our algorithms automatically adapt to the instance-specific complexity and so practitioners can use the algorithms without prior knowledge of the problem instance. The new bounds clarify what instance-specific quantities may dominate finite-sample performance, providing additional insight beyond the purely asymptotic characterization of Chernoff’s original work in [Chernoff (1959)](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-30/issue-3/Sequential-Design-of-Experiments/10.1214/aoms/1177706205.full).
3. Note that for the finite hypothesis case when the means under each hypothesis for each of the action is known then we can simply run a python code to evaluate $D_1$. In Example 1 we calculated $D_1$. Since the true hypothesis $\theta^*$ is not known (in a practitioner's setting) we cannot calculate the $D_0$ under the true hypothesis $\theta^*$. However we can still calculate all possible values for $D_0$ associated with each of the hypotheses under consideration. In the worst-case, the practitioner knows that the algorithm's sample complexity is bounded in terms of the largest value of $\beta(\delta)/D_0 + \alpha/D_1$. We will update our draft to include these new comments.
Since now we have addressed and agreed to the modifications proposed by the reviwer we hope that the reviewer will increase their score.
<!-- Regarding changing the title of the paper, we will mail the organizers (we are not sure we are allowed to do that from author interface) to change the title to "Modified Chernoff Sampling for Active Testing and Active Regression". Sorry for ignoring this comment in our last reply. -->