# hi
$\newcommand{\eps}{\epsilon}$
$\newcommand{\pistar}{\pi^*}$
$\newcommand{\piref}{\pi^h}$
$\newcommand{\piat}{\tilde\pi}$
$\newcommand{\Ep}{\mathbb{E}}$
$\newcommand{\regret}{\textit{reg}}$
$\renewcommand{\brack}[1]{\left\{\begin{array}{ll}#1\end{array}\right.}$
$\newcommand{\minpi}{\min_{\pi \in \Pi}}$
$\newcommand{\avgi}{\frac 1 N \sum_{i=1}^N}$
$\newcommand{\epspol}{\eps_{\text{pol-approx}}}$
$\newcommand{\epsdc}{\eps_{\text{dc-approx}}}$
$\newcommand{\regpol}{\regret_{\text{pol}}}$
$\newcommand{\regdc}{\regret_{\text{dc}}}$
$\newcommand{\err}{\textit{err}}$
$\newcommand{\cH}{\mathcal{H}}$
$\newcommand{\Epi}{\Ep_{s \sim d_{\pi_i}}}$
$\newcommand{\FN}{m^-}$
$\newcommand{\FP}{m^+}$
Assumptions:
- $\ell$ is symmetric, strongly convex, bounded, and satisfies triangle inequality (clearly holds for squared loss, not very strong assumption)
- $\piat_i$ is "more realizable" than $\pistar$ for all $i$; namely:
$\minpi \Epi \ell(s, \pi, \piat_i) \leq \minpi \Epi \ell(s, \pi, \pistar)$ for all $i$
intuitively, $\pistar$ is eg "human annotation", while $\piat_i$ is a mixture of $\pistar$ and the heuristic. if the heuristic is computational, then it seems plausible that it would be no harder to learn computationally than $\pistar$ (strong assumption)
- assuming that policy learner is no regret for all sequences of state/action pairs (no big deal)
- assuming that difference classifier is no regret for all sequences of state/label pairs (no big deal)
- assuming that the active learner is no regret even when the labels are adversarial (at least affected by AL's previous decisions) -- this is the biggest gap between theory and practice as cesa-bianchi doesn't satisfy this -- ideally ping Akshay and see if any algorithm does
''[For leaqi, there exists a policy $\pi \in \pi_{1:N}$ such that
$\Ep_{s \sim d_\pi} \ell(s, \pi, \pistar) \leq \epspol + \epsdc + \regpol(N) + \regdc(N) + O(\sqrt{T/N})$
and therefore, if both policy learner and difference classifier learner are no regret, we have:
$\lim_{N \rightarrow \infty} \Ep_{s \sim d_\pi} \ell(s, \pi, \pistar) \leq \epspol + \epsdc$
meaning that leaqi achieves no regret.
where:
$\epspol = \min_{\pi \in \Pi} \avgi \Epi \ell(s, \pi, \pistar)$
$\epsdc = \min_{h \in \cH} \avgi \Epi \err(s, h, \pistar(s)\neq \piref(s))$
$\FN$ is a bound on number of false negatives made by difference classifier if trained on fully observed data]''
and
$\piat_i = \brack{ \pistar(s) & \text{if } \textit{STAP}(s, \piref(s), h_i(s)) = \text{disagree} \\ \piref(s) & \text{otherwise}}$
and we run no regret learners that ensure:
$\avgi \Epi \ell(s, \pi_i, \piat_i) - \min_{\pi \in \Pi} \avgi \Epi \ell(s, \pi, \piat_i) \leq \regpol(N)$
$\avgi \Epi \err(s, h_i, \pistar(s)\neq\piref(s)) - \min_{h \in \cH} \avgi \Epi \ell(s, h, \pistar(s)\neq\piref(s)) \leq \regdc(N)$
Proof.
\begin{align}
&\min_{\pi \in \pi_{1:N}} \Ep_{s \sim d_\pi} \ell(s, \pi_i, \pistar)\\
&\leq \avgi \Epi \ell(s, \pi_i, \pistar) \\
&= \avgi\Epi\ell(s, \pi_i, \pistar) + \avgi\Epi\ell(s, \pi_i, \piat_i) - \avgi\Epi\ell(s, \pi_i, \piat_i) \\
&\leq \avgi\Epi\ell(s, \pi_i, \piat_i) + \avgi\Epi\ell(s, \piat_i, \pistar) \text{ // triangle inequality} \\
&= \avgi\Epi\ell(s, \pi_i, \piat_i) + \avgi\Epi\ell(s, \piat_i, \pistar) - \epspol + \epspol \\
&= \avgi\Epi\ell(s, \pi_i, \piat_i) + \avgi\Epi\ell(s, \piat_i, \pistar) - \minpi\avgi\Epi\ell(s, \pi, \pistar) + \epspol \\
&\leq \avgi\Epi\ell(s, \pi_i, \piat_i) + \epspol - \minpi\avgi\Epi\ell(s, \pi, \piat_i) + \minpi\avgi\Epi\ell(s, \pi, \pistar) \text{ // assumption that piat is easier to imitate than pistar} \\
&\leq \regpol(N) + \epspol + \avgi\Epi\ell(s, \piat_i, \pistar) \text{ // no regret on pi} \\
&\leq \regpol(N) + \epspol + T \avgi \Epi\err(s, h_i, \pistar(s)\neq\piref(s)) + 4\sqrt{T(\FN+1)/N} \text{ // by lemma} \\
&= \regpol(N) + \epspol + \avgi T\Epi\err(s, h_i, \pistar(s)\neq\piref(s)) + 4\sqrt{T(\FN+1)/N} - T\epsdc + T\epsdc \\
&= \regpol(N) + \epspol + T \avgi \Epi\err(s, h_i, \pistar(s)\neq\piref(s)) + 4\sqrt{T(\FN+1)/N} \\
&\quad - T \min_{h \in \cH} \avgi \Epi \err(s, h, \pistar(s)\neq \piref(s)) + T\epsdc \\
&= \regpol(N) + \epspol + T \regpol(N) + T\epsdc + 4\sqrt{T(\FN+1)/N}\\
\end{align}
Lemma. Let $J \leq NT$ be the number of calls to the dc, and let $i(j)$ be the episode in which call $j$ is made, and let $s_j$ be the corresponding state.
\begin{align}
&\sum_{j=1}^J \ell(s_j, \piat_{i(j)}, \pistar) \\
&= \sum_{j=1}^{\FN-1} \ell(s_j, \piat_{i(j)}, \pistar) + \sum_{i=\FN}^{J}\ell(s_j, \piat_{i(j)}, \pistar) \\
&\leq \FN + \sum_{j=\FN}^{J}\ell(s_j, \piat_{i(j)}, \pistar) \text{ // loss bounded by 1} \\
&\leq \FN + \FP - \FN + 4\sqrt{J(\FN+1)} \text{ // STAP theorem} \\
&= \FP + 4\sqrt{J(\FN+1)} \\
&\leq T \sum_{i=1}^N \Epi\err(s, h_i, \pistar(s)\neq\piref(s)) + 4\sqrt{J(\FN+1)}\\
&\leq T \sum_{i=1}^N \Epi\err(s, h_i, \pistar(s)\neq\piref(s)) + 4\sqrt{NT(\FN+1)}
\end{align}