# Ensemble
**Notations**
Let $|\mathcal{A}|$ be the cardinality of a set $\mathcal{A}$.
Let $B(x, \epsilon)$ be a $\ell_2$ ball centered at $x$ with a radius of $\epsilon$.
Let $\mathbb{O}_N = \{1,..., 2N+1 | N \in \mathbb{N}_{+}\}$, a set of odd number of integers.
**Definition 1 (Gloro Voting)** Suppose $m_n(x): \mathbb{R}^d \rightarrow \mathcal{Y}, n\in\mathbb{O}_N$, are gloronets where $\mathcal{Y} = \mathcal{C} \cup \{\bot\}$ and $\mathcal{C}$ is a set of all possible labels of $x$. Let
$$\mathcal{S}_i(x) := \{m_n |(m_n(x) = i)\wedge (n\in\mathbb{O}_N)\}$$
A gloro voting machine $M(x)$ is a function of $m_1, m_2, ..., m_{2n+1}$ such that
\begin{align}
M(x) =
\begin{cases}
j & \text{if } |\mathcal{S}_j(x)| > |\mathcal{S}_{\bot}(x)| + \max_{i\in \mathcal{C}, i\neq j}{|\mathcal{S}_{i}(x)|} \\
\bot & \text{otherwise}
\end{cases}
\end{align}
where $j = \arg\max_{i \in \mathcal{C}}{|\mathcal{S}_i(x)|}$.
**Theorem 1** $M(x)$ is $2\epsilon$-globally robust at $x$ if $\forall n \in \mathbb{O}_N$, $m_n(x)$ is $2\epsilon$-globally robust at $x$.
Before proving Theorem 1, we prove Lemma 1.
**Lemma 1** Let $\mathcal{S}_{i \rightarrow j}(x, x') := \{m_n | (m_n(x) = i)\wedge (m_n(x') = j)\wedge (n\in\mathbb{O}_N)\}$. If $||x'-x||\leq 2\epsilon$,
$$\forall i \neq \bot, |S_i(x)| = |S_{i \rightarrow i}(x, x')| + |S_{i \rightarrow \bot}(x, x')|$$
$$\forall i \neq \bot, |S_i(x')| = |S_{i \rightarrow i}(x, x')| + |S_{\bot \rightarrow i}(x, x')|$$
$$|S_{\bot}(x)| = \sum_{j \in \mathcal{Y}} \mathcal{S}_{\bot\rightarrow j}(x, x') $$
*Proof*.
$\mathcal{S}_{i \rightarrow j}(x, x')$ denotes a set of models **which predict i for x and j for x'** and it naturally lead to the following equation:
$$|\mathcal{S}_i(x)| = \sum_{j \in \mathcal{Y}} |\mathcal{S}_{i \rightarrow j}(x, x')|$$
However, because $m_n(x)$ is $2\epsilon$-globally-robust, one of the following statement must hold:
1) Case I: if $m_n(x) \neq \bot$, then $m_n(x') = \bot \text{ or } m_n(x') = m_n(x)$.
2) Case II: if $m_n(x) = \bot$, then $m_n(x') \in \mathcal{Y}$.
For Case I to be true, it indicates
$$\forall i \neq \bot, |S_i(x)| = \sum_{j \in \mathcal{Y}} \mathcal{S}_{i \rightarrow j}(x, x') = |S_{i \rightarrow i}(x, x')| + |S_{i \rightarrow \bot}(x, x')|$$
For Case II to be true, it indicates
$$|S_{\bot}(x)| = \sum_{j \in \mathcal{Y}} \mathcal{S}_{\bot\rightarrow j}(x, x')$$
Because of Case I and Case II, if $m(x') = y_n$, we can infer that $m_n(x) = y_n$ or $m_n(x) = \bot$, which indicates:
$$\forall i \neq \bot, |S_i(x')| = |S_{i \rightarrow i}(x, x')| + |S_{\bot \rightarrow i}(x, x')|$$.
**The proof of Theorem 1 is as follows**:
For the trivial case $M(x) = \bot$, $M(x)$ satisifies the conditions of glorbally-robust.
Now we consider other cases and WLOG we suppose $M(x) = y \neq \bot$.
Given a point $x' \in B(x, 2\epsilon)$,
\begin{align}
\max_j |S_j(x')| &= \max_j [ |S_{j\rightarrow j}(x, x')| + |S_{\bot\rightarrow j}(x, x')|] \quad (\text{Lemma 1})
\end{align}
if $\arg\max_j |S_j(x')| = y$, $M(x') = y$ or $M(x') = \bot$, depending on the situation.
If for some $x' \in B(x, 2\epsilon)$, $\arg\max_j |S_j(x')| = y' \neq y$, then
\begin{align}
|S_{y'}(x')| &= |S_{y'\rightarrow y'}(x, x')| + |S_{\bot\rightarrow y'}(x, x')| \leq |S_{y'}(x)| + |S_\bot(x)|
\end{align}
The equality is achieved if and only if: 1) $|S_{y'}(x)| = |S_{y'\rightarrow y'}(x, x')|$, i.e. all models that predict $y'$ for $x$ also predict $y'$ for $x'$; and 2) $|S_\bot(x)| = |S_{\bot\rightarrow y'}(x, x')|$, i.e. all models that predict $\bot$ for $x$ predict $y'$ for $x'$.
Now we show that $M(x)$ will always return $\bot$ in this case as follows:
\begin{align}
|\mathcal{S}_{\bot}(x')| + \max_{i\in \mathcal{C}, i\neq y'}{|\mathcal{S}_{i}(x')|} &= \max_{i\in \mathcal{C}, i\neq y'} [|\mathcal{S}_{\bot}(x')| + |\mathcal{S}_{i}(x')|]\\
&=\max_{i\in \mathcal{C}, i\neq y'} [\sum_{j\in\mathcal{C}, j\neq i} |S_{j \rightarrow \bot}(x, x')| + |S_{i \rightarrow \bot}(x, x')| + |\mathcal{S}_{i}(x')| ] \\
&\geq \max_{i\in \mathcal{C}, i\neq y'} [|S_{i \rightarrow \bot}(x, x')| + |\mathcal{S}_{i}(x')| ]\\
&\geq \max_{i\in \mathcal{C}, i\neq y'} [|S_{i \rightarrow \bot}(x, x')| + |\mathcal{S}_{i\rightarrow i}(x, x')| + |\mathcal{S}_{\bot\rightarrow i}(x, x')|]\\
&\geq \max_{i\in \mathcal{C}, i\neq y'} [|S_{i \rightarrow \bot}(x, x')| + |\mathcal{S}_{i\rightarrow i}(x, x')|]\\
&=\max_{i\in \mathcal{C}, i\neq y'} |S_i(x)| \quad (\text{Lemma 1})\\
&= |S_y(x)|
\end{align}
Because $M(x) = y$,
$$|S_y(x)| > |\mathcal{S}_{\bot}(x)| + \max_{i\in \mathcal{C}, i\neq y}{|\mathcal{S}_{i}(x)|} \geq |S_\bot(x)| + |S_{y'}(x)| \geq |S_{y'}(x')|$$
We show that
$$\max_j |S_j(x')| = |S_{y'}(x')| < |\mathcal{S}_{\bot}(x')| + \max_{i\in \mathcal{C}, i\neq y'}{|\mathcal{S}_{i}(x')|} $$
Thereofre, $M(x')$ must return $\bot$ if for some $x' \in B(x, 2\epsilon)$, $\arg\max_j |S_j(x')| = y' \neq y$.
Taken together, we prove that for $x' \in B(x, 2\epsilon)$, $M(x') =M(x)$ or $M(x') = \bot$, which states $M$ is $2\epsilon$-globally robust.
### Correct Cascading is equivalent to voting
**Definition (Pre-Permutation Cascading)** Suppose $m_n(x): \mathbb{R}^d \rightarrow \mathcal{C}, n\in\mathbb{N}$ are $\epsilon$-local robust models with a certificataion procedure $\phi(m_n, x) \in \mathcal{Y} = \mathcal{C} \cup \{\bot\}$ that outputs $\bot$ if the robustness can not be certified. Let $\pi \in \Pi$ be a permutation of the models, a pre-permutation cascading is
\begin{align}
M(x, \pi) =
\begin{cases}
m_{\pi(i)}(x) & \text{if } \exists i \in [0, N), \phi(m_{\pi(i)}, x) \neq \bot \text{and } \forall j \in [0, i), \phi(m_{\pi(j)}, x) = \bot\\
\bot & \text{otherwise}
\end{cases}
\end{align}
**Definition (Permutation-based Cascading)** Suppose $m_n(x): \mathbb{R}^d \rightarrow \mathcal{C}, n\in\mathbb{N}$ are $\epsilon$-local robust models with a certificataion procedure $\phi(m_n, x) \in \mathcal{Y} = \mathcal{C} \cup \{\bot\}$ that outputs $\bot$ if the robustness can not be certified. Let $\pi \in \Pi$ be a permutation of the models, a permutation-based cascading is
\begin{align}
M(x, \pi) =
\begin{cases}
m_{\pi(0)}(x) & \text{if } \exists \pi \in \Pi \text{ such that } c(\pi) = 1 \\
\bot & \text{otherwise}
\end{cases}
\end{align} where
\begin{align}
c(\pi) =
\begin{cases}
1 & \text{if } \exists j \in [\frac{N}{2}, N), \forall 0\leq k \leq j, \phi(m_{\pi(k)}, x) \neq \bot \text{ and } m_{\pi(k)} = m_{\pi(j)} \\
0 & \text{otherwise}
\end{cases}
\end{align}
## Simplified voting for 3 models that are binary classfiers
Consider 3 models $m_1$, $m_2$ and $m_3$ that are binary classifiers, so the outputs are $c_1$, $c_2$ or $\bot$. The definition of the ensemble $M(x)$ is as follows.
$M(x)=c_1$ if \# votes for $c_1$ > \# votes for $c_2$ + \# votes for $\bot$ (i.e. a majority of models outputs $c_1$). Similarly $M(x)=c_2$ if \# votes for $c_2$ > \# votes for $c_1$ + \# votes for $\bot$ (i.e. a majority of models outputs $c_2$). Otherwise, $M(x)=\bot$.
**Theorem 2** $M(x)$ is $2\epsilon$-globally robust if $m_1$, $m_2$ and $m_3$ are $2\epsilon$-globally robust.
**Proof** By contradiction. Assume there are two inputs $x$, $x'$ s.t. $|x-x'|<2\epsilon$, $M(x)=c_1$ and $M(x')=c_2$ (so $M$ is not globally robust). Now, $M(x)=c_1$ if at least two models, say $m_i$ and $m_j$, $i\neq j$, voted for $c_1$, i.e. $m_i(x)=c_1 \wedge m_j=c_2$, while $M(x')=c_2$ if $m_k(x')=c_2 \wedge m_l(x')=c_2$, for $k\neq l$. Since there are only 3 models, it must be the case that among i, j, k, l, two of them are the same, say $i=k$. $m_i(x)= c_1$ implies $m_i(x')=c_1$ or $\bot$ which contradicts $m_k(x')=c_2$.
## Simplified voting for N models that are binary classfiers
Consider $N=2n+1$ models $m_1$, $m_2$, .. $m_N$ that are binary classifiers. The ensemble $M(x)$ returns $c_i$ if $n+1$ models return $c_i$ otherwise returns $\bot$.
**Theorem 3** $M(x)$ is $2\epsilon$-globally robust if $m_1$, $m_2$ .. $m_N$ are $2\epsilon$-globally robust.
**Proof** By contradiction. Assume there are two inputs $x$, $x'$ s.t. $|x-x'|<2\epsilon$, $M(x)=c_1$ and $M(x')=c_2$ (so $M$ is not globally robust). Now, $M(x)=c_1$ if $n+1$ models voted for $c_1$, while $M(x')=c_2$ if $n+1$ models voted for $c_2$. Since there are only $2n+1$ models in total, it must be the case that there is one model ($m_k$) that voted for both $c_1$ and $c_2$, i.e. $m_k(x)=c_1$ and $m_k(x')=c_2$ which contradicts $2\epsilon$-global robustness for $m_k$.
## Considering only local robustness for binary classifiers
Definition of $m(x)$ needs to be modified to return $(l,cert)$, where $l$ is $c_1$ or $c_2$ and $cert=0$ or $1$, $1=$certified, $0=$ not certified. We write $m(x)=(l,cert)$ and use $m(x).label$ to mean $l$ and $m(x).cert$ to mean $cert$.
**Property 1** (soundness of local robustness check) If $m(x).cert=1$, $m$ is $\epsilon$-locally robust at $x$, meaning $\forall x'$ s.t. $|x-x'|<\epsilon$, $m(x).label=m(x').label$.
**Definition (voting)** Given $N=2n+1$ binary classifiers $m_1, m_2, .. m_N$, define $M(x).label=c_i$ if at least n+1 (i.e. majority of) models votes for label $c_i$; $M(x)=(c_i,1)$ if at least n+1 (majority) votes for label with certificate $(c_i,1)$ otherwise $M(x)=(c_i,0)$.
**Lemma??** The above is equivalent to $M(x)=(c_i,k)$ if at least n+1 models vote for label $c_i$ and at least n+1 models (not necessarily the same) vote for $(c_i,k)$ (not sure this is true and it is not needed but I think this is what was discussed in the telecon?).
**Theorem 4** If $m_1, m_2 .. m_N$ satisfy Property 1 then $M$ satisfies Property 1.
**Proof** By contradiction. Assume $M$ does not satisfy Property 1. There are two inputs $x$, $x'$ s.t. $|x-x'|<\epsilon$, $M(x).label=c_1$, $M(x').label=c_2$ and $M(x).cert=1$ ($M$ is not locally robust at $x$ but returns certified, so it violates Property 1). Now, $M(x)=(c_1,1)$ if (at least) $n+1$ models voted for label with certificate $(c_1,1)$ (according to definition of voting), while $M(x').label=c_2$ if at least $n+1$ models voted for $c_2$ (regardless of certificate). Since there are only $2n+1$ models in total, it must be the case that there is one model ($m_k$) that voted for both $(c_1,1)$ and $c_2$, i.e. $m_k(x)=(c_1,1)$ and $m_k(x').label=c_2$ which contradicts $\epsilon$-local robustness for $m_k$ and hence Property 1.
## Considering only local robustness for general M-class classifiers
Definition of $m(x)$ needs to be modified to return $(l,cert)$, where $l$ is $c_1$, $c_2$ .. or $c_M$ and $cert=0$ or $1$, $1=$certified, $0$ not certified. We write $m(x)=(l,cert)$ and use $m(x).label$ to mean $l$ and $m(x).cert$ to mean $cert$.
**Property 1** (soundness of local robustness check; same as above) If $m(x).cert=1$, $m$ is $\epsilon$-locally robust at $x$, meaning $\forall x'$ s.t. $|x-x'|<\epsilon$, $m(x).label=m(x').label$.
Let $v_x(c_i)$ denote the number of votes for label $c_i$ (for $1\leq i\leq M$). Similarly let $v_x(c_i,cert)$ denote the number of votes for label with certificate $(c_i,cert)$.
**Definition (voting N=3)** Given $N=2n+1$ M-class classifiers $m_1, m_2, .. m_N$, define $M(x).label=c_i$ if $v_x(c_i) > v_x(c_j)$, $\forall j\neq i$; $M(x)=(c_i,1)$ if $v_x(c_i,1) > v_x(c_j)$, $\forall j\neq i$ otherwise $M(x)=(c_i,0)$.
**Theorem 4** If $m_1, m_2, m_3$ satisfy Property 1 then $M$ satisfies Property 1.
**Proof** By contradiction. Assume $M$ does not satisfy Property 1. There are two inputs $x$, $x'$ s.t. $|x-x'|<\epsilon$, $M(x).label=c_{i_1}$, $M(x').label=c_{i_2}$, $c_{i_1}\neq c_{i_2}$ and $M(x).cert=1$ ($M$ is not locally robust at $x$ but returns certified, so it violates Property 1). Now, $M(x)=(c_{i_1},1)$ if $v_x(c_{i_1},1) > v_x(c_j)$, $\forall j\neq i_1$, while $M(x').label=c_{i_2}$ if $v_{x'}(c_{i_2}) > v_{x'}(c_j)$, $\forall j\neq i_2$ (regardless of certificate).
For special case of $N=3$, $M(x)=(c_{i_1},1)$ if at least 2 models vote for $(c_{i_1},1)$ while $M(x').label=c_{i_2}$ if at least 2 models vote for $c_{i_2}$ which is not possible.
## Another attempt at the $m$-class case
Define $M$ as follows:
\begin{align}
M(x).label &= i & \text{if $\forall j \neq i$ . $v_x(i) > v_x(j)$} \\
M(x).cert &= 1 & \text{if $\forall j \neq M(x).label$ . $v_x(M(x).label, 1) > v_x(*, 0) + v_x(j, 1)$} \\
M(x).cert &= 0 & \text{otherwise}
\end{align}
AFSOC $\exists x, x'$ s.t. $||x-x'|| < \epsilon$ $M(x) = (i_1, 1)$, $M(x').label = i_2$ where $i_2 \neq i_1$.
Thus $v_x(i_1, 1) > v_x(*, 0) + v_x(i_2, 1)$.
Consider the votes on $x'$. Because $v_x(i_1, 1)$ models are locally robust at $x$, $v_{x'}(i_1) \geq v_x(i_1, 1)$. Similarly, $v_{x'}(i_2) \leq v_x(*,0) + v_x(i_2, 1)$, since only points that are non-robust at $x$ can change labels.
Putting things together we have
\begin{align}
v_{x'}(i_1) &\geq v_x(i_1, 1) \\
&> v_x(*,0) + v_x(i_2, 1) \\
&\geq v_{x'}(i_2)
\end{align}
Thus $M(x').label$ cannot be $i_2$ \#
## Optimal Weights
Let $y^k_i$ be the clean prediction for the $i$-th point made by the $k$-th model. Let $w$ be the weights and he clean emsemble prediciton
$y_i(w) = \arg\max_j \sum_k \mathbb{I}[y_i^{k}=j]*w_k$