Ensemble - HackMD

# Ensemble **Notations** Let $|\mathcal{A}|$ be the cardinality of a set $\mathcal{A}$. Let $B(x, \epsilon)$ be a $\ell_2$ ball centered at $x$ with a radius of $\epsilon$. Let $\mathbb{O}_N = \{1,..., 2N+1 | N \in \mathbb{N}_{+}\}$, a set of odd number of integers. **Definition 1 (Gloro Voting)** Suppose $m_n(x): \mathbb{R}^d \rightarrow \mathcal{Y}, n\in\mathbb{O}_N$, are gloronets where $\mathcal{Y} = \mathcal{C} \cup \{\bot\}$ and $\mathcal{C}$ is a set of all possible labels of $x$. Let $$\mathcal{S}_i(x) := \{m_n |(m_n(x) = i)\wedge (n\in\mathbb{O}_N)\}$$ A gloro voting machine $M(x)$ is a function of $m_1, m_2, ..., m_{2n+1}$ such that \begin{align} M(x) = \begin{cases} j & \text{if } |\mathcal{S}_j(x)| > |\mathcal{S}_{\bot}(x)| + \max_{i\in \mathcal{C}, i\neq j}{|\mathcal{S}_{i}(x)|} \\ \bot & \text{otherwise} \end{cases} \end{align} where $j = \arg\max_{i \in \mathcal{C}}{|\mathcal{S}_i(x)|}$. **Theorem 1** $M(x)$ is $2\epsilon$-globally robust at $x$ if $\forall n \in \mathbb{O}_N$, $m_n(x)$ is $2\epsilon$-globally robust at $x$. Before proving Theorem 1, we prove Lemma 1. **Lemma 1** Let $\mathcal{S}_{i \rightarrow j}(x, x') := \{m_n | (m_n(x) = i)\wedge (m_n(x') = j)\wedge (n\in\mathbb{O}_N)\}$. If $||x'-x||\leq 2\epsilon$, $$\forall i \neq \bot, |S_i(x)| = |S_{i \rightarrow i}(x, x')| + |S_{i \rightarrow \bot}(x, x')|$$ $$\forall i \neq \bot, |S_i(x')| = |S_{i \rightarrow i}(x, x')| + |S_{\bot \rightarrow i}(x, x')|$$ $$|S_{\bot}(x)| = \sum_{j \in \mathcal{Y}} \mathcal{S}_{\bot\rightarrow j}(x, x') $$ *Proof*. $\mathcal{S}_{i \rightarrow j}(x, x')$ denotes a set of models **which predict i for x and j for x'** and it naturally lead to the following equation: $$|\mathcal{S}_i(x)| = \sum_{j \in \mathcal{Y}} |\mathcal{S}_{i \rightarrow j}(x, x')|$$ However, because $m_n(x)$ is $2\epsilon$-globally-robust, one of the following statement must hold: 1) Case I: if $m_n(x) \neq \bot$, then $m_n(x') = \bot \text{ or } m_n(x') = m_n(x)$. 2) Case II: if $m_n(x) = \bot$, then $m_n(x') \in \mathcal{Y}$. For Case I to be true, it indicates $$\forall i \neq \bot, |S_i(x)| = \sum_{j \in \mathcal{Y}} \mathcal{S}_{i \rightarrow j}(x, x') = |S_{i \rightarrow i}(x, x')| + |S_{i \rightarrow \bot}(x, x')|$$ For Case II to be true, it indicates $$|S_{\bot}(x)| = \sum_{j \in \mathcal{Y}} \mathcal{S}_{\bot\rightarrow j}(x, x')$$ Because of Case I and Case II, if $m(x') = y_n$, we can infer that $m_n(x) = y_n$ or $m_n(x) = \bot$, which indicates: $$\forall i \neq \bot, |S_i(x')| = |S_{i \rightarrow i}(x, x')| + |S_{\bot \rightarrow i}(x, x')|$$. **The proof of Theorem 1 is as follows**: For the trivial case $M(x) = \bot$, $M(x)$ satisifies the conditions of glorbally-robust. Now we consider other cases and WLOG we suppose $M(x) = y \neq \bot$. Given a point $x' \in B(x, 2\epsilon)$, \begin{align} \max_j |S_j(x')| &= \max_j [ |S_{j\rightarrow j}(x, x')| + |S_{\bot\rightarrow j}(x, x')|] \quad (\text{Lemma 1}) \end{align} if $\arg\max_j |S_j(x')| = y$, $M(x') = y$ or $M(x') = \bot$, depending on the situation. If for some $x' \in B(x, 2\epsilon)$, $\arg\max_j |S_j(x')| = y' \neq y$, then \begin{align} |S_{y'}(x')| &= |S_{y'\rightarrow y'}(x, x')| + |S_{\bot\rightarrow y'}(x, x')| \leq |S_{y'}(x)| + |S_\bot(x)| \end{align} The equality is achieved if and only if: 1) $|S_{y'}(x)| = |S_{y'\rightarrow y'}(x, x')|$, i.e. all models that predict $y'$ for $x$ also predict $y'$ for $x'$; and 2) $|S_\bot(x)| = |S_{\bot\rightarrow y'}(x, x')|$, i.e. all models that predict $\bot$ for $x$ predict $y'$ for $x'$. Now we show that $M(x)$ will always return $\bot$ in this case as follows: \begin{align} |\mathcal{S}_{\bot}(x')| + \max_{i\in \mathcal{C}, i\neq y'}{|\mathcal{S}_{i}(x')|} &= \max_{i\in \mathcal{C}, i\neq y'} [|\mathcal{S}_{\bot}(x')| + |\mathcal{S}_{i}(x')|]\\ &=\max_{i\in \mathcal{C}, i\neq y'} [\sum_{j\in\mathcal{C}, j\neq i} |S_{j \rightarrow \bot}(x, x')| + |S_{i \rightarrow \bot}(x, x')| + |\mathcal{S}_{i}(x')| ] \\ &\geq \max_{i\in \mathcal{C}, i\neq y'} [|S_{i \rightarrow \bot}(x, x')| + |\mathcal{S}_{i}(x')| ]\\ &\geq \max_{i\in \mathcal{C}, i\neq y'} [|S_{i \rightarrow \bot}(x, x')| + |\mathcal{S}_{i\rightarrow i}(x, x')| + |\mathcal{S}_{\bot\rightarrow i}(x, x')|]\\ &\geq \max_{i\in \mathcal{C}, i\neq y'} [|S_{i \rightarrow \bot}(x, x')| + |\mathcal{S}_{i\rightarrow i}(x, x')|]\\ &=\max_{i\in \mathcal{C}, i\neq y'} |S_i(x)| \quad (\text{Lemma 1})\\ &= |S_y(x)| \end{align} Because $M(x) = y$, $$|S_y(x)| > |\mathcal{S}_{\bot}(x)| + \max_{i\in \mathcal{C}, i\neq y}{|\mathcal{S}_{i}(x)|} \geq |S_\bot(x)| + |S_{y'}(x)| \geq |S_{y'}(x')|$$ We show that $$\max_j |S_j(x')| = |S_{y'}(x')| < |\mathcal{S}_{\bot}(x')| + \max_{i\in \mathcal{C}, i\neq y'}{|\mathcal{S}_{i}(x')|} $$ Thereofre, $M(x')$ must return $\bot$ if for some $x' \in B(x, 2\epsilon)$, $\arg\max_j |S_j(x')| = y' \neq y$. Taken together, we prove that for $x' \in B(x, 2\epsilon)$, $M(x') =M(x)$ or $M(x') = \bot$, which states $M$ is $2\epsilon$-globally robust. ### Correct Cascading is equivalent to voting **Definition (Pre-Permutation Cascading)** Suppose $m_n(x): \mathbb{R}^d \rightarrow \mathcal{C}, n\in\mathbb{N}$ are $\epsilon$-local robust models with a certificataion procedure $\phi(m_n, x) \in \mathcal{Y} = \mathcal{C} \cup \{\bot\}$ that outputs $\bot$ if the robustness can not be certified. Let $\pi \in \Pi$ be a permutation of the models, a pre-permutation cascading is \begin{align} M(x, \pi) = \begin{cases} m_{\pi(i)}(x) & \text{if } \exists i \in [0, N), \phi(m_{\pi(i)}, x) \neq \bot \text{and } \forall j \in [0, i), \phi(m_{\pi(j)}, x) = \bot\\ \bot & \text{otherwise} \end{cases} \end{align} **Definition (Permutation-based Cascading)** Suppose $m_n(x): \mathbb{R}^d \rightarrow \mathcal{C}, n\in\mathbb{N}$ are $\epsilon$-local robust models with a certificataion procedure $\phi(m_n, x) \in \mathcal{Y} = \mathcal{C} \cup \{\bot\}$ that outputs $\bot$ if the robustness can not be certified. Let $\pi \in \Pi$ be a permutation of the models, a permutation-based cascading is \begin{align} M(x, \pi) = \begin{cases} m_{\pi(0)}(x) & \text{if } \exists \pi \in \Pi \text{ such that } c(\pi) = 1 \\ \bot & \text{otherwise} \end{cases} \end{align} where \begin{align} c(\pi) = \begin{cases} 1 & \text{if } \exists j \in [\frac{N}{2}, N), \forall 0\leq k \leq j, \phi(m_{\pi(k)}, x) \neq \bot \text{ and } m_{\pi(k)} = m_{\pi(j)} \\ 0 & \text{otherwise} \end{cases} \end{align} ## Simplified voting for 3 models that are binary classfiers Consider 3 models $m_1$, $m_2$ and $m_3$ that are binary classifiers, so the outputs are $c_1$, $c_2$ or $\bot$. The definition of the ensemble $M(x)$ is as follows. $M(x)=c_1$ if \# votes for $c_1$ > \# votes for $c_2$ + \# votes for $\bot$ (i.e. a majority of models outputs $c_1$). Similarly $M(x)=c_2$ if \# votes for $c_2$ > \# votes for $c_1$ + \# votes for $\bot$ (i.e. a majority of models outputs $c_2$). Otherwise, $M(x)=\bot$. **Theorem 2** $M(x)$ is $2\epsilon$-globally robust if $m_1$, $m_2$ and $m_3$ are $2\epsilon$-globally robust. **Proof** By contradiction. Assume there are two inputs $x$, $x'$ s.t. $|x-x'|<2\epsilon$, $M(x)=c_1$ and $M(x')=c_2$ (so $M$ is not globally robust). Now, $M(x)=c_1$ if at least two models, say $m_i$ and $m_j$, $i\neq j$, voted for $c_1$, i.e. $m_i(x)=c_1 \wedge m_j=c_2$, while $M(x')=c_2$ if $m_k(x')=c_2 \wedge m_l(x')=c_2$, for $k\neq l$. Since there are only 3 models, it must be the case that among i, j, k, l, two of them are the same, say $i=k$. $m_i(x)= c_1$ implies $m_i(x')=c_1$ or $\bot$ which contradicts $m_k(x')=c_2$. ## Simplified voting for N models that are binary classfiers Consider $N=2n+1$ models $m_1$, $m_2$, .. $m_N$ that are binary classifiers. The ensemble $M(x)$ returns $c_i$ if $n+1$ models return $c_i$ otherwise returns $\bot$. **Theorem 3** $M(x)$ is $2\epsilon$-globally robust if $m_1$, $m_2$ .. $m_N$ are $2\epsilon$-globally robust. **Proof** By contradiction. Assume there are two inputs $x$, $x'$ s.t. $|x-x'|<2\epsilon$, $M(x)=c_1$ and $M(x')=c_2$ (so $M$ is not globally robust). Now, $M(x)=c_1$ if $n+1$ models voted for $c_1$, while $M(x')=c_2$ if $n+1$ models voted for $c_2$. Since there are only $2n+1$ models in total, it must be the case that there is one model ($m_k$) that voted for both $c_1$ and $c_2$, i.e. $m_k(x)=c_1$ and $m_k(x')=c_2$ which contradicts $2\epsilon$-global robustness for $m_k$. ## Considering only local robustness for binary classifiers Definition of $m(x)$ needs to be modified to return $(l,cert)$, where $l$ is $c_1$ or $c_2$ and $cert=0$ or $1$, $1=$certified, $0=$ not certified. We write $m(x)=(l,cert)$ and use $m(x).label$ to mean $l$ and $m(x).cert$ to mean $cert$. **Property 1** (soundness of local robustness check) If $m(x).cert=1$, $m$ is $\epsilon$-locally robust at $x$, meaning $\forall x'$ s.t. $|x-x'|<\epsilon$, $m(x).label=m(x').label$. **Definition (voting)** Given $N=2n+1$ binary classifiers $m_1, m_2, .. m_N$, define $M(x).label=c_i$ if at least n+1 (i.e. majority of) models votes for label $c_i$; $M(x)=(c_i,1)$ if at least n+1 (majority) votes for label with certificate $(c_i,1)$ otherwise $M(x)=(c_i,0)$. **Lemma??** The above is equivalent to $M(x)=(c_i,k)$ if at least n+1 models vote for label $c_i$ and at least n+1 models (not necessarily the same) vote for $(c_i,k)$ (not sure this is true and it is not needed but I think this is what was discussed in the telecon?). **Theorem 4** If $m_1, m_2 .. m_N$ satisfy Property 1 then $M$ satisfies Property 1. **Proof** By contradiction. Assume $M$ does not satisfy Property 1. There are two inputs $x$, $x'$ s.t. $|x-x'|<\epsilon$, $M(x).label=c_1$, $M(x').label=c_2$ and $M(x).cert=1$ ($M$ is not locally robust at $x$ but returns certified, so it violates Property 1). Now, $M(x)=(c_1,1)$ if (at least) $n+1$ models voted for label with certificate $(c_1,1)$ (according to definition of voting), while $M(x').label=c_2$ if at least $n+1$ models voted for $c_2$ (regardless of certificate). Since there are only $2n+1$ models in total, it must be the case that there is one model ($m_k$) that voted for both $(c_1,1)$ and $c_2$, i.e. $m_k(x)=(c_1,1)$ and $m_k(x').label=c_2$ which contradicts $\epsilon$-local robustness for $m_k$ and hence Property 1. ## Considering only local robustness for general M-class classifiers Definition of $m(x)$ needs to be modified to return $(l,cert)$, where $l$ is $c_1$, $c_2$ .. or $c_M$ and $cert=0$ or $1$, $1=$certified, $0$ not certified. We write $m(x)=(l,cert)$ and use $m(x).label$ to mean $l$ and $m(x).cert$ to mean $cert$. **Property 1** (soundness of local robustness check; same as above) If $m(x).cert=1$, $m$ is $\epsilon$-locally robust at $x$, meaning $\forall x'$ s.t. $|x-x'|<\epsilon$, $m(x).label=m(x').label$. Let $v_x(c_i)$ denote the number of votes for label $c_i$ (for $1\leq i\leq M$). Similarly let $v_x(c_i,cert)$ denote the number of votes for label with certificate $(c_i,cert)$. **Definition (voting N=3)** Given $N=2n+1$ M-class classifiers $m_1, m_2, .. m_N$, define $M(x).label=c_i$ if $v_x(c_i) > v_x(c_j)$, $\forall j\neq i$; $M(x)=(c_i,1)$ if $v_x(c_i,1) > v_x(c_j)$, $\forall j\neq i$ otherwise $M(x)=(c_i,0)$. **Theorem 4** If $m_1, m_2, m_3$ satisfy Property 1 then $M$ satisfies Property 1. **Proof** By contradiction. Assume $M$ does not satisfy Property 1. There are two inputs $x$, $x'$ s.t. $|x-x'|<\epsilon$, $M(x).label=c_{i_1}$, $M(x').label=c_{i_2}$, $c_{i_1}\neq c_{i_2}$ and $M(x).cert=1$ ($M$ is not locally robust at $x$ but returns certified, so it violates Property 1). Now, $M(x)=(c_{i_1},1)$ if $v_x(c_{i_1},1) > v_x(c_j)$, $\forall j\neq i_1$, while $M(x').label=c_{i_2}$ if $v_{x'}(c_{i_2}) > v_{x'}(c_j)$, $\forall j\neq i_2$ (regardless of certificate). For special case of $N=3$, $M(x)=(c_{i_1},1)$ if at least 2 models vote for $(c_{i_1},1)$ while $M(x').label=c_{i_2}$ if at least 2 models vote for $c_{i_2}$ which is not possible. ## Another attempt at the $m$-class case Define $M$ as follows: \begin{align} M(x).label &= i & \text{if $\forall j \neq i$ . $v_x(i) > v_x(j)$} \\ M(x).cert &= 1 & \text{if $\forall j \neq M(x).label$ . $v_x(M(x).label, 1) > v_x(*, 0) + v_x(j, 1)$} \\ M(x).cert &= 0 & \text{otherwise} \end{align} AFSOC $\exists x, x'$ s.t. $||x-x'|| < \epsilon$ $M(x) = (i_1, 1)$, $M(x').label = i_2$ where $i_2 \neq i_1$. Thus $v_x(i_1, 1) > v_x(*, 0) + v_x(i_2, 1)$. Consider the votes on $x'$. Because $v_x(i_1, 1)$ models are locally robust at $x$, $v_{x'}(i_1) \geq v_x(i_1, 1)$. Similarly, $v_{x'}(i_2) \leq v_x(*,0) + v_x(i_2, 1)$, since only points that are non-robust at $x$ can change labels. Putting things together we have \begin{align} v_{x'}(i_1) &\geq v_x(i_1, 1) \\ &> v_x(*,0) + v_x(i_2, 1) \\ &\geq v_{x'}(i_2) \end{align} Thus $M(x').label$ cannot be $i_2$ \# ## Optimal Weights Let $y^k_i$ be the clean prediction for the $i$-th point made by the $k$-th model. Let $w$ be the weights and he clean emsemble prediciton $y_i(w) = \arg\max_j \sum_k \mathbb{I}[y_i^{k}=j]*w_k$

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.