# Agnostic Learning with Bias ## References 1. https://arxiv.org/pdf/2005.14426.pdf 2. https://arxiv.org/pdf/2106.01101.pdf ## Notations * $\widetilde{w} \in \mathbb{R}^d$, $w = [\widetilde{w}, b_w] \in \mathbb{R}^{d+1}$, similarly for other vectors. * $x = [\widetilde{x}, 1] \in \mathbb{R}^{d+1}$ * $\theta = \arccos(\dfrac{\widetilde{w}^\top\widetilde{v}}{\|\widetilde{w}\|\|\widetilde{v}\|})$ * $F(w) = \mathbb{E}[(\sigma(w^\top x) - y)^2]$ * $v = \arg \min_{\|\widetilde{w}\| = 1} F(w)$ * $OPT = F(v)$ * $G(w) = \mathbb{E}[(\sigma(w^\top x) - \sigma(v^\top x))^2]$ * $H(w) = \mathbb{E}[(\sigma(w^\top x) - \sigma(v^\top x))^2\sigma'(w^\top x)]$ ## Assumptions **Assumption 1** ($(\frac{1}{\beta}, \alpha)$-anti-anti-concentration) *For a r.v. $\widetilde{x} \in \mathbb{R}^d$, $\exists ~ \alpha, \beta > 0$ s.t. $\forall ~ \widetilde{w}, \widetilde{v} \in \mathbb{R}^d$ and $p$ being the joint p.d.f. of $(\widetilde{w}^\top \widetilde{x}, \widetilde{v}^\top \widetilde{x})$,* $$\inf_{z: z \in \mathbb{R}^2, \|z\| \leq \alpha} p(z) \geq \beta$$ **Assumption 2** (Weak $c'$-anti-concentration) *For a r.v. $\widetilde{x} \in \mathbb{R}^d$, $\exists ~ c'$ s.t. $\forall ~ \widetilde{u} \in \mathbb{R}^d$ where $\|\widetilde{u}\| = 1$, for every $a \in \mathbb{R}, b > 0$,* $$\mathbb{P}\{\widetilde{u}^\top\widetilde{x} \in [a, a+b]\} \leq bc'$$ **Remark** $c'$-anti-concentration is satisfied when $p(z) \leq c'$ for all possible $z$. **Remark** Log-concave unit-variance distributions satisfy both assumptions [Robust-halfspace ZFG21]. ## Analysis: When Assumption 1 holds **Lemma 3** *If Assumption 1 holds for $\widetilde{x}$ and $\theta \leq \pi - \delta$ for some $\delta \in [0, \pi)$, let $b' = \max\{-b_w/\|\widetilde{w}\|, -b_v/\|\widetilde{v}\|, 0\}/\sin(\delta/2)$ and assume $b' < \alpha$, then* $$\mathbb{E}[(w^\top x - v^\top x)^2 1\{w^\top x \geq 0, v^\top x \geq 0\}] \geq \dfrac{(\alpha - b')^4 \sin(\delta/4)^3\beta}{8^4} \cdot \min\{1, \dfrac{1}{\alpha^2}\} \cdot \|w-v\|^2$$ **Lemma 4** *Let $\mathbb{E}[\|x\|^2] \leq B_X^2$. If for some $\epsilon > 0$ and $\xi = \frac{(\alpha - b')^4 \sin(\delta/4)^3\beta}{8^4 B_X^2} \cdot \min\{1, \frac{1}{\alpha^2}\}$, $H(w) \leq \epsilon\xi$, then $\|\widetilde{w}-\widetilde{v}\|^2 \leq 1$ implies $G(w) \leq \epsilon$*. *Proof:* $$H(w) = \mathbb{E}[(\sigma(w^\top x) - \sigma(v^\top x))^2\sigma'(w^\top x)] \geq \mathbb{E}[(\sigma(w^\top x) - \sigma(v^\top x))^2\sigma'(w^\top x)\sigma'(v^\top x)]$$ $$= \mathbb{E}[(\sigma'(w^\top x)(w^\top x) - \sigma'(v^\top x)(v^\top x))^2\sigma'(w^\top x)\sigma'(v^\top x)]$$ $$ = \mathbb{E}[(w^\top x - v^\top x)^2\sigma'(w^\top x)\sigma'(v^\top x)]$$ Hence $$\xi\epsilon \geq \mathbb{E}[(w^\top x - v^\top x)^2\sigma'(w^\top x)\sigma'(v^\top x)] \geq \dfrac{(\alpha - b')^4 \sin(\delta/4)^3\beta}{8^4} \cdot \min\{1, \dfrac{1}{\alpha^2}\} \cdot \|w-v\|^2$$ $$\Rightarrow \|w-v\|^2 \leq \dfrac{\epsilon}{B_X^2}$$ by Lemma 3 and the value of $\xi$. Finally, by the definition of $G(w)$, $$G(w) \leq \mathbb{E}[\|x\|^2] \|w-v\|^2 \leq \epsilon$$ $\square$ **Lemma 5** *Let $\eta \leq \frac{1}{4B_X^2}$. If $\forall ~ t \in \mathbb{N}$ $\|w_t-v\|\leq 1$ and $\sigma(\cdot)$ is ReLU (i.e. Lipschitz constant $L=1$), the following holds during gradient descent.* $$\nabla F(w_t)^\top(w_t-v) \geq 2H(w_t) - B_X \sqrt{OPT}$$ $$\|\nabla F(w_t)\|^2 \leq 4B_X^2 H(w_t) + 4B_X^2OPT$$ *and thus* $$\|w_t-v\|^2 - \|w_{t+1}-v\|^2 \geq 2\eta(2H(w_t) - B_X\sqrt{OPT}) - 4\eta^2(B_X^2 H(w_t) + B_X^2OPT)$$ $$=H(w_t)(4\eta - 4B_X^2\eta^2) - 4\eta^2B_X^2OPT - 2\eta B_X\sqrt{OPT}$$ $$\geq H(w_t)(4\eta) - 4\eta B_X \sqrt{OPT} = 4\eta(H(w_t)-B_X \sqrt{OPT})$$ **Claim 6** *With $\eta \leq \frac{1}{4B_X^2}$, if at time $t$ of gradient descent, $H(w_t) \geq 2B_X\sqrt{OPT}$, then $\|w_t-v\|^2 - \|w_{t+1}-v\|^2 \geq 4\eta B_X \sqrt{OPT}$. Furthermore, after $t \geq \frac{1}{4\eta B_X \sqrt{OPT}}$ iterations, $H(w_t) \leq 2B_X\sqrt{OPT}$, implying* $$G(w_t) \leq \dfrac{2^{13} B_X^3 \sqrt{OPT}}{(\alpha - b')^4 \sin(\delta/4)^3\beta \cdot \min\{1, \frac{1}{\alpha^2}\}}$$ ## Analysis: When Assumption 2 holds **Lemma 7** *If Assumption 2 holds for $\widetilde{x}$, $\mathbb{E}[\|x\|^2] \leq B_X^2$ and $\|w-v\| \leq K-1$ for some $K > 1$, and let $\xi' \leq \frac{1}{24}$, then when $G(w) \leq G(0) - \delta$ for some $\delta > 0$,* $$\mathbb{E}[(w^\top x - v^\top x)^2 1\{w^\top x \geq 0, v^\top x \geq 0\}] \geq \xi'^2 \cdot ((\dfrac{\delta}{B_X^2 K})^2 - 8B_Xc'\xi') \cdot \|w-v\|^2$$ **Lemma 8** *Let $\epsilon > 0$, $G(w) \leq G(0) - \delta$, $\|w - v\| \leq K-1$ and $\zeta = \frac{\xi'^2}{B_X^2} \cdot ((\frac{\delta}{B_X^2 K})^2 - 8B_Xc'\xi')$, then $H(w) \leq \zeta \epsilon$ implies $G(w) \leq \epsilon$*. **Claim 9** *With $\eta \leq \frac{1}{4B_X^2}$, if at time $t$ of gradient descent, $H(w_t) \geq 2B_X\sqrt{OPT}$, then $\|w_t-v\|^2 - \|w_{t+1}-v\|^2 \geq 4\eta B_X \sqrt{OPT}$. Furthermore, after $t \geq \frac{1}{4\eta B_X \sqrt{OPT}}$ iterations, $H(w_t) \leq 2B_X\sqrt{OPT}$, implying* $$G(w_t) \leq \dfrac{2 B_X^3 \sqrt{OPT}}{\xi'^2 \cdot ((\frac{\delta}{B_X^2 K})^2 - 8B_Xc'\xi')}$$