# Agnostic Learning with Bias
## References
1. https://arxiv.org/pdf/2005.14426.pdf
2. https://arxiv.org/pdf/2106.01101.pdf
## Notations
* $\widetilde{w} \in \mathbb{R}^d$, $w = [\widetilde{w}, b_w] \in \mathbb{R}^{d+1}$, similarly for other vectors.
* $x = [\widetilde{x}, 1] \in \mathbb{R}^{d+1}$
* $\theta = \arccos(\dfrac{\widetilde{w}^\top\widetilde{v}}{\|\widetilde{w}\|\|\widetilde{v}\|})$
* $F(w) = \mathbb{E}[(\sigma(w^\top x) - y)^2]$
* $v = \arg \min_{\|\widetilde{w}\| = 1} F(w)$
* $OPT = F(v)$
* $G(w) = \mathbb{E}[(\sigma(w^\top x) - \sigma(v^\top x))^2]$
* $H(w) = \mathbb{E}[(\sigma(w^\top x) - \sigma(v^\top x))^2\sigma'(w^\top x)]$
## Assumptions
**Assumption 1** ($(\frac{1}{\beta}, \alpha)$-anti-anti-concentration) *For a r.v. $\widetilde{x} \in \mathbb{R}^d$, $\exists ~ \alpha, \beta > 0$ s.t. $\forall ~ \widetilde{w}, \widetilde{v} \in \mathbb{R}^d$ and $p$ being the joint p.d.f. of $(\widetilde{w}^\top \widetilde{x}, \widetilde{v}^\top \widetilde{x})$,*
$$\inf_{z: z \in \mathbb{R}^2, \|z\| \leq \alpha} p(z) \geq \beta$$
**Assumption 2** (Weak $c'$-anti-concentration) *For a r.v. $\widetilde{x} \in \mathbb{R}^d$, $\exists ~ c'$ s.t. $\forall ~ \widetilde{u} \in \mathbb{R}^d$ where $\|\widetilde{u}\| = 1$, for every $a \in \mathbb{R}, b > 0$,*
$$\mathbb{P}\{\widetilde{u}^\top\widetilde{x} \in [a, a+b]\} \leq bc'$$
**Remark** $c'$-anti-concentration is satisfied when $p(z) \leq c'$ for all possible $z$.
**Remark** Log-concave unit-variance distributions satisfy both assumptions [Robust-halfspace ZFG21].
## Analysis: When Assumption 1 holds
**Lemma 3** *If Assumption 1 holds for $\widetilde{x}$ and $\theta \leq \pi - \delta$ for some $\delta \in [0, \pi)$, let $b' = \max\{-b_w/\|\widetilde{w}\|, -b_v/\|\widetilde{v}\|, 0\}/\sin(\delta/2)$ and assume $b' < \alpha$, then*
$$\mathbb{E}[(w^\top x - v^\top x)^2 1\{w^\top x \geq 0, v^\top x \geq 0\}] \geq \dfrac{(\alpha - b')^4 \sin(\delta/4)^3\beta}{8^4} \cdot \min\{1, \dfrac{1}{\alpha^2}\} \cdot \|w-v\|^2$$
**Lemma 4** *Let $\mathbb{E}[\|x\|^2] \leq B_X^2$. If for some $\epsilon > 0$ and $\xi = \frac{(\alpha - b')^4 \sin(\delta/4)^3\beta}{8^4 B_X^2} \cdot \min\{1, \frac{1}{\alpha^2}\}$, $H(w) \leq \epsilon\xi$, then $\|\widetilde{w}-\widetilde{v}\|^2 \leq 1$ implies $G(w) \leq \epsilon$*.
*Proof:*
$$H(w) = \mathbb{E}[(\sigma(w^\top x) - \sigma(v^\top x))^2\sigma'(w^\top x)] \geq \mathbb{E}[(\sigma(w^\top x) - \sigma(v^\top x))^2\sigma'(w^\top x)\sigma'(v^\top x)]$$
$$= \mathbb{E}[(\sigma'(w^\top x)(w^\top x) - \sigma'(v^\top x)(v^\top x))^2\sigma'(w^\top x)\sigma'(v^\top x)]$$
$$ = \mathbb{E}[(w^\top x - v^\top x)^2\sigma'(w^\top x)\sigma'(v^\top x)]$$
Hence
$$\xi\epsilon \geq \mathbb{E}[(w^\top x - v^\top x)^2\sigma'(w^\top x)\sigma'(v^\top x)] \geq \dfrac{(\alpha - b')^4 \sin(\delta/4)^3\beta}{8^4} \cdot \min\{1, \dfrac{1}{\alpha^2}\} \cdot \|w-v\|^2$$
$$\Rightarrow \|w-v\|^2 \leq \dfrac{\epsilon}{B_X^2}$$
by Lemma 3 and the value of $\xi$. Finally, by the definition of $G(w)$,
$$G(w) \leq \mathbb{E}[\|x\|^2] \|w-v\|^2 \leq \epsilon$$
$\square$
**Lemma 5** *Let $\eta \leq \frac{1}{4B_X^2}$. If $\forall ~ t \in \mathbb{N}$ $\|w_t-v\|\leq 1$ and $\sigma(\cdot)$ is ReLU (i.e. Lipschitz constant $L=1$), the following holds during gradient descent.*
$$\nabla F(w_t)^\top(w_t-v) \geq 2H(w_t) - B_X \sqrt{OPT}$$
$$\|\nabla F(w_t)\|^2 \leq 4B_X^2 H(w_t) + 4B_X^2OPT$$
*and thus*
$$\|w_t-v\|^2 - \|w_{t+1}-v\|^2 \geq 2\eta(2H(w_t) - B_X\sqrt{OPT}) - 4\eta^2(B_X^2 H(w_t) + B_X^2OPT)$$
$$=H(w_t)(4\eta - 4B_X^2\eta^2) - 4\eta^2B_X^2OPT - 2\eta B_X\sqrt{OPT}$$
$$\geq H(w_t)(4\eta) - 4\eta B_X \sqrt{OPT} = 4\eta(H(w_t)-B_X \sqrt{OPT})$$
**Claim 6** *With $\eta \leq \frac{1}{4B_X^2}$, if at time $t$ of gradient descent, $H(w_t) \geq 2B_X\sqrt{OPT}$, then $\|w_t-v\|^2 - \|w_{t+1}-v\|^2 \geq 4\eta B_X \sqrt{OPT}$. Furthermore, after $t \geq \frac{1}{4\eta B_X \sqrt{OPT}}$ iterations, $H(w_t) \leq 2B_X\sqrt{OPT}$, implying*
$$G(w_t) \leq \dfrac{2^{13} B_X^3 \sqrt{OPT}}{(\alpha - b')^4 \sin(\delta/4)^3\beta \cdot \min\{1, \frac{1}{\alpha^2}\}}$$
## Analysis: When Assumption 2 holds
**Lemma 7** *If Assumption 2 holds for $\widetilde{x}$, $\mathbb{E}[\|x\|^2] \leq B_X^2$ and $\|w-v\| \leq K-1$ for some $K > 1$, and let $\xi' \leq \frac{1}{24}$, then when $G(w) \leq G(0) - \delta$ for some $\delta > 0$,*
$$\mathbb{E}[(w^\top x - v^\top x)^2 1\{w^\top x \geq 0, v^\top x \geq 0\}] \geq \xi'^2 \cdot ((\dfrac{\delta}{B_X^2 K})^2 - 8B_Xc'\xi') \cdot \|w-v\|^2$$
**Lemma 8** *Let $\epsilon > 0$, $G(w) \leq G(0) - \delta$, $\|w - v\| \leq K-1$ and $\zeta = \frac{\xi'^2}{B_X^2} \cdot ((\frac{\delta}{B_X^2 K})^2 - 8B_Xc'\xi')$, then $H(w) \leq \zeta \epsilon$ implies $G(w) \leq \epsilon$*.
**Claim 9** *With $\eta \leq \frac{1}{4B_X^2}$, if at time $t$ of gradient descent, $H(w_t) \geq 2B_X\sqrt{OPT}$, then $\|w_t-v\|^2 - \|w_{t+1}-v\|^2 \geq 4\eta B_X \sqrt{OPT}$. Furthermore, after $t \geq \frac{1}{4\eta B_X \sqrt{OPT}}$ iterations, $H(w_t) \leq 2B_X\sqrt{OPT}$, implying*
$$G(w_t) \leq \dfrac{2 B_X^3 \sqrt{OPT}}{\xi'^2 \cdot ((\frac{\delta}{B_X^2 K})^2 - 8B_Xc'\xi')}$$