--- theme: gaia _class: lead paginate: true --- # **No-Dimensional Sampling Coresets for Classification** Jeff M. Phillips* & Meysam Alishahi University of Utah (* visiting ScaDS.AI, Uni Leipzig, MPI for Math in Sciences) --- ## Coresets Large dataset $X$. Want to solve expensive $f(X)$. Step 1: Reduce $Z \leftarrow X$ (a *coreset*) Step 2: Solve $f(Z)$ Guarantee: - $|f(Z) - f(X)| < \varepsilon$ - $|Z| < \mathsf{size}(1/\varepsilon)$ --- ## Linear classifiers $(X,y)$ with $X \in \mathbb{R}^{n \times d}$ and $y \in \{-1,+1\}^n$ for each $(x_i, y_i)$, goal $w \in \mathbb{R}^{d+1}$ so $v_i(w) = \langle y_i x_i , w \rangle > 0$. Objective: minimize $\sum_{i=1}^n \phi(v_i(w))$ with - sigmoid $\phi(v) = \frac{1}{1 + e^v}$ - logistic $\phi(v) = \ln(1+e^{-v})$ - svm $\phi(v) = \max(0, 1-v)$ - ReLu $\phi(v) = \max(0,-v)$ Regularizers: Minimize $\sum_{i=1}^n \phi(v_i(w)) + \frac{1}{k}\|w\|_2^2$ with $f^{\phi,k}_w(x_i) = \phi(v_i(w)) + \frac{1}{k}\|w\|_2^2 \in F^{\phi,k}$ --- ## Our Results - Consider *infinite* $X$ with prob. dist. $P$ on $\mathbb{R}^d$. Find small $Z \subset \mathbb{R}^d$ so for all $w$ $$\left|\int_x f_w^{\phi,k}(x) {\rm d}P(x) - \sum_{z \in Z} \nu(z) f_w^{\phi,k}(z) \right| \leq \varepsilon \int_X f_w^{\phi,k}(x) {\rm d}P(x)$$ - $|Z| = O(k^3/\varepsilon^2)$. Via iid samples from $P$. No dependence on dimension $d$ Sample complexity results! - Radamacher Complexity bound for sensitivity sampling. - Simplification of [Langberg-Schulman-Feldman] sensitivity sampling bounds. New proofs with Fubini's theorem. --- ## 3 Ingredients - Sensitivity Sampling (Langberg-Schulman-Feldman 2010,2011) importance sampling - Linked Range Spaces for KDE coresets (JK**P**V 2011) - Radamacher Complexity & McDiarmid's Inequality --- ## Sensitivity Sampling Objective function $f : \mathbb{R}^d \to [0,\infty)$, for $f_w \in F$ *upper sensitivity function* $s$: $\sup_{f \in F} \frac{f(x)}{\int_z f(z) {\rm d}P(z)} \leq s(x)$ *total sensitivity* $S = \int_x s(x) {\rm d}P(x)$ Importance sampling: sample $Z$ from $Q(x) = P(x) \cdot \frac{s(x)}{S}$ new weighting $\nu(z) = \frac{S}{s(z)}\frac{1}{|Z|}$ Now "effect" from any $x$ bounded by $S$, not $s(x)$. $(1+\varepsilon)$ error with $|Z| = O(\mathsf{VC} \cdot S^2/\varepsilon^2)$ $\mathsf{VC}$ is VC-dimension of $s(x)/S$-augmented $(Q,F)$. (Feldman-Schmidt-Sohler 2020) improve to $|Z| = O(\mathsf{VC} \cdot S \log S / \varepsilon^2)$ --- ## Linked Range Spaces Kernel $K(x,z) \in [0,1]$ like $K(x,z)=\exp(-\|x-z\|^2)$ $\mathsf{KDE}_P(z) = \mathbf{E}_{x \sim P} K(x,z)$ $\varepsilon$-*KDE coreset* is $Z \subset \mathbb{R}^d$ so $|\mathsf{KDE}_P(x) - \mathsf{KDE}_Z(x)| \leq \varepsilon$ for all $x \in \mathbb{R}^d$ (Joshi-Kommaraju-**P**-Venkatasubramanian 2011) For $f : \mathbb{R}^d \to [0,1]$, define superlevel set $\mathsf{super}(f,\tau) = \{x \in \mathbb{R}^d \mid f(x) \geq \tau \}$. *linked range space* $(\mathbb{R}^d, \mathsf{super}(F))$ is all superlevel sets $f \in F$ and $\tau \in [0,1]$ Sample $Z \sim P$ with $|Z| = O(\mathsf{VC}/\varepsilon^2)$ then $Z$ an $\varepsilon$-KDE coreset. $\mathsf{VC}$ is VC dimension of $(\mathbb{R}^d,\mathsf{super}(K))=d$ For sensitivity sampling, we need: $g: \mathbb{R}^d \to [0, \infty)$, define sublevel set $\mathsf{sub}(g,\tau) = \{x \in \mathbb{R}^d \mid g(x) \leq \tau\}$. Call these range spaces: *sub-linked*. --- ## Radamacher Complexity $m$-*Radamacher Complexity* for $(P,F)$ uses $\sigma_i \ldots \sigma_m \sim \{-1,+1\}$ $$\mathcal{R}_m^P = \mathbf{E}_{x_{1:m} \sim P} \mathbf{E}_{\sigma_{1:m}} \sup_{f \in F} \left[ \frac{1}{m} \sum_{i=1}^m \sigma_i f(x_i) \right]$$ $f : (\mathbb{R}^d)^n \to \mathbb{R}$ satisfies $\Delta$-*bounded difference* if, for all $x_1, \ldots, x_n$ and $x'_i$, $$|f(x_1,\ldots,x_{i-1}, x_i, x_{i+1}, \ldots x_n) - f(x_1,\ldots,x_{i-1}, x'_i, x_{i+1}, \ldots x_n)| \leq \Delta$$ McDiarmid (1989) for $f$ with $\Delta$-bounded diff, $X_1 \ldots X_n$ are independent RVs. $$\mathbf{Pr}[f(X_1, \ldots, X_n) - \mathbf{E}[f(X_1, \ldots, X_n)] \geq \varepsilon] \leq \exp(-2 \varepsilon^2 / (n \Delta^2))$$ --- ## Prior Work **$\mu(X)$ Complexity** (Munteanu-Schweigelsohn-Sohler-Woodruff 2018) $$\mu(X) = \sup_w \frac{\int_x |\langle x,w \rangle| {\rm d}P_+(x)}{\int_x |\langle x,w \rangle| {\rm d}P_-(x)}$$ where $P_+$ is distribution of $x_i$ st $y_i=+1$; $P_-$ st $y_i = -1$ Study $\sum_{i=1}^n \phi(v_i)$ with logistic $\phi(v) = \ln(1+e^{-v})$, no regularizer - coreset size $|Z|$ must depend on $\mu(X)$ - $|Z| = O^*(\frac{d^3 \mu(X)^3 }{ \varepsilon^4})$ **Lewis weights** (Mai-Musco-Rao 2021) - $|Z| = O(\frac{\mu(X)^2 d \log(d/\varepsilon) }{ \varepsilon^2})$ --- ## Prior Work ### Regularized Regression Study $\sum_{i=1}^n \phi(v_i) + \frac{1}{k} \|w\|_2^2$ with logistic $\phi(v) = \ln(1+e^{-v})$ (Tolochinsky-Jubran-Feldman 2022) - $|Z| = O(\frac{k d^2 \log n \log k}{\varepsilon^2})$ - logistic and svm require $\|w\| \leq 1$, not sigmoid - bounded $\|x_i\|$ (Curtin-Im-Moseley-Pruhs-Samadian 2020) - $|Z| = O(\frac{d k \log k}{\varepsilon^2})$ --- ## Our Main Radamacher Result **Theorem 1:** For $(P,F)$ with upper sensitivity function $s$, total sensitivity $S$. Take $|Z|$ s-sensitivity samples (from $Q$), wp $> 1- \exp(-2|Z|t^2/S)$ then for all $f \in F$ $$\left| \int_x f(x) dP(x) - \frac{S}{|Z|} \sum_{z \in Z} \frac{f(z)}{s(z)} \right| \leq (2 \mathcal{R}_{|Z|}^Q(F_Q) + t) \int_x f(x) {\rm d}P(x)$$ - $\|x\|_2 \leq 1$ (constant goes into $C_\phi$) - $\phi \in$ logistic, sigmoid, svm, ReLu 1. $S = O(k)$ ($s$ constant; depends on $\max_x \|x\|_2$, Lipschitz factor of $\phi$) 2. sensitive Radmacher $\mathcal{R}_{|Z|}^Q(F_Q) = C_{\phi,k}\sqrt{S/|Z|} = O(k\sqrt{k/|Z|})$ 3. Set $|Z| = O(k^3/\varepsilon^2)$, $t = \varepsilon/2$ yields $$\left| \int_x f(x) {\rm d}P(x) - \frac{1}{|Z|} \sum_{z \in Z} f(z) \right| \leq (1+\varepsilon)\int_x f(x) {\rm d}P(x)$$ --- ### Proof of Theorem 1. Define and divide by $P(f)=\int_x f(x) {\rm d}P(x)$ so we need to prove: $$\left| 1 - \frac{S}{|Z|} \sum_{z \in Z} \frac{1}{P(f)}\frac{f(z)}{s(z)} \right| \leq 2 \mathcal{R}_{|Z|}^Q(F_Q) + t$$ Using that $s(x) \geq f(x)/P(f)$: - $b_f(z_{1:|Z|}) = \frac{S}{|Z|} \sum_{z \in Z} \frac{1}{P(f)}\frac{f(z)}{s(z)} \leq S$ - $b_f(z_{1:|Z|})$ has a bounded difference $\Delta \leq S/|Z|$. Thus by McDiarmid's, with prob $\geq 1 - \exp(-2|Z|t^2/S^2)$ $$\begin{align}\sup_{f \in F} \left|1-b_f(z_{1:|Z|}) \right| &\leq \textbf{E}_{z_{1:|Z|} \sim Q} \left[\sup_{f \in F}\left|1-b_f(z_{1:|Z|}) \right|\right] + t \\ &\leq 2 \mathcal{R}_{|Z|}^Q(F_Q) + t\end{align}$$ --- ### Relative $(\alpha, \eta)$-Approximation Range space $(P, \mathcal{H})$. For $h \in \mathcal{H}$ let $P(h) = \int_x 1_{x \in h} {\rm d}P(x)$ be measure in $h$. $\mu$ is a *relative $(\alpha, \eta)$-approximation* of $(P, \mathcal{H})$ if $$\sup_{h \in \mathcal{H}} |P(h) - \mu(h)| \leq \alpha \max(\eta, P(h))$$ $m = O(\mathsf{VC} \frac{1}{\alpha^2}\frac{1}{\eta}\log \frac{1}{\eta})$ iid samples from $P$ (HarPeled-Sharir 2011) --- ### Sub-linked Function Approximation Function family $G$ then $g : \mathbb{R}^d \to [0,\infty)$ (in $G)$, and prob dist $Q$ sub-linked ranges space $(Q, \mathsf{sub}(G))$: - $\mathsf{sub}(g,\tau) = \{x \in \mathbb{R}^d \mid g(x) \leq \tau\}$ - $Q(g,\tau) = \int_{x \in \mathbb{R}^d} 1_{g(x) \leq \tau} {\rm d}Q(x) = \int_{x \in \mathsf{sub}(g,\tau)} {\rm d}Q(x)$ Consider $g(x) = \frac{S}{s(x)P(f)}f(x)$ and $Q(x) = \frac{s(x)}{S}P(x)$ - $g(x) \leq S$ (set $s(x) = s'(x)+1$) - $\int_x g(x) {\rm d}Q(x) = 1$ --- ## Classic Sensitivity VC-bound **Theorem 2**: $\nu$ is relative $(\alpha, \eta)$-approximatioin of $(Q,\mathsf{sub}(G))$. For each $f \in F$ $$\left| P(f) - \int_x \frac{S f(x)}{s(x)} dv(x) \right| \leq (\alpha + S \eta \alpha) P(f)$$ Then set $\alpha = \varepsilon/2$ and $\eta = 1/S$, so $... \leq \varepsilon P(f)$ --- ### Proof: Divide the goal by $P(f)$ and solve: $$\begin{align} \frac{1}{P(f)} \left| P(f) - \int_x \frac{S f(x)}{s(x)} {\rm d}\nu(x) \right| &= \left| 1 - \int_x g(x) {\rm d}\nu(x) \right| \\ &= \left| \int_x g(x) {\rm d}Q(x) - \int_x g(x) {\rm d}\nu(x) \right| \\\mathsf{(via~ \textbf{Lemma 2})} &=\left| \int_{\tau=0}^S Q(g,\tau) - \nu(g,\tau) {\rm d}\tau\right| \\ &\leq \int_{\tau=0}^S \left| Q(g,\tau) - \nu(g,\tau)\right| {\rm d}\tau \\ &\leq \alpha \int_{\tau=0}^S \max(\eta, Q(g,\tau)) {\rm d}\tau \\ &= \alpha \int_{\tau=0}^{L_g} \max(\eta, Q(g,\tau)) {\rm d}\tau + \alpha \int_{\tau=L_g}^S \max(\eta, Q(g,\tau)) {\rm d}\tau \\ &\leq \alpha (1+ S \eta)\end{align}$$ --- ### Split at $L_g$ Define $L_g = \sup\{ \tau \geq 0 \mid Q(g,\tau) \geq \eta\}$ - above the threshold $L_g$: $$\int_{\tau = L_g}^S \max(\eta, Q(g,\tau)) {\rm d}\tau =\int_{\tau = L_g}^S \eta~ {\rm d}\tau \leq S \eta$$ - below the threshold $L_g$: $$\begin{align}\int_{\tau=0}^{L_g} \max(\eta, Q(g,\tau)) {\rm d}\tau &=\int_{\tau=0}^{L_g} Q(g,\tau) {\rm d}\tau \\ \mathsf{(via ~\textbf{Lemma 1})} &=\int_x \min(L_g, g(x)) {\rm d}Q(x) \leq \int_x g(x) {\rm d}Q(x) = 1\end{align}$$ --- ### Using Fubini's theorem: **Lemma 1:** $\int_x \min(L, g(x)){\rm d}Q(x) = \int_{\tau = 0}^L Q(g,\tau) {\rm d}\tau$ > $$\begin{align}\int_x \min(L, g(x)){\rm d}Q(x) > &=\int_x \int_{\tau = 0}^L 1_{\tau < g(x)} {\rm d}\tau~ {\rm d}Q(x) > \\ &=\int_{\tau = 0}^L \int_x 1_{\tau < g(x)} {\rm d}Q(x)~ {\rm d}\tau > \\ &=\int_{\tau = 0}^L Q(g,\tau) {\rm d}\tau \end{align}$$ > **Lemma 2:** $\left| \int_x g(x){\rm d}Q(x) - \int_x g(x){\rm d}\nu(x) \right| = \left| \int_{\tau=0}^S Q(g,\tau) - \nu(g,\tau) {\rm d}\tau\right|$ > $$\begin{align}\left| \int_x g(x){\rm d}Q(x) - \int_x g(x){\rm d}\nu(x) \right| > &= \left| \int_x \min(S,g(x)){\rm d}Q(x) - \int_x \min(S,g(x)){\rm d}\nu(x) \right| > \\ \mathsf{ (via \textbf{ Lemma 1}, x2)} &=\left| \int_{\tau=0}^S Q(g,\tau) - \nu(g,\tau) {\rm d}\tau\right| \end{align}$$