Notes Deep Repulsive Ensembles

--- title: Notes Deep Repulsive Ensembles --- # Repulsive Deep Ensembles 1. **What does the paper do?** *The paper proposes the kernelized repulsive term in the deep ensembles training objective* 2. **What is the motivation of the repulsive term?** *The ensemble members tend to collapse to mode of the posterior. Consequently, it fails to act as an average model.* 3. **What is the inspiration of this paper?** *Stein variational gradient descent (SVGD). The SVGD's objective consists two terms: the log gradient term and the gradient kernel term. The later term act as a repulsive term. It ensures each particle to stay away to others. Therefore, the particles are able to occupy the target distribution.* 4. **What is the analysis of the proposed method?** *Both SVGD and the repulsive are the discretization of the Wasserstein gradient flow of the Kullback-Leibler divergence over the space of probability measure.* 5. **Experiments ?** *sampling from the synthetic distribution, regression problem, 2D and image classification.* ## Wasserstein Gradient Flow ### Geometry of $(\mathcal{P}_2(\mathcal{M}), W_2)$ #### Definition of Wasserstein Space Let $\mathcal{P}_2(\mathcal{M})$ be the space of probability measure on $\mathcal{M}$, i.e. $\mathcal{P}_2(\mathcal{M}) = \{ \mu \in \mathcal{P}(\mathcal{M}), \, \int_{\mathcal{M}} \lVert x \rVert^2 \, d \mu(x) < +\infty \}$ . Subsequently, we assume $\mathcal{P}_2(\mathcal{M})$ is equipped with the Wasserstein-2 distance. Given $\forall \mu, \nu \in \mathcal{P}_2(\mathcal{M})$, the Wasserstein distance between $\mu$ and $\nu$ is written as follows: \begin{align} W_{2}^2 (\mu, \nu) = \underset{s \, \in \, \Pi(x, y)}{\inf} \int \lVert x - y \rVert^2 \, ds(x, y) \end{align} where $\Pi$ is set of possible couplings. The metric space $(\mathcal{P}_2(\mathcal{M}), W_2)$ is called the Wasserstein space. #### Riemanian Structure of $(\mathcal{P}_2(\mathcal{M}), W_2)$ and $L^2$ Spaces We define \begin{align} \mathcal{T}_\mu \mathcal{P}_2(\mathcal{M}) \subset L^2(\mu) = \{f: \mathbb{R}^d \rightarrow \mathbb{R}^d, \, \int_{\mathcal{M}} \lVert f(x) \rVert^2 \, d\mu(x) < +\infty \} \end{align} as a space of vector-valued and square-integrable w.r.t. $\mu$. It is a Hilbert space of functions equipped with the inner product \begin{align} \langle f, g \rangle_\mu = \int_{\mathcal{M}} \langle f(x), g(x) \rangle_{\mathbb{R}^d} \, d\mu(x) \end{align} #### Pushforward Measure Let $\mu: \mathcal{P}_2(\mathcal{M})$ and a measurable map $T: \mathbb{R}^d \rightarrow \mathbb{R}^d$. The pushforward measure $T_\#\mu$ is characterized by $x \sim \mu \implies T(x) \sim T_\#\mu$. The identity map is denoted by $Id$, i.e. $Id_\#\mu = \mu$ #### Dynamics on $\mathcal{P}_2(\mathcal{M})$ through $L^2$ maps If $T \in L^2(\mu)$ and $\mu \in \mathcal{P}_2(\mathcal{M})$ then $T_\#\mu \in \mathcal{P}_2(\mathcal{M})$ fulfills: \begin{align} \int \lVert y \rVert^2 \, d(T_\#\mu)(y) = \int \lVert x \rVert^2 \, d\mu(x) < + \infty \end{align} *Bernier's theorem:* Let $\mu, \nu \in \mathcal{P}_2(\mathcal{M})$ s.t. $\mu << \text{Leb}$. Then, there is a unique $T_\mu^\nu : \mathcal{M} \rightarrow \mathcal{M}$ s.t. 1. $T_{\mu \#}^{\nu}\mu = \nu$ 2. $W_2^2 (\mu, \nu) = \lVert Id - T_\mu^\nu\rVert_\mu^2 \; \; \triangleq \, \int \, \lVert x - T_\mu^\nu(x) \rVert^2 \, d\mu(x)$ $T_\mu^\nu$ is called the Optimal transport map. #### The Notion of Wasserstein Geodesic The Wasserstein geodesic between $\rho_0 = \mu$ and $\rho_1 = \nu$ can be written as follows: \begin{align} \rho_t = ((1 - t) Id \, + \, t \, T_{\mu}^\nu)_\#\mu \end{align} #### Convexity along Wasserstein Geodesics Let $\mathcal{F}: \mathcal{P}_2(\mathcal{M}) \rightarrow (-\infty, +\infty]$. $\mathcal{F}$ is $\lambda$-strongly geodesic convex with $\lambda \geq 0, \; \forall \mu, \nu \in \mathcal{P}_2(\mathcal{M})$: \begin{align} \mathcal{F}(\rho_t) \leq (1 - t) \mathcal{F}(\mu) + t \mathcal{F}(\nu) - \frac{\lambda \, t (1 - t)}{2} W_2^2(\mu, \nu) \end{align} where $(\rho_t)_{t \geq 0}$ is a Wasserstein-2 geodesic between $\mu$ and $\nu$. #### KL-divergence We define KL-divergence $KL[\mu \, \vert \, \pi]$ as \begin{align} KL[\mu \, \vert \, \pi] &= \int \log \frac{\mu}{\pi} (x) \, d\mu(x) \\ &= \int V(x) \, d\mu(x) + \int \log \mu(x) \, d\mu(x) + C \end{align} ### Definition of Wasserstein Gradient Flow #### Gradient Flow on Probability Distribution The goal is to approximate $\pi$ by solving \begin{align} \underset{\mu \, \in \, \mathcal{P}_2(\mathcal{M})}{\min} \, \mathcal{F}(\mu), \quad \mathcal{F}(\mu) = KL[\mu \, \vert \, \pi] \end{align} We want the gradient flow of $\mathcal{F} : \mathcal{P}_2(\mathcal{M}) \rightarrow (-\infty, +\infty]$ to be something of the form \begin{align} \dot{\mu_t} = -\nabla_{W_2} \, \mathcal{F}(\mu_t) \end{align} #### Velocity Field Definition: Given $(\mu_t)_{t \geq 0} \in \mathcal{P}_2(\mathcal{M})$, if there exists $(v_t)_{t \geq 0} \in (L^2(\mu_t))_{t \geq 0}$ s.t. \begin{align} \frac{d}{dt} \int \varphi d\mu_t = \langle \nabla \varphi, v_t \rangle_{\mu_t} \end{align} $\forall \varphi: \mathcal{M} \rightarrow \mathbb{R}$, then $(v_t)_{t \geq 0}$ is a velocity field of $(\mu_t)_{t \geq 0}$. The velocity field governs the dynamics of $(\mu_t)_{t \geq 0}$. #### Continuity Equation The velocity field satisfies the PDE \begin{align} \frac{\partial \mu_t}{\partial t} + \nabla . (\mu_t \, v_t) = 0 \end{align} *Proof:* 1. $\frac{d}{dt} \int \varphi(x) \, \mu_t(x) \, dx = \int \varphi(x) \frac{d \mu_t}{dt}(x) \, dx$ 2. \begin{align} \frac{d}{dt} \int \varphi(x) \, \mu_t(x) \, dx &= \langle \nabla \varphi, \, v_t \rangle_{\mu_t} \\ &= \int \langle \nabla \varphi(x), v_t(x) \rangle_{\mathbb{R}^d} \, \mu_t(x) \, dx \\ &= - \int \varphi(x) \, \nabla.(v_t(x) \, \mu_t(x)) dx \end{align} Note that the last row is obtained by applying integral by parts. #### Wasserstein Gradient Definition: Let $\mu \in \mathcal{P}_2(\mathcal{M})$. Consider a perturbation on the Wasserstein space $(Id + \varepsilon \, h)_\#\mu$ with $h \in L^2(\mu)$. If the Taylor expansion of $\mathcal{F}$ yields \begin{align} \mathcal{F}((Id + \varepsilon \, h)_\#\mu) = \mathcal{F}(\mu) + \varepsilon \langle \nabla_{W_2} \mathcal{F}(u), h \rangle_{\mu} \, + \, o(\varepsilon) \end{align} then $\nabla_{W_2} \mathcal{F}(\mu) \in L^2(\mu)$ is the Wasserstein gradient of $\mathcal{F}(\mu)$ #### First Variation Definition: Let $\mu \in \mathcal{P}_2(\mathcal{M})$. Consider a linear perturbation $\mu + \varepsilon \xi \in \mathcal{P}_2(\mathcal{M})$. If the Taylor expansion of $\mathcal{F}$ yields \begin{align} \mathcal{F}(\mu + \varepsilon \xi) = \mathcal{F}(\mu) + \varepsilon \int \mathcal{F}^\prime(\mu)(x) \, d\xi(x) \, + \, o(\varepsilon) \end{align} then $\mathcal{F}^\prime(\mu) : \mathcal{M} \rightarrow \mathbb{R}$ is the first variation of $\mathcal{F}(\mu)$. #### Wasserstein gradient = Gradient of First Variation Given the definition of Wasserstein gradient and the gradient of the first variation, we have \begin{align} \nabla_{W_2} \mathcal{F}(\mu) = \nabla \mathcal{F}^\prime(\mu) \end{align} where $\nabla_{W_2} \mathcal{F}(\mu) : \mathcal{M} \rightarrow \mathbb{R}^d$ and $\mathcal{F}^\prime(\mu) : \mathcal{M} \rightarrow \mathbb{R}$. #### Wasserstein Gradient of KL Let $\mathcal{F}(\mu) = \int V(x) \, d\mu(x) + \int \log \mu(x) \, d\mu(x)$. For $\pi \propto \exp(-V)$, we have KL-divergence as $KL[\mu \, \vert \, \pi] = \mathcal{F}(\mu) - \mathcal{F}(\pi)$, where $\mathcal{F}(\pi)$ is a constant. By additivity, the Wasserstein gradient of KL is given by \begin{align} \nabla_{W_2} \, \mathcal{F}(\mu) = \nabla \mathcal{F}^\prime(\mu) = \nabla V + \nabla \log \mu = \nabla \log \frac{\mu}{\pi} \end{align} #### Velocity Field = negative Wasserstein Gradient Recall that we want to define the equation \begin{align} \dot{\mu_t} = - \nabla_{W_2} \, \mathcal{F}(\mu_t) \end{align} The definition of velocity field can be seen as a chain rule \begin{align} \frac{d}{dt} \int \varphi \, d\mu_t = \langle \nabla\varphi, \, v_t \rangle_{\mu_t} \quad \mathcal{F}(\mu) = \int \varphi \, d\mu \end{align} If $\nabla \varphi = \nabla_{W_2} \, \mathcal{F}(\mu_t)$, then \begin{align} \frac{d}{dt} \mathcal{F}(\mu_t) = \langle \nabla_{W_2} \, \mathcal{F}(\mu_t), v_t \rangle_{\mu_t} \end{align} We consider $v_t = - \nabla_{W_2} \mathcal{F}(\mu_t)$ to decrease $\mathcal{F}$. Substitute $v_t$ back to the velocity field equation, we obtain the descent property: \begin{align} \frac{d}{dt} \mathcal{F}(\mu_t) = - \lVert \nabla_{W_2} \mathcal{F}(\mu_t) \rVert^2 \leq 0 \end{align} #### Wasserstein Gradient Flow The Wasserstein GF of $\mathcal{F}$ is ruled by: \begin{align} v_t = - \nabla_{W_2} \mathcal{F}(\mu_t) \end{align} Substitute $v_t$ into the continuity equation to obtain the Wasserstein gradient flow \begin{align} \frac{\partial \mu_t}{\partial t} &= - \nabla. (\mu_t v_t) \\ &= \nabla . (\mu_t \, \nabla_{W_2} \mathcal{F}(\mu_t)) \end{align} #### Wasserstein Gradient Flow of of KL Divergence Recall the definition of KL Divergence \begin{align} KL[\mu \, \vert \, \pi] &= \int \log \, \frac{\mu}{\pi}(x) \, d\mu(x) \\ &= \int V(x) \, d\mu(x) + \int \log \mu(x) \, d\mu(x) \end{align} The Wasserstein gradient of KL-divergence is the Fokker-Planck equation: \begin{align} \frac{\partial \mu_t}{\partial t} &= \nabla . (\mu_t \nabla_{W_2} KL[\mu_t \, \vert \, \pi]) \\ &= \nabla . (\mu_t \nabla \log \frac{\mu_t}{\pi}) \\ &= \nabla. (\mu_t \, \nabla_{W_2} \int V(x) \, d\mu(x)) + \nabla. (\mu_t \, \int \log \mu(x) \, d\mu(x)) \\ &= \nabla. (\mu_t \, \nabla V) + \nabla. (\mu_t \nabla \log \mu_t) = \nabla. (\mu_t \, \nabla V) + \nabla. (\mu_t \frac{\nabla \mu_t}{\mu_t}) \\ &= \nabla. (\mu_t \, \nabla V) + \Delta \mu_t \end{align} The continuity equation above specifies the Langevin difussion $x_t \sim \mu_t$ \begin{align} dx_t = - \nabla V(x_t) + \sqrt{2} dB_t \end{align} Where $B_t$ is the Brownian motion in $\mathbb{R}^d$. #### Convergence of Wasserstein Gradient Flow - If $\mathcal{F}$ is $\lambda-$geodesically convex, then \begin{align} W_2(\mu_t, \pi) \leq \exp(- \lambda \, t) W_2(\mu_0, \pi) \end{align} - If we assume the inequality \begin{align} \lVert \nabla_{W_2} \mathcal{F}(\mu_t) \rVert^2_{L^2(\mu_t)} \geq \frac{1}{\lambda} \mathcal{F}(\mu_t) \end{align} By the Gronwall"s lemma, we have \begin{align} \mathcal{F}(\mu_t) \leq \exp(- \lambda t) \mathcal{F}(\mu_0) \end{align} ### Sampling Algorithm #### Discretization of Wasserstein Gradient Flow For a time step $\gamma > 0$: 1. Wasserstein gradient descent or Forward Euler \begin{align} \mu_{l + 1} = (Id - \gamma \nabla_{W_2} \mathcal{F}(\mu_l) )_\#\mu_l \end{align} Given $x_0 \sim \mu_0$, the gradient descent over $\mathcal{M}$ is written as \begin{align} x_{l + 1} = x_l \, - \, \gamma \nabla \log \frac{\mu_l}{\pi}(x_l) \sim \mu_{l + 1} \quad \text{if} \;\; x_l \sim \mu_l \end{align} Problem: $\mu_l$ is not known, we only have the samples of $x_l$. Question: - Since $\mathcal{M}$ is compact and the optimization depends on $- \nabla_{W_2} \mathcal{F}(\mu_l)$ (which lies in the tangent space), there is a possibility of the updated $\mu_{l + 1}$ exceed the manifold. How do we mitigate this issue? Potential solution: force $\mu_{l + 1}$ to be the nearest probability measure in $\mathcal{M}$. 2. Jordan-Kinderlehrer-Otto scheme \begin{align} \mu_{l + 1} \in JKO_{\gamma \, \mathcal{F}}(\mu_l) \, \triangleq \underset{\mu \, \in \mathcal{P}_2(\mathcal{M})}{\min} \gamma \, \left \{ \mathcal{F}(\mu) + \frac{1}{2} W_2^2 (\mu, \mu_l) \right \} \end{align} 3. Splitting scheme ($\mathcal{F} = \mathcal{F}_1 + \mathcal{F}_2$, $\mathcal{F}_2$ is geo. convex) \begin{align} &\mu_{l + \frac{1}{2}} = (Id - \gamma \nabla_{W_2} \, \mathcal{F}(\mu_l))_\#\mu_l \\ &\mu_{l + 1} = JKO_{\gamma \mathcal{F}_2} (\mu_{l + \frac{1}{2}}) \end{align} #### Kernel and RHKS - $k : \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$ is a positive, semi-definite kernel. - Its corresponding RKHS \begin{align} \mathcal{H}_k = \overline{\left \{ \sum_{m = 1}^M \alpha_m \, k(. , x^m); \; \; m \in \mathbb{N}; \; \; \alpha_1, \cdots, \alpha_M \in \mathbb{R}^d; \; \; x^1, \cdots, x^M \in \mathbb{R}^d \right \} } \end{align} - $\mathcal{H}_k$ is a Hilbert space with inner product $\langle ., . \rangle_{\mathcal{H}_k}$ and the norm $\lVert . \rVert^2_{\mathcal{H}_k}$ - $\forall \mu \in \mathcal{P}(\mathcal{M}), \; \; \int_{\mathbb{R}^d \times \mathbb{R}^d} k(x, x) \, d\mu(x) < + \infty \implies \mathcal{H}_k \subset L^2(\mu)$ - Reproducing property: $\forall f \in \mathcal{H}_k, \; \; f(x) = \langle f, k(x, .) \rangle_{\mathcal{H}_k}$ #### Stein Variational Gradient Descent Considering the following metric depending on $k$ \begin{align} W_k^2(\mu_0, \mu_1) = \underset{\mu, \nu}{\inf} \left \{ \int_0^1 \lVert v_t(x) \rVert^2_{\mathcal{H}_k} \, dt(x) : \frac{\partial \mu_t}{\partial t} = \nabla . (\mu_t \, v_t) \right \} \end{align} The $W_k^2$ gradient flow of KL-divergence can be written as the PDE \begin{align} &\frac{\partial \mu_t}{\partial t} + \nabla. (\mu_t \, P_{\mu_t} \nabla \log \frac{\mu_t}{\pi}) = 0 \\ \end{align} Where $P_\mu : L^2(\mu) \rightarrow \mathcal{H}_k$ is the kernel integral operator \begin{align} P_\mu \, f(.) = \int k(x, .) \, f(x) \, d\mu(x) \end{align} #### Stein Variational Gradient Descent Algorithm We apply the kernel integral operator on the $W_2$ gradient of $KL[. \, \vert \, \mu]$ to obtain \begin{align} P_\mu \nabla \log \frac{\mu}{\pi} (.) &= \int \nabla \log \frac{\mu}{\pi}(x) \, k(x, \, .) \, d\mu(x) \\ &= \int - \nabla \log \pi(x) \, k(x, \, .) \, d\mu(x) + \int \nabla \mu(x) \, k(x, \, .) \, dx \end{align} Assume mild zero boundary condition that is $\underset{\lVert x \rVert \rightarrow \infty}{\lim} k(x, \, .) \, \pi(x) = 0$. Thus, we obtain \begin{align} p_\mu \nabla \log \frac{\mu}{\pi} &= \int - k(x, \, .) \, \nabla \log \pi(x) \, d\mu(x) + \int \nabla \mu(x) \, k(x, \, .) \, dx \, - \, k(x, \, .) \, \pi(x) \\ &= \int - k(x, \, .) \, \nabla \log \pi(x) \, d\mu(x) - \int \, \nabla_x \, k(x, \, .) \, \mu(x) \, dx \quad \scriptsize \text{integration by parts} \\ &= - \int [k(x, \, .) \, \nabla \log \pi(x) \, + \, \nabla_x k(x, \, .)] \, d\mu(x) \quad \scriptsize \\ \end{align} SVGD updates a set of $M$ particles $x^1, \cdots, x^M$ at each time $l \geq 0$. \begin{align} x_{l + 1}^{i} &= x_l^i - \frac{\gamma}{M} \, \sum_{m = 1}^M k(x_l^i, x_l^m) \, \nabla_{x_l^m} V(x_l^m) - \nabla_{x_l^m} k(x_l^i, x_l^m) \\ &= x_l^i - \gamma \, P_{\mu^l} \nabla \log \frac{\mu_l^M}{\pi}(x_l^i) \quad \mu_l^M = \frac{1}{M} \sum_{m = 1}^{M} \delta_{x_l^m} \end{align} #### SVGD from the Wasserstein Space Perspective \begin{align} &\mu_{l + 1} = (Id - \gamma \, h_{\mu_l})_\#\mu_l \end{align} Where $h_{\mu_l} = P_{\mu_l} \log \frac{\mu_l}{\pi}$ #### Gradient Descent Interpretation Taylor expansion around $\mu$ for $h \in \mathcal{H}_k$ yields \begin{align} KL[(Id + \varepsilon h)_\#\mu \, \vert \, \pi ] = KL[\mu \, \vert \, \pi] + \varepsilon \langle h_\mu , h \rangle_{\mathcal{H}_k} + o(\varepsilon) \end{align} Therefore, $h_\mu$ plays the role of the Wasserstein gradient in $\mathcal{H}_k$. As the consequence, $h_\mu$ satisfies the descent lemma \begin{align} \frac{KL[\mu_{l + 1} \, \vert \, \pi] - KL[\mu_l \, \vert \, \pi]}{\gamma} \leq - \frac{1}{2}\lVert h_{\mu_l} \rVert^2_{\mathcal{H}_k} \end{align} ### Repulsive Deep Ensembles #### Parameter Space The Wasserstein gradient of KL-divergence is \begin{align} \frac{\partial \mu_t}{\partial t} (x) &= \nabla \, . \left( \mu_t(x) \, \nabla_{W_2} KL[\mu \, \vert \, \pi] \right) \\ &= \nabla \, . \left( \mu_t(x) \nabla(\log \mu_t(x) - \log \pi(x)) \right) \end{align} As the result, we can define the dynamics of $x$ as follows \begin{align} &\frac{dx}{dt} = - \nabla (\log \mu_t(x) \, - \, \log \pi(x)) \\ \end{align} Applying the forward Euler discretization gives us \begin{align} &x_{l + 1}^i = x_l^i + \varepsilon_l \left( \nabla_x (\log \pi(x_l^i) - \log \mu_l(x_l^i) ) \right ) \end{align} Recall that we do not have the access to $\mu_l$. Instead, we only have the samples $x^1, \cdots x^M$. We then estimate the probability measure with the help of kernel $k$. \begin{align} &\mu_l(x) \approx \tilde{\mu_l}(x) = \frac{1}{M} \sum_{i = 1}^M k(x, x_l^m) \\ \end{align} Subsequently, we compute the log gradient of $\mu_l$ using the chain rule \begin{align} &\nabla \log \mu_l(x_l^i) \approx \frac{\sum_{m=1}^M \nabla_{x_l^i} k(x_l^i, \, x_l^m)}{\sum_{m=1}^M k(x_l^i, \, x_l^j)} \end{align} After substituting the estimation back to the forward Euler equation, we obtain \begin{align} x_{l + 1}^i = x_l^i + \varepsilon_l \left( \nabla_x \log \pi(x_l^i) - \frac{\sum_{m=1}^M \nabla_{x_l^i} k(x_l^i, \, x_l^m)}{\sum_{m=1}^M k(x_l^i, \, x_l^j)} \right ) \end{align} Comments: - By default, we assume Wasserstein GF of KL divergence with unimodal target distribution. I suspect that both SVGD and repulsive deep ensemble will perform poory if the true posterior is multimodal. Potential solution: product of manifold - Since the stationary kernel $k(x, y)$ can be written as $k(x - y)$, then almost all stationary kernels will work as a repellent. Question: - Is there any constraint on the step size $\varepsilon_l$? For example, SGLD puts the following constraints on the step size $\varepsilon_l$: $\sum_{l = 1}^\infty \varepsilon_l = \infty$ and $\sum_{l = 1}^\infty \varepsilon_l^2 < \infty$ #### Functional Space The Wasserstein gradient of KL-divergence is \begin{align} \frac{\partial \mu_t}{\partial t} (f) &= \nabla \, . \left( \mu_t(f) \, \nabla_{W_2} KL[\mu \, \vert \, \pi] \right) \\ &= \nabla \, . \left( \mu_t(f) \nabla(\log \mu_t(f) - \log \pi(f)) \right) \end{align} As the result, we can define the dynamics of $f$ as follows \begin{align} &\frac{df}{dt} = - \nabla (\log \mu_t(f) \, - \, \log \pi(f)) \\ \end{align} Applying the forward Euler discretization gives us \begin{align} &f_{l+1}^i = f_l^i + \varepsilon_l (\nabla (\log \mu_t(f) \, - \, \log \pi(f))) \end{align} Recall that we do not have the access to $\mu_l$. Instead, we only have the samples $f^1, \cdots fx^M$. We then estimate the probability measure with the help of kernel $k$. \begin{align} f_{l + 1}^{i} = f_l^i + \varepsilon_l \left ( \nabla_f \log \pi(f_l^i) - \frac{\sum_{m = 1}^M \nabla_{f_l^i} \, k(f_l^i, f_l^m)}{\sum_{m=1}^M k(f_l^i, f_l^m)} \right ) \\ \end{align} Since the function $f$ is induced by the parameter $x$, the endgoal is to update $x$, i.e. \begin{align} x_{l+1}^i = x_l^i + \varepsilon_l \left(\frac{\partial f_l^i}{\partial x_l^i}\right)^\intercal(\nabla (\log \mu_t(f) \, - \, \log \pi(f))) \end{align} ### Stein Gradient Estimator (SGE) Stein's identity: \begin{align} \mathbb{E}_q[h(x) \nabla_x \log q(x) + \nabla_x h(x)] = 0 \end{align} Settings: Consider $M$ i.i.d. samples drawn from $q(x)$. \begin{align} & \mathcal{X} \subset \mathbb{R}^d \\ & h : \mathcal{X} \rightarrow \mathbb{R}^{d^\prime} \\ & H = [h(x^1), \cdots, h(x^M)] \in \mathbb{R}^{d^\prime \times M} \\ & G = [\nabla_{x^1} \log q(x^1), \cdots, \nabla_{x^M} \log q(x^M)]^\intercal \in \mathbb{R}^{M \times d} \end{align} Applying Monte Carlo sampling on the Stein's identity gives us \begin{align} & - \frac{1}{M} \sum_{m = 1}^M h(x^m) \, \nabla_{x^m} \log q(x^m) = \frac{1}{M} \sum_{m = 1}^M \nabla_{x^m} h(x^m) \\ & - \frac{1}{M} HG = \overline{\nabla_x h} \\ \end{align} Where $\overline{\nabla_{x} h} \in \mathbb{R}^{d^\prime \times d}$ and $\nabla_{x^m} h(x^m) = [\nabla_{x^m} h_1(x_m), \cdots, \nabla_{x_m} h_{d^\prime}(x_m)]^\intercal$. Given the above equations, we define the ridge regression problem \begin{align} \underset{\hat{G} \in \mathbb{R}^{M \times d}}{\min} \lVert \overline{\nabla_x h} + \frac{1}{M} H \hat{G} \rVert^2_F + \frac{\eta}{M^2} \lVert \hat{G} \rVert_F^2 \end{align} We set the gradient to zero and solve for $\hat{G}$ \begin{align} &2 \frac{H}{M} (\overline{\nabla_x h} + \frac{1}{M} H\hat{G}) + \frac{2\eta}{M^2} \hat{G} = 0 \\ & \frac{1}{M} H^\intercal \; \overline{\nabla_x h} = \left ( \frac{- (H^\intercal H + \eta I)}{M^2} \right ) \hat{G} \\ & \hat{G}^{stein} = -M (K + \eta )^{-1} \, H^\intercal \overline{\nabla_x h} \end{align} where $K = H^\intercal H$. Consequently, we have $K_{ij} = h(x^i)^\intercal h(x^j)$. We then apply the kernel trick such that $K_{ij} = k(x^i, x^j), \quad k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$. Similarly, we obtain $(H^\intercal \overline{\nabla_{x} h})_{ij} = \frac{1}{M} \sum_{m = 1}^M \nabla_{x_j^m} k(x^i, x^m)$. #### Repulsive Deep Ensembles with SGE Estimating $\nabla \log \mu_l(x_l^i)$ using SGE method gives us \begin{align} x_{l + 1}^{i} &= x^i_l \, + \, \varepsilon_l\, (\nabla \log \pi(x_l^i) \, - \nabla \log \mu_l(x_l^i)) \\ &= x_l^i + \varepsilon_l \left( \nabla \log \pi(x_l^i) \, + \sum_{m = 1}^M (K + \eta \mathbb{I})_{im}^{-1} \sum_{m=1}^M \nabla_{x_l^i} k(x_l^i, x_l^j) \right) \end{align} ### Spectral Stein Gradient Estimator (SSGE) #### The Nyström method Consider a problem finding the eigen functions $\{ \psi_j \}_{j \geq 1} \in L^2(\mathcal{X}, q)$ of covariance kernel $k(x, y)$ w.r.t. the probability measure $q$: \begin{align} \int k(x, y) \, \psi(y) \, q(y) dy = \mu \, \psi(x) \end{align} Furhtermore, we set the constraint that the eigenfunctions $\{\psi_j\}_{j \geq 1}$ are orthonormal under $q$, i.e. \begin{align} \int \psi_i(x) \, \psi_j(x) \, q(x) dx = \delta_{ij} \end{align} where $\delta_{ij} = \mathbb{1} [i = j]$. We approximate the problem by applying the ubiased Monte Carlo sampling on the left hand side of the eigen functions problem using $M$ i.i.d. samples \begin{align} \frac{1}{M} K \psi \approx \mu \psi \end{align} The goal is to compute the eigen vectors $u_1, \cdots, u_J$ using $J$ largest eigen values $\lambda_1 \geq \cdots \geq \lambda_J$. We acquire the solution by comparing the to $K u_j = \lambda_j u_j$ \begin{align} & \psi_j(x^m) \approx \sqrt{M} u_{jm} \quad m = 1, \cdots M \\ & \mu_j \approx \frac{\lambda_j}{M} \end{align} Finally, we plug the solution above back to the eigen functions probem to obtain the Nyström approximation of the $j$-th eigenfunction at point $x$. \begin{align} \psi_j(x) \approx \hat{\psi}_j(x) = \frac{\sqrt{M}}{\lambda_j} \sum_{m = 1}^M u_{jm} \, k(x, x^m) \end{align} #### SSGE Consider the target gradient $g(x) = \nabla_x \log q(x), \quad g: \mathcal{X} \rightarrow \mathbb{R}^d$ with the $i$-th component of the gradient can be written as $g_i(x) = \nabla_{x_i} \log \, q(x)$. Assume that $g_1, \cdots, g_d \in L^2(\mathcal{X}, q)$. Since $\{ \psi_j \}_{j \geq 1}$ is the orthormal basis of $L^2(\mathcal{X}, q)$, we can expand $g_i$ as follows: \begin{align} g_i(x) = \sum_{j = 1}^\infty \beta_{ij} \, \psi_j(x) \end{align} *Proposition:* If $k(., .)$ has a continuous second order partial derivative and $k(x, .)$ and $k(., x)$ are in the Stein class of $q$, we have \begin{align} \mathbb{E}_q[\psi_j(x) g(x) + \nabla_x \psi_j(x)] = 0, \quad j=1, \cdots \infty \end{align} Substituting the gradient expansion to the proposition equation, and using the orthonormality of $\{\psi_j\}_{j \geq 1}$ we obtain \begin{align} \beta_{ij} = - \mathbb{E}_q[\nabla_{x_i} \psi_j(x)] \end{align} To estimate $\beta_{ij}$, we need an approximation of $\nabla_{x_i} \psi_j(x)$. The idea is that $\nabla_{x_i} \psi_j(x)$ can be taken from the covariance kernel equation. \begin{align} \mu \nabla_{x_i} \psi_j(x) = \int \nabla_{x_i} k(x, y) \, \psi_j(y) \, q(y) \, dy \\ \end{align} We then perform Monte Carlo sampling to estimate $\nabla_{x_i} \psi_j(x)$. \begin{align} \hat{\nabla}_{x_i} \psi_j(x) \approx \frac{1}{\mu_j \, M} \sum_{m = 1}^M \nabla_{x_i} k(x, x^m) \, \psi_j(x^m) \end{align} Substituting the approximation of $\mu_j$ and $\psi_j(x)$ to the MC sampling above, and comparing to the Nyström method, we conclude \begin{align} \hat{\nabla}_{x_i} \psi_j(x) \approx \nabla_{x_i} \hat{\psi}_j(x) \end{align} Finally, we truncate the expansion of $g_i$ to the first $J$ terms and plug the Nyström approximation of $\{\psi_j \}_{j = 1}^J$, we get the estimator as follows: \begin{align} \hat{g}_i(x) &= \sum_{j = 1}^J \hat{\beta}_{ij} \, \hat{\psi_j}(x) \\ &= \sum_{j = 1}^J - \frac{1}{M} \sum_{m = 1}^M \nabla_{x_i} \hat{\psi_j}(x^m) \, \hat{\psi_j}(x) \\ &= \sum_{j = 1}^J - \frac{1}{M} \sum_{m = 1}^M \, \frac{1}{\mu_j \, M} \sum_{k = 1}^M \nabla_{x_i} \, k(x^m, x^k) \, \psi_j(x^m) \hat{\psi_j}(x) \\ &= \sum_{j = 1}^J - \frac{1}{M} \sum_{m = 1}^M \, \frac{M}{\lambda_j \, M} \sum_{k = 1}^M \nabla_{x_i} \, k(x^m, x^k) \, \sqrt{M} u_{jm} \frac{\sqrt{M}}{\lambda_j} \sum_{n = 1}^M u_{jn} \, k(x, x^n) \\ &= - \sum_{j = 1}^J \frac{1}{\lambda_j^2} \sum_{m = 1}^M \sum_{k = 1}^M u_{jm} \, \nabla_{x_i^m} \, k(x^m, x^k) \sum_{n = 1}^M u_{jn} \, k(x, x^n) \end{align} #### Repulsive Deep Ensembles with SSGE Estimating $\nabla \log \mu_l(x_l^i)$ using SSGE method gives us \begin{align} x_{l + 1}^{i} &= x^i_l \, + \, \varepsilon_l\, (\nabla \log \pi(x_l^i) \, - \nabla \log \mu_l(x_l^i)) \\ &= x_l^i + \varepsilon_l \left( \nabla \log \pi(x_l^i) \, + \sum_{j = 1}^J \frac{1}{\lambda_j^2} \, \sum_{m = 1}^M \, \sum_{k = 1}^M u_{jm} \, \nabla_{x_l^m} k(x_l^m, x_l^k) \, \sum_{l = 1}^n u_{jl} \, k(x_l^i, x_l^n) \right) \end{align}