---
title: Notes Deep Repulsive Ensembles
---
# Repulsive Deep Ensembles
1. **What does the paper do?**
*The paper proposes the kernelized repulsive term in the deep ensembles training objective*
2. **What is the motivation of the repulsive term?**
*The ensemble members tend to collapse to mode of the posterior. Consequently, it fails to act as an average model.*
3. **What is the inspiration of this paper?**
*Stein variational gradient descent (SVGD). The SVGD's objective consists two terms: the log gradient term and the gradient kernel term. The later term act as a repulsive term. It ensures each particle to stay away to others. Therefore, the particles are able to occupy the target distribution.*
4. **What is the analysis of the proposed method?**
*Both SVGD and the repulsive are the discretization of the Wasserstein gradient flow of the Kullback-Leibler divergence over the space of probability measure.*
5. **Experiments ?**
*sampling from the synthetic distribution, regression problem, 2D and image classification.*
## Wasserstein Gradient Flow
### Geometry of $(\mathcal{P}_2(\mathcal{M}), W_2)$
#### Definition of Wasserstein Space
Let $\mathcal{P}_2(\mathcal{M})$ be the space of probability measure on $\mathcal{M}$, i.e. $\mathcal{P}_2(\mathcal{M}) = \{ \mu \in \mathcal{P}(\mathcal{M}), \, \int_{\mathcal{M}} \lVert x \rVert^2 \, d \mu(x) < +\infty \}$ . Subsequently, we assume $\mathcal{P}_2(\mathcal{M})$ is equipped with the Wasserstein-2 distance. Given $\forall \mu, \nu \in \mathcal{P}_2(\mathcal{M})$, the Wasserstein distance between $\mu$ and $\nu$ is written as follows:
\begin{align}
W_{2}^2 (\mu, \nu) = \underset{s \, \in \, \Pi(x, y)}{\inf} \int \lVert x - y \rVert^2 \, ds(x, y)
\end{align}
where $\Pi$ is set of possible couplings. The metric space $(\mathcal{P}_2(\mathcal{M}), W_2)$ is called the Wasserstein space.
#### Riemanian Structure of $(\mathcal{P}_2(\mathcal{M}), W_2)$ and $L^2$ Spaces
We define
\begin{align}
\mathcal{T}_\mu \mathcal{P}_2(\mathcal{M}) \subset L^2(\mu) = \{f: \mathbb{R}^d \rightarrow \mathbb{R}^d, \, \int_{\mathcal{M}} \lVert f(x) \rVert^2 \, d\mu(x) < +\infty \}
\end{align}
as a space of vector-valued and square-integrable w.r.t. $\mu$. It is a Hilbert space of functions equipped with the inner product
\begin{align}
\langle f, g \rangle_\mu = \int_{\mathcal{M}} \langle f(x), g(x) \rangle_{\mathbb{R}^d} \, d\mu(x)
\end{align}
#### Pushforward Measure
Let $\mu: \mathcal{P}_2(\mathcal{M})$ and a measurable map $T: \mathbb{R}^d \rightarrow \mathbb{R}^d$. The pushforward measure $T_\#\mu$ is characterized by $x \sim \mu \implies T(x) \sim T_\#\mu$. The identity map is denoted by $Id$, i.e. $Id_\#\mu = \mu$
#### Dynamics on $\mathcal{P}_2(\mathcal{M})$ through $L^2$ maps
If $T \in L^2(\mu)$ and $\mu \in \mathcal{P}_2(\mathcal{M})$ then $T_\#\mu \in \mathcal{P}_2(\mathcal{M})$ fulfills:
\begin{align}
\int \lVert y \rVert^2 \, d(T_\#\mu)(y) = \int \lVert x \rVert^2 \, d\mu(x) < + \infty
\end{align}
*Bernier's theorem:*
Let $\mu, \nu \in \mathcal{P}_2(\mathcal{M})$ s.t. $\mu << \text{Leb}$. Then, there is a unique $T_\mu^\nu : \mathcal{M} \rightarrow \mathcal{M}$ s.t.
1. $T_{\mu \#}^{\nu}\mu = \nu$
2. $W_2^2 (\mu, \nu) = \lVert Id - T_\mu^\nu\rVert_\mu^2 \; \; \triangleq \, \int \, \lVert x - T_\mu^\nu(x) \rVert^2 \, d\mu(x)$
$T_\mu^\nu$ is called the Optimal transport map.
#### The Notion of Wasserstein Geodesic
The Wasserstein geodesic between $\rho_0 = \mu$ and $\rho_1 = \nu$ can be written as follows:
\begin{align}
\rho_t = ((1 - t) Id \, + \, t \, T_{\mu}^\nu)_\#\mu
\end{align}
#### Convexity along Wasserstein Geodesics
Let $\mathcal{F}: \mathcal{P}_2(\mathcal{M}) \rightarrow (-\infty, +\infty]$. $\mathcal{F}$ is $\lambda$-strongly geodesic convex with $\lambda \geq 0, \; \forall \mu, \nu \in \mathcal{P}_2(\mathcal{M})$:
\begin{align}
\mathcal{F}(\rho_t) \leq (1 - t) \mathcal{F}(\mu) + t \mathcal{F}(\nu) - \frac{\lambda \, t (1 - t)}{2} W_2^2(\mu, \nu)
\end{align}
where $(\rho_t)_{t \geq 0}$ is a Wasserstein-2 geodesic between $\mu$ and $\nu$.
#### KL-divergence
We define KL-divergence $KL[\mu \, \vert \, \pi]$ as
\begin{align}
KL[\mu \, \vert \, \pi] &= \int \log \frac{\mu}{\pi} (x) \, d\mu(x) \\
&= \int V(x) \, d\mu(x) + \int \log \mu(x) \, d\mu(x) + C
\end{align}
### Definition of Wasserstein Gradient Flow
#### Gradient Flow on Probability Distribution
The goal is to approximate $\pi$ by solving
\begin{align}
\underset{\mu \, \in \, \mathcal{P}_2(\mathcal{M})}{\min} \, \mathcal{F}(\mu), \quad \mathcal{F}(\mu) = KL[\mu \, \vert \, \pi]
\end{align}
We want the gradient flow of $\mathcal{F} : \mathcal{P}_2(\mathcal{M}) \rightarrow (-\infty, +\infty]$ to be something of the form
\begin{align}
\dot{\mu_t} = -\nabla_{W_2} \, \mathcal{F}(\mu_t)
\end{align}
#### Velocity Field
Definition: Given $(\mu_t)_{t \geq 0} \in \mathcal{P}_2(\mathcal{M})$, if there exists $(v_t)_{t \geq 0} \in (L^2(\mu_t))_{t \geq 0}$ s.t.
\begin{align}
\frac{d}{dt} \int \varphi d\mu_t = \langle \nabla \varphi, v_t \rangle_{\mu_t}
\end{align}
$\forall \varphi: \mathcal{M} \rightarrow \mathbb{R}$, then $(v_t)_{t \geq 0}$ is a velocity field of $(\mu_t)_{t \geq 0}$. The velocity field governs the dynamics of $(\mu_t)_{t \geq 0}$.
#### Continuity Equation
The velocity field satisfies the PDE
\begin{align}
\frac{\partial \mu_t}{\partial t} + \nabla . (\mu_t \, v_t) = 0
\end{align}
*Proof:*
1. $\frac{d}{dt} \int \varphi(x) \, \mu_t(x) \, dx = \int \varphi(x) \frac{d \mu_t}{dt}(x) \, dx$
2.
\begin{align}
\frac{d}{dt} \int \varphi(x) \, \mu_t(x) \, dx &= \langle \nabla \varphi, \, v_t \rangle_{\mu_t} \\
&= \int \langle \nabla \varphi(x), v_t(x) \rangle_{\mathbb{R}^d} \, \mu_t(x) \, dx \\
&= - \int \varphi(x) \, \nabla.(v_t(x) \, \mu_t(x)) dx
\end{align}
Note that the last row is obtained by applying integral by parts.
#### Wasserstein Gradient
Definition: Let $\mu \in \mathcal{P}_2(\mathcal{M})$. Consider a perturbation on the Wasserstein space $(Id + \varepsilon \, h)_\#\mu$ with $h \in L^2(\mu)$. If the Taylor expansion of $\mathcal{F}$ yields
\begin{align}
\mathcal{F}((Id + \varepsilon \, h)_\#\mu) = \mathcal{F}(\mu) + \varepsilon \langle \nabla_{W_2} \mathcal{F}(u), h \rangle_{\mu} \, + \, o(\varepsilon)
\end{align}
then $\nabla_{W_2} \mathcal{F}(\mu) \in L^2(\mu)$ is the Wasserstein gradient of $\mathcal{F}(\mu)$
#### First Variation
Definition: Let $\mu \in \mathcal{P}_2(\mathcal{M})$. Consider a linear perturbation $\mu + \varepsilon \xi \in \mathcal{P}_2(\mathcal{M})$. If the Taylor expansion of $\mathcal{F}$ yields
\begin{align}
\mathcal{F}(\mu + \varepsilon \xi) = \mathcal{F}(\mu) + \varepsilon \int \mathcal{F}^\prime(\mu)(x) \, d\xi(x) \, + \, o(\varepsilon)
\end{align}
then $\mathcal{F}^\prime(\mu) : \mathcal{M} \rightarrow \mathbb{R}$ is the first variation of $\mathcal{F}(\mu)$.
#### Wasserstein gradient = Gradient of First Variation
Given the definition of Wasserstein gradient and the gradient of the first variation, we have
\begin{align}
\nabla_{W_2} \mathcal{F}(\mu) = \nabla \mathcal{F}^\prime(\mu)
\end{align}
where $\nabla_{W_2} \mathcal{F}(\mu) : \mathcal{M} \rightarrow \mathbb{R}^d$ and $\mathcal{F}^\prime(\mu) : \mathcal{M} \rightarrow \mathbb{R}$.
#### Wasserstein Gradient of KL
Let $\mathcal{F}(\mu) = \int V(x) \, d\mu(x) + \int \log \mu(x) \, d\mu(x)$. For $\pi \propto \exp(-V)$, we have KL-divergence as $KL[\mu \, \vert \, \pi] = \mathcal{F}(\mu) - \mathcal{F}(\pi)$, where $\mathcal{F}(\pi)$ is a constant. By additivity, the Wasserstein gradient of KL is given by
\begin{align}
\nabla_{W_2} \, \mathcal{F}(\mu) = \nabla \mathcal{F}^\prime(\mu) = \nabla V + \nabla \log \mu = \nabla \log \frac{\mu}{\pi}
\end{align}
#### Velocity Field = negative Wasserstein Gradient
Recall that we want to define the equation
\begin{align}
\dot{\mu_t} = - \nabla_{W_2} \, \mathcal{F}(\mu_t)
\end{align}
The definition of velocity field can be seen as a chain rule
\begin{align}
\frac{d}{dt} \int \varphi \, d\mu_t = \langle \nabla\varphi, \, v_t \rangle_{\mu_t} \quad \mathcal{F}(\mu) = \int \varphi \, d\mu
\end{align}
If $\nabla \varphi = \nabla_{W_2} \, \mathcal{F}(\mu_t)$, then
\begin{align}
\frac{d}{dt} \mathcal{F}(\mu_t) = \langle \nabla_{W_2} \, \mathcal{F}(\mu_t), v_t \rangle_{\mu_t}
\end{align}
We consider $v_t = - \nabla_{W_2} \mathcal{F}(\mu_t)$ to decrease $\mathcal{F}$. Substitute $v_t$ back to the velocity field equation, we obtain the descent property:
\begin{align}
\frac{d}{dt} \mathcal{F}(\mu_t) = - \lVert \nabla_{W_2} \mathcal{F}(\mu_t) \rVert^2 \leq 0
\end{align}
#### Wasserstein Gradient Flow
The Wasserstein GF of $\mathcal{F}$ is ruled by:
\begin{align}
v_t = - \nabla_{W_2} \mathcal{F}(\mu_t)
\end{align}
Substitute $v_t$ into the continuity equation to obtain the Wasserstein gradient flow
\begin{align}
\frac{\partial \mu_t}{\partial t} &= - \nabla. (\mu_t v_t) \\
&= \nabla . (\mu_t \, \nabla_{W_2} \mathcal{F}(\mu_t))
\end{align}
#### Wasserstein Gradient Flow of of KL Divergence
Recall the definition of KL Divergence
\begin{align}
KL[\mu \, \vert \, \pi] &= \int \log \, \frac{\mu}{\pi}(x) \, d\mu(x) \\
&= \int V(x) \, d\mu(x) + \int \log \mu(x) \, d\mu(x)
\end{align}
The Wasserstein gradient of KL-divergence is the Fokker-Planck equation:
\begin{align}
\frac{\partial \mu_t}{\partial t} &= \nabla . (\mu_t \nabla_{W_2} KL[\mu_t \, \vert \, \pi]) \\
&= \nabla . (\mu_t \nabla \log \frac{\mu_t}{\pi}) \\
&= \nabla. (\mu_t \, \nabla_{W_2} \int V(x) \, d\mu(x)) + \nabla. (\mu_t \, \int \log \mu(x) \, d\mu(x)) \\
&= \nabla. (\mu_t \, \nabla V) + \nabla. (\mu_t \nabla \log \mu_t) = \nabla. (\mu_t \, \nabla V) + \nabla. (\mu_t \frac{\nabla \mu_t}{\mu_t}) \\
&= \nabla. (\mu_t \, \nabla V) + \Delta \mu_t
\end{align}
The continuity equation above specifies the Langevin difussion $x_t \sim \mu_t$
\begin{align}
dx_t = - \nabla V(x_t) + \sqrt{2} dB_t
\end{align}
Where $B_t$ is the Brownian motion in $\mathbb{R}^d$.
#### Convergence of Wasserstein Gradient Flow
- If $\mathcal{F}$ is $\lambda-$geodesically convex, then
\begin{align}
W_2(\mu_t, \pi) \leq \exp(- \lambda \, t) W_2(\mu_0, \pi)
\end{align}
- If we assume the inequality
\begin{align}
\lVert \nabla_{W_2} \mathcal{F}(\mu_t) \rVert^2_{L^2(\mu_t)} \geq \frac{1}{\lambda} \mathcal{F}(\mu_t)
\end{align}
By the Gronwall"s lemma, we have
\begin{align}
\mathcal{F}(\mu_t) \leq \exp(- \lambda t) \mathcal{F}(\mu_0)
\end{align}
### Sampling Algorithm
#### Discretization of Wasserstein Gradient Flow
For a time step $\gamma > 0$:
1. Wasserstein gradient descent or Forward Euler
\begin{align}
\mu_{l + 1} = (Id - \gamma \nabla_{W_2} \mathcal{F}(\mu_l) )_\#\mu_l
\end{align}
Given $x_0 \sim \mu_0$, the gradient descent over $\mathcal{M}$ is written as
\begin{align}
x_{l + 1} = x_l \, - \, \gamma \nabla \log \frac{\mu_l}{\pi}(x_l) \sim \mu_{l + 1} \quad \text{if} \;\; x_l \sim \mu_l
\end{align}
Problem: $\mu_l$ is not known, we only have the samples of $x_l$.
Question:
- Since $\mathcal{M}$ is compact and the optimization depends on $- \nabla_{W_2} \mathcal{F}(\mu_l)$ (which lies in the tangent space), there is a possibility of the updated $\mu_{l + 1}$ exceed the manifold. How do we mitigate this issue?
Potential solution: force $\mu_{l + 1}$ to be the nearest probability measure in $\mathcal{M}$.
2. Jordan-Kinderlehrer-Otto scheme
\begin{align}
\mu_{l + 1} \in JKO_{\gamma \, \mathcal{F}}(\mu_l) \, \triangleq \underset{\mu \, \in \mathcal{P}_2(\mathcal{M})}{\min} \gamma \, \left \{ \mathcal{F}(\mu) + \frac{1}{2} W_2^2 (\mu, \mu_l) \right \}
\end{align}
3. Splitting scheme ($\mathcal{F} = \mathcal{F}_1 + \mathcal{F}_2$, $\mathcal{F}_2$ is geo. convex)
\begin{align}
&\mu_{l + \frac{1}{2}} = (Id - \gamma \nabla_{W_2} \, \mathcal{F}(\mu_l))_\#\mu_l \\
&\mu_{l + 1} = JKO_{\gamma \mathcal{F}_2} (\mu_{l + \frac{1}{2}})
\end{align}
#### Kernel and RHKS
- $k : \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$ is a positive, semi-definite kernel.
- Its corresponding RKHS
\begin{align}
\mathcal{H}_k = \overline{\left \{ \sum_{m = 1}^M \alpha_m \, k(. , x^m); \; \; m \in \mathbb{N}; \; \; \alpha_1, \cdots, \alpha_M \in \mathbb{R}^d; \; \; x^1, \cdots, x^M \in \mathbb{R}^d \right \} }
\end{align}
- $\mathcal{H}_k$ is a Hilbert space with inner product $\langle ., . \rangle_{\mathcal{H}_k}$ and the norm $\lVert . \rVert^2_{\mathcal{H}_k}$
- $\forall \mu \in \mathcal{P}(\mathcal{M}), \; \; \int_{\mathbb{R}^d \times \mathbb{R}^d} k(x, x) \, d\mu(x) < + \infty \implies \mathcal{H}_k \subset L^2(\mu)$
- Reproducing property: $\forall f \in \mathcal{H}_k, \; \; f(x) = \langle f, k(x, .) \rangle_{\mathcal{H}_k}$
#### Stein Variational Gradient Descent
Considering the following metric depending on $k$
\begin{align}
W_k^2(\mu_0, \mu_1) = \underset{\mu, \nu}{\inf} \left \{ \int_0^1 \lVert v_t(x) \rVert^2_{\mathcal{H}_k} \, dt(x) : \frac{\partial \mu_t}{\partial t} = \nabla . (\mu_t \, v_t) \right \}
\end{align}
The $W_k^2$ gradient flow of KL-divergence can be written as the PDE
\begin{align}
&\frac{\partial \mu_t}{\partial t} + \nabla. (\mu_t \, P_{\mu_t} \nabla \log \frac{\mu_t}{\pi}) = 0 \\
\end{align}
Where $P_\mu : L^2(\mu) \rightarrow \mathcal{H}_k$ is the kernel integral operator
\begin{align}
P_\mu \, f(.) = \int k(x, .) \, f(x) \, d\mu(x)
\end{align}
#### Stein Variational Gradient Descent Algorithm
We apply the kernel integral operator on the $W_2$ gradient of $KL[. \, \vert \, \mu]$ to obtain
\begin{align}
P_\mu \nabla \log \frac{\mu}{\pi} (.) &= \int \nabla \log \frac{\mu}{\pi}(x) \, k(x, \, .) \, d\mu(x) \\
&= \int - \nabla \log \pi(x) \, k(x, \, .) \, d\mu(x) + \int \nabla \mu(x) \, k(x, \, .) \, dx
\end{align}
Assume mild zero boundary condition that is $\underset{\lVert x \rVert \rightarrow \infty}{\lim} k(x, \, .) \, \pi(x) = 0$. Thus, we obtain
\begin{align}
p_\mu \nabla \log \frac{\mu}{\pi} &= \int - k(x, \, .) \, \nabla \log \pi(x) \, d\mu(x) + \int \nabla \mu(x) \, k(x, \, .) \, dx \, - \, k(x, \, .) \, \pi(x) \\
&= \int - k(x, \, .) \, \nabla \log \pi(x) \, d\mu(x) - \int \, \nabla_x \, k(x, \, .) \, \mu(x) \, dx \quad \scriptsize \text{integration by parts} \\
&= - \int [k(x, \, .) \, \nabla \log \pi(x) \, + \, \nabla_x k(x, \, .)] \, d\mu(x) \quad \scriptsize \\
\end{align}
SVGD updates a set of $M$ particles $x^1, \cdots, x^M$ at each time $l \geq 0$.
\begin{align}
x_{l + 1}^{i} &= x_l^i - \frac{\gamma}{M} \, \sum_{m = 1}^M k(x_l^i, x_l^m) \, \nabla_{x_l^m} V(x_l^m) - \nabla_{x_l^m} k(x_l^i, x_l^m) \\
&= x_l^i - \gamma \, P_{\mu^l} \nabla \log \frac{\mu_l^M}{\pi}(x_l^i) \quad \mu_l^M = \frac{1}{M} \sum_{m = 1}^{M} \delta_{x_l^m}
\end{align}
#### SVGD from the Wasserstein Space Perspective
\begin{align}
&\mu_{l + 1} = (Id - \gamma \, h_{\mu_l})_\#\mu_l
\end{align}
Where $h_{\mu_l} = P_{\mu_l} \log \frac{\mu_l}{\pi}$
#### Gradient Descent Interpretation
Taylor expansion around $\mu$ for $h \in \mathcal{H}_k$ yields
\begin{align}
KL[(Id + \varepsilon h)_\#\mu \, \vert \, \pi ] = KL[\mu \, \vert \, \pi] + \varepsilon \langle h_\mu , h \rangle_{\mathcal{H}_k} + o(\varepsilon)
\end{align}
Therefore, $h_\mu$ plays the role of the Wasserstein gradient in $\mathcal{H}_k$. As the consequence, $h_\mu$ satisfies the descent lemma
\begin{align}
\frac{KL[\mu_{l + 1} \, \vert \, \pi] - KL[\mu_l \, \vert \, \pi]}{\gamma} \leq - \frac{1}{2}\lVert h_{\mu_l} \rVert^2_{\mathcal{H}_k}
\end{align}
### Repulsive Deep Ensembles
#### Parameter Space
The Wasserstein gradient of KL-divergence is
\begin{align}
\frac{\partial \mu_t}{\partial t} (x) &= \nabla \, . \left( \mu_t(x) \, \nabla_{W_2} KL[\mu \, \vert \, \pi] \right) \\
&= \nabla \, . \left( \mu_t(x) \nabla(\log \mu_t(x) - \log \pi(x)) \right)
\end{align}
As the result, we can define the dynamics of $x$ as follows
\begin{align}
&\frac{dx}{dt} = - \nabla (\log \mu_t(x) \, - \, \log \pi(x)) \\
\end{align}
Applying the forward Euler discretization gives us
\begin{align}
&x_{l + 1}^i = x_l^i + \varepsilon_l \left( \nabla_x (\log \pi(x_l^i) - \log \mu_l(x_l^i) ) \right )
\end{align}
Recall that we do not have the access to $\mu_l$. Instead, we only have the samples $x^1, \cdots x^M$. We then estimate the probability measure with the help of kernel $k$.
\begin{align}
&\mu_l(x) \approx \tilde{\mu_l}(x) = \frac{1}{M} \sum_{i = 1}^M k(x, x_l^m) \\
\end{align}
Subsequently, we compute the log gradient of $\mu_l$ using the chain rule
\begin{align}
&\nabla \log \mu_l(x_l^i) \approx \frac{\sum_{m=1}^M \nabla_{x_l^i} k(x_l^i, \, x_l^m)}{\sum_{m=1}^M k(x_l^i, \, x_l^j)}
\end{align}
After substituting the estimation back to the forward Euler equation, we obtain
\begin{align}
x_{l + 1}^i = x_l^i + \varepsilon_l \left( \nabla_x \log \pi(x_l^i) - \frac{\sum_{m=1}^M \nabla_{x_l^i} k(x_l^i, \, x_l^m)}{\sum_{m=1}^M k(x_l^i, \, x_l^j)} \right )
\end{align}
Comments:
- By default, we assume Wasserstein GF of KL divergence with unimodal target distribution. I suspect that both SVGD and repulsive deep ensemble will perform poory if the true posterior is multimodal.
Potential solution: product of manifold
- Since the stationary kernel $k(x, y)$ can be written as $k(x - y)$, then almost all stationary kernels will work as a repellent.
Question:
- Is there any constraint on the step size $\varepsilon_l$? For example, SGLD puts the following constraints on the step size $\varepsilon_l$:
$\sum_{l = 1}^\infty \varepsilon_l = \infty$ and $\sum_{l = 1}^\infty \varepsilon_l^2 < \infty$
#### Functional Space
The Wasserstein gradient of KL-divergence is
\begin{align}
\frac{\partial \mu_t}{\partial t} (f) &= \nabla \, . \left( \mu_t(f) \, \nabla_{W_2} KL[\mu \, \vert \, \pi] \right) \\
&= \nabla \, . \left( \mu_t(f) \nabla(\log \mu_t(f) - \log \pi(f)) \right)
\end{align}
As the result, we can define the dynamics of $f$ as follows
\begin{align}
&\frac{df}{dt} = - \nabla (\log \mu_t(f) \, - \, \log \pi(f)) \\
\end{align}
Applying the forward Euler discretization gives us
\begin{align}
&f_{l+1}^i = f_l^i + \varepsilon_l (\nabla (\log \mu_t(f) \, - \, \log \pi(f)))
\end{align}
Recall that we do not have the access to $\mu_l$. Instead, we only have the samples $f^1, \cdots fx^M$. We then estimate the probability measure with the help of kernel $k$.
\begin{align}
f_{l + 1}^{i} = f_l^i + \varepsilon_l \left ( \nabla_f \log \pi(f_l^i) - \frac{\sum_{m = 1}^M \nabla_{f_l^i} \, k(f_l^i, f_l^m)}{\sum_{m=1}^M k(f_l^i, f_l^m)} \right ) \\
\end{align}
Since the function $f$ is induced by the parameter $x$, the endgoal is to update $x$, i.e.
\begin{align}
x_{l+1}^i = x_l^i + \varepsilon_l \left(\frac{\partial f_l^i}{\partial x_l^i}\right)^\intercal(\nabla (\log \mu_t(f) \, - \, \log \pi(f)))
\end{align}
### Stein Gradient Estimator (SGE)
Stein's identity:
\begin{align}
\mathbb{E}_q[h(x) \nabla_x \log q(x) + \nabla_x h(x)] = 0
\end{align}
Settings:
Consider $M$ i.i.d. samples drawn from $q(x)$.
\begin{align}
& \mathcal{X} \subset \mathbb{R}^d \\
& h : \mathcal{X} \rightarrow \mathbb{R}^{d^\prime} \\
& H = [h(x^1), \cdots, h(x^M)] \in \mathbb{R}^{d^\prime \times M} \\
& G = [\nabla_{x^1} \log q(x^1), \cdots, \nabla_{x^M} \log q(x^M)]^\intercal \in \mathbb{R}^{M \times d}
\end{align}
Applying Monte Carlo sampling on the Stein's identity gives us
\begin{align}
& - \frac{1}{M} \sum_{m = 1}^M h(x^m) \, \nabla_{x^m} \log q(x^m) = \frac{1}{M} \sum_{m = 1}^M \nabla_{x^m} h(x^m) \\
& - \frac{1}{M} HG = \overline{\nabla_x h} \\
\end{align}
Where $\overline{\nabla_{x} h} \in \mathbb{R}^{d^\prime \times d}$ and $\nabla_{x^m} h(x^m) = [\nabla_{x^m} h_1(x_m), \cdots, \nabla_{x_m} h_{d^\prime}(x_m)]^\intercal$. Given the above equations, we define the ridge regression problem
\begin{align}
\underset{\hat{G} \in \mathbb{R}^{M \times d}}{\min} \lVert \overline{\nabla_x h} + \frac{1}{M} H \hat{G} \rVert^2_F + \frac{\eta}{M^2} \lVert \hat{G} \rVert_F^2
\end{align}
We set the gradient to zero and solve for $\hat{G}$
\begin{align}
&2 \frac{H}{M} (\overline{\nabla_x h} + \frac{1}{M} H\hat{G}) + \frac{2\eta}{M^2} \hat{G} = 0 \\
& \frac{1}{M} H^\intercal \; \overline{\nabla_x h} = \left ( \frac{- (H^\intercal H + \eta I)}{M^2} \right ) \hat{G} \\
& \hat{G}^{stein} = -M (K + \eta )^{-1} \, H^\intercal \overline{\nabla_x h}
\end{align}
where $K = H^\intercal H$. Consequently, we have $K_{ij} = h(x^i)^\intercal h(x^j)$. We then apply the kernel trick such that $K_{ij} = k(x^i, x^j), \quad k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$. Similarly, we obtain $(H^\intercal \overline{\nabla_{x} h})_{ij} = \frac{1}{M} \sum_{m = 1}^M \nabla_{x_j^m} k(x^i, x^m)$.
#### Repulsive Deep Ensembles with SGE
Estimating $\nabla \log \mu_l(x_l^i)$ using SGE method gives us
\begin{align}
x_{l + 1}^{i} &= x^i_l \, + \, \varepsilon_l\, (\nabla \log \pi(x_l^i) \, - \nabla \log \mu_l(x_l^i)) \\
&= x_l^i + \varepsilon_l \left( \nabla \log \pi(x_l^i) \, + \sum_{m = 1}^M (K + \eta \mathbb{I})_{im}^{-1} \sum_{m=1}^M \nabla_{x_l^i} k(x_l^i, x_l^j) \right)
\end{align}
### Spectral Stein Gradient Estimator (SSGE)
#### The Nyström method
Consider a problem finding the eigen functions $\{ \psi_j \}_{j \geq 1} \in L^2(\mathcal{X}, q)$ of covariance kernel $k(x, y)$ w.r.t. the probability measure $q$:
\begin{align}
\int k(x, y) \, \psi(y) \, q(y) dy = \mu \, \psi(x)
\end{align}
Furhtermore, we set the constraint that the eigenfunctions $\{\psi_j\}_{j \geq 1}$ are orthonormal under $q$, i.e.
\begin{align}
\int \psi_i(x) \, \psi_j(x) \, q(x) dx = \delta_{ij}
\end{align}
where $\delta_{ij} = \mathbb{1} [i = j]$. We approximate the problem by applying the ubiased Monte Carlo sampling on the left hand side of the eigen functions problem using $M$ i.i.d. samples
\begin{align}
\frac{1}{M} K \psi \approx \mu \psi
\end{align}
The goal is to compute the eigen vectors $u_1, \cdots, u_J$ using $J$ largest eigen values $\lambda_1 \geq \cdots \geq \lambda_J$. We acquire the solution by comparing the to $K u_j = \lambda_j u_j$
\begin{align}
& \psi_j(x^m) \approx \sqrt{M} u_{jm} \quad m = 1, \cdots M \\
& \mu_j \approx \frac{\lambda_j}{M}
\end{align}
Finally, we plug the solution above back to the eigen functions probem to obtain the Nyström approximation of the $j$-th eigenfunction at point $x$.
\begin{align}
\psi_j(x) \approx \hat{\psi}_j(x) = \frac{\sqrt{M}}{\lambda_j} \sum_{m = 1}^M u_{jm} \, k(x, x^m)
\end{align}
#### SSGE
Consider the target gradient $g(x) = \nabla_x \log q(x), \quad g: \mathcal{X} \rightarrow \mathbb{R}^d$ with the $i$-th component of the gradient can be written as $g_i(x) = \nabla_{x_i} \log \, q(x)$. Assume that $g_1, \cdots, g_d \in L^2(\mathcal{X}, q)$. Since $\{ \psi_j \}_{j \geq 1}$ is the orthormal basis of $L^2(\mathcal{X}, q)$, we can expand $g_i$ as follows:
\begin{align}
g_i(x) = \sum_{j = 1}^\infty \beta_{ij} \, \psi_j(x)
\end{align}
*Proposition:*
If $k(., .)$ has a continuous second order partial derivative and $k(x, .)$ and $k(., x)$ are in the Stein class of $q$, we have
\begin{align}
\mathbb{E}_q[\psi_j(x) g(x) + \nabla_x \psi_j(x)] = 0, \quad j=1, \cdots \infty
\end{align}
Substituting the gradient expansion to the proposition equation, and using the orthonormality of $\{\psi_j\}_{j \geq 1}$ we obtain
\begin{align}
\beta_{ij} = - \mathbb{E}_q[\nabla_{x_i} \psi_j(x)]
\end{align}
To estimate $\beta_{ij}$, we need an approximation of $\nabla_{x_i} \psi_j(x)$. The idea is that $\nabla_{x_i} \psi_j(x)$ can be taken from the covariance kernel equation.
\begin{align}
\mu \nabla_{x_i} \psi_j(x) = \int \nabla_{x_i} k(x, y) \, \psi_j(y) \, q(y) \, dy \\
\end{align}
We then perform Monte Carlo sampling to estimate $\nabla_{x_i} \psi_j(x)$.
\begin{align}
\hat{\nabla}_{x_i} \psi_j(x) \approx \frac{1}{\mu_j \, M} \sum_{m = 1}^M \nabla_{x_i} k(x, x^m) \, \psi_j(x^m)
\end{align}
Substituting the approximation of $\mu_j$ and $\psi_j(x)$ to the MC sampling above, and comparing to the Nyström method, we conclude
\begin{align}
\hat{\nabla}_{x_i} \psi_j(x) \approx \nabla_{x_i} \hat{\psi}_j(x)
\end{align}
Finally, we truncate the expansion of $g_i$ to the first $J$ terms and plug the Nyström approximation of $\{\psi_j \}_{j = 1}^J$, we get the estimator as follows:
\begin{align}
\hat{g}_i(x) &= \sum_{j = 1}^J \hat{\beta}_{ij} \, \hat{\psi_j}(x) \\
&= \sum_{j = 1}^J - \frac{1}{M} \sum_{m = 1}^M \nabla_{x_i} \hat{\psi_j}(x^m) \, \hat{\psi_j}(x) \\
&= \sum_{j = 1}^J - \frac{1}{M} \sum_{m = 1}^M \, \frac{1}{\mu_j \, M} \sum_{k = 1}^M \nabla_{x_i} \, k(x^m, x^k) \, \psi_j(x^m) \hat{\psi_j}(x) \\
&= \sum_{j = 1}^J - \frac{1}{M} \sum_{m = 1}^M \, \frac{M}{\lambda_j \, M} \sum_{k = 1}^M \nabla_{x_i} \, k(x^m, x^k) \, \sqrt{M} u_{jm} \frac{\sqrt{M}}{\lambda_j} \sum_{n = 1}^M u_{jn} \, k(x, x^n) \\
&= - \sum_{j = 1}^J \frac{1}{\lambda_j^2} \sum_{m = 1}^M \sum_{k = 1}^M u_{jm} \, \nabla_{x_i^m} \, k(x^m, x^k) \sum_{n = 1}^M u_{jn} \, k(x, x^n)
\end{align}
#### Repulsive Deep Ensembles with SSGE
Estimating $\nabla \log \mu_l(x_l^i)$ using SSGE method gives us
\begin{align}
x_{l + 1}^{i} &= x^i_l \, + \, \varepsilon_l\, (\nabla \log \pi(x_l^i) \, - \nabla \log \mu_l(x_l^i)) \\
&= x_l^i + \varepsilon_l \left( \nabla \log \pi(x_l^i) \, + \sum_{j = 1}^J \frac{1}{\lambda_j^2} \, \sum_{m = 1}^M \, \sum_{k = 1}^M u_{jm} \, \nabla_{x_l^m} k(x_l^m, x_l^k) \, \sum_{l = 1}^n u_{jl} \, k(x_l^i, x_l^n) \right)
\end{align}