###### tags: `sampling` `monte carlo` `variance reduction` `one-offs` `expository`
# Control Variates and Controlled Diffusions
**Overview**: In this note, I will outline a method for constructing diffusions which admit a desired invariant measure, and have certain connections to control variates of Stein-Langevin type.
## The 'Complete Recipe' for Sampling with Dynamical Systems
Consider the task of drawing samples from a probability measure with density given as $\pi(dz) = \pi(z) dz$. It has been shown that any time-homogeneous stochastic process with continuous sample paths which leaves $\pi$ invariant can be written as
\begin{align}
dz = \left( D(z) + Q(z) \right) \nabla \log \pi (z) dt + \left( \Gamma^D(z) + \Gamma^Q(z) \right) dt + \sqrt{2 D(z)} dW,
\end{align}
where
* $D$ is a positive-semidefinite-matrix-valued function
* $Q$ is a skew-symmetric-matrix-valued function
* Given a matrix-valued function $M$, $\Gamma^M$ is the vector-valued function with coordinates given by
\begin{align}
\Gamma^M_i (z) &= \text{div}_z \left( M_{i\cdot} (z) \right) \\
&= \sum_{j = 1}^d \frac{\partial M_{ij}}{\partial z_j}(z).
\end{align}
In this setup, $D$ prescribes the 'reversible' / 'equilibrium' component of the dynamics, and $Q$ prescribes the 'non-reversible' / 'non-equilibrium' component.
I first came across this formulation through a [2015 work](https://arxiv.org/abs/1506.04696) of Ma, Chen, and Fox. After publication, it has been remarked multiple times that decompositions of this form have long been known in the physics literature. A more refined presentation and extension of these ideas was given in this [2021 work](https://arxiv.org/abs/2105.02845) of Barp, Takao, Betancourt, Arnaudon, and Girolami.
A takeaway from this formulation is that if one wants to draw samples from some probability measure $\mu$, one can proceed as follows:
* Specify an extended target measure $\pi$ which admits $\mu$ as a marginal distribution.
* Specify matrix-valued functions $D$ and $Q$, as above.
* Simulate the corresponding diffusion.
Note that the above formulation is essentially algebraic: the claim is that these processes leave $\pi$ invariant, and there is no claim as to the rate at which they converge to $\pi$ over time (if they indeed do).
## Dynamical Systems with Control Variables
Consider sampling from $\pi(x) \propto \exp \left( - V(x) \right)$ by embedding it into an extended target measure $\Pi (x, \xi)$, with log-density specified by
\begin{align}
-\log \Pi (x, \xi) &=: \mathcal{V} (x, \xi) \\
&= V(x) + \sum_{j = 1}^M S_j \left( \xi_j \right),
\end{align}
that is, we introduce $M$ additional 'control variables' $\xi_{1:M}$, which are independent from one another and from $x$ at stationarity. It is common to take $S_j (\xi_j) = \frac{1}{2} | \xi_j |^2$, i.e. to give each of the control variables an asymptotically Gaussian law.
Now, specify also $M$ 'control potentials' $f_{1:M}$, each mapping $\mathbf{R}^d$ into $\mathbf{R}$, and aggregate them into a vector-valued function as
\begin{align}
F(x) &= \begin{bmatrix} f_1 (x) \\ f_2 (x) \\ \cdots \\ f_M (x) \end{bmatrix} \in \mathbf{R}^M.
\end{align}
Write also
\begin{align}
F' (x) &= \begin{bmatrix} \nabla_x f_1 (x) \\ \nabla_x f_2 (x) \\ \cdots \\ \nabla_x f_M (x) \end{bmatrix} \in \mathbf{R}^{M \times D}
\end{align}
for the Jacobian of this map.
With this specification made, we are now in a position to define $D$ and $Q$ matrices, which we choose to take as
\begin{align}
D(x, \xi) &= \begin{bmatrix} I_D & 0_{D \times M}\\ 0_{M \times D} & 0_{M \times M} \end{bmatrix} \\
Q(x, \xi) &= \begin{bmatrix} 0_{D \times D} & -F' (x)^T \\ F'(x) & 0_{M \times M} \end{bmatrix}.
\end{align}
Working through the equations, this gives us the joint dynamics for $(x, \xi)$ as
\begin{cases}
dx &= - \nabla_x V(x) dt + F'(x)^T \nabla_\xi S (\xi) dt + \sqrt{2} W_t \\
d\xi &= \left( \mathcal{L} F \right) (x) dt,
\end{cases}
where $\mathcal{L}$ acts on scalar-valued functions by
\begin{align}
\left( \mathcal{L} f \right) (x) = \langle - \nabla_x V(x), \nabla_x f (x) \rangle + \Delta_x f (x),
\end{align}
and is extended to vector-valued functions by acting coordinatewise. As such, these dynamics can equally be written out in coordinates as
\begin{cases}
dx_t [i] &= \left( - \frac{\partial V}{\partial x[i]} (x_t) + \sum_{j = 1}^M \frac{\partial f_j}{\partial x[i]} (x_t) \cdot \frac{\partial S_j}{\partial \xi[j]} (\xi_t [j]) \right) dt + \sqrt{2} W_t[i] \\
d\xi_t [j] &= \left( \mathcal{L} f_j \right) (x_t) dt.
\end{cases}
I will refer to this system as the $F$-controlled dynamics. Note that when $F = 0$, this is just the standard overdamped Langevin dynamics.
## Interpretation of the $F$-controlled Dynamics
One somewhat-heuristic interpretation of these dynamics is as follows: the standard theory of infinitesimal generators of stochastic processes tells us that at stationarity, for each $j$, the quantity $\mathcal{L} f_j$ should have mean value $0$. The corresponding 'control variable' $\xi_j$ can be seen as a thermostat which monitors how close to $0$ the quantity $\mathcal{L} f_j$ is, and feeds this information back into the $x$ dynamics, in order to encourage this quantity to average out to $0$.
By the same token, there seems to be some connection here to control variates. Essentially, given the knowledge that a certain quantity $c$ should average out to $0$ at equilibrium, one can consider a sort of moment-matching approach to construct a set of particles such that the empirical average of $c$ across the particles is $0$. This bears some resemblance to strategies like (Kernel) Stein Discrepancy minimisation. The $F$-Controlled dynamics might then be viewed as a hybrid of pure moment-matching with more conventional Monte Carlo sampling strategies.
## $F$-Controlled Dynamics from Control Variates
Suppose that we have a quantity $c$ which we know has mean $0$ under $\pi$. Under appropriate regularity conditions (related to the Poisson / Stein equation), it holds that there exists a corresponding potential $f$ such that $\mathcal{L} f = c$. As such, one can (in principle) construct such dynamics directly from a set of control variates. However, it should be noted that the numerical construction of these potentials $f$ is usually quite challenging, limiting the practical applicability of such strategies.
## Conclusion
In this note, I have provided a brisk recap of the many ways in which continuous dynamics can be harnessed for the purposes of Monte Carlo sampling, and then applied this theory to describe a particular dynamical system. I have then drawn some links between the proposed system, the method of control variates, and the method of moments.
As foreshadowed earlier in the note, it is not so clear how useful these methods will be for accelerating the convergence of sampling methods, either provably or otherwise. The wealth of opportunities provided by the 'complete recipe' is unfortunately relatively silent on the question of which opportunities ought to actually be taken. It is fair to say that this remains a difficult question.

Overview: In this note, I log some basic observations about diffusion-based generative models.

8/14/2023Overview: In this note, I describe some aspects of hierarchical structure in MCMC algorithms, and how they can be of theoretical and practical relevance.

8/9/2023Overview: In this note, I discuss a recurrent question which can be used to generate research questions about methods of all sorts. I then discuss a specific instance of how this question has proved fruitful in the theory of optimisation algorithms. Methods and Approximations A nice story is that when Brad Efron derived the bootstrap, it was done in service of the question “What is the jackknife an approximation to?”. I can't help but agree that there's something quite exciting about research questions which have this same character of ''What is (this existing thing) an approximation to?''. One bonus tilt on this which I appreciate is that there can be multiple levels of approximation, and hence many answers to the same question. One well-known example is gradient descent, which can be viewed as an approximation to the proximal point method, which can then itself be viewed as an approximation to a gradient flow. There are probably even more stops along the way here. In this case, there is even the perspective that from the perspective of mathematical theory, there may be at least as much to be gained by stopping off at the proximal point interpretation, as there is from the gradient flow perspective. My experience is that generalist applied mathematicians get to grips with the gradient flow quickly, but optimisation theorists can squeeze more out of the PPM formulation. There is thus some hint that using this 'intermediate' approximation can be particularly insightful in its own right. It would be interesting to collect more examples with this character.

5/22/2023Overview: In this note, I prove Hoeffding's inequality from the perspectives of martingales and convex ordering. The Basic Construction Let $-\infty<a<x<b<\infty$, and define a random variable $M$ with law $M\left(x;a,b\right)$ by \begin{align} M=\begin{cases} a & \text{w.p. }\frac{b-x}{b-a}\ b & \text{w.p. }\frac{x-a}{b-a}. \end{cases}

5/22/2023
Published on ** HackMD**