###### tags: `coupling` `monte carlo` `expository`
# Coupling Methods in Monte Carlo: Static Theory
**Overview**: In this note, I will describe the notion of a *coupling* between random variables, and some basic applications to variance reduction for Monte Carlo methods.
## A Taster: Estimating Differences of Expectations
Consider the task of estimating the difference between two expectations, possibly taken with respect to different measures *and* different functions, i.e.
\begin{align}
\Delta = \mathbf{E}_p [F] - \mathbf{E}_q [G].
\end{align}
The simplest approach is arguably to estimate each expectation separately, and then compute the difference between those estimators, i.e.
1. Sample $x \sim p$, compute $F(x)$.
2. Sample $y \sim q$, compute $G(y)$.
3. Return $F(x) - G(y)$.
This approach is simple, and can work well. However, it is often possible to do better.
Let us start by trying to express $\Delta$ as a single integral. We can certainly write
\begin{align}
\Delta &= \mathbf{E}_p [F] - \mathbf{E}_q [G] \\
&= \int p(dx) F(x) - \int q(dy) G(y) \\
&= \int p(dx) q(dy) \left( F(x) - G(y) \right).
\end{align}
However, this is just one option. Indeed, there may be many joint distributions $\Pi$ on $\mathcal{X} \times \mathcal{Y}$ such that
\begin{align}
\Delta &= \int \Pi(dx, dy) \left( F(x) - G(y) \right),
\end{align}
any of which would be valid choices for devising a Monte Carlo method for estimating $\Delta$. So, we should try to find a good one!
## Couplings
In some sense, for a given pair $(F, G)$, there exist far too many choices of $\Pi$ which enable unbiased estimation of $\Delta$. A useful way of narrowing our search is to find $\Pi$ such that the above relation holds for all sensible $(F, G)$, i.e.
\begin{align}
\forall F \in L^1(p), G \in L^1(q), \quad \int \Pi(dx, dy) \left( F(x) - G(y) \right) = \mathbf{E}_p [F] - \mathbf{E}_q [G].
\end{align}
Setting either $F$ or $G$ to $0$ gives the equivalent formulation
\begin{align}
\forall F \in L^1(p), \quad \mathbf{E}_\Pi [F] &= \mathbf{E}_p [F] \\
\forall G \in L^1(q), \quad \mathbf{E}_\Pi [G] &= \mathbf{E}_q [G] .
\end{align}
Such a $\Pi$ is called a *coupling* of $(p, q)$, and the set of all such couplings is often denoted as $\Gamma(p, q)$.
An alternative presentation (which is often more intuitive for algorithms) is that if one draws $(x, y) \sim \Pi$, then *marginally*, $x$ has the law of $p$, and $y$ has the law of $q$. Crucially, the joint behaviour of the pair $(x, y)$ is otherwise unconstrained - the pair may be independent, positively- or negatively- dependent, or even deterministically related! This flexibility can be quite useful.
## Back to Estimation
Returning to our estimation problem, we can consider the estimator given by
1. Sample $(x, y) \sim \Pi$.
2. Return $F(x) - G(y)$.
Now, to assess the quality of this estimator, we start by writing down its variance:
\begin{align}
\text{Var}_\Pi \left( F(x) - G(y) \right) &= \int \Pi (dx, dy)\left( \bar{F}(x) - \bar{G}(y) \right)^2,
\end{align}
where $\bar{F}(x) := F(x) - \mathbf{E}_p[F], \bar{G}(y) := G(x) - \mathbf{E}_q[G]$. As such, we see that a good coupling for this problem is one such that, under the joint law $\Pi$, $\bar{F}(x)$ and $\bar{G}(y)$ are close to one another.
In fact, writing $T(x, y) = (\bar{F}(x), \bar{G}(y))$, one can write this variance as
\begin{align}
\text{Var}_\Pi \left( F(x) - G(y) \right) &= \int \Pi (dx, dy)\left( \bar{F}(x) - \bar{G}(y) \right)^2 \\
&= \int \left( T_\#\Pi \right) (df, dg)\left( f - g \right)^2,
\end{align}
where $T_\#\Pi$ denotes the pushforward of the measure $\Pi$ through the mapping $T$. Note that $T_\# \Pi$ is a valid coupling of $\bar{F}_\# p$ and $\bar{G}_\# q$. As such, one can lower-bound this variance by
\begin{align}
\text{Var}_\Pi \left( F(x) - G(y) \right) \geqslant \inf_{\tau \in \Gamma(\bar{F}_\# p, \bar{G}_\# q)} \mathbf{E}_\tau \left[ |f - g|^2 \right].
\end{align}
The astute reader will recognise this expression as the squared 2-Wasserstein distance between the measures $\bar{F}_\# p$ and $\bar{G}_\# q$. Crucially, this distance is independent of the particular coupling which we choose to use: this is a fundamental lower bound on the variance reduction which one can achieve through a coupling method of this form.
## A Basic Example of Coupling: Common Random Numbers
Suppose that $p$ and $q$ are supported on a subset of $\mathbf{R}$, and have a density with respect to Lebesgue measure. One can then define cumulative distribution functions
\begin{align}
\Phi_p (x) &= p((-\infty, x]) \\
\Phi_q (x) &= q((-\infty, x])
\end{align}
with well-behaved inverses on $(0, 1)$.
It is a well-known fact that one can generate a sample from $p$ as follows:
1. Sample $u \sim \mathcal{U} [0, 1]$.
2. Solve $\Phi_p (x) = u$ for $x$.
3. Return $x$.
Now, consider the following program:
1. Sample $u \sim \mathcal{U} [0, 1]$.
2. Solve $\Phi_p (x) = u$ for $x$.
3. Solve $\Phi_q (y) = u$ for $y$.
4. Return $(x, y)$.
By the preceding discussion, we know that marginally, $x \sim p$ and $y \sim q$. Hence, this program implicitly specifies a coupling for $(p, q)$. This is often known as the method of *common random numbers*, *seed sharing*, or similar.
Moreover, by the monotonicity of $F_p, F_q$, and their inverses, routine arguments from the theory of optimal transport will guarantee that this is in many respects an *optimal* coupling, i.e. on average, this coupling generates pairs $(x, y)$ which are as close together as one can hope. This is particularly useful for estimating differences in expectations of the same function against $p$ and $q$, as
\begin{align}
\mathbf{E}_\Pi \left[ \left( h(x) - h(y) \right)^2 \right] &= \int \Pi (dx, dy)\left( h(x) - h(y) \right)^2 \\
&\leqslant \text{Lip}(h)^2 \cdot \int \Pi (dx, dy)\left( x - y \right)^2 ,
\end{align}
and so guaranteeing the closeness of $(x, y)$ under $\Pi$ ensures that quantities like $\mathbf{E}_p[h]-\mathbf{E}_q[h]$ can be reliably estimated.
This simple observation can be taken surprisingly far. For example, suppose that we are tasked with estimating $\mathbf{E}_p [F] - \mathbf{E}_q [G]$, where $x$ and $y$ are both one-dimensional, and $F$ and $G$ are co-monotone (i.e. both increasing or both decreasing). One can then construct an optimal coupling for $(F_\# p, G_\# q)$ by sampling
1. Sample $u \sim \mathcal{U} [0, 1]$.
2. Solve $\Phi_p (x) = u$ for $x$.
3. Solve $\Phi_q (y) = u$ for $y$.
4. Return $(F(x), G(y))$.
This works because the monotonicity of $F$ ensures that $\Phi_{F_\#p} \circ F = \Phi_p$, i.e. we are implementing the original method, but without explicitly forming the density of $F_\#p$, and so on.
This method is best understood in one dimension, but the same principles apply to higher-dimensional models, conditionally-specified models, and other variations. As a rule of thumb, sharing as much of the randomness in the algorithm as possible, in a structured way, is typically useful.
In cases where $F$ or $G$ are not monotone, the strategy can occasionally struggle, or make things work. To improve matters, a case-by-case analysis is typically necessary.
## A More Advanced Example: Repeated Sampling
Previously, the necessity of couplings arose from our having multiple distinct probability measures at hand. However, coupling approaches can be useful even in the standard Monte Carlo setting. Suppose we are interested in estimating $I = \mathbf{E}_p [F]$. Now, write
\begin{align}
\mathbf{E}_p [F] &= \sum_{i \in [N]} \mathbf{E}_p \left[ \frac{1}{N}F(X_i) \right].
\end{align}
By a similar reasoning to before, we might now consider computing the expectation on the right-hand side with respect to an $N$-way coupling of $(p, p, \ldots, p)$, i.e. we draw $N$ samples, each of which is marginally drawn from $p$, but which may otherwise be correlated. There are a number of approaches to this, a few of which I detail here:
* Suppose that $N = 2$, and $F$ is monotone.
* In line with our earlier discussion, one could define a sampler for $F_\#p$ by setting $f_+ = F \circ \Phi_p^{-1} (u)$. Similarly, one could define a sampler for $(-F)_\#p$ by setting $f_- = (-F) \circ \Phi_p^{-1} (1 - u)$.
* Both of these samplers are monotone in $u$, and hence this pair constitutes an optimal coupling.
* Essentially, the strategy here is to produce a pair of samples from $p$ which are strongly *anti*correlated, such that their fluctuations under $F$ largely cancel one another out under addition. Indeed, if $F$ and $p$ are both symmetric, then a single sample from this coupling will have zero variance!
* This strategy is known as *antithetic coupling*.
* Consider now general $N$, but perhaps not having a specific $F$ in mind - perhaps we are conceivably interested in a whole class of $F$.
* A safe strategy might be to couple samples from $p$ such that they are well-dispersed around the support of $p$. Some simple approaches in this direction include:
* Sample $u \sim \mathcal{U}[0, 1]$, and for $1 \leqslant k \leqslant N$, set $x_k = \Phi_p^{-1} \left( \left( \frac{k}{N} + U \right) \mod 1 \right)$. This is roughly akin to the idea of 'systematic sampling' in Sequential Monte Carlo, and has links to (Randomised) Quasi Monte Carlo.
* For $1 \leqslant k \leqslant N$, sample $u_k \sim \mathcal{U}[0, 1]$, and set $x_k = \Phi_p^{-1} \left( \frac{k - u_k}{N} \right)$. This is essentially a version of Stratified Sampling.
* A reason to prefer systematic sampling is that it requires less randomness to operate. A reason to prefer stratified sampling is that it generates samples which are independent, which simplifies many theoretical analyses. In practice, the difference between the two is not always so large.
## Infinitesimal Coupling and the Reparametrisation Trick
A fun application of coupling which has encouraged quite some enthusiasm in recent years is the reparametrisation trick. It contrasts somewhat with the preceding examples in that, through a clever limiting procedure, it manages to use only a single sample!
Posit a family of parametrised probability measures $p_\theta$, a function $F$, and consider the task of estimating the derivative
\begin{align}
\gamma = \nabla_\theta \phi(\theta) = \nabla_\theta \mathbf{E}_{p_\theta} [F].
\end{align}
Suppose also that samples from $p_\theta$ can be generated by sampling $\xi \sim \rho$, where $\rho$ is some measure which does not depend on $\theta$, and then setting $x = \mathcal{G}(\xi, \theta)$, where $\mathcal{G}$ is a well-behaved function.
We proceed as follows: for small perturbations $\delta \theta$, one can write
\begin{align}
\phi (\theta + \delta \theta) - \phi(\theta) &= \int_0^1 \langle \nabla_\theta \phi(\theta + t \delta \theta), \nabla \theta \rangle dt \\
&\approx \langle \nabla_\theta \phi(\theta), \delta \theta \rangle,
\end{align}
and so in some sense, it is sufficient for us to be able to estimate $\phi (\theta + \delta \theta) - \phi(\theta)$. Now, write
\begin{align}
\phi (\theta + \delta \theta) - \phi(\theta) &= \mathbf{E}_{p_{\theta + \delta\theta}} [F] - \mathbf{E}_{p_\theta} [F].
\end{align}
We could estimate the quantity on the right hand side by sampling $\xi \sim \rho$, and returning the value of
\begin{align}
\hat{g} = F(\mathcal{G}(\xi, \theta + \delta\theta)) - F(\mathcal{G}(\xi, \theta)).
\end{align}
Now, through an appropriate Taylor expansion, one can write that
\begin{align}
\hat{g} &= F(\mathcal{G}(\xi, \theta + \delta\theta)) - F(\mathcal{G}(\xi, \theta)) \\
&\approx F \left(\mathcal{G}(\xi, \theta )+\frac{\partial \mathcal{G}}{\partial \theta} \delta\theta \right) - F(\mathcal{G}(\xi, \theta)) \\
&\approx \left( F \left(\mathcal{G}(\xi, \theta ) \right) + \frac{\partial F}{\partial x} \frac{\partial \mathcal{G}}{\partial \theta} \delta\theta \right) - F(\mathcal{G}(\xi, \theta)) \\
&\approx \left\langle \left( \frac{\partial \mathcal{G}}{\partial \theta} \right)^T \nabla_x F(\mathcal{G}(\xi, \theta)), \delta \theta \right\rangle.
\end{align}
With this in mind, we can assert that
\begin{align}
\mathbf{E}_\rho \left[ \left( \frac{\partial \mathcal{G}}{\partial \theta} \right)^T \nabla_x F(\mathcal{G}(\xi, \theta)) \right] = \nabla_\theta \phi(\theta) = \gamma,
\end{align}
and indeed, in reasonable situations, this can be proven rigorously.
## Conclusion
In this note, I have outlined the noting of a coupling of random variables, and sketched some simple applications to variance reduction in Monte Carlo. The examples herein were mostly based on static problems, but the coupling approach can really come into its own when considering more elaborate dependence structures, including Markov chains. This will be expanded upon in a later post.

Overview: In this note, I log some basic observations about diffusion-based generative models.

8/14/2023Overview: In this note, I describe some aspects of hierarchical structure in MCMC algorithms, and how they can be of theoretical and practical relevance.

8/9/2023Overview: In this note, I discuss a recurrent question which can be used to generate research questions about methods of all sorts. I then discuss a specific instance of how this question has proved fruitful in the theory of optimisation algorithms. Methods and Approximations A nice story is that when Brad Efron derived the bootstrap, it was done in service of the question “What is the jackknife an approximation to?”. I can't help but agree that there's something quite exciting about research questions which have this same character of ''What is (this existing thing) an approximation to?''. One bonus tilt on this which I appreciate is that there can be multiple levels of approximation, and hence many answers to the same question. One well-known example is gradient descent, which can be viewed as an approximation to the proximal point method, which can then itself be viewed as an approximation to a gradient flow. There are probably even more stops along the way here. In this case, there is even the perspective that from the perspective of mathematical theory, there may be at least as much to be gained by stopping off at the proximal point interpretation, as there is from the gradient flow perspective. My experience is that generalist applied mathematicians get to grips with the gradient flow quickly, but optimisation theorists can squeeze more out of the PPM formulation. There is thus some hint that using this 'intermediate' approximation can be particularly insightful in its own right. It would be interesting to collect more examples with this character.

5/22/2023Overview: In this note, I prove Hoeffding's inequality from the perspectives of martingales and convex ordering. The Basic Construction Let $-\infty<a<x<b<\infty$, and define a random variable $M$ with law $M\left(x;a,b\right)$ by \begin{align} M=\begin{cases} a & \text{w.p. }\frac{b-x}{b-a}\ b & \text{w.p. }\frac{x-a}{b-a}. \end{cases}

5/22/2023
Published on ** HackMD**