###### tags: `chasing constants` `mcmc` `monte carlo` `expository` `theory`
# Chasing Constants: Prelude (Spectral Gaps of Markov Chains by Contractivity)
**Overview**: In this note, I will describe the notion of the spectral gap of a reversible Markov chain, and outline a strategy for estimating this quantity using coupling arguments.
This serves as a prelude for upcoming posts in which I review some concrete applications of this technique.
## Markov Chains for Approximate Simulation
One of my main research interests is the design and analysis of Markov chain algorithms for Monte Carlo simulation. Given a probability measure of interest $\pi$, there are many creative ways to come up with a Markov transition kernel $\mathrm{P}$ which leaves $\pi$ invariant, i.e.
\begin{align}
x \sim \pi,\, y \sim \mathrm{P} \left( x, \cdot \right) \implies y \sim \pi,
\end{align}
and which even allow for the asymptotic generation of samples from $\pi$, i.e.
\begin{align}
x_0 \sim \pi_0, x_t \sim \mathrm{P} \left( x_{t-1}, \cdot \right) \implies \lim_{t \to \infty} \rm{Law} \left( x_t \right) = \pi.
\end{align}
This enterprise is known as _Markov chain Monte Carlo_.
For practical purposes, it is important to make these convergence statements rigorous and quantitative. As the problem is fundamentally probabilistic in nature, it is perhaps unsurprising that there are a few different ways of quantifying the rate at which this convergence takes place.
## $L^2$ Convergence of Markov Chains
One particularly appealing framework for studying the convergence of Markov chains is the so-called $L^2$- or Hilbert space approach. In this setup, one observes some useful mathematical facts:
1. The space of functions which are square-integrable under $\pi$ forms a Hilbert space, known as $L^2 \left( \pi \right)$.
2. The Markov kernel $\mathrm{P}$ can be given an interpretation as a linear operator on this space, by
\begin{align}
\left( Pf \right)\left( x \right) = \int \mathrm{P} \left( x, \mathrm{d} y \right) f\left( y \right).
\end{align}
3. The convergence properties of the Markov chain can be characterised in terms of the operator $P$, and in particular, of its spectrum.
The third point deserves particular attention. To begin with, it is a useful exercise to convince oneself that $P^n$ is the linear operator corresponding to taking $n$ steps of the Markov chain driven by $\mathrm{P}$. If the chain is to converge in law to $\pi$, then one should expect that
\begin{align}
\left( P^n f \right)\left( x \right) &= \int \mathrm{P}^n \left( x, \mathrm{d} y \right) f\left( y \right) \\
&\approx \int \pi \left( \mathrm{d} y \right) f\left( y \right).
\end{align}
Since $P^n f$ remains an element of $L^2 \left( \pi \right)$ for all $n$, one can consider measuring this convergence in the Hilbert space norm, i.e.
\begin{align}
\| P^n f - \pi \left( f \right) \cdot \mathbf{1} \|_2^2 &= \int \pi \left( \mathrm{d} x \right) \left( \left( P^n f \right) \left( x \right) - \pi \left( f \right) \right)^2 \\
&= \mathrm{Var}_\pi \left( P^n f \right).
\end{align}
The $L^2$ approach seeks to obtain estimates on the rate at which such quantities decay as a function of $n$. There is typically a preference for bounds which do not require highly specific details about $f$, and so one often sees bounds of the form
\begin{align}
\| P^n f - \pi \left( f \right) \cdot \mathbf{1} \|_2^2 \leqslant a \left( n \right) \cdot b \left( f \right),
\end{align}
where $a$ is an explicit rate of decay, and $b$ is some simple norm-like functional of $f$. One might describe this as a 'variance decay bound'.
A slight peculiarity of the $L^2$ approach is that a priori, it is agnostic to the geometry of the state space. Indeed, if two Markov chains are related to one another through an appropriate bijection, then the induced linear operators are essentially equivalent, the chains have identical $L^2$ convergence rates, eigendecompositions, etc. This is in contrast to e.g. measuring convergence in some transport distance, where the choice of geometry on the space is an essential feature.
In what follows, I will
i) work only with functions which satisfy $\pi \left( f \right) = 0$ (this gives rise to a Hilbert space which is typically denoted by $L_0^2 \left( \pi \right)$), and
ii) assume that the Markov chain in question is invariant under time-reversal with respect to the measure $\pi$.
The first assumption comes without a loss of generality, but the second is a more substantial restriction, which nevertheless often holds in practice.
## Spectral Gaps
As the iterates $P^n f$ arise from the repeated application of the same linear operator, it is perhaps unsurprising that except in specific edge cases, the rate at which the variance decays with $n$ cannot be faster than exponential. Moreover, exponential decay is indeed feasible, and the gold standard for $L^2$ convergence of Markov chains is to exhibit a bound of the form
\begin{align}
\text{for all } f \in L_0^2 \left( \pi \right), \quad \| P^n f \|_2^2 \leqslant \left(1 - \gamma \right)^n \cdot \| f \|_2^2,
\end{align}
for some (ideally large) constant $\gamma \in (0, 1)$.
It is a nice exercise (essentially of linear-algebraic character) to show that decay estimates of this form are equivalent to the following inequality:
\begin{align}
\text{for all } f \in L_0^2 \left( \pi \right), \quad \mathcal{E} \left( P^2, f \right) \geqslant \gamma \cdot \| f \|_2^2,
\end{align}
where $\mathcal{E}$ is the so-called 'Dirichlet form' of the kernel $\mathrm{P}^2$, defined for general Markov kernels $\mathrm{T}$ by
\begin{align}
\mathcal{E} \left( T, f \right) = \frac{1}{2} \cdot \int\int \pi \left( \mathrm{d} x \right) \mathrm{T} \left( x, \mathrm{d} y \right) \left( f\left( y \right) - f\left( x \right) \right)^2.
\end{align}
Such an inequality is known as a 'Poincaré Inequality' or 'Spectral Gap Inequality' for $T$, and indeed can be viewed as a statement about the spectrum of the linear operator $T$. Proving such an inequality for a given chain is often something of an achievement, and allows for many useful properties to be deduced.
## Couplings, Contractivity, and Spectral Gap Estimates
One useful general tool for establishing a spectral gap estimate from scratch is to establish that the chain has some sort of good mixing property in space, in that for $x \neq y$, as $n$ grows, the laws of $\mathrm{P}^n \left( x, \cdot \right)$ and $\mathrm{P}^n \left( y, \cdot \right)$ exhibit substantial overlap in a suitable sense. This is not easy per se, but at least relies mostly on 'local' properties of the chain, which improves its tractability somewhat. When working with concrete examples, it is often clear whether or not such a strategy will be feasible.
To make this more precise, fix a metric $d$ on the ambient space, and define the Lipschitz constant of a function $f$ on this space by
\begin{align}
| f |_\mathrm{Lip} := \sup_{x \neq y} \frac{| f\left( y \right) - f\left( x \right)|}{d \left( x, y \right)}.
\end{align}
Suppose that one can identify a $\kappa \in \left(0, 1 \right)$ such that for $x \neq y$, there exists of a coupling $\check{P}$ of $\mathrm{P} \left( x, \cdot \right)$ and $\mathrm{P} \left( y, \cdot \right)$ which is $\left(1 - \kappa \right)$ contractive with respect to $d$, i.e.
\begin{align}
\left( x', y' \right) \sim \check{P} \left( \left( x, y \right), \cdot\right) \implies \mathbf{E} \left[ d \left( x', y' \right) \right] \leqslant \left(1 - \kappa \right) \cdot d \left( x, y \right).
\end{align}
Under this condition, it will be possibly to show the existence of a spectral gap of at least $\kappa$ for $P$. Indeed, the coupling allows one to show that
\begin{align}
| \left( Pf \right)\left( x \right) - \left( Pf \right)\left( y \right)| &\leqslant \left(1 - \kappa \right) \cdot d \left( x, y \right) \cdot | f |_\mathrm{Lip} \\
\implies \quad | Pf |_\mathrm{Lip} &\leqslant \left(1 - \kappa \right) \cdot | f |_\mathrm{Lip} \\
\implies \quad | P^n f |_\mathrm{Lip} &\leqslant \left(1 - \kappa \right)^n \cdot | f |_\mathrm{Lip} \quad \text{for all } n \in \mathbf{N}.
\end{align}
One can then write that
\begin{align}
\| P^n f \|_2^2 &= \frac{1}{2} \cdot \int \pi \left( \mathrm{d} x \right) \cdot \pi \left( \mathrm{d} y \right) \cdot \left( \left( P^n f \right) \left( x \right) - \left( P^n f \right) \left( y \right) \right)^2 \\
&\leqslant \frac{1}{2} \cdot | P^n f |_\mathrm{Lip}^2 \cdot \int \pi \left( \mathrm{d} x \right) \cdot \pi \left( \mathrm{d} y \right) \cdot d \left( x, y \right)^2 \\
&\leqslant \left( \mathrm{const.} \right) \cdot \left(1 - \kappa \right)^{2n} \cdot | f |_\mathrm{Lip}^2.
\end{align}
Some clever arguments allow us to systematically improve this bound without further effort. Indeed, for reversible Markov chains, it is known that the sequence $n \mapsto \| P^n f \|_2^2$ is log-concave (there is a nice spectral argument for this), and so applying Jensen's inequality shows that for $0<n<m$,
\begin{align}
\| P^n f \|_2^2 &\leqslant \left( \| P^m f \|_2^2 \right)^{n/m} \cdot \left( \| f \|_2^2 \right)^{1-n/m} \\
&\leqslant \left( \left( \mathrm{const.} \right) \cdot \left(1 - \kappa \right)^{2m} \cdot | f |_\mathrm{Lip}^2 \right)^{n/m} \cdot \left( \| f \|_2^2 \right)^{1-n/m} ,
\end{align}
and sending $m$ to $\infty$ gives that
\begin{align}
\| P^n f \|_2^2 &\leqslant \left(1 - \kappa \right)^{2n} \cdot \| f \|_2^2.
\end{align}
If the metric is chosen such that Lipschitz functions are dense in $L_0^2 \left( \pi \right)$ (which is typically the case), then an approximation argument allows us to conclude that the above inequality holds on all of $L_0^2 \left( \pi \right)$, and hence that the claimed spectral gap inequality holds for $P$ with constant $\gamma \geqslant \kappa$.
## Conclusion
This is just a taster, hoping to demonstrate that i) Poincaré inequalities might be relevant objects of study for Markov chain enthusiasts, and ii) that coupling / contractivity techniques may be useful for establishing them in practice.
In the coming posts, I hope to outline (at least) two concrete instances in which this approach can be fruitful for providing estimates of the spectral gap for models of practical interest.
One point which I hope will become clear is that the generality of the 'contractivity implies spectral gap' result offers a great deal of flexibility: if you can identify _any_ metric for which you can prove contractivity, then you can deduce a spectral gap. It doesn't matter if many metrics will not work; as soon as you have contractivity in _one_ nice metric, the spectral gap estimate follows.
### Acknowledgments
My interest in $L^2$ convergence rates of Markov chains has been particularly enabled by my research group in Bristol (Christophe Andrieu, Anthony Lee, Andi Wang). The connection between couplings and spectral gaps is something which I was reminded of thanks to the correspondence of Aaron Smith, which prompted me to look into this area more broadly. The proofs which I have sketched above are lifted from the works of Yann Ollivier on 'Ricci curvature of Markov chains'.

Overview: In this note, I log some basic observations about diffusion-based generative models.

8/14/2023Overview: In this note, I describe some aspects of hierarchical structure in MCMC algorithms, and how they can be of theoretical and practical relevance.

8/9/2023Overview: In this note, I discuss a recurrent question which can be used to generate research questions about methods of all sorts. I then discuss a specific instance of how this question has proved fruitful in the theory of optimisation algorithms. Methods and Approximations A nice story is that when Brad Efron derived the bootstrap, it was done in service of the question “What is the jackknife an approximation to?”. I can't help but agree that there's something quite exciting about research questions which have this same character of ''What is (this existing thing) an approximation to?''. One bonus tilt on this which I appreciate is that there can be multiple levels of approximation, and hence many answers to the same question. One well-known example is gradient descent, which can be viewed as an approximation to the proximal point method, which can then itself be viewed as an approximation to a gradient flow. There are probably even more stops along the way here. In this case, there is even the perspective that from the perspective of mathematical theory, there may be at least as much to be gained by stopping off at the proximal point interpretation, as there is from the gradient flow perspective. My experience is that generalist applied mathematicians get to grips with the gradient flow quickly, but optimisation theorists can squeeze more out of the PPM formulation. There is thus some hint that using this 'intermediate' approximation can be particularly insightful in its own right. It would be interesting to collect more examples with this character.

5/22/2023Overview: In this note, I prove Hoeffding's inequality from the perspectives of martingales and convex ordering. The Basic Construction Let $-\infty<a<x<b<\infty$, and define a random variable $M$ with law $M\left(x;a,b\right)$ by \begin{align} M=\begin{cases} a & \text{w.p. }\frac{b-x}{b-a}\ b & \text{w.p. }\frac{x-a}{b-a}. \end{cases}

5/22/2023
Published on ** HackMD**