Particle Systems for Sampling, and their Limits

###### tags: `one-offs` `sampling` `particles` # Particle Systems for Sampling, and their Limits **Overview**: In this note, I will discuss some connections between dynamical systems which are used for sampling purposes, the evolution PDEs which are associated to those systems, and certain limits of these systems which produce interesting behaviour. ## Dynamics for Sampling In many areas of computational mathematics, it is of interest to draw samples from a prescribed distribution $\pi (dx) = \exp \left( - V(x) \right) / Z$, where $V$ is a 'potential' function mapping $\mathbf{R}^d \to \mathbf{R}$. When $d$ is sufficiently large, direct sampling methods like rejection sampling, inverse transform sampling, and similar are infeasibly expensive, and so one has to resort to approximate methods. A popular approximation to make is that based on *Markov Chain Monte Carlo*, or MCMC. The premise of MCMC is to simulate an ergodic dynamical system, the law of whose state converges in distribution to $\pi$. A prototypical example of such a system is the *overdamped Langevin dynamics*, given by the stochastic differential equation \begin{align} dX_t = - \nabla_x V (X_t) dt + \sqrt{2} dW_t. \end{align} A more advanced example is the *underdamped Langevin dynamics*, which actually sample from $\pi(dx) \otimes \mathcal{N} (dv | 0, I_d )$, and are given by \begin{align} dX_t &= V_t dt \\ dV_t &= - \nabla_x V (X_t) dt - \gamma V_t dt + \sqrt{2 \gamma} dW_t. \end{align} There exist a great many other systems which can also be used to sample in this way. ## Understanding Dynamical Systems via PDEs While it is intuitive to consider these dynamical systems at the 'particle level', as presented above, it can often be useful to consider their actions in a more abstract way. For example, consider an observable of the form $f(x, v)$. Instead of considering the path of an individual particle, which may be quite rough and irregular, we could instead study the *average* behaviour of how $f$ evolves along a trajectory of the dynamics. To this end, we might define \begin{align} \left( \mathcal L f\right)(x, v) := \lim_{t \to 0^+} \frac{\mathbf{E}[f(X_t, V_t) | (X_0, V_0) = (x, v)] - f(x, v)}{t}. \end{align} $\mathcal L$ is known as the *infinitesimal generator* of the stochastic process. For diffusions, the operator $\mathcal L$ is a second-order linear differential operator which acts on sufficiently smooth functions of $(x, v)$. As an example, the generator of the overdamped Langevin dynamics is given by \begin{align} \left( \mathcal L f\right)(x) &= \langle - \nabla_x V(x), \nabla_x f(x) \rangle + \Delta_x f(x), \end{align} and the generator of underdamped Langevin dynamics is given by \begin{align} \left( \mathcal L f \right)(x, v) &= \langle v, \nabla_x f (x, v) \rangle - \langle \nabla_x V(x), \nabla_v f (x, v) \rangle \\ &+ \gamma \left( \langle - v, \nabla_v f (x, v) + \Delta_v f (x, v) \right). \end{align} Now, if we define $f_t (x, v) = \mathbf{E}[f(X_t, V_t) | (X_0, V_0) = (x, v)]$, then it can be shown that $f_t$ solves the linear partial differential equation (PDE) \begin{align} \partial_t f_t = \mathcal{L} f_t. \end{align} A benefit of this framing is that the generator allows us to character the behaviour of a stochastic system of finite dimension (which requires one set of tools) by studying an infinite-dimensional deterministic system (which requires a different set of tools). Depending on the situation, either of these toolsets might be more appropriate. The dual picture is to consider the evolution of measures which is induced by the dynamics, i.e. if $\mu_t$ is the law of $(X_t, V_t)$, how does $\mu_t$ evolve in time? Fixing a test function $f$, it holds by definition that for small $t$, \begin{align} \mathbf{E}[f(X_t, V_t)] &= \int \mu_t (x, v) f(x, v) dx dv\\ \mathbf{E}[f(X_0, V_0)] + t \mathbf{E}[\left(\mathcal{L}f \right)(X_0, V_0)] + o(t) &= \int \left( \mu_0 (x, v) + t \left( \partial_t \mu_0 \right)(x, v) + o(t) \right) f(x, v) dx dv\\ \mathbf{E}[\left(\mathcal{L}f \right)(X_0, V_0)] &= \int \left( \partial_t \mu_0 \right)(x, v) f(x, v) dx dv \\ \int \mu_0 (x, v) \left(\mathcal{L}f \right) (x, v) dx dv &= \int \left( \partial_t \mu_0 \right)(x, v) f(x, v) dx dv. \end{align} As such, under appropriate technical conditions, it holds that $\partial_t \mu_0 = \mathcal{L}^* \mu_0$, where $\mathcal{L}^*$ is the adjoint operator of $\mathcal{L}$ on an appropriate subspace of $L^2(dx, dv)$. One can then deduce that $\mu_t$ evolves as \begin{align} \partial_t \mu_t = \mathcal{L}^* \mu_t. \end{align} An important observation is that because each of the dynamical systems considered are 'single-particle' systems, the corresponding PDEs are linear (in $f_t, \mu_t$ respectively). This will be contrasted in the examples which follow. ## Nonlinear Dynamical Systems for Sampling One can generalise the underdamped Langevin dynamics by introducing a *preconditioning matrix* $L$, and modifying the dynamics to \begin{align} dX_t &= L V_t dt \\ dV_t &= - L\nabla_x V (X_t) dt - \gamma V_t dt + \sqrt{2 \gamma} dW_t. \end{align} For any $L$, this diffusion preserves the desired invariant measure. There is then a reasonable question of how $L$ ought to be set in order to achieve the optimal convergence rate. While this is largely an open question in nontrivial settings, a heuristic which has been found to be useful in practice is to take $L$ to be a square root of covariance matrix of $X$ under $\pi$. However, in practice, this matrix is unknown. As such, we might attempt to approximate it using a particle system, i.e. with $N$ particles $\{ (X_t^a, V_t^a ) \}_{a = 1}^N$, run the dynamics according to \begin{align} dX_t &= \mathrm{L} \left( \rho_t \right) V_t dt \\ dV_t &= - \mathrm{L} \left( \rho_t \right) \nabla_x V (X_t) dt - \gamma V_t dt + \sqrt{2 \gamma} dW_t, \end{align} where $\rho_t$ is the empirical measure of $\{ (X_t^a, ) \}_{a = 1}^N$ and $\mathrm{L} ( \cdot )$ is a square root of the covariance operator, i.e. \begin{align} \mathrm{L} ( m ) = \left( \textbf{Cov}_m [ X ] \right)^{1/2}. \end{align} One can equally write down an evolution equation for the expectations of functions at a given time, but there is now an extra complication: because the dynamics depend on $\mu_t$, so too does the operator $\mathcal{L}$, i.e. one must write \begin{align} \partial_t f_t &= \mathcal{L}_{\mu_t} f_t \\ \partial_t \mu_t &= \mathcal{L}_{\mu_t}^* \mu_t. \end{align} More precisely, the measure $\mu_t$ evolves as \begin{align} \partial_t \mu (x, v, t) &= - \text{div}_x \left( \mu (x, v, t) \mathrm{L} ( \rho (\cdot, t) v )\right) \\ &+ \text{div}_v \left( \mu (x, v, t) \mathrm{L} ( \rho (\cdot, t) \nabla_x V (x) )\right) \\ &+ \gamma \left( \langle - v, \nabla_v f (x, v) + \Delta_v f (x, v) \right) \end{align} In particular, the evolution PDE for $\mu_t$ is now *nonlinear* in $\mu_t$, through the influence of $\rho(\cdot, t)$. Generally speaking, when a time-dependent PDE concerning the evolution of a probability measure is nonlinear, this reflect the presence of interaction effects at the level of particle dynamics. ## From Kinetic to Hydrodynamic Evolutions The PDE constructed above would generally be referred to as being of 'mean-field' or 'kinetic' type, as the dynamics correspond to those of an ensemble of interacting particles which each have their own position and momentum. A further step up the hierarchy is to take the so-called 'hydrodynamic' limit, in which the velocity of a particle is strictly dependent on its position, i.e. any two particles at the same location and at the same time will have the same velocity. This is known as the 'isokinetic' assumption. As such, instead of tracking the evolution of the joint law of $(x, v)$, one can simplify to studying the marginal law of $x$, and seek an evolution PDE for $\rho (x, t) = \int \mu (x, v, t) dv$, jointly with the velocity of $x$ at time $t$, which we write as $u(x, t)$. To begin with, write \begin{align} \partial_t \rho (x, t) &= \partial_t \left( \int \mu (x, v, t) dv\right) \\ &= \int \partial_t\mu (x, v, t) dv \\ &= \int \left( - \text{div}_x \left( \mu \mathrm{L} v \right) + \text{div}_v \left( \mu \mathrm{L} \nabla_x V (x) )\right) + \gamma \left( \langle - v, \nabla_v f (x, v) + \Delta_v f (x, v) \right)\right) dv \end{align} Integration by parts causes the last 3 terms to vanish, leading us to \begin{align} \partial_t \rho (x, t) &= - \int \text{div}_x \left( \mu \mathrm{L} v )\right) dv \\ &= - \text{div}_x \left( \mathrm{L} ( \rho (\cdot, t)) \int \mu (x, v, t) v dv \right) \\ &= - \text{div}_x \left( \mathrm{L} ( \rho (\cdot, t)) \rho (x, t) u (x, t) \right). \end{align} Thus, given $u(x, t)$, we have a transport equation for the evolution of $\rho (x, t)$. We now look one step up the hierarchy, and consider the evolution of $\rho (x, t) u (x, t)$ in time: \begin{align} \partial_t \left( \rho (x, t) u(x, t) \right) &= \partial_t \left( \int \mu (x, v, t) v dv \right) \\ &= \int \left( \partial_t\mu (x, v, t) \right) v dv \\ &= \int \left( - \text{div}_x \left( \mu \mathrm{L} v \right) + \text{div}_v \left( \mu \mathrm{L} \nabla_x V (x) \right) + \gamma \left( \langle - v, \nabla_v \mu + \Delta_v \mu \right) \right) vdv \\ &= -I + II + \gamma III. \end{align} Now, we again harness integration by parts to simplify these terms. Recalling that $u$ is now a vector-valued function, we work in coordinates: \begin{align} I_i &= \int v_i \text{div}_x \left( \mu \mathrm{L} v \right) dv \\ &= \sum_{j, k} \frac{\partial}{\partial x_j} \left( \mathrm{L}_{j, k} \int \mu (x, v, t) v_i v_k dv \right) \\ &= \sum_{j, k} \frac{\partial}{\partial x_j} \left( \mathrm{L}_{j, k} \rho (x, t) u_i (x, t) u_k (x, t) \right) \\ &= \text{div}_x \left( \rho (x, t) u_i (x, t) \left( \mathrm{L} u\right) (x, t) \right) \end{align} \begin{align} II_i &= \int v_i \text{div}_v \left( \mu \mathrm{L} \nabla_x V (x) \right) dv \\ &= \sum_{j, k} \int \mathrm{L}_{j, k} v_i \frac{\partial}{\partial v_j} \left( \mu (x, v, t) \frac{\partial V}{\partial x_k} (x) \right) dv \\ &= - \sum_{j, k} \mathrm{L}_{j, k} \delta_{i, j} \int \mu (x, v, t) \frac{\partial V}{\partial x_k} (x) dv \\ &= - \sum_{k} \mathrm{L}_{i, k} \rho (x, t) \frac{\partial V}{\partial x_k} (x) \\ &= - \rho (x, t) \left( \mathrm{L} \nabla_x V \right)_i (x) \end{align} \begin{align} III_i &= \int v_i \left( \langle - v, \nabla_v \rangle \mu + \Delta_v \mu \right) dv \\ &= - (d + 1) \rho (x, t) u_i (x, t). \end{align} Collecting these terms, one can write the full hydrodynamic limit as \begin{align} \partial_t \rho (x, t) + \text{div}_x \left( \mathrm{L} \rho (x, t) u (x, t) \right) &= 0 \\ \partial_t \left( \rho (x, t) u_i (x, t) \right) + \text{div}_x \left( \rho (x, t) u_i (x, t) \left( \mathrm{L} u\right) (x, t) \right) &= - \rho (x, t) \left( \mathrm{L} \nabla_x V \right)_i (x) - \gamma (d + 1) \rho (x, t) u_i (x, t) \\ \mathrm{L} &= \mathrm{L} ( \rho (\cdot, t)) \end{align} In principle, a parameter-counting argument suggests that there is sufficient information in these equations to close the system. However, from the rigorous PDE perspective, the deed is far from done, and the well-posedness properties typically require significant effort to establish. Indeed, the Euler and Navier-Stokes equations of fluid dynamics are of a similar form to these equations, and are well-known as a formidable mathematical challenge! ## Conclusion In this note, I have described some aspects of how dynamical systems can be used in the context of approximate sampling, and how such systems can be studied through the lens of partial differential equations. Using this perspective, one can also study dynamical systems of particles which interact with one another through their empirical measure, and the limits of these systems. I then gave an example of a simple instance of an interacting particle system which can be formally studied with these tools, including a sketch of how a hydrodynamic limit might look for this system.