Implicit Transfer Operators and Boltzmann Priors for Accelerating Molecular Dynamics: AI810 Blog Post (20258104)

# Implicit Transfer Operators and Boltzmann Priors for Accelerating Molecular Dynamics: AI810 Blog Post (20258104) Molecular dynamics (MD) simulations are a cornerstone for studying molecular systems, but they are notoriously slow when rare events (like protein folding or ligand unbinding) occur on long timescales. MD must use very small time-steps (\~fs) for stability, yet phenomena of interest can span microseconds to seconds. Generating sufficiently long trajectories with standard MD becomes impractical. Recently, generative models have been proposed to learn the system’s dynamics, allowing much larger time-steps (ns\~$\mu$s) by modeling the transition probability density of the system’s state over longer intervals as shown in Figure 1. This approach treats MD as a Markov process with state $x_t$ and transition density $p( x_{t+\Delta t} \mid x_t)$, and uses machine learning to approximate these transitions directly. In this post, we focus on two papers: **Implicit Transfer Operator (ITO) Learning** [1] and its extension **Boltzmann Priors for ITO (BoPITO)** [2]. We will cover their methodology, including the transfer-operator formalism, and how BoPITO improves ITO in data efficiency, equilibrium accuracy. ![image](https://hackmd.io/_uploads/BJAruBgzeg.png) ###### Figure 1. Initial state $x_t$ (left) and sampled state $x_{t+\Delta t}$ from learned transition probability distribution $p_{\theta}( x_{t+\Delta t} \mid x_t)$ (right). This figure can be found in [3]. ## Background: Markov Transfer Operators in Molecular Dynamics **Molecular Dynamics as a Markov Process.** In MD, a molecular system’s time-evolution is often modeled by a stochastic differential equation (e.g. Langevin dynamics) under a potential energy function $U(x)$. When discretized with a small time-step $\tau$, this yields a Markov chain $x_0 \to x_\tau \to x_{2\tau} \to \dots$ with a transition density $p_\tau(x_{t+\tau} \mid x_t)$ for each step. Under ergodicity, the process has a stationary distribution $\mu(x)$, called *Boltzmann distribution*, $$\mu(x) = \frac{\exp(-\beta U(x))}{Z}, \tag{1}$$ where $Z = \int_\Omega e^{-\beta U(x)} dx$ and $\beta$ is the inverse temperature. Important physical quantities can be written as expectations under $\mu(x)$ (stationary observables) or under the path of the Markov chain (dynamic observables). **Transfer Operator Formalism.** The *transfer operator* is a linear operator to describe the evolution of probability distributions under the Markov chain. Specifically, let $\rho(x)$ be a probability density over states at time $t$. The transfer operator $T_\Omega(\tau)$ maps $\rho$ to a density at time $t+\tau$ by integrating over the transition kernel. The transfer operator acting over $N$ steps can be written as: $$[T_\Omega(N\tau)\circ \rho](x_{t+N\tau})=\frac{1}{\mu(x_{t+N\tau})}\int_{\Omega}\rho(x_t)p_{N\tau}(x_{t+N\tau}\mid x_t)d\mu(x_t), \tag{2}$$ where $p_{N\tau}(\cdot|\cdot)$ denotes the $N$-step transition density. The spectral decomposition of the transfer operator $T_\Omega$ is given by: $$[T_\Omega(N\tau)\circ \rho](x) \;=\; \sum_{i=1}^{\infty} \lambda_i(N\tau)\langle \varphi_i,\rho \rangle\psi_i(x), \tag{3}$$ where $\lambda_i(N\tau)$ are eigenvalues and $(\psi_i, \varphi_i)$ are the corresponding right and left eigenfunctions. We have $\langle \varphi_i, \rho \rangle = \int_\Omega \varphi_i(x)\rho(x)dx$. By convention, eigenvalues are ordered $|\lambda_1| \ge |\lambda_2| \ge \cdots$. For a ergodic Markov operator, $\lambda_1 = 1$ is always an eigenvalue; its eigenfunctions are $\psi_1(x) = 1$ and $\varphi_1(x) = \mu(x)$. All other eigenvalues satisfy $|\lambda_i|<1$, and they govern the relaxation rates of various slow processes in the system (with $\lambda_i = e^{-k_i \Delta t}$ related to relaxation rate $k_i$). The relationship $\varphi_i(x) = \mu(x)\psi_i(x)$ holds, reflecting detailed balance in reversible systems. In summary, the leading eigenfunction of $T_\Omega$ is the equilibrium distribution $\mu$, while the dominant eigenfunctions except the leading eigenfunction encode slow dynamics such as protein folding. ## Implicit Transfer Operator Learning **Learning Long-Timestep Dynamics.** The ITO framework [1] treats the transition density over a long time interval as a target distribution to learn. Formally, ITO aims to learn a surrogate for $p(x_{t+N\tau}\mid x_t)$ where $N$ can be large (e.g. thousands of MD steps). In the language of transfer operators, ITO’s generative model is attempting to capture the action of the transfer operator $T_\Omega(N\tau)$ on delta-distributions (point masses), since $p(x_{t+N\tau}\mid x_t)$ is exactly the kernel of that operator: $$[T_\Omega(N\tau)\circ \delta](x_{t+N\tau})=p_{N\tau}(x_{t+N\tau}\mid x_t), \tag{4}$$ By learning this conditional distribution, one can “jump” $N$ steps in a single leap, dramatically speeding up sampling of long trajectories. **Score-Based Generative Models.** To implement the conditional distribution $p_{N\tau}(x_{t+N\tau}\mid x_t)$, the ITO utilizes score-based generative model [3,4], powerful generative models in machine learning. In diffusion models, one defines a forward noising process that gradually perturbs data into random noise. This forward process is typically a fixed SDE, for example: $$dx^t = f(x^t, t)dt + g(t)dW_t, \qquad t\in[0,T], \tag{5}$$ here, we use subscript $x_t$ and superscript $x^{t}$ to denote *physical time* and *diffusion time*, respectively, and $W_t$ a Wiener process. $x^{0}(=x)$ is a real data sample, and $x^{T}$ is pure noise. Generation is done by simulating the reverse denoising process from noise back to data, which can be formulated as another SDE that requires the score function $\nabla_x \log p^t(x)$ at each intermediate time. In practice, one trains a neural network $s_\theta(x^t, t)$ to approximate this score at each diffusion time. By simulating the learned reverse process, the model can generate new samples from the data distribution. **Conditional Score-Based Generative Models for Dynamics.** Importantly, diffusion models can be conditioned on context; for instance, one can train a conditional diffusion given some condition $y$, so it learns $p_\theta(x \mid y)$. The ITO models a conditional denoising diffusion model to represent the transition density $p_{N\tau}(x_{t+N\tau}\mid x_t)$ for some large $N$. In other words, the model learns to “denoise” a corrupted version of $x_{t+N\tau}$ back toward the true $x_{t+N\tau}$, with $x_t$ given as context. Formally, we denote the conditional score model as $s_\theta(x^{t_{\text{diff}}}_{t+N\tau},x_t,N,t_{\text{diff}})$ for a noised sample of the future state $x_{t+N\tau}$ at diffusion time $t_{\text{diff}}$ with a condition of the current state $x_t$ and steps $N$, approximating the score $\nabla_{x} \log p(x^{t_{\text{diff}}}_{t+N\tau} \mid x_t)$. When fully denoised ($t_{\text{diff}}=0$), this score corresponds to the true conditional density of interest $p(x_{t+N\tau}\mid x_t)$. **Multi-Resolution Training.** A key idea in ITO is to train on multiple time resolutions simultaneously. Instead of fixing one lag $N$, the training data can include pairs $(x_t, x_{t+N\tau})$ for a range of $N$ (e.g. both short and long intervals). This forces the model to learn transition densities at different time scales, which encourages self-consistency across time-scales. In practice, one can randomly sample $N$ from a set of steps used in training, and inform the model of the chosen $N$ by encoding $N$ as an input feature. The ITO paper demonstrated that such multi-lag training helps the model better capture the slow dynamical modes of the system than training at a single fixed lag. Intuitively, by learning short, medium, and long-term transitions of the process, the model can learn the underlying Markov process more robustly. If we have MD trajectories, we can sample many pairs with various time lags $N\tau$. ## Boltzmann Priors for ITO While ITO organize the concept of learning MD transition models, its practicality can be limited by data. BoPITO [2] addresses this by leveraging **Boltzmann Generators** [6; BG] to (1) collect the training data, and (2) incorporate Boltzmann distribution $\mu(x)$ into the conditional generative model $p_{N\tau}(x_{t+N\tau}\mid x_t)$. This results in a more data-efficient training and a model that samples from the correct equilibrium distribution in the long-time limit. We break down BoPITO’s contributions below. **Boltzmann Generator Prior for Data Generation.** The BG is a type of generative model trained to produce equilibrium samples from Boltzmann distribution. BoPITO leverages a pretrained BG as a prior to initialize MD simulations and generate training data more efficiently. Instead of starting all MD simulations from one relaxed structure which requires long simulations to explore other metastable states, one can sample diverse starting points from the BG covering various metastable basins according to $\mu$ and run many short simulations from each. This adaptive sampling strategy yields many short trajectories that broadly cover the configuration space, which is far more efficient to collect rare transitions than a few long trajectories. In BoPITO, using a BG in this way led to an order of magnitude reduction in the amount of MD simulation data needed to train the ITO model. ![Image](https://hackmd.io/_uploads/Hk19rHgGxg.png) ###### Figure 2. BoPITO leverages pre-trained Boltzmann Generators to enable data-efficient training of ITO models of the transition density. This figure can be found in [2] **Incorporating the Boltzmann Generator into the Model.** The crucial insight of BoPITO is that the Boltzmann distribution is the leading eigenfunction of the transfer operator. The eigenfunction corresponding to $\lambda_1=1$ is the Boltzmann distribution $\mu(x)$. All other eigenmodes decay with $|\lambda_i|<1$. BoPITO explicitly separates these components. Instead of learning the entire transition density from scratch, BoPITO fixes the stationary part using BG of $\mu(x)$, and only learns the remaining, decaying part of the dynamics as shown in Figure 2. Formally, BoPITO defines the conditional score as a sum of two parts: $$ s_\theta(x^{t_{\text{diff}}}_{t+N\tau},x_t,N,t_{\text{diff}}) =s_{\text{eq}}(x^{t_{\text{diff}}}_{t+N\tau},t_{\text{diff}})+\hat\lambda^N s_{\text{dyn}}(x^{t_{\text{diff}}}_{t+N\tau},x_t,N,t_{\text{diff}}), \tag{6} $$ here $s_{\text{eq}}$ is the score function of BG (the stationary component), and $s_{\text{dyn}}$ is the remaining score for the dynamic part of the transition (the decaying modes). $\hat\lambda\in(0,1)$ is a tunable hyperparameter that essentially scales the contribution of the dynamic part depending on the time lag $N$. This form is inspired by the spectral expansion of the transition probability: $$p(x_{t+N\tau}\mid x_t) = \mu(x_{t+N\tau}) \;+\; \sum_{i=2}^\infty \lambda_i^N\varphi_i(x_{t+N\tau})\psi_i(x_t). \tag{7}$$ The first term is the leading mode $\mu(x)$, and the rest are decaying modes with $\lambda_i^N$ factors. By setting $s_{\text{eq}} = \nabla_x \log \mu(x)$ and $s_{\text{dyn}}$ to model the remaining terms, we effectively achieve the decomposition of the score function. In practice, $s_{\text{eq}}$ can be obtained from a Boltzmann Generator or even directly from the known force field $s_{\text{eq}}(x) = -\beta \nabla U(x)$. By fixing $s_{\text{eq}}$, BoPITO reduces the learning burden for equilibrium, focusing on decaying modes. A major benefit of the decomposition is that as $N\to \infty$, the term $\hat\lambda^N s_{\text{dyn}} \to 0$ (because $\hat\lambda<1$). This means, in the limit of time, the model’s transition density tends to $\mu(x)$ regardless of $x_t$. In other words, the model approaches the BG for large $N$, providing an inductive bias that the dynamics should relax toward equilibrium at long times.  ## Conclusion **Implicit Transfer Operator learning** introduced a powerful way to learn multi-time-step surrogate models for molecular dynamics using score-based generative models. It enables generating realistic long-timescale trajectories and compute observables much faster than brute-force simulation, given enough training data. Building on that, **Boltzmann Priors for ITO** improves data efficiency and reliability of ITO by adding equilibrium knowledge to ITO model. This opens the door to tackling previously intractable problems in protein folding, drug binding kinetics, materials aging. ## Reference [1] Schreiner, Mathias, Ole Winther, and Simon Olsson. "Implicit transfer operator learning: Multiple time-resolution models for molecular dynamics." Advances in Neural Information Processing Systems (2023). [2] Diez, Juan Viguera, et al. "Boltzmann priors for Implicit Transfer Operators." arXiv (2024). [3] Klein, L., Foong, A., Fjelde, T., Mlodozeniec, B., Brockschmidt, M., Nowozin, S., ... & Tomioka, R. Timewarp: Transferable acceleration of molecular dynamics by learning time-coarsened dynamics. Advances in Neural Information Processing Systems (2023). [4] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in neural information processing systems (2020). [5] Song, Yang, et al. "Score-Based Generative Modeling through Stochastic Differential Equations." International Conference on Learning Representations (2020). [6] Noé, Frank, et al. "Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning." Science (2019).