In this lecture, we present a continuous normalizing flow approach to control the continuity equation to target probability distributions. This method is especially relevant when your goal state has some uncertainty. When the goal state is just a single point, it reduces to a version of policy gradient method for continuous time systems.
Let $\rho_0$ be a reference distribution. Let $\rho_T$ be the goal distribution.
# Quick intro to Maximum Likelihood Estimation
Let $\rho_T(x)$ be the true (data) probability density on $X$, and let $\rho_\theta(x)$ be a parameterized model density.
One way to fit the model is to minimize the discrepancy between $\rho_T$ and $\rho_\theta$ using the [KL divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence):
$$
\mathrm{KL}(\rho_T \,\|\, \rho_\theta)
:= \int \rho_T(x)\,\log\frac{\rho_T(x)}{\rho_\theta(x)}\,dx.
$$
Expanding the logarithm gives
$$
\mathrm{KL}(\rho_T \,\|\, \rho_\theta)
= \int \rho_T(x)\log \rho_T(x)\,dx \;-\; \int \rho_T(x)\log \rho_\theta(x)\,dx.
$$
The first term $\int \rho_T(x)\log \rho_T(x)\,dx$ does not depend on $\theta$. Therefore,
$$
\arg\min_\theta \mathrm{KL}(\rho_T \,\|\, \rho_\theta)
\;=\;
\arg\max_\theta \int \rho_T(x)\log \rho_\theta(x)\,dx.
$$
The right-hand side is the expected log-likelihood under the data distribution. Hence,
$$
\text{Maximize likelihood}
\;\Longleftrightarrow\;
\text{Minimize }\mathrm{KL}(\rho_T \,\|\, \rho_\theta).
$$
**MLE with samples.**
Our goal is to develop a sample based approximation of the cost function. Given i.i.d. data $x_1,\dots,x_n \sim \rho_T$, we can approximate the expectation by the empirical average:
$$
\max_\theta \int \rho_T(x)\log \rho_\theta(x)\,dx
\;\approx\;
\max_\theta \frac{1}{n}\sum_{i=1}^n \log \rho_\theta(x_i).
$$
# Continuous Normalizing Flows
Now we introduce the idea of continuous normalizing flows for transporting control systems to target distributions. Lets take the simple case first, where our system is the following
$$\dot{x} = u$$
We want to control this system from $\rho_0$ to $\rho_T$.
Let $\rho$ be the density of the variable $x$. The change in log of the density along the flow is can be be computedusing the chain rule to give,
$$\frac{d}{dt}{\log \rho(t,x(t))} = \frac{\partial_t \rho(t,x(t)) + u(t,x(t))\cdot \nabla_x \rho(t,x(t)) } \rho \tag{3}$$
assuming $\rho>0$, which is necessarily the case if $u(t,x)$ is smooth due to the following change-of-variable formula:
If $f:\mathbb{R}^d \rightarrow \mathbb{R}^d$ is an invertible $C^1$ map with a $C^1$ inverse. Then for a probability density functions $p_X$ and $p_Z$:
$$p_X(x) = p_Z\!\big(f^{-1}(x)\big)\,\left|\det\!\left(\frac{\partial f^{-1}(x)}{\partial x}\right)\right|. \tag{CoV}$$
Recall the [continuity equation](https://hackmd.io/6dbdw7gFR8OeOzkBjhL56g) satisfied by the density $\rho$.
$$\partial_t \rho = -\nabla \cdot (u(t,x) \rho )$$
Then using the fact and applying the it to $(3)$ we get,
$$\frac{d}{dt}{\log \rho(t,x(t))} = -\nabla_{\rm} \cdot u(t,x).$$
Therefore, we can compute
$$\log \rho(T,x(T)) = \log \rho(0,x(0)) + \int_{0}^T \nabla_{\rm} \cdot u(t,x(t))dt.$$
Replacing, $u(t,x)$ with an approximating function class $u_{\theta}(t,x)$, let $x_i(t)$ be the solution of the ODE solved backward in time
$$
\dot{x}_i(t) = u_\theta(t,x_i(t)), \qquad x_i(T)=x_i.
$$
Then the CNF change-of-variables formula gives
$$\log \rho_\theta(T,x_i)
=\log \rho_0\big(x_i(0)\big)-\int_{0}^{T} \nabla\cdot u_\theta\big(t,x_i(t)\big)\,dt.$$
Therefore the sample-based maximum-likelihood objective can be written as
$$
\max_{\theta}\;\frac{1}{N}\sum_{i=1}^N
\left[
\log \rho_0\big(x_i(0)\big)
-\int_{0}^{T} \nabla\cdot u_\theta\big(t,x_i(t)\big)\,dt
\right].
$$
To summarize, the algorithm is as follows:
$$
\boxed{
\begin{aligned}
&\textbf{Algorithm (CNF training)}\\[4pt]
&1.\ \text{Initialize parameters } \theta.\\
&2.\ \text{Repeat until convergence:}\\
&3.\ \text{For each data point } x_i \sim \rho_T,\ \text{solve backward}\\
&\qquad \dot{x}_i(t)=u_\theta(t,x_i(t)),\quad x_i(T)=x_i,\\
&\qquad s_i=\int_0^T \nabla\cdot u_\theta\big(t,x_i(t)\big)\,dt.\\
&4.\ \text{Compute the loss}\\
&\qquad \mathcal{L}(\theta)= -\frac{1}{N}\sum_{i=1}^N\Big(\log \rho_0(x_i(0)) - s_i\Big).\\
&5.\ \text{Update parameters}\\
&\qquad \theta \leftarrow \theta - \eta\,\nabla_\theta \mathcal{L}(\theta).
\end{aligned}
}
$$
In high dimensions, $\nabla \cdot u_{\theta}$ can be intractable to compute. Therefore, one often uses the [Hutchinson's trace estimator](https://docs.backpack.pt/en/master/use_cases/example_trace_estimation.html).
# Controlled Normalizing Flow
Our goal now is to generalize this construction to control systems
$$ \dot{x} = f(x,u^\theta):=f_0(x) + \sum_{i=1}^m u^{\theta}_i(t,x) f_i(x).$$
It is easy to see that this generalization doesn't affect the algorithm much and can be adapted in a straightforward way.
$$
\boxed{
\begin{aligned}
&\textbf{Algorithm (Control CNF training)}\\[4pt]
&1.\ \text{Initialize parameters } \theta.\\
&2.\ \text{Repeat following until convergence}\\
&3.\ \text{For each data point } x_i \sim \rho_T,\ \text{solve backward}\\
&\qquad \dot{x}_i(t)=f\big(x_i(t),u^\theta(t,x_i(t))\big),\quad x_i(T)=x_i,\\
&\qquad s_i=\int_0^T \nabla\cdot f\big(x_i(t),u^\theta(t,x_i(t))\big)\,dt.\\
&4.\ \text{Compute the loss}\\
&\qquad \mathcal{L}(\theta)= -\frac{1}{N}\sum_{i=1}^N\Big(\log \rho_0(x_i(0)) - s_i\Big).\\
&5.\ \text{Update parameters}\\
&\qquad \theta \leftarrow \theta - \eta\,\nabla_\theta \mathcal{L}(\theta).
\end{aligned}
}
$$
The above algorithm assumes that one has $\rho_0$ is close form so that $\rho_0(x_i(0))$ can be computed.