Before starting we refresh the definition of probability path and velocity field as introduced in the Flow Matching paper by Lipman et al., as they will be useful to understand the derivations. These are given by:
In addition to that, we know that for a Gaussian probability path , then the conditional velocity field can be derived as
We can rewrite this as follows (while I like the trick, I still have to figure out how one can think about this in the first place tbh, but that's life I suppose) where in the last derivation we used the fact that since .
By substituting the definition we have just found of in the definition of , we get:
To understand the second term, we have to analyze the score . Indeed, we can rewrite it by using the logarithm trick as follows:
Therefore we can see that the missing integral on the equation above is exactly . By substituting the definition, we get the relation between the velocity field and the score function, which is given by:
NOTE: All the above derivations above rely on the fact that we are able to arrive to a definition of $ u_t(x|x_1)$ in terms of . This is possible only because we are considering a conditional probability path that is Gaussian, i.e. , which is possible if and only if the base distribution is Gaussian. However, flow matching/stochastic interpolants can be used to transport any distribution to any distribution . In this case, the probability path is still Gaussian, but it depends on both and , i.e. , but if we marginalize out, we don't get a Gaussian distribution.
NOTE: In a lot of paper we found a definition using the derivative of certain logarithm. Indeed, if we look closely at the definition of and we can see that they reminds of the derivative of a logarithmic function. Indeed, if we can write . The same way, if we consider the following . So I feel that at the end of appendix B of the Guided flow paper, they are missing a square kind of. The same quantity can also be written as .
Therefore, we can defined the vector field in terms of the score in the following three equivalent ways:
Conditional velocity field and
To show the connectuon between and we have to follow almost the same steps as we did above.
In case of a certain conditioning observation , the probability path and the velocity field are defined as follows:
As before the trick is to rewrite the conditional velocity field by using the fact that we are dealing with a Gaussian probability path . Therefore as before, we have that
We can also consider the conditional score which can be rewritten following similar steps as before
By substituting the definition of into , we get:
where we used the definition of we found above.
NOTE This derivation is exactly the one of Zheng et al., which showed that we can write the conditional velocity field as follows
which in the case of a Gaussian probability path defined as , it becomes
Guiding an unconditional velocity field
Let's assume we have a pre-trained velocity field . How can I guide it to sample an example from a specific class for example. In the section above we have shown that
We can also write the unconditional score in terms of the velocity field :
In addition to that, if we want to sample an example conditioned on a specific observation , we are interested in simulating the conditional vector field , which we don't have it available. However above we have seen that we can write it as
Example In the most common setting, where we assume that , and the probability path can be expressed as , i.e. we have that and , then we can compute the unconditional velocity field and the conditional one as follows: where can be an additional classifier trained on the interpolation .
Inverse problems
Score based models are nice because there is a neat way to tackle inverse problems. In this cases, indeed, they allow to sample from (note usually in diffusion this will be ) without the need to train a separate classifier. Indeed, we can write as follows where is the likelihood model which we can compute easily and is the distribution which given a noised gives us a distribution over the possible noiseless . This distribution, despite not being available in closed form, can be approximated by moment matching. The mean can be obtained by using Tweedie’s formula (first introduced by Robbins):
Therefore, if we want to apply the same technique with flow matching, we have to compute the corresponding Tweedie's/Robbins' formula with the velocity field. To do so, one possibility is to use the equation that relates the score to the velocity field that we have introduced above. If we do so, we have a way to compute using a velocity field. We will derive the equation step-by-step.
Now we have a way to approximate the mean of . The covariance function can also be estimated via Tweedie's formula, although this require the computation of the Hessian:
Therefore we can now approximate as a Gaussian distribution with mean and covariance . In addition to that, if assume that we have an observation likelihood that is Gaussian, i.e. , where is the observation model and the likelihood variance. By doing so, then we can get a closed form approximation for given by . The computation of the Hessian can be problematic, in addition to that in our case we have to find a relationship between the velocity field and the Hessian. For this reason, different papers propose different approaches to approximate it simply by considering , where is a monotonically increasing function. Therefore the final approximation is given by
Alternative derivation In Zheng et al. paper (Training-free linear image inversion paper), we have a different approach to derive the for the Tweedie's formula. They, indeed, decided to use the relationship between the velocity field and the score that is defined by using the derivatives of logarithms that we have presented above, i.e. If we now re-write the relationship from the score perspective, we get Then we can insert this into the Tweedie's formula and then it's just applying calculus as we are showing by deriving it step-by-step:
Example We now go back to the running example where and (and and , respectively). If we substitute them above, we get
Probability flow ODE and recovering the SDE using the velocity field
Before jumping into the relationship between the velocity field and the SDE we have to introduce some background concept. We will try to do it in the simplest possible way. Let's start by considering the general form of an SDE (here for simplicity we assume , i.e. is a scalar): where is the drift term and (which sometimes can also depends on , i.e. ) is the diffusion term. Let us also assume that the initial condition is sampled from a certain distribution . Then the Fokker-Planck equation tells us how the probability distribution evolves over time:
Let us now assume that we have an ODE, instead, with a specific form of velocity field, which is given by where the initial observations are sampled from . The evolution of , the probability distribution associate with the dynamics, over time can be described by the continuity equation, which is given by By comparing this with the Fokker-Planck equation above, we can notice that both the SDE and the ODE has the same dynamics if . The ODE is known as Probability flow ODE, and it is usually written as:
Everybody at this point they cite Kingma 2021 (Variational Diffusion Model paper), but I am not able to find where they actually define the following: Therefore, the probability flow ODE can also be written as where in the last equation we used the relation between and we derived at the beginning.
Alternative and possibly simpler derivation We can derive the same thing by following a different and maybe easier path. The derivation we have presented above, was given in the following presentation/lecture. However, we can derive the same conclusion by just rewriting the Fokker-Planck equation as we are going to see and then use the definition of divergence and its connection to the continuity equation. We mention different concepts in just two lines, but we are going to describe them briefly here before getting into the step-by-step derivation. We have seen above that the Fokker-Planck equation tells us the evolution of the probability over time if it follows the dynamics described by the SDE. We can also see that it consists of two terms, the first one kind of describes the change of probability given by the drift, which is a deterministic function. The second term, instead, is related to the diffusion term of the SDE, and describes the change of probability given by the diffusion. Roughly speaking, an ODE is just a deterministic SDE, i.e. an SDE without the diffusion term. Therefore, the evolution of the porbability following an ODE, is just given by the first term, i.e. , which is usually referred to as the continuity equation. Here we are reporting it for the scalar case, but in the high-dimensional case this is written as , where is the divergence operator. Now we have all the ingredients to show the alternative derivation. Let's start from the Fokker-Planck equation: which can be seen as a continuity equation of the following ODE: Therefore, we show how we can express an SDE via an ODE, but we have to know the score . Then following the same reasoning we did above, we can show how we can link this to the vector field . If you are interested in learning more about the Fokker-Planck equation we suggest this nice blogpost.
A different path to draw a connection between flow matching and score-based modelling
I am not completely sure this is 100% correct, so do not draw any conclusion from this
A different path to get to the relationship between the vector field in flow matching and the score learned by score-based models can be derived by following Karras et al. approach. In the paper, they propose a different view of the probability flow ODE associated to a particular noising SDE and corresponding backword/denoising SDE. Before showing the relation, we take a detour and start from the usual SDE formulation of the noising process as introduced by Song et al. where is the drift, the diffusion, and is a Wiener process. The drift and the diffusion usually depends on the type of the noising SDE we are considering, like if it is variance exploding or variance preserving, but usually in diffusion model we choose a drift term that is affine, meaning that can be written ad . Therefore, the usual SDE we consider is given by Since the drift is affine, then the perturbation kernel or noising kernel associated with this SDE has a closed form given by where and depends on the choice of the drift and the diffusion and are given by
NOTE Now we are assuming that the perturbation kernel has the form , while before we were considering . Therefore is different from above, and in addition to that, while the base distribution is for diffusion models, in flow matching it is given by .
The marginal distribution , which is the one that the probability flow ODE wants to match for every time is given by and the probability flow ODE that actually obey this is given by
However, the following probability flow ODE is defined in terms of the drift and diffusion . It would be more helpful if the properties of the marginal distribution that the ODE is trying to obey can be derived in terms of the and that appear in the nosing kernel. In the paper, they show that is possible to define it in terms of and , and this is given by:
Derivation of the probability flow ODE in terms of and For completeness we will derive step by step the new formulation of the probability flow ODE, but the same derivations can be found in the Appendix of Karras et al paper. Let us start by writing the marginal distribution in a different way (all the derivations we present here are actually reported in the appendix of Karras et al): By looking at the definition of above, we can define the following compact terms: We can therefore substitute the new definition of the marginal distribution into the probability flow ODE and get the following: We can now compute the drift and the diffusion coefficient in terms of and . Let's start with : while for we get: So by substituting those into the probability flow ODE then we can get a different version of it in terms of and :
Therefore, now given a perturbation kernel, we have a way to describe the ODE that has the same marginals as the underlying SDE that is linked to that specific perturbation kernel. If we think about flow matching now, we are approximating a ODE and to do so we are using a probability path. Thus, if our base distribution is standard Gaussian, our velocity field is approximating exactly the probability flow ODE we have presented above, meaning that
We try now to see if by considering the usual flow matching choice of having the following probability path . While above we considered and , now to follow Karras et al. framework, we have that , while . From those we can simply compute also the derivatives and . By substituting this in the relation between the velocity field and the score, we get: which is the same we found also above, by following a different approach.
References
Lipman, Yaron, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. "Flow matching for generative modeling." ICLR 2023
Zheng, Qinqing, Matt Le, Neta Shaul, Yaron Lipman, Aditya Grover, and Ricky TQ Chen. "Guided flows for generative modeling and decision making." arXiv preprint arXiv:2311.13443 (2023).
Kingma, Diederik, Tim Salimans, Ben Poole, and Jonathan Ho. "Variational diffusion models." Advances in neural information processing systems 34
Pokle, Ashwini, Matthew J. Muckley, Ricky TQ Chen, and Brian Karrer. "Training-free linear image inversion via flows." arXiv preprint arXiv:2310.04432 (2023).
Efron, B. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106, 2011.
Robbins, H. E. An empirical bayes approach to statistics. In Breakthroughs in Statis- tics: Foundations and basic theory, pp. 388–394. Springer, 1992.
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2020). Score-based generative modeling through stochastic differential equations. _arXiv preprint arXiv:2011.13456.
Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35, 26565-26577.