Idea: When the probability density function is differentiable, we can compute the gradient of the probability density.
In contrast with likelihood models, such as variational autoencoders and energy based models, where we directly work with the probability distribution , the score based representation uses the score function, giving the gradient of the log likelihood with respect to the inputs and not the parameters of the model.
The score function is a vector valued function, where the direction of the vector field tells us the direction to follow to increase the log likelihood.
Image Not ShowingPossible Reasons
The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted
Need to find a way of parameterizing curves that are ideally flexible.
Even when the shapes of these densities get very complicated (due to changes in parameters), we need to ensure that the total area under the curve is still fixed (=1). This can be tricky and might require very specific architectures.
Advantages of representation via score function
Doesn't need to satisfy any kind of normalization constraint.
Potentially much simpler to work with.
Score estimation by training score-based models
Image Not ShowingPossible Reasons
The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted
Given: i.i.d samples Task: Estimating the score Score Model: A learnable vector-valued function Goal: Objective: Average Euclidean distance over the whole space. Minimize the Fisher divergence:
Score matching:
Image Not ShowingPossible Reasons
The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted
The most simple way of parameterizing the scores would be via deep neural networks, which would take the i.i.d samples as input, and directly outputs the scores of the model. Score models improve computational time as compared to likelihood based models such as energy based models, which require 2 backprops (one for the scores and one for the gradients) and whose backprop step has a time complexity of , where is the number of dimensions. However the Jacobian calculation during backpropagation in score models is , which is better than likelihood models, but is still dependant on the number of dimensions. Hence score models are not scalable.
In order to make the Jacobian computation efficient and make our model scale to high dimensional settings, Vincent (2011) came up with a different way of estimating the score via denoising score matching.
Here, instead of trying to estimate the score of the actual data, we try to estimate (match) the score of a noise perturbed data distribution.
Consider the perturbed distribution:
So we sample some point from the density , then randomly sample from some Gaussian noise and convolve that with the data to get the perturbed data distribution .
Score estimation for is easier (will prove below). If the noise level () is small, the perturbed distribution is a good approximation of the original distribution, .
The Fisher divergence when estimating :
We try to simplify the above integral:
Hence denoising score matching:
Summary of denoising score matching
Core ideas:
Estimate the score of a noise perturbed distribution
is easy to compute (Gaussian)
Advantages of this method: efficient to optimize even for very high dimensional data, and useful for optimal denoising.
Disadvantage: Cannot estimate the score of clean (noise-free) data.
Process for denoising score matching:
Sample a minibatch of datapoints
Sample a minibatch of perturbed datapoints
Estimate the denoising score matching loss with empirical means
If Gaussian perturbation:
Apply stochastic gradient descent (need to choose a very small ).
Hence,
Score matching:
Denoising:
tries to estimate the noise that was added to produce , which can then be removed from the initial noisy data (denoising). Hence we have reduced the problem of image generation to the problem of denoising, which is an easier problem to work with.
Tweedie's formula
This mentions that the optimal denoising strategy is to follow the gradient (score):
Given , , the posterior can be defined with Bayes rule as:
We know that,
Hence Tweedie's formula gives us:
Hence Tweedie's formula helps us get a form to recover the original data distribution from the noise perturbed distribution.
Langevin dynamics can produce samples from a probability density using only the score function . Given a fixed step size and an initial value , with being a prior distribution, the Langevin method recursively computes the following:
where . Welling and Teh prove in their paper that as and , becomes an exact sample from .
Challenges of score-based sampling
The manifold hypothesis: The manifold hypothesis states that real world data is usually concentrated in a lower dimensional manifold, embedded in a higher dimensional space (ambient space). Since the score function is a gradient taken in the ambient space, it is undefined when is confined to a low dimensional manifold. Further the score estimator gives a good approximation of the score only when the support of the data distribution is the ambient space (Corollary 2, Hyvärinen).
Low data density regions: In the low data density regions, the score matching may not have enough evidence to estimate score functions accurately, due to lack of data samples. In practice, the expectation is always estimated using i.i.d samples . Consider any region such that . Then in most cases and the score matching will not have sufficient samples to estimate the true score for all . Further, if the data distribution is a weighted mixture of 2 distributions, the score matching does not learn the weights, and hence the sampling would lead to a different distribution than the one desired.
Song et al. expanded the idea of denoising score matching to score-based sampling. The key idea proposed by them is that perturbing the data with random Gaussian noise makes the support of the data distribution equivalent to the ambient space. Hence the perturbed data is no longer confined to a lower dimensional manifold. Further, large Gaussian noise has the effect of filling low density regions in the original unperturbed distribution, hence the training by sample approximation is better at low density regions, resulting in improved score estimation. The authors use 2 key observations
High noise provides useful directional information for Langevin dynamics
However, perturbed density no longer approximates the true data density
The authors define a noise conditional score network to predict the scores for noise levels simultaneously. The noise levels form a positive geometric sequence that satisfies , where is large enough to mitigate the difficulties with score based sampling, and . This method makes sure that we utilize the directional information provided by the higher noise perturbation, while closely approximating the true density eventually. Each Langevin dynamics update takes place with .
The annealed Langevin algorithm is given below:
The noise distribution is chosen to be , hence . Given the denoising score matching objective:
The objective function for the NCSN is then given by:
Soon after the annealed Langevin algorithm was released, Martineau et. al. released their paper observing that the annealed Langevin with denoising score matching performs worse than generative adversarial networks under the Freschet Inception Distance (FID) metric. However, this gap vanishes when denoising the final Langevin samples (using the Tweedie's formula). The authors propose a Consistent Annealed Sampling as a more stable alternative to Annealed Langevin Sampling (ALS).
Let the value of in Algorithm 1 be and that for Algorithm 2 be . Then and are related as . . For the geometric sequence . Further .
The ALS makes the assumption that the noise of the sample at step will be of variance , an assumption upon which the quality of the estimation of the score depends. The authors show that ALS does not ensure their claimed schedule with a couple of propositions.
Before writing the propositions, let us revisit the Tweedie's formula to get the denoised sample using denosing score matching.
Rewriting this, we can get the optimum score function in the following form:
Proposition 1: Let be the optimal score function from the above equation. Following the sampling described in Algorithm 1, the variance of the noise component in the sample will remain greater than at every time step .
This means that for , the sampling hasn't fully converged and the remaining noise is carried over to the next iteration of the Langevin Sampling. Further, for , the actual noise at every iteration is expected to be even higher than for the best possible score function .
The Consistent Annealed Sampling (CAS) ensures that the noise levels will follow a prescribed schedule for any sampling step size and number of steps .
Proposition 2: Let be the optimal score function. Following the sampling described in Algorithm 2, the variance of the noise component in the sample will be consistently equal to at every time step .
Importantly, proposition 2 holds no matter how many steps we take to decrease the noise geometrically. Further, proposition 2 only holds when the initial sample is corrupted (). However, by defining as the maximum Euclidean distance between all pairs of training data points, the noise becomes in practice much greater than the true image, i.e. in practice, sampling from pure noise initialization () becomes indistinguishable from sampling with a data initialization.
Proposition 3: Given a noise conditional score function, the update rules from Algorithm 1 and 2 are respectively equivalent to the following update rules:
This paper answers some significant questions that were not answered satisfactorily in the ALS paper, and uses them to improve the sampling algorithm. Firstly, following the observations made by Martineau et. al., the authors apply the Tweedie's formula to Algorithm 1. Instead of returning , the algorithm would now return the denoised sample to remove the unwanted noise .
There are many design choices that are critical to the successful training and inference of NCSNs, including
The set of noise scales ,
The way that incorporates information of ,
The step size parameter
The number of sampling steps per noise scale in Algorithm 1.
Some important questions that remained unanswered from the NCSN paper are the following: a. How should we adjust the initial noise level for different datasets? b. Is geometric progression a good noise schedule? c. How do we choose the number of noise scales for different datasets? Is a particular choice of ideal, and if yes, why?
a. A large value of would ensure a higher diversity of the final samples. However, an excessively large will require more noise scales and make annealed Langevin dynamics more expensive. Suppose we have a dataset which is i.i.d. sampled from . Assuming is sufficiently large, we have , where denotes a point mass distribution. When perturbed with , the empirical distribution becomes where . For generating diverse samples regardless of initialization, we naturally expect that Langevin dynamics can explore any component when initialized from any other component , where . The performance of Langevin dynamics is governed by the score function .
Proposition 1: Let , where . With , the score function is . Moreover,
In order for Langevin dynamics to transition from to easily for , has to be relatively large, because otherwise will ignore the component (on average) when initialized with and in such case Langevin dynamics will act as if did not exist. The bound of the above equation indicates that can decay exponentially fast if is small compared to . As a result, it is necessary for to be numerically comparable to the maximum pairwise distances of data to facilitate transitioning of Langevin dynamics and hence improving sample diversity. In particular, the authors suggest:
Technique 1 (Initial noise scale): Choose to be as large as the maximum Euclidean distance between all pairs of training data points.
b. Other noise scales: For all , the intuition is that we need reliable gradient signals for when initializing Langevin dynamics with samples from . However, an extensive grid search on can be very expensive. To provide theoretical guidance on finding good noise scales, we consider a simple case where the dataset contains only one data point, or equivalently, . Our first step is to understand the distributions of better, especially when has high dimensionality. We can decompose in hyperspherical coordinates to , where and denote the radial and angular coordinates of , respectively. Since is an isotropic Gaussian, the angular component is uniform and shared across all noise scales. As for , we have the following:
Proposition 2 Let , and . We have
In practice, dimensions of image data can range from several thousand to millions, and are typically large enough to warrant with negligible error. Therefore, we take to simplify our analysis, where and .
Recall that our goal is to ensure that samples from will cover high-density regions of . Because is shared across all noise scales, already covers the angular component of . Therefore, we need the radial components of and to have a large overlap.
Since has high density in (employing the "three-sigma rule of thumb"), a natural choice is to fix with some moderately large constant for all , where and is the CDF of the standard Gaussian. This choice immediately implies that , and thus forms a geometric progression.
c. Number of noise scales: Ideally, we should choose as many noise scales as possible to make . However, having too many noise scales will make sampling very costly, as we need to run Langevin dynamics for each noise scale in sequence. On the other hand, (for images) as in the original setting of [1] is arguably too small, for which up to numerical precision. To strike a balance, we recommend , which performs well in our experiments.
Technique 2 (Other noise scales) Choose as a geometric progression with common ratio , such that
For high resolution images, we need a large and a huge number of noise scales as per Technique 1 and 2. Recall that the NCSN is a single amortized network that takes a noise scale and gives the corresponding score. The memory consumption increases linearly with the number of time scales .
The authors propose an efficient alternative that is easier to implement and more widely applicable. For we observe that Moreover, (Song, Ermon, 2019) for a trained NCSN on real data. Because the norm of score functions scales inversely proportional to , we can incorporate the noise information by rescaling the output of an unconditional score network with . This motivates our following recommendation:
Technique 3 (Noise conditioning) Parameterize the NCSN with where is an unconditional score network.
In order to sample from an NCSN with annealed Langevin dynamics, we need to specify the number of sampling steps per noise scale and the step size parameter in Algorithm 1.
Annealed Langevin dynamics connect two adjacent noise scales () by initializing the Langevin dynamics for with samples obtained from . When applying Langevin dynamics to : where and . The distribution of can be computed in closed form:
Proposition 3: Let . For (as in Algorithm 1), we have , where
When is a geometric progression as advocated by Technique 2, we immediately see that is identical across all because of the shared . Furthermore, the value of has no explicit dependency on the dimensionality .
For better mixing of annealed Langevin dynamics, we hope approaches 1 across all noise scales, which can be achieved by finding and that minimize the difference between Eq.(7) and 1. Unfortunately, this often results in an unnecessarily large that makes sampling very expensive for large . As an alternative, we propose to first choose based on a reasonable computing budget (typically is several thousand) and subsequently find by making Eq.(7) as close to 1 as possible. In summary:
Technique 4 (selecting and ): Choose as large as allowed by a computing budget and then select an that makes Eq.(7) maximally close to 1.
Diffusion models (Sohl-Dickstein et. al., Ho et. al.) are another class of generative models involved in sequentially corrupting training data with slowly increasing noise, and then learning to reverse this corruption in order to form a generative model of the data. Denoising diffusion probabilistic models (DDPM) trains a sequence of probabilistic models to reverse each step of the noise corruption, using knowledge of the functional form of the reverse distributions to make training tractable. For continuous state spaces, the DDPM training objective implicitly computes scores at each noise scale.
Considering a sequence . For each training data point , a discrete Markov chain is constructed such that . Hence, , where . Similar to score based modelling with Langevin dynamics, we can denote the perturbed data distribution as . The noise scales are prescribed such that is approximately distributed according to . A variational Markov chain in the reverse direction is parameterized with and trained with a re-weighted variant of the evidence lower bound (ELBO): After solving the above equation to get the optimal model , samples can be generated by starting from and following the estimated reverse Markov chain as below:
This method is called ancestral sampling, since it amounts to performing ancestral sampling from the graphical model . Similar to NCSN, equation (1) is also a weighted sum of denoising score matching objectives which implies that the optimal model, , matches the score of the perturbed data distribution, . The weights of the -th terms of both objective functions, namely , are related to corresponding perturbation kernels in the same functional form and .
The authors present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. The reverse time SDE depends on the score of the perturbed data distribution. A predictor-corrector framework is introduced to correct errors in the evolution of the discretized reverse-time SDE. The authors also derived an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency.
This work builds on DDPMs and Annealed Langevin dynamics, by generalizing them to infinite noise scales, such that perturbed data distributions evolve according to an SDE as the noise intensifies.
The diffusion process, indexed by a continuous time variable , where is the prior density is modelled as the solution to the SDE
where corresponds to the deterministic drift and is the infinitesimal noise. Hence is a standard Weiner process. is the diffusion coefficient of . The SDE has a unique strong solution as long as the coefficients are globally Lipschitz in both state and time. Let be the transition kernel from to , where and let p_t(x) be the probability density of x(t). To estimate , a time-dependent score-based model is trained as a continuous generalization to NCSN and DDPM objectives.
Here, is a positive weighting function, is uniformly sampled over , , and . With sufficient data and model capacity, score matching ensures that the optimum score, denoted by , equals for almost all and . As in SMLD and DDPM, we can typically choose .
The forward SDE is given by:
The corresponding reverse time SDE is:
where is a standard Wiener process when time flows backwards from to and is an infinitesimal negative timestep.
The Euler–Maruyama method is used to solve the SDE, resulting in the following equations:
()
Predictor-Corrector (PC) Samplers
This provides a better solution to the reverse time SDEs. At each time step, the numerical SDE solver first gives an estimate of the sample at the next time step, playing the role of a “predictor”. Then, the score-based MCMC approach corrects the marginal distribution of the estimated sample, playing the role of a “corrector”. PC samplers generalize the original sampling methods of SMLD and DDPM: the former uses an identity function as the predictor and annealed Langevin dynamics as the corrector, while the latter uses ancestral sampling as the predictor and identity as the corrector.
Noise schedules of score based generative models
In this section we will do a review of noise schedules for score based generative models.
We first review the schedules for the papers that have already been discussed. The NCSN uses a geometric noise schedule , where is large enough to mitigate the difficulties with score based sampling, and . DDPM uses a forward diffusion schedule where and hence the noise schedule . The SDE method takes infinite noise levels with the same schedule as ALS: where .
Nichol, Dhariwal, 2021 introduces the -cosine noise schedule . Under the assumptions that the variances are preserved, . Hence . Kingma et. al, 2021 observes that the quality of image generation depends on the signal to noise ratio (SNR) or more importantly log .
Hoogeboom et. al., 2023 observes that the noise schedules that work for lower resolution images need not work for higher resolution images. The authors explain it by the example of a image, and assuming that the pixels given by are independent of each other . Commonly, diffusion models use network architectures that use downsampling to operate on lower resolution feature maps, in our case with average pooling. Suppose we average pool , where we let indices denote the pixels in a square that is being pooled. This new pixel is . Given that we know that and that for a constant . Letting denote the first pixel of the average pooled input image, we find that . The lower resolution pixel only has half the amount of noise. Hence the noise schedule isn't consistent over different resolutions. Hence, the authors argue that the noise schedule could be defined with respect to some reference resolution . The authors then modify the schedule for some other resolution via the SNR as .