Try   HackMD

Score based generative models

Score based models

Representing probability distributions via scores

Idea: When the probability density function is differentiable, we can compute the gradient of the probability density.

In contrast with likelihood models, such as variational autoencoders and energy based models, where we directly work with the probability distribution

p(x), the score based representation uses the score function, giving the gradient of the log likelihood with respect to the inputs and not the parameters of the model.

Score function=x log p(x)

The score function is a vector valued function, where the direction of the vector field tells us the direction to follow to increase the log likelihood.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Disadvantages of PDF representation
p(x)

  • Need to make sure that the PDFs are normalized.
  • Need to find a way of parameterizing curves that are ideally flexible.
  • Even when the shapes of these densities get very complicated (due to changes in parameters), we need to ensure that the total area under the curve is still fixed (=1). This can be tricky and might require very specific architectures.

Advantages of representation via score function
x log p(x)

  • Doesn't need to satisfy any kind of normalization constraint.
  • Potentially much simpler to work with.

Score estimation by training score-based models

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Given: i.i.d samples

{x1,x2,...,xn}pdata(x)
Task: Estimating the score
x log p(x)

Score Model: A learnable vector-valued function
sθ(x):RR

Goal:
sθ(x)x log p(x)

Objective: Average Euclidean distance over the whole space. Minimize the Fisher divergence:
12Expdata[xlogpdata(x)sθ(x)22]

Score matching:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Score models are not scalable!

The most simple way of parameterizing the scores would be via deep neural networks, which would take the i.i.d samples as input, and directly outputs the scores of the model. Score models improve computational time as compared to likelihood based models such as energy based models, which require 2 backprops (one for the scores and one for the gradients) and whose backprop step has a time complexity of

O(d2), where
d
is the number of dimensions. However the Jacobian calculation during backpropagation in score models is
O(d)
, which is better than likelihood models, but is still dependant on the number of dimensions. Hence score models are not scalable.

In order to make the Jacobian computation efficient and make our model scale to high dimensional settings, Vincent (2011) came up with a different way of estimating the score via denoising score matching.

Denoising score matching (Vincent, 2011)

Here, instead of trying to estimate the score of the actual data, we try to estimate (match) the score of a noise perturbed data distribution.

Consider the perturbed distribution:

qσ(x~x)=N(x;σ2I)qσ(x~)=p(x)qσ(x~x)dx

So we sample some point

x from the density
pdata(x)
, then randomly sample from some Gaussian noise
qσ(x~x)
and convolve that with the data to get the perturbed data distribution
qσ(x~)
.

Score estimation for

x~log qσ(x~) is easier (will prove below). If the noise level (
σ
) is small, the perturbed distribution is a good approximation of the original distribution,
qσ(x~)p(x~)
.

The Fisher divergence when estimating

x~log qσ(x~):

12Ex~qσ[x~logqσ(x~)sθ(x~)22]

=12qσ(x~)x~logqσ(x~)sθ(x~)22dx~

=12qσ(x~)x~logqσ(x~)22dx~+12qσ(x~)sθ(x~)22dx~qσ(x~)x~logqσ(x~)sθ(x~)dx~

=const.+12Ex~qσ[sθ(x~)22]qσ(x~)x~logqσ(x~)sθ(x~)dx~

We try to simplify the above integral:

qσ(x~)x~logqσ(x~)sθ(x~)dx~
=qσ(x~)1qσ(x~)x~qσ(x~)sθ(x~)dx~=x~qσ(x~)sθ(x~)dx~

=x~(pdata(x)qσ(x~x)dx)sθ(x~)dx~=(pdata(x)x~qσ(x~x)dx)sθ(x~)dx~

=(pdata(x)qσ(x~x)dx)sθ(x~)dx~=pdata(x)qσ(x~x)x~logqσ(x~x)sθ(x~)dxdx~

=Expdata(x),x~qσ(x~x)[x~logqσ(x~x)sθ(x~)]

Hence denoising score matching:

12Ex~qσ[x~logqσ(x~)sθ(x~)22]
=const.+12Ex~qσ[sθ(x~)22]qσ(x~)x~logqσ(x~)sθ(x~)dx~

=const.+12Ex~qσ[sθ(x~)22]Expdata(x),x~qσ(x~x)[x~logqσ(x~x)sθ(x~)]

=const.+12Expdata(x),x~qσ(x~x)[sθ(x~)222x~logqσ(x~x)sθ(x~)]

=12Expdata(x),x~qσ(x~x)[sθ(x~)x~logqσ(x~x)22]+const.

Summary of denoising score matching

Core ideas:

  • Estimate the score of a noise perturbed distribution

12Ex~qσ[x~logqσ(x~)sθ(x~)22]

=12Expdata(x),x~qσ(x~x)[sθ(x~)x~logqσ(x~x)22]+const.

  • x~logqσ(x~x)
    is easy to compute (Gaussian)
  • qσ(x~x)=N(x;σ2I)
  • x~logqσ(x~x)=x~xσ2
  • Advantages of this method: efficient to optimize even for very high dimensional data, and useful for optimal denoising.
  • Disadvantage: Cannot estimate the score of clean (noise-free) data.

Process for denoising score matching:

  • Sample a minibatch of datapoints
    {x1,x2,...,xn}pdata(x)
  • Sample a minibatch of perturbed datapoints
    {x~1,x~2,,x~n}qσ(x~)
  • Estimate the denoising score matching loss with empirical means
    12ni=1n[sθ(x~i)x~logqσ(x~ixi)22]
  • If Gaussian perturbation:
    12ni=1n[sθ(x~i)+x~ixiσ222]
  • Apply stochastic gradient descent (need to choose a very small
    σ
    ).

Hence,

  • Score matching:

    12Ex~qσ[x~logqσ(x~)sθ(x~)22]

  • Denoising:

    =12Ep(x)Eqσ(x~x)[1σ2(xx~)sθ(x~)22]+const.

sθ(x~) tries to estimate the noise that was added to produce
x~
, which can then be removed from the initial noisy data (denoising). Hence we have reduced the problem of image generation to the problem of denoising, which is an easier problem to work with.

Tweedie's formula

This mentions that the optimal denoising strategy is to follow the gradient (score):

x~+σ2x~log qσ(x~)

Given

p(x),
qσ(x~x)=N(x;σ2I)
, the posterior
p(xx~)
can be defined with Bayes rule as:
p(xx~)p(x)qσ(x~x)

We know that,

qσ(x~)=p(x)qσ(x~x)dx

Hence Tweedie's formula gives us:

Exp(xx~)[x]=x~+σ2x~logqσ(x~)x~+σ2sθ(x~)

Hence Tweedie's formula helps us get a form to recover the original data distribution from the noise perturbed distribution.

Score based Sampling with Langevin Dynamics (Welling, Teh, 2011)

Langevin dynamics can produce samples from a probability density

p(x) using only the score function
xlog p(x)
. Given a fixed step size
ϵ>0
and an initial value
x0~π(x)
, with
π(x)
being a prior distribution, the Langevin method recursively computes the following:

(1)xt~=xt1~+ϵ2xlog p(xt1)+ϵzt

where

ztN(0,I). Welling and Teh prove in their paper that as
ϵ0
and
T
,
xT~
becomes an exact sample from
p(x)
.

Challenges of score-based sampling

  1. The manifold hypothesis: The manifold hypothesis states that real world data is usually concentrated in a lower dimensional manifold, embedded in a higher dimensional space (ambient space). Since the score function is a gradient taken in the ambient space, it is undefined when

    x is confined to a low dimensional manifold. Further the score estimator
    sθ(x)
    gives a good approximation of the score only when the support of the data distribution is the ambient space (Corollary 2, Hyvärinen).

  2. Low data density regions: In the low data density regions, the score matching may not have enough evidence to estimate score functions accurately, due to lack of data samples. In practice, the expectation

    12Expdata[xlogpdata(x)sθ(x)22] is always estimated using i.i.d samples
    {xi}i=1Npdata(x)
    . Consider any region
    RRD
    such that
    pdata(R)0
    . Then in most cases
    {xi}i=1NR=ϕ
    and the score matching will not have sufficient samples to estimate the true score for all
    xR
    . Further, if the data distribution is a weighted mixture of 2 distributions, the score matching does not learn the weights, and hence the sampling would lead to a different distribution than the one desired.

Annealed Langevin dynamics and NCSN (Song & Ermon, 2019)

Song et al. expanded the idea of denoising score matching to score-based sampling. The key idea proposed by them is that perturbing the data with random Gaussian noise makes the support of the data distribution equivalent to the ambient space. Hence the perturbed data is no longer confined to a lower dimensional manifold. Further, large Gaussian noise has the effect of filling low density regions in the original unperturbed distribution, hence the training by sample approximation is better at low density regions, resulting in improved score estimation. The authors use 2 key observations

  1. High noise provides useful directional information for Langevin dynamics
  2. However, perturbed density no longer approximates the true data density

The authors define a noise conditional score network to predict the scores for

L noise levels simultaneously. The noise levels form a positive geometric sequence that satisfies
σ1σ2==σL1σL>1
, where
σ1
is large enough to mitigate the difficulties with score based sampling, and
σL0
. This method makes sure that we utilize the directional information provided by the higher noise perturbation, while closely approximating the true density eventually. Each Langevin dynamics update takes place with
αi=ϵσi2/σL2
.

The annealed Langevin algorithm is given below:

Screenshot 2024-12-19 at 1.34.01 AM

The noise distribution is chosen to be

qσ(x~x,σ2I), hence
x~logqσ(x~x)=x~xσ2
. Given the denoising score matching objective:

(2)l(θ,σ)=12Eqσ(x~x)pdata(x)[sθ(x~i)+x~ixiσ222]

The objective function for the NCSN is then given by:

(3)L(θ;{σi}i=1L)=1Li=1Lλ(σi)l(θ,σi)

where

λ(σi)>0 is a coefficient function depending on
σi
.

Consistent annealed sampling (Martineau et. al., 2020)

Soon after the annealed Langevin algorithm was released, Martineau et. al. released their paper observing that the annealed Langevin with denoising score matching performs worse than generative adversarial networks under the Freschet Inception Distance (FID) metric. However, this gap vanishes when denoising the final Langevin samples (using the Tweedie's formula). The authors propose a Consistent Annealed Sampling as a more stable alternative to Annealed Langevin Sampling (ALS).

Screenshot 2024-12-19 at 1.07.28 PM

Let the value of

L in Algorithm 1 be
L1
and that for Algorithm 2 be
L2
. Then
L1
and
L2
are related as
L2=(L11)T+1
.
γt=σt+1σt
. For the geometric sequence
γ=σLσ11L1
. Further
{σi}i=1L={γiσ1i{0,1,,L1},γ=σLσ11L1<1}
.

The ALS makes the assumption that the noise of the sample at step

i will be of variance
σi2
, an assumption upon which the quality of the estimation of the score depends. The authors show that ALS does not ensure their claimed schedule with a couple of propositions.

Before writing the propositions, let us revisit the Tweedie's formula to get the denoised sample using denosing score matching.

(4)Exqσ(xx~)[x]=x~+σ2sθ(x~)

Rewriting this, we can get the optimum score function in the following form:

(5)sθ(x~)=Exqσ(xx~)[x]x~σ2

Proposition 1: Let

s be the optimal score function from the above equation. Following the sampling described in Algorithm 1, the variance of the noise component in the sample
x
will remain greater than
σt2
at every time step
t
.

This means that for

T<, the sampling hasn't fully converged and the remaining noise is carried over to the next iteration of the Langevin Sampling. Further, for
sθs
, the actual noise at every iteration is expected to be even higher than for the best possible score function
s
.

The Consistent Annealed Sampling (CAS) ensures that the noise levels will follow a prescribed schedule for any sampling step size

ϵ and number of steps
L
.

Proposition 2: Let

s be the optimal score function. Following the sampling described in Algorithm 2, the variance of the noise component in the sample
x
will be consistently equal to
σt2
at every time step
t
.

Importantly, proposition 2 holds no matter how many steps

L we take to decrease the noise geometrically. Further, proposition 2 only holds when the initial sample is corrupted (
x0=I+σ0z0
). However, by defining
σ0
as the maximum Euclidean distance between all pairs of training data points, the noise becomes in practice much greater than the true image, i.e. in practice, sampling from pure noise initialization (
x0=σ0zt
) becomes indistinguishable from sampling with a data initialization.

Proposition 3: Given a noise conditional score function, the update rules from Algorithm 1 and 2 are respectively equivalent to the following update rules:

  1. x(1η)x+ηH(x,σi)+2ησiz

  2. x(1η)x+ηH(x,σi)+βσi+1z

where

zN(0,I),η=ϵ/σL2 and
H(x,σi)=Exqσi(xx~)[x]=x~+σi2sθ(x~)

Stable training for NCSN (Song & Ermon, 2020)

This paper answers some significant questions that were not answered satisfactorily in the ALS paper, and uses them to improve the sampling algorithm. Firstly, following the observations made by Martineau et. al., the authors apply the Tweedie's formula to Algorithm 1. Instead of returning

x~T, the algorithm would now return the denoised sample
x~T+σT2sθ(xT,σT)
to remove the unwanted noise
zN(0,σT2I)
.

There are many design choices that are critical to the successful training and inference of NCSNs, including

  1. The set of noise scales
    {σi}i=1L
    ,
  2. The way that
    sθ(x,σ)
    incorporates information of
    σ
    ,
  3. The step size parameter
    ϵ
  4. The number of sampling steps per noise scale
    T
    in Algorithm 1.

1. Choosing noise scales

Some important questions that remained unanswered from the NCSN paper are the following:
a. How should we adjust the initial noise level

σ1 for different datasets?
b. Is geometric progression a good noise schedule?
c. How do we choose the number of noise scales
L
for different datasets? Is a particular choice of
L
ideal, and if yes, why?

a. A large value of

σ1 would ensure a higher diversity of the final samples. However, an excessively large
σ1
will require more noise scales and make annealed Langevin dynamics more expensive. Suppose we have a dataset
{x(1),x(2),,x(N)}
which is i.i.d. sampled from
pdata(x)
. Assuming
N
is sufficiently large, we have
pdata(x)p^data(x)1Ni=1Nδ(x=x(i))
, where
δ()
denotes a point mass distribution. When perturbed with
N(0,σ12I)
, the empirical distribution becomes
p^σ1(x)1Ni=1Np(i)(x)
where
p(i)(x)N(xx(i),σ12I)
. For generating diverse samples regardless of initialization, we naturally expect that Langevin dynamics can explore any component
p(j)(x)
when initialized from any other component
p(i)(x)
, where
ij
. The performance of Langevin dynamics is governed by the score function
xlogp^σ1(x)
.

Proposition 1: Let

p^σ1(x)1Ni=1Np(i)(x), where
p(i)(x)N(xx(i),σ12I)
. With
r(i)(x)p(i)(x)1Nk=1Np(k)(x)
, the score function is
xlogp^σ1(x)=i=1Nr(i)(x)xlogp(i)(x)
. Moreover,
(6)Ep(i)(x)[r(j)(x)]12exp(x(i)x(j)228σ12).

In order for Langevin dynamics to transition from

p(i)(x) to
p(j)(x)
easily for
ij
,
Ep(i)(x)[r(j)(x)]
has to be relatively large, because otherwise
xlogp^σ1(x)=k=1Nr(k)(x)xlogp(k)(x)
will ignore the component
p(j)(x)
(on average) when initialized with
xp(i)(x)
and in such case Langevin dynamics will act as if
p(j)(x)
did not exist. The bound of the above equation indicates that
Ep(i)(x)[r(j)(x)]
can decay exponentially fast if
σ1
is small compared to
x(i)x(j)22
. As a result, it is necessary for
σ1
to be numerically comparable to the maximum pairwise distances of data to facilitate transitioning of Langevin dynamics and hence improving sample diversity. In particular, the authors suggest:

Technique 1 (Initial noise scale): Choose

σ1 to be as large as the maximum Euclidean distance between all pairs of training data points.

b. Other noise scales: For all

1<iL, the intuition is that we need reliable gradient signals for
pσi(x)
when initializing Langevin dynamics with samples from
pσi1(x)
. However, an extensive grid search on
{σi}i=1L
can be very expensive. To provide theoretical guidance on finding good noise scales, we consider a simple case where the dataset contains only one data point, or equivalently,
i:pσi(x)=N(x0,σi2)
. Our first step is to understand the distributions of
pσi(x)
better, especially when
x
has high dimensionality. We can decompose
pσi(x)
in hyperspherical coordinates to
p(ϕ)pσi(r)
, where
r
and
ϕ
denote the radial and angular coordinates of
x
, respectively. Since
pσi(x)
is an isotropic Gaussian, the angular component
p(ϕ)
is uniform and shared across all noise scales. As for
pσi(r)
, we have the following:

Proposition 2 Let

xRDN(0,σ2I), and
r=x2
. We have
p(r)=12D/21Γ(D/2)rD1σDexp(r22σ2)and rDσdN(0,σ2/2) when D.

In practice, dimensions of image data can range from several thousand to millions, and are typically large enough to warrant

p(r)N(rDσ,σ2/2) with negligible error. Therefore, we take
pσi(r)=N(rmi,si2)
to simplify our analysis, where
mi=Dσ
and
si2=σ2/2
.

Recall that our goal is to ensure that samples from

pσi(x) will cover high-density regions of
pσi1(x)
. Because
p(ϕ)
is shared across all noise scales,
pσi(x)
already covers the angular component of
pσi1(x)
. Therefore, we need the radial components of
pσi(x)
and
pσi1(x)
to have a large overlap.

Since

pσi1(r) has high density in
Ii1=[mi13si1,mi1+3si1]
(employing the "three-sigma rule of thumb"), a natural choice is to fix
pσi(rIi1)=Φ(2D(γi1)+3γi)Φ(2D(γi1)3γi)=C

with some moderately large constant
C>0
for all
1<iL
, where
γi=σi1/σi
and
Φ()
is the CDF of the standard Gaussian. This choice immediately implies that
γ1=γ2==γL
, and thus
{σi}i=1L
forms a geometric progression.

c. Number of noise scales: Ideally, we should choose as many noise scales as possible to make

C1. However, having too many noise scales will make sampling very costly, as we need to run Langevin dynamics for each noise scale in sequence. On the other hand,
L=10
(for
32×32
images) as in the original setting of [1] is arguably too small, for which
C=0
up to numerical precision. To strike a balance, we recommend
C=0.5
, which performs well in our experiments.

Technique 2 (Other noise scales) Choose

{σi}i=1L as a geometric progression with common ratio
γ
, such that
Φ(2D(γ1)+3γ)Φ(2D(γ1)3γ)0.5

2. Incorporating the noise information

For high resolution images, we need a large

σ1 and a huge number of noise scales as per Technique 1 and 2. Recall that the NCSN is a single amortized network that takes a noise scale and gives the corresponding score. The memory consumption increases linearly with the number of time scales
L
.

The authors propose an efficient alternative that is easier to implement and more widely applicable. For

pσ(x)=N(x0,σ2I) we observe that
E[xlogpσ(x)2]D/σ.

Moreover,
sθ(x,σ)21/σ
(Song, Ermon, 2019) for a trained NCSN on real data. Because the norm of score functions scales inversely proportional to
σ
, we can incorporate the noise information by rescaling the output of an unconditional score network
sθ(x)
with
1/σ
. This motivates our following recommendation:

Technique 3 (Noise conditioning) Parameterize the NCSN with

sθ(x,σ)=sθ(x)σ where
sθ(x)
is an unconditional score network.

3&4. Step size and number of time-steps

In order to sample from an NCSN with annealed Langevin dynamics, we need to specify the number of sampling steps per noise scale

T and the step size parameter
ϵ
in Algorithm 1.

Annealed Langevin dynamics connect two adjacent noise scales (

σi1>σi) by initializing the Langevin dynamics for
pσi(x)
with samples obtained from
pσi1(x)
. When applying Langevin dynamics to
pσi(x)
:
xt+1xt+αxlogpσi(xt)+2αzt,

where
x0pσi1(x)
and
ztN(0,I)
. The distribution of
xT
can be computed in closed form:

Proposition 3: Let

γ=σi1σi. For
α=ϵσi2σL2
(as in Algorithm 1), we have
xTN(0,sT2I)
, where
(7)sT2σi2=(1ϵσL2)2T(γ22ϵσL2σi2(1ϵσL2)2)+2ϵσL2σi2(1ϵσL2)2

When

{σi}i=1L is a geometric progression as advocated by Technique 2, we immediately see that
sT2σi2
is identical across all
1<iT
because of the shared
γ
. Furthermore, the value of
sT2σi2
has no explicit dependency on the dimensionality
D
.

For better mixing of annealed Langevin dynamics, we hope

sT2σi2 approaches 1 across all noise scales, which can be achieved by finding
ϵ
and
T
that minimize the difference between Eq.(7) and 1. Unfortunately, this often results in an unnecessarily large
T
that makes sampling very expensive for large
L
. As an alternative, we propose to first choose
T
based on a reasonable computing budget (typically
T×L
is several thousand) and subsequently find
ϵ
by making Eq.(7) as close to 1 as possible. In summary:

Technique 4 (selecting

T and
ϵ
):
Choose
T
as large as allowed by a computing budget and then select an
ϵ
that makes Eq.(7) maximally close to 1.

Score based diffusion models

Denoising diffusion probabilistic models (Ho et. al., 2020)

Diffusion models (Sohl-Dickstein et. al., Ho et. al.) are another class of generative models involved in sequentially corrupting training data with slowly increasing noise, and then learning to reverse this corruption in order to form a generative model of the data. Denoising diffusion probabilistic models (DDPM) trains a sequence of probabilistic models to reverse each step of the noise corruption, using knowledge of the functional form of the reverse distributions to make training tractable. For continuous state spaces, the DDPM training objective implicitly computes scores at each noise scale.

Considering a sequence

0<β1,β2,,βN<1. For each training data point
x0pdata(x)
, a discrete Markov chain
{x0,x1,xN}
is constructed such that
p(xixi1)=N(xi;1βixi1,βiI)
. Hence,
pαi(xix0)=N(xi;αixi1,(1αi)I)
, where
αi:=j=1i(1βj)
. Similar to score based modelling with Langevin dynamics, we can denote the perturbed data distribution as
pαi(x~):=pdata(x)pαi(x~|x)dx
. The noise scales are prescribed such that
xN
is approximately distributed according to
N(0,I)
. A variational Markov chain in the reverse direction is parameterized with
pθ(xi1|xi)=N(xi1;11βi(xi+βisθ(xi,i)),βiI)

and trained with a re-weighted variant of the evidence lower bound (ELBO):
(1)θ=argminθi=1N(1αi)Epdata(x)Epαi(x~|x)[sθ(x~,i)x~logpαi(x~|x)2].

After solving the above equation to get the optimal model
sθ(x,i)
, samples can be generated by starting from
xNN(0,I)
and following the estimated reverse Markov chain as below:
xi1=11βi(xi+βisθ(xi,i))+βizi,i=N,N1,,1.

This method is called ancestral sampling, since it amounts to performing ancestral sampling from the graphical model

i=1Npθ(xi1|xi). Similar to NCSN, equation (1) is also a weighted sum of denoising score matching objectives which implies that the optimal model,
sθ(x~,i)
, matches the score of the perturbed data distribution,
x~logpαi(x~)
. The weights of the
i
-th terms of both objective functions, namely
σi2
1αi
, are related to corresponding perturbation kernels in the same functional form
σi21/E[xlogpσi(x~|x)2]
and
(1αi)1/E[xlogpαi(x~|x)2]
.

Score based diffusion via SDE (Song et. al.)

The authors present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. The reverse time SDE depends on the score of the perturbed data distribution. A predictor-corrector framework is introduced to correct errors in the evolution of the discretized reverse-time SDE. The authors also derived an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency.

This work builds on DDPMs and Annealed Langevin dynamics, by generalizing them to infinite noise scales, such that perturbed data distributions evolve according to an SDE as the noise intensifies.

The diffusion process, indexed by a continuous time variable

t[0,T], where
pT
is the prior density is modelled as the solution to the SDE

dxt=f(xt,t)dt+g(t)dwt

where

f(xt,t) corresponds to the deterministic drift and
dwt
is the infinitesimal noise. Hence
w
is a standard Weiner process.
g(t)
is the diffusion coefficient of
x(t)
. The SDE has a unique strong solution as long as the coefficients are globally Lipschitz in both state and time. Let
pst(x(t)x(s))
be the transition kernel from
x(s)
to
x(t)
, where
0s<tT
and let p_t(x) be the probability density of x(t). To estimate
xlog pt(x)
, a time-dependent score-based model is trained as a continuous generalization to NCSN and DDPM objectives.

θ=argminθEt{λ(t)Ex(0)p0(x)Ex(t)p0t(x(t)x(0))[sθ(x(t),t)x(t)logp0t(x(t)x(0))22]}.

Here,

λ:[0,T]R>0 is a positive weighting function,
t
is uniformly sampled over
[0,T]
,
x(0)p0(x)
, and
x(t)p0t(x(t)x(0))
. With sufficient data and model capacity, score matching ensures that the optimum score, denoted by
sθ(x,t)
, equals
xlogpt(x)
for almost all
x
and
t
. As in SMLD and DDPM, we can typically choose
λ1/E[x(t)logp0t(x(t)x(0))22]
.

The forward SDE is given by:

dxt=σ(t)dwt

The corresponding reverse time SDE is:

dx=σ2(t)sθ(x,t)dt+σ(t)dw¯

where

w¯ is a standard Wiener process when time flows backwards from
T
to
0
and
dt
is an infinitesimal negative timestep.

The EulerMaruyama method is used to solve the SDE, resulting in the following equations:

  1. xxσ(t)2sθ(x,t)Δt+σ(t)z
    (
    zN(0,|Δt|I)
    )
  2. tt+Δt

Predictor-Corrector (PC) Samplers

This provides a better solution to the reverse time SDEs. At each time step, the numerical SDE solver first gives an estimate of the sample at the next time step, playing the role of a “predictor”. Then, the score-based MCMC approach corrects the marginal distribution of the estimated sample, playing the role of a “corrector”. PC samplers generalize the original sampling methods of SMLD and DDPM: the former uses an identity function as the predictor and annealed Langevin dynamics as the corrector, while the latter uses ancestral sampling as the predictor and identity as the corrector.

Noise schedules of score based generative models

In this section we will do a review of noise schedules for score based generative models.

We first review the schedules for the papers that have already been discussed. The NCSN uses a geometric noise schedule

σ1σ2==σL1σL>1, where
σ1
is large enough to mitigate the difficulties with score based sampling, and
σL0
. DDPM uses a forward diffusion schedule
β2,...,T
where
βt=(Tt+1)1
and hence the noise schedule
1αt=1j=1t(1βj)
. The SDE method takes infinite noise levels with the same schedule as ALS:
{σi}i=1L
where
L
.

Nichol, Dhariwal, 2021 introduces the

α-cosine noise schedule
αt=cos(πt/2)
. Under the assumptions that the variances are preserved,
σt2=1αt2
. Hence
σt=sin(πt/2)
. Kingma et. al, 2021 observes that the quality of image generation depends on the signal to noise ratio (SNR)
σt/αt
or more importantly log
σt/αt
.

Hoogeboom et. al., 2023 observes that the noise schedules that work for lower resolution images need not work for higher resolution images. The authors explain it by the example of a

128×128 image, and assuming that the pixels
zt(i)
given by
q(zt(i)|x)=N(zt(i)|αtxi,σt)
are independent of each other . Commonly, diffusion models use network architectures that use downsampling to operate on lower resolution feature maps, in our case with average pooling. Suppose we average pool
zt
, where we let indices
1,2,3,4
denote the pixels in a
2×2
square that is being pooled. This new pixel is
zt64×64=zt(1)+zt(2)+zt(3)+zt(4)4
. Given that we know that
Var[X1+X2]=Var[X1]+Var[X2]
and that
Var[aX]=a2Var[X]
for a constant
a
. Letting
x64×64
denote the first pixel of the average pooled input image, we find that
zt64×64N(αtx64×64,σt/2)
. The lower resolution pixel
zt64×64
only has half the amount of noise. Hence the noise schedule isn't consistent over different resolutions. Hence, the authors argue that the noise schedule could be defined with respect to some reference resolution
r×r
. The authors then modify the schedule for some other resolution
d×d
via the SNR as
SNRd×d(t)=SNRr×r(t)(r/d)2
.

Additional references

  1. On the Importance of Noise Scheduling for Diffusion Models: https://arxiv.org/pdf/2301.10972
  2. Improved Techniques for Training Score-Based Generative Models: https://arxiv.org/abs/2006.09011
  3. Common Diffusion Noise Schedules and Sample Steps Are Flawed: https://openaccess.thecvf.com/content/WACV2024/html/Lin_Common_Diffusion_Noise_Schedules_and_Sample_Steps_Are_Flawed_WACV_2024_paper.html