Report on score-based generative modeling v2

# Score based generative models [ToC] ## Score based models ### Representing probability distributions via scores **Idea:** When the probability density function is differentiable, we can compute the gradient of the probability density. In contrast with likelihood models, such as variational autoencoders and energy based models, where we directly work with the probability distribution $p(x)$, the score based representation uses the score function, giving the gradient of the log likelihood with respect to the inputs and not the parameters of the model. $$\text{Score function} = \nabla_x \text{ log }p(x)$$ The score function is a vector valued function, where the direction of the vector field tells us the direction to follow to increase the log likelihood. ![p(x)vsscore](https://hackmd.io/_uploads/HJBje2_jA.png) #### Disadvantages of PDF representation $p(x)$ - Need to make sure that the PDFs are normalized. - Need to find a way of parameterizing curves that are ideally flexible. - Even when the shapes of these densities get very complicated (due to changes in parameters), we need to ensure that the total area under the curve is still fixed (=1). This can be tricky and might require very specific architectures. #### Advantages of representation via score function $\nabla_x \text{ log }p(x)$ - Doesn't need to satisfy any kind of normalization constraint. - Potentially much simpler to work with. ### Score estimation by training score-based models ![Screenshot 2024-08-26 115935](https://hackmd.io/_uploads/B1HtLQciA.png) **Given:** i.i.d samples $\{\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_n\} \sim p_{\text{data}}(\mathbf{x})$ **Task:** Estimating the score $\nabla_\mathbf{x} \text{ log }p(\mathbf{x})$ **Score Model:** A learnable vector-valued function $s_\theta(\mathbf{x}): \mathbb{R} \rightarrow \mathbb{R}$ **Goal:** $s_\theta(\mathbf{x}) \approx \nabla_\mathbf{x} \text{ log }p(\mathbf{x})$ **Objective:** Average Euclidean distance over the whole space. Minimize the Fisher divergence: $$\frac{1}{2} {E}_{\mathbf{x} \sim p_{\text{data}}} \left[\left\|\nabla_{\mathbf{x}} \log p_{\text{data}}(\mathbf{x}) - \mathbf{s}_\theta(\mathbf{x})\right\|_2^2\right]$$ **Score matching:** ![Screenshot 2024-08-26 121350](https://hackmd.io/_uploads/HJH0FQ9sA.png) #### Score models are not scalable! The most simple way of parameterizing the scores would be via deep neural networks, which would take the i.i.d samples as input, and directly outputs the scores of the model. Score models improve computational time as compared to likelihood based models such as energy based models, which require 2 backprops (one for the scores and one for the gradients) and whose backprop step has a time complexity of $O(d^2)$, where $d$ is the number of dimensions. However the Jacobian calculation during backpropagation in score models is $O(d)$, which is better than likelihood models, but is still dependant on the number of dimensions. Hence score models are not scalable. In order to make the Jacobian computation efficient and make our model scale to high dimensional settings, Vincent (2011) came up with a different way of estimating the score via denoising score matching. ### Denoising score matching ([Vincent, 2011](https://www.iro.umontreal.ca/~vincentp/Publications/smdae_techreport.pdf)) Here, instead of trying to estimate the score of the actual data, we try to estimate (match) the score of a noise perturbed data distribution. Consider the perturbed distribution: $$q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x}) = \mathcal{N}(\mathbf{x}; \sigma^2 \mathbf{I}) \quad q_{\sigma}(\tilde{\mathbf{x}}) = \int p(\mathbf{x}) q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x}) d\mathbf{x} $$ So we sample some point $\mathbf{x}$ from the density $p_{\text{data}}(\mathbf{x})$, then randomly sample from some Gaussian noise $q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x})$ and convolve that with the data to get the perturbed data distribution $q_{\sigma}(\tilde{\mathbf{x}})$. Score estimation for $\nabla_{\tilde{\mathbf{x}}}\text{log }q_{\sigma}(\tilde{\mathbf{x}})$ is easier (will prove below). If the noise level ($\sigma$) is small, the perturbed distribution is a good approximation of the original distribution, $q_{\sigma}(\tilde{\mathbf{x}}) \approx p(\tilde{\mathbf{x}})$. The Fisher divergence when estimating $\nabla_{\tilde{\mathbf{x}}}\text{log }q_{\sigma}(\tilde{\mathbf{x}})$: $$ \frac{1}{2} \mathbb{E}_{\tilde{\mathbf{x}} \sim q_{\sigma}} \left[ \left\|\nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}}) - \mathbf{s}_{\theta}(\tilde{\mathbf{x}})\right\|_2^2 \right] $$ $$ = \frac{1}{2} \int q_{\sigma}(\tilde{\mathbf{x}}) \left\|\nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}})- \mathbf{s}_{\theta}(\tilde{\mathbf{x}})\right\|_2^2 d\tilde{\mathbf{x}} $$ $$ = \frac{1}{2} \int q_{\sigma}(\tilde{\mathbf{x}}) \left\|\nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}})\right\|_2^2 d\tilde{\mathbf{x}} + \frac{1}{2} \int q_{\sigma}(\tilde{\mathbf{x}}) \|\mathbf{s}_{\theta}(\tilde{\mathbf{x}})\|_2^2 d\tilde{\mathbf{x}} - \int q_{\sigma}(\tilde{\mathbf{x}}) \nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}})^{\top} \mathbf{s}_{\theta}(\tilde{\mathbf{x}}) d\tilde{\mathbf{x}} $$ $$ = \text{const.} + \frac{1}{2} \mathbb{E}_{\tilde{\mathbf{x}} \sim q_{\sigma}} \left[\|\mathbf{s}_{\theta}(\tilde{\mathbf{x}})\|_2^2\right] - \int q_{\sigma}(\tilde{\mathbf{x}}) \nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}})^{\top} \mathbf{s}_{\theta}(\tilde{\mathbf{x}}) d\tilde{\mathbf{x}} $$ We try to simplify the above integral: $$- \int q_{\sigma}(\tilde{\mathbf{x}}) \nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}})^{\top} \mathbf{s}_{\theta}(\tilde{\mathbf{x}}) d\tilde{\mathbf{x}} $$ $$ = - \int q_{\sigma}(\tilde{\mathbf{x}}) \frac{1}{q_{\sigma}(\tilde{\mathbf{x}})} \nabla_{\tilde{\mathbf{x}}} q_{\sigma}(\tilde{\mathbf{x}})^{\top} \mathbf{s}_{\theta}(\tilde{\mathbf{x}}) d\tilde{\mathbf{x}} = - \int \nabla_{\tilde{\mathbf{x}}} q_{\sigma}(\tilde{\mathbf{x}})^{\top} \mathbf{s}_{\theta}(\tilde{\mathbf{x}}) d\tilde{\mathbf{x}} $$ $$ = - \int \nabla_{\tilde{\mathbf{x}}} \left( \int p_{\text{data}}(\mathbf{x}) q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x}) d\mathbf{x} \right)^{\top} \mathbf{s}_{\theta}(\tilde{\mathbf{x}}) d\tilde{\mathbf{x}} = - \int \left( \int p_{\text{data}}(\mathbf{x}) \nabla_{\tilde{\mathbf{x}}} q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x}) d\mathbf{x} \right)^{\top} \mathbf{s}_{\theta}(\tilde{\mathbf{x}}) d\tilde{\mathbf{x}} $$ $$ = - \int \left( \int p_{\text{data}}(\mathbf{x}) q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x}) d\mathbf{x} \right)^{\top} \mathbf{s}_{\theta}(\tilde{\mathbf{x}}) d\tilde{\mathbf{x}} = - \int \int p_{\text{data}}(\mathbf{x}) q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x}) \nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x})^{\top} \mathbf{s}_{\theta}(\tilde{\mathbf{x}}) d\mathbf{x} d\tilde{\mathbf{x}} $$ $$ = - \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}(\mathbf{x}), \tilde{\mathbf{x}} \sim q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x})} \left[ \nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x})^{\top} \mathbf{s}_{\theta}(\tilde{\mathbf{x}}) \right] $$ Hence denoising score matching: $$\frac{1}{2} \mathbb{E}_{\tilde{\mathbf{x}} \sim q_{\sigma}} \left[ \left\|\nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}}) - \mathbf{s}_{\theta}(\tilde{\mathbf{x}})\right\|_2^2 \right]$$ $$= \text{const.} + \frac{1}{2} \mathbb{E}_{\tilde{\mathbf{x}} \sim q_{\sigma}} \left[\|\mathbf{s}_{\theta}(\tilde{\mathbf{x}})\|_2^2\right] - \int q_{\sigma}(\tilde{\mathbf{x}}) \nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}})^{\top} \mathbf{s}_{\theta}(\tilde{\mathbf{x}}) d\tilde{\mathbf{x}}$$ $$ = \text{const.} + \frac{1}{2} \mathbb{E}_{\tilde{\mathbf{x}} \sim q_{\sigma}} \left[\|\mathbf{s}_{\theta}(\tilde{\mathbf{x}})\|_2^2\right] - \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}(\mathbf{x}), \tilde{\mathbf{x}} \sim q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x})} \left[ \nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x})^{\top} \mathbf{s}_{\theta}(\tilde{\mathbf{x}}) \right] $$ $$= \text{const.} + \frac{1}{2} \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}(\mathbf{x}), \tilde{\mathbf{x}} \sim q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x})} \left[\|\mathbf{s}_{\theta}(\tilde{\mathbf{x}})\|_2^2 - 2 \nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x})^{\top} \mathbf{s}_{\theta}(\tilde{\mathbf{x}}) \right] $$ $$ = \frac{1}{2} \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}(\mathbf{x}), \tilde{\mathbf{x}} \sim q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x})} \left[\|\mathbf{s}_{\theta}(\tilde{\mathbf{x}}) - \nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x})\|_2^2 \right] + \text{const.} $$ #### Summary of denoising score matching Core ideas: - Estimate the score of a noise perturbed distribution $$\frac{1}{2} \mathbb{E}_{\tilde{\mathbf{x}} \sim q_{\sigma}} \left[ \left\|\nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}}) - \mathbf{s}_{\theta}(\tilde{\mathbf{x}})\right\|_2^2 \right]$$ $$ = \frac{1}{2} \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}(\mathbf{x}), \tilde{\mathbf{x}} \sim q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x})} \left[\|\mathbf{s}_{\theta}(\tilde{\mathbf{x}}) - \nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x})\|_2^2 \right] + \text{const.} $$ - $\nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x})$ is easy to compute (Gaussian) - $q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x}) = \mathcal{N}(\mathbf{x}; \sigma^2 \mathbf{I})$ - $\nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x}) = - \frac{\tilde{x}-x}{\sigma^2}$ - Advantages of this method: efficient to optimize even for very high dimensional data, and useful for optimal denoising. - Disadvantage: Cannot estimate the score of clean (noise-free) data. Process for denoising score matching: - Sample a minibatch of datapoints $\{\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_n\} \sim p_{\text{data}}(\mathbf{x})$ - Sample a minibatch of perturbed datapoints $$\{\tilde{\mathbf{x}}_1, \tilde{\mathbf{x}}_2, \cdots, \tilde{\mathbf{x}}_n\} \sim q_{\sigma}(\tilde{\mathbf{x}})$$ - Estimate the denoising score matching loss with empirical means $$\frac{1}{2n} \sum_{i=1}^n \left[ \|\mathbf{s}_{\theta}(\tilde{\mathbf{x}}_i) - \nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}}_i \mid \mathbf{x}_i)\|_2^2 \right]$$ - If Gaussian perturbation: $$\frac{1}{2n} \sum_{i=1}^n \left[ \left\|\mathbf{s}_{\theta}(\tilde{\mathbf{x}}_i) + \frac{\tilde{\mathbf{x}}_i - \mathbf{x}_i}{\sigma^2}\right\|_2^2 \right]$$ - Apply stochastic gradient descent (need to choose a very small $\sigma$). Hence, - Score matching: $\frac{1}{2} \mathbb{E}_{\tilde{\mathbf{x}} \sim q_{\sigma}} \left[ \left\|\nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}}) - \mathbf{s}_{\theta}(\tilde{\mathbf{x}})\right\|_2^2 \right]$ - Denoising: $= \frac{1}{2} \mathbb{E}_{p(\mathbf{x})} \mathbb{E}_{q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x})} \left[\left\|\frac{1}{\sigma^2} (\mathbf{x} - \tilde{\mathbf{x}}) - \mathbf{s}_{\theta}(\tilde{\mathbf{x}})\right\|_2^2\right] + \text{const.}$ $\mathbf{s}_{\theta}(\tilde{\mathbf{x}})$ tries to estimate the noise that was added to produce $\tilde{x}$, which can then be removed from the initial noisy data (denoising). Hence we have reduced the problem of image generation to the problem of denoising, which is an easier problem to work with. ### Tweedie's formula This mentions that the optimal denoising strategy is to follow the gradient (score): $$\tilde{x} + \sigma^2 \nabla_\tilde{x} \text{log }q_\sigma(\tilde{x})$$ Given $p(\mathbf{x})$, $q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x}) = \mathcal{N}(\mathbf{x}; \sigma^2 \mathbf{I})$, the posterior $p(\mathbf{x \mid \tilde{x}})$ can be defined with Bayes rule as: $$p(\mathbf{x \mid \tilde{x}}) \propto p(\mathbf{x}) q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x})$$ We know that, $q_{\sigma}(\tilde{\mathbf{x}}) = \int p(\mathbf{x}) q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x}) d\mathbf{x}$ Hence Tweedie's formula gives us: $$\mathbb{E}_{\mathbf{x} \sim p(\mathbf{x} \mid \tilde{\mathbf{x}})}[\mathbf{x}] = \tilde{\mathbf{x}} + \sigma^2 \nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}})\approx \tilde{\mathbf{x}} + \sigma^2 \mathbf{s}_{\theta}(\tilde{\mathbf{x}})$$ Hence Tweedie's formula helps us get a form to recover the original data distribution from the noise perturbed distribution. ## Score based Sampling with Langevin Dynamics ([Welling, Teh, 2011](https://www.stats.ox.ac.uk/~teh/research/compstats/WelTeh2011a.pdf)) Langevin dynamics can produce samples from a probability density $p(x)$ using only the score function $\nabla_x \text{log } p(x)$. Given a fixed step size $\epsilon > 0$ and an initial value $\tilde{x_0} \sim \pi(x)$, with $\pi(x)$ being a prior distribution, the Langevin method recursively computes the following: $$\tilde{x_t}= \tilde{x_{t-1}} + \frac{\epsilon}{2} \nabla_x \text{log } p(x_{t-1}) + \sqrt{\epsilon} z_t \tag{1}$$ where $z_t \sim \mathcal{N}(0, I)$. Welling and Teh prove in their paper that as $\epsilon\to 0$ and $T \to \infty$, $\tilde{x_T}$ becomes an exact sample from $p(x)$. ### Challenges of score-based sampling 1. **The manifold hypothesis:** The manifold hypothesis states that real world data is usually concentrated in a lower dimensional manifold, embedded in a higher dimensional space (ambient space). Since the score function is a gradient taken in the ambient space, it is undefined when $x$ is confined to a low dimensional manifold. Further the score estimator $s_{\theta}(x)$ gives a good approximation of the score only when the support of the data distribution is the ambient space ([Corollary 2, Hyvärinen](https://jmlr.org/papers/volume6/hyvarinen05a/hyvarinen05a.pdf)). 2. **Low data density regions:** In the low data density regions, the score matching may not have enough evidence to estimate score functions accurately, due to lack of data samples. In practice, the expectation $\frac{1}{2} {E}_{\mathbf{x} \sim p_{\text{data}}} \left[\left\|\nabla_{\mathbf{x}} \log p_{\text{data}}(\mathbf{x}) - \mathbf{s}_\theta(\mathbf{x})\right\|_2^2\right]$ is always estimated using i.i.d samples $\{x_i\}_{i=1}^N \sim p_{data}(x)$. Consider any region $\mathcal{R} \subset \mathbb{R}^D$ such that $p_{data}(\mathcal{R}) \approx 0$. Then in most cases $\{x_i\}_{i=1}^N \cap \mathcal{R} = \phi$ and the score matching will not have sufficient samples to estimate the true score for all $x \in \mathcal{R}$. Further, if the data distribution is a weighted mixture of 2 distributions, the score matching does not learn the weights, and hence the sampling would lead to a different distribution than the one desired. ## Annealed Langevin dynamics and NCSN ([Song & Ermon, 2019](https://arxiv.org/pdf/1907.05600)) Song *et al.* expanded the idea of denoising score matching to score-based sampling. The key idea proposed by them is that perturbing the data with random Gaussian noise makes the support of the data distribution equivalent to the ambient space. Hence the perturbed data is no longer confined to a lower dimensional manifold. Further, large Gaussian noise has the effect of filling low density regions in the original unperturbed distribution, hence the training by sample approximation is better at low density regions, resulting in improved score estimation. The authors use 2 key observations 1. High noise provides useful directional information for Langevin dynamics 2. However, perturbed density no longer approximates the true data density The authors define a noise conditional score network to predict the scores for $L$ noise levels simultaneously. The noise levels form a positive geometric sequence that satisfies $\frac{\sigma_1}{\sigma_2} = \cdots = \frac{\sigma_{L-1}}{\sigma_L} > 1$, where $\sigma_1$ is large enough to mitigate the difficulties with score based sampling, and $\sigma_L \approx 0$. This method makes sure that we utilize the directional information provided by the higher noise perturbation, while closely approximating the true density eventually. Each Langevin dynamics update takes place with $\alpha_i = \epsilon \cdot \sigma_i^2/\sigma_L^2$. The annealed Langevin algorithm is given below: ![Screenshot 2024-12-19 at 1.34.01 AM](https://hackmd.io/_uploads/Sk_PC4WSkx.png =70%x) The noise distribution is chosen to be $q_\sigma (\tilde{x}\mid x, \sigma^2I)$, hence $\nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}} \mid \mathbf{x}) = - \frac{\tilde{x}-x}{\sigma^2}$. Given the denoising score matching objective: $$l(\theta, \sigma) = \frac{1}{2} \mathbb{E}_{q_\sigma(\tilde{x} \mid x)p_{data}(x)}\left[ \left\|\mathbf{s}_{\theta}(\tilde{\mathbf{x}}_i) + \frac{\tilde{\mathbf{x}}_i - \mathbf{x}_i}{\sigma^2}\right\|_2^2 \right] \tag{2}$$ The objective function for the NCSN is then given by: $$\mathcal{L}(\theta; \{\sigma_i\}_{i=1}^L) = \frac{1}{L}\sum_{i=1}^L \lambda(\sigma_i) l(\theta, \sigma_i)\tag{3}$$ where $\lambda(\sigma_i) > 0$ is a coefficient function depending on $\sigma_i$. ### Consistent annealed sampling ([Martineau *et. al.*, 2020](https://arxiv.org/pdf/2009.05475)) Soon after the annealed Langevin algorithm was released, Martineau *et. al.* released their paper observing that the annealed Langevin with denoising score matching performs worse than generative adversarial networks under the Freschet Inception Distance (FID) metric. However, this gap vanishes when denoising the final Langevin samples (using the Tweedie's formula). The authors propose a Consistent Annealed Sampling as a more stable alternative to Annealed Langevin Sampling (ALS). ![Screenshot 2024-12-19 at 1.07.28 PM](https://hackmd.io/_uploads/HJbg-yMrJx.png =70%x) Let the value of $L$ in Algorithm 1 be $L_1$ and that for Algorithm 2 be $L_2$. Then $L_1$ and $L_2$ are related as $L_2 = (L_1-1)T +1$. $\gamma_t = \frac{\sigma_{t+1}}{\sigma_t}$. For the geometric sequence $\gamma = {\frac{\sigma_{L}}{\sigma_1}}^\frac{1}{L-1}$. Further $\{\sigma_i\}_{i=1}^L = \{\gamma^i \sigma_1 \mid i\in \{0, 1, \cdots, L-1\}, \gamma = {\frac{\sigma_{L}}{\sigma_1}}^\frac{1}{L-1} < 1\}$. The ALS makes the assumption that the noise of the sample at step $i$ will be of variance $\sigma_i^2$, an assumption upon which the quality of the estimation of the score depends. The authors show that ALS does not ensure their claimed schedule with a couple of propositions. Before writing the propositions, let us revisit the Tweedie's formula to get the denoised sample using denosing score matching. $$\mathbb{E}_{\mathbf{x} \sim q_\sigma(\mathbf{x} \mid \tilde{\mathbf{x}})}[\mathbf{x}] = \tilde{\mathbf{x}} + \sigma^2 \mathbf{s}_{\theta}(\tilde{\mathbf{x}}) \tag{4}$$ Rewriting this, we can get the optimum score function in the following form: $$\mathbf{s}_{\theta}^*(\tilde{\mathbf{x}}) = \frac{\mathbb{E}_{\mathbf{x} \sim q_\sigma(\mathbf{x} \mid \tilde{\mathbf{x}})}[\mathbf{x}] - \tilde{\mathbf{x}}}{\sigma^2} \tag{5}$$ **Proposition 1:** Let $s^*$ be the optimal score function from the above equation. Following the sampling described in Algorithm 1, the variance of the noise component in the sample $x$ will remain greater than $\sigma_t^2$ at every time step $t$. This means that for $T<\infty$, the sampling hasn't fully converged and the remaining noise is carried over to the next iteration of the Langevin Sampling. Further, for $s_\theta \neq s^*$, the actual noise at every iteration is expected to be even higher than for the best possible score function $s^*$. The Consistent Annealed Sampling (CAS) ensures that the noise levels will follow a prescribed schedule for any sampling step size $\epsilon$ and number of steps $L$. **Proposition 2:** Let $s^*$ be the optimal score function. Following the sampling described in Algorithm 2, the variance of the noise component in the sample $x$ will be consistently equal to $\sigma_t^2$ at every time step $t$. Importantly, proposition 2 holds no matter how many steps $L$ we take to decrease the noise geometrically. Further, proposition 2 only holds when the initial sample is corrupted ($x_0 = I + \sigma_0 z_0$). However, by defining $\sigma_0$ as the maximum Euclidean distance between all pairs of training data points, the noise becomes in practice much greater than the true image, i.e. in practice, sampling from pure noise initialization ($x_0 = \sigma_0 z_t$) becomes indistinguishable from sampling with a data initialization. **Proposition 3:** Given a noise conditional score function, the update rules from Algorithm 1 and 2 are respectively equivalent to the following update rules: 1. $x \leftarrow (1-\eta)x + \eta H(x, \sigma_i) + \sqrt{2\eta} \sigma_i z$ 2. $x \leftarrow (1-\eta)x + \eta H(x, \sigma_i) + \beta \sigma_{i+1} z$ where $z \sim \mathcal{N}(0, I), \eta= \epsilon/\sigma_L^2$ and $H(x, \sigma_i) = \mathbb{E}_{\mathbf{x} \sim q_{\sigma_i}(\mathbf{x} \mid \tilde{\mathbf{x}})}[\mathbf{x}] = \tilde{\mathbf{x}} + \sigma_i^2 \mathbf{s}_{\theta}(\tilde{\mathbf{x}})$ ### Stable training for NCSN ([Song & Ermon, 2020](https://arxiv.org/pdf/2006.09011)) This paper answers some significant questions that were not answered satisfactorily in the ALS paper, and uses them to improve the sampling algorithm. Firstly, following the observations made by Martineau *et. al.*, the authors apply the Tweedie's formula to Algorithm 1. Instead of returning $\tilde{x}_T$, the algorithm would now return the denoised sample $\tilde{x}_T + \sigma_T^2 \mathbf{s}_{\theta}({x_T, \sigma_T})$ to remove the unwanted noise $z'\sim \mathcal{N}(0,\sigma_T^2 I)$. There are many design choices that are critical to the successful training and inference of NCSNs, including 1. The set of noise scales $\{\sigma_i\}^L_{i=1}$, 2. The way that $s_\theta(x, \sigma)$ incorporates information of $\sigma$, 3. The step size parameter $\epsilon$ 4. The number of sampling steps per noise scale $T$ in Algorithm 1. $1. \textbf{ Choosing noise scales}$ Some important questions that remained unanswered from the NCSN paper are the following: a. How should we adjust the initial noise level $\sigma_1$ for different datasets? b. Is geometric progression a good noise schedule? c. How do we choose the number of noise scales $L$ for different datasets? Is a particular choice of $L$ ideal, and if yes, why? a. A large value of $\sigma_1$ would ensure a higher diversity of the final samples. However, an excessively large $\sigma_1$ will require more noise scales and make annealed Langevin dynamics more expensive. Suppose we have a dataset $\{ \mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(N)} \}$ which is i.i.d. sampled from $p_{\text{data}}(\mathbf{x})$. Assuming $N$ is sufficiently large, we have $p_{\text{data}}(\mathbf{x}) \approx \hat{p}_{\text{data}}(\mathbf{x}) \triangleq \frac{1}{N} \sum_{i=1}^{N} \delta(\mathbf{x} = \mathbf{x}^{(i)})$, where $\delta(\cdot)$ denotes a point mass distribution. When perturbed with $\mathcal{N}(0, \sigma_1^2 \mathbf{I})$, the empirical distribution becomes $\hat{p}_{\sigma_1}(\mathbf{x}) \triangleq \frac{1}{N} \sum_{i=1}^{N} p^{(i)}(\mathbf{x})$ where $p^{(i)}(\mathbf{x}) \triangleq \mathcal{N}(\mathbf{x} \mid \mathbf{x}^{(i)}, \sigma_1^2 \mathbf{I})$. For generating diverse samples regardless of initialization, we naturally expect that Langevin dynamics can explore any component $p^{(j)}(\mathbf{x})$ when initialized from any other component $p^{(i)}(\mathbf{x})$, where $i \neq j$. The performance of Langevin dynamics is governed by the score function $\nabla_{\mathbf{x}} \log \hat{p}_{\sigma_1}(\mathbf{x})$. **Proposition 1:** Let $\hat{p}_{\sigma_1}(\mathbf{x}) \triangleq \frac{1}{N} \sum_{i=1}^{N} p^{(i)}(\mathbf{x})$, where $p^{(i)}(\mathbf{x}) \triangleq \mathcal{N}(\mathbf{x} \mid \mathbf{x}^{(i)}, \sigma_1^2 \mathbf{I})$. With $r^{(i)}(\mathbf{x}) \triangleq \frac{p^{(i)}(\mathbf{x})}{\frac{1}{N} \sum_{k=1}^{N} p^{(k)}(\mathbf{x})}$, the score function is $\nabla_{\mathbf{x}} \log \hat{p}_{\sigma_1}(\mathbf{x}) = \sum_{i=1}^{N} r^{(i)}(\mathbf{x}) \nabla_{\mathbf{x}} \log p^{(i)}(\mathbf{x})$. Moreover, $$ \mathbb{E}_{p^{(i)}(\mathbf{x})}[r^{(j)}(\mathbf{x})] \leq \frac{1}{2} \exp\left(-\frac{\|\mathbf{x}^{(i)} - \mathbf{x}^{(j)}\|_2^2}{8\sigma_1^2}\right). \tag{6}$$ In order for Langevin dynamics to transition from $p^{(i)}(\mathbf{x})$ to $p^{(j)}(\mathbf{x})$ easily for $i \neq j$, $\mathbb{E}_{p^{(i)}(\mathbf{x})}[r^{(j)}(\mathbf{x})]$ has to be relatively large, because otherwise $\nabla_{\mathbf{x}} \log \hat{p}_{\sigma_1}(\mathbf{x}) = \sum_{k=1}^{N} r^{(k)}(\mathbf{x}) \nabla_{\mathbf{x}} \log p^{(k)}(\mathbf{x})$ will ignore the component $p^{(j)}(\mathbf{x})$ (on average) when initialized with $\mathbf{x} \sim p^{(i)}(\mathbf{x})$ and in such case Langevin dynamics will act as if $p^{(j)}(\mathbf{x})$ did not exist. The bound of the above equation indicates that $\mathbb{E}_{p^{(i)}(\mathbf{x})}[r^{(j)}(\mathbf{x})]$ can decay exponentially fast if $\sigma_1$ is small compared to $\|\mathbf{x}^{(i)} - \mathbf{x}^{(j)}\|_2^2$. As a result, it is necessary for $\sigma_1$ to be numerically comparable to the maximum pairwise distances of data to facilitate transitioning of Langevin dynamics and hence improving sample diversity. In particular, the authors suggest: **Technique 1 (Initial noise scale):** Choose $\sigma_1$ to be as large as the maximum Euclidean distance between all pairs of training data points. b. Other noise scales: For all $1 < i \leq L$, the intuition is that we need reliable gradient signals for $p_{\sigma_i}(x)$ when initializing Langevin dynamics with samples from $p_{\sigma_{i-1}}(x)$. However, an extensive grid search on $\{\sigma_i\}_{i=1}^L$ can be very expensive. To provide theoretical guidance on finding good noise scales, we consider a simple case where the dataset contains only one data point, or equivalently, $\forall i : p_{\sigma_i}(x) = \mathcal{N}(x \mid 0, \sigma_i^2)$. Our first step is to understand the distributions of $p_{\sigma_i}(x)$ better, especially when $x$ has high dimensionality. We can decompose $p_{\sigma_i}(x)$ in hyperspherical coordinates to $p(\phi)p_{\sigma_i}(r)$, where $r$ and $\phi$ denote the radial and angular coordinates of $x$, respectively. Since $p_{\sigma_i}(x)$ is an isotropic Gaussian, the angular component $p(\phi)$ is uniform and shared across all noise scales. As for $p_{\sigma_i}(r)$, we have the following: **Proposition 2** Let $x \in \mathbb{R}^D \sim \mathcal{N}(0, \sigma^2 I)$, and $r = \|x\|_2$. We have $$ p(r) = \frac{1}{2^{D/2-1} \Gamma(D/2)} \frac{r^{D-1}}{\sigma^D} \exp\left(-\frac{r^2}{2\sigma^2}\right) \text{and } r - \sqrt{D}\sigma \overset{d}{\longrightarrow} \mathcal{N}(0, \sigma^2 / 2) \text{ when } D \to \infty. $$ In practice, dimensions of image data can range from several thousand to millions, and are typically large enough to warrant $p(r) \approx \mathcal{N}(r \mid \sqrt{D}\sigma, \sigma^2 / 2)$ with negligible error. Therefore, we take $p_{\sigma_i}(r) = \mathcal{N}(r \mid m_i, s_i^2)$ to simplify our analysis, where $m_i = \sqrt{D}\sigma$ and $s_i^2 = \sigma^2 / 2$. Recall that our goal is to ensure that samples from $p_{\sigma_i}(x)$ will cover high-density regions of $p_{\sigma_{i-1}}(x)$. Because $p(\phi)$ is shared across all noise scales, $p_{\sigma_i}(x)$ already covers the angular component of $p_{\sigma_{i-1}}(x)$. Therefore, we need the radial components of $p_{\sigma_i}(x)$ and $p_{\sigma_{i-1}}(x)$ to have a large overlap. Since $p_{\sigma_{i-1}}(r)$ has high density in $\mathcal{I}_{i-1} = [m_{i-1} - 3s_{i-1}, m_{i-1} + 3s_{i-1}]$ (employing the "three-sigma rule of thumb"), a natural choice is to fix $p_{\sigma_i}(r \in \mathcal{I}_{i-1}) = \Phi(\sqrt{2D}(\gamma_i - 1) + 3\gamma_i) - \Phi(\sqrt{2D}(\gamma_i - 1) - 3\gamma_i) = C$ with some moderately large constant $C > 0$ for all $1 < i \leq L$, where $\gamma_i = \sigma_{i-1} / \sigma_i$ and $\Phi(\cdot)$ is the CDF of the standard Gaussian. This choice immediately implies that $\gamma_1 = \gamma_2 = \ldots = \gamma_L$, and thus $\{\sigma_i\}_{i=1}^L$ forms a geometric progression. c. Number of noise scales: Ideally, we should choose as many noise scales as possible to make $C \approx 1$. However, having too many noise scales will make sampling very costly, as we need to run Langevin dynamics for each noise scale in sequence. On the other hand, $L = 10$ (for $32 \times 32$ images) as in the original setting of [1] is arguably too small, for which $C = 0$ up to numerical precision. To strike a balance, we recommend $C = 0.5$, which performs well in our experiments. **Technique 2 (Other noise scales)** Choose $\{\sigma_i\}_{i=1}^L$ as a geometric progression with common ratio $\gamma$, such that $\Phi(\sqrt{2D}(\gamma - 1) + 3\gamma) - \Phi(\sqrt{2D}(\gamma - 1) - 3\gamma) \approx 0.5$ $2. \textbf{ Incorporating the noise information}$ For high resolution images, we need a large $\sigma_1$ and a huge number of noise scales as per Technique 1 and 2. Recall that the NCSN is a single amortized network that takes a noise scale and gives the corresponding score. The memory consumption increases linearly with the number of time scales $L$. The authors propose an efficient alternative that is easier to implement and more widely applicable. For $p_{\sigma}(x) = \mathcal{N}(x \mid 0, \sigma^2 I)$ we observe that $$ \mathbb{E}[\|\nabla_x \log p_{\sigma}(x)\|_2] \approx \sqrt{D}/\sigma. $$ Moreover, $\|s_{\theta}(x, \sigma)\|_2 \propto 1/\sigma$ (Song, Ermon, 2019) for a trained NCSN on real data. Because the norm of score functions scales inversely proportional to $\sigma$, we can incorporate the noise information by rescaling the output of an unconditional score network $s_{\theta}(x)$ with $1/\sigma$. This motivates our following recommendation: **Technique 3 (Noise conditioning)** Parameterize the NCSN with $s_{\theta}(x, \sigma) = \frac{s_{\theta}(x)}{\sigma}$ where $s_{\theta}(x)$ is an unconditional score network. $3\&4. \textbf{ Step size and number of time-steps}$ In order to sample from an NCSN with annealed Langevin dynamics, we need to specify the number of sampling steps per noise scale $T$ and the step size parameter $\epsilon$ in Algorithm 1. Annealed Langevin dynamics connect two adjacent noise scales ($\sigma_{i-1} > \sigma_i$) by initializing the Langevin dynamics for $p_{\sigma_i}(x)$ with samples obtained from $p_{\sigma_{i-1}}(x)$. When applying Langevin dynamics to $p_{\sigma_i}(x)$: $$ x_{t+1} \leftarrow x_t + \alpha \nabla_x \log p_{\sigma_i}(x_t) + \sqrt{2\alpha} z_t, $$ where $x_0 \sim p_{\sigma_{i-1}}(x)$ and $z_t \sim \mathcal{N}(0, I)$. The distribution of $x_T$ can be computed in closed form: **Proposition 3:** Let $\gamma = \frac{\sigma_{i-1}}{\sigma_i}$. For $\alpha = \epsilon \cdot \frac{\sigma_i^2}{\sigma_L^2}$ (as in Algorithm 1), we have $x_T \sim \mathcal{N}(0, s_T^2 I)$, where $$ \frac{s_T^2}{\sigma_i^2} = \left(1 - \frac{\epsilon}{\sigma_L^2}\right)^{2T} \left(\gamma^2 - \frac{2\epsilon}{\sigma_L^2 - \sigma_i^2}\left(1 - \frac{\epsilon}{\sigma_L^2}\right)^2\right) + \frac{2\epsilon}{\sigma_L^2 - \sigma_i^2}\left(1 - \frac{\epsilon}{\sigma_L^2}\right)^2 \tag{7} $$ When $\{\sigma_i\}_{i=1}^L$ is a geometric progression as advocated by Technique 2, we immediately see that $\frac{s_T^2}{\sigma_i^2}$ is identical across all $1 < i \leq T$ because of the shared $\gamma$. Furthermore, the value of $\frac{s_T^2}{\sigma_i^2}$ has no explicit dependency on the dimensionality $D$. For better mixing of annealed Langevin dynamics, we hope $\frac{s_T^2}{\sigma_i^2}$ approaches 1 across all noise scales, which can be achieved by finding $\epsilon$ and $T$ that minimize the difference between Eq.(7) and 1. Unfortunately, this often results in an unnecessarily large $T$ that makes sampling very expensive for large $L$. As an alternative, we propose to first choose $T$ based on a reasonable computing budget (typically $T \times L$ is several thousand) and subsequently find $\epsilon$ by making Eq.(7) as close to 1 as possible. In summary: **Technique 4 (selecting $T$ and $\epsilon$):** Choose $T$ as large as allowed by a computing budget and then select an $\epsilon$ that makes Eq.(7) maximally close to 1. ## Score based diffusion models ### Denoising diffusion probabilistic models ([Ho *et. al.*, 2020](https://arxiv.org/pdf/2006.11239)) Diffusion models ([Sohl-Dickstein *et. al.*](https://arxiv.org/abs/1503.03585), [Ho *et. al.*](https://arxiv.org/pdf/2006.11239)) are another class of generative models involved in sequentially corrupting training data with slowly increasing noise, and then learning to reverse this corruption in order to form a generative model of the data. Denoising diffusion probabilistic models (DDPM) trains a sequence of probabilistic models to reverse each step of the noise corruption, using knowledge of the functional form of the reverse distributions to make training tractable. For continuous state spaces, the DDPM training objective implicitly computes scores at each noise scale. Considering a sequence $0 < \beta_1, \beta_2, \cdots, \beta_N< 1$. For each training data point $x_0 \sim p_{data}(x)$, a discrete Markov chain $\{x_0, x_1, \cdots x_N\}$ is constructed such that $p(x_i \mid x_{i-1}) = \mathcal{N}(x_i ; \sqrt{1 - \beta_i} x_{i-1}, \beta_i I)$. Hence, $p_{\alpha_i}(x_i \mid x_{0}) = \mathcal{N}(x_i ; \sqrt{\alpha_i} x_{i-1}, (1-\alpha_i) I)$, where $\alpha_i := \prod_{j=1}^i (1 - \beta_j)$. Similar to score based modelling with Langevin dynamics, we can denote the perturbed data distribution as $p_{\alpha_i}(\tilde{\mathbf{x}}) := \int p_{\text{data}}(\mathbf{x}) p_{\alpha_i}(\tilde{\mathbf{x}} | \mathbf{x}) \, \mathrm{d}\mathbf{x}$. The noise scales are prescribed such that $\mathbf{x}_N$ is approximately distributed according to $\mathcal{N}(\mathbf{0}, \mathbf{I})$. A variational Markov chain in the reverse direction is parameterized with $p_\theta(\mathbf{x}_{i-1} | \mathbf{x}_i) = \mathcal{N}\left(\mathbf{x}_{i-1}; \frac{1}{\sqrt{1 - \beta_i}} \left(\mathbf{x}_i + \beta_i \mathbf{s}_\theta(\mathbf{x}_i, i)\right), \beta_i \mathbf{I}\right)$ and trained with a re-weighted variant of the evidence lower bound (ELBO): $$ \theta^* = \arg \min_\theta \sum_{i=1}^N (1 - \alpha_i) \mathbb{E}_{p_{\text{data}}(\mathbf{x})} \mathbb{E}_{p_{\alpha_i}(\tilde{\mathbf{x}} | \mathbf{x})} \left[ \|\mathbf{s}_\theta(\tilde{\mathbf{x}}, i) - \nabla_{\tilde{\mathbf{x}}} \log p_{\alpha_i}(\tilde{\mathbf{x}} | \mathbf{x}) \|^2 \right] \tag{1}. $$ After solving the above equation to get the optimal model $\mathbf{s}_{\theta^*}(\mathbf{x}, i)$, samples can be generated by starting from $\mathbf{x}_N \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and following the estimated reverse Markov chain as below: $$ \mathbf{x}_{i-1} = \frac{1}{\sqrt{1 - \beta_i}} \left( \mathbf{x}_i + \beta_i \mathbf{s}_{\theta^*}(\mathbf{x}_i, i) \right) + \sqrt{\beta_i} \mathbf{z}_i, \quad i = N, N-1, \cdots, 1. $$ This method is called ancestral sampling, since it amounts to performing ancestral sampling from the graphical model $\prod_{i=1}^N p_\theta(\mathbf{x}_{i-1} | \mathbf{x}_i)$. Similar to NCSN, equation (1) is also a weighted sum of denoising score matching objectives which implies that the optimal model, $\mathbf{s}_{\theta^*}(\tilde{\mathbf{x}}, i)$, matches the score of the perturbed data distribution, $\nabla_{\tilde{\mathbf{x}}} \log p_{\alpha_i}(\tilde{\mathbf{x}})$. The weights of the $i$-th terms of both objective functions, namely $\sigma_i^2$ $1-\alpha_i$, are related to corresponding perturbation kernels in the same functional form $\sigma_i^2 \propto 1/\mathbb{E}[\|\nabla_\mathbf{x} \log p_{\sigma_i}(\tilde{\mathbf{x}} | \mathbf{x})\|^2]$ and $(1 - \alpha_i) \propto 1/\mathbb{E}[\|\nabla_\mathbf{x} \log p_{\alpha_i}(\tilde{\mathbf{x}} | \mathbf{x})\|^2]$. ### Score based diffusion via SDE ([Song *et. al.*](https://arxiv.org/pdf/2011.13456)) The authors present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. The reverse time SDE depends on the score of the perturbed data distribution. A predictor-corrector framework is introduced to correct errors in the evolution of the discretized reverse-time SDE. The authors also derived an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. This work builds on DDPMs and Annealed Langevin dynamics, by generalizing them to infinite noise scales, such that perturbed data distributions evolve according to an SDE as the noise intensifies. The diffusion process, indexed by a continuous time variable $t \in [0,T]$, where $p_T$ is the prior density is modelled as the solution to the SDE $$dx_t = f(x_t, t)dt + g(t)dw_t$$ where $f(x_t, t)$ corresponds to the deterministic drift and $dw_t$ is the infinitesimal noise. Hence $w$ is a standard Weiner process. $g(t)$ is the diffusion coefficient of $x(t)$. The SDE has a unique strong solution as long as the coefficients are globally Lipschitz in both state and time. Let $p_{st}(x(t) \mid x(s))$ be the transition kernel from $x(s)$ to $x(t)$, where $0 \leq s < t \leq T$ and let p_t(x) be the probability density of x(t). To estimate $\nabla_x \text{log }p_t(x)$, a time-dependent score-based model is trained as a continuous generalization to NCSN and DDPM objectives. $$ \theta^* = \arg \min_{\theta} \mathbb{E}_t \left\{ \lambda(t) \mathbb{E}_{\mathbf{x}(0) \sim p_0(\mathbf{x})} \mathbb{E}_{\mathbf{x}(t) \sim p_{0t}(\mathbf{x}(t) \mid \mathbf{x}(0))} \left[ \left\| s_\theta(\mathbf{x}(t), t) - \nabla_{\mathbf{x}(t)} \log p_{0t}(\mathbf{x}(t) \mid \mathbf{x}(0)) \right\|_2^2 \right] \right\}. $$ Here, $\lambda : [0, T] \to \mathbb{R}_{>0}$ is a positive weighting function, $t$ is uniformly sampled over $[0, T]$, $\mathbf{x}(0) \sim p_0(\mathbf{x})$, and $\mathbf{x}(t) \sim p_{0t}(\mathbf{x}(t) \mid \mathbf{x}(0))$. With sufficient data and model capacity, score matching ensures that the optimum score, denoted by $s_{\theta^*}(\mathbf{x}, t)$, equals $\nabla_\mathbf{x} \log p_t(\mathbf{x})$ for almost all $\mathbf{x}$ and $t$. As in SMLD and DDPM, we can typically choose $\lambda \propto 1 / \mathbb{E} \left[ \| \nabla_{\mathbf{x}(t)} \log p_{0t}(\mathbf{x}(t) \mid \mathbf{x}(0)) \|_2^2 \right]$. The forward SDE is given by: $$dx_t = \sigma(t)dw_t$$ The corresponding reverse time SDE is: $$d\mathbf{x} = -\sigma^2(t)s_{\theta}(\mathbf{x}, t)dt + \sigma(t)d\mathbf{\bar{w}}$$ where $\mathbf{\bar{w}}$ is a standard Wiener process when time flows backwards from $T$ to $0$ and $dt$ is an infinitesimal negative timestep. The Euler--Maruyama method is used to solve the SDE, resulting in the following equations: 1. $x \leftarrow x - \sigma(t)^2s_{\theta}(\mathbf{x}, t) \Delta t + \sigma(t) z$ ($z \sim \mathcal{N}(0, |\Delta t|I)$) 2. $t \leftarrow t+ \Delta t$ **Predictor-Corrector (PC) Samplers** This provides a better solution to the reverse time SDEs. At each time step, the numerical SDE solver first gives an estimate of the sample at the next time step, playing the role of a “predictor”. Then, the score-based MCMC approach corrects the marginal distribution of the estimated sample, playing the role of a “corrector”. PC samplers generalize the original sampling methods of SMLD and DDPM: the former uses an identity function as the predictor and annealed Langevin dynamics as the corrector, while the latter uses ancestral sampling as the predictor and identity as the corrector. ## Noise schedules of score based generative models In this section we will do a review of noise schedules for score based generative models. We first review the schedules for the papers that have already been discussed. The NCSN uses a geometric noise schedule $\frac{\sigma_1}{\sigma_2} = \cdots = \frac{\sigma_{L-1}}{\sigma_L} > 1$, where $\sigma_1$ is large enough to mitigate the difficulties with score based sampling, and $\sigma_L \approx 0$. DDPM uses a forward diffusion schedule $\beta_{2,..., T}$ where $\beta_t = (T-t+1)^{-1}$ and hence the noise schedule $1-\alpha_t = 1- \prod_{j=1}^t (1 - \beta_j)$. The SDE method takes infinite noise levels with the same schedule as ALS: $\{\sigma_i\}_{i=1}^L$ where $L\to \infty$. [Nichol, Dhariwal, 2021](https://arxiv.org/pdf/2102.09672) introduces the $\alpha$-cosine noise schedule $\alpha_t = cos(\pi t/2)$. Under the assumptions that the variances are preserved, $\sigma_t^2 = 1-\alpha_t^2$. Hence $\sigma_t = sin(\pi t/2)$. [Kingma *et. al*, 2021](https://arxiv.org/abs/2107.00630) observes that the quality of image generation depends on the signal to noise ratio (SNR) $\sigma_t/\alpha_t$ or more importantly log $\sigma_t/\alpha_t$. [Hoogeboom *et. al.*, 2023](https://arxiv.org/pdf/2301.11093) observes that the noise schedules that work for lower resolution images need not work for higher resolution images. The authors explain it by the example of a $128 \times 128$ image, and assuming that the pixels $z_t^{(i)}$ given by $q(z_t^{(i)} | \mathbf{x}) = \mathcal{N}(z_t^{(i)} | \alpha_t x_i, \sigma_t)$ are independent of each other . Commonly, diffusion models use network architectures that use downsampling to operate on lower resolution feature maps, in our case with average pooling. Suppose we average pool $z_t$, where we let indices $1, 2, 3, 4$ denote the pixels in a $2 \times 2$ square that is being pooled. This new pixel is $z_t^{64 \times 64} = \frac{z_t^{(1)} + z_t^{(2)} + z_t^{(3)} + z_t^{(4)}}{4}$. Given that we know that $\text{Var}[X_1 + X_2] = \text{Var}[X_1] + \text{Var}[X_2]$ and that $\text{Var}[aX] = a^2 \text{Var}[X]$ for a constant $a$. Letting $x^{64 \times 64}$ denote the first pixel of the average pooled input image, we find that $z_t^{64 \times 64} \sim \mathcal{N}(\alpha_t x^{64 \times 64}, \sigma_t / 2)$. The lower resolution pixel $z_t^{64 \times 64}$ only has half the amount of noise. Hence the noise schedule isn't consistent over different resolutions. Hence, the authors argue that the noise schedule could be defined with respect to some reference resolution $r\times r$. The authors then modify the schedule for some other resolution $d \times d$ via the SNR as $\text{SNR}^{d \times d}(t)= \text{SNR}^{r \times r}(t)\cdot (r/d)^2$. ## Additional references 1. On the Importance of Noise Scheduling for Diffusion Models: https://arxiv.org/pdf/2301.10972 2. Improved Techniques for Training Score-Based Generative Models: https://arxiv.org/abs/2006.09011 3. Common Diffusion Noise Schedules and Sample Steps Are Flawed: https://openaccess.thecvf.com/content/WACV2024/html/Lin_Common_Diffusion_Noise_Schedules_and_Sample_Steps_Are_Flawed_WACV_2024_paper.html