# A Survey on Generative Models
## Summarize
Recent progress of generative models is mainly driven by three approaches:
**Likelihood-based models** include variational autoencoders (VAEs), normalizing flows, autoregressive models, and energy-based models (EBMs).
These models are trained by maximizing the data likelihood under the model.
VAEs
* Pros: Sample efficient
* Cons: Posterior collapse, usually need carefully designed architecture for good results.
EBMs
* Pros: Very flexible
* Cons: Usually requires MCMC, which is computationally expensive
Normalizing flows
* Cons: Expressivity limitations
**Generative adversarial networks (GANs)**
* Unstable training due to minimax loss
* Mode collapse
**Score-based models**
## Variational Autoencoders (VAEs)
The majority of improving VAEs is dedicated to the statistical challenges
1. Reducing the gap between approximate and true posterior distributions.
2. Formulating tighter bounds
3. Reducing the gradient noise
4. Extending VAEs to discrete variables
5. Tackling posterior collapse
### NVAE: A Deep Hierarchical Variational Autoencoder (NeurIPS 2020 Spotlight) | [Code](https://github.com/NVlabs/NVAE)
It carefully design network architecture that produces high-quality images, the main building block is *depthwise convolution* that rapidly increase the receptive field of the network without lots of parameters. Batch Normalization is an import component and spectral regularization is crucial for training stability.
1. Depthwise convolution for increasing the receptive field of the networks with less parameters.
2. Residual parameterization of the approximate posteriors.
3. Spectral regularization for training stability. To minimize the Lipschitz constant for each layer.
4. Practical solutions to reduce memory burden, define the model in mixed-precision.

### Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images | [Code](https://github.com/openai/vdvae)
Variational Autoencoders consist of a generator $p_{\theta}(x|z)$, a prior $p_{\theta}(z)$ and an approximate posterior $q_{\phi}(z|x)$. Neural Networks $\phi$ and $\theta$ are trained end2end with backpropagation and the reparameterization trick in order to maximize the ELBO:
\begin{gather*}
\log\ p_{\theta}(x) \ge E_{z\sim q_{\theta}(z|x)}\log p(x|z) - D_{KL}[ q_{\phi}(z|x) || p_{\theta}(z)]
\end{gather*}

Hierarchical conditioning structures both the approximate posterior and prior generate latent variables in the same order.
Notes:
1. Initialization: scale by $\frac{1}{\sqrt N}$, where $N$ is depth, for last conv layer in each residual bottleneck block.
2. Use **nearest neighbor upsampling** for *"unpool"* layer, as using transposed conv layer, the network may ignore layers at low resolution (e.g. 1x1 or 4x4 layers)
3. Skipping large gradient norm (e.g. $1e15$)
## Autoregressive models
### Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling
**Challenge 1**
High-fidelity image generation is the multi-faceted relationship between the MLE scores achieved by a model and the model's sample fidelity.
MLE forces the model to support the entire empirical distribution. This guarantees the model's ability to generalize at the cost of alloting capacity of the distribution that are **irrelevant** to fidelity.
**Challenge 2**
$256\times 256\times 3 = 196608$, to learn dependencies among them, we need large amounts of memory and computation.
**Multidimensional Upscaling**
Map from one image to another by upscaling *depth* or *image size*

**Subscale Pixel Network (SPN) architecture**
It divides images in sub-images, then the image is generated conditioned on previously generated slices.

## Flow-based Models
### Glow: Generative flow with invertible 1x1 convolutions | [Code](https://github.com/openai/glow)
This paper propose a generative flow
It consists of a series of steps of flow, combined in a multi-scale architecture.

**Actnorm** an affine transformation of the activations per channel, similiar to batch normalization. These parameters are initialized such that the post-actnorm activations per channel have zero mean and unit variance.
**Invertible 1x1 conv**
1x1 conv with equal input and output is a generalization of a permutation operation.
Log of $h\times w\times c$ tensor $h$, with $c\times c$ weight matrix $W$ is straightforward to compute:
\begin{gather*}
log|det(\frac{d\ conv2D(h,W)}{d\ h})| = h\cdot w\cdot log|det(W)|
\end{gather*}
The cost of computing $det(W)$ can be reduced from $O(c^3)$ to $O(c)$ by LU decomposition.
\begin{gather*}
W = PL(U + diag(S))
\end{gather*}
$P$ is permutation matrix, $L$ is lower triangular matrix with $1$ on diagonal, $U$ is upper triangular matrix with $0$ on diagonal, $s$ is vector.
\begin{gather*}
log|det(W)| = sum(log(s))
\end{gather*}
**Affine coupling layers**
A reversible transformation that forward, backward and log-determinant are computationally efficient.
### StyleFlow: Attribute-conditioned Exploration of StyleGAN-Generated Images using Conditional Continuous Normalizing Flows | [Code](https://github.com/RameenAbdal/StyleFlow)
TODO
## VAE + Flows
### Decoupling Global and Local Representations From/for Image Generation
**Posterior Collapse in VAEs**
The ELBO objective in VAEs may not guide the model towards the intended role for the latent variables $Z$, or even learn uninformative $Z$ with the observation that the $KL$ term vanishes to $0$. No supervision on latent space to characterize the latent variables $Z$ is the essential reason of *posterior collapse*.
**Local Dependency in Generative Flows**
Generative flows suffer from the limitation of expressiveness and local dependency. Most generative flows tend to capture the dependency among features only locally, and are incapable of realistic synthesis of large images compared to GANs.
Unlike VAEs can represent high-dimensional data as coordinates in a latent low-dimensional space, the long-term dependencies that usually describe the global features of the data can only be propagated through a composition of transformations.
By embedding a generative flows in VAEs to model the decoder, the proposed model is able to learn decoupled representations which capture global and local information of images. The key insight is to utilize the inductive biases from the model, extracting the global information by compression encoder in a low-dimensional representation, and a flow-based decoder which favors local dependencies to store residual information into a local high-dimensional representation.
#### Method

First, feed the image $x$ into compression encoder, which compress high-dimensional image into a low-dimensional vector, the the local information is forced to be discarded, yeilding $z$ that capture global representation. Then, $z$ is a conditional input for flow-based decoder, which transform $x$ to $v$. We can reconstruct $x$ by $v$ and $z$.
Flow-based decoder adopts the backbone of **Glow** -- actnorm, invertible $1\times 1$ conv and coupling.
#### Discussion
The supervision of learning latent $z$ comes from the compression encoder, which is encouraged to discard local information, and the preference of the flow-based decoder for capturing local dependencies reinforces global information modeling of the encoder.
From the perspective of the flow-based decoder, the latent codes z provides the decoder with the
imperative global information, which is essential to resolve the limitation of expressiveness due to
local dependency.

## Score-based Models
Likelihood-based models have to use specialized architectures to build a normalized probability model (e.g. autoregressive models, flow models), or use surrogate losses (e.g. ELBO used in VAEs, contrastive divergence in EBMs) for training. GANs is hard to train due to adversarial training procedure.
Score-based model explore another principle for generative modeling based on estimating and sampling *score*, the gradient of log density function at the input data point. This is a vector field that points in the direction where the log data density grows the most.
### Generative Modeling by Estimating Gradients of the Data Distribution (NeurIPS 2019 Oral)
To tackle these two challenges, this paper propose to *perturb the data with random Gaussian noise of various magnitudes* to ensure the resulting distribution does not collapse to a low dimensional manifold. And, an *annealed version of Langevin dynamics*, where we initially use scores corresponding to the highest noise level, and gradually anneal down the noise level until it is small enough to be indistinguishable from the original data distribution.
The objective is tractable without the need of architectures design or special constraints, and can be opimized without adversarial traing, MCMC sampling or other approximations during training.
**Background**
We define *score* as a probaility density $p(x)$ to be $\nabla_x\log p(x)$.
We use score network $s_{\theta}(x)$ to estimate $\nabla_x\log p_{data}(x)$, the objective minimizes

, however $tr(\nabla_x s_{\theta}(x))$ is expensive.
**Large scale score matching**
1. Denoising score matching
It first perturb the data point $x$ with a pre-specified noise distribution $q$ and then employs score matching on $q$, the objectives equals:

2. Sliced score matching
It uses random projections to approximate $tr(\nabla_x s_{\theta}(x))$ in score matching. The objective is:

**Challenge 1**
If the data is supported on a low-dimensional manifold, the score will be undefined in the ambient space, and score matching will fail to provide a consistent score estimator.
**Challenge 2**
The scarcity of training data in low data density regions hinders the accuracy of score estimation and slows down the mixing of Langevin dynamics sampling.

### Score-Based Generative Modeling through Stochastic Differential Equations (ICLR 2021 Oral)
*Score matching with Langevin dynamics* (SMLD) estimates the *score* (i.e. the gradient of the log probability density) at each noise scales during generation and then uses Langevin dynamics to sample from a sequence of decreasing noise scales during generation.
*Denoising diffusion probabilistic modeling* (DDPM) trains a sequence of probabilistic models to reverse each step of the noise corruption, using knowledge of the functional form of the reverse distributions to make training tractable.
Score-based model have proven efficient at generating images, however the relationship between SMLD and DDPM is largely unexplored. We use stochastic differential equations (SDEs) to unify and generalizing both approaches.

#### Contribitions
**Flexible Sampling**: We can employ general-purpose SDE solvert to intergrate the reverse-time SDE for sampling. In addition, Predict-Corrector (PC) samplers combine numerical SDE solvers with score-based MCMC; and deterministic samplers based on the probability flow ODE. The former unifies and improves over existings samplings method for score-based models, the latter allows for exact likelihood computation, efficient and adaptive sampling via black-box ODE solvers, flexible data manipulation via latent codes, and a unique encoding.
**Controllable generation**: We can modulate the generation process by conditioning on information
not available during training, because the conditional reverse-time SDE can be efficiently computed
from unconditional scores. This enables applications such as class-conditional generation, image
inpainting, and colorization using a single unconditional score-based model without re-training.

While our proposed sampling approaches improve results and enable more efficient sampling, they remain slower at sampling than GANs on the same datasets.

### Denoising diffusion probabilistic models | [code](https://github.com/hojonathanho/diffusion)
A diffusion probalibity model is a parameterized Markov chain trained using variational inference to produce samples the data after finite time.
Transitions of this chain are learned to reverse a diffusion process, which is a Markov chain that gradually adds noise to the data in the opposite direction of sampling unitl signal is destroyed.

**Background**
Diffusion models are latent variable models of the form $p_{\theta}(x_0)=\int p_{\theta}(x_{0:T})dx_{1:T}$, where $x_1,x_2...x_T$ are latents of the same dimensionality as the data $x_0\sim q(x_0)$.
$p_{\theta}(x_{0:T})$ is called the *reverse process*, and it is defined as a Markov chain with learned Gaussian transitions starting at $p(x_T)=N(x_T;0,1)$

The approximate posterior $q(x_{1:T}|x_0)$, called the *forward process*, is fixed to Markov chain the gradually add noise to the data according to variance schedule $\beta_1,\beta_2,...,\beta_T$

Training is performed by optimizing the usual variational bound on negative log likelihood

Further improvements come from variance reduction by rewriting $L$ as

**Experiments**

Fig7. shows stochastic predictions $x_0\sim p_{\theta}(x_0|x_t)$ with $x_t$ frozen for various $t$. When $t$ is small, all but fine details are preserved, and when $t$ is large, large scale features are preserved.
**Conclusion**
While diffusion models might resemble flows and VAEs, diffusion models are designed so that $q$ has no parameters and the top-level latent $x_T$ has nearly zero mutual information with the data $x_0$.
The $\epsilon$-prediction reverse process parameterization establishes a connection between diffusion models and denoising score matching over multiple noise levels with annealed Langevin dynamics with sampling.

## Energy-Based Models
>Among likelihood-based models, EBMs model the unnormalized data density by assigning low energy to high-probability regions in the data space. Unlike normalizing flows, it requires no restriction on network architectures and are therefore very **expressive**. Also, they also exhibit **robustness** and **out-of-distribution generalization**, because, during training, areas with high probability under the model but low probability under the data distribution are penalized explicitly. However, training and sampling EBMs usually requires **MCMC**, which can suffer from slow mode mixing and is **computationally expensive** when neural networks represent the energy function.
### Learning Energy-Based Models by Diffusion Recovery Likelihood (ICLR 2021 Poster)
Two challenges remain for training EMBs on high-dimensional datasets. First, learning EBMs by MLE requires MCMC which is **computaionally prohibitive**. Second, the energy potentials learned with **non-convergent MCMC** do not have a valid steady-state, so the samples can differ greatly.
We propose to train EBMs with *diffusion recovery likelihood*. We perturb the dataset with a sequence of noise distributions, and learn a seqeunce of EBMs to model the *marginal* distributions of the perturbation process. EBMs are learned to maximizing recovery likelihood, which are the densities of conditional distributions that reverse each step of the perturbation process. And, sampling from conditional distributions are easier than marginal distributions.
After learning all marginal EBMs, we can generate image samples by starting from the Gaussian white noise, and then produce samples from each conditional distribution in the descening order of noise scales.

**Background**
Let $x \sim p_{data}(x)$ denote the training sample, and $p_{\theta}(x)$ denote a model's probability density function that aims to approximates $p_{data}(x)$.
An EBM is defined as $p_{\theta}(x) = \frac{1}{Z}exp(f_{\theta}(x))$, where $Z = \int exp(f_{\theta}(x))dx$ is the partition function. MLE can be approximated by $E_{x\sim p_{data}}[\log p_{\theta}(x)]$. And the gradient approximately follows \begin{gather*}\frac{\partial}{\partial \theta} E_{x\sim p_{data}}[\log p_{\theta}(x)] = E_{x\sim p_{data}}[\frac{\partial}{\partial \theta}f_{\theta}(x)] - E_{x\sim p_{\theta}}[\frac{\partial}{\partial \theta}f_{\theta}(x)]\end{gather*}
The expectations can be approximated by MCMC, but it takes a long time to converge for sampling images.
**Recovery Likelihood**
In order to address the difficulty of sampling from $p_{\theta}(x)$, we consider the *recovery likelihood*, defined by the density of the sample **conditioned** on its noisy version perturbed by Gaussian noise.
let $\hat x = \alpha x + \sigma \epsilon$ be the noise sample of $x$, where $\alpha$ is a positive coefficient, and $\epsilon \sim N(0, 1)$.
We consider *recovery likelihood*
\begin{gather*}
p_{\theta}(x|\hat x) = \frac{1}{Z}exp(f_{\theta}(x) - \frac{1}{2\sigma ^ 2}||\hat x - x||^2)
\end{gather*}
where $Z = \int exp(f_{\theta}(x) - \frac{1}{2\sigma ^ 2}||\hat x - x||^2)dx$ is a partition function of this conditional EBM.
The extra quardratic term constrains the density to be localized around $\hat x$, making the density less multi-modal and easier to sample from.
When $\sigma$ is small, $p_{\theta}(x|\hat x)$ is appromiately a single mode Gaussian distribution, which greatly reduces the burden of MCMC
We define the *recovery log likelihood function* as
\begin{gather*}
J(\theta) = \frac{1}{n} \sum^{n}_{i=1} \log p_{\theta}(x_i|\hat x_i)
\end{gather*}
The term *recovery* indicates we can recover the clean $x$ from noise $\hat x$, we generate $x$ by $K$ steps of Langevin dynamics that iterates according to
\begin{gather*}
x^{\tau+1} = x^{\tau} + \frac{\delta ^2}{2} (\nabla_xf_{\theta}(x^{\tau})) + \frac{1}{\sigma ^2}(\hat x - x^{\tau}) + \delta \epsilon^{\tau}
\end{gather*}
**Normal Approximation to Recovery Likelihood**
The negative conditional energy is

Then we can approximate $p_{\theta}(x|\hat x)$ by a Gaussian approximation

This normal approximation has two implications:
1. it verifies the fact that the conditional density $p_{\theta}(x|\hat x)$ can be generally easier to sample from when $\sigma$ is small
2. it provides hints for choosing the step size of Langevin dynamics
**Connection to Variational Inference and Score Matching**
Instead of modeling $p_{\theta}(x)$ as an energy-based model, we employ variational inference and represent the conditional density as
\begin{gather*}
p_{\theta}(x|\hat x) = N(x;\hat x + \sigma ^ 2 s_{\theta}(\hat x),\sigma ^2)
\end{gather*}
On the other hand, the training objective of denoising score matching is to minimize eq.14,

where $s_{\theta}(\hat x)$ is the score of $\hat x$
We can further show that the learning gradient of maximizing log-likelihood of the **normal approximation** is approximately the same as the learning gradient of maximizing the **original recovery log-likelihood** with one step of Langevin dynamic
**Diffusion Recovery Likelihood**
Sampling from $p_{\theta}(x|\hat x)$ with MCMC becomes simpler when $\sigma$ is smaller, on the otherhand, it is multimodal and hard to sample when $\sigma$ is large
To keep $\sigma$ small and sample efficient, we propose to maximize a sequence of recovery likelihoods by chaining together a sequence of perturbation distributions with increasing intensity
Assume a sequence of perturbed observations $x_0$,$x_1$,...$x_T$ such that

**Experiment**

## EBM + VAE
> VAEs are computationally more efficient for sampling than EBMs, as they do not require running expensive MCMC steps. VAEs also do not suffer from expressivity limitations that normalizing flows face.
>
> VAEs naturally come with a latent embedding of data that allows fast traverse of the data manifold by moving in the **latent space** and mapping the movements to the **data space**.
>
> However, VAEs tend to assign high probability to regions with low density under the data distribution. This often results in blurry or corrupted samples generated by VAEs. This also explains why VAEs often fail at out-of-distribution detection.
### VAEBM: A Symbiosis between Variational Autoencoders and Energy-based Models
This paper combines the best of both. Intuitively, the VAE captures the majority of the mode structure in the data distribution, then energy function focuses on refining the details and reducing the likelihood of non-data-like regions that VAEs generate.
The energy function defined in the pixel space also shares similarities with discriminator in GANs, which can generate crisp and detailed images.
VAE is trained using the reparameterization trick, while the EBM component requires sampling from the joint energy-based model during training.
We show that we can sidestep the difficulties of sampling from VAEBM, by reparametrizing the MCMC updates using VAE’s latent variables. This allows MCMC chains to quickly traverse the model distribution and it speeds up mixing.

**The advantages of two-stage training**
The first stage, it minimizes the distance between VAE model and the data distribution, and in the second stage, the EBM further reduce the mismatch. As the pre-trained VAE provides a good approximatation, we expect that a relatively cheap updates is enough.
**Compare to score-based model**
Some models are also based on *denoising score matching* (DSM), but do not parameterize any explicit energy function and instead directly model the vector-valued score function. We view score-based models as alternatives to EBMs trained with maximum likelihood. Although they do not require iterative MCMC during training, they need very long sampling chains to anneal the noise when sampling from the model (& 1000 steps). Therefore, sample generation is extremely slow.
Despite their impressive sample quality, denoising score matching models are slow at sampling, often requiring at least $1000$ MCMC steps. Since VAEBM uses short MCMC chains, it takes only 8.79 seconds to generate 50 CIFAR-10 samples, whereas NCSN takes 107.9 seconds, which is about 12× slower.
**Adversarial training vs. sampling**
The gradient update is similar to WGAN discriminator [9]. The different is that, VAEMB draws samples by MCMC, while WGAN draws from generator by playing adversarial game, while we only update energy function.

## Generative adversarial networks
### A Style-Based Generator Architecture for Generative Adversarial Networks (CVPR 2019)

The main different from traditional GAN is that *styleGAN* trains an 8-layer MLP that transforms latent code $z$ to $w$, then compute the spatially invariant style $A$ from vector $w$ instead of directly from images and input *noise* $B$ for *stochatic detail*.
**Style Mixing**
To further encourage the styles to localize, we employ *mixing regularization*, where a given percentage of images are generated using two random latent codes instead of one during training
To be specific, we run two latent
codes $z1$, $z2$ through the mapping network, and have the corresponding $w1$, $w2$ control the styles so that $w1$ applies before the crossover point and $w2$ after it. This regularization technique prevents the network from assuming that adjacent styles are correlated

**Stochastic Variation**
StyleGAN adds per-pixel noise after each onv3x3 for random features without affecting out perception of the image

**Distentanglement**
A latent space that consists of linear subspaces, each of which controls one factor of variation

This mapping can be adapted to *unwarp* $W$ so that the factors of variation become more linear.
Two ways to quantifying disentanglement
1. Perceptual path length (PPL)
2. Linear separability


### Analyzing and Improving the Image Quality of StyleGAN (CVPR 2020) | [Code](https://github.com/NVlabs/stylegan2)
**Removing normalization artifacts**
Most images generated by StyleGAN exhibit characteristic blob-shaped artifacts that resemble water droplets.

We pinpoint problem to *AdaIN* that normalize the mean and variance of each feature map seperately, so potentially destroying any information found in the magnitudes of the feature related to each other
We break *AdaIN* to normalization and modulation. Furthermore, mean is not needed. This result to Fig 2.c

**Weight demodulation**
To remove artifacts while retaining full controllability. The main idea is to base normalization on the *expected* statistics of the incoming feature maps, but without explicit forcing.
The modulation scales each input feature map of the convolution based on the incoming style, which can alternatively be implemented by **scaling the convolution weight**
The purpose of instance normalization is to essentially remove the effect of **scaling**, however, why not scale the weight by $\frac{1}{\sigma}$ to avoid instance normaliztion? Then, we achieve Fig 2.d
#### Image quality and generator smoothes
We observe that a low PPL correlate with high-quality images.

Instead of simply encourage minimal PPL, we introduce a new regularizer that aims for a smoother generator mapping
**Lazy regularziation**
Compute regularization terms only once every 16 minibatches.
**Path legnth regularization**
Goal: Fixed-size step in $W$ results in a non-zero, fixed-magnitude change in the image. The desire to preserve the expected lengths of vectors regardless of the direction, we formulate our regularizer as \begin{gather*}E_{w,y\sim N(0,1)}(||J^T_{w}y||_2 - \alpha)^2 \end{gather*}
where $y$ are random images and $w\sim f(z)$, the constant $\alpha$ is set dynamically during optimization as the long-running exponential moving average of the lengths $||J^T_w y||_2$, allowing to learn suitable global scale by itself.
**Progressive growing revisited**
The progressively grown generator appears to have a strong location preference for details

After experiment we use skip generator and residual discriminator.


**Experiment**

## Reference
[1] Arash Vahdat, et al. [*NVAE: A Deep Hierarchical Variational Autoencoder*](https://arxiv.org/abs/2007.03898)
[2] Rewon Child, et al. [*Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images*](https://arxiv.org/abs/2011.10650)
[3] Ruiqi Gao, et al. [*Learning Energy-Based Models by Diffusion Recovery Likelihood*](https://arxiv.org/abs/2012.08125)
[4] Xuezhe Ma, et al. [*Decoupling Global and Local Representations from/for Image Generation*](https://arxiv.org/abs/2004.11820)
[5] Yang Song, et al. [*Score-Based Generative Modeling through Stochastic Differential Equations*](https://arxiv.org/abs/2011.13456)
[6] Zhisheng Xiao, et al. [*VAEBM: A Symbiosis between Variational Autoencoders and Energy-based Models*](https://arxiv.org/abs/2010.00654)
[7] Diederik P. Kingma, et al. [*Glow: Generative Flow with Invertible 1x1 Convolutions*](https://arxiv.org/abs/1807.03039)
[8] Jacob Menick & Nal Kalchbrenner, et al. [*Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling*](https://arxiv.org/abs/1812.01608)
[9] Martin Arjovsky, et al. [*Wasserstein GAN*](https://arxiv.org/abs/1701.07875)
[10] Yang Song, et al. [*Generative Modeling by Estimating Gradients of the
Data Distribution*](https://arxiv.org/abs/1907.05600)
[11] Tero Karras, et al. [*Analyzing and Improving the Image Quality of StyleGAN*](https://arxiv.org/abs/1912.04958)
[12] Tero Karras, et al. [*A Style-Based Generator Architecture for Generative Adversarial Networks*](https://arxiv.org/abs/1812.04948)