Variational Coarse-to-Fine

# Variational Coarse-to-Fine ## Variational Auto-Encoder Variations ### Hierarchical VAE (HVAE) Properties: - $p$ produces data using a sequence of latent choices, each depending on (potentially) all previous choices - According to [Zhao 2017b](https://arxiv.org/abs/1702.08396), this has training issues (see LVAE below) Papers: - [Kingma 2016](https://arxiv.org/abs/1606.04934) Improving Variational Inference with Inverse Autoregressive Flow - [Bachman 2016](https://arxiv.org/abs/1612.04739) An Architecture for Deep, Hierarchical Generative Models ### Markov HVAE - This is an instance of HVAE - $p$ produces data using a sequence of latent choices, each only depending on the previous one - According to [Zhao 2017b](https://arxiv.org/abs/1702.08396), this has training issues (see LVAE below) ### Ladder VAE (LVAE) Properties: - This is an instance of HVAE - The guide computation first proceeds bottom-up (deterministically), then top-down (sampling) - The parameters of guide and generative distribution are tied together in the downward pass. - For the Gaussian guide distribution, the distribution parameters $\mu$ and $\sigma$ are computed as precision-weighted averages of parameters for $p$ and for a "pure" guide $q$ - In the deterministic guide computation, information only comes from the next-lower layer, not directly from data $x$ - To get learning to work, a warm-up period (gradually turning on the prior KL term) and batch norm are needed - This architecture supports "explaining away" (but I don't think the paper demonstrates it) Problems according to [Zhao 2017b](https://arxiv.org/abs/1702.08396): - If $p$ is sufficiently expressive and ELBo optimized perfectly, then only lowest layer is needed; rest is redundant - Also, if $p$ is expressive, latent $z$ will be ignored ([Chen 2016](https://arxiv.org/abs/1611.02731)) - If $p$ is simple, e.g. just a hierarchy of Gaussians (or otherwise a simple sequence of unimodal distributions), no meaningful hierarchies can be learned - "Intuitively, for example, the distribution over object subparts for an object category is unlikely to be unimodal and easy to capture with a Gaussian distribution." Questions: - Could we implement renormalization group Ising in this framework? - Why exactly does this architecture (and HVAE more generally) require batch norm, warm-up? Intuitively, what is the source of the learning issues? Papers: - [Sønderby 2016](https://arxiv.org/abs/1602.02282) Ladder Variational Autoencoders ### Variational Ladder Auto-Encoder (VLAE) Properties: - The generative process samples latent random variables $z$ independently, then applies a deterministic transform (a neural net) to get parameters of the data distribution - The inference process deterministically computes the values of auxiliary $h$ nodes bottom-up - The guide distribution for each $z$ is determined using a neural net that takes as input the corresponding $h$ variable - So, the guide distributions for different $z$s are conditionally independent given $x$ - This is in effect a VAE with a single layer of randomness - This architecture supposedly does not have the learning problems outlined in [Zhao 2017b](https://arxiv.org/abs/1702.08396) Problems: - It's difficult to specify what features you want your latent variables to represent Questions: - Could we implement renormalization group Ising in this framework? - Probably not. Without making the random choices in the lower levels dependent on the higher ones, it's difficult to see how to do this. - $\tilde{z_1}$ takes input from the previous higher layer's $\tilde{z_2}$, so that could put the randomness in $z_1$ through a deterministic transformation that makes it dependent on $z_2$. Does this make the point of keeping these independent redundant? Segue to the next point below. - Why doesn't the top-most level do all the work - that is, I see how parameter sharing (reuse of lower nets as pieces of deeper nets for higher layers) results in those lower nets doing useful work, but I'm not sure what encourages the lower-level randomness to be used. - Wouldn't increasing the dimensionality of (the higher-level randomness) $z_2$ have the same effect? If there are too few degrees of freedom in the overall model, lower level randomness can be used. But if the model as a whole has more than enough degrees of freedom, then why would the lower-level ones get used? And I don't think the architecture should rely on there being too few degrees of freedom overall; a big point of VAEs after all is that regularization will prevent too much randomness from getting used. - To understand this, it might also be useful to understand the Sequential VAE and why the increased dimensionality of the latent representation improves performance (Proposition 5 in [Zhao 2017a](https://arxiv.org/abs/1702.08658)). How different are these architectures really? - [Zhao 2017a](https://arxiv.org/abs/1702.08658) argues that for expressive $p$, using ELBO won't work. But in VLAE, $p$ seems pretty expressive as well if we plug in deep nets with many parameters? So will ELBO run into issues here as well? - How does this compare to the architecture where the inference nets aren't shared between levels (but we still use a deeper net for "higher" levels)? - How much of a constraint is the independence assumption? - This reminds of the exogeneous randomness assumption used in the reparameterization trick and in causal inference Implementations: - [WebPPL](https://github.com/stuhlmueller/neural-nets/blob/master/models/vlae.wppl) Papers: - [Zhao 2017b](https://arxiv.org/abs/1702.08396) Learning Hierarchical Features from Generative Models ### Sequential VAE (SVAE) Properties: - There is a sequence of latent variables $z_i$, and a corresponding sequence of generated data points $x_i$, with $x_n$ representing the ultimate model output - The generative process for $x_i$ depends on $z_i$ (as usual) but also on $x_{i-1}$ - At each step of the generative process, a loss function in pixel space is used - This could be limiting: it might lead to very particular kinds of sequences of latent features, which may not correspond to what we think of as coarse-to-fine. For example, knowing whether a thing is a house or a face might not actually help much with reducing the error in the first step, so instead the model might choose to encoding the lighting conditions. - In the guide, the data $x$ is fed into each latent variable $z_i$ individually - This is a model with a simple $p$, so it should not suffer from the problems in [Zhao 2017b](https://arxiv.org/abs/1702.08396) Questions: - Could we amortize early steps in Sequential VAE more easily than later steps? - Is there any parameter sharing between different steps? Should there be? Papers: - [Zhao 2017a](https://arxiv.org/abs/1702.08658) Towards Deeper Understanding of Variational Autoencoding Models ### Multi-Stage VAE (MSVAE) Properties: - The decoder has two stages. The first generates a blurry image, the second sharpens it - The first stage $f_{\theta_1}$: - consists of deconvolution layers - uses a $l_2$ reconstruction loss - The second stage, $f_{\theta_2}$: - takes the output of the first stage as input - consists of residual layers with skip connections - uses a $l_1$ reconstruction loss - The encoder consists of a sequence of convolutions - During training, the sum of the first- and second-stage losses (and the prior loss) is jointly optimized, similar to how SVAE jointly optimizes losses for intermediate and final generated $x_i$ Questions & comments: - The second stage doesn't have direct access to $z$ and can only depend on it through the blurry image generated by $f_{\theta_1}$. If each blurry image corresponds to multiple sharp images, then how is the overall architecture intended to encode and decode these images without loss? - In other words, even if the architecture produced high-quality samples, it's unclear that it would be a good model of the data distribution - The paper blames the $l_2$ loss for blurry images, but SVAE (see above) does generate images that look similarly good (as far as I can tell) to MSVAE and uses a $l_2$ loss. - Why not train the entire VAE with $l_1$ loss if that is what fixes the blurry images? - For what it's worth, in my (Andreas's) experiments with learning multimodal distributions, using $l_1$ loss instead of $l_2$ loss didn't not fix the problem - Based on [Zhao 2017a](https://arxiv.org/abs/1702.08658), it seems more likely that the combination of simple $p$ and non-discriminative $q$ is to blame - The generated images still seem kind of blurry Papers: - [Cai 2017](https://arxiv.org/abs/1705.07202) Multi-Stage Variational Auto-Encoders for Coarse-to-Fine Image Generation ## General notes - Distinguishing whole-picture and feature-based abstraction/coarse-graining: Ishita: > I think the HVAE/LVAE/"screening off" type architectures might give the renormalization group type coarse-graining that you spoke about in your previous work as well. But I think the flatter ones from the Ermon group will give representations in which each "layer" provides different information – like lighting/shape/details etc at different scales of "coarseness" – which all together reconstruct the data. Andreas: > I don't clearly understand this distinction. For "normalization group" coarse-graining you also need to add information at each layer, no? > > That said, on an intuitive level, I do see the difference between having a "complete" picture at each level (that can be compared to - maybe coarsened - data) vs having a partial picture of individual features (that can't be compared to the data on their own). I think these are more ends on a spectrum than clear categories, and the most interesting applications are probably somewhere in between (the coarser levels are neither totally independent features nor a simple down-scaling of a complete picture). ## Applications ### Coarse-to-fine Ising - What is the goal here? - Goal 1: Effectively produce samples at a particular temperature - How exactly does this fit into the VAE framework? - Goal 2: Conditioning on a particular Ising lattice, reconstruct "nearby" lattices - By making small changes to the latent variables at different levels, explore variations at different levels of abstraction - What distribution would we use for the higher-level latent random choices? - A first guess might be multivariate Bernoulli for both higher-level and data-generating distributions, but VAEs don't work great with discrete choices (need to use LR estimator). - In general, this will have to depend on the specific architecture we're using. - For VLAE, the guide distributions at different levels are independent given the data; I think this means that the deterministic transform has to do more work and that it's more likely that the latent choices are something like (multivariate) `uniform(0, 1)`. ## References - [Hinton 2006](http://doi.org/10.1126/science.1127647) Reducing the Dimensionality of Data with Neural Networks - [Stuhlmüller 2013](https://stuhlmueller.org/papers/inverses-nips2013.pdf) Learning Stochastic Inverses - [Valpola 2014](https://arxiv.org/abs/1411.7783) From neural PCA to deep unsupervised learning - [Steinhardt 2014](http://proceedings.mlr.press/v32/steinhardt14.pdf) Filtering with Abstract Particles - [Stuhlmüller 2015](https://arxiv.org/abs/1509.02962) Coarse-to-Fine Sequential Monte Carlo for Probabilistic Programs - [Sønderby 2016](https://arxiv.org/abs/1602.02282) Ladder Variational Autoencoders - [Miller 2016](https://arxiv.org/abs/1611.06585) Variational Boosting: Iteratively Refining Posterior Approximations - [Paige 2016](https://arxiv.org/abs/1602.06701) Inference Networks for Sequential Monte Carlo in Graphical Models - [Chen 2016](https://arxiv.org/abs/1611.02731) Variational Lossy Autoencoder - [Kingma 2016](https://arxiv.org/abs/1606.04934) Improving Variational Inference with Inverse Autoregressive Flow - [Bachman 2016](https://arxiv.org/abs/1612.04739) An Architecture for Deep, Hierarchical Generative Models - [Zhao 2017a](https://arxiv.org/abs/1702.08658) Towards Deeper Understanding of Variational Autoencoding Models - [Zhao 2017b](https://arxiv.org/abs/1702.08396) Learning Hierarchical Features from Generative Models - [Cai 2017](https://arxiv.org/abs/1705.07202) Multi-Stage Variational Auto-Encoders for Coarse-to-Fine Image Generation