SDN - HackMD

# SDN ## Introduction 1. SDN is a Generative model for sets of images, and it does so in a energy-based fashion. 2. SDN jointly learns a __set encoder__, __set discriminator__, __set generator__, and __set prior__, and is trained in an adversarial game manner. 3. SDN is able to reconstruct image sets that preserve salient attributes of the inputs, and is also able to generate novel objects/identities, which makes it different from class-conditional GAN. ## Examples 1. Problem setting: - SDN takes the top row set from the dataset as input, and encodes it into a latent vector $z$, which contains 3D information of that set ($z$ can describe the set as a whole, like a truck , here). - Then, a generator can generate different views from this low level representation of the set. 2. ShapeNet dataset: ![](https://i.imgur.com/TGoncOn.png) - Top row: a set of input images from dataset. - Bottom row: reconstructions of that set from different view points. 3. VggFace2 dataset: ![](https://i.imgur.com/XLmqQ7D.jpg) - During reconstruction the identity of a person will not be preserved. - What supposed to be preserved is rough ideneity of the person on the images, and also a little bit of image compositions (background). - We couldn't use class-conditional generative model here, because we want to train a model that where we can input a new identity that we haven't seen during training. And we don't know how many sets there are going to be in the end. ## Architectures ![](https://i.imgur.com/0pcVgVP.png) ### Encoder: ![](https://i.imgur.com/UfXsbLz.png =250x) 1. Structure: - Input: a set of images of the same identity from different view points. - Output: set representation $z(x)$. 2. Pool & binarize: (1) Set representation should be independent of the ordering of these inputs, and also should be independent of the size of the set (how many views of this car). (2) Pooling: $\frac{1}{N}\sum_{i=1}^N{c(x_i)}$, an average of hidden representation $c(x_i)$ (encoders in the image are convolution networks). - We discard the information of the exact ordering and size here. Thus, $z(x) will be independent of a particular rendering position, and it will only depend on the fact that this is a car. (3) Binarize: ![](https://i.imgur.com/c4CfRtw.png =500x) - We couldn't do one-hot encoding here because we don't know how many sets there are. - We also couldn't make a continous representation because, if so, we couldn't encode the identity of sets. - SDN encodes each class as a binary combination of $\{-1, 1\}$. This is a way to encode a large number of set identities in a low dimensional vector. - Binarize takes the output of pooling and clamps it to either $-1$ or $1$. - Then discriminator and generator will take this binary vector as information. ### Generator: ![](https://i.imgur.com/VWmqmyk.png =250x) 1. Structure: - Input: set identity $z$ and random noise ${z}'_i$. - Output: different instances of that set. - ${z}'_i$ is sampled from a latent distribution (Gaussian distribution), ${z}'_i \sim P_\psi$. ### Discriminator: ![](https://i.imgur.com/7XrilrF.png =250x) 1. Structure: - Input: images $x$, and set identity $z$. - Output: - Regular energy-based GAN: a real number that represents the value of energy. It's calculated by 3-layer MLPs on the right side. - Additional reconstruction pipline: compare $x$ with outputs of decoder using MSE metric (stabilize training process). - We couldn't simply compare images to each other like we would do in a regular energy-based GAN, because in SDN, set identity should be corresponded to each other. ### Overview: - Take a set $X$ (different images of the same identity) from dataset. - Then, Feed $X$ to encoder, and get a latent representation $z$ for that set that encodes the identities of objects. If we train the encoder well, we don't have to seen that object before. - After that, we feed the binary representation $z$ into the generator together with some noises. We'll get a set of images of different views of objects with a similar identity and similar image style. - Finally, the discriminator will tell us if the input images are from dataset or the generator. ## Methodology ### Set Distribution Networks ![](https://i.imgur.com/AsmPeCQ.png) 1. Discrete latent space $Z$: - Lots of mathematical problems are going to rasie from the fact that $z$ is in discrete space. - $p_\theta(\cdot)$ is a prior distribution with the support given by $supp(z)$ which is a subset of $Z$. This means, given a encoder, not all of binary vector will be filled. So the prior is only defined on the support of $Z$. 2. Set prior $p_\theta(z(X;\theta))$: - We need a discrete prior distribution on $z$ variables. ![](https://i.imgur.com/lIZ1IOL.png =600x) 3. Conditional distribution $p_\theta(X|z)$: ![](https://i.imgur.com/Yayvral.png =500x)(Energy-based Model) - Given a identity representation $z$, the probability of $X$ being different images of a given set is equal to the energy that is assigned to that set divided by the energy that is assigned to all the sets. - Numerator: assign the energy only to sets that are mapped to the same $z$. - Denominator: normalize the distribution. 4. Energy function: - This function contains two parts, energy value, and reconstructions loss. ![](https://i.imgur.com/fz2PMBR.png) - Energy value (always positive) is high if a image $x$ isn't congruent with the identity $z$, and it's low if a image $x$ is consistent with identity $z$. - Discriminator takes the role of this energy function. ## Approximate Inference with Learned Prior and Generator 1. Likelihood estimation: ![](https://i.imgur.com/vUXWj5i.png) 2. Prior loss: ![](https://i.imgur.com/lU7Oa93.png) - Using parameterization trick to get rid of intractable support, and get an upper bound of true distribution. 3. Generator and discriminator loss combo: ![](https://i.imgur.com/P5728Lh.png) - The log term in second row of equation(6) is intractable, because we couldn't enumerate all the sets of images. ![](https://i.imgur.com/mnORtGL.png) ## Loss function and optimization 1. Section 2.4 in paper. - We want to maximize likelihood $p_{\theta}(X)$ in Eq(1). According to Eq(2), it's equal to minimize negative log likelihood $-\log{p_{\theta}(z(X;\theta))}-\log{p_{\theta}}(X|z(X;\theta))$. - As shown in Eq(5) and Eq(6), $L_0(\theta)$ is the upper bound of $-\log{p_{\theta}(z(X;\theta))}$, and $L_1(\theta, \psi)$ is the lower bound of $-\log{p_{\theta}}(X|z(X;\theta))$ - SDNs contains two sets of parameters to be optimize, $\theta$ and $\psi$, where $\theta$ denotes the combined parameters for the prior, encoder and discriminator, and $\psi$ denotes those for the generator. 2. Training process: - During training, with $\theta$ fixed, we first optimize $\psi$ to tighten the lower bound by solving $\max_{\psi}L_1(\theta, \psi)$. - Then, with $\psi$ fixed, we can optimize $\theta$ with a loss that combines $L_0(\theta)$ and $L_1(\theta, \psi)$. ![](https://i.imgur.com/WJ9OUrr.png =500x)