L14-Autoencoders

# L14-Autoencoders > Organization contact [name= [ierosodin](ierosodin@gmail.com)] ###### tags: `deep learning` `學習筆記` ==[Back to Catalog](https://hackmd.io/@ierosodin/Deep_Learning)== * http://www.deeplearningbook.org/contents/autoencoders.html * The network of an autoencoder may be viewed as containing an encoder and a decoder, specifying deterministic or stochastic mappings. * ![](https://i.imgur.com/f3UdTnK.png) * The learning is to minimize a loss function, likely with regularization * ![](https://i.imgur.com/NQfHCf1.png) * Traditionally, autorencoders were used for dimension reduction. * However, theoretical connections between autoencoders and some modern latent variable models have brought autoencoders to the forefront of ==generative modeling==. * Variational Autoencoders (VAE) * A probabilistic generative model with latent variables that is built on top of end-to-end trainable neural networks * ![](https://i.imgur.com/WQLSz7g.png) * $\begin{split}logp(X; \theta) &= log \frac{p(X, Z; \theta)}{p(Z|X; \theta)} \\ &= E_{Z\sim q(Z|X; \theta^{'})} log \frac{p(X, Z; \theta)}{q(Z|X; \theta^{'})}\frac{q(Z|X; \theta^{'})}{p(Z|X; \theta)} \\ &= E_{Z\sim q(Z|X; \theta^{'})} log \frac{p(X, Z; \theta)}{q(Z|X; \theta^{'})} + E_{Z\sim q(Z|X; \theta^{'})} log \frac{q(Z|X; \theta^{'})}{p(Z|X; \theta)} \\ &= E_{Z\sim q(Z|X; \theta^{'})} log \frac{p(X, Z; \theta)}{q(Z|X; \theta^{'})} + KL(q(Z|X; \theta^{'}||p(Z|X; \theta)) \\ &\geq E_{Z\sim q(Z|X; \theta^{'})} log \frac{p(X, Z; \theta)}{q(Z|X; \theta^{'})} = L(X, q, \theta) \\ &= E_{Z\sim q(Z|X; \theta^{'})} log \frac{p(X|Z; \theta) p(Z; \theta)}{q(Z|X; \theta^{'})} \\ &= E_{Z\sim q(Z|X; \theta^{'})} log \frac{p(Z; \theta)}{q(Z|X; \theta^{'})} + E_{Z\sim q(Z|X; \theta^{'})} logp(X|Z; \theta) \\ &= -KL(q(Z|X; \theta^{'})||p(Z; \theta)) + E_{Z\sim q(Z|X; \theta^{'})} logp(X|Z; \theta) \\ &= Regularization + Reconstruction \end{split}$ * The regularization term requires that the conditional distribution $q(Z|X; \theta^{'})$ of the latent code $Z$ given $X$ should be compatible with the prior $p(Z)$ * The reconstruction term requires that the latent code $Z$ generated by the encoder $q(Z|X; \theta^{'})$ for the input $X$ should maximize the log-likelihood $logp(X|Z; \theta)$ of $X$ * Trick: The re-parameterization technique works around this difficulty by generating samples input to the decoder with * $B(X)\epsilon + \mu(X)$, where $BBT = \Sigma$ and $\epsilon \sim N(0, I)$ * In fact, the encoder can learn $B(X)$ directly * Origin * ![](https://i.imgur.com/NeXhjeI.png) * Trick * ![](https://i.imgur.com/RhyahCR.png) * Given the data $X = {x_i}$ is drawn from an empirical distribution $p_d(x)$, the objective function $L(X, q, \theta)$ can be expressed more precisely as * $\begin{split} &E_{x \sim p_d(x)}[-KL(q(z|x; \theta^{'})||p(z; \theta)) + E_{Z\sim q(z|x; \theta^{'})} logp(x|z; \theta)] \\ &= E_{x \sim p_d(x)}[E_{Z\sim q(z|x; \theta^{'})} logp(x|z; \theta)] - E_{x \sim p_d(x)}[KL(q(z|x; \theta^{'})||p(z; \theta))] \\ &= E_{x \sim p_d(x)}[E_{Z\sim q(z|x; \theta^{'})} logp(x|z; \theta)] - E_{x \sim p_d(x)}[H_{q(z|x)}p(z) - H(q(z|x))] \\ &= E_{x \sim p_d(x)}[E_{Z\sim q(z|x; \theta^{'})} logp(x|z; \theta)] + E_{x \sim p_d(x)}H(q(z|x)) - \int q(z)(-logp(z))dz \\ &= E_{x \sim p_d(x)}[E_{Z\sim q(z|x; \theta^{'})} logp(x|z; \theta)] + E_{x \sim p_d(x)}H(q(z|x)) - E_{z \sim q(z)}[-logp(z)] \\ &= E_{x \sim p_d(x)}[E_{Z\sim q(z|x; \theta^{'})} logp(x|z; \theta)] + E_{x \sim p_d(x)}H(\frac{q(x|z)q(z)}{q(x)}) - E_{z \sim q(z)}[-logp(z)] \\ &= E_{x \sim p_d(x)}[E_{Z\sim q(z|x; \theta^{'})} logp(x|z; \theta)] + E_{z \sim q(z)}H(q(x|z)) + E_{z \sim q(z)}H(q(z)) \\ &- H(x) - H_{q(z)}(p(z)) \\ &= E_{x \sim p_d(x)}[E_{Z\sim q(z|x; \theta^{'})} logp(x|z; \theta)] - H(x) + E_{z \sim q(z)}H(q(x|z)) - KL(q(z)||p(z)) \end{split}$ * $- H(x) + E_{z \sim q(z)}H(q(x|z))$ is Mutual information between $x$ and $z$ * $KL(q(z)||p(z))$ is KL div. between the aggregated and prior dist. * When the encoder is viewed as a communication channel with $x$ as input and $z$ as output, the mutual information indicates how much information about $x$ is sent to the $z$; the larger the mutual information, the more information about $x$ the $z$ carries. * Conditional Variational Autoencoders (CVAE) * Training VAE to learn a conditional distribution $p(X|c)$ * $\begin{split}logp(X|c; \theta) &= E_{Z\sim q(Z|X, c; \theta^{'})} log \frac{p(X, Z|c; \theta)}{p(Z|X, c; \theta)} \\ &= E_{Z\sim q(Z|X, c; \theta^{'})} log \frac{p(X, Z|c; \theta)}{q(Z|X, c; \theta^{'})}\frac{q(Z|X, c; \theta^{'})}{p(Z|X, c; \theta)} \\ &= E_{Z\sim q(Z|X, c; \theta^{'})} log \frac{p(X, Z|c; \theta)}{q(Z|X, c; \theta^{'})} + E_{Z\sim q(Z|X, c; \theta^{'})} log \frac{q(Z|X, c; \theta^{'})}{p(Z|X, c; \theta)} \\ &= E_{Z\sim q(Z|X, c; \theta^{'})} log \frac{p(X, Z|c; \theta)}{q(Z|X, c; \theta^{'})} + KL(q(Z|X, c; \theta^{'}||p(Z|X, c; \theta)) \\ &\geq E_{Z\sim q(Z|X, c; \theta^{'})} log \frac{p(X, Z|c; \theta)}{q(Z|X, c; \theta^{'})} \\ &= E_{Z\sim q(Z|X, c; \theta^{'})} log \frac{p(X|, Z, c; \theta) p(Z|c; \theta)}{q(Z|X, c; \theta^{'})} \\ &= E_{Z\sim q(Z|X, c; \theta^{'})} log \frac{p(Z|c; \theta)}{q(Z|X, c; \theta^{'})} + E_{Z\sim q(Z|X, c; \theta^{'})} logp(X|, Z, c; \theta) \\ &= -KL(q(Z|X, c; \theta^{'})||p(Z|c; \theta)) + E_{Z\sim q(Z|X, c; \theta^{'})} logp(X|, Z, c; \theta) \end{split}$ * How to specify the conditional prior $p(Z|c)$? * Learn from data using a neural network (regularization?) * Use a simple fixed prior without regard to $c$ * Ignore the regularization term (no longer VAE) * ![](https://i.imgur.com/FI7VQ3Y.png) * Denoising Autoencoders (DAE) * The DAE is to receive a corrupted data point as input and to predict the uncorrupted data point as output; that is, to minimize * $L(x, g(f(\tilde x))$, where $\tilde x$ is a noise-corrupted version of $x$ * where x̃ is a noise-corrupted version of x * The training of DAE proceeds * Sample an $x$ from the training data * Sample a corrupted version $\tilde x$ from $C(\tilde x | x)$ * Minimize the negative log-likelihood by performing gradient descent w.r.t. model parameters * $-E_{x \sim \hat p_{data}(x)}E_{\tilde x \sim C(\tilde x | x)}log p_{decoder}(x | h = f(\tilde x))$ * [DAV check](https://davidstutz.de/denoising-variational-auto-encoders/) * ![](https://i.imgur.com/wVMyOle.png) * ![](https://i.imgur.com/i93ULT8.png) * Sparse Autoencoders * A sparse autoencoder is an autoencoder whose training criterion involves a sparsity penalty $\Omega (h)$ * $L(x, g(f(x))) + \Omega (h)$ * Sparse autoencoders are typically used to learn features for another task, such as classiﬁcation. An autoencoder that has been regularized to be sparse must respond to unique statistical features of the dataset it has been trained on, rather than simply acting as an identity function. * We can think of the entire sparse autoencoder framework as approximating maximum likelihood training of a generative model that has latent variables. Suppose we have a model with visible variables $x$ and latent variables $h$, with an explicit joint distribution $p_{model}(x, h) = p_{model}(h)p_{model}(x | h)$. We refer to $p_{model}(h)$ as the model’s prior distribution over the latent variables, representing the model’s beliefs prior to seeing $x$. * From this point of view, with this chosen $h$, we are maximizing * ![](https://i.imgur.com/WNlIlmQ.png) * 如果的非線性函數是sigmoid函數，當神經元的輸出接近1時為激活，接近0時為稀疏；如果採用tanh函數，當神經元的輸出接近1時為激活，接近-1時為稀疏。 * Assume ![](https://i.imgur.com/sIQGkNR.png) * ![](https://i.imgur.com/Kuys0O7.png) * 稀疏自動編碼希望讓隱含層的平均激活度為一個比較小的值。 * Contractive Autoencoders (CAE) * The CAE imposes a regularizer on the code $h$ which encourages to learn an encoder function that does not change much when input $x$ changes slightly * $L(x, g(f(x)) + \Omega (h, x)$ * $\Omega (h, x) = \lambda ||\frac{\partial f(x)}{\partial x}||$ * The encoder $f(x)$ at a training point $x_0$ can be approximated as * $f(x) \approx f(x_0) + \frac{\partial f(x_0)}{\partial x}(x - x_0)$ * As such, the CAE is seen to encourage the Jacobian matrix $\frac{\partial f(x_0)}{\partial x}$ at every training point $x_0$ to become contractive, making their singular values become as small as possible * It is however noticed that the optimization has to respect also the reconstruction error; this leads to an effect that keeps the singular values along directions with large local variances * These directions are known as tangent directions to the data manifold; that is, they correspond to real variations in the data. * The encoder learns a mapping $f(x)$ that is only sensitive to changes along the manifold directions * ![](https://i.imgur.com/9AQ6Zc8.png) * 第一部分最小化重構誤差，即要在編碼的時候將最具代表性的特徵信息保留下來，而第二部分只與偏導不為0時的樣本有關，即丟掉了所有有用的信息，而保留下抖動信息，我們要使模型對抖動具有不變性。那麼整個損失函數的作用即只保持具有代表性的好特徵信息。