Autoencoders - HackMD

# Autoencoders ## What are autoencoders? Autoencoders are artificial neural networks, trained in an unsupervised manner, that aim to first learn encoded repesentations of our data and then generate the input data (as closely as possible) from the learned encoded representations. Thus, the output of an autoencoder is its prediction for the input. <center> <img src="https://i.imgur.com/unVCuSy.png" width="40%"/> Fig. 1: Architecture of a basic autoencoder </center> Fig. 1 shows the architecture of a basic autoencoder. As before, we start from the bottom with the input $x$ which is subjected to an encoder(affine transformation defined by $W_h$, followed by squashing). This results in the intermediate hidden layer $h$. This is subjected to the decoder(another affine transformation defined by $W_x$ followed by another squashing). This produces the output $\hat{x}$, which is our model's prediction/reconstruction of the input. As per our convention, we say that this is a 3 layer neural network. We can represent the above network mathematically by using the following equations: \begin{align*} h = f(W_hx + b_h) \\ \hat{x} = g(W_xh + b_x) \end{align*} We also specify the following dimensionalities: \begin{align*} x,\hat{x} \in \mathbb{R}^n\\ h \in \mathbb{R}^d\\ W_h \in \mathbb{R}^{d \times n}\\ W_x \in \mathbb{R}^{n \times d}\\ \end{align*} Note: In order to represent PCA, we can have tight weights (or tied weights) defined by $W_x\ \dot{=}\ W_h^\top$ ## Why are we using autoencoders? At this point we may wonder what the point of predicting the input is and what are the applications of autoencoders. The primary application of an autoencoder is for anomaly detection or image denoising. We know that an autoencoder's task is to be able to reconstruct data that lives on the manifold i.e. given a data manifold, we would want our autoencoder to be able to reconstruct only the input that exists in that manifold. Thus we constrain the model to reconstruct things that have been observed during training, and so any variation present in new inputs will be removed because the model would be insensitive to those kind of perturbations. Another application of an auto encoder is as an image compressor. If we have an intermediate dimensionality $d$ lower than the input dimensionality $n$, then the encoder can be used as a compressor and the hidden representations (coded representations) would address all (or most) of the information in the specific input but take less space. ## Reconstruction loss Let us now look at the reconstruction losses that we generally use. The overall loss for the dataset is given as the average per sample loss i.e. \begin{align*} L = \frac{1}{m} \sum_{j=1}^m \ell(x^{(j)},\hat{x}^{(j)}) \end{align*} When the input is categorical, we could use the Cross Entropy loss to calculate the per sample loss which is given by \begin{align*} \ell(x,\hat{x}) = -\sum_{i=1}^n [x_i \log(\hat{x}_i) + (1-x_i)\log(1-\hat{x}_i)] \end{align*} And when the input is real valued, we may want to use the Mean Squared Error Loss gien by \begin{align*} \ell(x,\hat{x}) = \frac{1}{2} \lVert x - \hat{x} \rVert^2 \end{align*} ## Under-/over-complete hidden layer When the dimensionality of the hidden layer $d$ is less than the dimensionality of the input $n$ then we say it is under complete hidden layer. And similarly, when $d>n$, we call it an over-complete hidden layer. Figure 2 shows an under-complete hidden layer on the left and an over-complete hidden layer on the right. <center> <img src="https://i.imgur.com/7vwe3qS.png" width="60%"/> Fig. 2: An under-complete vs an over-complete hidden layer </center> As discussed above, an under-complete hidden layer can be used for compression as we are encoding the information from input in fewer dimensions. On the other hand, in an over-complete layer we use an encoding with higher dimensionality than the input. This makes optimization easier. Since, we are trying to reconstruct the input, the model is prone to copying all the input features into the hidden layer and passing it as the output thus essentially behaving as an identity function. This needs to be avoided as this would imply that our model fails to learn anything. Hence, we need to apply some additional constraints by applying an information bottleneck. We do this by constraining the possible configurations that the hidden layer can take to only those configurations seen during training. This allows for a selective reconstruction(limited to a subset of the input space) and makes the model insensitive to everything not in the manifold. It is to be noted that an under-complete layer cannot behave as an identity function simply because the hidden layer doesn't have enough dimensions to copy the input. Thus an under-complete hidden layer is less likely to overfit as compared to an over-complete hidden layer but it could still overfit. For ex- given a powerful encoder and a decoder, the model could simply associate one number to each data point and learn the mapping. There are several methods to avoid overfitting such as regularization methods, architectural methods, etc.