AutoEncoder Study Notes

[toc] :::info This is a study notes as a process of Indoor Positioning System Model Includes the use of AutoEncoders and Deep AutoEncoder as well as the fundamentals of it. **Recources** 1. [Applied Deep Learning: AutoEncoders](https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798) 2. [A comprehensive survey on design and application of autoencoder in deep learning](https://www.sciencedirect.com/science/article/pii/S1568494623001941) 3. [Autoencoder and its various variants ](https://ieeexplore.ieee.org/document/8616075) ::: ## Introduction to AutoEncoders An AutoEncoder is a type of neural network architecture that is primarily used for unsupervised learning tasks such as data compression, dimensionality reduction, and feature extraction. It consists of an encoder network and a decoder network, which work together to reconstruct the input data. :::success **Auto** refers to "automatic" or "self" signifies that the neural network architecture is *designed to learn and process the data in an automatic manner*. While **Encoder** refers to the part of the AutoEncoder that *performs the encoding (compression) of the input data*. ::: ### Purpose of AutoEncoders The main goal of AutoEncoder is to learn presentation for a group of data, especially for dimensionality step-down. Autoencoders have a unique feature where their input is equal to its output by forming feedforwarding networks. Autoencoder turns the input into compressed data to form a low dimensional code and then again retrace the input to form the desired output. The compressed code of input is also called latent space representation. In simple, the main aim is to reduce distortion in between circuits. ### Components of AutoEncoder ![](https://hackmd.io/_uploads/BywkpcQtn.png) There are three main components in AutoEncoder: 1. Encoder 2. Code (Most Reduced Form) 3. Decoder :::info Encoder and Decoder are completely connected to form a feed forwarding mesh, while the code acts as a single layer that has its own dimension. ::: ### AutoEncoder Characteristics AutoEncoders are mostly used for dimension reduction that has several important properties: :::spoiler * Lossy Compression AutoEncoders aim to represent the input data in a lower-dimensional space, often referred to as the latent space or bottleneck layer. This reduction in dimensionality inherently results in the loss of some information. The compressed representation in the latent space has fewer dimensions than the original input data, leading to a loss of fine-grained details. * Unsupervised Does not require explicit labels to train on, but rather it generates their own labels from training data * Data-Driven Only meaningful if used to compress data that are similar to what they have been trained on. AutoEncoders learn features specific for the given training data. ::: ### Types of AutoEncoders There are several types of AutoEncoders each with its own specific variations and purposes. In this context, we won't dig too deep into the real implementations of each AutoEncoders. 1. Denoising AutoEncoder (DAE) A DAE is designed to handle noisy input data. During training, the DAE is trained to reconstruct the original, clean input data from a corrupted version. By learning to remove noise and reconstruct the clean data, the DAE can effectively denoise input data. 3. Sparse AutoEncoder (SAE) In a Sparse AutoEncoder, a sparsity constraint is introduced during training. This encourages the AutoEncoder to learn sparse representations, where only a small number of neurons in the hidden layers are activated at a time. Sparse AutoEncoders can be useful for feature extraction and learning more compact representations of the data. 5. Varitional AutoEncoder (VAE) VAEs are probabilistic AutoEncoder models that learn a latent space representation with a specific distribution, typically a Gaussian distribution. They introduce an additional constraint during training that encourages the latent space to follow the desired distribution. VAEs are used for generative modeling, allowing the generation of new data samples similar to the training data. 7. Convolutional AutoEncoder (CAE) Convolutional AutoEncoders use convolutional neural networks (CNNs) as the encoder and decoder components. They are specifically designed for image data, leveraging the convolutional operations for spatial feature extraction and preserving the spatial structure of the input data during reconstruction. ## Architecture ![](https://hackmd.io/_uploads/H1Q0lo7Kh.png) As explained above, both encoder and decoder are fully-connected represent the implementation of feed-forward neural network. While the code is a signle layer with the different dimensionality than encoder and decoder that can be configured. As shown in the figure above, n, t, and n are the neuron numbers in each layer. The neuron numbers in the input and output layer are the same. While inside the hidden layer is limited. Generally, the neuron number in the hidden layer is less than that in the input layer (t<n), which reduces the dimension. ### Encoder Network Input layer and hidden layer both form an encoder. The original data X is input into the model, and the converted data h is obtained after encoding. This coding process is completed by this formula shown below $h = f(x) = f(W_1X + b_1)$ *Where* * $W_1$ is the weight matrix between the input layer and the output layer * $b_1$ is the bias vector * $f(x)$ is activation function of the non-linear transformation :::info The encoding process is essentially the **re-extraction** of data into a specific code through a deterministic mapping relationship. ::: ### Decoder Network A Decoder network's purpose is to map the converted code *h* to the original space and reconstruct $X^d$. And this process is done by the formula below $X^d = g(x) = g(W_2h+b_2)$ Where * $W_2$ is the weight matrix between the hidden layer and the output layer * $b_2$ is the bias vector * $g(x)$ is a non-linear transformation (sigmoid or tanh) or an affine transformation function :::info The decoding process is the **conversion** of the specific encoding into the input data. ::: ## Training AutoEncoders AutoEncoder is trained to find the parameters (weight matrix W and bias vector b, represented by $θ$) that minimize the reconstruction error between $X_d$ and X. In other words, it nvolves optimizing the network parameters to minimize the reconstruction error between the input data and the output generated by the decoder. :::warning The process typically involves the use of an optimization algorithm, a loss function, and backpropagation to update the weights and biases of the network. ::: ### Loss Functions In the face of regression and classification problems, there are two general types of reconstruction errors for conventional encoders. #### Mean Squared Error (Regression) $J_A(θ) = J(X, X^d) = \frac{1}{2} \sum_{i=1}^n ||X_{i}{}^d - X_i||^2$ #### Cross-Entropy (Classification) $J_A(θ) = J(X, X^d) = - \sum_{i=1}^n (x_ilog(x_i{}^d) + (1-x_i)log(1-x_i{}^d))$ ### Optimization Algorithms AutoEncoders, like other neural networks, rely on optimization algorithms to update the network parameters during training and minimize the reconstruction error. It aims to **reconstruct the input data as accurately as possible**. The optimization algorithm, guided by a loss function (such as mean squared error), adjusts the network parameters to minimize the discrepancy between the original input data and the reconstructed output. Optimization algorithms determine the direction and magnitude of parameter updates to reduce the reconstruction error. :::info Choosing which optimization algorithm is the most optimal depends on network architecture and dataset characteristics. ::: 1. **Gradient Descent** Gradient descent is a widely used optimization algorithm in machine learning. It works by iteratively updating the weights and biases in the direction of the negative gradient of the loss function. Given a loss function L(θ) that measures the discrepancy between the predicted output and the true output, the goal of Gradient Descent is to minimize this loss function by updating the parameters θ iteratively. The update rule for each parameter θ_i in each iteration is: $θ_i = θ_i - α * ∂L(θ) / ∂θ_i$ Where: * $θ_i$ represents the i-th parameter to be updated, * α is the learning rate * $∂L(θ) / ∂θ_i$ is the partial derivative of the loss function with respect to the i-th parameter 2. **Adaptive Moment Estimation (Adam)** is an extension of gradient descent that incorporates adaptive learning rates. It combines the benefits of both the AdaGrad and RMSprop optimization algorithms. Adam adapts the learning rate for each parameter individually, allowing faster convergence and better performance on non-stationary objectives. Given a loss function L(θ) that measures the discrepancy between the predicted output and the true output, the goal of Adam is to iteratively update the parameters θ to minimize this loss function. The update rule for each parameter θ_i in each iteration is as $m = β1 * m + (1 - β1) * ∇L(θ)$ ... *First Moment Estimate* $v = β2 * v + (1 - β2) * (∇L(θ))^2$ ... *Second Moment Estimate* $m̂ = m / (1 - β1^t)$ ... *Bias-corrected first moment estimate* $v̂ = v / (1 - β2^t)$ ... *Bias-corrected second moment estimate* $θ = θ - α * (v / (\sqrt{s} + ε))$ ... *Parameter Update* Where: * θ_i represents the i-th parameter to be updated. * α (alpha) is the learning rate, which controls the step size of the updates. * ∇L(θ) is the gradient of the loss function with respect to the parameters. * m and v are the first and second moment estimates of the gradients, respectively. * β1 and β2 are the exponential decay rates for the first and second moments, typically close to 1 (common choices are 0.9 for β1 and 0.999 for β2). * t is the current time step (iteration). * m̂ and v̂ are the bias-corrected first and second moment estimates, respectively. * ε (epsilon) is a small value for numerical stability to avoid division by zero. :::info The Adam update rule adapts the learning rate for each parameter based on the estimated first and second moments of the gradients. It provides faster convergence compared to standard gradient descent and helps handle noisy gradients and varying learning rates for different parameters. ::: :::spoiler It's important to note that the above formulation represents the general update rule for Adam. Different implementations or variations of Adam may include additional features or tweaks, such as gradient clipping or adaptive learning rate schedules. ::: 3. **Root Mean Square Propagation (RMSProp)** Optimization algorithm that adjusts the learning rate for each parameter based on the root mean square of the past gradients. It helps mitigate the diminishing learning rate problem and provides stability during training