# Autoencoders and Variational Autoencoders
## Introduction
Autoencoders and Variational Autoencoders (VAEs) are powerful neural network architectures that learn compressed representations of data, known as latent representations. These models have a wide array of applications, including:
* **Dimensionality Reduction:** Reduce the complexity of high-dimensional data while preserving essential information. This is useful for visualization, data compression, and improving the performance of machine learning algorithms on large datasets.
* **Denoising:** Remove noise from corrupted data, enhancing its quality for subsequent analysis or tasks.
* **Anomaly Detection:** Identify unusual patterns or outliers in data by leveraging the reconstruction error of the autoencoder. Data points with high reconstruction errors are likely anomalies.
* **Generative Modeling:** VAEs, in particular, can generate new, realistic samples from the learned data distribution. This capability is valuable in creative applications such as image generation, music composition, and text synthesis.
## Latent Variable Models
Latent variable models establish a map between an observation space $\mathbf{x} \in \mathbb{R}^D$ and a latent space $\mathbf{z} \in \mathbb{R}^Q$. The objective is to learn a compressed representation ($Q<D$) that captures the essential structure within the data. These models can be classified as follows:
* **Linearity vs. Non-Linearity:** The mapping between the observed and latent spaces can be either linear (e.g., PCA) or non-linear (e.g., autoencoders, VAEs).
* **Deterministic vs. Probabilistic:** The mapping can be deterministic, where each input is mapped to a single point in the latent space, or probabilistic, where the latent representation is a probability distribution.
* **Explicit Encoder:** Some models explicitly define an encoder network to learn the mapping to the latent space. Others, like GANs, learn the mapping implicitly through an adversarial training process.
| Feature | Deterministic | Probabilistic |
|---------|----------------------------|----------------------------------|
| Linear | Principal Component Analysis (PCA) | Probabilistic PCA |
| Non-Linear with encoder | Autoencoder | Variational Autoencoder (VAE) |
| Non-Linear without encoder | | Generative Adversarial Network (GAN) |
## Principal Component Analysis (PCA)
PCA is a foundational linear dimensionality reduction technique. It can be interpreted as a special case of a linear autoencoder, but without any hidden layers. The core idea of PCA is to find a lower-dimensional subspace that captures the most significant variance present in the data. This is achieved by minimizing the reconstruction error (the difference between the original data and its projection back from the lower-dimensional space), which, interestingly, is mathematically equivalent to maximizing the variance within the latent (lower-dimensional) representation itself.
### Theory Behind PCA: Unveiling the Equivalence
Let's delve into the mathematical underpinnings of PCA to demonstrate the equivalence between minimizing reconstruction error and maximizing variance.
#### Problem Setup
We begin by defining:
* **Observed data:** $\mathbf{x}_1,\dots,\mathbf{x}_n \in \mathbb{R}^D$, where each $\mathbf{x}_i$ is a data point in the original, high-dimensional space.
* **Latent representations:** $\mathbf{z}_1,\dots,\mathbf{z}_n \in \mathbb{R}^{Q}$, where each $\mathbf{z}_i$ is the corresponding low-dimensional representation of $\mathbf{x}_i$.
* **Reconstructed data:** $\hat{\mathbf{x}}_i$, obtained by linearly mapping $\mathbf{z}_i$ back to the original space. Mathematically: $$\mathbf{\hat{x}_i} = \mathbf{\bar{x}} + \sum_{j=1}^Q z_{ij} \mathbf{v}_j$$ where:
* $\mathbf{\bar{x}}=\frac{1}{N}\sum_{i=1}^N\mathbf{x}_i$ is the mean of the observed data.
* $V = \{\mathbf{v}_1,\dots,\mathbf{v}_Q\}$ is an orthonormal set of vectors (principal components) in $\mathbb{R}^D$.
* $z_{ij}$ are the coefficients representing how much each principal component contributes to reconstructing the i-th data point
#### L2 Reconstruction Loss
We measure the quality of the reconstruction using the L2 (Euclidean) distance between the original and reconstructed data points, summed over all data points:
$$
\mathcal{L}(Z,V) = \sum_{i=1}^N||\mathbf{\hat{x}}_i-\mathbf{x}_i||^2
$$
#### Equivalence to Variance Maximization
Through mathematical derivation (detailed below), we can show that minimizing this reconstruction loss is the same as finding the orthonormal vectors $V$ that maximize the variance of the latent representations $Z$:
$$
\mathbf{v} = \mathop{\arg\max}\limits_{v \in \mathbb{R}^D, ||v|| = 1} \mathbf{v}^T \text{Cov}(X, X) \mathbf{v}
$$
where $\text{Cov}(X, X)$ is the covariance matrix of the observed data.
#### Solution & Interpretation
The solution to this maximization problem reveals that the optimal $\mathbf{v}$'s are the eigenvectors of the covariance matrix associated with the largest eigenvalues. These eigenvectors form the principal components, and the corresponding eigenvalues represent the amount of variance explained by each component.
#### Mathematical Derivation
We have
\begin{split}
\mathcal{L}(Z,V)&=\sum_{i=1}^N||\mathbf{\hat{x}}_i-\mathbf{x}_i||^2 \\
&=\sum_{i=1}^N||\mathbf{\bar{x}}+\sum_{j=1}^Qz_{ij}\mathbf{v}_j-\mathbf{x}_i||^2 \\
&=\sum_{i=1}^N \left[\sum_{j=1}^Q[z_{ij}\mathbf{v}_j^T+(\mathbf{\bar{x}}-\mathbf{x}_i)^T] (\sum_{j=1}^Qz_{ij}\mathbf{v}_j+\mathbf{\bar{x}}-\mathbf{x}_i)\right] \\
&=\sum_{i=1}^N \left[\sum_{j=1}^Qz_{ij}^2+2(\mathbf{\bar{x}}-\mathbf{x}_i)^T \sum_{j=1}^Qz_{ij}\mathbf{v}_j + ||\mathbf{\bar{x}}-\mathbf{x}_i||^2\right]
\end{split}
Taking the partial derivative with respect to $z_{ij}$, we get:
$$
\frac{\partial \mathcal{L}(Z, V)}{\partial z_{ij}} = 2z_{ij} + 2 (\mathbf{\bar{x}} - \mathbf{x}_i)^T \mathbf{v}_j
$$
Setting this derivative to zero and solving for the optimal $z_{ij}$, we obtain:
$$
z_{ij}^* = -(\mathbf{\bar{x}} - \mathbf{x}_i)^T \mathbf{v}_j
$$
Let $Z=Z^*$. Substituting this optimal value back into the loss function, we have:
\begin{split}
\mathcal{L}(Z^*,V)
&= \sum_{i=1}^N \left[\sum_{j=1}^Q (z_{ij}^*)^2+2(\mathbf{\bar{x}}-\mathbf{x}_i)^T \sum_{j=1}^Q z_{ij} \mathbf{v}_j + ||\mathbf{\bar{x}}-\mathbf{x}_i||^2 \right]\\
&= -\sum_{i=1}^N\sum_{j=1}^Q \mathbf{v}_j^T(\mathbf{\bar{x}}-\mathbf{x}_i)(\mathbf{\bar{x}}-\mathbf{x}_i)^T\mathbf{v}_j+\sum_{i=1}^N||\mathbf{\bar{x}}-\mathbf{x}_i||^2\\
&=-N\sum_{j=1}^Q \mathbf{v}_j^T\frac{1}{N}\sum_{i=1}^N(\mathbf{\bar{x}}-\mathbf{x}_i)(\mathbf{\bar{x}}-\mathbf{x}_i)^T\mathbf{v}_j+\sum_{i=1}^N||\mathbf{\bar{x}}-\mathbf{x_i}||^2\\
&=-N\sum_{j=1}^Q \mathbf{v}_j^T \text{Cov}(X,X) \mathbf{v}_j+\sum_{i=1}^N||\mathbf{\bar{x}}-\mathbf{x}_i||^2
\end{split}
To minimize this loss, we need to find the vectors $\mathbf{v}$ that maximize the following expression:
$$
\mathbf{v} = \mathop{\arg\max}\limits_{v \in \mathbb{R}^D, ||v|| = 1} \mathbf{v}^T \text{Cov}(X, X) \mathbf{v}
$$
Alternatively, if we substitute $z_{ij}^*=-(\mathbf{\bar{x}}-\mathbf{x}_i)^T\mathbf{v}_j$ back into the loss function, we get:
\begin{align*}
\mathcal{L}(Z^*, V)
&= - \sum_{i=1}^N \sum_{j=1}^Q (z_{ij}^*)^2 + \sum_{i=1}^N ||\mathbf{\bar{x}} - \mathbf{x}_i||^2 \\
&= -N \sum_{j=1}^Q \frac{1}{N} \sum_{i=1}^N (z_{ij}^*)^2 + \sum_{i=1}^N ||\mathbf{\bar{x}} - \mathbf{x}_i||^2 \\
&= -N \sum_{j=1}^Q \text{Cov}(Z_j^*, Z_j^*) + \sum_{i=1}^N ||\mathbf{\bar{x}} - \mathbf{x}_i||^2
\end{align*}
This demonstrates the equivalence between minimizing the L2 reconstruction loss and maximizing the variance of the latent variables $Z$.
To find the optimal $V$, we use the method of Lagrange multipliers to solve the constrained optimization problem:
$$
\mathbf{v} = \mathop{\arg\max}\limits_{v \in \mathbb{R}^D, ||v|| = 1} \mathbf{v}^T C \mathbf{v}
$$
where $C = \text{Cov}(X,X)$.
We define the Lagrangian:
$$
f(\mathbf{v}, \lambda) = \mathbf{v}^T C \mathbf{v} - \lambda (\mathbf{v}^T \mathbf{v} - 1)
$$
Taking the partial derivative with respect to $v_i$ and setting it to zero, we get:
\begin{split}
\frac{\partial f(\mathbf{v}, \lambda)}{\partial v_i}
&= C_{i, \cdot} \mathbf{v} + \mathbf{v}^T C_{\cdot, i} - 2 \lambda v_i \\
&= 2 C_{i, \cdot} \mathbf{v} - 2 \lambda v_i = 0
\end{split}
This leads to the equation $C\mathbf{v}=\lambda\mathbf{v}$, which means that $\mathbf{v}$ is an eigenvector of the covariance matrix $\text{Cov}(X,X)$.
#### The Structure of PCA
PCA can be succinctly described by its encoder and decoder:
* **Input:** $\mathbf{x}$ (data point in the original space)
* **Encoder:** $\mathbf{z} = V^T (\mathbf{x}− \bar{\mathbf{x}})$ (projects the data onto the principal components)
* **Decoder:** $\hat{\mathbf{x}} = V\mathbf{z} + \bar{\mathbf{x}}$ (reconstructs the data from the latent representation)
The matrix $V$ comprises the top $Q$ eigenvectors (principal components) of the data covariance matrix.

#### Limitations of PCA
While PCA is a powerful and widely used technique, it has certain limitations:
* **Linearity Assumption:** PCA assumes a linear relationship between the original and latent variables. This can be restrictive when dealing with data that exhibits complex, non-linear patterns.
* **Sensitivity to Outliers:** PCA is susceptible to the influence of outliers, which can distort the principal components and lead to suboptimal representations. Robust PCA methods can be employed to address this issue.
## Autoencoders
Autoencoders represent a class of neural networks specifically designed to learn efficient representations of data by attempting to reconstruct their input. This process involves two key components: an encoder and a decoder.

### Encoder: Compressing the Essence
* **Mathematical Representation:** $g_{\phi}: \mathbf{x} \mapsto \mathbf{z}$
* **Functionality:** The encoder acts as a data compressor, transforming the input data $\mathbf{x}$ into a lower-dimensional latent representation $\mathbf{z}$. This latent space aims to encapsulate the most salient and informative aspects of the input, discarding noise or irrelevant details
* **Common Layers:**
* **Convolutional layers:** Particularly useful for image data, these layers detect local patterns and features across the spatial dimensions of the input.
* **Pooling Layers:** Reduce dimensionality by downsampling, aiding in capturing more abstract representations and making the model less sensitive to small shifts in the input.
* **Fully Connected (Dense) Layers:** Enable general-purpose transformations and further refine the latent representation.
* **Non-linearity:** Activation functions (such as ReLU or sigmoid) introduce non-linearity, enabling the model to capture complex relationships within the data.
### Decoder: Reconstructing from the Compressed Form
* **Mathematical Representation:** $f_{\theta}: \mathbf{z} \mapsto \hat{\mathbf{x}}$
* **Functionality:** The decoder functions as a reconstruction engine, taking the compressed latent representation $\mathbf{z}$ and attempting to generate an output $\hat{\mathbf{x}}$_hat that closely resembles the original input $\mathbf{x}$.
* **Structure:** The decoder's architecture often mirrors that of the encoder, but in reverse. This symmetry helps to ensure that the information captured in the latent space can be effectively utilized for reconstruction.
* **Common Layers:**
* **Transposed Convolutional Layers (Deconvolutional Layers):** Upsample the latent representation to increase its spatial dimensions, effectively reversing the downsampling performed by convolutional and pooling layers in the encoder.

* **Unpooling Layers:** Counteract the pooling operation by restoring the original spatial dimensions. This often involves using information recorded during the encoding process (e.g., locations of maximum values in max pooling) to guide the unpooling.

#### Loss Function
The training objective of an autoencoder is to minimize the discrepancy between the original input $\mathbf{x}$ and its reconstructed counterpart $\hat{\mathbf{x}}$. This discrepancy, known as the reconstruction error, is often measured using the Mean Squared Error (MSE) loss function.
#### Advantage over PCA
Autoencoders have a distinct advantage over PCA in their capacity to learn non-linear relationships in the data. This allows them to capture more intricate patterns and potentially achieve more accurate reconstructions compared to the linear transformations employed by PCA.
#### Denoising Autoencoders: A Robust Variant
Denoising autoencoders are a specialized type of autoencoder trained on corrupted input data (with added noise) but tasked with reconstructing the clean, original data. This forces the model to learn to filter out noise and extract the essential underlying structure, making it useful for data preprocessing and denoising tasks.

## Example: Dimensional Reduction using PCA and Autoencoder
This code example provides a practical demonstration of dimensionality reduction on a synthetic dataset using both Principal Component Analysis (PCA) and an Autoencoder. The dataset is generated using sklearn's `make_classification`, and we'll visually explore how PCA and the autoencoder transform the data into a lower-dimensional space.
### Key Points
* **Dataset:** Synthetic dataset with 10,000 samples, 50 features, and 3 classes.
* **Dimensionality Reduction:** Both PCA and the autoencoder will reduce the dimensionality to 5.
* **Visualization:** Scatter plots will showcase the first two components of the transformed data to reveal how the classes cluster in the lower-dimensional space.
* **Comparison:** This example highlights the contrast between PCA's linear transformation and the autoencoder's capability to capture non-linear relationships.
### Setup and Data Preparation
#### Import Modules
```python=
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import time
import numpy as np
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
```
#### Hyperparameters
```python=
# Hyperparameters
RANDOM_SEED = 42
LEARNING_RATE = 1e-3
BATCH_SIZE = 128
NUM_EPOCHS = 500
HIDDEN_DIM = 5 # Target dimensionality for reduction
TEST_SIZE = 0.2
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
```
#### Generate Data
```python=
# Set random seeds for reproducibility
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
# Generate Data
X, y = datasets.make_classification(n_samples=10000,
n_features=50,
n_redundant=10,
n_informative=10,
random_state=RANDOM_SEED,
n_clusters_per_class=2,
n_classes=3,
class_sep=2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_SEED)
```
#### Normalization
```python=
# Normalize data
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)
```
### PCA
#### Fit PCA
```python=
# PCA
pca = PCA(n_components=HIDDEN_DIM)
pca_scores = pca.fit_transform(X_train_std)
```
#### Visualization of Eigenvalues
```python=
# Visualization of Eigenvalues
pca_full = PCA(n_components=None, svd_solver='auto')
pca_full.fit(X_train_std)
eigenvalues = pca_full.explained_variance_
total_variance = np.sum(eigenvalues)
explained_variance_ratio = eigenvalues / total_variance
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
plt.figure(figsize=(10, 6))
plt.bar(range(1, 51), explained_variance_ratio[:50], alpha=0.5, label='Eigenvalues')
plt.bar(range(1, 6), explained_variance_ratio[:5], color='red', label='First 5 Eigenvalues')
plt.xlabel('Principal Component')
plt.ylabel('Percentage of Explained Variance')
plt.title('Percentage of Variance Explained by Each Eigenvalue (All 50)')
plt.xticks(range(1, 51, 5))
plt.legend()
plt.show()
print(f"Cumulative explained variance for the first 5 components: {cumulative_variance_ratio[4]:.2%}")
```
```
Cumulative explained variance for the first 5 components: 31.61%
```

### Autoencoder
#### Autoencoder Class
```python=
# Autoencoder
class AutoEncoder(nn.Module):
def __init__(self, input_dim, hidden_dim):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 30),
nn.LeakyReLU(0.1),
nn.Linear(30, 10),
nn.LeakyReLU(0.1),
nn.Linear(10, hidden_dim),
)
self.decoder = nn.Sequential(
nn.Linear(hidden_dim, 10),
nn.LeakyReLU(0.1),
nn.Linear(10, 30),
nn.LeakyReLU(0.1),
nn.Linear(30, input_dim)
)
def forward(self, x):
return self.decoder(self.encoder(x))
def encode(self, x):
return self.encoder(x)
model = AutoEncoder(X_train.shape[1], HIDDEN_DIM).to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
criterion = nn.MSELoss()
```
#### Build DataLoader
```python=
# Prepare data loaders
train_dataset = TensorDataset(torch.FloatTensor(X_train_std), torch.LongTensor(y_train))
test_dataset = TensorDataset(torch.FloatTensor(X_test_std), torch.LongTensor(y_test))
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)
```
#### Train Autoencoder
```python=
# Training function
def train_autoencoder(model, train_loader, test_loader, num_epochs, optimizer, criterion):
train_losses, test_losses = [], []
for epoch in range(num_epochs):
model.train()
train_loss = 0
for batch_features, _ in train_loader:
batch_features = batch_features.to(DEVICE)
optimizer.zero_grad()
outputs = model(batch_features)
loss = criterion(outputs, batch_features)
loss.backward()
optimizer.step()
train_loss += loss.item()
train_loss /= len(train_loader)
train_losses.append(train_loss)
model.eval()
test_loss = 0
with torch.no_grad():
for batch_features, _ in test_loader:
batch_features = batch_features.to(DEVICE)
outputs = model(batch_features)
loss = criterion(outputs, batch_features)
test_loss += loss.item()
test_loss /= len(test_loader)
test_losses.append(test_loss)
if (epoch + 1) % 50 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Test Loss: {test_loss:.4f}')
return train_losses, test_losses
# Train the model
start_time = time.time()
train_losses, test_losses = train_autoencoder(model, train_loader, test_loader, NUM_EPOCHS, optimizer, criterion)
print(f'Training time: {time.time() - start_time:.2f} seconds')
```
```
Epoch [50/500], Train Loss: 0.6420, Test Loss: 0.6512
Epoch [100/500], Train Loss: 0.6318, Test Loss: 0.6432
Epoch [150/500], Train Loss: 0.6278, Test Loss: 0.6398
Epoch [200/500], Train Loss: 0.6255, Test Loss: 0.6376
Epoch [250/500], Train Loss: 0.6237, Test Loss: 0.6366
Epoch [300/500], Train Loss: 0.6224, Test Loss: 0.6351
Epoch [350/500], Train Loss: 0.6211, Test Loss: 0.6347
Epoch [400/500], Train Loss: 0.6204, Test Loss: 0.6341
Epoch [450/500], Train Loss: 0.6196, Test Loss: 0.6342
Epoch [500/500], Train Loss: 0.6195, Test Loss: 0.6344
Training time: 149.64 seconds
```
#### Plot Loss Curve
```python=
# Plot loss curves
plt.figure(figsize=(10, 5))
plt.plot(train_losses, label='Train Loss')
plt.plot(test_losses, label='Test Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and Test Loss')
plt.legend()
plt.show()
# Encode the data
model.eval()
with torch.no_grad():
encoded_data = model.encode(torch.FloatTensor(X_train_std).to(DEVICE)).cpu().numpy()
# Visualization
plt.figure(figsize=(20, 8))
```

### Visualization and Comparison
#### Plot Two Components of Compressed Data
```python=
# PCA plot
plt.subplot(121)
scatter = plt.scatter(pca_scores[:, 0], pca_scores[:, 1], c=y_train, cmap='viridis')
plt.colorbar(scatter)
plt.title('PCA')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
# Autoencoder plot
plt.subplot(122)
scatter = plt.scatter(encoded_data[:, 0], encoded_data[:, 1], c=y_train, cmap='viridis')
plt.colorbar(scatter)
plt.title('Autoencoder')
plt.xlabel('First Encoded Dimension')
plt.ylabel('Second Encoded Dimension')
plt.tight_layout()
plt.show()
```

#### Interpretation
* **PCA:** The scatter plot of PCA-transformed data might show some clustering of the three classes, but the separation may not be perfect due to PCA's linear nature.
* **Autoencoder:** The autoencoder's ability to learn non-linear transformations could lead to a clearer separation between classes in the latent space visualization.
## Variational Autoencoders (VAEs)
While traditional autoencoders are adept at learning compressed representations and reconstructing input data, they lack the inherent ability to generate new, plausible data samples. This limitation stems from their deterministic nature; they learn a direct mapping from input to latent space and back, without explicitly modeling the underlying probability distribution of the data. Variational Autoencoders (VAEs) address this shortcoming by introducing a probabilistic framework into the autoencoder architecture. This enables VAEs to not only learn meaningful representations but also to generate new samples that resemble the training data.
### The Structure of VAEs
VAEs maintain the fundamental encoder-decoder structure of autoencoders, but they diverge in their treatment of the latent space. Let's explore the key components of a VAE and how they facilitate the generation of new data.
#### Encoder & Decoder
Similar to autoencoders, VAEs employ an encoder to map input data to a latent space and a decoder to reconstruct the original data from this latent representation. The crucial difference lies in the encoder's output. Instead of producing a single point in the latent space, the VAE's encoder outputs two vectors:
1. Mean Vector $\mu$: Represents the center of a Gaussian distribution in the latent space.
2. Log Variance Vector $\log \sigma$: Defines the spread (or uncertainty) of the Gaussian distribution in the latent space.
This probabilistic approach allows the VAE to capture the inherent variability in the data, enabling it to generate diverse samples during the decoding process.

#### Loss Function
The loss function used to train a VAE consists of two key parts:
1. **Reconstruction Loss:** This measures the fidelity of the reconstructed output compared to the original input. It quantifies how well the decoder, parameterized by $\theta$, can recreate the input data x from the latent representation z produced by the encoder. Mean Squared Error (MSE) is a common choice for this loss: $$L_{MSE}(\theta,\phi) = \frac{1}{N}\sum_{i=1}^{N}||\mathbf{x}^{(i)} -f_{\theta}(g_{\phi}(\mathbf{x}^{(i)})) ||^2$$ where:
* $f_{\theta}$ represents the decoder with parameters $\theta$
* $g_{\phi}$ represents the encoder with parameters $\phi$
* $x_i$ is the $i$-th data point in the dataset
* $N$ is the total number of data points
2. **KL Divergence (Latent Loss):** This term acts as a regularizer, encouraging the distribution of the latent variables $z$ (as modeled by the encoder) to stay close to a predefined prior distribution, typically a standard normal distribution. The Kullback-Leibler (KL) divergence measures the dissimilarity between these two distributions: $$L_{KL}(\theta,\phi) = -\frac{1}{2}\sum_{i=1}^Q\left[1+\log(\sigma_i^2)-\mu_i^2-\sigma_i^2\right]$$ where:
* $\mu$ and $\log(\sigma)$ are the mean and log variance outputs of the encoder for a given input $x$.
* $Q$ is the dimensionality of the latent space.
* $\sigma_j$ and $\mu_j$ are the $j$-th components of the standard deviation $\sigma$ and mean $\mu$ vectors, respectively
The total loss function for the VAE is then the sum of these two terms:
$$
L(\theta,\phi)=L_{MSE}(\theta,\phi)+L_{KL}(\theta,\phi)
$$
By minimizing this combined loss, the VAE learns to balance the trade-off between generating high-quality reconstructions and ensuring that its latent space is well-structured and amenable to sampling for new data generation.
#### Reparameterization Trick
Having defined the VAE's loss function, the next step is to train the model using gradient descent. However, a challenge arises when we attempt to compute the gradient of the loss with respect to the encoder's parameters $\phi$. The issue stems from the fact that the latent variable $z$ is sampled randomly from a Gaussian distribution parameterized by the encoder's outputs ($\mu$ and $\sigma$).
Directly differentiating through this random sampling operation is not possible. To overcome this obstacle, we employ a technique called the **reparameterization trick**. Instead of directly sampling $z$, we re-express it as a deterministic function of the encoder's outputs and an auxiliary random variable:
$$
\mathbf{z} = \mu + \sigma \odot \epsilon
$$
where:
* $\mu$ and $\sigma$ are the mean and standard deviation output by the encoder for a given input $\mathbf{x}^{(i)}$.
* $\epsilon$ is an auxiliary random variable sampled from a standard normal distribution: $\epsilon \sim \mathcal{N}(0, I)$
* $\odot$ denotes element-wise multiplication.

This transformation effectively separates the stochasticity (represented by $\epsilon$) from the parameters we want to learn ($\mu$ and $\sigma$). Now, $z$ is a deterministic function of the encoder's outputs and an external random source. This allows us to backpropagate gradients through the network and update the encoder's parameters using standard optimization techniques.
The reparameterization trick is a cornerstone of VAEs, enabling their efficient training and empowering them to generate new, diverse samples by manipulating the latent space in a controlled manner.
### The Theoretical Foundation of VAEs
Variational Autoencoders (VAEs) are grounded in the principle of maximizing the likelihood of the observed data, given a probabilistic model of the underlying latent variables that generate the data. Let's explore the mathematical framework that underpins VAEs and how they optimize this likelihood.
#### Evidence and the Underlying Latent Distribution
* **Observed Data and Latent Variables:** VAEs operate under the premise that the observed data $X = \{\mathbf{x}^{(1)}, \dots, \mathbf{x}^{(N)} \}$ arises from an unobserved latent distribution $Z$.
* **Maximizing the Evidence:** The central objective of a VAE is to learn a model that maximizes the likelihood of observing the data, referred to as the "evidence." Mathematically, the evidence is expressed as the log-likelihood of the data: $$\text{evidence} = \log p( \mathbf{x})$$
* **Connecting Observed Data and Latent Variables:** Since we assume that the observed data $X$ is generated from the latent variables $Z$, we can express the likelihood of the data as an expectation over the latent distribution: $$p(\mathbf{x}) = \int p(\mathbf{z}) p(\mathbf{x}|\mathbf{z}) d \mathbf{z} = \mathbb{E}_{\mathbf{z} \sim p( \mathbf{z})} p(\mathbf{x}|\mathbf{z})$$ This equation states that the probability of observing a particular data point $x$ is the average of the likelihood of generating $x$ given different values of the latent variable $z$, weighted by the prior probability of those $z$ values.
#### Intractability and Variational Inference
* **The Challenge of Direct Computation:** Unfortunately, directly computing the integral in the above equation is often computationally intractable, especially for complex models and high-dimensional data.
* **Variational Approximation:** To address this, VAEs employ a technique called variational inference. They introduce an approximate posterior distribution $q_\phi(\mathbf{z}|\mathbf{x})$ (parameterized by $\phi$), which is represented by the encoder network. This approximation aims to estimate the true, but intractable, posterior distribution $p(\mathbf{x}|\mathbf{z})$.
* **Decoder as the Likelihood:** The decoder network, parameterized by $\theta$, models the likelihood $p_\theta(\mathbf{x}|\mathbf{z})$ of generating the observed data $x$ given a specific latent representation $z$.
#### Deriving the Evidence Lower Bound (ELBO)
* **KL Divergence and Bayes' Theorem:** By leveraging the non-negativity of the Kullback-Leibler (KL) divergence and Bayes' theorem, we can derive a lower bound on the log-likelihood of the data, known as the Evidence Lower Bound (ELBO): $$\log p(\mathbf{x}) \geq \mathbb{E}_{\mathbf{z} \sim q_\phi (\mathbf{z}|\mathbf{x})}\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right]-D_{KL}(q_\phi (\mathbf{z}|\mathbf{x})||p(\mathbf{z})) =: \text{ELBO}$$
* **Maximizing the ELBO:** The ELBO provides a tractable objective function that we can maximize during training. By maximizing the ELBO, we indirectly maximize the likelihood of the observed data, leading to a model that captures the underlying data distribution effectively.
#### Closed-form Expression and Loss Function
* **Assumptions:** To obtain a computationally convenient form of the ELBO, VAEs typically make two assumptions:
1. The prior distribution over the latent variables $p(z)$ is a standard normal distribution: $p(z) \sim \mathcal{N}(0, I)$.
2. The variational posterior $q_\phi(\mathbf{z}|\mathbf{x})$ is also Gaussian: $q_\phi (\mathbf{z}|\mathbf{x}) \sim \mathcal{N}(\mu, \sigma^2 I)$, where $\mu$ and $\sigma$ are the outputs of the encoder network.
* **KL Divergence Simplification:** Under these assumptions, the KL divergence term in the ELBO has a closed-form expression, making it easy to compute.
* **VAE Loss Function:** The final loss function for the VAE is defined as the negative ELBO.
By minimizing this loss function using gradient-based optimization and the reparameterization trick, the VAE learns to encode the data into a meaningful latent space and generate new samples that are consistent with the training data distribution.
#### KL Divergence between Two Multivariate Gaussian Distributions
Let's derive the Kullback-Leibler (KL) divergence between two multivariate Gaussian distributions. This derivation is relevant to the latent loss component of the VAE loss function.
**Theorem**
Consider two multivariate Gaussian distributions:
* $p(x)=\mathcal{N}(\mu^{(1)},{\sigma^{(1)}}^2I)$
* $q(x)=\mathcal{N}(\mu^{(2)},{\sigma^{(2)}}^2I)$
where both distributions have dimension $Q$. The KL divergence between $p$ and $q$ is given by:
$$
D_{KL}(p||q) = \sum_
{i=1}^Q\left[\log\frac{\sigma_i^{(2)}}{\sigma_i^{(1)}}-\frac{1}{2} + \frac{{\sigma_i^{(1)}}^2 + (\mu_i^{(1)}-\mu_i^{(2)})^2}{2{\sigma_i^{(2)}}^2}\right]
$$
**Proof.**
The KL divergence is defined as the expectation of the log-likelihood ratio under the distribution $p$:
$$
D_{KL}(p||q) = \mathbb{E}_{x\sim p(x)} \left[\log\frac{p(x)}{q(x)} \right]
$$
Substituting the probability density functions of the multivariate Gaussian distributions and simplifying, we get:
\begin{split}
D_{KL}(p||q)
&= \mathbb{E}_{x\sim p(x)}[\log\frac{p(x)}{q(x)}]\\
&= \mathbb{E}_{x\sim p(x)} [\log \left( \frac{\frac{1}{\sqrt{(2\pi)^N\det({\sigma^{(1)}}^2I)}}\exp(-\frac{1}{2}(x-\mu^{(1)})^T{({\sigma^{(1)}}^2I)}^{-1}(x-\mu^{(1)})}{\frac{1}{\sqrt{(2\pi)^N\det({\sigma^{(2)}}^2I)}}\exp(-\frac{1}{2}(x-\mu^{(2)})^T{({\sigma^{(2)}}^2I)}^{-1}(x-\mu^{(2)})}\right) ]\\
\end{split}
Further simplification and expansion of terms leads to:
$$
D_{KL}(p||q) = \mathbb{E}_{x\sim p(x)} [\sum_{i=1}^Q \left(\log\frac{\sigma_i^{(2)}}{\sigma_i^{(1)}}-\frac{1}{2}(x_i-\mu_i^{(1)})^2/{\sigma_i^{(1)}}^2+\frac{1}{2}(x_i-\mu_i^{(2)})^2/{\sigma_i^{(2)}}^2\right) ]
$$
By taking the expectation and rearranging terms, we arrive at the final expression:
$$
D_{KL}(p||q) = \sum_
{i=1}^Q\left[\log\frac{\sigma_i^{(2)}}{\sigma_i^{(1)}}-\frac{1}{2}+\frac{{\sigma_i^{(1)}}^2+(\mu_i^{(1)}-\mu_i^{(2)})^2}{2{\sigma_i^{(2)}}^2}\right] \quad \Box
$$
This result is utilized in the VAE's loss function to quantify the divergence between the learned latent distribution and the prior distribution, encouraging the model to learn a structured and informative latent space.
### $\beta$-VAE: Addressing KL Vanishing
In practice, VAEs can encounter a phenomenon called "KL vanishing." This occurs when the KL divergence term in the loss function becomes negligible during training, leading the model to prioritize reconstruction accuracy at the expense of learning a meaningful latent representation. The consequence is a model that generates blurry or low-quality samples, as the latent space collapses to a narrow region around the prior.
To mitigate KL vanishing, researchers have proposed the $\beta$-VAE, which introduces a scaling factor $\beta$ to the KL divergence term in the loss function:
$$
\text{Loss} = \text{Reconstruction Loss} + \beta \cdot \text{KL Loss}
$$
The value of $\beta$ is typically scheduled to increase gradually during training, allowing the model to focus on reconstruction initially and then progressively emphasize the regularization imposed by the KL divergence.
Several strategies for scheduling β have been explored, including:
* **Linear Increase:** $\beta$ increases linearly with the number of training epochs.
* **Monotonic Increase:** $\beta$ increases monotonically but not necessarily linearly, allowing for more flexible control over the trade-off between reconstruction and regularization.
* **Cyclical Annealing:** $\beta$ varies cyclically between high and low values, potentially helping the model escape local minima and explore a wider range of the latent space.

## Example: Image Generation using a VAE
Let's dive into a hands-on implementation of a Variational Autoencoder (VAE) for image generation. We will leverage the [AT&T Database of Faces dataset](https://www.kaggle.com/datasets/kasikrit/att-database-of-faces), which comprises 400 grayscale face images from 40 distinct individuals. Each image is 112 pixels in height and 92 pixels in width. Our objective is to train a VAE to learn a compressed, meaningful representation of these faces, enabling us to generate new, realistic face images.
### Setup and Data Loading
#### Import Modules
```python=
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import time
import numpy as np
import pandas as pd
from glob import glob
from PIL import Image
from torchvision import transforms
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
import urllib.request
import zipfile
import os
import math
```
#### Hyperparameters
```python=
# Hyperparameters
RANDOM_SEED = 42
LEARNING_RATE = 1e-4
BATCH_SIZE = 64
NUM_EPOCHS = 10000
HIDDEN_DIM = 100
LATENT_DIM = 10
BETA_MAX = 1/30
TEST_SIZE = 0.2
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Set random seeds for reproducibility
torch.manual_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
```
#### Download and Prepare Data
```python=
import urllib.request
import zipfile
# Download and prepare data
def download_and_extract_data():
url = 'http://www.cl.cam.ac.uk/Research/DTG/attarchive/pub/data/att_faces.zip'
filename = 'data/att_faces.zip'
os.makedirs('data', exist_ok=True)
urllib.request.urlretrieve(url, filename)
with zipfile.ZipFile(filename, 'r') as zip_ref:
zip_ref.extractall('data')
download_and_extract_data()
# Load and preprocess data
def load_data():
image_paths = glob('data/s*/*.pgm')
images = [np.array(Image.open(path)).flatten() for path in image_paths]
return pd.DataFrame(images)
faces = load_data()
# Visualize some examples from the dataset
fig, axes = plt.subplots(10,10,figsize=(10,12),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.01, wspace=0.01))
for i, ax in enumerate(axes.flat):
ax.imshow(faces.iloc[i].values.reshape(112,92),cmap="gray")
fig.suptitle('AT&T Database of Faces')
plt.subplots_adjust(top=0.95)
plt.show()
```

#### Data Normalization and Loaders
```python=
# Calculate mean and standard deviation for normalization
mean, std = faces.stack().mean(), faces.stack().std()
print(f'Mean: {mean}, Std: {std}')
# Define transformations for the dataset
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean, std)
])
# Create a custom dataset class
# Data normalization and loaders
mean, std = faces.stack().mean(), faces.stack().std()
print(f'Mean: {mean:.4f}, Std: {std:.4f}')
class FaceDataset(Dataset):
def __init__(self, dataframe, transform):
self.dataframe = dataframe
self.transform = transform
def __len__(self):
return len(self.dataframe)
def __getitem__(self, index):
image = self.dataframe.iloc[index].to_numpy().astype(float).reshape(112, 92)
return self.transform(image), self.transform(image)
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean, std)
])
train, test = train_test_split(faces, test_size=TEST_SIZE, random_state=RANDOM_SEED)
trainset = FaceDataset(train, transform=transform)
testset = FaceDataset(test, transform=transform)
train_loader = DataLoader(trainset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(testset, batch_size=BATCH_SIZE, shuffle=False)
```
```
Mean: 112.6312849378882, Std: 49.92475472095019
Mean: 112.6313, Std: 49.9248
```
### VAE Model Definition
Before implementing the VAE, we introduce some necessary layers in practice.
#### Helper Layers
* **Reshape layer:** Converts a flattened 1D vector into 2D feature maps, essential for transitioning between convolutional and linear layers.
* **Trim layer:** Adjusts the output dimensions of transposed convolutions to match the original image size, accounting for potential slight mismatches due to strides.
```python=
# Helper layers
class Reshape(nn.Module):
def __init__(self, *args):
super(Reshape, self).__init__()
self.shape = args
def forward(self, x):
return x.view((-1,) + self.shape)
class Trim(nn.Module):
def __init__(self):
super(Trim, self).__init__()
def forward(self, x):
return x[:, :, :-1, :-1]
```
<font color=red>Check the code</font>
<font color=blue>Checked</font>
#### VAE Architecture
```python=
class VAE(nn.Module):
def __init__(self, hidden_dim, latent_dim):
super().__init__()
self.encoder = nn.Sequential(
nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1),
nn.LeakyReLU(0.01),
nn.BatchNorm2d(32),
nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1),
nn.LeakyReLU(0.01),
nn.Conv2d(64, 64, kernel_size=3, stride=2, padding=1),
nn.LeakyReLU(0.01),
nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1),
nn.Flatten(),
nn.Linear(41216, hidden_dim)
)
self.mean_layer = nn.Linear(hidden_dim, latent_dim)
self.logvar_layer = nn.Linear(hidden_dim, latent_dim)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, hidden_dim),
nn.LeakyReLU(0.01),
nn.Linear(hidden_dim, 41216),
Reshape(64, 28, 23),
nn.ConvTranspose2d(64, 64, kernel_size=3, stride=1, padding=1),
nn.LeakyReLU(0.01),
nn.ConvTranspose2d(64, 64, kernel_size=3, stride=2, padding=1),
nn.LeakyReLU(0.01),
nn.ConvTranspose2d(64, 32, kernel_size=3, stride=2, padding=0),
nn.BatchNorm2d(32),
nn.LeakyReLU(0.01),
nn.ConvTranspose2d(32, 1, kernel_size=3, stride=1, padding=0),
Trim()
)
def encode(self, x):
x = self.encoder(x)
return self.mean_layer(x), self.logvar_layer(x)
def reparameterize(self, mean, log_var):
std = torch.exp(0.5 * log_var)
eps = torch.randn_like(std)
return mean + eps * std
def decode(self, z):
return self.decoder(z)
def forward(self, x):
mean, log_var = self.encode(x)
z = self.reparameterize(mean, log_var)
return self.decode(z), mean, log_var
# Beta scheduling function
def beta(epoch):
return BETA_MAX * (1 - math.exp(-epoch / 1000))
# Loss function
def loss_function(x, x_hat, mean, log_var, epoch):
reconstruction_loss = F.mse_loss(x_hat, x, reduction='sum')
kl_divergence = -0.5 * torch.sum(1 + log_var - mean.pow(2) - log_var.exp())
return reconstruction_loss + beta(epoch) * kl_divergence, reconstruction_loss, kl_divergence
# Initialize the VAE model and optimizer
model = VAE(HIDDEN_DIM, LATENT_DIM).to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
```
### Training the VAE
#### Train the Model
```python=
# Training function
def train_vae(num_epochs, model, optimizer, train_loader, test_loader, logging_interval=1000):
log_dict = {'train_loss': [], 'test_loss': [], 'reconstruction_loss': [], 'kl_loss': []}
for epoch in range(num_epochs):
model.train()
train_loss = 0
reconstruction_loss = 0
kl_loss = 0
for batch_idx, (data, _) in enumerate(train_loader):
data = data.to(DEVICE).to(torch.float32)
optimizer.zero_grad()
recon_batch, mean, log_var = model(data)
loss, recon, kl = loss_function(data, recon_batch, mean, log_var, epoch)
loss.backward()
train_loss += loss.item()
reconstruction_loss += recon.item()
kl_loss += kl.item()
optimizer.step()
avg_loss = train_loss / len(train_loader.dataset)
avg_recon = reconstruction_loss / len(train_loader.dataset)
avg_kl = kl_loss / len(train_loader.dataset)
log_dict['train_loss'].append(avg_loss)
log_dict['reconstruction_loss'].append(avg_recon)
log_dict['kl_loss'].append(avg_kl)
model.eval()
test_loss = 0
with torch.no_grad():
for data, _ in test_loader:
data = data.to(DEVICE).to(torch.float32)
recon_batch, mean, log_var = model(data)
loss, _, _ = loss_function(data, recon_batch, mean, log_var, epoch)
test_loss += loss.item()
test_loss /= len(test_loader.dataset)
log_dict['test_loss'].append(test_loss)
if (epoch + 1) % logging_interval == 0:
print(f'Epoch: {epoch+1}/{num_epochs}, '
f'Train loss: {avg_loss:.5f}, '
f'Test loss: {test_loss:.5f}, '
f'Recon loss: {avg_recon:.5f}, '
f'KL loss: {avg_kl:.5f}')
return log_dict
# Train the VAE model
log_dict = train_vae(NUM_EPOCHS, model, optimizer, train_loader, test_loader)
```
```
Epoch: 1000/10000, Train loss: 493.85192, Test loss: 2810.84810, Recon loss: 492.70334, KL loss: 54.54246
Epoch: 2000/10000, Train loss: 307.99849, Test loss: 3057.58281, Recon loss: 306.27050, KL loss: 59.96282
Epoch: 3000/10000, Train loss: 269.58483, Test loss: 3291.74937, Recon loss: 267.70620, KL loss: 59.31493
Epoch: 4000/10000, Train loss: 220.69615, Test loss: 3364.86553, Recon loss: 218.69842, KL loss: 61.05127
Epoch: 5000/10000, Train loss: 198.57227, Test loss: 3339.15273, Recon loss: 196.58514, KL loss: 60.01872
Epoch: 6000/10000, Train loss: 176.84017, Test loss: 3367.27002, Recon loss: 174.78538, KL loss: 61.79695
Epoch: 7000/10000, Train loss: 163.33802, Test loss: 3508.20732, Recon loss: 161.23345, KL loss: 63.19488
Epoch: 8000/10000, Train loss: 146.88067, Test loss: 3478.91826, Recon loss: 144.83660, KL loss: 61.34283
Epoch: 9000/10000, Train loss: 135.00193, Test loss: 3535.44800, Recon loss: 132.88695, KL loss: 63.45724
Epoch: 10000/10000, Train loss: 127.29259, Test loss: 3531.07798, Recon loss: 125.24019, KL loss: 61.57497
```
#### Plot Loss Curve
```python=
# Plot loss curves
plt.figure(figsize=(10, 7))
plt.plot(log_dict['train_loss'], label='Train Loss')
plt.plot(log_dict['test_loss'], label='Test Loss')
plt.plot(log_dict['reconstruction_loss'], label='Reconstruction Loss')
plt.plot(log_dict['kl_loss'], label='KL Divergence')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.title('VAE Training Progress')
plt.show()
```

### Evaluation and Generation
#### Test Reconstruction
```python=
# Function to generate images
def generate_images(model, num_images=10):
model.eval()
with torch.no_grad():
z = torch.randn(num_images, LATENT_DIM).to(DEVICE)
sample = model.decode(z).cpu()
sample = sample * std + mean
return sample
# Generate and display images
generated_images = generate_images(model)
fig, axes = plt.subplots(2, 5, figsize=(12, 6))
for i, ax in enumerate(axes.flatten()):
ax.imshow(generated_images[i].squeeze(), cmap='gray')
ax.axis('off')
plt.tight_layout()
plt.show()
```

#### Generate Images by Tuning Latent Components
```python=
# Function to interpolate between two latent vectors
def interpolate_latent(model, start_img, end_img, steps=10):
model.eval()
with torch.no_grad():
start_z = model.encode(start_img.to(torch.float32).unsqueeze(0).to(DEVICE))[0]
end_z = model.encode(end_img.to(torch.float32).unsqueeze(0).to(DEVICE))[0]
interpolated = torch.zeros(steps, LATENT_DIM)
for i in range(steps):
interpolated[i] = start_z + (end_z - start_z) * (i / (steps - 1))
interpolated_imgs = model.decode(interpolated.to(DEVICE)).cpu()
interpolated_imgs = interpolated_imgs * std + mean
return interpolated_imgs
# Interpolate between two random images
start_img, _ = trainset[np.random.randint(len(trainset))]
end_img, _ = trainset[np.random.randint(len(trainset))]
interpolated_imgs = interpolate_latent(model, start_img, end_img)
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
for i, ax in enumerate(axes.flatten()):
ax.imshow(interpolated_imgs[i].squeeze(), cmap='gray')
ax.axis('off')
plt.tight_layout()
plt.show()
```

## Reference
* Prince, S. J. D. (2017). Understanding Deep Learning. Chapter 17. [link](https://udlbook.github.io/udlbook/)
* Geiger, A. (n.d.). Deep Learning. Lecture 11. [link](https://drive.google.com/file/d/1kS-QLlkDvOCEQqgrG02uezcWEHXuxf6g/view)
* Liu, C.-H. (n.d.). 深度學習Paper系列(04):Variational Autoencoder (VAE). [link](https://tomohiroliu22.medium.com/%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92paper%E7%B3%BB%E5%88%97-04-variational-autoencoder-vae-a7fbc67f0a2)
* Lee, H.-Y. (n.d.). ML Lecture 18: Unsupervised Learning - Deep Generative Model (Part II). [link](https://www.youtube.com/watch?v=8zomhgKrsmQ&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49&index=27)
* Noh, H., Hong, S., & Han, B. (2015). Learning Deconvolution Network for Semantic Segmentation. arXiv preprint arXiv:1505.04366.
* The Reparameterization Trick in Variational Autoencoders. (n.d.). Baeldung. [link](https://www.baeldung.com/cs/vae-reparameterization)
* Odaibo, S. G. (2019). Tutorial: Deriving the Standard Variational Autoencoder (VAE) Loss Function. arXiv preprint arXiv:1907.08956.
* Hasan, S. (n.d.). AutoEncoders: Theory + PyTorch Implementation. [link](https://medium.com/@syed_hasan/autoencoders-theory-pytorch-implementation-a2e72f6f7cb7)
* Bommasani, R., et al. (2019). Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 210–219. [link](https://aclanthology.org/N19-1021.pdf)
* Castiglioni, A. (n.d.). Dimensionality reduction with Autoencoders versus PCA. Towards Data Science. [link](https://towardsdatascience.com/dimensionality-reduction-with-autoencoders-versus-pca-f47666f80743)