# Autoencoders and Variational Autoencoders ## Introduction Autoencoders and Variational Autoencoders (VAEs) are powerful neural network architectures that learn compressed representations of data, known as latent representations. These models have a wide array of applications, including: * **Dimensionality Reduction:** Reduce the complexity of high-dimensional data while preserving essential information. This is useful for visualization, data compression, and improving the performance of machine learning algorithms on large datasets. * **Denoising:** Remove noise from corrupted data, enhancing its quality for subsequent analysis or tasks. * **Anomaly Detection:** Identify unusual patterns or outliers in data by leveraging the reconstruction error of the autoencoder. Data points with high reconstruction errors are likely anomalies. * **Generative Modeling:** VAEs, in particular, can generate new, realistic samples from the learned data distribution. This capability is valuable in creative applications such as image generation, music composition, and text synthesis. ## Latent Variable Models Latent variable models establish a map between an observation space $\mathbf{x} \in \mathbb{R}^D$ and a latent space $\mathbf{z} \in \mathbb{R}^Q$. The objective is to learn a compressed representation ($Q<D$) that captures the essential structure within the data. These models can be classified as follows: * **Linearity vs. Non-Linearity:** The mapping between the observed and latent spaces can be either linear (e.g., PCA) or non-linear (e.g., autoencoders, VAEs). * **Deterministic vs. Probabilistic:** The mapping can be deterministic, where each input is mapped to a single point in the latent space, or probabilistic, where the latent representation is a probability distribution. * **Explicit Encoder:** Some models explicitly define an encoder network to learn the mapping to the latent space. Others, like GANs, learn the mapping implicitly through an adversarial training process. | Feature | Deterministic | Probabilistic | |---------|----------------------------|----------------------------------| | Linear | Principal Component Analysis (PCA) | Probabilistic PCA | | Non-Linear with encoder | Autoencoder | Variational Autoencoder (VAE) | | Non-Linear without encoder | | Generative Adversarial Network (GAN) | ## Principal Component Analysis (PCA) PCA is a foundational linear dimensionality reduction technique. It can be interpreted as a special case of a linear autoencoder, but without any hidden layers. The core idea of PCA is to find a lower-dimensional subspace that captures the most significant variance present in the data. This is achieved by minimizing the reconstruction error (the difference between the original data and its projection back from the lower-dimensional space), which, interestingly, is mathematically equivalent to maximizing the variance within the latent (lower-dimensional) representation itself. ### Theory Behind PCA: Unveiling the Equivalence Let's delve into the mathematical underpinnings of PCA to demonstrate the equivalence between minimizing reconstruction error and maximizing variance. #### Problem Setup We begin by defining: * **Observed data:** $\mathbf{x}_1,\dots,\mathbf{x}_n \in \mathbb{R}^D$, where each $\mathbf{x}_i$ is a data point in the original, high-dimensional space. * **Latent representations:** $\mathbf{z}_1,\dots,\mathbf{z}_n \in \mathbb{R}^{Q}$, where each $\mathbf{z}_i$ is the corresponding low-dimensional representation of $\mathbf{x}_i$. * **Reconstructed data:** $\hat{\mathbf{x}}_i$, obtained by linearly mapping $\mathbf{z}_i$ back to the original space. Mathematically: $$\mathbf{\hat{x}_i} = \mathbf{\bar{x}} + \sum_{j=1}^Q z_{ij} \mathbf{v}_j$$ where: * $\mathbf{\bar{x}}=\frac{1}{N}\sum_{i=1}^N\mathbf{x}_i$ is the mean of the observed data. * $V = \{\mathbf{v}_1,\dots,\mathbf{v}_Q\}$ is an orthonormal set of vectors (principal components) in $\mathbb{R}^D$. * $z_{ij}$ are the coefficients representing how much each principal component contributes to reconstructing the i-th data point #### L2 Reconstruction Loss We measure the quality of the reconstruction using the L2 (Euclidean) distance between the original and reconstructed data points, summed over all data points: $$ \mathcal{L}(Z,V) = \sum_{i=1}^N||\mathbf{\hat{x}}_i-\mathbf{x}_i||^2 $$ #### Equivalence to Variance Maximization Through mathematical derivation (detailed below), we can show that minimizing this reconstruction loss is the same as finding the orthonormal vectors $V$ that maximize the variance of the latent representations $Z$: $$ \mathbf{v} = \mathop{\arg\max}\limits_{v \in \mathbb{R}^D, ||v|| = 1} \mathbf{v}^T \text{Cov}(X, X) \mathbf{v} $$ where $\text{Cov}(X, X)$ is the covariance matrix of the observed data. #### Solution & Interpretation The solution to this maximization problem reveals that the optimal $\mathbf{v}$'s are the eigenvectors of the covariance matrix associated with the largest eigenvalues. These eigenvectors form the principal components, and the corresponding eigenvalues represent the amount of variance explained by each component.   #### Mathematical Derivation We have \begin{split} \mathcal{L}(Z,V)&=\sum_{i=1}^N||\mathbf{\hat{x}}_i-\mathbf{x}_i||^2 \\ &=\sum_{i=1}^N||\mathbf{\bar{x}}+\sum_{j=1}^Qz_{ij}\mathbf{v}_j-\mathbf{x}_i||^2 \\ &=\sum_{i=1}^N \left[\sum_{j=1}^Q[z_{ij}\mathbf{v}_j^T+(\mathbf{\bar{x}}-\mathbf{x}_i)^T] (\sum_{j=1}^Qz_{ij}\mathbf{v}_j+\mathbf{\bar{x}}-\mathbf{x}_i)\right] \\ &=\sum_{i=1}^N \left[\sum_{j=1}^Qz_{ij}^2+2(\mathbf{\bar{x}}-\mathbf{x}_i)^T \sum_{j=1}^Qz_{ij}\mathbf{v}_j + ||\mathbf{\bar{x}}-\mathbf{x}_i||^2\right] \end{split} Taking the partial derivative with respect to $z_{ij}$, we get: $$ \frac{\partial \mathcal{L}(Z, V)}{\partial z_{ij}} = 2z_{ij} + 2 (\mathbf{\bar{x}} - \mathbf{x}_i)^T \mathbf{v}_j $$ Setting this derivative to zero and solving for the optimal $z_{ij}$, we obtain: $$ z_{ij}^* = -(\mathbf{\bar{x}} - \mathbf{x}_i)^T \mathbf{v}_j $$ Let $Z=Z^*$. Substituting this optimal value back into the loss function, we have: \begin{split} \mathcal{L}(Z^*,V) &= \sum_{i=1}^N \left[\sum_{j=1}^Q (z_{ij}^*)^2+2(\mathbf{\bar{x}}-\mathbf{x}_i)^T \sum_{j=1}^Q z_{ij} \mathbf{v}_j + ||\mathbf{\bar{x}}-\mathbf{x}_i||^2 \right]\\ &= -\sum_{i=1}^N\sum_{j=1}^Q \mathbf{v}_j^T(\mathbf{\bar{x}}-\mathbf{x}_i)(\mathbf{\bar{x}}-\mathbf{x}_i)^T\mathbf{v}_j+\sum_{i=1}^N||\mathbf{\bar{x}}-\mathbf{x}_i||^2\\ &=-N\sum_{j=1}^Q \mathbf{v}_j^T\frac{1}{N}\sum_{i=1}^N(\mathbf{\bar{x}}-\mathbf{x}_i)(\mathbf{\bar{x}}-\mathbf{x}_i)^T\mathbf{v}_j+\sum_{i=1}^N||\mathbf{\bar{x}}-\mathbf{x_i}||^2\\ &=-N\sum_{j=1}^Q \mathbf{v}_j^T \text{Cov}(X,X) \mathbf{v}_j+\sum_{i=1}^N||\mathbf{\bar{x}}-\mathbf{x}_i||^2 \end{split} To minimize this loss, we need to find the vectors $\mathbf{v}$ that maximize the following expression: $$ \mathbf{v} = \mathop{\arg\max}\limits_{v \in \mathbb{R}^D, ||v|| = 1} \mathbf{v}^T \text{Cov}(X, X) \mathbf{v} $$ Alternatively, if we substitute $z_{ij}^*=-(\mathbf{\bar{x}}-\mathbf{x}_i)^T\mathbf{v}_j$ back into the loss function, we get: \begin{align*} \mathcal{L}(Z^*, V) &= - \sum_{i=1}^N \sum_{j=1}^Q (z_{ij}^*)^2 + \sum_{i=1}^N ||\mathbf{\bar{x}} - \mathbf{x}_i||^2 \\ &= -N \sum_{j=1}^Q \frac{1}{N} \sum_{i=1}^N (z_{ij}^*)^2 + \sum_{i=1}^N ||\mathbf{\bar{x}} - \mathbf{x}_i||^2 \\ &= -N \sum_{j=1}^Q \text{Cov}(Z_j^*, Z_j^*) + \sum_{i=1}^N ||\mathbf{\bar{x}} - \mathbf{x}_i||^2 \end{align*} This demonstrates the equivalence between minimizing the L2 reconstruction loss and maximizing the variance of the latent variables $Z$. To find the optimal $V$, we use the method of Lagrange multipliers to solve the constrained optimization problem: $$ \mathbf{v} = \mathop{\arg\max}\limits_{v \in \mathbb{R}^D, ||v|| = 1} \mathbf{v}^T C \mathbf{v} $$ where $C = \text{Cov}(X,X)$. We define the Lagrangian: $$ f(\mathbf{v}, \lambda) = \mathbf{v}^T C \mathbf{v} - \lambda (\mathbf{v}^T \mathbf{v} - 1) $$ Taking the partial derivative with respect to $v_i$ and setting it to zero, we get: \begin{split} \frac{\partial f(\mathbf{v}, \lambda)}{\partial v_i} &= C_{i, \cdot} \mathbf{v} + \mathbf{v}^T C_{\cdot, i} - 2 \lambda v_i \\ &= 2 C_{i, \cdot} \mathbf{v} - 2 \lambda v_i = 0 \end{split} This leads to the equation $C\mathbf{v}=\lambda\mathbf{v}$, which means that $\mathbf{v}$ is an eigenvector of the covariance matrix $\text{Cov}(X,X)$. #### The Structure of PCA PCA can be succinctly described by its encoder and decoder: * **Input:** $\mathbf{x}$ (data point in the original space) * **Encoder:** $\mathbf{z} = V^T (\mathbf{x}− \bar{\mathbf{x}})$ (projects the data onto the principal components) * **Decoder:** $\hat{\mathbf{x}} = V\mathbf{z} + \bar{\mathbf{x}}$ (reconstructs the data from the latent representation) The matrix $V$ comprises the top $Q$ eigenvectors (principal components) of the data covariance matrix. ![Screenshot 2024-08-21 at 10.16.36 AM](https://hackmd.io/_uploads/rkWbWsXiR.png) #### Limitations of PCA While PCA is a powerful and widely used technique, it has certain limitations: * **Linearity Assumption:** PCA assumes a linear relationship between the original and latent variables. This can be restrictive when dealing with data that exhibits complex, non-linear patterns. * **Sensitivity to Outliers:** PCA is susceptible to the influence of outliers, which can distort the principal components and lead to suboptimal representations. Robust PCA methods can be employed to address this issue. ## Autoencoders Autoencoders represent a class of neural networks specifically designed to learn efficient representations of data by attempting to reconstruct their input. This process involves two key components: an encoder and a decoder. ![autoencoder](https://hackmd.io/_uploads/Bk0RH525A.png) ### Encoder: Compressing the Essence * **Mathematical Representation:** $g_{\phi}: \mathbf{x} \mapsto \mathbf{z}$ * **Functionality:** The encoder acts as a data compressor, transforming the input data $\mathbf{x}$ into a lower-dimensional latent representation $\mathbf{z}$. This latent space aims to encapsulate the most salient and informative aspects of the input, discarding noise or irrelevant details * **Common Layers:** * **Convolutional layers:** Particularly useful for image data, these layers detect local patterns and features across the spatial dimensions of the input. * **Pooling Layers:** Reduce dimensionality by downsampling, aiding in capturing more abstract representations and making the model less sensitive to small shifts in the input. * **Fully Connected (Dense) Layers:** Enable general-purpose transformations and further refine the latent representation. * **Non-linearity:** Activation functions (such as ReLU or sigmoid) introduce non-linearity, enabling the model to capture complex relationships within the data. ### Decoder: Reconstructing from the Compressed Form * **Mathematical Representation:** $f_{\theta}: \mathbf{z} \mapsto \hat{\mathbf{x}}$ * **Functionality:** The decoder functions as a reconstruction engine, taking the compressed latent representation $\mathbf{z}$ and attempting to generate an output $\hat{\mathbf{x}}$_hat that closely resembles the original input $\mathbf{x}$. * **Structure:** The decoder's architecture often mirrors that of the encoder, but in reverse. This symmetry helps to ensure that the information captured in the latent space can be effectively utilized for reconstruction. * **Common Layers:** * **Transposed Convolutional Layers (Deconvolutional Layers):** Upsample the latent representation to increase its spatial dimensions, effectively reversing the downsampling performed by convolutional and pooling layers in the encoder. ![deconv](https://hackmd.io/_uploads/B13_1ExjR.png) * **Unpooling Layers:** Counteract the pooling operation by restoring the original spatial dimensions. This often involves using information recorded during the encoding process (e.g., locations of maximum values in max pooling) to guide the unpooling. ![unpooling](https://hackmd.io/_uploads/ByXFJ4ejA.png) #### Loss Function The training objective of an autoencoder is to minimize the discrepancy between the original input $\mathbf{x}$ and its reconstructed counterpart $\hat{\mathbf{x}}$. This discrepancy, known as the reconstruction error, is often measured using the Mean Squared Error (MSE) loss function. #### Advantage over PCA Autoencoders have a distinct advantage over PCA in their capacity to learn non-linear relationships in the data. This allows them to capture more intricate patterns and potentially achieve more accurate reconstructions compared to the linear transformations employed by PCA. #### Denoising Autoencoders: A Robust Variant Denoising autoencoders are a specialized type of autoencoder trained on corrupted input data (with added noise) but tasked with reconstructing the clean, original data. This forces the model to learn to filter out noise and extract the essential underlying structure, making it useful for data preprocessing and denoising tasks. ![ae_noise](https://hackmd.io/_uploads/rykE_reiA.png) ## Example: Dimensional Reduction using PCA and Autoencoder This code example provides a practical demonstration of dimensionality reduction on a synthetic dataset using both Principal Component Analysis (PCA) and an Autoencoder. The dataset is generated using sklearn's `make_classification`, and we'll visually explore how PCA and the autoencoder transform the data into a lower-dimensional space. ### Key Points * **Dataset:** Synthetic dataset with 10,000 samples, 50 features, and 3 classes. * **Dimensionality Reduction:** Both PCA and the autoencoder will reduce the dimensionality to 5. * **Visualization:** Scatter plots will showcase the first two components of the transformed data to reveal how the classes cluster in the lower-dimensional space. * **Comparison:** This example highlights the contrast between PCA's linear transformation and the autoencoder's capability to capture non-linear relationships. ### Setup and Data Preparation #### Import Modules ```python= import torch import torch.nn as nn import torch.nn.functional as F import matplotlib.pyplot as plt import time import numpy as np from torch.utils.data import TensorDataset, DataLoader from sklearn.model_selection import train_test_split from sklearn import datasets from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA ``` #### Hyperparameters ```python= # Hyperparameters RANDOM_SEED = 42 LEARNING_RATE = 1e-3 BATCH_SIZE = 128 NUM_EPOCHS = 500 HIDDEN_DIM = 5 # Target dimensionality for reduction TEST_SIZE = 0.2 DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu') ``` #### Generate Data ```python= # Set random seeds for reproducibility np.random.seed(RANDOM_SEED) torch.manual_seed(RANDOM_SEED) # Generate Data X, y = datasets.make_classification(n_samples=10000, n_features=50, n_redundant=10, n_informative=10, random_state=RANDOM_SEED, n_clusters_per_class=2, n_classes=3, class_sep=2) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_SEED) ``` #### Normalization ```python= # Normalize data sc = StandardScaler() X_train_std = sc.fit_transform(X_train) X_test_std = sc.transform(X_test) ``` ### PCA #### Fit PCA ```python= # PCA pca = PCA(n_components=HIDDEN_DIM) pca_scores = pca.fit_transform(X_train_std) ``` #### Visualization of Eigenvalues ```python= # Visualization of Eigenvalues pca_full = PCA(n_components=None, svd_solver='auto') pca_full.fit(X_train_std) eigenvalues = pca_full.explained_variance_ total_variance = np.sum(eigenvalues) explained_variance_ratio = eigenvalues / total_variance cumulative_variance_ratio = np.cumsum(explained_variance_ratio) plt.figure(figsize=(10, 6)) plt.bar(range(1, 51), explained_variance_ratio[:50], alpha=0.5, label='Eigenvalues') plt.bar(range(1, 6), explained_variance_ratio[:5], color='red', label='First 5 Eigenvalues') plt.xlabel('Principal Component') plt.ylabel('Percentage of Explained Variance') plt.title('Percentage of Variance Explained by Each Eigenvalue (All 50)') plt.xticks(range(1, 51, 5)) plt.legend() plt.show() print(f"Cumulative explained variance for the first 5 components: {cumulative_variance_ratio[4]:.2%}") ``` ``` Cumulative explained variance for the first 5 components: 31.61% ``` ![pca_ratio](https://hackmd.io/_uploads/rJeRJdb3A.png) ### Autoencoder #### Autoencoder Class ```python= # Autoencoder class AutoEncoder(nn.Module): def __init__(self, input_dim, hidden_dim): super().__init__() self.encoder = nn.Sequential( nn.Linear(input_dim, 30), nn.LeakyReLU(0.1), nn.Linear(30, 10), nn.LeakyReLU(0.1), nn.Linear(10, hidden_dim), ) self.decoder = nn.Sequential( nn.Linear(hidden_dim, 10), nn.LeakyReLU(0.1), nn.Linear(10, 30), nn.LeakyReLU(0.1), nn.Linear(30, input_dim) ) def forward(self, x): return self.decoder(self.encoder(x)) def encode(self, x): return self.encoder(x) model = AutoEncoder(X_train.shape[1], HIDDEN_DIM).to(DEVICE) optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE) criterion = nn.MSELoss() ``` #### Build DataLoader ```python= # Prepare data loaders train_dataset = TensorDataset(torch.FloatTensor(X_train_std), torch.LongTensor(y_train)) test_dataset = TensorDataset(torch.FloatTensor(X_test_std), torch.LongTensor(y_test)) train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False) ``` #### Train Autoencoder ```python= # Training function def train_autoencoder(model, train_loader, test_loader, num_epochs, optimizer, criterion): train_losses, test_losses = [], [] for epoch in range(num_epochs): model.train() train_loss = 0 for batch_features, _ in train_loader: batch_features = batch_features.to(DEVICE) optimizer.zero_grad() outputs = model(batch_features) loss = criterion(outputs, batch_features) loss.backward() optimizer.step() train_loss += loss.item() train_loss /= len(train_loader) train_losses.append(train_loss) model.eval() test_loss = 0 with torch.no_grad(): for batch_features, _ in test_loader: batch_features = batch_features.to(DEVICE) outputs = model(batch_features) loss = criterion(outputs, batch_features) test_loss += loss.item() test_loss /= len(test_loader) test_losses.append(test_loss) if (epoch + 1) % 50 == 0: print(f'Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Test Loss: {test_loss:.4f}') return train_losses, test_losses # Train the model start_time = time.time() train_losses, test_losses = train_autoencoder(model, train_loader, test_loader, NUM_EPOCHS, optimizer, criterion) print(f'Training time: {time.time() - start_time:.2f} seconds') ``` ``` Epoch [50/500], Train Loss: 0.6420, Test Loss: 0.6512 Epoch [100/500], Train Loss: 0.6318, Test Loss: 0.6432 Epoch [150/500], Train Loss: 0.6278, Test Loss: 0.6398 Epoch [200/500], Train Loss: 0.6255, Test Loss: 0.6376 Epoch [250/500], Train Loss: 0.6237, Test Loss: 0.6366 Epoch [300/500], Train Loss: 0.6224, Test Loss: 0.6351 Epoch [350/500], Train Loss: 0.6211, Test Loss: 0.6347 Epoch [400/500], Train Loss: 0.6204, Test Loss: 0.6341 Epoch [450/500], Train Loss: 0.6196, Test Loss: 0.6342 Epoch [500/500], Train Loss: 0.6195, Test Loss: 0.6344 Training time: 149.64 seconds ``` #### Plot Loss Curve ```python= # Plot loss curves plt.figure(figsize=(10, 5)) plt.plot(train_losses, label='Train Loss') plt.plot(test_losses, label='Test Loss') plt.xlabel('Epoch') plt.ylabel('Loss') plt.title('Training and Test Loss') plt.legend() plt.show() # Encode the data model.eval() with torch.no_grad(): encoded_data = model.encode(torch.FloatTensor(X_train_std).to(DEVICE)).cpu().numpy() # Visualization plt.figure(figsize=(20, 8)) ``` ![ae_loss](https://hackmd.io/_uploads/rkFtbub2C.png) ### Visualization and Comparison #### Plot Two Components of Compressed Data ```python= # PCA plot plt.subplot(121) scatter = plt.scatter(pca_scores[:, 0], pca_scores[:, 1], c=y_train, cmap='viridis') plt.colorbar(scatter) plt.title('PCA') plt.xlabel('First Principal Component') plt.ylabel('Second Principal Component') # Autoencoder plot plt.subplot(122) scatter = plt.scatter(encoded_data[:, 0], encoded_data[:, 1], c=y_train, cmap='viridis') plt.colorbar(scatter) plt.title('Autoencoder') plt.xlabel('First Encoded Dimension') plt.ylabel('Second Encoded Dimension') plt.tight_layout() plt.show() ``` ![compressed2](https://hackmd.io/_uploads/H1Hnbu-3R.png) #### Interpretation * **PCA:** The scatter plot of PCA-transformed data might show some clustering of the three classes, but the separation may not be perfect due to PCA's linear nature. * **Autoencoder:** The autoencoder's ability to learn non-linear transformations could lead to a clearer separation between classes in the latent space visualization. ## Variational Autoencoders (VAEs) While traditional autoencoders are adept at learning compressed representations and reconstructing input data, they lack the inherent ability to generate new, plausible data samples. This limitation stems from their deterministic nature; they learn a direct mapping from input to latent space and back, without explicitly modeling the underlying probability distribution of the data. Variational Autoencoders (VAEs) address this shortcoming by introducing a probabilistic framework into the autoencoder architecture. This enables VAEs to not only learn meaningful representations but also to generate new samples that resemble the training data. ### The Structure of VAEs VAEs maintain the fundamental encoder-decoder structure of autoencoders, but they diverge in their treatment of the latent space. Let's explore the key components of a VAE and how they facilitate the generation of new data. #### Encoder & Decoder Similar to autoencoders, VAEs employ an encoder to map input data to a latent space and a decoder to reconstruct the original data from this latent representation. The crucial difference lies in the encoder's output. Instead of producing a single point in the latent space, the VAE's encoder outputs two vectors: 1. Mean Vector $\mu$: Represents the center of a Gaussian distribution in the latent space. 2. Log Variance Vector $\log \sigma$: Defines the spread (or uncertainty) of the Gaussian distribution in the latent space. This probabilistic approach allows the VAE to capture the inherent variability in the data, enabling it to generate diverse samples during the decoding process. ![vae structure](https://hackmd.io/_uploads/r1eW-X7jR.png) #### Loss Function The loss function used to train a VAE consists of two key parts: 1. **Reconstruction Loss:** This measures the fidelity of the reconstructed output compared to the original input. It quantifies how well the decoder, parameterized by $\theta$, can recreate the input data x from the latent representation z produced by the encoder. Mean Squared Error (MSE) is a common choice for this loss: $$L_{MSE}(\theta,\phi) = \frac{1}{N}\sum_{i=1}^{N}||\mathbf{x}^{(i)} -f_{\theta}(g_{\phi}(\mathbf{x}^{(i)})) ||^2$$ where: * $f_{\theta}$ represents the decoder with parameters $\theta$ * $g_{\phi}$ represents the encoder with parameters $\phi$ * $x_i$ is the $i$-th data point in the dataset * $N$ is the total number of data points 2. **KL Divergence (Latent Loss):** This term acts as a regularizer, encouraging the distribution of the latent variables $z$ (as modeled by the encoder) to stay close to a predefined prior distribution, typically a standard normal distribution. The Kullback-Leibler (KL) divergence measures the dissimilarity between these two distributions: $$L_{KL}(\theta,\phi) = -\frac{1}{2}\sum_{i=1}^Q\left[1+\log(\sigma_i^2)-\mu_i^2-\sigma_i^2\right]$$ where: * $\mu$ and $\log(\sigma)$ are the mean and log variance outputs of the encoder for a given input $x$. * $Q$ is the dimensionality of the latent space. * $\sigma_j$ and $\mu_j$ are the $j$-th components of the standard deviation $\sigma$ and mean $\mu$ vectors, respectively The total loss function for the VAE is then the sum of these two terms: $$ L(\theta,\phi)=L_{MSE}(\theta,\phi)+L_{KL}(\theta,\phi) $$ By minimizing this combined loss, the VAE learns to balance the trade-off between generating high-quality reconstructions and ensuring that its latent space is well-structured and amenable to sampling for new data generation. #### Reparameterization Trick Having defined the VAE's loss function, the next step is to train the model using gradient descent. However, a challenge arises when we attempt to compute the gradient of the loss with respect to the encoder's parameters $\phi$. The issue stems from the fact that the latent variable $z$ is sampled randomly from a Gaussian distribution parameterized by the encoder's outputs ($\mu$ and $\sigma$). Directly differentiating through this random sampling operation is not possible. To overcome this obstacle, we employ a technique called the **reparameterization trick**. Instead of directly sampling $z$, we re-express it as a deterministic function of the encoder's outputs and an auxiliary random variable: $$ \mathbf{z} = \mu + \sigma \odot \epsilon $$ where: * $\mu$ and $\sigma$ are the mean and standard deviation output by the encoder for a given input $\mathbf{x}^{(i)}$. * $\epsilon$ is an auxiliary random variable sampled from a standard normal distribution: $\epsilon \sim \mathcal{N}(0, I)$ * $\odot$ denotes element-wise multiplication. ![reparameterization](https://hackmd.io/_uploads/rJy-vKli0.png) This transformation effectively separates the stochasticity (represented by $\epsilon$) from the parameters we want to learn ($\mu$ and $\sigma$). Now, $z$ is a deterministic function of the encoder's outputs and an external random source. This allows us to backpropagate gradients through the network and update the encoder's parameters using standard optimization techniques. The reparameterization trick is a cornerstone of VAEs, enabling their efficient training and empowering them to generate new, diverse samples by manipulating the latent space in a controlled manner. ### The Theoretical Foundation of VAEs Variational Autoencoders (VAEs) are grounded in the principle of maximizing the likelihood of the observed data, given a probabilistic model of the underlying latent variables that generate the data. Let's explore the mathematical framework that underpins VAEs and how they optimize this likelihood. #### Evidence and the Underlying Latent Distribution * **Observed Data and Latent Variables:** VAEs operate under the premise that the observed data $X = \{\mathbf{x}^{(1)}, \dots, \mathbf{x}^{(N)} \}$ arises from an unobserved latent distribution $Z$. * **Maximizing the Evidence:** The central objective of a VAE is to learn a model that maximizes the likelihood of observing the data, referred to as the "evidence." Mathematically, the evidence is expressed as the log-likelihood of the data: $$\text{evidence} = \log p( \mathbf{x})$$ * **Connecting Observed Data and Latent Variables:** Since we assume that the observed data $X$ is generated from the latent variables $Z$, we can express the likelihood of the data as an expectation over the latent distribution: $$p(\mathbf{x}) = \int p(\mathbf{z}) p(\mathbf{x}|\mathbf{z}) d \mathbf{z} = \mathbb{E}_{\mathbf{z} \sim p( \mathbf{z})} p(\mathbf{x}|\mathbf{z})$$ This equation states that the probability of observing a particular data point $x$ is the average of the likelihood of generating $x$ given different values of the latent variable $z$, weighted by the prior probability of those $z$ values. #### Intractability and Variational Inference * **The Challenge of Direct Computation:** Unfortunately, directly computing the integral in the above equation is often computationally intractable, especially for complex models and high-dimensional data. * **Variational Approximation:** To address this, VAEs employ a technique called variational inference. They introduce an approximate posterior distribution $q_\phi(\mathbf{z}|\mathbf{x})$ (parameterized by $\phi$), which is represented by the encoder network. This approximation aims to estimate the true, but intractable, posterior distribution $p(\mathbf{x}|\mathbf{z})$. * **Decoder as the Likelihood:** The decoder network, parameterized by $\theta$, models the likelihood $p_\theta(\mathbf{x}|\mathbf{z})$ of generating the observed data $x$ given a specific latent representation $z$. #### Deriving the Evidence Lower Bound (ELBO) * **KL Divergence and Bayes' Theorem:** By leveraging the non-negativity of the Kullback-Leibler (KL) divergence and Bayes' theorem, we can derive a lower bound on the log-likelihood of the data, known as the Evidence Lower Bound (ELBO): $$\log p(\mathbf{x}) \geq \mathbb{E}_{\mathbf{z} \sim q_\phi (\mathbf{z}|\mathbf{x})}\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right]-D_{KL}(q_\phi (\mathbf{z}|\mathbf{x})||p(\mathbf{z})) =: \text{ELBO}$$ * **Maximizing the ELBO:** The ELBO provides a tractable objective function that we can maximize during training. By maximizing the ELBO, we indirectly maximize the likelihood of the observed data, leading to a model that captures the underlying data distribution effectively. #### Closed-form Expression and Loss Function * **Assumptions:** To obtain a computationally convenient form of the ELBO, VAEs typically make two assumptions: 1. The prior distribution over the latent variables $p(z)$ is a standard normal distribution: $p(z) \sim \mathcal{N}(0, I)$. 2. The variational posterior $q_\phi(\mathbf{z}|\mathbf{x})$ is also Gaussian: $q_\phi (\mathbf{z}|\mathbf{x}) \sim \mathcal{N}(\mu, \sigma^2 I)$, where $\mu$ and $\sigma$ are the outputs of the encoder network. * **KL Divergence Simplification:** Under these assumptions, the KL divergence term in the ELBO has a closed-form expression, making it easy to compute. * **VAE Loss Function:** The final loss function for the VAE is defined as the negative ELBO. By minimizing this loss function using gradient-based optimization and the reparameterization trick, the VAE learns to encode the data into a meaningful latent space and generate new samples that are consistent with the training data distribution. #### KL Divergence between Two Multivariate Gaussian Distributions Let's derive the Kullback-Leibler (KL) divergence between two multivariate Gaussian distributions. This derivation is relevant to the latent loss component of the VAE loss function. **Theorem** Consider two multivariate Gaussian distributions: * $p(x)=\mathcal{N}(\mu^{(1)},{\sigma^{(1)}}^2I)$ * $q(x)=\mathcal{N}(\mu^{(2)},{\sigma^{(2)}}^2I)$ where both distributions have dimension $Q$. The KL divergence between $p$ and $q$ is given by: $$ D_{KL}(p||q) = \sum_ {i=1}^Q\left[\log\frac{\sigma_i^{(2)}}{\sigma_i^{(1)}}-\frac{1}{2} + \frac{{\sigma_i^{(1)}}^2 + (\mu_i^{(1)}-\mu_i^{(2)})^2}{2{\sigma_i^{(2)}}^2}\right] $$ **Proof.** The KL divergence is defined as the expectation of the log-likelihood ratio under the distribution $p$: $$ D_{KL}(p||q) = \mathbb{E}_{x\sim p(x)} \left[\log\frac{p(x)}{q(x)} \right] $$ Substituting the probability density functions of the multivariate Gaussian distributions and simplifying, we get: \begin{split} D_{KL}(p||q) &= \mathbb{E}_{x\sim p(x)}[\log\frac{p(x)}{q(x)}]\\ &= \mathbb{E}_{x\sim p(x)} [\log \left( \frac{\frac{1}{\sqrt{(2\pi)^N\det({\sigma^{(1)}}^2I)}}\exp(-\frac{1}{2}(x-\mu^{(1)})^T{({\sigma^{(1)}}^2I)}^{-1}(x-\mu^{(1)})}{\frac{1}{\sqrt{(2\pi)^N\det({\sigma^{(2)}}^2I)}}\exp(-\frac{1}{2}(x-\mu^{(2)})^T{({\sigma^{(2)}}^2I)}^{-1}(x-\mu^{(2)})}\right) ]\\ \end{split} Further simplification and expansion of terms leads to: $$ D_{KL}(p||q) = \mathbb{E}_{x\sim p(x)} [\sum_{i=1}^Q \left(\log\frac{\sigma_i^{(2)}}{\sigma_i^{(1)}}-\frac{1}{2}(x_i-\mu_i^{(1)})^2/{\sigma_i^{(1)}}^2+\frac{1}{2}(x_i-\mu_i^{(2)})^2/{\sigma_i^{(2)}}^2\right) ] $$ By taking the expectation and rearranging terms, we arrive at the final expression: $$ D_{KL}(p||q) = \sum_ {i=1}^Q\left[\log\frac{\sigma_i^{(2)}}{\sigma_i^{(1)}}-\frac{1}{2}+\frac{{\sigma_i^{(1)}}^2+(\mu_i^{(1)}-\mu_i^{(2)})^2}{2{\sigma_i^{(2)}}^2}\right] \quad \Box $$ This result is utilized in the VAE's loss function to quantify the divergence between the learned latent distribution and the prior distribution, encouraging the model to learn a structured and informative latent space. ### $\beta$-VAE: Addressing KL Vanishing In practice, VAEs can encounter a phenomenon called "KL vanishing." This occurs when the KL divergence term in the loss function becomes negligible during training, leading the model to prioritize reconstruction accuracy at the expense of learning a meaningful latent representation. The consequence is a model that generates blurry or low-quality samples, as the latent space collapses to a narrow region around the prior. To mitigate KL vanishing, researchers have proposed the $\beta$-VAE, which introduces a scaling factor $\beta$ to the KL divergence term in the loss function: $$ \text{Loss} = \text{Reconstruction Loss} + \beta \cdot \text{KL Loss} $$ The value of $\beta$ is typically scheduled to increase gradually during training, allowing the model to focus on reconstruction initially and then progressively emphasize the regularization imposed by the KL divergence. Several strategies for scheduling β have been explored, including: * **Linear Increase:** $\beta$ increases linearly with the number of training epochs. * **Monotonic Increase:** $\beta$ increases monotonically but not necessarily linearly, allowing for more flexible control over the trade-off between reconstruction and regularization. * **Cyclical Annealing:** $\beta$ varies cyclically between high and low values, potentially helping the model escape local minima and explore a wider range of the latent space. ![beta](https://hackmd.io/_uploads/SyYZs39iR.png) ## Example: Image Generation using a VAE Let's dive into a hands-on implementation of a Variational Autoencoder (VAE) for image generation. We will leverage the [AT&T Database of Faces dataset](https://www.kaggle.com/datasets/kasikrit/att-database-of-faces), which comprises 400 grayscale face images from 40 distinct individuals. Each image is 112 pixels in height and 92 pixels in width. Our objective is to train a VAE to learn a compressed, meaningful representation of these faces, enabling us to generate new, realistic face images. ### Setup and Data Loading #### Import Modules ```python= import torch import torch.nn as nn import torch.nn.functional as F import matplotlib.pyplot as plt import time import numpy as np import pandas as pd from glob import glob from PIL import Image from torchvision import transforms from torch.utils.data import Dataset, DataLoader from sklearn.model_selection import train_test_split import urllib.request import zipfile import os import math ``` #### Hyperparameters ```python= # Hyperparameters RANDOM_SEED = 42 LEARNING_RATE = 1e-4 BATCH_SIZE = 64 NUM_EPOCHS = 10000 HIDDEN_DIM = 100 LATENT_DIM = 10 BETA_MAX = 1/30 TEST_SIZE = 0.2 DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # Set random seeds for reproducibility torch.manual_seed(RANDOM_SEED) np.random.seed(RANDOM_SEED) ``` #### Download and Prepare Data ```python= import urllib.request import zipfile # Download and prepare data def download_and_extract_data(): url = 'http://www.cl.cam.ac.uk/Research/DTG/attarchive/pub/data/att_faces.zip' filename = 'data/att_faces.zip' os.makedirs('data', exist_ok=True) urllib.request.urlretrieve(url, filename) with zipfile.ZipFile(filename, 'r') as zip_ref: zip_ref.extractall('data') download_and_extract_data() # Load and preprocess data def load_data(): image_paths = glob('data/s*/*.pgm') images = [np.array(Image.open(path)).flatten() for path in image_paths] return pd.DataFrame(images) faces = load_data() # Visualize some examples from the dataset fig, axes = plt.subplots(10,10,figsize=(10,12), subplot_kw={'xticks':[], 'yticks':[]}, gridspec_kw=dict(hspace=0.01, wspace=0.01)) for i, ax in enumerate(axes.flat): ax.imshow(faces.iloc[i].values.reshape(112,92),cmap="gray") fig.suptitle('AT&T Database of Faces') plt.subplots_adjust(top=0.95) plt.show() ``` ![database](https://hackmd.io/_uploads/r1Bc8ib30.png) #### Data Normalization and Loaders ```python= # Calculate mean and standard deviation for normalization mean, std = faces.stack().mean(), faces.stack().std() print(f'Mean: {mean}, Std: {std}') # Define transformations for the dataset transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize(mean, std) ]) # Create a custom dataset class # Data normalization and loaders mean, std = faces.stack().mean(), faces.stack().std() print(f'Mean: {mean:.4f}, Std: {std:.4f}') class FaceDataset(Dataset): def __init__(self, dataframe, transform): self.dataframe = dataframe self.transform = transform def __len__(self): return len(self.dataframe) def __getitem__(self, index): image = self.dataframe.iloc[index].to_numpy().astype(float).reshape(112, 92) return self.transform(image), self.transform(image) transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize(mean, std) ]) train, test = train_test_split(faces, test_size=TEST_SIZE, random_state=RANDOM_SEED) trainset = FaceDataset(train, transform=transform) testset = FaceDataset(test, transform=transform) train_loader = DataLoader(trainset, batch_size=BATCH_SIZE, shuffle=True) test_loader = DataLoader(testset, batch_size=BATCH_SIZE, shuffle=False) ``` ``` Mean: 112.6312849378882, Std: 49.92475472095019 Mean: 112.6313, Std: 49.9248 ``` ### VAE Model Definition Before implementing the VAE, we introduce some necessary layers in practice. #### Helper Layers * **Reshape layer:** Converts a flattened 1D vector into 2D feature maps, essential for transitioning between convolutional and linear layers. * **Trim layer:** Adjusts the output dimensions of transposed convolutions to match the original image size, accounting for potential slight mismatches due to strides. ```python= # Helper layers class Reshape(nn.Module): def __init__(self, *args): super(Reshape, self).__init__() self.shape = args def forward(self, x): return x.view((-1,) + self.shape) class Trim(nn.Module): def __init__(self): super(Trim, self).__init__() def forward(self, x): return x[:, :, :-1, :-1] ``` <font color=red>Check the code</font> <font color=blue>Checked</font> #### VAE Architecture ```python= class VAE(nn.Module): def __init__(self, hidden_dim, latent_dim): super().__init__() self.encoder = nn.Sequential( nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1), nn.LeakyReLU(0.01), nn.BatchNorm2d(32), nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1), nn.LeakyReLU(0.01), nn.Conv2d(64, 64, kernel_size=3, stride=2, padding=1), nn.LeakyReLU(0.01), nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1), nn.Flatten(), nn.Linear(41216, hidden_dim) ) self.mean_layer = nn.Linear(hidden_dim, latent_dim) self.logvar_layer = nn.Linear(hidden_dim, latent_dim) self.decoder = nn.Sequential( nn.Linear(latent_dim, hidden_dim), nn.LeakyReLU(0.01), nn.Linear(hidden_dim, 41216), Reshape(64, 28, 23), nn.ConvTranspose2d(64, 64, kernel_size=3, stride=1, padding=1), nn.LeakyReLU(0.01), nn.ConvTranspose2d(64, 64, kernel_size=3, stride=2, padding=1), nn.LeakyReLU(0.01), nn.ConvTranspose2d(64, 32, kernel_size=3, stride=2, padding=0), nn.BatchNorm2d(32), nn.LeakyReLU(0.01), nn.ConvTranspose2d(32, 1, kernel_size=3, stride=1, padding=0), Trim() ) def encode(self, x): x = self.encoder(x) return self.mean_layer(x), self.logvar_layer(x) def reparameterize(self, mean, log_var): std = torch.exp(0.5 * log_var) eps = torch.randn_like(std) return mean + eps * std def decode(self, z): return self.decoder(z) def forward(self, x): mean, log_var = self.encode(x) z = self.reparameterize(mean, log_var) return self.decode(z), mean, log_var # Beta scheduling function def beta(epoch): return BETA_MAX * (1 - math.exp(-epoch / 1000)) # Loss function def loss_function(x, x_hat, mean, log_var, epoch): reconstruction_loss = F.mse_loss(x_hat, x, reduction='sum') kl_divergence = -0.5 * torch.sum(1 + log_var - mean.pow(2) - log_var.exp()) return reconstruction_loss + beta(epoch) * kl_divergence, reconstruction_loss, kl_divergence # Initialize the VAE model and optimizer model = VAE(HIDDEN_DIM, LATENT_DIM).to(DEVICE) optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE) ``` ### Training the VAE #### Train the Model ```python= # Training function def train_vae(num_epochs, model, optimizer, train_loader, test_loader, logging_interval=1000): log_dict = {'train_loss': [], 'test_loss': [], 'reconstruction_loss': [], 'kl_loss': []} for epoch in range(num_epochs): model.train() train_loss = 0 reconstruction_loss = 0 kl_loss = 0 for batch_idx, (data, _) in enumerate(train_loader): data = data.to(DEVICE).to(torch.float32) optimizer.zero_grad() recon_batch, mean, log_var = model(data) loss, recon, kl = loss_function(data, recon_batch, mean, log_var, epoch) loss.backward() train_loss += loss.item() reconstruction_loss += recon.item() kl_loss += kl.item() optimizer.step() avg_loss = train_loss / len(train_loader.dataset) avg_recon = reconstruction_loss / len(train_loader.dataset) avg_kl = kl_loss / len(train_loader.dataset) log_dict['train_loss'].append(avg_loss) log_dict['reconstruction_loss'].append(avg_recon) log_dict['kl_loss'].append(avg_kl) model.eval() test_loss = 0 with torch.no_grad(): for data, _ in test_loader: data = data.to(DEVICE).to(torch.float32) recon_batch, mean, log_var = model(data) loss, _, _ = loss_function(data, recon_batch, mean, log_var, epoch) test_loss += loss.item() test_loss /= len(test_loader.dataset) log_dict['test_loss'].append(test_loss) if (epoch + 1) % logging_interval == 0: print(f'Epoch: {epoch+1}/{num_epochs}, ' f'Train loss: {avg_loss:.5f}, ' f'Test loss: {test_loss:.5f}, ' f'Recon loss: {avg_recon:.5f}, ' f'KL loss: {avg_kl:.5f}') return log_dict # Train the VAE model log_dict = train_vae(NUM_EPOCHS, model, optimizer, train_loader, test_loader) ``` ``` Epoch: 1000/10000, Train loss: 493.85192, Test loss: 2810.84810, Recon loss: 492.70334, KL loss: 54.54246 Epoch: 2000/10000, Train loss: 307.99849, Test loss: 3057.58281, Recon loss: 306.27050, KL loss: 59.96282 Epoch: 3000/10000, Train loss: 269.58483, Test loss: 3291.74937, Recon loss: 267.70620, KL loss: 59.31493 Epoch: 4000/10000, Train loss: 220.69615, Test loss: 3364.86553, Recon loss: 218.69842, KL loss: 61.05127 Epoch: 5000/10000, Train loss: 198.57227, Test loss: 3339.15273, Recon loss: 196.58514, KL loss: 60.01872 Epoch: 6000/10000, Train loss: 176.84017, Test loss: 3367.27002, Recon loss: 174.78538, KL loss: 61.79695 Epoch: 7000/10000, Train loss: 163.33802, Test loss: 3508.20732, Recon loss: 161.23345, KL loss: 63.19488 Epoch: 8000/10000, Train loss: 146.88067, Test loss: 3478.91826, Recon loss: 144.83660, KL loss: 61.34283 Epoch: 9000/10000, Train loss: 135.00193, Test loss: 3535.44800, Recon loss: 132.88695, KL loss: 63.45724 Epoch: 10000/10000, Train loss: 127.29259, Test loss: 3531.07798, Recon loss: 125.24019, KL loss: 61.57497 ``` #### Plot Loss Curve ```python= # Plot loss curves plt.figure(figsize=(10, 7)) plt.plot(log_dict['train_loss'], label='Train Loss') plt.plot(log_dict['test_loss'], label='Test Loss') plt.plot(log_dict['reconstruction_loss'], label='Reconstruction Loss') plt.plot(log_dict['kl_loss'], label='KL Divergence') plt.xlabel('Epochs') plt.ylabel('Loss') plt.legend() plt.title('VAE Training Progress') plt.show() ``` ![loss2](https://hackmd.io/_uploads/S1BaBo-nA.png) ### Evaluation and Generation #### Test Reconstruction ```python= # Function to generate images def generate_images(model, num_images=10): model.eval() with torch.no_grad(): z = torch.randn(num_images, LATENT_DIM).to(DEVICE) sample = model.decode(z).cpu() sample = sample * std + mean return sample # Generate and display images generated_images = generate_images(model) fig, axes = plt.subplots(2, 5, figsize=(12, 6)) for i, ax in enumerate(axes.flatten()): ax.imshow(generated_images[i].squeeze(), cmap='gray') ax.axis('off') plt.tight_layout() plt.show() ``` ![generated_images](https://hackmd.io/_uploads/SJp24sbn0.png) #### Generate Images by Tuning Latent Components ```python= # Function to interpolate between two latent vectors def interpolate_latent(model, start_img, end_img, steps=10): model.eval() with torch.no_grad(): start_z = model.encode(start_img.to(torch.float32).unsqueeze(0).to(DEVICE))[0] end_z = model.encode(end_img.to(torch.float32).unsqueeze(0).to(DEVICE))[0] interpolated = torch.zeros(steps, LATENT_DIM) for i in range(steps): interpolated[i] = start_z + (end_z - start_z) * (i / (steps - 1)) interpolated_imgs = model.decode(interpolated.to(DEVICE)).cpu() interpolated_imgs = interpolated_imgs * std + mean return interpolated_imgs # Interpolate between two random images start_img, _ = trainset[np.random.randint(len(trainset))] end_img, _ = trainset[np.random.randint(len(trainset))] interpolated_imgs = interpolate_latent(model, start_img, end_img) fig, axes = plt.subplots(2, 5, figsize=(15, 6)) for i, ax in enumerate(axes.flatten()): ax.imshow(interpolated_imgs[i].squeeze(), cmap='gray') ax.axis('off') plt.tight_layout() plt.show() ``` ![interpolate](https://hackmd.io/_uploads/rkfs4jZn0.png) ## Reference * Prince, S. J. D. (2017). Understanding Deep Learning. Chapter 17. [link](https://udlbook.github.io/udlbook/) * Geiger, A. (n.d.). Deep Learning. Lecture 11. [link](https://drive.google.com/file/d/1kS-QLlkDvOCEQqgrG02uezcWEHXuxf6g/view) * Liu, C.-H. (n.d.). 深度學習Paper系列(04):Variational Autoencoder (VAE). [link](https://tomohiroliu22.medium.com/%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92paper%E7%B3%BB%E5%88%97-04-variational-autoencoder-vae-a7fbc67f0a2) * Lee, H.-Y. (n.d.). ML Lecture 18: Unsupervised Learning - Deep Generative Model (Part II). [link](https://www.youtube.com/watch?v=8zomhgKrsmQ&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49&index=27) * Noh, H., Hong, S., & Han, B. (2015). Learning Deconvolution Network for Semantic Segmentation. arXiv preprint arXiv:1505.04366. * The Reparameterization Trick in Variational Autoencoders. (n.d.). Baeldung. [link](https://www.baeldung.com/cs/vae-reparameterization) * Odaibo, S. G. (2019). Tutorial: Deriving the Standard Variational Autoencoder (VAE) Loss Function. arXiv preprint arXiv:1907.08956. * Hasan, S. (n.d.). AutoEncoders: Theory + PyTorch Implementation. [link](https://medium.com/@syed_hasan/autoencoders-theory-pytorch-implementation-a2e72f6f7cb7) * Bommasani, R., et al. (2019). Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 210–219. [link](https://aclanthology.org/N19-1021.pdf) * Castiglioni, A. (n.d.). Dimensionality reduction with Autoencoders versus PCA. Towards Data Science. [link](https://towardsdatascience.com/dimensionality-reduction-with-autoencoders-versus-pca-f47666f80743)