---
tags: hw6, handout
---
# HW6 Programming: Variational Autoencoders
:::info
Conceptual questions due **Friday, April 19th, 2024 at 6:00 PM EST**
Programming assignment due **Friday, April 26th, 2024 at 6:00 PM EST**
:::
## Theme
![](https://a-z-animals.com/media/Seahorse-Hippocampus.jpg)
*The fish have developed an interest in generative AI, and are looking to make a model called Stable Diffishusion! But first, they must learn how to make \(C\)VAEs (Coditional Variational Autofincoders), so they can generate images like this beautiful seahorse!*
# Introduction
Generative models are cool! This time we are going to build our own VAEs to generate digit images, and we will eventually train a specific type of VAE that can generate specific digits by training it with additional supervision :)
## Conceptual Questions
Please submit your answers to all conceptual questions as one pdf on Gradescope under the HW6 VAE Conceptual Questions assignment. When submitting your pdf on Gradescope, please be sure to mark which pages in your submission contain your answers for each question. LaTex is recommended.
**2470 students only:** All conceptual questions (including non-2470 ones) should be written as one pdf and submitted to the "[CS2470] HW6 VAE Conceptual Questions" assignment. Please do not submit to the CS1470 conceptual assignment.
## Getting the stencil
Please click <ins>[here to get the stencil code](https://classroom.github.com/a/HOq-un5_)</ins>. Reference this <ins>[guide](https://docs.google.com/document/d/1-vux8I-Hy7kpQixYwQE-acx-cljH3PpeK2CadOT-Stw/edit?usp=sharing)</ins> or <ins>[these](https://docs.google.com/presentation/d/1w_dzls2rfabUrhrQz9f5QU71t7HmLdHJFS8BnHz_q7E/edit?usp=sharing)</ins> slides for more information about GitHub and GitHub Classroom.
## Setup
Work on this assignment off of the <ins>[stencil code](https://classroom.github.com/a/HOq-un5_)</ins> provided, but do not change the stencil except where specified. Changing the stencil will result in incompatibility with the autograder and result in a low grade. For this assignment, a significant amount of code is provided. You **shouldn't** change any method signatures.
## Assignment Overview
In this assignment, you will implement a **variational autoencoder (VAE)** and a **conditional variational autoencoder (CVAE)** with slightly different architectures and apply them to the MNIST handwritten dataset. Compared with autoencoders, VAEs embed the inputs to a distribution instead of a single latent vector. At inference time, VAEs would sample from a learned distribution of the latent space to generate new images. Refer to the autoencoder lab for a refresher on how autoencoders work. <ins>[Here](https://medium.com/analytics-vidhya/variational-autoencoders-explained-bce87e31e43e)</ins> is an article that explains the differences between AEs and VAEs conceptually to guide your programming.
## Roadmap
### Step 1: Fully-Connected VAE (VAE)
Our first VAE only has multiple linear layers. In this section, you need to build your encoder and decoder in the `vae.py`. You will also need to implement the reparameterization trick and loss function, which are located at the bottom of the file and are shared between the VAE and CVAE.
#### VAE Encoder
The encoder will take the images as inputs, flatten them to 1d vectors, and map the 1d vectors into hidden representations. The hidden representations would be used to predict posterior **mu** and **log-variance**, both of which represent the learned distribution of the latent space.
**We need to define an** `encoder`, `mu_layer`, **and** `logvar_layer` **in the initializer of the VAE class in** `vae.py`. The`mu_layer` and `logvar_layer` are two separate linear layers that will represent the mu (mean) and log variance of the learned distribution of the latent space.
The hidden size of the middle layers (H) is up to you, and it will be the same across all encoder and decoder layers.
Here is an example architecture for the encoder:
- Flatten()
- Linear layer with flattened input size and output size H
- ReLU
- Linear layer with input size H and output size H
- ReLU
- Linear layer with input size H and output size H
- ReLU
#### VAE Decoder
Next, we build the decoder, which will convert the latent representations to reconstructed images. **We need to define** the `decoder` **in the initializer of the VAE class in** `vae.py`. Here is an example architecture for the `decoder`:
- Linear layer with input size of the latent size Z and output size H
- ReLU
- Linear layer with input size H and output size H
- ReLU
- Linear layer with input size H and output size H
- ReLU
- Linear layer with input size H and output size of the flattened image size
- Sigmoid
- Reshape (to image size)
#### Reparametrization
As random sampling is not differentiable, a reparameterization trick has to be used to estimate the posterior $z$. To do this, we use the encoder to predict `mu`($\mu$) and `log-variance` ($\log(v)$) estimates, which we can use to generate random realizations. Specifically, we can sample a random `epsilon`($\epsilon$) from a fixed normal distribution to compute $z$ as a function of $\mu$, $\log(v)$, $\epsilon$:
$z = \mu + \sigma * \epsilon \space\space\text{where} \space\space \sigma^2= v$
We can then compute the partial derivatives with respect to $\mu$ and $v$ through $z$. If $\epsilon$ ~ $N(0,1)$, the result of our forward pass calculation will be a distribution centered at $\mu$ and $z$.
**Based on the above, we can implement** `reparameterization` in `vae.py`.
#### VAE Forward Pass
Next, we need to fill in the forward pass in VAE class. The forward pass should pass images through the encoder, compute `mu` and `log-variance`, reparametrize to estimate the latent space $z$, and finally pass $z$ into the decoder to generate an image.
#### Loss Function
To train our VAEs, we need to define our loss function. As shown below, the loss function for VAEs contains two terms: A reconstruction loss term (left) and a KL divergence term (right).
$-E_{z_{q_\phi}(z|x)}[\log p_{\theta}(x|z)] + D_{KL}(q_\phi(z|x), p(z))$
This is the negative of the variational lower bound. The reconstruction loss term can be computed by using binary cross entropy loss between the original input pixels and the output pixels from the decoder. **You should use the given** `bce_function()` **to compute this reconstruction loss**. Similar to the lab, you are allowed to use CE loss since the images are normalized to be between 0 and 1 and can behave in a similar fashion to per-pixel probabilities.
The KL divergence term would drive the latent space distribution to be close to a prior distribution (we pick the standard normal distribution).
To help you out, here’s the unvectorized form of the KL divergence term. Supposed that:
> $q_\phi(z | x)$ is a Z-dimensional diagonal Gaussian with mean $\mu_{z|x}$ and std $\sigma_{z|x}$ of shape (Z,)
> $p(z)$ is a Z-dimensional Gaussian with zero mean and unit variance.
Then we can write the KL divergence term as:
$D_{KL}(q_\phi(z|x), p(z)) = -\frac{1}{2}\sum_{j=1}^{J}(1 + \log(\sigma_{z|x}^2)_j - (\mu_{z|x})_j^2 - (\sigma_{z|x})_j^2)$
From here, we can derive a vectorized version of this loss (shown below) that also operates on batches of inputs. **During implementation, remember to average the loss across samples in each batch**.
$L = ||x - \hat{x}||_2^2 + \lambda D_{KL}(N(\mu,\sigma),N(0,1))$
If you are curious how this was derived, here is a [paper](https://arxiv.org/pdf/1907.08956.pdf) that goes through all the steps.
**Implement your** `loss_function()` in `vae.py`.
#### Train our VAE
**As the last step, we need to fill in** `train_vae()` in `assignment.py`. The MNIST dataset has been preprocessed into batches, where each batch has 1024 images of handwritten digits. You are expected to return the accumulated loss values of all batches in order to observe the average loss per image during training. After training for 5 epochs, the loss should be <120.
#### Optional: Visualize results
You can use the given `show_vae_images()` and `show_vae_interpolation()` to look at the generated images from our VAE. The images should look fine, though a bit blurry or badly formed. Note that these images will be saved in the outputs directory.
### Step 2: Conditional Fully-Connected VAE (CVAE)
Our second VAE extends the first VAE model, but with an additional control over what digits to generate. We’ll use the labels of the MNIST images, and condition our latent space and image generation on the specific digit class. Instead of $q_\phi(z | x)$ and $p_\phi(x | z)$, we have $q_\phi(z | x, c)$ and $p_\phi(x | z, c)$.
This will allow us to do conditional generation at inference time. We can specifically choose to generate 1s, 3s, etc., instead of generating new digits randomly.
#### Define CVAE with class input
Our CVAE architecture is the same as our VAE, except that we’ll add an one-hot label to both the input (i.e. flattened image vectors) and the $z$ latent space. Say our one-hot vector is called $c$.
For CVAE class in `vae.py`, use the same architecture as our VAE with the following modifications:
1. Modify the first linear layer of your `encoder` to take in not only the flattened image, but also the one-hot label vector $c$.
2. Modify the first linear layer of your `decoder` to project latent space + one-hot vector to hidden size $H$.
3. Implement the forward pass to combine the flattened image with the one-hot vectors before passing them into the `encoder`, and also combine the latent space with the one-hot vectors before passing them to the `decoder`. (**Hint: `tf.concat`** – you may need to cast to a float tensor prior to this).
#### Train our CVAE
**Last, we need to modify** `train_vae()` in `assignment.py` **such that it also works with CVAE. Use the given `one_hot()` to convert the image labels, and pass the images and their one-hot vectors to the model to get the generated images**. After training for 5 epochs, your loss should be <120.
#### Optional: Visualize results
You can use the given `show_cvae_images()` to look at the generated images from our CVAE. You should see 10 generated images for each digit, from 0 to 9. Each digit should be reasonably distinguishable. Note that these images will be saved in the outputs directory.
### FAQ
- For both VAE and CVAE, the set of parameters are provided in the `parseArguments()` in `assignment.py`. Feel free to play around with it :)
- Two functions `save_model_weights()` and `load_weights()` are provided in `assignment.py` to allow you to play with the trained model without training it.
## 2470 Students
There is no extra requirement for you guys on this assignment :) Please complete the CS2470-only conceptual questions and the CS1470 conceptual questions.
## Autograder & Grading
Code: You will be primarily graded on functionality. It should take less than 30 minutes to train your VAE or CVAE models (you should be able to develop an architecture that takes **much** less than this, but after 30 minutes, the autograder will timeout). We will use your CVAE to conditionally generate images, and use a trained MNIST classifier to measure the quality of the generations. The accuracy should be >90%.
Conceptual: You will be primarily graded on correctness (when applicable), thoughtfulness, and clarity.
## Handing In
Handing in the conceptual questions of this assignment is similar to previous homeworks. On Gradescope, make sure you submit to the 1470 version ONLY if you're enrolled in 1470, or the 2470 version ONLY if you're enrolled in 2470.
You should submit the assignment via Gradescope under the corresponding project assignment by zipping up your hw6 folder.
:::warning
:warning: IMPORTANT!
Please make sure your `assignment.py` and `vae.py` are in “`hw6/code`” -- this is very important for our autograder to work!
:::
Lastly, don’t forget to include a brief `README` with your model's accuracy and any known bugs!
IF YOU ARE IN 2470: PLEASE REMEMBER TO ADD A BLANK FILE CALLED “2470student” IN THE `hw6/code` DIRECTORY, WE ARE USING THIS AS A FLAG TO GRADE 2470 SPECIFIC REQUIREMENTS, FAILURE TO DO SO MEANS LOSING POINTS ON THIS ASSIGNMENT!