HW6 Programming: Variational Autoencoders

Conceptual questions due Friday, April 19th, 2024 at 6:00 PM EST
Programming assignment due Friday, April 26th, 2024 at 6:00 PM EST

Theme

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The fish have developed an interest in generative AI, and are looking to make a model called Stable Diffishusion! But first, they must learn how to make (C)VAEs (Coditional Variational Autofincoders), so they can generate images like this beautiful seahorse!

Introduction

Generative models are cool! This time we are going to build our own VAEs to generate digit images, and we will eventually train a specific type of VAE that can generate specific digits by training it with additional supervision :)

Conceptual Questions

Please submit your answers to all conceptual questions as one pdf on Gradescope under the HW6 VAE Conceptual Questions assignment. When submitting your pdf on Gradescope, please be sure to mark which pages in your submission contain your answers for each question. LaTex is recommended.

2470 students only: All conceptual questions (including non-2470 ones) should be written as one pdf and submitted to the "[CS2470] HW6 VAE Conceptual Questions" assignment. Please do not submit to the CS1470 conceptual assignment.

Getting the stencil

Please click here to get the stencil code. Reference this guide or these slides for more information about GitHub and GitHub Classroom.

Setup

Work on this assignment off of the stencil code provided, but do not change the stencil except where specified. Changing the stencil will result in incompatibility with the autograder and result in a low grade. For this assignment, a significant amount of code is provided. You shouldn't change any method signatures.

Assignment Overview

In this assignment, you will implement a variational autoencoder (VAE) and a conditional variational autoencoder (CVAE) with slightly different architectures and apply them to the MNIST handwritten dataset. Compared with autoencoders, VAEs embed the inputs to a distribution instead of a single latent vector. At inference time, VAEs would sample from a learned distribution of the latent space to generate new images. Refer to the autoencoder lab for a refresher on how autoencoders work. Here is an article that explains the differences between AEs and VAEs conceptually to guide your programming.

Roadmap

Step 1: Fully-Connected VAE (VAE)

Our first VAE only has multiple linear layers. In this section, you need to build your encoder and decoder in the vae.py. You will also need to implement the reparameterization trick and loss function, which are located at the bottom of the file and are shared between the VAE and CVAE.

VAE Encoder

The encoder will take the images as inputs, flatten them to 1d vectors, and map the 1d vectors into hidden representations. The hidden representations would be used to predict posterior mu and log-variance, both of which represent the learned distribution of the latent space.

We need to define an encoder, mu_layer, and logvar_layer in the initializer of the VAE class in vae.py. Themu_layer and logvar_layer are two separate linear layers that will represent the mu (mean) and log variance of the learned distribution of the latent space.
The hidden size of the middle layers (H) is up to you, and it will be the same across all encoder and decoder layers.
Here is an example architecture for the encoder:

Flatten()
Linear layer with flattened input size and output size H
ReLU
Linear layer with input size H and output size H
ReLU
Linear layer with input size H and output size H
ReLU

VAE Decoder

Next, we build the decoder, which will convert the latent representations to reconstructed images. We need to define the decoder in the initializer of the VAE class in vae.py. Here is an example architecture for the decoder:

Linear layer with input size of the latent size Z and output size H
ReLU
Linear layer with input size H and output size H
ReLU
Linear layer with input size H and output size H
ReLU
Linear layer with input size H and output size of the flattened image size
Sigmoid
Reshape (to image size)

Reparametrization

As random sampling is not differentiable, a reparameterization trick has to be used to estimate the posterior

z

. To do this, we use the encoder to predict mu(

μ

) and log-variance (

\log (v)

) estimates, which we can use to generate random realizations. Specifically, we can sample a random epsilon(

ϵ

) from a fixed normal distribution to compute

z

as a function of

μ

\log (v)

ϵ

z = μ + σ * ϵ where σ^{2} = v

We can then compute the partial derivatives with respect to

μ

and

v

through

z

. If

ϵ

N (0, 1)

, the result of our forward pass calculation will be a distribution centered at

μ

and

z

.
Based on the above, we can implement reparameterization in vae.py.

VAE Forward Pass

Next, we need to fill in the forward pass in VAE class. The forward pass should pass images through the encoder, compute mu and log-variance, reparametrize to estimate the latent space

z

, and finally pass

z

into the decoder to generate an image.

Loss Function

To train our VAEs, we need to define our loss function. As shown below, the loss function for VAEs contains two terms: A reconstruction loss term (left) and a KL divergence term (right).

- E_{z_{q_{ϕ}} (z | x)} [\log p_{θ} (x | z)] + D_{K L} (q_{ϕ} (z | x), p (z))

This is the negative of the variational lower bound. The reconstruction loss term can be computed by using binary cross entropy loss between the original input pixels and the output pixels from the decoder. You should use the given bce_function() to compute this reconstruction loss. Similar to the lab, you are allowed to use CE loss since the images are normalized to be between 0 and 1 and can behave in a similar fashion to per-pixel probabilities.

The KL divergence term would drive the latent space distribution to be close to a prior distribution (we pick the standard normal distribution).

To help you out, here’s the unvectorized form of the KL divergence term. Supposed that:

$q_{ϕ} (z | x)$ is a Z-dimensional diagonal Gaussian with mean
$μ_{z | x}$ and std
$σ_{z | x}$ of shape (Z,)

$p (z)$ is a Z-dimensional Gaussian with zero mean and unit variance.

Then we can write the KL divergence term as:

D_{K L} (q_{ϕ} (z | x), p (z)) = - \frac{1}{2} \sum_{j = 1}^{J} (1 + \log (σ_{z | x}^{2})_{j} - (μ_{z | x})_{j}^{2} - (σ_{z | x})_{j}^{2})

From here, we can derive a vectorized version of this loss (shown below) that also operates on batches of inputs. During implementation, remember to average the loss across samples in each batch.

L = | | x - \hat{x} | |_{2}^{2} + λ D_{K L} (N (μ, σ), N (0, 1))

If you are curious how this was derived, here is a paper that goes through all the steps.

Implement your loss_function() in vae.py.

Train our VAE

As the last step, we need to fill in train_vae() in assignment.py. The MNIST dataset has been preprocessed into batches, where each batch has 1024 images of handwritten digits. You are expected to return the accumulated loss values of all batches in order to observe the average loss per image during training. After training for 5 epochs, the loss should be <120.

Optional: Visualize results

You can use the given show_vae_images() and show_vae_interpolation() to look at the generated images from our VAE. The images should look fine, though a bit blurry or badly formed. Note that these images will be saved in the outputs directory.

Step 2: Conditional Fully-Connected VAE (CVAE)

Our second VAE extends the first VAE model, but with an additional control over what digits to generate. We’ll use the labels of the MNIST images, and condition our latent space and image generation on the specific digit class. Instead of

q_{ϕ} (z | x)

and

p_{ϕ} (x | z)

, we have

q_{ϕ} (z | x, c)

and

p_{ϕ} (x | z, c)

This will allow us to do conditional generation at inference time. We can specifically choose to generate 1s, 3s, etc., instead of generating new digits randomly.

Define CVAE with class input

Our CVAE architecture is the same as our VAE, except that we’ll add an one-hot label to both the input (i.e. flattened image vectors) and the

z

latent space. Say our one-hot vector is called

c

For CVAE class in vae.py, use the same architecture as our VAE with the following modifications:

Modify the first linear layer of your encoder to take in not only the flattened image, but also the one-hot label vector
$c$ .
Modify the first linear layer of your decoder to project latent space + one-hot vector to hidden size
$H$ .
Implement the forward pass to combine the flattened image with the one-hot vectors before passing them into the encoder, and also combine the latent space with the one-hot vectors before passing them to the decoder. (Hint: tf.concat – you may need to cast to a float tensor prior to this).

Train our CVAE

Last, we need to modify train_vae() in assignment.py such that it also works with CVAE. Use the given one_hot() to convert the image labels, and pass the images and their one-hot vectors to the model to get the generated images. After training for 5 epochs, your loss should be <120.

Optional: Visualize results

You can use the given show_cvae_images() to look at the generated images from our CVAE. You should see 10 generated images for each digit, from 0 to 9. Each digit should be reasonably distinguishable. Note that these images will be saved in the outputs directory.

FAQ

For both VAE and CVAE, the set of parameters are provided in the parseArguments() in assignment.py. Feel free to play around with it :)
Two functions save_model_weights() and load_weights() are provided in assignment.py to allow you to play with the trained model without training it.

2470 Students

There is no extra requirement for you guys on this assignment :) Please complete the CS2470-only conceptual questions and the CS1470 conceptual questions.

Autograder & Grading

Code: You will be primarily graded on functionality. It should take less than 30 minutes to train your VAE or CVAE models (you should be able to develop an architecture that takes much less than this, but after 30 minutes, the autograder will timeout). We will use your CVAE to conditionally generate images, and use a trained MNIST classifier to measure the quality of the generations. The accuracy should be >90%.
Conceptual: You will be primarily graded on correctness (when applicable), thoughtfulness, and clarity.

Handing In

Handing in the conceptual questions of this assignment is similar to previous homeworks. On Gradescope, make sure you submit to the 1470 version ONLY if you're enrolled in 1470, or the 2470 version ONLY if you're enrolled in 2470.
You should submit the assignment via Gradescope under the corresponding project assignment by zipping up your hw6 folder.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

IMPORTANT!
Please make sure your assignment.py and vae.py are in “hw6/code” – this is very important for our autograder to work!

Lastly, don’t forget to include a brief README with your model's accuracy and any known bugs!

IF YOU ARE IN 2470: PLEASE REMEMBER TO ADD A BLANK FILE CALLED “2470student” IN THE hw6/code DIRECTORY, WE ARE USING THIS AS A FLAG TO GRADE 2470 SPECIFIC REQUIREMENTS, FAILURE TO DO SO MEANS LOSING POINTS ON THIS ASSIGNMENT!

HW6 Programming: Variational Autoencoders

Theme

Introduction

Conceptual Questions

Getting the stencil

Setup

Assignment Overview

Roadmap

Step 1: Fully-Connected VAE (VAE)

VAE Encoder

VAE Decoder

Reparametrization

VAE Forward Pass

Loss Function

Train our VAE

Optional: Visualize results

Step 2: Conditional Fully-Connected VAE (CVAE)

Define CVAE with class input

Train our CVAE

Optional: Visualize results

FAQ

2470 Students

Autograder & Grading

Handing In

Read more

HW3 Programming: CNNs

Deep Learning Final Project

HW6 Conceptual: Variational Autoencoders

HW5 Conceptual: Image Captioning