# ML Final Project Report
## Artemii Bykov, DS-01
###### tags: `Introduction to ML`, `Domain Adaptation`, `GAN`, `Unsupervised learning`
#### Link to [Colab](https://colab.research.google.com/drive/1_Eorppx6keut8eowaFbUtBmeYOtJktNt)
#### Note: This report was written using mostly article [Generate To Adapt: Aligning Domains using Generative Adversarial Networks](https://arxiv.org/pdf/1704.01705.pdf), several articles: [Habr](https://habr.com/ru/company/mailru/blog/426803/), [Medium](https://medium.com/activating-robotic-minds/up-sampling-with-transposed-convolution-9ae4f2df52d0) and of course, own knowledge and experience \^_\^
[ToC]
### Trainig phase architecture
#### Model bird overview

In the training phase, our pipeline consists of two parallel streams:
* **Stream 1**: *classification* branch where **F-C networks** are updated using **supervised** classification loss.
* **Stream 2**: *adversarial* branch which is a Auxiliary Classifier GAN (ACGAN) framework (**G-D pair**). **F-G-D networks** are updated so that both source and target embeddings produce **source-like images**. **Note**: The auxiliary classifier in ACGAN uses only the source domain labels, and is needed to ensure that class-consistent images are generated (e.g) embedding of digit 3 generates an image that looks like 3.
#### Structure of each part
##### Network F

##### Network C

##### Network G

##### Network D

### Test phase architecture
#### Model bird overview

In the test phase, we just remove Stream 1 and classification is performed using the **F-C pair**. You can find more detail about **F** and **C** in Training phase architecture
### Difference from the Baseline
The key difference between Baseline and the current solution is **ACGAN**. If in the Baseline we have only and **F** and **C** networks (i.e Stream 1), but now we have additionally **ACGAN** (Stream 2). **ACGAN** is key for Domain Adaptation
### Proposed approach
* **Steps description:** Given a real image $x$ as input to **F** , the input to the generator network **G** is $x_g = [F (x), z, l]$, which is a concatenated version of the encoder embedding $F(x)$, a random noise vector z ∈ $R^d$ sampled from $N~(0, 1)$ and a one hot encoding of the class label, $l ∈ \{0, 1\}^{(N_c + 1)}$ with $N_c$ real classes and $\{Nc + 1\}$ being the fake class. For all target samples, since the class labels are unknown, $l$ is set as the one hot encoding of the fake class $\{Nc + 1\}$.
* We use the classifier network **C**, as usual supervised learning
* The discriminator mapping **D** takes the real image $x$ or the generated image $G(x_g)$ as input and outputs two distributions: $D_{data}(x)$: the probability of the input being real, which is modeled as a binary classifier (obviously). $D_{cls}(x)$: the class probability distribution of the input $x$, which is modeled as a $(N_c)$-way classifier. It should be noted that, for target data, since class labels are unknown, only $D_{data}$ is used to backpropagate the gradients.
* **Optimization procedure:**
* **D loss**: Given source images as input, $D$ outputs two distributions $D_{data}$ and $D_{cls}$. $D_{data}$ is optimized by minimizing a binary cross entropy loss $L_{data,src}$ and $D_{cls}$ is optimized by minimizing the cross entropy loss $L_{cls,src}$ between the source labels and the model predictive distribution $D_{cls}(x)$.
* **G loss**: G is updated usinga combination of adversarial loss and classification loss
* **F and C losses**: Updated based on the source images and source labels in a traditional supervised manner. **F** is also updated using the adversarial gradients from **D** so that the feature learning and image generation processes cooccur smoothly
* **Final step**: The real target images are presented as input to **F**. The target embeddings output by **F** along with the random noise vector $z$ and the fake label encoding $l$ are input to **G**. The generated target images $G(x_g)$ are then given as input to **D**. As described above, **D** outputs two distributions but the loss function is evaluated only for D_{data} since in the unsupervised case considered here, target labels are not provided during training.
### Hyper-parameters
| Hyper-parameter | Value |
|-----------------------------|--------|
| bathc_size | 100 |
| image_size | 32 |
| size of latent space | 512 |
| # of filters in G | 64 |
| # of filters in D | 64 |
| # of epochs | 100 |
| learning rate | 0.0005 |
| $\alpha_1$ for Adam | 0.3 |
| $\beta_1$ for Adam | 0.8 |
| weight for adversarial loss | 0.1 |
| learning rate decay | 0.0001 |
### Results
#### SVHN

As we can see SVHN Test accuracy is not so high, about 90%, but we should remember that we solve Domain Adaptation problem, and high accuracy on SVHN will lead to problems with MNIST samples classification
#### MNIST

MNIST results are very good, we get about 92±1%
#### Abalation study
| Settings | MNIST Test accuracy |
| ------------------------------| ------------------- |
| Stream 1 - Source only | 71% |
| Stream 1 + Stream 2 (C1 only) | 79% |
| Stream 1 + Stream 2 (C1 + C2) | 92% |
**Stream 1** - The embedding network **F** + classification network **C**
**Stream 2** - The adversarial stream consists of the **G-D** pair
**C1** - real/fake classifier
**C2** - auxiliary classifier
We observe that using only the real/fake classifier **C1** in the network **D** does improve performance, but the auxiliary classifier **C2** is needed to get the full performance benefit.
#### t-SNE on the 1-st Epoch (red - SVHN)

#### t-SNE on the 50-th Epoch (red - SVHN)

#### Examples of generated source-like images using source

#### Examples of generated source-like images using target

#### Personal thoughts
I have learned a lot about unsupervised learning and especially about domain adaptation which is a amasint idea that is so complex and so simple at the same time. I have succeeded to solve the domain gap problem. I think that there is a lot of things left to do especially in terms of hyperparameter boosting.