# ML Final Project Report ## Artemii Bykov, DS-01 ###### tags: `Introduction to ML`, `Domain Adaptation`, `GAN`, `Unsupervised learning` #### Link to [Colab](https://colab.research.google.com/drive/1_Eorppx6keut8eowaFbUtBmeYOtJktNt) #### Note: This report was written using mostly article [Generate To Adapt: Aligning Domains using Generative Adversarial Networks](https://arxiv.org/pdf/1704.01705.pdf), several articles: [Habr](https://habr.com/ru/company/mailru/blog/426803/), [Medium](https://medium.com/activating-robotic-minds/up-sampling-with-transposed-convolution-9ae4f2df52d0) and of course, own knowledge and experience \^_\^ [ToC] ### Trainig phase architecture #### Model bird overview ![](https://i.imgur.com/V4tIh3f.png) In the training phase, our pipeline consists of two parallel streams: * **Stream 1**: *classification* branch where **F-C networks** are updated using **supervised** classification loss. * **Stream 2**: *adversarial* branch which is a Auxiliary Classifier GAN (ACGAN) framework (**G-D pair**). **F-G-D networks** are updated so that both source and target embeddings produce **source-like images**. **Note**: The auxiliary classifier in ACGAN uses only the source domain labels, and is needed to ensure that class-consistent images are generated (e.g) embedding of digit 3 generates an image that looks like 3. #### Structure of each part ##### Network F ![](https://i.imgur.com/cqrc2Xl.png) ##### Network C ![](https://i.imgur.com/hqG2kLC.png) ##### Network G ![](https://i.imgur.com/nnycoXV.png) ##### Network D ![](https://i.imgur.com/yYs1Tbu.png) ### Test phase architecture #### Model bird overview ![](https://i.imgur.com/4XI2Fe2.png) In the test phase, we just remove Stream 1 and classification is performed using the **F-C pair**. You can find more detail about **F** and **C** in Training phase architecture ### Difference from the Baseline The key difference between Baseline and the current solution is **ACGAN**. If in the Baseline we have only and **F** and **C** networks (i.e Stream 1), but now we have additionally **ACGAN** (Stream 2). **ACGAN** is key for Domain Adaptation ### Proposed approach * **Steps description:** Given a real image $x$ as input to **F** , the input to the generator network **G** is $x_g = [F (x), z, l]$, which is a concatenated version of the encoder embedding $F(x)$, a random noise vector z ∈ $R^d$ sampled from $N~(0, 1)$ and a one hot encoding of the class label, $l ∈ \{0, 1\}^{(N_c + 1)}$ with $N_c$ real classes and $\{Nc + 1\}$ being the fake class. For all target samples, since the class labels are unknown, $l$ is set as the one hot encoding of the fake class $\{Nc + 1\}$. * We use the classifier network **C**, as usual supervised learning * The discriminator mapping **D** takes the real image $x$ or the generated image $G(x_g)$ as input and outputs two distributions: $D_{data}(x)$: the probability of the input being real, which is modeled as a binary classifier (obviously). $D_{cls}(x)$: the class probability distribution of the input $x$, which is modeled as a $(N_c)$-way classifier. It should be noted that, for target data, since class labels are unknown, only $D_{data}$ is used to backpropagate the gradients. * **Optimization procedure:** * **D loss**: Given source images as input, $D$ outputs two distributions $D_{data}$ and $D_{cls}$. $D_{data}$ is optimized by minimizing a binary cross entropy loss $L_{data,src}$ and $D_{cls}$ is optimized by minimizing the cross entropy loss $L_{cls,src}$ between the source labels and the model predictive distribution $D_{cls}(x)$. * **G loss**: G is updated usinga combination of adversarial loss and classification loss * **F and C losses**: Updated based on the source images and source labels in a traditional supervised manner. **F** is also updated using the adversarial gradients from **D** so that the feature learning and image generation processes cooccur smoothly * **Final step**: The real target images are presented as input to **F**. The target embeddings output by **F** along with the random noise vector $z$ and the fake label encoding $l$ are input to **G**. The generated target images $G(x_g)$ are then given as input to **D**. As described above, **D** outputs two distributions but the loss function is evaluated only for D_{data} since in the unsupervised case considered here, target labels are not provided during training. ### Hyper-parameters | Hyper-parameter | Value | |-----------------------------|--------| | bathc_size | 100 | | image_size | 32 | | size of latent space | 512 | | # of filters in G | 64 | | # of filters in D | 64 | | # of epochs | 100 | | learning rate | 0.0005 | | $\alpha_1$ for Adam | 0.3 | | $\beta_1$ for Adam | 0.8 | | weight for adversarial loss | 0.1 | | learning rate decay | 0.0001 | ### Results #### SVHN ![](https://i.imgur.com/cbSny1t.png) As we can see SVHN Test accuracy is not so high, about 90%, but we should remember that we solve Domain Adaptation problem, and high accuracy on SVHN will lead to problems with MNIST samples classification #### MNIST ![](https://i.imgur.com/5FCXgX4.png) MNIST results are very good, we get about 92±1% #### Abalation study | Settings | MNIST Test accuracy | | ------------------------------| ------------------- | | Stream 1 - Source only | 71% | | Stream 1 + Stream 2 (C1 only) | 79% | | Stream 1 + Stream 2 (C1 + C2) | 92% | **Stream 1** - The embedding network **F** + classification network **C** **Stream 2** - The adversarial stream consists of the **G-D** pair **C1** - real/fake classifier **C2** - auxiliary classifier We observe that using only the real/fake classifier **C1** in the network **D** does improve performance, but the auxiliary classifier **C2** is needed to get the full performance benefit. #### t-SNE on the 1-st Epoch (red - SVHN) ![](https://i.imgur.com/y8gXh6u.png) #### t-SNE on the 50-th Epoch (red - SVHN) ![](https://i.imgur.com/2NjGKjd.png) #### Examples of generated source-like images using source ![](https://i.imgur.com/b1xD6Us.png) #### Examples of generated source-like images using target ![](https://i.imgur.com/qSVDmzG.png) #### Personal thoughts I have learned a lot about unsupervised learning and especially about domain adaptation which is a amasint idea that is so complex and so simple at the same time. I have succeeded to solve the domain gap problem. I think that there is a lot of things left to do especially in terms of hyperparameter boosting.