# DLCV HW2 ###### tags: `Course` :::success 湯濬澤 NTUST_M11015117 HackMD Link:https://hackmd.io/@RTon/SJw676wms ::: ## Problem 1 - GAN ### 1. Architecture of model A & B #### **Model A - DCGAN** ```python DCGAN_Generator( (net): Sequential( (0): ConvTranspose2d(100, 512, kernel_size=(4, 4), stride=(1, 1), bias=False) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace=True) (6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (8): ReLU(inplace=True) (9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (11): ReLU(inplace=True) (12): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (13): Tanh() ) ) DCGAN_Discriminator( (net): Sequential( (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (1): LeakyReLU(negative_slope=0.2, inplace=True) (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (4): LeakyReLU(negative_slope=0.2, inplace=True) (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (6): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (7): LeakyReLU(negative_slope=0.2, inplace=True) (8): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (9): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (10): LeakyReLU(negative_slope=0.2, inplace=True) (11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), bias=False) (12): Sigmoid() ) ) ``` #### **Model B - WGAN-GP** ```python WGAN_GP_Generator( (net): Sequential( (0): ConvTranspose2d(100, 512, kernel_size=(4, 4), stride=(1, 1), bias=False) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace=True) (6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (8): ReLU(inplace=True) (9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (11): ReLU(inplace=True) (12): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (13): Tanh() ) ) WGAN_GP_Discriminator( (net): Sequential( (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (1): LeakyReLU(negative_slope=0.2, inplace=True) (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (3): LeakyReLU(negative_slope=0.2, inplace=True) (4): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (5): LeakyReLU(negative_slope=0.2, inplace=True) (6): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (7): LeakyReLU(negative_slope=0.2, inplace=True) (8): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), bias=False) ) ) ``` ### 2. Generated images of model A & B #### **Model A (DCGAN)** ![](https://i.imgur.com/Q7J7jz6.png) #### **Model B (WGAN-GP)** ![](https://i.imgur.com/Oh6fSae.png) Model A 的整體效果較好一些些,不像 WGAN-GP 有時會出現一些糊糊的照片,五官也比較清晰一點。實際用 FID 跑測試,DCGAN 的結果可以到 27.7,但 WGAN-GP 出來卻高達 100 多。 ### 3. What I’ve observed and learned ? :::info 從 Model-A (DCGAN) 與 Model-B (WGAN-GP) 的結果來看,兩者差異其實不大?甚至體感上好像 DCGAN 的表現反而好一點。這部分我的猜測是因為在 DCGAN 中,有嘗試一些增強 GAN 的技巧,例如 Discriminator 訓練多次,但那些技巧在 WGAN-GP 上有時會導致模型爆掉,所以沒有嘗試。而 GAN 本身的概念其實蠻特別的,同時訓練 Discriminator 與 generator,讓兩者互相對抗。並透過加雜訊來訓練模型還原原始圖片,電腦這樣的學習方式還蠻特別的。 而至於兩者之間的差異,其實蠻小的。WGAN 相較於 DCGAN,主要是引入了 Wasserstein distance 來取代 DCGAN 的 loss。而 WGAN-GP 則是多了 gradient penalty,Generator 部分一模一樣。 ::: ## Problem 2 - Diffusion models ### 1. Model ```python Diffusion_UNet_Small( (init_conv): Residual_Block( (conv): Sequential( (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): GroupNorm(1, 64, eps=1e-05, affine=True) (2): GELU(approximate=none) (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): GroupNorm(1, 64, eps=1e-05, affine=True) ) ) (down1): Down_Block( (conv): Sequential( (0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (1): Residual_Block( (conv): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): GroupNorm(1, 64, eps=1e-05, affine=True) (2): GELU(approximate=none) (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): GroupNorm(1, 64, eps=1e-05, affine=True) ) ) (2): Residual_Block( (conv): Sequential( (0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): GroupNorm(1, 128, eps=1e-05, affine=True) (2): GELU(approximate=none) (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): GroupNorm(1, 128, eps=1e-05, affine=True) ) ) ) (time_emb_mlp): Sequential( (0): SiLU() (1): Linear(in_features=256, out_features=128, bias=True) ) ) (sa_d1): Self_Attention( (attention_module): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True) ) (layer_norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True) (ff): Sequential( (0): LayerNorm((128,), eps=1e-05, elementwise_affine=True) (1): Linear(in_features=128, out_features=128, bias=True) (2): GELU(approximate=none) (3): Linear(in_features=128, out_features=128, bias=True) ) ) (down2): Down_Block( (conv): Sequential( (0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (1): Residual_Block( (conv): Sequential( (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): GroupNorm(1, 128, eps=1e-05, affine=True) (2): GELU(approximate=none) (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): GroupNorm(1, 128, eps=1e-05, affine=True) ) ) (2): Residual_Block( (conv): Sequential( (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): GroupNorm(1, 128, eps=1e-05, affine=True) (2): GELU(approximate=none) (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): GroupNorm(1, 128, eps=1e-05, affine=True) ) ) ) (time_emb_mlp): Sequential( (0): SiLU() (1): Linear(in_features=256, out_features=128, bias=True) ) ) (sa_d2): Self_Attention( (attention_module): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True) ) (layer_norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True) (ff): Sequential( (0): LayerNorm((128,), eps=1e-05, elementwise_affine=True) (1): Linear(in_features=128, out_features=128, bias=True) (2): GELU(approximate=none) (3): Linear(in_features=128, out_features=128, bias=True) ) ) (mid1): Residual_Block( (conv): Sequential( (0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): GroupNorm(1, 256, eps=1e-05, affine=True) (2): GELU(approximate=none) (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): GroupNorm(1, 256, eps=1e-05, affine=True) ) ) (mid2): Residual_Block( (conv): Sequential( (0): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): GroupNorm(1, 128, eps=1e-05, affine=True) (2): GELU(approximate=none) (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): GroupNorm(1, 128, eps=1e-05, affine=True) ) ) (up1): Up_Block( (up_conv): Sequential( (0): Upsample(scale_factor=2.0, mode=bilinear) ) (conv): Sequential( (0): Residual_Block( (conv): Sequential( (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): GroupNorm(1, 256, eps=1e-05, affine=True) (2): GELU(approximate=none) (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): GroupNorm(1, 256, eps=1e-05, affine=True) ) ) (1): Residual_Block( (conv): Sequential( (0): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): GroupNorm(1, 128, eps=1e-05, affine=True) (2): GELU(approximate=none) (3): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): GroupNorm(1, 64, eps=1e-05, affine=True) ) ) ) (time_emb_mlp): Sequential( (0): SiLU() (1): Linear(in_features=256, out_features=64, bias=True) ) ) (sa_u1): Self_Attention( (attention_module): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True) ) (layer_norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (ff): Sequential( (0): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (1): Linear(in_features=64, out_features=64, bias=True) (2): GELU(approximate=none) (3): Linear(in_features=64, out_features=64, bias=True) ) ) (up2): Up_Block( (up_conv): Sequential( (0): Upsample(scale_factor=2.0, mode=bilinear) ) (conv): Sequential( (0): Residual_Block( (conv): Sequential( (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): GroupNorm(1, 128, eps=1e-05, affine=True) (2): GELU(approximate=none) (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): GroupNorm(1, 128, eps=1e-05, affine=True) ) ) (1): Residual_Block( (conv): Sequential( (0): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): GroupNorm(1, 64, eps=1e-05, affine=True) (2): GELU(approximate=none) (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): GroupNorm(1, 64, eps=1e-05, affine=True) ) ) ) (time_emb_mlp): Sequential( (0): SiLU() (1): Linear(in_features=256, out_features=64, bias=True) ) ) (sa_u2): Self_Attention( (attention_module): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True) ) (layer_norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (ff): Sequential( (0): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (1): Linear(in_features=64, out_features=64, bias=True) (2): GELU(approximate=none) (3): Linear(in_features=64, out_features=64, bias=True) ) ) (out_conv): Conv2d(64, 3, kernel_size=(1, 1), stride=(1, 1)) (label_emb): Embedding(10, 256) ) ``` 模型部分主要參考自 https://www.youtube.com/watch?v=TBCRlnwJtZU 。不過他的模型是設計來輸出大圖片的,這次作業只需要 28×28 的圖,因此我有將 UNET 的結構簡化,只留了兩個 down 跟兩個 up block,以減少運算時間。Sampling 的方法與網路上的做法都差不多,用 `torch.linspac` 產出 `beta` 後,再計算 `alpha` 與 `torch.cumprod(alpha)`。最後在每個 time step 中,去更新圖片 `x`。 在 optimizer 與 loss 的選擇上,選用了 MSELoss 與 AdamW。 ```python optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate) criterion = nn.MSELoss() ``` ### 2. Results (time steps = 1000) ![](https://i.imgur.com/OEumCph.png) ### 3. Different time steps | t=1 | t=50 | t=100 | t=200| t=400 | t=600 | t=800 | t=1000 | | :--------: | :--------: | :--------: | :--------: | :--------: | :--------: | :--------: | :--------:| | ![](https://i.imgur.com/68RIcBn.png) | ![](https://i.imgur.com/E21txkH.png) | ![](https://i.imgur.com/XlLMKFF.png) | ![](https://i.imgur.com/RfVGUiv.png) | ![](https://i.imgur.com/hLxd7yq.png) | ![](https://i.imgur.com/2kfnzMf.png) | ![](https://i.imgur.com/BEJeG0q.png) | ![](https://i.imgur.com/wZiwkH2.png) | ### 4. What I’ve observed and learned ? :::info 雖說 time step 為 1000 時,效果較好,但相對的也要算比較久的時間計算。這次因為要在 15 分鐘內輸出 1000 張圖,因此最終繳交的 .py 檔設定的 time step 只有 500。 而這次作業最難實作的 model 應該就是 diffusion 了,資料少,且架構相對複雜,特別是計算 sample 與 embedding 那部分。不過原本以為 UNET 只能拿來做 segmentation,沒想到竟然能用在 diffusion model 上,還蠻特別的。 ::: ## Problem 3 - DANN models ### 1. Results | | MNIST-M → SVHN | MNIST-M → USPS | | -------- | -------- | -------- | | Trained on source | 0.2769 | 0.5658 | | DANN | 0.4360 | 0.7641 | | Trained on target | 0.9127 | 0.9845 | ### 2. t-SNE results | | Domain | Class | | -------- | -------- | -------- | | MNIST-M → SVHN | ![](https://i.imgur.com/SxVilEs.png) | ![](https://i.imgur.com/xRYcKn0.png) | | MNIST-M → USPS | ![](https://i.imgur.com/laW5na3.png) | ![](https://i.imgur.com/PAaPWEy.png) | <!-- | MNIST-M → SVHN | ![](https://i.imgur.com/CCKddEC.png) | ![](https://i.imgur.com/WsgInlG.png) | | MNIST-M → USPS | ![](https://i.imgur.com/LUDpsB0.png) | ![](https://i.imgur.com/yD58bGA.png) | --> ### 3. DANN implementation details DANN model 部分主要分為三塊 `Feature extractor`, `Class classifier` 以及 `Domain classifier`。當中 `Feature extractor` 負責將圖片的特徵萃取出來,丟給 class classifier 分類 0~9。domain classifier 則將 feature 做過一遍 Gradient Reverse 後,再辨別其 domain。 訓練時,如果是 DDAN 的方式,會先針對 source dataset 跑正常的 classification,並在 domain classifier training 時給予 label=1。對於 target doamin,只拿 data 跑 feature extraction 與 domain classification。最終將三者的 NLLLoss 合併做 backward。 ```python loss = src_class_loss + src_domain_loss + tgt_domain_loss ``` Optimizer 部分選擇的是 Adam,並用 MultiStepLR 調整 learning rate。 訓練時遇到最大的問題莫過於 Target dataset 的 accuracy 震盪,loss 雖說趨於穩定,source dataset 的 accuracy 也在 0.9 左右徘徊,就很奇怪。個人猜測可能 target dataset 資料量較少,所以才會導致 accuracy 變化過大? ![](https://i.imgur.com/uE8YvcB.png)