DLCV HW2

tags: `Course`

湯濬澤
NTUST_M11015117
HackMD Link：https://hackmd.io/@RTon/SJw676wms

Problem 1 - GAN

1. Architecture of model A & B

Model A - DCGAN

DCGAN_Generator(
  (net): Sequential(
    (0): ConvTranspose2d(100, 512, kernel_size=(4, 4), stride=(1, 1), bias=False)
    (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU(inplace=True)
    (6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (8): ReLU(inplace=True)
    (9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): ReLU(inplace=True)
    (12): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (13): Tanh()
  )
)

DCGAN_Discriminator(
  (net): Sequential(
    (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (1): LeakyReLU(negative_slope=0.2, inplace=True)
    (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (4): LeakyReLU(negative_slope=0.2, inplace=True)
    (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (6): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): LeakyReLU(negative_slope=0.2, inplace=True)
    (8): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (9): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (10): LeakyReLU(negative_slope=0.2, inplace=True)
    (11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), bias=False)
    (12): Sigmoid()
  )
)

Model B - WGAN-GP

WGAN_GP_Generator(
  (net): Sequential(
    (0): ConvTranspose2d(100, 512, kernel_size=(4, 4), stride=(1, 1), bias=False)
    (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU(inplace=True)
    (6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (8): ReLU(inplace=True)
    (9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): ReLU(inplace=True)
    (12): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (13): Tanh()
  )
)

WGAN_GP_Discriminator(
  (net): Sequential(
    (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (1): LeakyReLU(negative_slope=0.2, inplace=True)
    (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (3): LeakyReLU(negative_slope=0.2, inplace=True)
    (4): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (5): LeakyReLU(negative_slope=0.2, inplace=True)
    (6): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (7): LeakyReLU(negative_slope=0.2, inplace=True)
    (8): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), bias=False)
  )
)

2. Generated images of model A & B

Model A (DCGAN)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Model B (WGAN-GP)

Model A 的整體效果較好一些些，不像 WGAN-GP 有時會出現一些糊糊的照片，五官也比較清晰一點。實際用 FID 跑測試，DCGAN 的結果可以到 27.7，但 WGAN-GP 出來卻高達 100 多。

3. What I’ve observed and learned ?

從 Model-A (DCGAN) 與 Model-B (WGAN-GP) 的結果來看，兩者差異其實不大？甚至體感上好像 DCGAN 的表現反而好一點。這部分我的猜測是因為在 DCGAN 中，有嘗試一些增強 GAN 的技巧，例如 Discriminator 訓練多次，但那些技巧在 WGAN-GP 上有時會導致模型爆掉，所以沒有嘗試。而 GAN 本身的概念其實蠻特別的，同時訓練 Discriminator 與 generator，讓兩者互相對抗。並透過加雜訊來訓練模型還原原始圖片，電腦這樣的學習方式還蠻特別的。
而至於兩者之間的差異，其實蠻小的。WGAN 相較於 DCGAN，主要是引入了 Wasserstein distance 來取代 DCGAN 的 loss。而 WGAN-GP 則是多了 gradient penalty，Generator 部分一模一樣。

Problem 2 - Diffusion models

1. Model

Diffusion_UNet_Small(
  (init_conv): Residual_Block(
    (conv): Sequential(
      (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (1): GroupNorm(1, 64, eps=1e-05, affine=True)
      (2): GELU(approximate=none)
      (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (4): GroupNorm(1, 64, eps=1e-05, affine=True)
    )
  )
  (down1): Down_Block(
    (conv): Sequential(
      (0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (1): Residual_Block(
        (conv): Sequential(
          (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): GroupNorm(1, 64, eps=1e-05, affine=True)
          (2): GELU(approximate=none)
          (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): GroupNorm(1, 64, eps=1e-05, affine=True)
        )
      )
      (2): Residual_Block(
        (conv): Sequential(
          (0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): GroupNorm(1, 128, eps=1e-05, affine=True)
          (2): GELU(approximate=none)
          (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): GroupNorm(1, 128, eps=1e-05, affine=True)
        )
      )
    )
    (time_emb_mlp): Sequential(
      (0): SiLU()
      (1): Linear(in_features=256, out_features=128, bias=True)
    )
  )
  (sa_d1): Self_Attention(
    (attention_module): MultiheadAttention(
      (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
    )
    (layer_norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
    (ff): Sequential(
      (0): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (1): Linear(in_features=128, out_features=128, bias=True)
      (2): GELU(approximate=none)
      (3): Linear(in_features=128, out_features=128, bias=True)
    )
  )
  (down2): Down_Block(
    (conv): Sequential(
      (0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (1): Residual_Block(
        (conv): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): GroupNorm(1, 128, eps=1e-05, affine=True)
          (2): GELU(approximate=none)
          (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): GroupNorm(1, 128, eps=1e-05, affine=True)
        )
      )
      (2): Residual_Block(
        (conv): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): GroupNorm(1, 128, eps=1e-05, affine=True)
          (2): GELU(approximate=none)
          (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): GroupNorm(1, 128, eps=1e-05, affine=True)
        )
      )
    )
    (time_emb_mlp): Sequential(
      (0): SiLU()
      (1): Linear(in_features=256, out_features=128, bias=True)
    )
  )
  (sa_d2): Self_Attention(
    (attention_module): MultiheadAttention(
      (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
    )
    (layer_norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
    (ff): Sequential(
      (0): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (1): Linear(in_features=128, out_features=128, bias=True)
      (2): GELU(approximate=none)
      (3): Linear(in_features=128, out_features=128, bias=True)
    )
  )
  (mid1): Residual_Block(
    (conv): Sequential(
      (0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (1): GroupNorm(1, 256, eps=1e-05, affine=True)
      (2): GELU(approximate=none)
      (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (4): GroupNorm(1, 256, eps=1e-05, affine=True)
    )
  )
  (mid2): Residual_Block(
    (conv): Sequential(
      (0): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (1): GroupNorm(1, 128, eps=1e-05, affine=True)
      (2): GELU(approximate=none)
      (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (4): GroupNorm(1, 128, eps=1e-05, affine=True)
    )
  )
  (up1): Up_Block(
    (up_conv): Sequential(
      (0): Upsample(scale_factor=2.0, mode=bilinear)
    )
    (conv): Sequential(
      (0): Residual_Block(
        (conv): Sequential(
          (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): GroupNorm(1, 256, eps=1e-05, affine=True)
          (2): GELU(approximate=none)
          (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): GroupNorm(1, 256, eps=1e-05, affine=True)
        )
      )
      (1): Residual_Block(
        (conv): Sequential(
          (0): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): GroupNorm(1, 128, eps=1e-05, affine=True)
          (2): GELU(approximate=none)
          (3): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): GroupNorm(1, 64, eps=1e-05, affine=True)
        )
      )
    )
    (time_emb_mlp): Sequential(
      (0): SiLU()
      (1): Linear(in_features=256, out_features=64, bias=True)
    )
  )
  (sa_u1): Self_Attention(
    (attention_module): MultiheadAttention(
      (out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
    )
    (layer_norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
    (ff): Sequential(
      (0): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
      (1): Linear(in_features=64, out_features=64, bias=True)
      (2): GELU(approximate=none)
      (3): Linear(in_features=64, out_features=64, bias=True)
    )
  )
  (up2): Up_Block(
    (up_conv): Sequential(
      (0): Upsample(scale_factor=2.0, mode=bilinear)
    )
    (conv): Sequential(
      (0): Residual_Block(
        (conv): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): GroupNorm(1, 128, eps=1e-05, affine=True)
          (2): GELU(approximate=none)
          (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): GroupNorm(1, 128, eps=1e-05, affine=True)
        )
      )
      (1): Residual_Block(
        (conv): Sequential(
          (0): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): GroupNorm(1, 64, eps=1e-05, affine=True)
          (2): GELU(approximate=none)
          (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): GroupNorm(1, 64, eps=1e-05, affine=True)
        )
      )
    )
    (time_emb_mlp): Sequential(
      (0): SiLU()
      (1): Linear(in_features=256, out_features=64, bias=True)
    )
  )
  (sa_u2): Self_Attention(
    (attention_module): MultiheadAttention(
      (out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
    )
    (layer_norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
    (ff): Sequential(
      (0): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
      (1): Linear(in_features=64, out_features=64, bias=True)
      (2): GELU(approximate=none)
      (3): Linear(in_features=64, out_features=64, bias=True)
    )
  )
  (out_conv): Conv2d(64, 3, kernel_size=(1, 1), stride=(1, 1))
  (label_emb): Embedding(10, 256)
)

模型部分主要參考自 https://www.youtube.com/watch?v=TBCRlnwJtZU 。不過他的模型是設計來輸出大圖片的，這次作業只需要 28×28 的圖，因此我有將 UNET 的結構簡化，只留了兩個 down 跟兩個 up block，以減少運算時間。Sampling 的方法與網路上的做法都差不多，用 torch.linspac 產出 beta 後，再計算 alpha 與 torch.cumprod(alpha)。最後在每個 time step 中，去更新圖片 x。

在 optimizer 與 loss 的選擇上，選用了 MSELoss 與 AdamW。

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
criterion = nn.MSELoss()

2. Results (time steps = 1000)

3. Different time steps

t=1	t=50	t=100	t=200	t=400	t=600	t=800	t=1000

4. What I’ve observed and learned ?

雖說 time step 為 1000 時，效果較好，但相對的也要算比較久的時間計算。這次因為要在 15 分鐘內輸出 1000 張圖，因此最終繳交的 .py 檔設定的 time step 只有 500。
而這次作業最難實作的 model 應該就是 diffusion 了，資料少，且架構相對複雜，特別是計算 sample 與 embedding 那部分。不過原本以為 UNET 只能拿來做 segmentation，沒想到竟然能用在 diffusion model 上，還蠻特別的。

Problem 3 - DANN models

1. Results

	MNIST-M → SVHN	MNIST-M → USPS
Trained on source	0.2769	0.5658
DANN	0.4360	0.7641
Trained on target	0.9127	0.9845

2. t-SNE results

	Domain	Class
MNIST-M → SVHN
MNIST-M → USPS

3. DANN implementation details

DANN model 部分主要分為三塊 Feature extractor, Class classifier 以及 Domain classifier。當中 Feature extractor 負責將圖片的特徵萃取出來，丟給 class classifier 分類 0~9。domain classifier 則將 feature 做過一遍 Gradient Reverse 後，再辨別其 domain。

訓練時，如果是 DDAN 的方式，會先針對 source dataset 跑正常的 classification，並在 domain classifier training 時給予 label=1。對於 target doamin，只拿 data 跑 feature extraction 與 domain classification。最終將三者的 NLLLoss 合併做 backward。

loss = src_class_loss + src_domain_loss + tgt_domain_loss

Optimizer 部分選擇的是 Adam，並用 MultiStepLR 調整 learning rate。

訓練時遇到最大的問題莫過於 Target dataset 的 accuracy 震盪，loss 雖說趨於穩定，source dataset 的 accuracy 也在 0.9 左右徘徊，就很奇怪。個人猜測可能 target dataset 資料量較少，所以才會導致 accuracy 變化過大？