DLCV HW2
Problem 1 - GAN
1. Architecture of model A & B
Model A - DCGAN
DCGAN_Generator(
(net): Sequential(
(0): ConvTranspose2d(100, 512, kernel_size=(4, 4), stride=(1, 1), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU(inplace=True)
(9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU(inplace=True)
(12): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(13): Tanh()
)
)
DCGAN_Discriminator(
(net): Sequential(
(0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(4): LeakyReLU(negative_slope=0.2, inplace=True)
(5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(6): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): LeakyReLU(negative_slope=0.2, inplace=True)
(8): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(9): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(10): LeakyReLU(negative_slope=0.2, inplace=True)
(11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), bias=False)
(12): Sigmoid()
)
)
Model B - WGAN-GP
WGAN_GP_Generator(
(net): Sequential(
(0): ConvTranspose2d(100, 512, kernel_size=(4, 4), stride=(1, 1), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU(inplace=True)
(9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU(inplace=True)
(12): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(13): Tanh()
)
)
WGAN_GP_Discriminator(
(net): Sequential(
(0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(3): LeakyReLU(negative_slope=0.2, inplace=True)
(4): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(5): LeakyReLU(negative_slope=0.2, inplace=True)
(6): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(7): LeakyReLU(negative_slope=0.2, inplace=True)
(8): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), bias=False)
)
)
2. Generated images of model A & B
Model A (DCGAN)
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Model B (WGAN-GP)

Model A 的整體效果較好一些些,不像 WGAN-GP 有時會出現一些糊糊的照片,五官也比較清晰一點。實際用 FID 跑測試,DCGAN 的結果可以到 27.7,但 WGAN-GP 出來卻高達 100 多。
3. What I’ve observed and learned ?
從 Model-A (DCGAN) 與 Model-B (WGAN-GP) 的結果來看,兩者差異其實不大?甚至體感上好像 DCGAN 的表現反而好一點。這部分我的猜測是因為在 DCGAN 中,有嘗試一些增強 GAN 的技巧,例如 Discriminator 訓練多次,但那些技巧在 WGAN-GP 上有時會導致模型爆掉,所以沒有嘗試。而 GAN 本身的概念其實蠻特別的,同時訓練 Discriminator 與 generator,讓兩者互相對抗。並透過加雜訊來訓練模型還原原始圖片,電腦這樣的學習方式還蠻特別的。
而至於兩者之間的差異,其實蠻小的。WGAN 相較於 DCGAN,主要是引入了 Wasserstein distance 來取代 DCGAN 的 loss。而 WGAN-GP 則是多了 gradient penalty,Generator 部分一模一樣。
Problem 2 - Diffusion models
1. Model
Diffusion_UNet_Small(
(init_conv): Residual_Block(
(conv): Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 64, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 64, eps=1e-05, affine=True)
)
)
(down1): Down_Block(
(conv): Sequential(
(0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(1): Residual_Block(
(conv): Sequential(
(0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 64, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 64, eps=1e-05, affine=True)
)
)
(2): Residual_Block(
(conv): Sequential(
(0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 128, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 128, eps=1e-05, affine=True)
)
)
)
(time_emb_mlp): Sequential(
(0): SiLU()
(1): Linear(in_features=256, out_features=128, bias=True)
)
)
(sa_d1): Self_Attention(
(attention_module): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
)
(layer_norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
(ff): Sequential(
(0): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
(1): Linear(in_features=128, out_features=128, bias=True)
(2): GELU(approximate=none)
(3): Linear(in_features=128, out_features=128, bias=True)
)
)
(down2): Down_Block(
(conv): Sequential(
(0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(1): Residual_Block(
(conv): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 128, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 128, eps=1e-05, affine=True)
)
)
(2): Residual_Block(
(conv): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 128, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 128, eps=1e-05, affine=True)
)
)
)
(time_emb_mlp): Sequential(
(0): SiLU()
(1): Linear(in_features=256, out_features=128, bias=True)
)
)
(sa_d2): Self_Attention(
(attention_module): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
)
(layer_norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
(ff): Sequential(
(0): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
(1): Linear(in_features=128, out_features=128, bias=True)
(2): GELU(approximate=none)
(3): Linear(in_features=128, out_features=128, bias=True)
)
)
(mid1): Residual_Block(
(conv): Sequential(
(0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 256, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 256, eps=1e-05, affine=True)
)
)
(mid2): Residual_Block(
(conv): Sequential(
(0): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 128, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 128, eps=1e-05, affine=True)
)
)
(up1): Up_Block(
(up_conv): Sequential(
(0): Upsample(scale_factor=2.0, mode=bilinear)
)
(conv): Sequential(
(0): Residual_Block(
(conv): Sequential(
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 256, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 256, eps=1e-05, affine=True)
)
)
(1): Residual_Block(
(conv): Sequential(
(0): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 128, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 64, eps=1e-05, affine=True)
)
)
)
(time_emb_mlp): Sequential(
(0): SiLU()
(1): Linear(in_features=256, out_features=64, bias=True)
)
)
(sa_u1): Self_Attention(
(attention_module): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
)
(layer_norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
(ff): Sequential(
(0): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
(1): Linear(in_features=64, out_features=64, bias=True)
(2): GELU(approximate=none)
(3): Linear(in_features=64, out_features=64, bias=True)
)
)
(up2): Up_Block(
(up_conv): Sequential(
(0): Upsample(scale_factor=2.0, mode=bilinear)
)
(conv): Sequential(
(0): Residual_Block(
(conv): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 128, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 128, eps=1e-05, affine=True)
)
)
(1): Residual_Block(
(conv): Sequential(
(0): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 64, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 64, eps=1e-05, affine=True)
)
)
)
(time_emb_mlp): Sequential(
(0): SiLU()
(1): Linear(in_features=256, out_features=64, bias=True)
)
)
(sa_u2): Self_Attention(
(attention_module): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
)
(layer_norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
(ff): Sequential(
(0): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
(1): Linear(in_features=64, out_features=64, bias=True)
(2): GELU(approximate=none)
(3): Linear(in_features=64, out_features=64, bias=True)
)
)
(out_conv): Conv2d(64, 3, kernel_size=(1, 1), stride=(1, 1))
(label_emb): Embedding(10, 256)
)
模型部分主要參考自 https://www.youtube.com/watch?v=TBCRlnwJtZU 。不過他的模型是設計來輸出大圖片的,這次作業只需要 28×28 的圖,因此我有將 UNET 的結構簡化,只留了兩個 down 跟兩個 up block,以減少運算時間。Sampling 的方法與網路上的做法都差不多,用 torch.linspac
產出 beta
後,再計算 alpha
與 torch.cumprod(alpha)
。最後在每個 time step 中,去更新圖片 x
。
在 optimizer 與 loss 的選擇上,選用了 MSELoss 與 AdamW。
2. Results (time steps = 1000)

3. Different time steps
t=1 |
t=50 |
t=100 |
t=200 |
t=400 |
t=600 |
t=800 |
t=1000 |
 |
 |
 |
 |
 |
 |
 |
 |
4. What I’ve observed and learned ?
雖說 time step 為 1000 時,效果較好,但相對的也要算比較久的時間計算。這次因為要在 15 分鐘內輸出 1000 張圖,因此最終繳交的 .py 檔設定的 time step 只有 500。
而這次作業最難實作的 model 應該就是 diffusion 了,資料少,且架構相對複雜,特別是計算 sample 與 embedding 那部分。不過原本以為 UNET 只能拿來做 segmentation,沒想到竟然能用在 diffusion model 上,還蠻特別的。
Problem 3 - DANN models
1. Results
|
MNIST-M → SVHN |
MNIST-M → USPS |
Trained on source |
0.2769 |
0.5658 |
DANN |
0.4360 |
0.7641 |
Trained on target |
0.9127 |
0.9845 |
2. t-SNE results
|
Domain |
Class |
MNIST-M → SVHN |
 |
 |
MNIST-M → USPS |
 |
 |
3. DANN implementation details
DANN model 部分主要分為三塊 Feature extractor
, Class classifier
以及 Domain classifier
。當中 Feature extractor
負責將圖片的特徵萃取出來,丟給 class classifier 分類 0~9。domain classifier 則將 feature 做過一遍 Gradient Reverse 後,再辨別其 domain。
訓練時,如果是 DDAN 的方式,會先針對 source dataset 跑正常的 classification,並在 domain classifier training 時給予 label=1。對於 target doamin,只拿 data 跑 feature extraction 與 domain classification。最終將三者的 NLLLoss 合併做 backward。
Optimizer 部分選擇的是 Adam,並用 MultiStepLR 調整 learning rate。
訓練時遇到最大的問題莫過於 Target dataset 的 accuracy 震盪,loss 雖說趨於穩定,source dataset 的 accuracy 也在 0.9 左右徘徊,就很奇怪。個人猜測可能 target dataset 資料量較少,所以才會導致 accuracy 變化過大?
