Course
湯濬澤
NTUST_M11015117
HackMD Link:https://hackmd.io/@RTon/SJw676wms
DCGAN_Generator(
(net): Sequential(
(0): ConvTranspose2d(100, 512, kernel_size=(4, 4), stride=(1, 1), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU(inplace=True)
(9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU(inplace=True)
(12): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(13): Tanh()
)
)
DCGAN_Discriminator(
(net): Sequential(
(0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(4): LeakyReLU(negative_slope=0.2, inplace=True)
(5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(6): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): LeakyReLU(negative_slope=0.2, inplace=True)
(8): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(9): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(10): LeakyReLU(negative_slope=0.2, inplace=True)
(11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), bias=False)
(12): Sigmoid()
)
)
WGAN_GP_Generator(
(net): Sequential(
(0): ConvTranspose2d(100, 512, kernel_size=(4, 4), stride=(1, 1), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU(inplace=True)
(9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU(inplace=True)
(12): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(13): Tanh()
)
)
WGAN_GP_Discriminator(
(net): Sequential(
(0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(3): LeakyReLU(negative_slope=0.2, inplace=True)
(4): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(5): LeakyReLU(negative_slope=0.2, inplace=True)
(6): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(7): LeakyReLU(negative_slope=0.2, inplace=True)
(8): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), bias=False)
)
)
Model A 的整體效果較好一些些,不像 WGAN-GP 有時會出現一些糊糊的照片,五官也比較清晰一點。實際用 FID 跑測試,DCGAN 的結果可以到 27.7,但 WGAN-GP 出來卻高達 100 多。
從 Model-A (DCGAN) 與 Model-B (WGAN-GP) 的結果來看,兩者差異其實不大?甚至體感上好像 DCGAN 的表現反而好一點。這部分我的猜測是因為在 DCGAN 中,有嘗試一些增強 GAN 的技巧,例如 Discriminator 訓練多次,但那些技巧在 WGAN-GP 上有時會導致模型爆掉,所以沒有嘗試。而 GAN 本身的概念其實蠻特別的,同時訓練 Discriminator 與 generator,讓兩者互相對抗。並透過加雜訊來訓練模型還原原始圖片,電腦這樣的學習方式還蠻特別的。
而至於兩者之間的差異,其實蠻小的。WGAN 相較於 DCGAN,主要是引入了 Wasserstein distance 來取代 DCGAN 的 loss。而 WGAN-GP 則是多了 gradient penalty,Generator 部分一模一樣。
Diffusion_UNet_Small(
(init_conv): Residual_Block(
(conv): Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 64, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 64, eps=1e-05, affine=True)
)
)
(down1): Down_Block(
(conv): Sequential(
(0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(1): Residual_Block(
(conv): Sequential(
(0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 64, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 64, eps=1e-05, affine=True)
)
)
(2): Residual_Block(
(conv): Sequential(
(0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 128, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 128, eps=1e-05, affine=True)
)
)
)
(time_emb_mlp): Sequential(
(0): SiLU()
(1): Linear(in_features=256, out_features=128, bias=True)
)
)
(sa_d1): Self_Attention(
(attention_module): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
)
(layer_norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
(ff): Sequential(
(0): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
(1): Linear(in_features=128, out_features=128, bias=True)
(2): GELU(approximate=none)
(3): Linear(in_features=128, out_features=128, bias=True)
)
)
(down2): Down_Block(
(conv): Sequential(
(0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(1): Residual_Block(
(conv): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 128, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 128, eps=1e-05, affine=True)
)
)
(2): Residual_Block(
(conv): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 128, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 128, eps=1e-05, affine=True)
)
)
)
(time_emb_mlp): Sequential(
(0): SiLU()
(1): Linear(in_features=256, out_features=128, bias=True)
)
)
(sa_d2): Self_Attention(
(attention_module): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
)
(layer_norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
(ff): Sequential(
(0): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
(1): Linear(in_features=128, out_features=128, bias=True)
(2): GELU(approximate=none)
(3): Linear(in_features=128, out_features=128, bias=True)
)
)
(mid1): Residual_Block(
(conv): Sequential(
(0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 256, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 256, eps=1e-05, affine=True)
)
)
(mid2): Residual_Block(
(conv): Sequential(
(0): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 128, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 128, eps=1e-05, affine=True)
)
)
(up1): Up_Block(
(up_conv): Sequential(
(0): Upsample(scale_factor=2.0, mode=bilinear)
)
(conv): Sequential(
(0): Residual_Block(
(conv): Sequential(
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 256, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 256, eps=1e-05, affine=True)
)
)
(1): Residual_Block(
(conv): Sequential(
(0): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 128, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 64, eps=1e-05, affine=True)
)
)
)
(time_emb_mlp): Sequential(
(0): SiLU()
(1): Linear(in_features=256, out_features=64, bias=True)
)
)
(sa_u1): Self_Attention(
(attention_module): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
)
(layer_norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
(ff): Sequential(
(0): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
(1): Linear(in_features=64, out_features=64, bias=True)
(2): GELU(approximate=none)
(3): Linear(in_features=64, out_features=64, bias=True)
)
)
(up2): Up_Block(
(up_conv): Sequential(
(0): Upsample(scale_factor=2.0, mode=bilinear)
)
(conv): Sequential(
(0): Residual_Block(
(conv): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 128, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 128, eps=1e-05, affine=True)
)
)
(1): Residual_Block(
(conv): Sequential(
(0): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): GroupNorm(1, 64, eps=1e-05, affine=True)
(2): GELU(approximate=none)
(3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): GroupNorm(1, 64, eps=1e-05, affine=True)
)
)
)
(time_emb_mlp): Sequential(
(0): SiLU()
(1): Linear(in_features=256, out_features=64, bias=True)
)
)
(sa_u2): Self_Attention(
(attention_module): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
)
(layer_norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
(ff): Sequential(
(0): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
(1): Linear(in_features=64, out_features=64, bias=True)
(2): GELU(approximate=none)
(3): Linear(in_features=64, out_features=64, bias=True)
)
)
(out_conv): Conv2d(64, 3, kernel_size=(1, 1), stride=(1, 1))
(label_emb): Embedding(10, 256)
)
模型部分主要參考自 https://www.youtube.com/watch?v=TBCRlnwJtZU 。不過他的模型是設計來輸出大圖片的,這次作業只需要 28×28 的圖,因此我有將 UNET 的結構簡化,只留了兩個 down 跟兩個 up block,以減少運算時間。Sampling 的方法與網路上的做法都差不多,用 torch.linspac
產出 beta
後,再計算 alpha
與 torch.cumprod(alpha)
。最後在每個 time step 中,去更新圖片 x
。
在 optimizer 與 loss 的選擇上,選用了 MSELoss 與 AdamW。
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
criterion = nn.MSELoss()
t=1 | t=50 | t=100 | t=200 | t=400 | t=600 | t=800 | t=1000 |
---|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
雖說 time step 為 1000 時,效果較好,但相對的也要算比較久的時間計算。這次因為要在 15 分鐘內輸出 1000 張圖,因此最終繳交的 .py 檔設定的 time step 只有 500。
而這次作業最難實作的 model 應該就是 diffusion 了,資料少,且架構相對複雜,特別是計算 sample 與 embedding 那部分。不過原本以為 UNET 只能拿來做 segmentation,沒想到竟然能用在 diffusion model 上,還蠻特別的。
MNIST-M → SVHN | MNIST-M → USPS | |
---|---|---|
Trained on source | 0.2769 | 0.5658 |
DANN | 0.4360 | 0.7641 |
Trained on target | 0.9127 | 0.9845 |
Domain | Class | |
---|---|---|
MNIST-M → SVHN | ![]() |
![]() |
MNIST-M → USPS | ![]() |
![]() |
DANN model 部分主要分為三塊 Feature extractor
, Class classifier
以及 Domain classifier
。當中 Feature extractor
負責將圖片的特徵萃取出來,丟給 class classifier 分類 0~9。domain classifier 則將 feature 做過一遍 Gradient Reverse 後,再辨別其 domain。
訓練時,如果是 DDAN 的方式,會先針對 source dataset 跑正常的 classification,並在 domain classifier training 時給予 label=1。對於 target doamin,只拿 data 跑 feature extraction 與 domain classification。最終將三者的 NLLLoss 合併做 backward。
loss = src_class_loss + src_domain_loss + tgt_domain_loss
Optimizer 部分選擇的是 Adam,並用 MultiStepLR 調整 learning rate。
訓練時遇到最大的問題莫過於 Target dataset 的 accuracy 震盪,loss 雖說趨於穩定,source dataset 的 accuracy 也在 0.9 左右徘徊,就很奇怪。個人猜測可能 target dataset 資料量較少,所以才會導致 accuracy 變化過大?