〈 Diffusion Model 論文研究與實作心得 Part.2 〉 U-Net 模型架構介紹與實作

# 〈 Diffusion Model 論文研究與實作心得 Part.2 〉 U-Net 模型架構介紹與實作 --- Tu 2023/2/14 ## 一、前言在上一篇文章[〈 Diffusion Model 論文研究與實作心得 Part.1 〉前言與圖片雜訊前處理](https://hackmd.io/@Tu32/B1-m6Tuai)中，我完成了對圖片加入雜訊的部分，因此接下來就輪到模型的拆解。 ## 二、U-Net 模型簡介 ![](https://i.imgur.com/H7zAGYE.png) 圖片來源：【Deep Learning for Image Segmentation: U-Net Architecture】在DDPM論文中，作者使用了U-Net這種模型架構來進行訓練。U-Net是Auto-encoder的變種，可以看到下方一樣有一個bottleneck的部分，且輸入和輸出圖片的大小相同。U-Net在image segmantation的領域有著重大貢獻，與傳統的Auto-encoder不同的是，U-Net在encoder和decoder之間有使用residual connection，以更好的保留原始圖片的特徵。 ## 三、U-Net 架構實作若要進行U-Net的實作，可以拆解成下方幾個的零件實作。 * 兩層CNN的Block * time embedding * Down(左半邊的Encoder，兩層CNN加上Maxpooling) * Up(右半邊的Decoder，兩層CNN加上Upsample) * self attention * residual connection #### 1. 雙層CNN 先從最常用到的著手，先設計一個有兩層CNN的Block，在之後的地方都會用到 ```python class DoubleConv(nn.Module): def __init__(self): pass def forward(self): pass ``` 填入模型 ```python class DoubleConv(nn.Module): def __init__(self, in_c, out_c): super().__init__() self.conv1 = nn.Sequential( nn.Conv2d(in_c, out_c, kernel_size=3, padding=1), nn.GroupNorm(1, out_c), #equivalent with LayerNorm nn.ReLU() ) self.conv2 = nn.Sequential( nn.Conv2d(out_c, out_c, kernel_size=3, padding=1), nn.GroupNorm(1, out_c), #equivalent with LayerNorm nn.ReLU() ) def forward(self, x): x = self.conv1(x) x = self.conv2(x) return x ``` #### 2. Time Embedding 在訓練U-Net的時候，我一開始以為輸入是一張圖片，輸出只要給出被修復過的圖片就好。但其實這樣有一個問題，就是模型不知道不同timestep的圖片之間的差別，導致模型需要直接面對不同雜訊強度的圖片並進行修復。 embedding的概念簡單來說就是把一個單獨的值加工成一個tensor。比如我們對模型輸入圖片和一個整數(timestep)，我們能透過embedding將那個整數換成一個tensor，變成讓模型更容易學習的形式。而DDPM的作者選擇使用Sinusoidal Position Embedding來為單獨timestep做embedding。 ![](https://i.imgur.com/wtIY9yu.jpg) 看起來很厲害的Sinusoidal Position Embedding (圖源：[A Gentle Introduction to Positional Encoding in Transformer Models, Part 1](https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/)) 這個問題有點像Transformer在訓練的時候用attention訓練時，需要將文字再加上一個positional embedding的概念相同，我們也需要為不同雜訊強度的圖片加上一個time embedding來告訴模型這是甚麼強度的圖片。 ```python def pos_encoding(t, channels): t = torch.tensor([t]) inv_freq = 1.0 / ( 10000 ** (torch.arange(0, channels, 2).float() / channels) ) pos_enc_a = torch.sin(t.repeat(1, channels // 2) * inv_freq) pos_enc_b = torch.cos(t.repeat(1, channels // 2) * inv_freq) pos_enc = torch.cat([pos_enc_a, pos_enc_b], dim=-1) return pos_enc ``` 接收兩個整數後回傳一個embedded好的Tensor，範例如下 ```python pos_encoding(10, 16) #timestep = 10 ``` ``` tensor([[-0.5440, -0.0207, 0.8415, 0.3110, 0.0998, 0.0316, 0.0100, 0.0032, -0.8391, -0.9998, 0.5403, 0.9504, 0.9950, 0.9995, 0.9999, 1.0000]]) ``` 當然這樣一個tensor肯定不能直接與圖片的tensor相加，在size上還需要調整，這個在後面會有提到。 #### 3. Down & Up 接下來是Down和Up，簡單概念就是進行Maxpooling或Upsample後再加個DoubleConv 首先是Down的部分 ```python class Down(nn.Module): def __init__(self, in_c, out_c): super().__init__() self.down = nn.Sequential( nn.MaxPool2d(2), DoubleConv(in_c,out_c,first_residual=True), ) def forward(self, x): x = self.down(x) return x ``` 基本架構差不多是這樣，但是不要忘了我們還要為圖片加上time embedding ```python class Down(nn.Module): def __init__(self, in_c, out_c, emb_dim=128): super().__init__() self.down = nn.Sequential( nn.MaxPool2d(2), DoubleConv(in_c,out_c), ) self.emb_layer = nn.Sequential( nn.ReLU(), nn.Linear(emb_dim, out_c), ) def forward(self, x, t): x = self.down(x) #擴充兩個dimension，然後使用repeat填滿成和圖片相同(如同numpy.tile) t_emb = self.emb_layer(t)[:, :, None, None].repeat(1, 1, x.shape[-2], x.shape[-1]) return x + t_emb ``` Up的架構基本相同，但是如果看上面的圖，可以看到Up還需要接收一個類似residual connection的輸入，所以在forward()裡面會多一個`skip_x`與`x`接起來。 ```python class Up(nn.Module): def __init__(self, in_c, out_c, emb_dim=128): super().__init__() self.up = nn.Upsample(scale_factor=2, mode="bilinear", align_corners=True) self.conv = DoubleConv(in_c,out_c) self.emb_layer = nn.Sequential( nn.SiLU(), nn.Linear(emb_dim, out_c), ) def forward(self, x, skip_x, t): x = self.up(x) x = torch.cat([skip_x, x], dim=1) x = self.conv(x) emb = self.emb_layer(t)[:, :, None, None].repeat(1, 1, x.shape[-2], x.shape[-1]) return x + emb ``` #### 4. Self Attention Block 這個部分沒打算細講(因為我也沒完全懂)，之後可能會再寫一篇Attention is all you need的研究心得之類的。簡單來說Self Attention可以想成輸入一個向量，結果再輸出一個向量的黑盒子。(這邊直接照抄Outlier的程式碼) ```python class SelfAttention(nn.Module): def __init__(self, channels, size): super(SelfAttention, self).__init__() self.channels = channels self.size = size self.mha = nn.MultiheadAttention(channels, 4, batch_first=True) self.ln = nn.LayerNorm([channels]) self.ff_self = nn.Sequential( nn.LayerNorm([channels]), nn.Linear(channels, channels), nn.GELU(), nn.Linear(channels, channels), ) def forward(self, x): x = x.view(-1, self.channels, self.size * self.size).swapaxes(1, 2) x_ln = self.ln(x) attention_value, _ = self.mha(x_ln, x_ln, x_ln) attention_value = attention_value + x attention_value = self.ff_self(attention_value) + attention_value return attention_value.swapaxes(2, 1).view(-1, self.channels, self.size, self.size) ``` #### 5. 組裝U-Net 最後我們把來把上面寫的東西組裝起來 ```python class UNet(nn.Module): def __init__(self, c_in=3, c_out=3, time_dim=128, device="cuda"): super().__init__() self.device = device self.time_dim = time_dim self.inc = DoubleConv(c_in, 64) #(b,3,64,64) -> (b,64,64,64) self.down1 = Down(64, 128) #(b,64,64,64) -> (b,128,32,32) self.sa1 = SelfAttention(128, 32) #(b,128,32,32) -> (b,128,32,32) self.down2 = Down(128, 256) #(b,128,32,32) -> (b,256,16,16) self.sa2 = SelfAttention(256, 16) #(b,256,16,16) -> (b,256,16,16) self.down3 = Down(256, 256) #(b,256,16,16) -> (b,256,8,8) self.sa3 = SelfAttention(256, 8) #(b,256,8,8) -> (b,256,8,8) self.bot1 = DoubleConv(256, 512) #(b,256,8,8) -> (b,512,8,8) self.bot2 = DoubleConv(512, 512) #(b,512,8,8) -> (b,512,8,8) self.bot3 = DoubleConv(512, 256) #(b,512,8,8) -> (b,256,8,8) self.up1 = Up(512, 128) #(b,512,8,8) -> (b,128,16,16) because the skip_x self.sa4 = SelfAttention(128, 16) #(b,128,16,16) -> (b,128,16,16) self.up2 = Up(256, 64) #(b,256,16,16) -> (b,64,32,32) self.sa5 = SelfAttention(64, 32) #(b,64,32,32) -> (b,64,32,32) self.up3 = Up(128, 64) #(b,128,32,32) -> (b,64,64,64) self.sa6 = SelfAttention(64, 64) #(b,64,64,64) -> (b,64,64,64) self.outc = nn.Conv2d(64, c_out, kernel_size=1) #(b,64,64,64) -> (b,3,64,64) def pos_encoding(self, t, channels): t = torch.tensor([t]) inv_freq = 1.0 / ( 10000 ** (torch.arange(0, channels, 2).float() / channels) ) pos_enc_a = torch.sin(t.repeat(1, channels // 2) * inv_freq) pos_enc_b = torch.cos(t.repeat(1, channels // 2) * inv_freq) pos_enc = torch.cat([pos_enc_a, pos_enc_b], dim=-1) return pos_enc def forward(self, x, t): # (bs,) -> (bs, time_dim) t = t.unsqueeze(-1).type(torch.float) t = self.pos_encoding(t, self.time_dim) #initial conv x1 = self.inc(x) #Down x2 = self.down1(x1, t) x2 = self.sa1(x2) x3 = self.down2(x2, t) x3 = self.sa2(x3) x4 = self.down3(x3, t) x4 = self.sa3(x4) #Bottle neck x4 = self.bot1(x4) x4 = self.bot2(x4) x4 = self.bot3(x4) #Up x = self.up1(x4, x3, t) x = self.sa4(x) x = self.up2(x, x2, t) x = self.sa5(x) x = self.up3(x, x1, t) x = self.sa6(x) #Output output = self.outc(x) return output ``` 確認一下是否能正常運作以及輸出是否正確 ```python sample = torch.randn((32, 3, 64, 64)) t = torch.randint(0, T, (32,)) model = UNet() model(sample, t).shape ``` Output: ``` torch.Size([32, 3, 64, 64]) ``` 水喔，U-Net 模型的部分搞定了 ## 四、結語本來想多講一點的(圖片修復的部分)但寫到這裡已經快9000字了，下個部分沒意外應該就是完結了，看能不能寫完圖片修復和模型訓練。可能會再額外寫一篇Extra講如何改進什麼的，都是後話了。 ### 相關資料 https://www.youtube.com/watch?v=a4Yfz2FxXiY https://www.youtube.com/watch?v=HoKDTa5jHvg&t=1338s https://huggingface.co/blog/annotated-diffusion https://arxiv.org/pdf/2102.09672.pdf https://arxiv.org/pdf/1503.03585.pdf https://arxiv.org/pdf/2006.11239.pdf https://theaisummer.com/latent-variable-models/#reparameterization-trick https://theaisummer.com/diffusion-models/ https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/ ###### tags: `AI` `Deep Learning` `Diffusion Model`