Chapter 6 :Builders’ Guide

# Chapter 6 :Builders’ Guide :::success [回深度學習目錄](https://hackmd.io/UiomJi7URTKHv3S3hJmlYg#Chapter-6-:Builders%E2%80%99-Guide) 報告：讀書會 2024/04/11 編輯：2024/04/08 ::: :::info 對應中文版第5章 ::: [TOC] ## 6.1 Layers and Modules 介紹神經網絡模(Modules)的概念。一個模塊可以描述單個層（layer）、由多個層組成的組件，或者整個模型本身． ![image](https://hackmd.io/_uploads/By9Qf5geA.png) 一個Modules 可以視為一個class，一個Modeule須包含： - 接收輸入數據作為其前向傳播方法的參數。 - 通過前向傳播方法返回一個值來生成輸出。 - 計算其輸出相對於其輸入的梯度，可以通過其反向傳播方法訪問。通常情況下，這是自動完成的。 - 存儲並提供執行前向傳播計算所需的參數。 - 根據需要初始化模型參數。為了計算梯度，一個Module需包含定義Backward propogation function 的定義，不過在因為有**自動微分**的關係所以只需要定義Forwad和參數即可． ```python= import torch from torch import nn from torch.nn import functional as F net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10)) X = torch.rand(2, 20) net(X).shape ``` ``` torch.Size([2, 10]) ``` 此段程式碼中*net*包含一個隱藏層，接受一個任意維度的輸入，輸出成256維，經過Relu函數後，到下一個輸出層，並且輸出成10維． ### 6.1.1 A Custom Module ```python=+ class MLP(nn.Module): def __init__(self): # Call the constructor of the parent class nn.Module to perform # the necessary initialization super().__init__() self.hidden = nn.LazyLinear(256) self.out = nn.LazyLinear(10) # Define the forward propagation of the model, that is, how to return the # required model output based on the input X def forward(self, X): return self.out(F.relu(self.hidden(X))) net = MLP() net(X).shape ``` ``` torch.Size([2, 10]) ``` 此code透過自定義Class的方式完成與上面的Code相同的內容，``super().__init__()``中進行``nn.Module``父類別的初始化．並且定義**forward**函數，描述怎麼產生輸出．接著定義如何把多個Module串在一起的**Sequential Module** ```python=+ class MySequential(nn.Module): def __init__(self, *args): super().__init__() for idx, module in enumerate(args): self.add_module(str(idx), module) def forward(self, X): for module in self.children(): X = module(X) return X net = MySequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10)) net(X).shape ``` ``` torch.Size([2, 10]) ``` 有時我們不只是要再層間傳遞內容，我們需要做額外的數學運算，例如加入一個不隨模型訓練更新的常數$c$，計算$f(\mathbf{x},\mathbf{w}) = c \cdot \mathbf{w}^\top \mathbf{x}$，這裡$w$是隨訓練改變的權重． ```python=+ class FixedHiddenMLP(nn.Module): def __init__(self): super().__init__() # Random weight parameters that will not compute gradients and # therefore keep constant during training self.rand_weight = torch.rand((20, 20)) #中文版有加上 requires_grad=False 不過我查預設就是False了 self.linear = nn.LazyLinear(20) def forward(self, X): X = self.linear(X) X = F.relu(X @ self.rand_weight + 1) # Reuse the fully connected layer. This is equivalent to sharing # parameters with two fully connected layers X = self.linear(X) # Control flow while X.abs().sum() > 1: X /= 2 return X.sum() net = FixedHiddenMLP() net(X) ``` ``` tensor(-0.3836, grad_fn=<SumBackward0>) ``` 這邊做這個操作只是用來表示可以把每個Modules設計的很彈性，並沒有提及訓練上的意義．最後用一個程式碼把此章節的東西全部串起來． ```python=+ class NestMLP(nn.Module): def __init__(self): super().__init__() self.net = nn.Sequential(nn.LazyLinear(64), nn.ReLU(), nn.LazyLinear(32), nn.ReLU()) self.linear = nn.LazyLinear(16) def forward(self, X): return self.linear(self.net(X)) chimera = nn.Sequential(NestMLP(), nn.LazyLinear(20), FixedHiddenMLP()) chimera(X) ``` ``` tensor(0.0679, grad_fn=<SumBackward0>) ``` ## 6.2 Parameter Management 每次我們選定一組超參數之後，就會進行以最小化損失函數為目標的訓練，而透過訓練會得到參數，因此需要能儲存並提取參數的功能．本章介紹 - 用於調試、診斷和可視化的參數訪問。 - 在不同的模型組件之間共享參數。 ```python=+ import torch from torch import nn net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), nn.LazyLinear(1)) X = torch.rand(size=(2, 4)) net(X).shape ``` ``` torch.Size([2, 1]) ``` 上述程式碼包含了一個隱藏層輸出為8維，經過Relu後到一個輸出為1維的輸出層． ```python=+ net[2].state_dict() ``` ``` OrderedDict([('weight', tensor([[-0.1649, 0.0605, 0.1694, -0.2524, 0.3526, -0.3414, -0.2322, 0.0822]])), ('bias', tensor([0.0709]))]) ``` 透過將輸出層叫出來可以發現，他有8個權重分別對應到上一層的8個輸出，並且加上一項常數項． ```python=+ type(net[2].bias), net[2].bias.data ``` ``` (torch.nn.parameter.Parameter, tensor([0.0709])) ``` 可以各別查詢所需的參數 ```python=+ net[2].weight.grad == None ``` ``` True ``` 因為還沒錯backward Propogation的緣故，因此還沒計算gradiant，故gradiant仍不存在． ```python=+ [(name, param.shape) for name, param in net.named_parameters()] ``` ``` [('0.weight', torch.Size([8, 4])), ('0.bias', torch.Size([8])), ('2.weight', torch.Size([1, 8])), ('2.bias', torch.Size([1]))] ``` 透過for迴圈一次查看所有參數的種類及維度． ```python=+ # We need to give the shared layer a name so that we can refer to its # parameters shared = nn.LazyLinear(8) net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), shared, nn.ReLU(), shared, nn.ReLU(), nn.LazyLinear(1)) net(X) # Check whether the parameters are the same print(net[2].weight.data[0] == net[4].weight.data[0]) net[2].weight.data[0, 0] = 100 # Make sure that they are actually the same object rather than just having the # same value print(net[2].weight.data[0] == net[4].weight.data[0]) ``` ``` tensor([True, True, True, True, True, True, True, True]) tensor([True, True, True, True, True, True, True, True]) ``` 若我們希望在不同層間共用參數，可以透過上述的程式碼實現． ## 6.3 Parameter Initialization 此子章節介紹將參數進行初始化的方式．之前都是使用預設的方式進行初始化，這裡介紹幾個客製化的方式． ```python=+ import torch from torch import nn net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), nn.LazyLinear(1)) X = torch.rand(size=(2, 4)) net(X).shape ``` ``` torch.Size([2, 1]) ``` 上述程式碼包含了一個隱藏層輸出為8維，經過Relu後到一個輸出為1維的輸出層． ```python=+ def init_normal(module): if type(module) == nn.Linear: nn.init.normal_(module.weight, mean=0, std=0.01) nn.init.zeros_(module.bias) net.apply(init_normal) net[0].weight.data[0], net[0].bias.data[0] ``` ``` (tensor([-0.0129, -0.0007, -0.0033, 0.0276]), tensor(0.)) ``` 此段程式碼定義了一個使用$N(0,0.01)$初始化權重並加上0作為bias的函數． ```python=+ def init_constant(module): if type(module) == nn.Linear: nn.init.constant_(module.weight, 1) nn.init.zeros_(module.bias) net.apply(init_constant) net[0].weight.data[0], net[0].bias.data[0] ``` ``` (tensor([1., 1., 1., 1.]), tensor(0.)) ``` 此程式碼則使用常數1進行初始化． ```python=+ def init_xavier(module): if type(module) == nn.Linear: nn.init.xavier_uniform_(module.weight) def init_42(module): if type(module) == nn.Linear: nn.init.constant_(module.weight, 42) net[0].apply(init_xavier) net[2].apply(init_42) print(net[0].weight.data[0]) print(net[2].weight.data) ``` ``` tensor([-0.0974, 0.1707, 0.5840, -0.5032]) tensor([[42., 42., 42., 42., 42., 42., 42., 42.]]) ``` 此程式碼展示分層初始化，第一種是類似Uniform然後可以Normlized，第二個用常數42去初始化． ![image](https://hackmd.io/_uploads/rJo-rolxA.png) ```python=+ def my_init(module): if type(module) == nn.Linear: print("Init", *[(name, param.shape) for name, param in module.named_parameters()][0]) nn.init.uniform_(module.weight, -10, 10) module.weight.data *= module.weight.data.abs() >= 5 net.apply(my_init) net[0].weight[:2] ``` ``` Init weight torch.Size([8, 4]) Init weight torch.Size([1, 8]) tensor([[ 0.0000, -7.6364, -0.0000, -6.1206], [ 9.3516, -0.0000, 5.1208, -8.4003]], grad_fn=<SliceBackward0>) ``` 上面使用 $$ \begin{aligned} w \sim \begin{cases} U(5, 10) & \textrm{ with probability } \frac{1}{4} \\ 0 & \textrm{ with probability } \frac{1}{2} \\ U(-10, -5) & \textrm{ with probability } \frac{1}{4} \end{cases} \end{aligned} $$ 進行初始化 ```python=+ net[0].weight.data[:] += 1 net[0].weight.data[0, 0] = 42 net[0].weight.data[0] ``` ``` tensor([42.0000, -6.6364, 1.0000, -5.1206]) ``` 示範如何操縱參數． ## 6.4 Lazy Initialization 在此之前**6.1,6.2**中都是採用不指定輸入維度的方式，且一開始也不對參數進行初始化．深度學習框架有時不能知道網絡的輸入維度。因此進行推遲初始化，等到我們第一次通過模型傳遞數據時，才即時推斷每個層的大小。稍後，在處理卷積神經網絡時，這個技巧將變得更加方便，因為輸入維度（例如，圖像的分辨率）將影響每個後續層的維度。因此，在編寫代碼時能夠設置參數，而不需要在編寫代碼時知道維度的值，可以極大地簡化指定和後續修改模型的任務。接下來，我們將深入探討初始化的機制。 ```python=+ import torch from torch import nn from d2l import torch as d2l net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10)) net[0].weight ``` ``` <UninitializedParameter> ``` 可以看到使用``LazyLinear(256)``的方式只需要宣告輸出維度，不需要宣告輸入維度．且不會對參數進行初始化． ![image](https://hackmd.io/_uploads/HkKnOjggC.png) ![image](https://hackmd.io/_uploads/ryIROigl0.png) ```python=+ X = torch.rand(2, 20) net(X) net[0].weight.shape ``` ``` torch.Size([256, 20]) ``` 而透過第一次的參數輸入，初始化了參數．