Week 8: Learning Efficient Convolutional Networks through Network Slimming

# Week 8: Learning Efficient Convolutional Networks through Network Slimming ###### tags: `技術研討` ## 3. Network Slimming ### 比較各種剪枝方法與 channel-level sparsity 的優勢 - weight-level - 從每個各別的 weights 去判斷是否剪枝，可以有最高的剪枝彈性與模型縮率，但是通常需要特殊的軟硬體才能加速其在稀疏的模型中的 inference 速度 - kernel-level - 跟 PRUNING FILTERS FOR EFFICIENT CONVNETS 中的剪枝方式一樣，依照 filter 的權重決定是否將該 filters 剪枝 - channel-level - 依照方法 (channel 對應的 input / output weights、BN 的 scaling factors) 決定是否裁減某一個 channel，可以得到 "thinned" version of the unpruned network，所以還是可以透過 GPU 加速 inference - layer-level - 將整層的 layer 剪枝，像是 VGG16 變成 VGG15 XDD，通常這種方式的剪枝會有好的效果，主要是原模型深度夠深，也意味著該 CV 任務其實不需要這麼複雜的架構來解 ### 過往文獻 channel-level sparsity 的作法與挑戰 1. 直接修剪 pre-trained model 的 layer weights :::info 因為不太可能同時 channel 對應到的 input / output weights 同時趨近於 0，所以實際上依照 pre-trained model 來進行剪枝非常困難。在 Pre-trained ResNets 模型上，在不影響準確度的情況下，用此方法只能剪枝約 10 % 的 weights。 ::: 2. 在 training objective 中加入 group LASSO，讓同 channel 的 filter weights 同時被推向 0 :::info 此方法需要額外計算 regularization term 對所有的 filter weights 的 gradient，所以 training 時需要消耗不少額外的運算資源。 ::: ### channel pruning - scaling factors and sparsity-induced penalty - 本文運用的模型壓縮技巧 :::info 利用 batch normalization 中的 scaling factor $\gamma$ 來控制是否對該 channel 進行剪枝，做法如下： 1. 將 L1 norm 的 $\gamma$ 加進原本的 training objective 中，讓 batch norm 的 scaling factors 變成比較 sparse (也有 feature selection 的效果) 2. 如果 scaling fators 小於某個 global threshold，該層就會被剪枝 ::: :pushpin: batch normalization ![](https://i.imgur.com/HbdsXMp.png) ![](https://i.imgur.com/asVoFZo.png) - scaling and shifting distribution ![](https://i.imgur.com/mrRccv6.png) :pushpin: pruning process - network slimming procedure 剪枝流程 ![](https://i.imgur.com/XhERlcK.png) 剪枝示意圖 ![](https://i.imgur.com/hUhyEZe.png) :::info 如上圖所示，依照 BN 的 channel scaling factors 是否小於預設的 global threshold ，若是小於則將該層 layer 裁減。 ::: 模型 Objective function ![](https://i.imgur.com/pXbruti.png) :::info 1. 將 scaling factor $\gamma$ 用 L1 norm 的方式加入 Loss function，讓 $\gamma$ 可以呈現稀疏的狀態，因此可以更好地進行 pruning。 2. 透過 $\lambda$ 來控制 L1 norm 對於 Loss 的影響力，也就是控制 $\gamma$ 的稀疏程度。 ![](https://i.imgur.com/PcXQM6Y.png) ::: :pushpin: channel prunning 模型實作 - directly applied: AlexNet, VGGNet - adaptations are required: ResNet, DenseNet - 因為模型需要考量到最後需要跟 shortcut layer 做相加，所以在 BN 的順序上會有所調整 [15] Identity mappings in deep residual networks. ![](https://i.imgur.com/2lWUO6f.png) :::info 透過 BN 的 scaling factor $\gamma$ 來決定剪枝，會將 BN 前的 layer channels 剪枝： - 原本的架構 (a) 依照 BN 的結果剪枝完後，在做 addtion 的時候 channel 會無法 mapping 起來，就算依照剪枝後的結果調整 shortcut layer 的 channels，也會造成依照比較不重要的 feature maps 來剪較重要的 shortcut feature maps 的問題。 - 如果是採用 (b) 的架構，就會依照 BN 的結果剪枝 shortcut layer，最後在 addition 層時將 feature maps 調整成跟 shorcut layers 一樣的 channels，也就是我們期待的剪枝結果。 ::: [17] Densely connected convolutional networks ![](https://i.imgur.com/YJq1zTM.jpg) ## 4. 實驗結果 ### 使用資料集 | Datasets | Training data | Testing data | |:--------------------:|:-------------:|:------------:| | CIFAR-10 / CIFAR-100 | 50,000 | 10,000 | | SVHN | 604,388 | 26,032 | | ImageNet | 1,200,000 | 50,000 | | MNIST | 60,000 | 10,000 | ### 模型架構 | Model | Architecture | |:--------:|:-------------------------------------:| | VGGNet | 11-layer (8-conv + 3 FC) | | ResNet | 164-layer pre-activation + bottleneck | | DenseNet | 40-layer + growth rate 12 | ### 訓練方法 * Normal Training 正常的 training 方法，==<ins>當作 baseline</ins>==，沒有做剪枝 :::info :pushpin: 參數設定 | Dataset | mini batch size | epochs | learning rate | |:--------:|:---------------:|:------:|:---------------------------------:| | CIFAR | 64 | 120 | 0.1 (50%→0.01, 75%→0.001) | | SVHN | 64 | 20 | 0.1 (50%→0.01, 75%→0.001) | | ImageNet | 256 | 60 | 0.1 (33%→0.01, 67%→0.001) | | MNIST | 256 | 30 | 0.1 (33%→0.01, 67%→0.001) | channel scaling factors ( γ 值 ) 預設使用 0.5，論文中解釋是會有比較高的準確率 ::: * Training with Sparsity 訓練時 ==<ins>損失函數加入超參數懲罰項 $\lambda$ </ins>==，$\lambda$ 是用 grid search 找出來的 ($10^{-3}$, $10^{-4}$, $10^{-5}$) :::info :pushpin: 參數設定 * VGGNet: $10^{-4}$ * ResNet: $10^{-5}$ * DenseNet: $10^{-5}$ ::: * Pruning 剪枝的比例在同一次的訓練中是固定的 :::info :pushpin: 參數設定 **Q: 如何決定剪枝比例？** A: The pruning threshold is determined by a percentile among all scaling factors (見第五章) ::: * Fine-tuning 剪枝完後進行 Fine-tuning :::info :pushpin: 參數設定同 training 的參數設定僅有 VGGNet 訓練於 ImageNet 上考量到時間運算的限制，因此僅訓練 5 個epochs ::: ### 實驗結果 **下圖為實驗於 CIDAR-10 / CIFAR-100 / SVHN 的結果** * Baseline: Normal Training * Pruned: Training with Sparsity → Pruning → Fine-tuning :star: 重點擷取 * 即使剪超過 60% 以上結果依然與 baseline 相似 (last row of each model) * DenseNet / Resnet 剪 40% 的 model 效果最好！ ![](https://i.imgur.com/D4MQCEy.png) :::warning :question:小 QA 時間:question: 為什麼剪枝比例是 40% or 70%，parameters & FLOPs 卻不是剪枝對應的比例？ ::: <ins>另外也可以看出 VGGNet 有較多不需要的參數可以被剪枝(下圖)</ins> * 橘色表示原始模型的參數量 * 藍色/綠色表示剪枝後剩餘的參數比例 ![](https://i.imgur.com/MwlWuD3.png =480x320) **下圖為實驗於 ImageNet / MNIST 的結果** 表二表示 VGGNet 當 50% 的 channel 被剪枝時，parameter 減少 82.5% 的參數量，而 FLOPs 僅減少 30.4% * Conv 層：( 378 / 2752 ) channel 被剪枝 (==<ins>計算密集</ins>==) * FC 層：( 5094 / 8792 ) channel 被剪枝 (==<ins>參數密集</ins>==) 表三為與另一種剪枝方法 ( Structured Sparsity Learning (SSL) ) 進行比較 * SSL 剪枝的方法與本篇類同，差別在使用 group lasso 而非 l1-norm ![](https://i.imgur.com/fJ2VAM1.png =480x320) **下圖為 Multi-pass Scheme 實驗結果** Multi-pass Scheme 的方法實驗在 VGG 上，而因為 VGG 沒有 skip-connection，所以如果剪太多會太破壞模型結構，因此多加條件限制，==<ins>每層最多不能剪超過 50% 的 channels</ins>== :::info :pushpin: Multi-pass Scheme 的意思是重複剪枝一次 itreation 表示 (train with sparsity → prune → fine-tune) ::: :star: 重點擷取 * 原本的 baseline 分別是 6.34% & 26.74% * 從 CIFAR-10 當中可以看出在 iteration 5 的之後表現開始變差 * 而在 CIFAR-100 上則是經過 iteration 3 之後越來越差，推測原因為 class 有 100 個，剪太多對模型影響太大！ * 不過仍可以剪將近 90% 的 parameters & 70% 的 Flops 效果依然不錯！ ![](https://i.imgur.com/SuSbFil.png =480x480) :::warning :question:小 QA 時間:question: Q: 為什麼有些 Train with sparsity 的效果比 Fine-tuned 來的更好？ ::: ## 5. Analysis - 2個超參數對network的影響： - the pruned percentage t - the coefficient of the sparsity regularization term $\lambda$ ### The pruned percentage t： >剪太少channel，對於模型壓縮效果有限 >剪太多channel，accuracy會較難finetune回來 >>怎麼知道剪枝多少會影響model accuracy? 需要實驗的才能知道 #### pruning rate對error rate的影響 ![](https://i.imgur.com/69VDdgj.png) 1. **Trained with Sparsity**(還沒fine-tuned) 比Baseline好 2. **Pruned**：剪枝>40%的error rate>5.5%、剪枝>50%的error rate>6.5% 3. **Fine-tuned**：剪枝>60%的error rate>5.5%、剪枝>80%的error rate>6.25% --- ### The sparsity regularization term $\lambda$： >$\lambda$太低，對於 $\gamma$ 稀疏效果有限 >$\lambda$太高，$\gamma$ 大部分會趨近於0 >>太高太低是多少？需要實驗的才能知道 #### $\lambda$對$\gamma$(the scaling factors)的影響： ![](https://i.imgur.com/gGoDOS7.png) 1. **$\lambda$＝0**：$\gamma$ 位於0～0.2之間，但是$\gamma$為0的很少，大部分是0.1附近左右 2. **$\lambda$＝0.00001**：$\gamma$ 為0的上升至約450個，且$\gamma$ 整體更趨近於0，稀疏部分的$\gamma$ 3. **$\lambda$＝0.0001**：$\gamma$ 為0的上升至約2000個，幾乎都為0了 $\lambda$由小排到大，計算要剪枝多少比例 --- ### 剪枝的過程就像features selection，只有重要的features會被選上 ![](https://i.imgur.com/OvdPZ0V.png) - 某層layer的channel - 每channel初始weight都相同，$\gamma$也相同 - Training過程中，不重要的channel慢慢的被淘汰掉($\gamma$ 小的)，標記成暗色 --- ## Pytorch Code解說 ### vgg19 #### vgg19 model架構： 19 : [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512] ```python= print(model) ``` ```python= vgg( (feature): Sequential( (0): Conv2d(3, 28, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(28, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): Conv2d(28, 56, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): BatchNorm2d(56, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace=True) (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (7): Conv2d(56, 90, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (8): BatchNorm2d(90, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (9): ReLU(inplace=True) (10): Conv2d(90, 81, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (11): BatchNorm2d(81, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (12): ReLU(inplace=True) (13): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (14): Conv2d(81, 130, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (15): BatchNorm2d(130, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (16): ReLU(inplace=True) (17): Conv2d(130, 115, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (18): BatchNorm2d(115, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (19): ReLU(inplace=True) (20): Conv2d(115, 110, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (21): BatchNorm2d(110, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (22): ReLU(inplace=True) (23): Conv2d(110, 115, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (24): BatchNorm2d(115, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (25): ReLU(inplace=True) (26): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (27): Conv2d(115, 173, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (28): BatchNorm2d(173, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (29): ReLU(inplace=True) (30): Conv2d(173, 166, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (31): BatchNorm2d(166, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (32): ReLU(inplace=True) (33): Conv2d(166, 111, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (34): BatchNorm2d(111, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (35): ReLU(inplace=True) (36): Conv2d(111, 122, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (37): BatchNorm2d(122, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (38): ReLU(inplace=True) (39): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (40): Conv2d(122, 105, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (41): BatchNorm2d(105, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (42): ReLU(inplace=True) (43): Conv2d(105, 105, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (44): BatchNorm2d(105, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (45): ReLU(inplace=True) (46): Conv2d(105, 136, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (47): BatchNorm2d(136, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (48): ReLU(inplace=True) (49): Conv2d(136, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (50): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (51): ReLU(inplace=True) ) (classifier): Linear(in_features=8, out_features=10, bias=True) ) ``` BatchNorm layer參數介紹： ![](https://i.imgur.com/P2VA5FG.png) | BatchNorm 的4個參數 | Pytorch | | -------- | -------- | | mini-batch mean | running_mean | | mini-batch variance | running_var | | gamma(scale) | weight | | Beta(shift) | bias | - 為麼gamma叫做weight這樣誤導人，因為pytorch中只有可學習的參數才稱為parameter～～ - 所以BN層的輸出Y與輸入X之間的關係是： Y = (X - running_mean) / sqrt(running_var + eps) * gamma + beta --- ![](https://i.imgur.com/xDNzyey.png) #### Train with channel sparsity regularization: ```python= #main.py def train(epoch): model.train() for batch_idx, (data, target) in enumerate(train_loader): if args.cuda: data, target = data.cuda(), target.cuda() data, target = Variable(data), Variable(target) optimizer.zero_grad() output = model(data) loss = F.cross_entropy(output, target) pred = output.data.max(1, keepdim=True)[1] loss.backward() # 得到parameter的gradient # args.sr(True/False):sparsity-regularization if args.sr: updateBN() #channel sparsity regularization optimizer.step() # optimizer：存parameter的gradient # step():根據這些parameter的gradient更新值 ``` ```python= "additional subgradient descent on the sparsity-induced penalty term" # args.s(0.0001):scale sparse rate # torch.sign(): 對應input向量的值中，大於0的元素回傳1，小於0的元素回傳-1，0還是0） # tensor([ 0.7734, 0.5677, -0.3896, 1.9878]) -> tensor([ 1., 1., -1., 1.]) def updateBN(): for m in model.modules(): if isinstance(m, nn.BatchNorm2d): # Loss = loss + lambda * g(gamma) # 為什麼lambda要乘 1 -1 0? -> 給ya老師的作業XD # m.weight就是gamma m.weight.grad.data.add_(args.s*torch.sign(m.weight.data)) # L1 ``` ![](https://i.imgur.com/WXtWtbe.png) --- ![](https://i.imgur.com/DJ1WBRz.png) #### Prune channels with small scaling factors: 剪枝步驟： 1.記錄誰該留下（誰該剪掉）： (1) 計算總共有多少channel：有5504個 $\gamma$ (2) 把所有BatchNorm layer的 $\gamma$ 取絕對值後存在bn這個變數裡，並且由小到大排序（1*5504維的tensor） (3) 記錄誰該留下誰該剪掉： 5504 * 0.5(剪枝比例pruning rate)=2752，比2752的 $\gamma$ 小的人都要被剪掉（2752的 $\gamma$ 值為0.4569） (4) 迴圈將每一層的BatchNorm layer的 $\gamma$ 取出來和0.4569比大小，比0.4569大的標記為1、小的標記為0，存到mask變數裡，mask裡就記錄著每層要剪哪些channel 2.把要留下的人存進newmodel (1) if是第一個Conv layer: 把要留下的weight存進newmodel的weight (要剪掉的不要存進newmodel) (2) if是BatchNorm layer: 把要留下的weight存進newmodel的weight (要剪掉的不要存進newmodel) (3) 以此類推... 完成！ ```python= # vggprune.py '''(1)計算總共多少channels''' total = 0 for m in model.modules(): if isinstance(m, nn.BatchNorm2d): total += m.weight.data.shape[0] #m.weight就是gamma # baseline # m.weight.data.shape[0]: 64 64 128 128 256 256 256 256 512 512 512 512 512 512 512 512 (channels)(baseline) # total : 5504 (總共 5504 channels) ``` ```python= '''(2)所有gamma值取絕對值存進bn''' bn = torch.zeros(total) # 1*5504維 index = 0 for m in model.modules(): if isinstance(m, nn.BatchNorm2d): size = m.weight.data.shape[0] # m.weight.data.shape[0]: 64 64 128 128 256 256 256 256 512 512 512 512 512 512 512 512 (channels) bn[index:(index+size)] = m.weight.data.abs().clone() index += size # index+size: 0+64 # bn[0:64] -> bn[ 64+64 : 64+128 ] # 0:64 # 64:128 # ... # 4480 4992 #4992 5504 # bn: tensor([1.2170, 0.7687, ..., 0.5076, 0.4496]) (1*1651維) 把weight全部存進來 ``` ```python= '''(2)由小到大排序''' y, i = torch.sort(bn) # 小 -> 大 thre_index = int(total * args.percent) # scale sparse rate 0.5 剪枝比例 thre = y[thre_index] #取第2752個值:0.4569 # tensor([0.0962, 0.1189, 0.1631, ..., 1.2170, 1.2311, 1.2654]) -> sorted_tensor # tensor([ 58, 48, 8, ..., 1, 114, 26]) -> sorted_indices(該數值的排名rank) # thre_index = total * args.percent: 5504*0.5=2752 # thre: 0.4569 # 之後會把每個weight跟thre 0.4569這個數字比大小，產生一個0 1的tensor，大於thre的留下(小於thre的就不會被存進newmodel) ``` :::info **torch.sort():** A tuple of (sorted_tensor, sorted_indices) is returned, where the sorted_indices are the indices of the elements in the original input tensor. - torch.sort(input, dim=None, descending=False, out=None) - dim=0 按列排序，dim=1 按行排序，default:dim=1 - 回傳兩個tuple: sorted_tensor, sorted_indices(那個數值的排名rank) ::: ```python= '''(3) 記錄誰該留下誰該剪掉：比2752的 gamma 小的人都要被剪掉（2752的 gamma 值為0.4569）''' cfg = [] cfg_mask = [] for k, m in enumerate(model.modules()): if isinstance(m, nn.BatchNorm2d): '比大小，大的標記0&小的標記1' weight_copy = m.weight.data.abs().clone() #m.weight就是gamma mask = weight_copy.gt(thre).float()#.cuda() #weight每個值都和thre比大小，產生一個0 1的tensor，大於thre的留下(小於thre的之後就不會被存進newmodel) '矩陣相乘，把要剪掉的channel變成0' # mul_:矩陣相乘，但為什麼前面不用寫 m.weight.data = ，請看下方mul_函數說明 m.weight.data.mul_(mask) #和mask裡的0, 1相乘，讓矩陣sparsity m.bias.data.mul_(mask) ''' cfg:remaining channel cfg_mask:total channel 為了給newmodel建模型架構用 ''' cfg.append(int(torch.sum(mask))) #有幾個1 -> 31 cfg_mask.append(mask.clone()) #其實就是channel數->64 #cfg: 31 #cfg_mask: [tensor([0., 1., 1., ... 0., 0., 0.])] print('layer index: {:d} \t total channel: {:d} \t remaining channel: {:d}'. format(k, mask.shape[0], int(torch.sum(mask)))) # layer index: 3 total channel: 64 remaining channel: 31 # layer index: 6 total channel: 64 remaining channel: 57 # layer index: 10 total channel: 128 remaining channel: 94 # layer index: 13 total channel: 128 remaining channel: 84 # layer index: 17 total channel: 256 remaining channel: 142 elif isinstance(m, nn.MaxPool2d): cfg.append('M') ``` :::info **torch .gt(Tensor1,Tensor2):** - 其中Tensor1和Tensor2為同维度的tensor or matrix - 比較Tensor1和Tensor2的每個元素,返回一個0或1，若Tensor1中的元素 > Tensor2中的元素,结果為1,否則為0 **mul() & mul_():** 矩陣相乘 - mul_ & mul: 所有帶"——"都是inplace的，就是操作後原數據也會跟著改動 - 不帶 "——" 操作结束後原數據變回原狀 - x=tensor([1,2,3]), y=tensor([2,2,2]) - x.mul(y) = tensor([2,4,6]), x=tensor([1,2,3]) - x.mul_(y) = tensor([2,4,6]), x=tensor([2,4,6]) （x也跟著改了） ::: --- ```python '''new model''' #剪完的channel數要是這樣： cfg: #[31, 57,'M', 94, 84,'M', 142, 133, 133, 137,,'M' 277, 282, 290, 290,'M', 276, 254, 258, 13] newmodel = vgg(dataset=args.dataset, cfg=cfg) #做一個空的newmodel，照著我們要的channel數 ``` ```python= '''把要留下來的weight，存到newmodel (跟上次的步驟一樣唷)''' layer_id_in_cfg = 0 start_mask = torch.ones(3) #channel一開始是3 end_mask = cfg_mask[layer_id_in_cfg] #從第0個cfg_mask的channel開始 # model架構：conv->BN->conv->BN->... for [m0, m1] in zip(model.modules(), newmodel.modules()): if isinstance(m0, nn.BatchNorm2d): idx1 = np.squeeze(np.argwhere(np.asarray(end_mask.cpu().numpy()))) # end_mask: tensor([0., 1., 1., 0., 0., 1., 1., 0., 0., 1., 1., 0., 0., 1., 1., 0., 0., 1., # 0., 1., 0., 1., 0., 1., 0., 0., 1., 1., 0., 1., 1., 0., 1., 0., 1., 1., # 1., 1., 0., 1., 1., 1., 0., 0., 0., 0., 1., 1., 0., 1., 0., 1., 1., 1., # 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) # 1的位子 # idx1: [ 1 2 5 6 9 10 13 14 17 19 21 23 26 27 29 30 32 34 35 36 37 39 40 41 46 47 49 51 52 53 54] if idx1.size == 1: idx1 = np.resize(idx1,(1,)) m1.weight.data = m0.weight.data[idx1.tolist()].clone() m1.bias.data = m0.bias.data[idx1.tolist()].clone() m1.running_mean = m0.running_mean[idx1.tolist()].clone() m1.running_var = m0.running_var[idx1.tolist()].clone() layer_id_in_cfg += 1 start_mask = end_mask.clone() if layer_id_in_cfg < len(cfg_mask): # do not change in Final FC end_mask = cfg_mask[layer_id_in_cfg] elif isinstance(m0, nn.Conv2d): idx0 = np.squeeze(np.argwhere(np.asarray(start_mask.cpu().numpy()))) idx1 = np.squeeze(np.argwhere(np.asarray(end_mask.cpu().numpy()))) print('In shape: {:d}, Out shape {:d}.'.format(idx0.size, idx1.size)) # start_mask: tensor([1., 1., 1.]) # idx0: [0 1 2] #1的位子 # end_mask: tensor([0., 1., 1., 0., 0., 1., 1., 0., 0., 1., 1., 0., 0., 1., 1., 0., 0., 1., # 0., 1., 0., 1., 0., 1., 0., 0., 1., 1., 0., 1., 1., 0., 1., 0., 1., 1., # 1., 1., 0., 1., 1., 1., 0., 0., 0., 0., 1., 1., 0., 1., 0., 1., 1., 1., # 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) (64 -> 31個channels) # 1的位子 # idx1:[ 1 2 5 6 9 10 13 14 17 19 21 23 26 27 29 30 32 34 35 36 37 39 40 41 46 47 49 51 52 53 54] # In shape: 3, Out shape 31. if idx0.size == 1: idx0 = np.resize(idx0, (1,)) if idx1.size == 1: idx1 = np.resize(idx1, (1,)) w1 = m0.weight.data[:, idx0.tolist(), :, :].clone() #weight存進newmodel w1 = w1[idx1.tolist(), :, :, :].clone() #weight存進newmodel m1.weight.data = w1.clone() elif isinstance(m0, nn.Linear): idx0 = np.squeeze(np.argwhere(np.asarray(start_mask.cpu().numpy()))) if idx0.size == 1: idx0 = np.resize(idx0, (1,)) m1.weight.data = m0.weight.data[:, idx0].clone() m1.bias.data = m0.bias.data.clone() ``` ## 問題區 |姓名|問題|解答| |:-:|:-:|:-:| |||