【進階影像視覺模型】

# 【進階影像視覺模型】 - Part1. LeNet - Part2. AlexNet - Part3. VGGNet - Part4. ResNet - Part5. MobileNet - Part6. Inception Network - Part7. SqueezeNet - Part8. EfficientNet - Part9. DenseNet - Part10. Autoencoder - Part11. Rank-N or Top-N Accuracy ## Part1. LeNet - Convolution Layer1: 6 @ 5x5 (Filters/Kernels) - Convolution Layer2: 16 @ 5x5 (Filters/Kernels) - Max Pooling: 2x2 kernel + stride 2 - FC Layer1: 5x5x16 nodes $\rightarrow$ 120 nodes - FC Layer2: 120 nodes $\rightarrow$ 84 nodes - Output Layer: 84 nodes $\rightarrow$ 10 nodes <center> <img src="https://hackmd.io/_uploads/Skh1hW31yl.png" style=" width: 90%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> :::success **重點筆記**: LeNet 在 MNIST 資料上達到 99.3% Accuracy。 ::: <br> ## Part2. AlexNet - 前 5 層為Convolutional Layers，後 3 層為 FC Layers <center> <img src="https://hackmd.io/_uploads/r1QMgz2yyg.png" style=" width: 90%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> <br> ## Part3. VGGNet - VGGNet 主要使用的濾波器尺寸為 3 x 3 的**小型濾波器**，並且**深度較深**。 - VGG16 has 13 Conv Layers with 3 FC Layers - VGG19 has 16 Conv Layers with 3 FC Layers - Filters/Feature Maps 的數量逐漸增加，直到 FC Layers - 通常使用的激活函數為 **ReLU** <center> <img src="https://hackmd.io/_uploads/ryXdgMh1yg.png" style=" width: 90%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> <center> | Advantage | Disadvantage | | :--------: | :--------: | | “深度”網路設計 | 計算需求大 | | 規範的結構 | 過多多餘的參數 | </center> :::success **重點筆記**: Achieved 92.7% top-5 Accuracy in ImageNet (1000 classes) ::: <br> ## Part4. ResNet - 一般的 CNN 是線性順序的序列，當模型很深時，效能會下降 - 透過**跳躍式連接 (Skip Connections)** 能有效解決梯度消失問題 - 其殘差結構可以**增加網路深度而不增加誤差** - **Exploding and Vanishing Gradients** 1. 在具有 N 層的深度網路中，必須將 N 個導數相乘才能執行梯度更新 2. 如果導數很大，梯度會呈指數增長或“爆炸” 3. 同樣，如果導數很小，它們就會呈指數下降或“消失”，造成權重更新過慢 4. 解決辦法: 將前一層的輸入連接到前一層的輸出 <center> <img src="https://hackmd.io/_uploads/H1XwZz21yx.png" style=" width: 80%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> - Why Does ResNets Work? (Mathematics) <center> <img src="https://hackmd.io/_uploads/B1SsZGhy1l.png" style=" width: 90%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> <center> <img src="https://hackmd.io/_uploads/ryPOIItgke.png" style=" width: 70%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> - ResNet34 和 ResNet50 具有多個連續的 3x3 卷積層，具有不同大小特徵圖（ 64、128、256、512），每 2 個卷積會繞過一次 - 它們的輸出尺寸保持不變（padding=1, stride =1） - ResNet 由多個 residual units 建構，並具有多種不同層數：18、34、50、101、152 和 1202 - 只需要少量甚至不需要 FC Layers，因此能夠使模型更深，學習更多特徵 <center> <img src="https://hackmd.io/_uploads/rks8fz3yke.png" style=" width: 90%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> <center> | Advantage | Disadvantage | | :--------: | :--------: | | 有效訓練深層網絡 | 計算需求大 (due todeep structure) | | 更好的性能 | 結構複雜 (for mobile or embedding devices) | | 模型擴展性強 | | </center> <br> ## Part5. MobileNet - MobileNet 是輕量版的 CNN，為了能夠被用在嵌入式裝置或手機 - 推論 (Inference) 速度相對較慢 (i.e. forward propagation) - MobileNet Use Cases <center> <img src="https://hackmd.io/_uploads/ByAsGM2JJe.png" style=" width: 70%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> - 行動或嵌入式系統通常具有較低的運算能力，因為它們成本低廉且節能 - 在這些系統上使用 CNNs 需要：訓練較小的模型、壓縮模型 - MobileNet 透過使用以下方式實現了適合手機的模型：Depthwise Separable Convolutions、Two Hyper-Parameters #### Depthwise Separable Convolutions (深度可分卷積) - CNN: 12x12x3 的輸入圖片，當 Stride 為 1 時，共需 5 * 5 * 3 * 64 次操作。如果我們有 128 個過濾器，結果是 75 * 64 * 128 = 614,400 1. Step I: $8 * 8 * 5 * 5 * 3 = 4800$ 2. Step II: $4800 * 128 = 614,400$ <center> <img src="https://hackmd.io/_uploads/S15Y7G3kye.png" style=" width: 70%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> - DepthWise Convolutions: 5 x 5 x 3 * 64 = 75 * 64 = 4,800，3 x 64 x 128 = 24,576，4800 + 24,576 = 29,376 Operations 1. Step I: $8 * 8 * 5 * 5 * 3 = 4800$ 2. Step II: $8 * 8 * 3 * 128 = 24,576$ 3. Step III: $4800 + 24,576 = 29,376$ - 使用 Pointwise Convolutions 來得到相同的輸出形狀: 將輸出乘以 1x1x3 層，將執行 64 次，並得到 8x8x1 輸出，對輸出進行線性組合 :::success **重點筆記**: Feature Map: $D \times D \times M$、Kernel Size: $K \times K$、Channel: $N$ - Original: $D^2 \times K^2 \times M \times N$ - Depthwise Separable Convolution: $(D^2 \times K^2 \times M) + (D^2 \times M \times N)$ ::: <center> <img src="https://hackmd.io/_uploads/B1N1Vz3Jyx.png" style=" width: 70%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> #### Two Hyper Parameters - MobileNet 也透過兩個超參數來有效的降低模型大小 - Width Multiplier：縮減每一層的深度 (filters數量) - **Resolution Multiplier**：減少輸入影像的大小，從而減少每個後續層的大小 <center> | Advantage | Disadvantage | | :--------: | :--------: | | 輕量級 | 準確度相對較低 | | 高效的卷積操作 | | | 靈活性 | | </center> <br> ## Part6. Inception Network - 由於 CNN 需要大量參數的調整， Inception 希望解決 Filter Size 的選擇問題 - Inception Network 引入了平行卷積操作，有效處理不同大小的特徵 - The Inception V1 (a.k.a GoogleLeNet) Network was introduced by Google in 2014，它在 ImageNet (ILSVRC14) 競賽中取得了最佳的表現 <center> <img src="https://hackmd.io/_uploads/r1YiVf3y1l.png" style=" width: 90%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> - Inception Network 讓我們可以同時使用幾種不同大小的 Conv Filters - 使用 **same** padding 和 stride=1 來保持尺寸大小的一致性 - 我們可以執行所有大小的 Filters，甚至是 Max Pool，然後將它們堆疊在一起 - 這使得模型能夠學習高階和低階特徵的組合 <center> <img src="https://hackmd.io/_uploads/SJvhEfh1yg.png" style=" width: 90%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> #### Heavy Computation - Use 1x1 Convolutions to reduce the computation cost <center> <img src="https://hackmd.io/_uploads/B1l7BM311l.png" style=" width: 70%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> - 使用 **Bottleneck Layer** 1. 先縮小再放大 2. 少了 10 倍的計算成本，現在為 2.4M + 10M = 12.4M，而之前為 120M <center> <img src="https://hackmd.io/_uploads/r1_4Bf2ykg.png" style=" width: 70%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> #### Inception Block <center> <img src="https://hackmd.io/_uploads/BkTBBGn11e.png" style=" width: 90%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> #### Inception Design <center> <img src="https://hackmd.io/_uploads/BkFLBGnJ1e.png" style=" width: 90%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> <center> | Advantage | Disadvantage | | :--------: | :--------: | | 多尺度特徵提取能力 | 結構較為複雜 | | 高效的計算資源利用 | 難以移植到資源有限的設備 | | 深層網絡的可訓練性 | 參數和架構選擇複雜 | </center> <br> ## Part7. SqueezeNet - 希望在維持相同 accuracy 情況下，使用較小的 CNN 架構 1. 在分佈式訓練期間，較小的 CNN 只需要少量的communication across servers 2. 較小的 CNN 可以使用更少的頻寬來透過雲端更快地更新模型 3. 較小的 CNN 更適合部署在嵌入式系統 - 它的參數比 AlexNet 少 50 倍，執行速度快 3 倍 #### SqueezeNets Architectural Design Strategies - 將 3x3 過濾器替換為 1x1 - 參數比 3x3 過濾器少 9 倍 - 將輸入到 3x3 Filters 的頻道數減少 - 每層中參數的數量為（輸入頻道 * Filters 數 * 3 * 3） - 晚一點才在網路中進行 Downsampling，以便卷積層具有更大的 Feature Maps #### Fire Module - Squeeze and Expand Layers <center> <img src="https://hackmd.io/_uploads/HJ-9N4pJ1x.png" style=" width: 70%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> - s1×1：squeeze layer中1×1 filter的數量。 - e1×1 和 e3×3: expand layer中1×1和3×3 filter的數量 - 當使用 Fire Module 時，我們將s1×1設定為小於（e1×1+e3×3），因此squeeze layer有助於限制 3×3 Filters 的input channel數量 <center> <img src="https://hackmd.io/_uploads/SJVO8NTyyg.png" style=" width: 90%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> <center> | Advantage | Disadvantage | | :--------: | :--------: | | 極小的模型大小 | 性能略低於更大的網絡 (compared to VGGNet, Resnet) | | 相對高的分類性能 | 特徵提取能力有限(e.g. space feature) | | 方便移植到低資源設備 | 相對難以擴展到更大、更深的網路(compared to Resnet, DenseNet) | </center> <br> ## Part8. EfficientNet - Motivation behind EfficientNet 1. CNN 通常以固定的資源成本進行設計，然後進行擴展 (e.g. ResNet-18 to ResNet-200) 2. 透過增加深度（層數）或寬度（Filters 數量）來實現縮放 3. 實驗通常很繁瑣，需要手動調整並不容易達到最優結果 4. 網路的深度和寬度平衡擴展 - 我們需要一種更有原則的方法來擴展 CNN - Compound scaling and EfficientNet-B0 - 主要是針對推理效能與準確率之間的平衡 #### Compound Scaling (複合縮放) - 使用一組固定的縮放係數統一縮放每個維度（寬度、深度、解析度)(depth=$\alpha^\phi$, width=$\beta^\phi$, resolution=$\gamma^\phi$) - EfficientNet 系列模型能夠達到 state-of-the-art accurary，並且效率提高 10 倍 - 研究人員研究了放大不同維度的影響。 - 結果發現，平衡所有維度的縮放會帶來最佳的整體效能 #### Grid Search (網格搜尋) - 使用網格搜尋尋找在固定資源（例如 2 倍以上的 FLOPS）下，對 Baseline Network 進行不同維度縮放之間的關係 - 找到每個維度最合適的縮放係數 - 針對這個固定資源，應用係數將 Baseline Network 擴大到所需的目標模型大小 <center> <img src="https://hackmd.io/_uploads/SkzbOVpy1l.png" style=" width: 90%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> #### EfficientNet-B0 Architecture - 複合縮放方法可以應用於任何 CNN（例如 MobileNet 的準確率提高了 1.4%，ResNet 提高了 0.7%） - 模型縮放的有效性在很大程度上取決於 Baseline Network - EfficientNet-B0 是使用 Google 的 (NAS, Neural Architecture Search) 技術設計 1. 目的是找到一個在計算資源和性能之間達到平衡的高效卷積神經網路 2. 使用了 MobileNetV2 的 MBConv（Mobile Inverted Bottleneck Convolution） 3. MBConv ≈ depthwise convolution + pointwise convolution + skip connection <center> <img src="https://hackmd.io/_uploads/S1tv_4T11l.png" style=" width: 90%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> | Advantage | Disadvantage | | :--------: | :--------: | | 更高適應多種應用場景 (B0~B7)的性能/計算效率比 | 設計相對複雜 (e.g. rely on NAS) | | 相對高的分類性能 | 模型訓練難度(especially for B6,B7) | | 方便移植到低資源設備 | 特定於影像分類的優化| <br> ## Part9. DenseNet - 透過共享中間層特徵，從而減少參數的重複計算 - ResNet but better - Densely connected convolutional neural networks - Introduced in 2016 and won Best Paper Award at 2017 CVPR conference - It was able to attain higher accuracy than ResNet with fewer parameters <center> <img src="https://hackmd.io/_uploads/Hk-Z6IKx1g.png" style=" width: 70%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> #### Motivations - Training Deep CNNs is problematic due to vanishing gradients - 因為深度網路的路徑變得很長，梯度在完成路徑之前就變為零（vanish） - DenseNets 透過使用 **Collective Knowledge** 的概念來解決這個問題，其中每一層都接收來自所有先前層的信息 #### DenseNet Architecture - Dense Block <center> <img src="https://hackmd.io/_uploads/SJ0QK46J1e.png" style=" width: 90%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> - 每層接收來自所有先前層的信息 - 同一個 Dense Block 裡的特徵圖大小不變(這樣才能串接在一起) - 基本 DenseNet Composition Layer 包含 Batch Norm、ReLU 和 3x3 Conv Layer <center> <img src="https://hackmd.io/_uploads/SyHcp8Kl1g.png" style=" width: 70%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> - Bottleneck Layer-ReLU 1x1 Conv is done before Bottleneck Layer-ReLU 3x3 Layer <center> <img src="https://hackmd.io/_uploads/HkVkR8Yxyg.png" style=" width: 70%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> #### Multiple Dense Blocks with Transition Layers <center> <img src="https://hackmd.io/_uploads/S1ebvFN6Jye.png" style=" width: 70%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> - 使用 1x1 Conv 和 2x2 Average Pooling 作為兩個連續密集區塊之間的“過渡層” (減少深度和大小) <center> | Advantage | Disadvantage | | :--------: | :--------: | | 減少梯度消失問題 | 高內存需求 (需保存之前所有層的輸出) | | 促進特徵重用 | 計算代價較高 (≠ 參數量) | | 更小的參數量 | | </center> <br> ## Part10. Autoencoder - 為有損壓縮的類神經網路，且學習特定領域的表示。 <br> ## Part11. Rank-N or Top-N Accuracy - Rank-N Accuracy 是一種具有更多空間的評估分類器準確性的方法 - 有時候分類器仍然做得很好，但如果我們只查看最上面的預測類別，則不會反映出來 - Rank-N Accuracy 考慮機率最高的前 N 個類別