Resources - HackMD

## Proof of Conecpt Super resolution * https://github.com/xinntao/Real-ESRGAN * https://github.com/prs-eth/thera Lighting enhancement * https://huggingface.co/Cidaut/DarkIR ## Resource DINOv2 * https://www.kaggle.com/code/urjjj0909/classification-with-dinov2?scriptVersionId=232018245 * https://github.com/antmedellin/dinov2/blob/main/notebooks/depth_estimation.ipynb * https://junukcha.github.io/code/2023/12/31/dinov2-pca-visualization/ * https://github.com/purnasai/Dino_V2/blob/main/2.PCA_visualization.ipynb * https://github.com/facebookresearch/dinov2/issues/23 * https://jiawei-yang.github.io/DenoisingViT/gallery_0.html * https://www.kaggle.com/code/urjjj0909/dinov2-dense-matching?scriptVersionId=236433973 ## #### 1. 三個 CNN model ==補上模型基本資訊: 輸入維度、參數量、計算量...等== FPN * 多尺度檢測特徵融合的網路架構，旨在將深層和淺層特徵融合 * 假起淺層為 N、深層為 N+1 的關係，每一層的融合的邏輯為第 N+1 層先經過 2 倍 Upsampling 後，再與經過 1X1 Conv 的第 N 層做 Element-wise 相加運算 ![image](https://hackmd.io/_uploads/SJn0mkKAkg.png) ![image](https://hackmd.io/_uploads/BJbe41K01e.png) ResNet * 學習的函數為 $H(x)=F(x)+x$ * 不直接學習 $H(x)$（Underlying mapping） * 學習 $F(x)$（Residual mapping） * 首先，我們避免直接學習 $H(x)$，而是分成二條路: 1. 輸入 $x$ 送進 Conv+Pooling 層，輸出 $F(x)$ 2. 輸入 $x$ 直接往下傳 3. 將 $F(x)+x$ 一同送進 Activation layer * 假設該 Conv+Pooling 層沒有學習到任何東西，頂多將 $F(x)$ 學習成 0、而輸出就直接等於輸入 $H(x)=x$，好像直接學習到一個 Identity matrix ![image](https://hackmd.io/_uploads/B16_VyYRkg.png) ![image](https://hackmd.io/_uploads/rJ46VyYC1l.png) https://github.com/albanie/convnet-burden/blob/master/reports/resnet-50.md https://www.digitalocean.com/community/tutorials/popular-deep-learning-architectures-resnet-inceptionv3-squeezenet DenseNet * 用公式來表示 Block 設計，$H(·)$ 表示非線性函數，可能包含 Conv+BN+ReLU 或 Pooling 等作用函數，傳統 CNN 在第 $l$ 層的輸出 $x_l=H_l(x_{l-1})$: * ResNet第 $l$ 層的輸出，多了來自上一層的 Identity 輸入，且是加法運算: $$x_l=H_l(x_{l-1})+x_{l-1}$$ * DenseNet 第 $l$ 層的輸出，接收了前面所有層的 Feature map，且是疊加運算（Channel-wise）: $$x_l=H_l([x_0,x_1,...,x_{l-1}])$$ * Dense block 中每個 Dense layer 的輸出 Feature map 尺寸都相同，可以在 Channel 維度上疊加 * 每個 Dense layer 都固定輸出 $k$ 張 Feature map，可理解為 $k$ 個 Channel、或使用 $k$ 個Conv；在這裡 $k$ 是一個超參數，論文稱作 Growth rate，一般不須設定很大即可有不錯的性能 ![image](https://hackmd.io/_uploads/HypLP1K0Je.png) ![image](https://hackmd.io/_uploads/Bkr7PktCJe.png) ![image](https://hackmd.io/_uploads/SJduvyKRJg.png) #### 2. 二個 Transformer model ==補上模型基本資訊: 輸入維度、參數量、計算量...等== ViT * 將影像分割為多個 16X16 的影像 Patch，每個 Patch 視作 NLP 中的一個詞 * 每個 Patch 加入 Position embedding，讓模型學習 Patch 出現在整張影像中何處 * 最前方額外加入一組向量（位置為 0），稱作 Class token * 推論不會把所有輸出過 MLP 後做預測，僅拿 Class token 來做預測，原因是經過 Self-attention 後，Class token 已學習到影像中關聯 ![image](https://hackmd.io/_uploads/ByITwkYAyg.png) ![image](https://hackmd.io/_uploads/rkDkuytAyg.png) Swin transformer * 改善 ViT 固定的 Patch size，基本上只對影像做 16 倍的 Downsampling，多尺度的 Feature map 能更精確檢測到不同大小目標物件 * 用 Patch merging 達到 CNN 中的 Pooling 效果 * 利用 Shifted window 限制自注意力計算複雜度（W-MSA） * 窗口之間訊息交換（SW-MSA），並用 Masked-attention 來達成此功能 ![image](https://hackmd.io/_uploads/Hk3BuJY0ye.png) ![image](https://hackmd.io/_uploads/SJMD_ktAJg.png) #### 3. 上述的 Pros & Cons FPN * Pros * 多尺度特徵融合，給出語意級別特徵 * 具備小物件特徵擷取和辨識能力 * Dense prediction 的基本運算（分割應用） * Cons * 計算複雜度較高 * 有過擬合的風險（The complexity of FPNs can increase the risk of overfitting, particularly if the training data is limited） ResNet * Pros * Shortcut connection 計算上只是將輸入往下再相加，因此並沒有增加參數量和計算量，模型複雜度也不會增加 * $\cfrac{\partial{H(x)}}{\partial{x}}=\cfrac{\partial{F(x)}}{\partial{x}}+1$，若第一項產生梯度消失，第二項由於微分為 1，代表會直接將梯度傳下去，能夠保留梯度讓網路繼續學習 * 使用 Residual block 的網路能夠很深，但由於參數量並沒有增加，因此模型的複雜度其實並沒有上升更多，意即即使網路變得更深，也能夠方便找到一個不是這麼複雜的模型去擬合數據 * Cons * 訓練時會因為疊了很深的層數而增加 Computational cost * 推論速度難達到 Real-time 級別 DenseNet * Pros * 透過 Dense block 的密集連接、特徵重用的方式，保留或提升了梯度強度來解決梯度消失問題 * 特徵重用可以降低過擬合（Overfitting） * 每個 Dense layer 都可以透過 Feed forward 的方式直達 Target 來參與誤差計算，有種每個 Dense layer 做到隱式的監督學習（Deep supervision） * Cons * 因為密集連接的設計方式，訓練時有較高的 Computational cost * 推論速度難達到 Real-time 級別 * The dense connections and the need to store intermediate feature maps can result in increased memory usage * Dense connections might limit the capacity of later layers to learn complex features, potentially affecting its ability to learn certain types of features ViT * Pros * 不加入過多 CNN 歸納偏差，以 Attention 計算同時捕捉局部和全局特徵 * 達到 NLP 和 CV 統一模型可能性 * 相較於 CNN 容易達到效能飽和，ViT 隨著訓練資料量的上升而進一步提升性能 * Cons * 訓練資料量和 CNN 可能會有級別上的差距 * 訓練和推論計算複雜度過高 * 推論速度難達到 Real-time 級別 * 16X16 的 Patch size 和 224X224 的影像大小，難以拓展至更高解析度的影像 Swin transformer * Pros * Scalability（It can effectively handle large image sizes and resolutions, making it suitable for various applications） * Hierarchical architecture（The hierarchical design allows for efficient processing of high-resolution images and capturing both local and global features） * General-purpose backbone（Swin transformer can be used as a general-purpose backbone for a wide range of vision tasks） * Cons * 訓練資料量和 CNN 可能會有級別上的差距 * 訓練和推論計算複雜度過高 * 推論速度難達到 Real-time 級別 #### 4. 為什麼使用這個模型（模型大小、推論延遲...等）? 有比較過和其他模型的 Benchmark 嗎? [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) 先從為什麼不選用物件辨識? 因為前端資料我們可以限制就可以把任務單純變為分類 Out-of-distribution capability SSL no label needed ![image](https://hackmd.io/_uploads/H1hj4pdRJe.png) ![image](https://hackmd.io/_uploads/rJLQB6dAyl.png) ## CCL RANSAC Unsharp mask 1st order enhancement 2nd order enhancement linear transform piece-wise linear transform power transform log transform gamma correction histogram equalization histogram matching MAE DINOv2