SwinIR: Image Restoration Using Swin Transformer

Abstract

圖像恢復(Image restoration) 指把低畫質的圖(縮小、壓縮的圖)回到高畫質
最新的方法是基於卷積神經網路，但很少人利用 Transformer 的方法，即便他的表現很好
本文提出基於 Swin Transformer 的模型 SwinIR，用於圖像恢復
SwinIR 三架構
- 淺層特徵提取
- 深層特徵提取
- 圖像重建
實驗有三個
- image super-resolution
- image denoising
- JPEG compression artifact reduction
實驗表明 SwinIR 在不同任務上性能優於先前方 0.14~0.45 dB，參數量最多減少 67%

Introduction

Image restoration，如 image super-resolution(SR)、image denoising、JPEG compression artifact reduction，目標都是把低畫質重建成高畫質
以往的工作 CNN 都是主流，雖然效能的確提高，但有兩個問題
- image 和 convolution kernels 沒考慮到相關性
  - 相同的 kernel 回復不同的圖像區域是不好的
- local processing 時，卷積對 long-range dependecy 沒效率
為了替代 CNN，Transformer 設計了可以捕捉上下文互動的機制
但使用 Transformer 的圖像恢復器，input 通常有固定大小
- 邊界的像素不能用相鄰像素恢復
- 恢復的圖可能有 border artifacts
  - 可透過 patch overlapping 修復，但會有額外 cost
Swin Transformer 有 CNN 處理大圖像的優勢，也有 Transformer 使用 shifted window 的優勢
基於 Swin Transformer 提出 SwinIR
- 淺層特徵提取
- 深層特徵提取
- 圖像重建
深度特徵提取模組由幾個 residual Swin Transformer blocks(RSTB) 組成，每個塊都有幾個 Swin Transformer layers 和一個 residual connection
和 CNN 的模型相比有幾個優勢
- 圖像內容與 weight 互動
- shifted window 能捕捉 long-range 的資料
- 用更少參數得到更好結果
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 和其他模型相比，有更大的 PSNR
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

圖像恢復
Vision Transformer

Method

Network Architecture

Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
淺層提取 -> 深層提取 -> HQ 圖像重建

Shallow and deep feature extraction

給一個低畫質圖像
$I_{L Q} \in R^{H * W * C_{i n}}$
- H W C 是高度寬度通道
用 3*3 的卷基層來提取淺層特徵
- $F_{0} = H_{S F} (I_{L Q})$
從 F0 提取深層特徵
- $F_{D F} = H_{D F} (F 0)$
- H_DF 是深層特徵提取模組，包含 K 個 RSTB 和 1 個 3*3 卷積層
- 最後使用卷積層可以把卷積的 inductive bias 帶到 Transformer 中

Image reconstruction

透過前面得到的深層和淺層特徵重建高畫質圖像 I_RHQ
- $I_{R H Q} = H_{R E C} (F_{0} + F_{D F})$
- H_REC 是重建模組
淺層特徵包含低頻，深層特徵用於恢復高頻
透過 long skip 將低頻傳輸到重建模組
重建模組使用 sub-pixel convolution layer 對特徵採樣
- 如果去雜訊、減少偽影等不需要採樣的任務就用一個卷積層重建
利用 residual learning 重建 LQ 和 HQ 之間的殘差而不是直接重建 HQ(下式最後的 + I_LQ)
- $I_{R H Q} = H_{S w i n I R} (I_{L Q}) + I_{L Q}$

Loss function

$τ = | | I_{R H Q - I_{H Q}} | |$
- 和原圖越像越好，差距越小越好

Residual Swin Transformer Block

參考上圖 (a)
輸出的地方會把輸入也加進來(殘差連接)
- 增強平移等效性(不管圖像中的目標被移動到哪裡得到的結果應該一樣)
- 聚合不同等級的特徵

Swin Transformer layer

基於原始 Transformer layer 的 standard multi-head selfattention，但多了 local attention 和 shifted window 機制
參考上圖 (b)
Swin Transformer 先利用切著 input 成不重疊的 M * M Windows，把大小從 input 的 H * W * C 調整成 HW/M^2 * M^2 * C
- HW/M^2 是總共的 Windows 數量
接著每個區塊個別計算 standard self-attention，每個區塊得到特徵
$X \in R^{M^{2} * C}$ 和 query key value 矩陣 Q K V
- $Q = X P_{Q}$
- $K = X P_{K}$
- $V = X P_{V}$
- P_Q P_K P_V 是投影矩陣，每個 Windows 共用
- self-attention 的東西，應該不用說明太多…?
self-attention 結束最後接上 MSA，後面再來一次 self-attention，然後第二次接上 MLP
- 兩者前面都有 LayerNorm 做正規化

Experiments

Experimental Setup

實驗
- 經典圖 SR
- 真實世界圖 SR
- 去雜訊
- JPEG 偽影減少
RSTB number: 6
STL number: 6
window size: 8
- JPEG 偽影減少用 7，因為 8 的時候很廢，推測是因為 JPEG 剛好也是 8*8 分割
channel number: 180
attention head number: 6

Ablation Study and Discussion

Dateset
- Train: DIV2K
- Test: Manga109

Impact of channel number, RSTB number and STL number

Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- channel number
- RSTB number
- STL number
選用 180 6 6 是為了顧慮到模型大小

Impact of patch size and training image number; model convergence comparison

和基於 CNN 的 RCAN 比較
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Training patch size
- Percentage of used images
  - >100% 的訓練資料來自 Flickr2K
- Training iterations

Impact of residual connection and convolution layer in RSTB

Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
RSTB 最後卷積層的重要性
用三個 3*3 可以減少參數，但性能下降

Results on Image SR Classical image SR

Classical image SR

Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- SwinIR+ 表示用了 self-ensemble，對原圖水平、垂直、水平垂直反轉後結果求平均
- 不但效果好，參數也少
- 但執行時間中等
  - RCAN: 0.2s
  - IPT: 4.5s
    - 網路超大，效果雖不錯但還是輸
  - SwinIR: 1.1s
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 銳利且自然

Lightweight image SR

跟小尺寸的模型比較(自身模型也有縮小)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 在參數量中等的前提之下效果仍很好

Real-world image SR

- 訓練資料集不太足夠，但仍比其他自然
- 有夠好資料集可以更猛

Results on JPEG Compression Artifact Reduction

- 和 DRUNet 效果差不多，但參數量大概只有三分之一

Results on Image Denoising

- 同上，和 DRUNet 效果差不多，但參數量大概只有三分之一
- 不會有模糊感、更加銳利

Conclusion

提出基於 Swin Transformer 的圖像恢復模型 SwinIR
- 淺層特徵提取
- 深層特徵提取
- HR 重建
利用 RSTB 做深度特徵提取，而每個 RSTB 由 Swin Transformer 層、卷積層和 residual connection 構成
大量實驗結果表明能在圖像恢復任務上有較好的表現
- 未來希望可以擴展到去模糊、去雨等任務

SwinIR: Image Restoration Using Swin Transformer

Abstract

Introduction

Related Work

Method

Network Architecture

Shallow and deep feature extraction

Image reconstruction

Loss function

Residual Swin Transformer Block

Swin Transformer layer

Experiments

Experimental Setup

Ablation Study and Discussion

Impact of channel number, RSTB number and STL number

Impact of patch size and training image number; model convergence comparison

Impact of residual connection and convolution layer in RSTB

Results on Image SR Classical image SR

Classical image SR

Lightweight image SR

Real-world image SR

Results on JPEG Compression Artifact Reduction

Results on Image Denoising

Conclusion

tags: paper

Read more

【LeetCode】目錄

【LeetCode】1846. Maximum Element After Decreasing and Rearranging

【LeetCode】1759. Count Number of Homogenous Substrings

【LeetCode】1319. Number of Operations to Make Network Connected

tags: `paper`