Activate or Not: Learning Customized Activation

###### tags: `2022Q1技術研討`, `activation` # Activate or Not: Learning Customized Activation ## 背景補充 :question: RELU 有什麼優缺點 * 優點 * easy to compute so that the neural network converges very quickly * As its derivative is not 0 for the positive values of the neuron (f’(x)=1 for x ≥ 0), ReLu does not saturate and no dead neurons are reported. * 缺點 * not a zero-centered function * Saturation and vanishing gradient only occur for negative values that, given to ReLu, are turned into 0. (leaky RELU solves this problem) ![](https://i.imgur.com/m95ajRU.png) :question: 為什麼要用非線性的 activation function ![](https://i.imgur.com/AxJ3hGV.png) :STAR: Swish activation function * Swish ![](https://i.imgur.com/u7LOjvV.png) * function: $Swish = x\cdot\sigma(\beta x)$，$\sigma$ 表示 sigmoid function，$\beta=1$ * In very deep networks, swish achieves higher test accuracy than ReLU. * Non monocity * Swish 怎麼找到的？ Swish function其實並不能說是被精心設計出來的 activation function，更貼切地說，他是被暴力搜尋出來的 function。在原論文中就是給予一個 activation function的 space，然後開始跑實驗，最後從實驗結果中歸納出最強的 activation function。 Swish當初的 search space: ![](https://i.imgur.com/9oyuQ34.png) ## ActivateOrNot (ACON) ### 論文貢獻 2021年發表於 CVPR 的論文，乍看是一個效果超強的 activation function。但論文中主要使用的是 channel-wise Meta-ACON，其實根本上就是引用了 SE (Squeeze-and-excitation) 的概念，並不單單是一般直覺上的 activation function。 :pushpin: 模型成效提升 ImageNet top-1 accuracy relative improvements compared with the ReLU baselines: ![](https://i.imgur.com/xCQurNY.png) * 在大模型的情況下，meta-acon improvement 相對是穩定的 * 成效大概是 SENet 的 2 倍 :pushpin: 解釋 Swish activation function 為何有好的成效 :pushpin: 可以用在多種任務上，在 classification, object detection, semantic segmentation 都有不錯的效果 ### RELU vs. Swish * 透過 Smoth maximum 來近似常用的 activation function ![](https://i.imgur.com/5b8X17F.png) * 當 $\beta = \infty$，$S_\beta = max(x_1, x_2, ...,x_n)$ * 當 $\beta = 0$，$S_\beta = \frac{x_1+x_2+...+x_n}{n}$ * 透過 Smooth maximum 推導 activation function 的平滑近似 ![](https://i.imgur.com/gC6aOfB.png) * 簡單的來說就是：$(x_a-x_b)*\sigma[\beta(x_a-x_b)]+x_b$。$\sigma$ 表示 sigmoid function * Swish 其實就是 RELU 的平滑近似！ * 當 $x_a=x, x_b=0$，$S_\beta = x\cdot\sigma(\beta x)$ ![](https://i.imgur.com/TcJjkhT.png) ### Maxout vs. ACON family * 透過 maxout family 找到效果更好的平滑近似 activation functions ![](https://i.imgur.com/UKxLEzk.png) ### Meta-ACON 接下來論文就更進一步的將 ACON (ACON-C) 的表示放寬，讓 ACON 中 β 的數值交給輸入決定。也就是說，ACON 中 p 是可以學習的參數，而 β 更是可以直接被輸入影響，更大幅度地讓 ACON可以對不同的輸入產生差異。這樣因輸入改變的 ACON，論文稱為 Meta-ACON。 :pushpin: p 與 activation function 的關係 ![](https://i.imgur.com/JVk227c.png) * 透過學習 p1 和 p2 两个參數，來找到效果最好的激活函數 :pushpin: $\beta$ 與 activation function 的關係 ![](https://i.imgur.com/fRgKd6H.png) * 参数 $\beta$ 控制着激活函数是线性还是非线性 假如這個 $\beta$ 是常量，那麼 ACON-C 激活函數的線性非線性能力也就定了，所以作者提出一個 G(x) 模塊來由輸入 feature map $x_{c,h,w}$ 來動態的學習 $\beta$，以達到動態的控制函數線性/非線性能力 ![](https://i.imgur.com/w33fBiR.png) * $G(x)$ 架構選擇 ![](https://i.imgur.com/xnI48AK.png) 作者最後選擇 channel-wise 的方法，藉由兩組參數將 (H, W)上總合起來的數值，經過 tranform才得到 β。而這個 W2是將維度 C的資訊，投到 C/r，再透過 W1轉回維度 C。 :pushpin: SENet 簡介 ![](https://i.imgur.com/LRe5QIj.jpg) ![](https://i.imgur.com/pMJdntW.jpg) ## 實驗結果 * ACON family vs. RELU ![](https://i.imgur.com/ubAbQDL.png) ![](https://i.imgur.com/ydIesa8.png) * 各種 activation function 比較 ![](https://i.imgur.com/KtsFXFx.png) * ACON 泛化能力 ![](https://i.imgur.com/CYOKNXp.png) ## 實做程式碼 ~~~python # width = input_channel class AconC(nn.Module): r""" ACON activation (activate or not). # AconC: (p1*x-p2*x) * sigmoid(beta*(p1*x-p2*x)) + p2*x, beta is a learnable parameter # according to "Activate or Not: Learning Customized Activation" <https://arxiv.org/pdf/2009.04759.pdf>. """ def __init__(self, width): super().__init__() self.p1 = nn.Parameter(torch.randn(1, width, 1, 1)) self.p2 = nn.Parameter(torch.randn(1, width, 1, 1)) self.beta = nn.Parameter(torch.ones(1, width, 1, 1)) def forward(self, x): return (self.p1 * x - self.p2 * x) * torch.sigmoid(self.beta * (self.p1 * x - self.p2 * x)) + self.p2 * x class MetaAconC(nn.Module): r""" ACON activation (activate or not). # MetaAconC: (p1*x-p2*x) * sigmoid(beta*(p1*x-p2*x)) + p2*x, beta is generated by a small network # according to "Activate or Not: Learning Customized Activation" <https://arxiv.org/pdf/2009.04759.pdf>. """ def __init__(self, width, r=16): super().__init__() self.fc1 = nn.Conv2d(width, max(r, width // r), kernel_size=1, stride=1, bias=True) self.bn1 = nn.BatchNorm2d(max(r, width // r)) self.fc2 = nn.Conv2d(max(r, width // r), width, kernel_size=1, stride=1, bias=True) self.bn2 = nn.BatchNorm2d(width) self.p1 = nn.Parameter(torch.randn(1, width, 1, 1)) self.p2 = nn.Parameter(torch.randn(1, width, 1, 1)) def forward(self, x): # x [batch, C, W, H] beta = torch.sigmoid( self.bn2(self.fc2(self.bn1(self.fc1(x.mean(dim=2, keepdims=True).mean(dim=3, keepdims=True)))))) return (self.p1 * x - self.p2 * x) * torch.sigmoid(beta * (self.p1 * x - self.p2 * x)) + self.p2 * x ~~~ ## 參考資料 https://medium.com/%E8%BB%9F%E9%AB%94%E4%B9%8B%E5%BF%83/acon%E8%88%87tfnet-%E5%88%86%E6%9E%90relu%E8%88%87%E8%BF%91%E6%9C%9Fswish-senet%E7%99%BC%E5%B1%95%E7%9A%84%E9%97%9C%E9%80%A3%E6%80%A7-c05c77ff68a7 https://blog.csdn.net/qq_38253797/article/details/118964626 https://www.itread01.com/content/1548649814.html