###### tags: `2022Q1技術研討`, `activation` # Activate or Not: Learning Customized Activation ## 背景補充 :question: RELU 有什麼優缺點 * 優點 * easy to compute so that the neural network converges very quickly * As its derivative is not 0 for the positive values of the neuron (f’(x)=1 for x ≥ 0), ReLu does not saturate and no dead neurons are reported. * 缺點 * not a zero-centered function * Saturation and vanishing gradient only occur for negative values that, given to ReLu, are turned into 0. (leaky RELU solves this problem) ![](https://i.imgur.com/m95ajRU.png) :question: 為什麼要用非線性的 activation function ![](https://i.imgur.com/AxJ3hGV.png) :STAR: Swish activation function * Swish ![](https://i.imgur.com/u7LOjvV.png) * function: $Swish = x\cdot\sigma(\beta x)$,$\sigma$ 表示 sigmoid function,$\beta=1$ * In very deep networks, swish achieves higher test accuracy than ReLU. * Non monocity * Swish 怎麼找到的? Swish function其實並不能說是被精心設計出來的 activation function,更貼切地說,他是被<font color=#F6D55C>暴力搜尋出來的 function</font>。在原論文中就是給予一個 activation function的 space,然後開始跑實驗,最後從實驗結果中歸納出最強的 activation function。 Swish當初的 search space: ![](https://i.imgur.com/9oyuQ34.png) ## ActivateOrNot (ACON) ### 論文貢獻 2021年發表於 CVPR 的論文,乍看是一個效果超強的 activation function。但論文中主要使用的是 channel-wise Meta-ACON,其實根本上就是引用了 SE (Squeeze-and-excitation) 的概念,並不單單是一般直覺上的 activation function。 :pushpin: 模型成效提升 ImageNet top-1 accuracy <font color=#F6D55C>relative improvements compared with the ReLU baselines</font>: ![](https://i.imgur.com/xCQurNY.png) * 在大模型的情況下,meta-acon improvement 相對是穩定的 * 成效大概是 SENet 的 2 倍 :pushpin: 解釋 Swish activation function 為何有好的成效 :pushpin: 可以用在多種任務上,在 classification, object detection, semantic segmentation 都有不錯的效果 ### RELU vs. Swish * 透過 Smoth maximum 來近似常用的 activation function ![](https://i.imgur.com/5b8X17F.png) * 當 $\beta = \infty$,$S_\beta = max(x_1, x_2, ...,x_n)$ * 當 $\beta = 0$,$S_\beta = \frac{x_1+x_2+...+x_n}{n}$ * 透過 Smooth maximum 推導 activation function 的平滑近似 ![](https://i.imgur.com/gC6aOfB.png) * 簡單的來說就是:$(x_a-x_b)*\sigma[\beta(x_a-x_b)]+x_b$。$\sigma$ 表示 sigmoid function * <font color=#F6D55C>Swish 其實就是 RELU 的平滑近似!</font> * 當 $x_a=x, x_b=0$,$S_\beta = x\cdot\sigma(\beta x)$ ![](https://i.imgur.com/TcJjkhT.png) ### Maxout vs. ACON family * 透過 maxout family 找到效果更好的平滑近似 activation functions ![](https://i.imgur.com/UKxLEzk.png) ### Meta-ACON 接下來論文就更進一步的將 ACON (ACON-C) 的表示放寬,讓 ACON 中 β 的數值交給輸入決定。也就是說,<font color=#F6D55C>ACON 中 p 是可以學習的參數,而 β 更是可以直接被輸入影響</font>,更大幅度地讓 ACON可以對不同的輸入產生差異。這樣因輸入改變的 ACON,論文稱為 Meta-ACON。 :pushpin: p 與 activation function 的關係 ![](https://i.imgur.com/JVk227c.png) * 透過學習 p1 和 p2 两个參數,來找到效果最好的激活函數 :pushpin: $\beta$ 與 activation function 的關係 ![](https://i.imgur.com/fRgKd6H.png) * 参数 <font color=#F6D55C>$\beta$ 控制着激活函数是线性还是非线性</font> 假如這個 <font color=#F6D55C>$\beta$ 是常量</font>,那麼 ACON-C 激活函數的<font color=#F6D55C>線性非線性能力也就定了</font>,所以作者提出一個 <font color=#F6D55C>G(x) 模塊來由輸入 feature map $x_{c,h,w}$ 來動態的學習 $\beta$</font>,以達到<font color=#F6D55C>動態的控制函數線性/非線性能力</font> ![](https://i.imgur.com/w33fBiR.png) * $G(x)$ 架構選擇 ![](https://i.imgur.com/xnI48AK.png) 作者最後選擇 channel-wise 的方法,藉由兩組參數將 (H, W)上總合起來的數值,經過 tranform才得到 β。而這個 W2是將維度 C的資訊,投到 C/r,再透過 W1轉回維度 C。 :pushpin: SENet 簡介 ![](https://i.imgur.com/LRe5QIj.jpg) ![](https://i.imgur.com/pMJdntW.jpg) ## 實驗結果 * ACON family vs. RELU ![](https://i.imgur.com/ubAbQDL.png) ![](https://i.imgur.com/ydIesa8.png) * 各種 activation function 比較 ![](https://i.imgur.com/KtsFXFx.png) * ACON 泛化能力 ![](https://i.imgur.com/CYOKNXp.png) ## 實做程式碼 ~~~python # width = input_channel class AconC(nn.Module): r""" ACON activation (activate or not). # AconC: (p1*x-p2*x) * sigmoid(beta*(p1*x-p2*x)) + p2*x, beta is a learnable parameter # according to "Activate or Not: Learning Customized Activation" <https://arxiv.org/pdf/2009.04759.pdf>. """ def __init__(self, width): super().__init__() self.p1 = nn.Parameter(torch.randn(1, width, 1, 1)) self.p2 = nn.Parameter(torch.randn(1, width, 1, 1)) self.beta = nn.Parameter(torch.ones(1, width, 1, 1)) def forward(self, x): return (self.p1 * x - self.p2 * x) * torch.sigmoid(self.beta * (self.p1 * x - self.p2 * x)) + self.p2 * x class MetaAconC(nn.Module): r""" ACON activation (activate or not). # MetaAconC: (p1*x-p2*x) * sigmoid(beta*(p1*x-p2*x)) + p2*x, beta is generated by a small network # according to "Activate or Not: Learning Customized Activation" <https://arxiv.org/pdf/2009.04759.pdf>. """ def __init__(self, width, r=16): super().__init__() self.fc1 = nn.Conv2d(width, max(r, width // r), kernel_size=1, stride=1, bias=True) self.bn1 = nn.BatchNorm2d(max(r, width // r)) self.fc2 = nn.Conv2d(max(r, width // r), width, kernel_size=1, stride=1, bias=True) self.bn2 = nn.BatchNorm2d(width) self.p1 = nn.Parameter(torch.randn(1, width, 1, 1)) self.p2 = nn.Parameter(torch.randn(1, width, 1, 1)) def forward(self, x): # x [batch, C, W, H] beta = torch.sigmoid( self.bn2(self.fc2(self.bn1(self.fc1(x.mean(dim=2, keepdims=True).mean(dim=3, keepdims=True)))))) return (self.p1 * x - self.p2 * x) * torch.sigmoid(beta * (self.p1 * x - self.p2 * x)) + self.p2 * x ~~~ ## 參考資料 https://medium.com/%E8%BB%9F%E9%AB%94%E4%B9%8B%E5%BF%83/acon%E8%88%87tfnet-%E5%88%86%E6%9E%90relu%E8%88%87%E8%BF%91%E6%9C%9Fswish-senet%E7%99%BC%E5%B1%95%E7%9A%84%E9%97%9C%E9%80%A3%E6%80%A7-c05c77ff68a7 https://blog.csdn.net/qq_38253797/article/details/118964626 https://www.itread01.com/content/1548649814.html