Activation fuctions === ###### tags: `new ML` url: https://rb.gy/yk1fjk # what is an Activation fuction ![](https://i.imgur.com/RRBko50.png) 根據input我們的腦經過處理後會決定神經元要不要啟動 ![](https://i.imgur.com/GoypvLN.png) ![](https://i.imgur.com/QcVc5Qe.png) 同樣在deep learning activation fuction 是作類似的事 (腦)要不要啟動-->(ML)要不要啟動 ![](https://i.imgur.com/8ayqawS.png) 上面這張圖 的output layer 也有activation fuction 另外hidden layer 和 output 的activation fuction 應該會不同(要看做麼) # why do we need an Activation fuction ### 目的是為了加上非線性 ![](https://i.imgur.com/17ul285.png) 你會問 你在工三小 ok 我們假設**沒有activation fuction** 那我們每一層的hidden layer 只會是一個線性運算 例如ax + b **再多的hidden layer 結果都是線性** 這樣的模型無法解決較難的問題 ### 有個簡單的例子: 這是線性 ![](https://i.imgur.com/SxMH3Tv.png) 這是我們要的 (複雜) ![](https://i.imgur.com/xPgRR36.png) 線性怎麼樣都求不出來 ![](https://i.imgur.com/b8kyuGF.png) activation fuction 會把線折成我們要的結果 (改變形狀) ![](https://i.imgur.com/lDmO4tz.png) ![](https://i.imgur.com/JkH3hRK.png) 當然 真實請況沒有這麼簡單(xy而已2D) 可能還會是 3D 4D 5D的圖(x1,x2,x3,x4......y) # Networks Activation fuctions ## Binary Step Fuction ![](https://i.imgur.com/Iyl7oMH.png) **會有一個決定的值大於此值則會通過** 將output 傳至下一個hidden layer 反之不會傳 ![](https://i.imgur.com/oWE2VCT.png) #### 限制 1. **只有一個output 代表不能處理 x對多的情況 像multi class classification** 2. **gradient 是零 會阻礙back propagation** ##### gradient gradient 會決定更正的走向 (不會了解一下gradient decent ) ##### back propagation back propagation 是進行更正的意思 (對每個函式求最佳解) ##### example 如:進行一次back propagation 我的model 更正了多少 而該往哪更正則要看gradient 是多少 所以gradient是0 代表找到低點local minimum 但有可能不是最低點absolute minimum 故back propagation 可能求不出最佳解 ## Linear Activation Fuction ![](https://i.imgur.com/HsFzgmg.png) x1 而已 ![](https://i.imgur.com/8bTqK9y.png) #### 缺點: 1. 不能用back propagation 因為此方程式的微分為常數跟 input x 沒有關西 2. All layers of the neural network will collapse into one if a linear activation function is used. No matter the number of layers in the neural network, the last layer will still be a linear function of the first layer. So, essentially, a linear activation function turns the neural network into just one layer.(跟沒有activation fuction 上面解釋一樣) ## Non-Linear Activation fucntion ### Sigmoid/Logistic Activation fuction ![](https://i.imgur.com/bNZvl5p.png) 輸入數子會輸出[1,0]範圍的輸出 ![](https://i.imgur.com/InUe3ld.png) #### sigmoid 是常用的activation fuction 1. 可以拿來當作預測輸出example:A, B, C可能是哪個的機率是多少 2. 這個方程式可以微分且微分後較平滑 可以預防太大的step 如下: ![](https://i.imgur.com/j9Jrwqh.png) 此方程式的微分圖 #### 缺點: 1. 從上面微分果後的圖們可以看到gradient 只有在-3 到 3 這個區域有顯卓的效果 ``` 故當值大於3或小於-3方程式會有很小的gradient 當gradient 趨近於零 對於這個Hidden layer 裡面其中一個Node我們的模型會停止更新 (以為找到最佳解) 此問題叫vanishing graient 可用ReLU方程式解決 ``` 2. sigmoid 的輸出不是對稱於0 ``` So the output of all the neurons will be of the same sign. This makes the training of the neural network more difficult and unstable ``` ### Tanh Fuction (hyperbolic Tangent) 跟sigmoid fuction類似但輸出變成 [-1,1]的範圍 輸入越大輸出越靠近1.0 反之會越來越靠近-1.0 ![](https://i.imgur.com/P81rGlt.png) 數學式子:(知道才能微分求gradient) ![](https://i.imgur.com/K0qeEq8.png) ![](https://i.imgur.com/2ahfzrt.png) 圖上是微分的圖 #### 缺點 如同sigmoid fuction一樣會有vanishing gradient 的問題 而且gradient 的大小也較大 雖然如此 但因為tanh對稱於0 gradient方向會改變 shorturl.at/LOV26 #### 解釋why zero centered is so important: ![](https://i.imgur.com/pMjsL4N.png) ![](https://i.imgur.com/YQzjbkI.png) 嗯 所以呢? 我們假設有兩個weight. w1 w2 如果gradient 永遠是正 或 負(也就是 w1,w2>0 or w1,w2<0 ) 則代表他們移動只能是右上或左下 ![](https://i.imgur.com/xUvdmNp.png) 想要得到最好的w1,w2 (goal) 就只能像圖像那樣走 故像ReLU 或者 sigmoid 都很難做gradient based optimization ##### 解決辦法 Normalize: nomralize the data in advance to be a zero centerd as in batch/layer normalization 簡單來說是把node改成這樣 ![](https://i.imgur.com/lDN0Xzc.png) 這樣我們的back propagation ![](https://i.imgur.com/zXv5yfj.png) 就不會只依賴x就不會有問題了 url to question below: https://ai.stackexchange.com/questions/26958/why-is-it-a-problem-if-the-outputs-of-an-activation-function-are-not-zero-center 可以直接看 if the activation function of the network is not zero centered, y=f(xTw) is always positive or always negative. Thus, the output of a layer is always being moved to either the positive values or the negative values. As a result, the weight vector needs more updates to be trained properly, and the number of epochs needed for the network to get trained also increases. This is why the zero centered property is important, though it is NOT necessary. Zero-centered activation functions ensure that the mean activation value is around zero. This property is important in deep learning because it has been empirically shown that models operating on normalized data––whether it be inputs or latent activations––enjoy faster convergence. Unfortunately, zero-centered activation functions like tanh saturate at their asymptotes –– the gradients within this region get vanishingly smaller over time, leading to a weak training signal. ReLU avoids this problem but it is not zero-centered. Therefore all-positive or all-negative activation functions whether sigmoid or ReLU can be difficult for gradient-based optimization. So, To solve this problem deep learning practitioners have invented a myriad of Normalization layers (batch norm, layer norm, weight norm, etc.). we can normalize the data in advance to be zero-centered as in batch/layer normalization. ### ReLU function ![](https://i.imgur.com/wJVPbm0.png) ![](https://i.imgur.com/AG7IsZh.png) #### 優勢 1. 只有一定的neurons 被啟動 計算上較有效律 2. ReLU 會加速 convergence of the gradient decent #### 缺點: 1. Dying ReLU problem ![](https://i.imgur.com/ZFVuGIk.png) **負數的地方可以看出gradient為0** 所以在backpropagation時 有些neuron 的weight and bias不會更新 這樣**會產生許多不會反應的dead neuron ** ### Leaky ReLU function 是 ReLU function的進化版 為了解決dying ReLU problem 他在負的地方有一個小小的正坡 ![](https://i.imgur.com/usu4QuG.png) ![](https://i.imgur.com/UXFFJl6.png) 下圖是 Leaky ReLU function 的微分 ![](https://i.imgur.com/8kEdyzy.png) #### 缺點 1. 對於負數的預測可能有偏差 (The predictions may not be consistent for negative input values) 2. 負數的gradient 小 造成訓練時間長 ### parametric ReLU functions another ReLU funtions which oughts to solve the problem of the gradient being zero for the left side half of the axis ![](https://i.imgur.com/dNZUzLy.png) ![](https://i.imgur.com/I8EiDY0.png) used upon leaky ReLU function not solving too many dead neurons and relevant info isn't passed on to the next layer ### ELU (exponential linear units) a ReLU funtion which also modifys the negative half of the axis ELU uses a log curve to define the negative values unlike parametric and leaky ReLU which uses a straight line ![](https://i.imgur.com/JldzpGR.png) ![](https://i.imgur.com/swGqbvz.png) Where "a" is the slope parameter for negative values. This function’s limitation is that it may perform differently for different problems depending upon the value of slope parameter a. #### ELU is a strong alternative: 1. ELU becomes smooth slowly until its output equal to -α whereas RELU sharply smoothes. 2. Avoids dead ReLU problem by introducing log curve for negative values of input. It helps the network nudge weights and biases in the right direction. #### Lmitations 1. It increases the computational time because of the exponential operation included 2. No learning of the ‘a’ value takes place 3. Exploding gradient problem #### ELU(a=1) + derivative ![](https://i.imgur.com/84sYrzU.png) ![](https://i.imgur.com/akSz6SC.png) ![](https://i.imgur.com/QkQp3hP.png) ## Softmax funtion ![](https://i.imgur.com/7XY6vmi.png) the output of the sigmoid funtion is 0 to 1 which can be thought as probability ![](https://i.imgur.com/qO7H27x.png) Assume that you have three classes, meaning that there would be three neurons in the output layer. Now, suppose that your output from the neurons is [1.8, 0.9, 0.68]. Applying the softmax function over these values to give a probabilistic view will result in the following outcome: [0.58, 0.23, 0.19]. The function returns 1 for the largest probability index while it returns 0 for the other two array indexes. Here, giving full weight to index 0 and no weight to index 1 and index 2. So the output would be the class corresponding to the 1st neuron(index 0) out of three. You can see now how softmax activation function make things easy for multi-class classification problems. ### swish ![](https://i.imgur.com/bpGB4He.png) This function is bounded below but unbounded above i.e. Y approaches to a constant value as X approaches negative infinity but Y approaches to infinity as X approaches infinity. ![](https://i.imgur.com/X6RfK8M.png) #### advantages 1. Swish is a smooth function that means that it does not abruptly change direction like ReLU does near x = 0. Rather, it smoothly bends from 0 towards values < 0 and then upwards again. 2. Small negative values were zeroed out in ReLU activation function. However, those negative values may still be relevant for capturing patterns underlying the data. Large negative values are zeroed out for reasons of sparsity making it a win-win situation. 3. The swish function being non-monotonous enhances the expression of input data and weight to be learnt. ### Gaussian Error Linear Unit (GELU) ![](https://i.imgur.com/N9GfYjE.png) The Gaussian Error Linear Unit (GELU) activation function is compatible with BERT, ROBERTa, ALBERT, and other top NLP models. This activation function is motivated by combining properties from dropout, zoneout, and ReLUs. ReLU and dropout together yield a neuron’s output. ReLU does it deterministically by multiplying the input by zero or one (depending upon the input value being positive or negative) and dropout stochastically multiplying by zero. RNN regularizer called zoneout stochastically multiplies inputs by one. We merge this functionality by multiplying the input by either zero or one which is stochastically determined and is dependent upon the input. We multiply the neuron input x by m ∼ Bernoulli(Φ(x)), where Φ(x) = P(X ≤x), X ∼ N (0, 1) is the cumulative distribution function of the standard normal distribution. This distribution is chosen since neuron inputs tend to follow a normal distribution, especially with Batch Normalization. ![](https://i.imgur.com/3VwMlZP.png) GELU nonlinearity is better than ReLU and ELU activations and finds performance improvements across all tasks in domains of computer vision, natural language processing, and speech recognition. ### Scaled Exponential Linear Unit (SELU) ![](https://i.imgur.com/DZtAla4.png) SELU was defined in self-normalizing networks and takes care of internal normalization which means each layer preserves the mean and variance from the previous layers. SELU enables this normalization by adjusting the mean and variance. SELU has both positive and negative values to shift the mean, which was impossible for ReLU activation function as it cannot output negative values. Gradients can be used to adjust the variance. The activation function needs a region with a gradient larger than one to increase it. ![](https://i.imgur.com/ycRzyhs.png) #### point SELU has values of alpha α and lambda λ predefined. Here’s the main advantage of SELU over ReLU: Internal normalization is faster than external normalization, which means the network converges faster. SELU is a relatively newer activation function and needs more papers on architectures such as CNNs and RNNs, where it is comparatively explored. # Why are deep neural networks hard to train? ## Vanishing Gradients Like the sigmoid function, certain activation functions squish an ample input space into a small output space between 0 and 1. Therefore, a large change in the input of the sigmoid function will cause a small change in the output. Hence, the derivative becomes small. For shallow networks with only a few layers that use these activations, this isn’t a big problem. However, when more layers are used, it can cause the gradient to be too small for training to work effectively. ## Exploding Gradients Exploding gradients are problems where significant error gradients accumulate and result in very large updates to neural network model weights during training. An unstable network can result when there are exploding gradients, and the learning cannot be completed. The values of the weights can also become so large as to overflow and result in something called NaN values. # HOW TO CHOOSE As a rule of thumb, you can begin with using the ReLU activation function and then move over to other activation functions if ReLU doesn’t provide optimum results. 1. ReLU activation function should only be used in the hidden layers. 2. Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they make the model more susceptible to problems during training (due to vanishing gradients). 3. Swish function is used in neural networks having a depth greater than 40 layers. for output layers 1. Regression - Linear Activation Function 2. Binary Classification—Sigmoid/Logistic Activation Function 3. Multiclass Classification—Softmax 4. Multilabel Classification—Sigmoid 5. Convolutional Neural Network (CNN): ReLU activation function. 6. Recurrent Neural Network: Tanh and/or Sigmoid activation function. ![](https://i.imgur.com/uvGHoz5.png) # in a nutshell Activation Functions are used to introduce non-linearity in the network. A neural network will almost always have the same activation function in all hidden layers. This activation function should be differentiable so that the parameters of the network are learned in backpropagation. ReLU is the most commonly used activation function for hidden layers. While selecting an activation function, you must consider the problems it might face: vanishing and exploding gradients. Regarding the output layer, we must always consider the expected value range of the predictions. If it can be any numeric value (as in case of the regression problem) you can use the linear activation function or ReLU. Use Softmax or Sigmoid function for the classification problems.