# Activation Functions
###### tags: `Data Science`, `Notes`
## Commonly Used Activation Functions
### Identity
* Equation
\begin{align}f(\hat{x})=\hat{x}\end{align}
* Derivative
\begin{align}\frac{df(\hat{x})}{d\hat{x}}=1\end{align}
* Properties:
* Mathematically __does not__ affect the network at all.
* Practically, in order to unify the design structure of the neural network (as linear transformation followed by activation function), most of the deep learning framework implemented identify function as one of the activation functions.
* Range: $-\infty$ to $\infty$

### Sigmoid
* Equation
\begin{align}\sigma(\hat{x})=\frac{1}{1+\exp(-\hat{x})}\end{align}
* [Derivative](https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x)
\begin{align}\frac{d\sigma(\hat{x})}{d\hat{x}}=\sigma(\hat{x})(1-\sigma(\hat{x}))\end{align}
* Properties:
* Neural network with only one hidden layer of sigmoid function is equivalent to __logistic regression__.
* One very useful property of sigmoid function is that the range is $0$ to $1$ (bounded range); therefore, it is commonly used when we want to model __probability__ (e.g. output layer of binary classification problem).
* In binary classification problem, sigmoid function is a special case of __softmax function__.
* In multi-class classification, sigmoid is preferable over softmax when we believe __a sample can belong to multiple class__ since sigmoid function treat the probability of each class separately.
* The range of the derivative sigmoid function is $0$ to $0.25$ (and only a narrow range of $\hat{x}$ yield non zero $f'(\hat{x})$); therefore, applying sigmoid function in the hidden layer of very deep network could lead to __gradient vanishing__ problem (will discuss in backpropagation)
* Range: $0$ to $1$

### Hyperbolic Tangent
* Equation
\begin{align}\tanh(\hat{x})=\frac{\exp(\hat{x})-\exp(-\hat{x})}{\exp(\hat{x})+\exp(-\hat{x})}\end{align}
* Derivative
\begin{align}\frac{d(\tanh(\hat{x}))}{d\hat{x}}=\frac{1}{\cosh^2(\hat{x})}\end{align}
* Description:
* Because of the bounded (range: $-1$ to $1$) property, and less likely to lead to gradient vanishing like sigmoid function, hyperbolic tangent is commonly used in the activation function of each state in __recurrent neural network__.
* Also because of the bounded property, it is often faster to reach convergence.
* Range: $-1$ to $1$

### Rectified Linear Unit (ReLU)
* Equation
\begin{align}\text{ReLU}(\hat{x})=\max(0, \hat{x})\end{align}
* Derivative
\begin{align}\frac{d(\text{ReLU}(\hat{x}))}{d\hat{x}}=\begin{cases}1\;&\text{if } \hat{x} > 0\\0\;&\text{otherwise}\end{cases}\end{align}
* Properties:
* The most commonly used activation function in the hiddent layer of multilayer perceptron network.
* The output of the activation function is a sparse (have many zeros) tensor. Therefore, it requires less computational time and memory to reach convergence.
* Computational efficient to compute the derivative since it does not require exponential computation (about 1.5x faster than sigmoid and 2x faster than hyperbolic tangent).
* The sparisty could potentially prevent __overfitting__.
* However, the sparsity might also lead to __dying ReLU__ problem (some weight never update)
* Since it is not bounded, it can cause __gradient exploding__ problem (will discuss in recurrent neural network) when the weights are not initialized properly.
* Range: $0$ to $\infty$

### Leaky Rectified Linear Unit (Leaky ReLU)
* Equation
\begin{align}f(\hat{x})=\begin{cases}
\hat{x}\;\;\;\;\;\;\;\;\;&\text{if }\hat{x} > 0\\
a\hat{x} &\text{otherwise}\end{cases}\end{align} where $a$ is a hyperparameter (default in Tensorflow: $0.2$)
* Derivative
\begin{align}\frac{df(\hat{x})}{d\hat{x}}=\begin{cases}1\;&\text{if }\hat{x} > 0\\a\;&\text{otherwise}\end{cases}\end{align}
* Properties:
* Solve the dying ReLU problem of conventional ReLU.
* Does not avoid gradient exploding problem.
* Computationally more efficient than ELU.
* Range: $-\infty$ to $\infty$

### Exponential Linear Unit (ELU)
* Equation
\begin{align}f(\hat{x})=\begin{cases}
\hat{x}\;\;\;\;\;\;\;\;\;&\text{if }\hat{x} > 0\\
a(\exp(\hat{x})-1) &\text{otherwise}
\end{cases}\end{align}where $a$ is a predetermined hyper parameter (default in Tensorflow: $1$).
* Derivative
\begin{align}\frac{df(\hat{x})}{d\hat{x}}=\begin{cases}1\;&\text{if }\hat{x} > 0\\a\exp(\hat{x})&\text{otherwise}\end{cases}\end{align}
* Properties:
* Solve the dying ReLU problem of conventional ReLU.
* Does not avoid gradient exploding problem.
* $f(\hat{x})\to-a$ when $\hat{x}$ is negatively large.
* Converge faster than Leaky ReLU.
* Range: $-a$ to $\infty$

### Softmax
* Equation
\begin{align}\text{softmax}(x_k)=\frac{\exp(x_k)}{\underset{i}\sum \exp(x_i)}\end{align} where $\hat{x}_i$ is the neuron of the input layer.
<span style="padding-right: 150px;"></span>
* [Derivative](https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/)
\begin{align}\frac{\partial\text{softmax}(x_k)}{\partial x_l}&=\frac{\exp(x_k)}{\underset{i}\sum \exp(x_i)}(\delta-\frac{\exp(x_l)}{\underset{i}\sum \exp(x_i)})\end{align}where $\delta=1$ for $l=k$ and $\delta=0$ for $l \ne k$
* Properties:
* Similar to sigmoid funciton, the range softmax function is $0$ to $1$ (bounded range); therefore, it is also commonly used when we want to model __probability__.
* In multi-class classification, softmax is preferable over sigmoid when we believe __a sample can only belong to one class__ since softmax consider all other input neuron when applied.
*
* Range: $0$ to $1$

### Maxout:
* Equation
\begin{align}\text{maxout}(\pmb{\hat{x}})=\max_i(\hat{x}_i)\end{align} where $\hat{x}_i$ is the neuron of the input layer.
<span style="padding-right: 150px;"></span>
* Derivative
\begin{align}\frac{\partial\text{maxout}(\pmb{\hat{x}})}{\partial \hat{x}_k}=\begin{cases}
1\;&\text{if } \hat{x}_k = \text{maxout}({\pmb{x}})\\
0&\text{otherwise}
\end{cases}\end{align}
* Properties
* `MaxoutDense` layer implemented in Keras/Tensorflow. Practically Maxout often used as layer (instead of activation function). For instance, this is a `MaxoutDense` layer with `k=2`:
<span style="padding-right: 80px;"></span>
* ReLU is a special case of maxout activation function. For instance, in the first neuron illustrated above, Maxout is equivalent to applying ReLU on $\hat{x}_1$ when $\hat{x}_2, \hat{x}_3 = 0$.
* Maxout can approximate any linear convex function (when the number of neuron is large)
* Computataional intensive compare to ReLU.
* Less easy to lead to dying ReLU problem.
* Range: $-\infty$ to $\infty$

## Why Non-linear Activation Function
Consider the following multilayer perceptron network with linear activation function in the hidden layer:
<span style="padding-right: 200px;"></span>
And another multiplayer perceptron network with ReLU activation function in two hidden layer (the second layer is added just for visualization, the model can already predict the result pretty well in one hidden layer):
<span style="padding-right: 180px;"></span>
When linear function is not capable of finding the discriminative properties to establish the decision boundary, non-linear feature transformation can help the model establish decision boundary readily:

If we take a look at the output of the second hidden layer of the nonlinear perceptron network, we can see that the non-linaer transformation of the features ($x_1'$ and $x_2'$) allow the logistic classifier to find a better __linear__ decision boundary:
<span style="padding-right: 50px;"></span>