Try   HackMD

Activation Functions

tags: deep learning

"Please design an activation function for image classification and explain why." I like to ask the question whenever interviewing an experienced candidate. Unfortunately, I usually received answers like Sigmoid, ReLU or LeackyReLU. For me, understanding activation functions means how much deep learning knowledge an interviewee has. Therefore, in this article, I will go through common activation functions.

Step Function

f(x)={1,if x>00,otherwise

A step function is an activation function with a threshold. The neuron will be activated if the input value is above a certain threshold like 0. However, a step function which gradients are always 0 is unusable in gradient descent-based deep learning methods.

Sigmoid

f(x)=11+ex

Compare to the step function has a certain threshold, the sigmoid function smoothly interpolate the output value between 0 and 1, unlike the step function. Although gradients of the sigmoid function are not always 0, there is a significant issue: gradient vanishing. As a saturating function, the gradients will close to 0 when the inputs approach +∞ (positive infinity) or −∞ (negative infinity).

ReLU

f(x)={x,if x>00,otherwise

Rectified Linear Units (ReLU) is a good solution to avoid gradient vanishing. When the inputs greater than 0, the gradients are equal to 1. Even though gradient vanishing did not happen with ReLU in many research, gradient vanishing is possible in a specific situation which all weights are less than 0 initially due to the 0 gradients with negative inputs.

LeakyReLU

f(x)={x,if x>0αx,otherwise

As we mentioned in the part of ReLU, gradient vanishing still happens in a specific situation. LeakyReLU changes the outputs of negative inputs to

αx whose gradients are
α
. If
α
is not equal to 0, gradients of LeakyReLU are always non-zero.

PReLU

f(x)={x,if x>0αx,otherwise

Parametric ReLU is similar to LeakyReLU but converting

α from a constant to a trainable weight. With this changing, the neural network has the ability to modify the activation function and even can be a linear function with
α=1
.

Softplus

f(x)=ln(1+ex)

In contrast to ReLU, Softplus is enticingly smooth instead of a certain threshold. Consequently, for example, the decision boundaries of a ReLU classifier is sharp, whereas the Softplus classifier output is more smooth. However, Softplus has

ln() and
exp()
in its formulation, which is slower than ReLU.

ELU

f(x)={x,if x>0α(ex1),otherwise

Compare to ReLU, ELU have negative values which push the mean of the activations closer to 0. Mean activations that are closer to zero enable faster learning as they bring the gradient closer to the natural gradient (see the theorem in the paper). In the formula, we usually set

α=1 for the continuous gradient function.

Swish

f(x)=xσ(x)

σ(x)=(1+ex)1 is the sigmoid function.

Swish, proposed by Google, is the best-discovered activation function by using reinforcement learning. Unlike ReLU, Swish is nonmonotonic. In fact, the non-monotonicity property of Swish distinguishes itself from most common activation functions.

β-Swish

f(x)=xσ(βx)

β-Swish is Swish with a trainable parameter

β. Like PRelu, β-Swish has the ability to modify the activation function and also can be a linear function with
β=0
.

h-Swish

f(x)=xReLU6(x+3)6

There is a significant improvement of the accuracy with Swish, yet

exp() in the formula is much more expensive to compute on an edge device. h-Swish replaces the sigmoid function with a similar one that has no
exp()
.

Conclusion

Although there are many different activation functions for deep learning, we can not say which one is the best. Actually, we have to choose the most appropriate activation function depends on the situation. Furthermore, we can design a customized one for specific applying.

References