Neural networks' extraordinary powers are largely attributed to their activation functions. They are the ones who make the decisions, deciding whether an input will cause a neuron to "fire up" or stay dormant. Even though it may seem like a complex technicality, anyone interested in learning about neural networks must grasp the activation function.
What are activation functions?A neural network's activation functions are its decision-makers. They are crucial in deciding whether to activate a neuron because they are affixed to every neuron. Whether or not each neuron's input is pertinent to the network's prediction determines how this activation function operates.
As gatekeepers, activation functions let only specific data go through and contribute to the output of the network. They give neural networks a crucial layer of non-linearity that allows them to recognize and interpret intricate patterns in input.
To understand the distinctive qualities of a few common activation functions to gain a deeper understanding of this important idea. Additionally, the activation function is essential for standardizing the output of each neuron by limiting it to a predetermined range, usually between 0 and 1 or between -1 and 1.
The neurons in the input layer of a neural network receive inputs. Every neuron has a weight assigned to it, and the neuron's output is determined by multiplying the input by the corresponding weight. Next, this output is transferred to the subsequent layer.
The activation function is a mathematical "gate" between the input reaching the present neuron and the output sent to the layer below. It can be as simple as a step function, which turns on or off the neuron output according to a predetermined threshold or criterion.
Neural networks, crucially, use non-linear activation functions. These operations play a crucial role in the network's ability to comprehend complex data patterns, compute and learn almost any function pertinent to a particular query, and eventually produce accurate predictions.
List of activation functions supported in Vitis AI for PyTorchWhile compiling the model for the Vitis AI board (example: KV260, ZCU102, and many more) in PyTorch, these mentioned below are the activation functions supported in Vitis AI.
The ReLU function formula and curve are as follows:
ReLU = max(0, x)
Rectified Linear Unit, or ReLU for short, is a relatively new and very significant deep learning activation function. In contrast to several other activation functions, ReLU is incredibly simple. All it does is produce the highest value that falls between zero and its input. ReLU is not fully differentiable, but we may manage its derivative using a sub-gradient method, as seen in the above picture.
ReLU is becoming more and more well-liked, and for a good cause. In contrast to more conventional activation functions such as the sigmoid and tanh, it is unique.
Example:
m = nn.ReLU()
n = torch.tensor([-0.3, -1.4, -0.001, 0, 1.2, 7.5, 0.4])
output = m(n)
Output:
tensor([0.0000, 0.0000, 0.0000, 0.0000, 1.2000, 7.5000, 0.4000])
Leaky ReLU FunctionThe leaky ReLU function formula and curve are as follows, f(x) = max(0.01x, x)
Leaky ReLU removes the problems related to "Dead ReLU" while providing all the benefits of ReLU. Leaky ReLU keeps neurons from going into dormancy by allowing a little, non-zero gradient for negative inputs. Nonetheless, the particular problem and architecture determine whether Leaky ReLU consistently performs better than ReLU. There is no one-size-fits-all solution, and deciding between ReLU and its variations frequently necessitates empirical research and optimization.
These variations on the ReLU function show the continuous effort to improve neural network performance and robustness to address various deep learning difficulties and applications.
Example:
m = nn.LeakyReLU(26/256)
n = torch.tensor([-0.3, -1.4, -0.001, 0, 1.2, 7.5, 0.4])
output = m(n)
Output:
tensor([-3.0469e-02, -1.4219e-01, -1.0156e-04, 0.0000e+00, 1
7.5000e+00, 4.0000e-01])
ReLU6 FunctionThe ReLU6 function formula is as follows:
f(x) = min(max(0,x),6)
The ReLU6 function is a modification of the original ReLU function designed to improve its robustness when used with low-precision computation. ReLU6 limits the output of the ReLU function to a maximum size of 6. Any input below 0 will be discarded, while any input above 6 is converted to 6.[2]
Example:
m = nn.ReLU6()
n = torch.tensor([-0.3, -1.4, -0.001, 0, 1.2, 7.5, 0.4])
output = m(n)
Output:
tensor([0.0000, 0.0000, 0.0000, 0.0000, 1.2000, 6.0000, 0.4000])
Hard Tanh FunctionHardTanh is defined as:
HardTanh(x)=max_val if x> max_val, min_val if x< min_val, x otherwise
The Hard Tanh function is a modified version of the standard hyperbolic tangent (tanh) function. The standard tanh function squashes input values between -1 and 1. The hard tanh function introduces a hard threshold, so any input below -1 is set to -1, and any input above 1 is set to 1. This introduces non-linearity while preventing values from exploding during training.
Example:
m = nn.Hardtanh(-1,1)
n = torch.tensor([-0.3, -1.4, -0.001, 0, 1.2, 7.5, 0.4])
output = m(n)
Output:
tensor([-0.3000, -1.0000, -0.0010, 0.0000, 1.0000, 1.0000, 0.4000])
Hard Sigmoid FunctionThe Hard Sigmoid function is a piecewise linear approximation of the sigmoid function. It is faster to compute and is often used in situations where a faster but less smooth approximation of the sigmoid is acceptable. It introduces linear segments at the extremes of the sigmoid function.
Mathematically, the Hard Sigmoid function can be defined as:
Hardsigmoid(x) = 0 if x≤ -3, 1 if x≥ 3, x/6+1/2 otherwise
This function introduces non-linearity while having a simpler and faster computation than the sigmoid function.
Example:
m = nn.Hardsigmoid()
n = torch.tensor([-0.3, -1.4, -0.001, 0, 1.2, 7.5, 0.4])
output = m(n)
Output:
tensor([0.4500, 0.2667, 0.4998, 0.5000, 0.7000, 1.0000, 0.5667])
Hard Swish FunctionThe Hard Swish function is a modified version of the Swish activation function. Swish is a smooth, non-monotonic function that is differentiable. However, it involves the computation of the exponential function, which can be computationally expensive.
The Hard Swish function introduces a piecewise-linear approximation to the Swish function, making it computationally more efficient. It retains some of the non-linear properties of Swish but with less computational cost.
Hardswish(x) = 0 if x≤-3, x if x≥3, x.(x+3)/6 otherwise
Example:
m = nn.Hardswish()
n = torch.tensor([-0.3, -1.4, -0.001, 0, 1.2, 7.5, 0.4])
output = m(n)
Output:
tensor([-1.3500e-01, -3.7333e-01, -4.9983e-04, 0.0000e+00, 8
7.5000e+00, 2.2667e-01])
Mish FunctionThe Mish function is designed to be a smooth and differentiable alternative to other popular activation functions like ReLU (Rectified Linear Unit). It is defined as: Mish(x)=x∗Tanh(Softplus(x)) where, Softplus(x) =1/β * log(1+exp(β∗x)) ; default value of β is 1.
Example:
m = nn.Mish()
n = torch.tensor([-0.3, -1.4, -0.001, 0, 1.2, 7.5, 0.4])
output = m(n)
Output:
tensor([-1.5113e-01, -3.0368e-01, -5.9968e-04, 0.0000e+00, 1
7.5000e+00, 2.8903e-01])
Vitis AI Convergence and DPU Inferencing:To check the effect of these activation functions (making every layer compatible with DPU) for the Xilinx board, we use YOLO v4 as an example.
Dataset used:
For all the tests conducted below, the Road Sign Dataset (Link: www.kaggle.com/datasets/andrewmvd/road-sign-detection) was used.
The initial YOLOv4 as mentioned in the paper used a mish activation function. When trained to 100 epochs, it gave 78% mAP, and the result is as follows:
- Using Leaky Relu:
The following results were obtained for the test dataset:
and the output was as follows:
KV260 Board inference output:
- Using Hardswish:
The following results were obtained when trained with hard swish,
and the output was as follows:
KV260 Board inference result:
Since, the DPU subgraph number was 8, the model didn’t compiled in Vitis AI (which is targeted for KV260 board).
- Using hard tanh:
The following results were obtained when trained with hard tanh(0, 6),
There was no detection in the image till the confidence threshold of 0.2.
Below 0.2,
KV260 Board inference was done with a confidence score of 0.1, as no detection was observed with a threshold of 0.2 or above.
- Using Relu6:
The following results were obtained when trained with ReLU6:
Also, no detection was obtained until the confidence threshold of 0.2.
Below 0.2,
For board testing, while quantization and compilation in Vitis AI, Relu6 was converted to Relu.
Board Inference result (No bounding box was detected till the confidence threshold of 0.05):
- Using the hard sigmoid, after training for 40 epochs, the mAP was still 0.
Note:
All the training was done in NVIDIA RTX 2070 SUPER for 100 epochs, with a batch size of 8 and a freezing batch size of 16. Also, the learning rate was set to 1e-5 for a total of 100 epochs.
Why does LeakyReLU perform best?
Out of all the activation functions supported by Vitis AI, the closest one to that of Mish is Leaky ReLU, as seen in the graph above. Therefore, for YOLO v4, out of all the supported activations, Leaky ReLU is preferred.
References:
[1]. activation-functions-in-neuralnetworks
[2]. Relu6
[3]. pytorch-nn
Kudos to our MachineLearning Engineer, Anupam Wagle for creating detailed and insightful tutorial on "activation function and its impact on ML inferencing with Vitis AI and DPU". And thanks to Dikesh Shakya Banda for the article plan!
LogicTronix is AMD-Xilinx partner for FPGA Design and Machine Learning Acceleration!
For any inquiries around ML acceleration on FPGA, please ping us at info@logictronix.com
Comments