Team Dual_Boot:

•

Mausam Jain

Published November 19, 2020 © GPL3+

Elephant AI

Elephant AI is a set of machine learning algorithms to help in Elephant conservation and protect them from poaching.

AdvancedFull instructions provided4 days2,110

Top 5 Machine Learning Models

ElephantEdge

Things used in this project

Software apps and online services

Edge Impulse Studio

Story

1. The Problem

The world we live in is not a Disney movie where animals (excluding homo sapiens) live in harmony with no human intervention. Animal extinction is real and as many as 99% of the species at the threshold of extinction are there due to our (human, and that includes you too) activities[1]. Elephants are facing the same problem. these gentle giants are being hunted for their ivory tusk and fun.

With the help of technology, we can solve this problem. Technology can help in preventing elephant poaching. It can provide insights for their conservation.

2. A Solution

A solution we present here is a machine learning algorithm designed using Edge Impulse Studio. This algorithm is designed to be deployed on an elephant tracking collar by Smart Parks. The tracking collar is developed by IRNAS and has a lot of sensors built into it. Two of those sensors are a microphone and an accelerometer.

We have developed two machine learning algorithms to use these sensors and provide useful information. The algorithms are

2.1 Human Presence Detection

This algorithm aims to avoid potential poaching by detecting the presence of a human using sound. If an unauthorized person is present around an elephant then it is not a good sign for either's safety.
This algorithm uses MEMS audio sensor to record sound and use a machine-learning algorithm to detect human presence.

2.2 Elephant Activity Monitoring

This algorithm aims to monitor elephant activity. Monitoring an elephant's activity can help us in their conservation.
This algorithm takes time-series input from an accelerometer sensor and uses a machine-learning algorithm to identify the elephant's activity.
At the moment the activities this algorithm can detect are walking and resting. The same algorithm can also detect other activities like running, sleeping, etc. if more training data is provided.

Both of these algorithms work independently. They take input from different sensors and generate two distinct outputs.

3. Human Presence Detection

Figure 1: Human Presence Detection

This algorithm uses sound to detect the presence of a human. The MEMS microphone(step 2) of the tracking collar will pick up sounds present around an elephant and will send it to the preprocessing algorithm(step 3). This preprocessing algorithm generates MFCC of that sound clip. The generated MFCC is then sent to a Convolutional Neural Network(step 4). The convolutional neural network with the help of a fully connected neural network classifies the audio into three classes(step 5)

Human
Elephant
Other

Based on the classified class a smart park manager can decide on what to do next.

3.1 Data Gathering and Preprocessing

We gathered voice samples from multiple places and prepared the dataset for training and testing. Our voice dataset contains three classes

Elephant
Human
Other

Thanks to Elephant Voices for providing Elephant's voice dataset. Human and Other class data are captured from Youtube.

You can download the voice dataset from here.

We uploaded the training and testing data to the Edge Impulse via its Data Acquisition feature. We cropped and split the samples to remove unwanted sounds.

Figure 2: Data Distribution

1 / 3 • Figure 3: Data Acquisition Tab

This cleaned data is then preprocessed on Edge Impulse to generate MFCC. The generated MFCC can be visualized using Feature Explorer in Edge Impulse

Figure 6: Feature Explorer

3.2 Neural Network

The Neural Network we developed for sound classification is a combination of CNN and Fully Connected Network. Building and configuring a neural network is very easy in the Edge Impulse. You only need to add a Neural Network classifier block while creating the impulse. You can modify its architecture and configure hyper-parameters from the NN classifier tab under Impulse Design.

1 / 2 • Figure 7: Create Impulse Tab

We have tried multiple combinations of network architecture and data augmentation about which we will discuss in the next section. Before that let us point out the assumptions and some pre-settings.

From Figure 4, we have fixed the window size of 2000 ms. and window increase of 500 ms. for the sampling of the training and test data keeping in mind not just the memory constraints of input features in Edge Impulse but also the real-case scenario where, for example, the device can record and make predictions every 2 seconds. (Further possibilities include treating these predictions as time-series points and making some other predictions and analysis. We haven't tried this method though.)
Training cycles and learning rate values were carefully chosen for every experimented architecture separately so that the network gets enough epochs to learn.
For every architecture, we have set the minimum confidence score for the predictions as 0.50. It should be noted that the choice of this value significantly affects the test accuracy.

3.3 Testing and Outcome

We experimented with various Convolution Neural Network (CNN) architectures possible in Edge Impulse and analyzed the performance both in terms of accuracy and resource usage which are discussed in this section.

3.3.1 Performance of 1D convolutions

1 / 2 • Figure 9: Network architecture with a CPD block with 1D convolutions

In the first set of experiments, we define a CPD block as shown in Figure 9 which consists of Convolution, Pooling, and Dropout layers. There is a single convolution layer of 1D convolutions with 3x3 kernel size and number of neurons (filters) in multiples of 8 (starting from 8). Therefore, when we say 2 CPD blocks, the first one will have a convolution layer with 8 neurons (filters) and the second one will have the same configuration with 16 neurons (filters).

Figure 10: Variation in accuracy and hardware resource usage with complexity of the neural network

In Figure 10 we show how increasing the complexity and 'deepness' of the neural network affects resource usage like inference time, RAM, and ROM usage which are important numbers when talking about inferencing on edge devices which are low in both power and compute.

With an increase in the number of CPD blocks the test accuracy shows an upper bound of 85.78% on the described data and defined settings for testing. It decreases to 4 CPD blocks and slowly increases from there.
Making the network deeper linearly increases the inference time by approximately 3 ms. with the addition of one CPD block.
Correspondingly, RAM usage also follows a linear pattern although less steep than inference time with an average increase of 1.5 KB with the addition of one CPD block.
ROM usage, however, doesn't seem to follow an upward trend in our experiments and so we conclude that approximately 68.2 KB of ROM might be consumed by similar architectures as above.

Table 1: Confusion matrix on training set for the best model (CPD with 1D conv.)

So, from the above observations, we conclude that the best architectureinthis set of experiments is the one with 3 CPD blocks with 85.78% test accuracy, 16 ms. of inference time, 8.3 KB of RAM usage, and 68.9 KB ROM usage.

3.3.2 Performance of 2D convolutions

1 / 2 • Figure 11: Network architecture with a CPD block with 2D convolutions

In this part, a CPD block from the previous section is modified to contain a single convolution layer with 2D convolutions, 3x3 kernel size, and the number of neurons (filters) in multiples of 4 (starting from 4).

Figure 12: Variation in accuracy and hardware resource usage with complexity of the neural network

In Figure 12 we show how increasing the depth of the network by appending the CPD block of 2D convolutions affects the resource usage and performance of the model.

Increasing CPD blocks definitely increases the performance but after 3 blocks it increases by a tiny amount. So, with 4, 5, and 6 CPD blocks test accuracy is between 86% - 87%.
Making the network deeper increases inference time which follows a saturating exponential pattern.
Impressively, RAM usage follows a flatter linear pattern with an average increase of only 0.5 KB with the addition of one CPD block.
ROM usage follows a strange pattern where it decreases with the increase in the number of CPD blocks. A simpler network consumes 4x more ROM than the most complex one.

Table 2: Confusion matrix on training set for the best model (CPD with 2D conv.)

So, from the above observations, we conclude that the best architecture inthis set of experiments is the one with 4 CPD blocks with 86.19% test accuracy, 42 ms. of inference time, 10.3 KB of RAM usage, and 43.7 KB ROM usage.

3.3.3 Varying thenumber ofConvolution layers

Figure 13: Network architecture with varied number of convolution layers in each CPD block

In the second set of experiments, we used the network architecture depicted in Figure 13 where the number of convolution layers before the polling layer inside a single CPD block is varied keeping the number of neurons the same. There are two CPD blocks first with convolution layers of 8 neurons and the second with 16 neurons. We discuss performance metrics related to these settings below.

Figure 14: Variation in accuracy and hardware resource usage with convolution layers

Edge Impulse provides settings to add up to 3 contiguous convolution layers and we report results on these settings in Figure 14.

Increasing layers from one to three showed a jump in the test accuracy of almost 3%.
Although inference time increased linearly, the magnitude of an average increase of 8.5 ms. with each added convolution layer is significant.
The average RAM usage increase in this case also is impressive with only 1.7 KB per layer.
The average ROM usage increase with the number of layers is around 2.2 KB.

So, from the above observations, in this set of experiments, the best model is the one with 3 convolution layers (before the pooling layer) with 85.02% test accuracy, 30 ms. inference time, 10.2 KB RAM usage, and 79.4 KB ROM usage.

3.3.4 Data augmentation

In this set, we tried different data augmentation settings in Edge Impulse and report their effect on the accuracy of the model. Here we used the same architecture depicted in Figure 8 with one convolution layer.

Table 3: Effect on test accuracy with data augmentation

With the obvious increase in training time, the decrease in the test accuracy with data augmentation, however, came as a surprise as shown in Table 3. We did not try to add noise as the original audio samples contain natural background noise at many time frames. We vary masking of frequency from none to high with the warping of the time axis in one of the settings. Evidently, data augmentation has no effect on inference time and hardware compute usage.

4. Elephant Activity Monitoring

Figure 15: Elephant Activity Monitoring

This algorithm uses an accelerometer sensor to detect the elephant's activity. The sensor gathers the acceleration in the X, Y, and Z axis (step 2) and sends it to the preprocessing algorithm. Spectral analysis is performed on the acceleration data at the preprocessing step(step 3). This preprocessed data is then fed into a Fully Connected Neural Network(step 4). The neural network then classifies the data such as

Standing
Walking
Other

Note: The attached neural network model is trained only on 3 types of activity data (a) standing, (b) walking, and (c) other.

4.1 Data Gathering and Preprocessing

Disclaimer: We could not find the accelerometer data set for the elephant's activity. Therefore we created our own dataset based on human activity. The same neural network model can be used for training on the elephant's activity data.

Edge Impulse provides some very handy features for data collection. It supports some popular development boards for data collection and inference. You can even use your smartphone and that is what we did for preparing the activity dataset.

Figure 16: Data Collection Options

Data Collection Device Information:

Company: Oneplus
Model: Oneplus 6
Accelerometer Sensor: BOSCH - 34744578

Data is collected for 3 activities:

Stand
Walk
Other

Each sample in the dataset is 10 seconds long and the device was placed in the front trouser pocket while collecting data.

You can download the accelerometer dataset from here.

The data is then preprocessed using the Spectral Analysis block in Edge Impulse itself.

1 / 2 • Figure 17: Feature Explorer

4.2 Neural Network

The neural network used for activity classification is extremely simple. The network contains only four fully connected layers constituting an input layer with 30 neurons, two hidden layers with 20 and 10 neurons respectively, and one output layer with 3 neurons. We used the learning rate of 0.0005 and trained the network for only 30 epochs.

1 / 2 • Activity Classification Neural Network

The ReLu activation function is used for hidden layers and to optimize the network Adam optimizer is used.

4.3 Testing and Outcome

The neural network model we developed for activity monitoring performed extremely well. The training and test accuracies we received are almost 100%, too good to be true!

1 / 2 • Figure 19: Training Performance

The developed neural network model can be found here.

5. Conclusion

By looking at the set of experiments performed for both the algorithms we present our final conclusion here.

Table 4: Neural Network Architecture Comparison for Human Presence Detection

By looking at table 4 we can say that for the "Human Presence Detection", the best model is "Three 1-D CPD blocks and a Dense Layer" because it outperforms other models in terms of RAM usage and inference time and has only 0.41% less accuracy than the most accurate model.

The neural network model with "Three 1-D CPD blocks and a Dense Layer" architecture is available here.

For our voice dataset and models, the data augmentation did not have any effect.

For "Elephant Activity Monitoring" our fully connected network described in section 4.2 is performing best with 99.47% test accuracy however it is trained/tested on the human activity data and not on elephant activity data. To train and test this model we will need activity data of multiple elephants.