One of the most disturbing problems in different types of disasters is the human death toll, either because of people being stuck in the debris or due to the unconsciousness (e.g., generated by the toxic environment) or due to the delayed help. So, one of the most significant challenges faced by the rescue and search teams during different types of disasters is the actual search for survivors and victims. Thus, this project aims to design a video detection and classification system, a robotic drone system that will fly in manned or in an autonomous mode, searching continuously for humans. When a human will be identified the drone system will send back a warning signal for further action and investigation. A swarm of such drones can be used to cover a large area of land in a short time.
This objective is very challenging due to variable imaging conditions, image distortions due to the drone’s self-movement, variable size target identification, computationally demanding algorithms implementation, etc.
In conclusion, the goal of this project is to study the abilities of the NVIDIA Jetson Nano to detect humans (based on different deep learning neural networks) in an unconstraint environment, like a forest, from a video stream obtained from a quadcopter.
The problemsAt the base of the deep neural networks are the convolutional neural networks. These pre-trained convolutional networks have different characteristics that matter when choosing an explicit network to be applied to a specific problem – like human detection, as in my case. These neural networks are trained on more than a million images (e.g., from the ImageNet database) and can classify images in many categories (e.g., 1000 object categories). From my point of view, the essential characteristics are network accuracy and speed. Choosing a network is generally a tradeoff between these characteristics (accuracy vs. speed) – e.g. see the figure from this link.
Now, since the convolutional networks are widely adapted for various computer vision problems such as Image Classification (identifying what type of an object an image contains), Object Detection (detecting different kinds of objects in a picture) and Object Localization(determining locations of detected objects), the “Human Detection” – my problem is a particular case of Object Detection and such network must solve my problem.
In this project, I want to classify the input frames only in two classes: (a) images without human(s) and (b) pictures in which one, two or more humans are presented. One first question is how well these networks will solve this problem and, the second one, how fast? In a previous project, I used a Haar cascade and a Linear SVM classification model to perform human detection in video streams. But these algorithms took a substantial amount of time to finish their execution – between 2.4 and up to 2.9 seconds on a Raspberry Pi system (Raspberry Pi 3 Model B+, 1.4 GHz). So, this approach was unfeasible.
The human detection systemThe approach used for the human detection system is very well known in the deep neural network world. This approach is based on the transfer learning paradigm. So, I started with two existing convolutional neural networks widely used for image recognition. In this project, I used ResNet-18, ResNet-34, and AlexNet (having eight layers) neural networks. Inside these neural networks are already trained layers able to identify different features (like outlines, curves, edges, etc.) - these layers required a lot of training data and also required a lot of training time. Using all of this embedded knowledge, I only removed the last layer that was replaced with a new one having 512 inputs for ResNet networks and 4096 inputs for AlexNet, and both systems have two outputs, outputs defined by my classes.
In order to train the last layer, a new training database was created, having 240 images for the first class and the same number of pictures for the second class. To extract all of these images from the movies, a specific interface was designed – see the following figure.
The movies training database was created from 17 different movies. Some of these films include people, but their most important feature is the diversity of landscapes – thus, we have shady or bright forests, different types of trees, valleys, hills, rocky landscapes, mountain or country roads, etc. We can go through these videos frame by frame, or by jumping from 5 to 5 frames and movie after movie. At any moment, we can save the frame that we consider relevant in one of the three datasets. The first dataset is the real one (Dataset A) used in the training of the classifier. The other one has only 20 images per each class (Dataset B), two classes and was used to test the functions of the developed program. And the third data set (Dataset C) has 3 classes ((a) no humans detected, (b) one human being detected and (c) more humans beings discovered), and it will be used in future system developments.
The main elements of the code used to scroll the movies, from 5 to 5 frames, are presented in the following snippet:
import cv2
import ipywidgets
…………………
# Read the video from specified path - train
train_inputFile = str ("train1.mp4")
videoStream = cv2.VideoCapture(train_inputFile)
…………………
currentVideo = 1
…………………
next5_widget = ipywidgets.Button(description='Go forward 5 frames')
save_widget = ipywidgets.Button(description='Add')
…………………
# create image preview
image_widget = ipywidgets.Image(width=600, height=400)
retval,frame = videoStream.read()
if retvalretval:
image_widget.value = bgr8_to_jpeg(frame)
…………………
def next5(c):
global frame
global videoStream
global currentVideo
# reading from video stream five frames
for i in range(5):
retval,frame = videoStream.read()
if retval:
# if frames are still left continue with the next frame
pass
else:
videoStream.release()
currentVideo +=1
# no other frames, open a new video stream from the next movie
videoStream = cv2.VideoCapture("train{}.mp4".format(currentVideo))
retval,frame = videoStream.read()
image_widget.value = bgr8_to_jpeg(frame)
next5_widget.on_click(next5)
# save image for category and update counts
def save(c):
dataset.save_entry(frame, category_widget.value)
count_widget.value = dataset.get_count(category_widget.value)
save_widget.on_click(save)
The main idea of the above code is to sense the moment when the current movie has no more frames and to create a new video stream related to the next film from the database.
In the testing process, for both classes, we have almost the same number of frames: 654 for no humans class and 656 for humans class. The Adam optimization algorithm (a combination of RMSprop and Stochastic Gradient Descent with momentum) was used to train all of the deep neural networks. The outputs of the neuronal structures go through a softmax function in order to obtain the “posterior classification probabilities”. Each of the three neural network structures was trained 12 different times and the best-obtained results are presented in the following table.
Analyzing the obtained data, we observe that the AlexNet convolutional neural network is the fastest neuronal structure from the ones analyzed here. AlexNet is three-time faster than ResNet-34, but the classification performances for the human detection problem are catastrophic - they are slightly higher than the probability of guessing.
The ResNet-18 model is the best choice for our problem. With a high classification accuracy (90.46%) and a human subject detection time greater than 2 frames/second, ResNet-18 is the natural choice for the human detection system. The next movie presents the entire process of testing the ResNet-18 neural structure.
The ResNet-34can model in the best way the problem, and it obtains a classification accuracy of 93.2%. But the classification time is higher (= frame acquisition time + time require to preprocess the frame + frame classification interval) - around 736 ms or 1.35 frames/second. The obtained results are presented in the next picture.
For the “AI at the Edge Challenge” competition, I proved that NVIDIA Jetson Nano is a viable solution to be used in search and rescue operations of injured persons after forest fires, earthquakes, tsunamis waves, etc.
Mainly due to adverse weather conditions (during this competition was winter in my country), it was almost impossible to test the system on the drone outside in an open environment – my drone has all the electronic components (FMU, ESC, etc.) exposed, see the following figure. From here comes the next logical step: testing the system in real conditions positioned on the NXP HoverGames drone.
Increasing the size of the database, used to train the last layer of the neural network, is required in order to improve the classification performance. The movies used to train the neural network have been downloaded from YouTube and represent different recordings made by drones of mountain landscapes, forests, fields, roads, etc. in which people are present or not. For example, by analyzing the performance of identifying the human subjects in the above movie, someone can notice that these performances begin to decrease when the film shows the river and the bridge - these two elements (the river and the bridge) were not present in any of the movies from which we extracted the images used in training the neural network. As a direct conclusion, increasing the training database to a minimum of 500 images per class (I estimate that an optimum of 1000 would be indicated) is a mandatory requirement.
The training time for 20 epochs, with 240 images per class, is a little bit more than one hour and 10 minutes on the NVIDIA Jetson Nano board. Increasing the number of images to 1000 per class will increase the training time accordingly. Hence the need to train neural models on other more powerful systems (maybe even in the cloud) and, in the second stage, deploying the model on the NVIDIA Jetson Nano.
Comments