New LEO satellite fleets can provide low cost global data communication to trail cameras used for wildlife conservation. However, very low bandwidth means that it is impractical to send traditionally encoded video and photos from these cameras via satellite. My goal is to use AI, coupled with known pre-deployment data, to "semantically compress" images from camera into data that can be harvested remotely, ultimately via low bandwidth satellite links.
This project is a "proof of concept" step towards this goal. It uses an energy- efficient object detection model and a Xilinx Kria KV260 to read MP4 video from a trail camera. A simple application then goes, frame by frame, through the video, counting the number of animals found in each frame, and using the statistics over the entire video to assess its overall quality. The application uses the "confidence" score generated for each detection to track the highest quality frames, which can be used to prioritize which frames to send via satellite to make best use of this limited resource.
Labeling a Custom DatasetI was fortunate to have a collection of 10’s of thousands of trail camera images, with diverse backgrounds, gathered primarily as research for my wife Janet Pesaturo’s book “Camera Trapping Guide”. These images were not labeled. Rather than hand labeling them, I used the Microsoft MegaDetector model to label them for me. The results were not perfect, but saved me a ton of time (and likely about the same amount of human error).
I wrote a quick and dirty set of functions which take the.json file generated by MegaDetector, which encodes the data set in a hierarchical set of files, and creates a randomized subsample of images sized to the model input, for training and validation.
Training on Google ColabGoogle Compute Engine supports heavy duty instances with large memories and GPUs. Unfortunately, when I went to upgrade my standard VM to one with a GPU, Google refused, telling time that I had not yet demonstrated a reliable history of payment. Which is true enough -- I am still working through the $300 credit Google was so kind to add to my account when I signed up for GCE.
Not to be deterred, I did my training in Google Colab. I got the “Pro” version to guarantee access to long runs with higher performing GPUs. Even the “Pro” version limits GPU jobs to about 20 hours. Fortunately, the training code I used generated periodic checkpoints, which allowed me to complete long training runs as a series of 20 hours chunks.
Colab is also a great environment for writing small utility functions in Python.
I used Google Drive – accessible to both Colab and GCE to store persistent data (training images, model results, etc.) and to move data from one environment to the other, and ultimately to the target KV260 platform.
Retraining Yolov4 ModelRather late in the process, I came upon the Yolo-V4 tutorial in the Xilinx example directory https://github.com/Xilinx/Vitis-AI-Tutorials/tree/1.3 . Although (IMO) this “tutorial” is a little sketchy in places, it provides the necessary assurance up-front that if one works through the inevitable mistakes, that one can train a model that will eventually end up working on an application in the KV260. I wish I had happened on this example earlier.
Results of Training on 2000 Images:
calculation mAP (mean average precision)...
Detection layer: 82 - type = 28
Detection layer: 94 - type = 28
Detection layer: 106 - type = 28
400
detections_count = 1311, unique_truth_count = 597
class_id = 0, name = Animal, ap = 35.48% (TP = 171, FP = 72)
class_id = 1, name = Person, ap = 18.43% (TP = 4, FP = 0)
class_id = 2, name = Vehicle, ap = 55.00% (TP = 2, FP = 1)
for conf_thresh = 0.25, precision = 0.71, recall = 0.30, F1-score = 0.42
for conf_thresh = 0.25, TP = 177, FP = 73, FN = 420, average IoU = 50.27 %
IoU threshold = 50 %, used Area-Under-Curve for each unique Recall
mean average precision (mAP@0.50) = 0.363063, or 36.31 %
A mean average precision of 36.1% is nothing by the standards of current commercial models, but it's a plausible start, and adequate for this POC.
Running Vitis-AI Toolset on GCEUsing the YOLOv3 model trained in the Google Colab environment, I now needed to get that model into the DPU on the KV260. To do so requires Vitis-AI toolset. These are heavy duty tools, requiring lots of disk space, and a fair amount of scripted setup. Since this is the 21st century, I used cloud resources. I’ve had pretty good luck with Google environments, so I decided to go with Google Compute Engine VMs. I used about $100 of the $300 credit for my new account on this project.
Converting Darknet Trained Model to model for KV260After training the model in Colab, I have a.cfg file and a.weights file. I need to end up with an.xmodel file. This involves several steps, each of which invokes a script file.
- 0-copy-models.sh: Copy the darknet model over from where I left it on Colab
- 1-copy-images-files.sh: Copy the image files needed for quantization
- 2-darknet_convert.sh: Convert from darknet to Caffe floating point model
- 3-prototxt-to-quant.sh: The tutorial described a number of small edits to the.prototxt file at this stage to get it ready for quantization tool. I used a bunch of sed commands in the script to automate the edits
- 4- run_vai_q.sh: Run the VAI Quantizer
- 5-test_caffe_fp.sh : Sanity check on the floating point model -- runs against validation images
- 8-run_vai_c_kv260.sh : Run the VAI Compiler. This produces.xmodel file to ship off the KV260.
At first, I did these steps by hand, but after the 10th time or so, it became clear that automation would be a good idea.
Note how important it is, when cross compiling across separate host and target platforms it's key that VAI versions are the same in both environments. If this is not the case, the compiled.xmodel file will error out on the target, with perhaps the least useful error message ever. In my case, I ended up down-rev'ing the host environment to VAI1.3 to align it with the version in my KV260 running Ubuntu.
Running On KV260Once one has a compatible version of the toolset across host and target, my test application based on the Xilinx Yolov3_voc sample video application ran seamlessly after I pointed it to my new model.
$ ./test_mp4_animal_finder yolov3_ct -t 2 ~/Videos/2022-Adaptive-Compute/
Statistics ResultI summarize the detections in each single video as a set of statistics. These track the number of frames with animals; the highest confidence frames (likely to be the higher quality images); and the sizes of frames (likely to feature largest animals). Averages of these quantities give an overall score for the video
DetectStats::print_summary: video of 1215 frames; 9.13 FPS
>= 0: 1215
>= 1: 119
>= 2: 0
>= 3: 0
Highest Quality Frame ID: 954
Max Quality : 0.845
Average Quality : 0.043
Largest Area Frame ID: 328
Max Area : 0.902
Average Area : 0.073
For example, the summary above is of a video with 1215 frames total, 119 with a single potential animal identification, and zero with 2 or more animal identifications. The highest quality frame as 954, with a confidence of 84%, but the average quality of the video overall as low -- less than 5%. Similarly, frame 328 had the detection with largest area, at almost 90% of the total frame, but the overall video had very small area matches.
The video below contains video clips (shown in real time) and the summaries generated by at yolov3 model trained on 2000 images.
This small sample clearly illustrates the additional work to do in expanding, and creating a more diverse training set, and is consistent with the relatively low mAP score cited earlier. For example, note that almost none of the deer or raccoons are identified. This is likely because these animals are so common in trail camera photos, that they were less interesting for Janet's book, and therefore are under-represented in my training set.
PowerThe KV260 comes with a handy utility for extracting platform statistics, including power consumption. I ran:
sudo xlnx-config --xmutil platformstats -p
In two modes: one to establish a baseline of the platform running just Linux, and the other to measure the impact of the host application and DPU-executed model. The results are shown below:
Baseline: (just linux):Power Utilization
SOM total power : 5120 mW
SOM total current : 1017 mA
SOM total voltage : 5033 mV
AMS CTRL
System PLLs voltage measurement, VCC_PSLL : 1195 mV
PL internal voltage measurement, VCC_PSBATT : 716 mV
Voltage measurement for six DDR I/O PLLs, VCC_PSDDR_PLL : 1800 mV
VCC_PSINTFP_DDR voltage measurement : 839 mV
PS Sysmon
LPD temperature measurement : 29 C
FPD temperature measurement (REMOTE) : 29 C
VCC PS FPD voltage measurement (supply 2) : 840 mV
PS IO Bank 500 voltage measurement (supply 6) : 1795 mV
VCC PS GTR voltage : 856 mV
VTT PS GTR voltage : 1801 mV
PL Sysmon
PL temperature : 27 C
Running Video Processing App and Yolov3 model at 9 FPSPower Utilization
SOM total power : 6310 mW
SOM total current : 1255 mA
SOM total voltage : 5033 mV
AMS CTRL
System PLLs voltage measurement, VCC_PSLL : 1192 mV
PL internal voltage measurement, VCC_PSBATT : 717 mV
Voltage measurement for six DDR I/O PLLs, VCC_PSDDR_PLL : 1800 mV
VCC_PSINTFP_DDR voltage measurement : 838 mV
PS Sysmon
LPD temperature measurement : 29 C
FPD temperature measurement (REMOTE) : 29 C
VCC PS FPD voltage measurement (supply 2) : 840 mV
PS IO Bank 500 voltage measurement (supply 6) : 1794 mV
VCC PS GTR voltage : 857 mV
VTT PS GTR voltage : 1801 mV
PL Sysmon
PL temperature : 27 C
DiscussionI will assume that I will be able to substantially reduce the power consumed by the “host” processing (e.g. the Linux environment). This could involve migrating to PetaLinux, or, more likely, an even lower overhead RTOS. This will leave the power consumed actually running the the object detection model as the primary contributor to the system’s overall power.
Subtracting the power measured above, I found that the detection model consumed about 1.3 W. From this, I can calculate the amount of energy per 20-Second video to be about 167 Joules/20-s video.
20S * 60 FPS/9.3 FPS * 1.3 J/sec = 167 Joules/video (processing)
I have measured the power consumed by the trail camera simply taking and storing video into the SD card.
0.200 A * 12.4 volts * 20S = 2.48W * 20 = 59.6 Joules per 20S video
My goal is to have the object detector consume no more than 10% of the overall camera + model power. As it stands, the POC consumes about 30x the energy per video vs. the goal.
Thus, this POC demonstrates the substantial work ahead to improve energy efficiency. This will likely involve performance improvement over the entire pipeline, with a focus on reducing the size and improving the throughput of the model itself, and (eventually) throughput of the MPEG processing pipeline.
Conclusion and Next StepsThis proof of concept was successful in demonstrating the potential of edge-based image recognition and summary generation for remotely deployed trail cameras. A small program running on ARM cores in a Linux environment on a Kria KV-260 development board reads video directly from the trail camera as files over USB. The DPU then runs a Yolov3 detection model to find frames with class "animal" and gather statistics on every frame in every video to summarize and predict which frames (and which videos) are likely to be highest quality.
The POC also showed several areas for future development. Some of these include:
- Training the Detection Model on a larger dataset. In this POC, I was limited in training time by the project deadline. Now that I have a complete flow that works, it’s time to increase the size and diversity of the training set. I have more private data. In addition there are a number of pre-labeled camera trap training sets in the public domain
- Experimenting w/ Detection Model Tradeoffs: The Yolov3 model in the Xilinx Model Zoo was the only objection-detection model (of several I tried) that I could get to work end-to-end through the multiple steps between model and deployment on the KV260. I would like to experiment with energy efficiency vs. model accuracy tradeoffs using the EfficientDet family of object detection networks. In addition to promising best-in-class accuracy vs. operation intensity, these models can be scaled to optimize for power, speed, and accuracy.
- Performance Tuning Model on DPU: Xilinx has a number of tools to help understand and improve DPU performance while running models and applications. I didn’t have time to explore any of these for this project.
- Eliminating throughput bottlenecks: Several detection models have the potential, on the Xilinx KV-260 DPU, of running at many hundreds of frames per second – much higher than the 60 FPS rate of the trail camera. None-the-less, the POC is limited to approximately 9 FPS by the throughput of the YOLOv3 model. On higher throughput models (e.g. densebox_640_360 used in facedetect demos), I found the performance to be limited by the MPEG processing pipeline. It’s possible that behind this, there is an IO bottleneck to the trail camera SD card over the USB cable. In any case, improving the model throughput and the MPEG decode rate will improve the overall throughput of the system. This will have the desirable result of reducing the energy required to run non-pipeline resources (e.g. the board OS)
- Power Management: For this POC, I used a Ubuntu OS running on all available ARM cores on the Xilinx SOC. The result, by my measurements, is that the baseline power of the platform is about 5 times more than the power consumed actually doing the animal detection inference (5 Watts vs. 1 Watt). Ideally, this ratio would be the reverse – with most of the power going to the computationally intense inference model.
- Hardware Integration: I used a the KV260 development kit. This contains more hardware than I needed for this project, and is not yet ready for battery-powered deployment in a robust, weather-proof enclosure in the field. Developing a pared-down hardware design, battery-powered, hardware platform with a robust waterproof enclosure is an obvious next step.
- Software Hardening: The POC software stack is suitable for a prototype, but lacks the features required in a real world deployment scenario. Next steps in the software will focus primarily on power-management, ease of use, and robust error handling
I spent most of the time in this project trying things that seemed promising, but which didn’t quite lead to a solution. Many of these likely reflect my own lack of expertise, if not lack of perseverance. But here is a short list, for what it’s worth.
EfficientDet-LiteSince one of my design goals is to reduce power consumption, I initially targeted my development on Efficient-Det-Lite models. These had the advantage of having a straightforwardly simple example of retraining for a local data set. The generated TFLite models seemed like they would work with DPU (per Xilinx documentation) via TVM.
Getting TVM environment up (per examples on TVM tutorial page) was not easy. I relied on help from @brunojje and finally got it all running, only to run into an unresolvable (in the time allowed) error from TVM when I tried to partition the model for the Xilinx DPU.
EfficientDetOn looking at some of the models in the Xilinx Model Zoo, I found several at least as large as the EfficientDet models, which promised higher accuracy per arithmetic operation and model size vs. other detection models. These are inherently TF2 constructs, and I had trouble getting them into a format that could be quantized by the TF2 vai quantizer.
Serengeti Trained ModelsSarah Beery et al in Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection document two detection models in the TensorFlow 1 model zoo which are pre-trained with camera-trap data: faster_rcnn_resnet101_snapshot_serengeti_2020_06-10 and the innovative, and more accurate context_rcnn_resnet101_snapshot_serengeti_2020-06-10.
Unfortunately, I found that both contained constructs which were not supported by the TF1 vai quantizer.
ReferencesPesaturo, Janet, “Camera Trapping Guide: Tracks, Sign, and Behavior of Eastern Wildlife”, Stackpole Publishing, 2018,
Beery, Sara, et al. "Context r-cnn: Long term temporal context for per-camera object detection." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
@article{beery2019efficient,
title={Efficient Pipeline for Camera Trap Image Review},
author={Beery, Sara and Morris, Dan and Yang, Siyu},
journal={arXiv preprint arXiv:1907.06772},
year={2019}
}
@article{yolov3,
title={YOLOv3: An Incremental Improvement},
author={Redmon, Joseph and Farhadi, Ali},
journal = {arXiv},
year={2018}
}
Comments