Capturing the Interest of Robots
L-VITeX is a lightweight vision system, based on Edge Impulse's FOMO model, for resource-constrained robots engaged in terrain exploration.
Whether an autonomous robot is exploring the depths of the ocean, navigating the highways, or climbing the mountains of Mars, it needs a means to understand its surroundings. This information is essential for navigation, locating relevant objects, and other tasks required for carrying out its mission.
The real world is very complex, and can be understood on many different levels. But it is impractical for a robotic system to attempt to understand everything about its environment. Instead, they often run a Region of Interest (RoI) detection algorithm that assists them in locating only relevant features in their surroundings.
These algorithms tend to be very computationally expensive, however. Where size, cost, and energy consumption are of little concern, deploying them is not especially challenging. But when it comes to small drones and other resource-constrained systems that have hard limits on their available onboard computational power, most traditional RoI detection algorithms are out of reach.
A pair of engineers at the Rajshahi University of Engineering & Technology and Brac University have recently developed what they call L-VITeX, which is a lightweight visual intuition system for terrain exploration designed for resource-constrained robots and swarms. By leveraging L-VITeX, robots can save time and conserve energy by focusing their efforts on important areas during their explorations.
The core component of L-VITeX is Edge Impulse’s FOMO (Faster Objects, More Objects) model, which utilizes a truncated version of the MobileNet-V2 architecture. The FOMO model processes input images by dividing them into grids (e.g., 8x8 pixels) and identifies object centroids within each grid cell, rather than relying on bounding boxes, making it computationally efficient. By quantizing the model, L-VITeX further reduces memory usage and power consumption, enabling real-time performance on low-power hardware like the ESP-32 Cam.
L-VITeX employs an emphasis function (EF) that triggers specific actions when RoIs are detected in the environment by the FOMO model. For example, in a proof-of-concept experiment with a TinyTurtle robot, the EF was programmed to activate a “Look Close” behavior. This action directed the robot to slow down and approach the detected RoI for a closer inspection, ensuring that the robot gathers detailed visual data from areas of interest, rather than wasting resources on less relevant surroundings.
The performance of the system was assessed in a number of experiments. Using a dataset consisting of video from drones, a floating-point model performed well, achieving an F1 score of 0.92 at higher resolutions (64x64 and 96x96), with accuracy reaching up to 98.51 percent. However, increasing the resolution also led to higher latency and Peak RAM Occupation (PRO). The quantized integer (int8) model offered a significant reduction in latency and PRO while maintaining similar accuracy and F1 scores, particularly at higher resolutions.
Using another dataset targeted at rock detection by rovers, a floating-point model again performed well, with F1 scores improving from 0.63 at 32x32 to 0.88 at 96x96. Again, the int8 model offered similar results but with better efficiency in terms of latency and memory usage.
This research successfully demonstrates the potential of a lightweight, FOMO-based object detection system for vision-guided terrain exploration. While challenges remain in detecting objects with less contrast, the work establishes a foundation for improving vision-based exploration tasks, with future efforts focusing on enhancing detection accuracy in more complex scenarios.