In this post, we share tips for a multi-view camera setup to help you estimate the 3D positions of fast-moving objects across the sky. This past weekend's FleetWeek in San Francisco presented an opportunity to test the limits of localization with a 37-foot wide baseline.
What is a Baseline? What is Stereo Vision?Something special happens when you add a second camera to your vision projects. Using another camera makes it possible to match points from one camera frame to their respective positions in the second camera's frame.
What do you get? The ability to apply trigonometry to infer spatial relationships like depth!
Localizing objects within the image using techniques like object detection provides just one piece of the puzzle, and by pairing it with depth estimation, you can resolve the 3D positions of objects.
Luxonis's powerful OAK-D cameras support real-time stereo vision using the depthai-SDK. Still, with a baseline of 7.5 cm, this configuration lacks the sensitivity to localize objects at much greater distances. We wanted to spot the Blue angels soaring hundreds of feet above our apartment, aiming to resolve their spatial positions.
During FleetWeek 2022, we tested this idea with a baseline of approximately 12 feet with encouraging results. We simply connected the two cameras to our laptop In our initial foray. We stretched out the 25-foot USB-C cables, powering the cameras, and collected the data.
This year, we upped the ante with a second computer to increase the baseline from opposite ends of our apartment to max out the sensitivity of our setup to spot jets coming from city blocks away.
However, accurate stereo matching of fast-moving objects also requires synchronizing image frames and capturing at the highest possible resolution. That is because stereo matching is done by identifying pixel-by-pixel correspondences to estimate the disparity between these distinct views.
It's also important to have a high degree of overlap between views so that there are many correspondences to perform this matching robustly. And so, setting up the cameras to share large parts of the field of view is key.
At the same time, we want to record at higher frame rates for easier approximate synchronization. Add to this the fact that 4K images take up a lot of space and processing larger frames takes more memory and we opted to record at 1280x720 resolution.
ROS to the RescueSince our camera are distributed between two separate host machines, getting synchronicity is not as easy as letting the cameras roll for 5 hour bursts over each of the three days of shows. In fact, you have to consider storage for all the video!
And so we used the DepthAI ROS nodes to get timestamped frames we could approximately align and we were able to manually trigger recordings when the jets entered the field of view.
Afterward, we could extract RGB image frames in rosbags from each device and align these by timestamps. We found their docker images easy to work with and made minor modifications to the launch files.
We're Gonna Need a Bigger ArUco BoardTypically, stereo calibration is done using a reference ArUco board.
They don't make markers you can paint across the sky to calibrate your stereo pair, so we needed a way to estimate the transformation needed to warp the image from one camera frame to another.
That is because finding corresponding pixels between two images is MUCH simpler if we can first reduce the problem to searching in one dimension instead of two.
This begins by identifying robust feature keypoints in each frame before matching across frames to determine the homography matrix used to transform or warp one image to the other.
Getting to the KeyPointsOpenCV provides a variety of methods to identify these keypoints, such as SIFT and ORB. Since so much of the targeted field of view includes a featureless blue sky, it was important to frame our shots such that buildings and trees occupied an appreciable part of the scene.
To make feature detection even more robust, we employed hloc - hierarchical localization using SuperGlue pre-trained neural networks optimized to identify robust features for matching. This allowed us to estimate the camera pose, which was important to warping the image for stereo rectification.
Furthermore, we applied Lowe's method, a standard technique for rejecting outliers in a collection with RANSAC to estimate correspondences more reliably. Essentially, we reject candidate feature point matches between images for which we are less confident.
We can even use openCV to take our estimates to the subpixel level with refinement methods designed to improve our matches even further.
Now, with shots well-framed to find many good feature keypoints using powerful deep learning techniques and a sampling strategy we can rely on, we can obtain estimates for the homography matrix used to warp images with openCV.
Since we are working with video, we were able to aggregate estimates of the homography matrix over several successive frames to get a better transformation.
Having warped one frame to another, we can obtain a disparity map by performing stereo matching, again using openCV's rich library.
Eking Out Better Disparity MapsNow that we have disparity maps, we're practically there! Obtaining depth from disparity relies on a simple parametric equation relating terms intrinsic to your camera device and baseline distances.
However, the disparity maps may still contain a lot of noise. And so, since our camera setup is stationary and we don't have to deal with the complexities of motion compensation, we can apply temporal smoothing techniques to help filter this noise from consideration.
We also applied Weighted Least Squares (WLS) filtering to post-process the disparity maps further. Still, we found this most effectively applied only to the bottom third of the frames, occupied by buildings and thus stationary.
Next time, we could use lenses with a wider FOV to capture more of the sky before dewarping the images. We could also try capturing with cameras that reach higher frame rates to get closer to synchronized images, if not using hardware-synchronized devices.
Also, the depthai-cameras have Intel's Myriad X Vision Processing Unit (VPU) for fast, vectorized compute, enabling real-time object detection on-device. We could use this to trigger recordings when jets are detected. This could have helped us avoid missing a few excellent shots.
Ultimately, we obtained disparity maps, which worked to infer great distances with our setup without calibration boards.
Would you like to learn more? Join us for discussion on Discord!
Comments
Please log in or sign up to comment.