Apple Researchers Release Depth Pro, a One-Shot Model for Sub-Second Depth Mapping on Any Image
Calculating an accurate depth map from a single 2D image in 0.3 seconds, Depth Pro is impressive — and on GitHub now.
Apple researchers have developed a foundation model that, they say, can deliver a sharp depth map from any single two-dimensional image in less than a second: Depth Pro.
"We present a foundation model for zero-shot metric monocular depth estimation," the research team explains. "Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions."
Depth mapping is handy for everything from robotic vision to blurring the background of images post-capture. Typically, it relies on being able to capture the scene from two slightly different angles — as with smartphones that have multiple rear-facing cameras, where the differences between the images on two sensors are used to calculate depth and separate the foreground from the background — or the use of a distance-measuring technology such as lidar. Depth Pro, though, requires neither of these, yet Apple claims it can turn a single two-dimensional image into an accurate depth map in well under a second.
"The key idea of our architecture," the researchers explain, "is to apply plain ViT [Vision Transformer] encoders on patches extracted at multiple scales and fuse the patch predictions into a single high-resolution dense prediction in an end-to-end trainable model. For predicting depth, we employ two ViT encoders, a patch encoder and an image encoder."
Depth Pro's results are impressive: working with an image encoder resolution of 384×384 and a network resolution of 1536×1536, the model delivers depth maps accurate enough to pick out the individual whiskers on a bunny's face and the contents of a cage distinct from the bars surrounding it. It's also fast: in testing, Depth Pro delivers its results in just 0.3 seconds per image — though this, admittedly, is based on running the model on one of NVIDIA's high-end Tesla V100 GPUs.
A preprint of the researchers' work is available on Cornell's arXiv server under open-access terms; Apple has also made sample code and model weights available on GitHub under a custom open-source license.
Freelance journalist, technical author, hacker, tinkerer, erstwhile sysadmin. For hire: freelance@halfacree.co.uk.