Point and Search
WatchThis uses computer vision and GPT-4o to let users point at objects, snap a picture, and ask questions about the world around them.
When someone asks you a question that you do not know the answer to, how do you typically respond? Most people's first instinct is to tell them to "Google it". This, of course, means to grab a digital device, launch a web browser, and type a search query into Google, then to scan the results for an answer. But this is 2024! Technology has advanced tremendously since “Google” first became a verb a couple decades ago. Furthermore, a text-based query is not always the best way to seek out an answer — especially if you want more information about a nearby physical object that is not easy to describe.
A team at the MIT Media Lab has hacked together a solution that they believe could make it easier to get answers to your burning questions. They have developed a prototype called WatchThis that is mounted on the wrist and uses computer vision and large language models in a unique way to gather more information about one’s surroundings. With WatchThis, you simply point and search.
The device consists of a Seeed Studio XIAO ESP32S3 Sense development board, which is powered by a dual-core ESP32-S3 microcontroller and supports both Wi-Fi and Bluetooth wireless communication. This is paired with an OV2640 camera module and a Seeed Studio Round Display for XIAO with a 1.28-inch touchscreen. A LiPo battery powers the system, and it is attached to the wrist via a strap and a 3D-printed enclosure.
To use WatchThis, the display screen flips up to face the user. The camera is attached to the rear side of the display such that it can capture a video stream of what the wearer is pointing it at, and that video is shown on the display. Next, the user points their finger at an object of interest, then taps on the screen with the other hand. This causes the device to capture an image of the scene.
A companion smartphone app is used to type a question. That question, along with the captured image, are sent to OpenAI’s GPT-4o model via the official API. This model can analyze both images and text and reason about them to answer questions. When the answer from the model is returned it is displayed on the screen, on top of the captured image, for a few seconds before the device returns to its normal operating mode. Typical response times are in the neighborhood of three seconds, making WatchThis reasonably snappy.
The developers chose to use a smartphone app to allow users to type their questions for accuracy, but having to use another device to type out a question is a bit clunky. One question immediately raised by this arrangement is why the entire system does not just run on the smartphone. It already has an integrated camera and certainly has the ability to make an API call, after all. A voice recognition algorithm, while it would not be as accurate, could make WatchThis much more natural and efficient to use. Perhaps after some enhancements like this, we will tell people to “WatchThis it” in the future. Hey, “Google it” did not always roll so easily off the tongue either, you know!