See What You Mean
MIT researchers developed a practical, self-supervised method to make videos searchable so that you can locate specific segments in a jiffy.
Some people prefer to learn by reading about a topic, while others are not interested unless there is a video that they can watch. While individual opinions on the matter differ, it is hard to deny that each format has its pros and cons. Text can easily be scanned or searched for a particular piece of information, whereas this is much more cumbersome to do with a video. However, videos offer the advantage of being much more dense in terms of the information that they convey, which is especially helpful when it comes to demonstrating a skill, for example.
What if both searchability and information density could be combined? Researchers have attempted to bring simple searches to videos, allowing viewers to quickly jump to a specific segment that they are interested in by running a quick text-based search. But to date, these approaches have proved to be of limited value in real-world scenarios. The problem is that the machine learning algorithms that power these tools rely on huge amounts of manually annotated video data for training. Producing these types of datasets is prohibitively time-consuming and expensive for anything other than a very narrowly-focused use case.
An innovative idea developed by researchers at MIT and the MIT-IBM Watson AI Lab may serve to upend this present paradigm, however. They have developed a novel self-supervised spatio-temporal grounding-based approach that allows them to train their algorithm on raw video data — no manual annotation is required. After training is complete, the tool allows a user to type out a brief description of what they are looking for, and the precise location in the video where that event can be found is predicted.
The approach begins with an unlabeled dataset, as well as automatically generated annotations, such as those that are produced by YouTube’s closed captioning tool. This data is then fed into a training process that has two distinct stages. The first stage operates at a high level to understand what actions happen throughout the course of a video, and when they happen. The second phase drills down to a lower level of detail to identify specific features that are of interest. For example, in a cooking demonstration, this second stage may identify a spoon or a knife laying on the counter, rather than just the act of cooking itself.
Under ideal conditions, these steps may be sufficient. But in the real world, actions and spoken descriptions of actions may not be aligned. The demonstrator may, for example, discuss what they intend to do right before they actually do it. For this reason, the algorithm incorporates a feature that serves to disentangle these misalignments.
The team could not find a good way to evaluate their work, since large, well-annotated video datasets with precise labeling of the start and end times of each action were hard to come by. To remedy this situation, they built their own dataset to help them benchmark their algorithm. After defining an appropriate annotation technique and building up a dataset, they used it to evaluate their system. During the evaluation, they found that their new approach was generally much more accurate in identifying specific actions in videos than existing methods. It also proved to be much better at identifying human-object interactions, which are crucial in identifying a great many actions of interest.
In the future, the researchers plan to extend their approach to also include audio data, since sounds are often strongly correlated with actions. They believe that with some refinement, their approach may prove to be useful in learning all sorts of skills. It might even assist health care professionals in reviewing diagnostic videos one day.
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.