Hearing the Big Picture
DenseAV learns sight-sound associations from videos without annotations, making multimodal AI systems more practical to train.
Recent advances in machine learning have led to the development of many useful tools that are powered by algorithms that are very adept at, for example, understanding natural language, recognizing objects in images, or transcribing speech. But most researchers agree that the next frontier in the field will involve multimodal models that can understand multiple types of data. These multimodal models can gain a much richer understanding of the world, which will help engineers to build more useful AI-powered applications in the future.
But training these models presents a number of unique challenges. In particular, sourcing sufficiently large training datasets with proper annotation of each data source can prove to be too expensive and time-consuming to be practical. This present paradigm is increasingly being questioned, however. After all, a young child learns associations between sights and sounds — like between the sound of a bark and the appearance of a dog — without guidance from a parent, for example. Given the efficiency of natural systems, there must be better ways to build our artificial systems.
One such system was recently proposed by a team led by researchers at MIT’s CSAIL. Their unique algorithm, called DenseAV, watches videos, and can learn the associations between sounds and sights. Crucially, it does this without requiring a pre-trained model or an annotated dataset. Rather, it is capable of parsing a large volume of video data and making sense of it entirely on its own, much like a young child.
DenseAV is composed of two separate components — one that processes video, and the other, audio. This separation was important in that it ensured each component extracted meaningful features from its own data source; they could not look at each other’s notes. These two independent signals can then be compared to see when they match up. Using this contrastive learning approach, the model can pick important patterns out of the data itself, without any data labels.
Unlike previous efforts that matched up entire image frames with sounds, DenseAV instead works at the pixel level. This allows for a much greater level of detail where even background elements can be identified in a video stream such that a greater understanding of the world can be achieved by artificial systems.
The algorithm was initially trained on a dataset of two million unlabeled YouTube videos. Additional datasets were created by the researchers to benchmark the model. When compared to today’s best algorithms, DenseAV was shown to be more capable in tasks like identifying objects based on their names or the sounds that they make.
Given the early successes that have been seen with DenseAV, the team is hoping it will help them to understand how animals communicate. It might, for example, help them to unlock the secrets of dolphin or whale communication in the future. As a next step, the researchers plan to scale up the size of the model and possibly incorporate language models into the architecture in the hope of further improving the algorithm’s performance.