Unlocking Multisensory Understanding
This multimodal ML technique learns the relationships between audio and video data to understand the world in a more human-like way.
The combination of audio and visual data plays a fundamental role in how humans perceive and understand the world around them. Our senses of sight and hearing work together in a harmonious manner, providing a rich and comprehensive understanding of our environment. This integration of audio and visual information enables us to form a more complete and nuanced perception of events, objects, and people.
When we watch a movie, we not only see the characters and the scenes but also hear the dialogue, background music, and sound effects. This enhances our emotional engagement, helps us follow the storyline, and provides depth to our overall experience. Similarly, in everyday life, we rely on both visual and auditory information to navigate our environment, recognize faces, interpret gestures, and understand social cues.
However, replicating this seamless integration of audio and visual data is a significant challenge for computer vision applications. In order to develop the learning algorithms needed to make sense of this highly complex data, large volumes of manually annotated audio and video samples are required. But producing these types of datasets is extremely time-consuming and expensive, and can also be error-prone.
But given the important applications that this technology could enable, and the huge interest surrounding the fusion of audio and video data, it is clear that these manual processes will not scale up sufficiently. Before we can get to the point of using web-scale datasets, new methods will need to be developed to train the algorithms.
One such approach has just been proposed by a team led by researchers at MIT CSAIL. They have developed a type of neural network called the contrastive audio-visual masked autoencoder (CAV-MAE) that can learn to model the relationships between audio and visual data in a way that present approaches cannot. And it does this in a more human-like way, leveraging self-supervised learning methods instead of manually labeled datasets.
The CAV-MAE approach has two distinct phases. In the first phase, a predictive model separately masks 75% of the audio and video data, then tokenizes the remaining 25%. Next, separate audio and video encoders try to make sense of the data, after which the algorithm attempts to predict the “missing” data that has been masked. The difference between the actual masked data and the prediction is leveraged to calculate the loss, which is then used to help the model learn and improve its predictions.
This process captures some important knowledge about audio-visual associations, but was found to be insufficient by itself, so the result of the first phase was fed into a contrastive learning model. The contrastive learner seeks to place similar representations close to one another in feature space. It does this by first separately passing the audio and video data into their own encoders, then passing the results into a joint encoder. This joint encoder also keeps the audio and visual components separated, which serves to determine which parts of each data type are most relevant to each other.
To validate their approach, the team trained two separate models, one that consisted of only a masked encoder, and another that used only contrastive learning. The results of this testing were compared with the results from CAV-MAE, and it was found that the CAV-MAE performance was superior, showing a clear synergy between the techniques. In fact, the CAV-MAE performance was even found to rival the results seen when running state of the art supervised models on audio-visual event classification tasks. Moreover, the team’s methods were found to match, or outperform, existing methods that use significantly more computational resources.
The CAV-MAE approach appears to be a meaningful step forward for multimodal applications. The researchers envision new use cases being enabled in a broad range of areas, including sports, education, entertainment, motor vehicles, and public safety. They even believe that one day similar methods will extend to modalities beyond audio and video.
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.