Meta AI Releases Data2vec, an ML Algorithm That Works Across Text, Images, and Speech
Designed around a teacher-student model, data2vec claims to outperform rivals — despite working across three different modalities of data.
Meta AI, formerly known a Facebook AI, has detailed what it claims is the "first high-performance self-supervised [machine learning] algorithm" capable of operating with speech, vision, and text: data2vec.
"While people appear to learn in a similar way regardless of how they get information — whether they use sight or sound, for example — there are currently big differences in the way self-supervised learning algorithms learn from images, speech, text, and other modalities," the company explains.
"This discrepancy has been a significant barrier to applying advances in self-supervised learning more broadly. Because a powerful algorithm designed for, say, understanding images can’t be directly applied to another modality, such as text, it is difficult to push several modalities ahead at the same rate."
Data2vec aims to solve that problem, offering a single self-supervised algorithm — meaning it doesn't rely on labelled data sets — capable of working across speech, vision, and text. Compared to previous approaches, Meta AI claims data2vec offers simplified training — yet matches or outperforms modality-specific rivals.
"We tested the method on the popular ImageNet computer vision benchmark, where it performed better than existing methods for popular model sizes," the company says. "On speech, we found that it performed better than wav2vec 2.0 or HuBERT, two previous Meta AI self-supervised algorithm for speech. For text, we tested it on the popular GLUE benchmark suite, and it performed as well as RoBERTa, a re-implementation of BERT."
The system works through the use of a "teacher network," which computes target representations from images, text, or audio. Some of the input is masked, then the process repeated on a student network — which is given the job of predicting representations of the full input data, despite only being given a part. This prediction is based on internal representations of the input data — removing its reliance on operating within a single modality.
"Data2vec demonstrates that the same self-supervised algorithm can work well in different modalities — and often better than the best existing algorithms," Meta AI notes. "This paves the way for more general self-supervised learning and brings us closer to a world where AI might use videos, articles, and audio recordings to learn about complicated subjects, such as the game of soccer or different ways to bake bread. We also hope data2vec will bring us closer to a world where computers need very little labeled data in order to accomplish tasks."
The paper describing data2vec is available from Meta AI under open-access terms; the company has also released both the data2vec source code and pre-trained models for speech and natural-language processing, with vision to follow, on GitHub under the permissive MIT license.