You’ve Got to Work with What You’ve Got
MultiModN is a modular AI model that understands data types like text, audio, and video, but does not suffer from biases due to sparse data.
Traditionally, deep learning applications have been categorized based on the type of data they operate on, such as text, audio, or video. Text-based deep learning models, for instance, have excelled in natural language processing tasks such as sentiment analysis, language translation, and text generation. Similarly, audio-based models have been employed for tasks like speech recognition and sound classification, while video-based models have found applications in gesture recognition, object detection, and video summarization.
However, this approach is not always ideal, especially in decision-making scenarios where information from multiple modalities may be crucial for making informed choices. Recognizing this limitation, multimodal models have gained popularity in recent years. These models are designed to accept inputs from various modalities simultaneously and produce outputs that integrate information from these modalities. For instance, a multimodal model might take in both textual descriptions and image data to generate captions or assess the sentiment of a scene in a video.
Despite the advantages of multimodal models, there are challenges associated with training them, particularly due to the disparate availability of training data for different modalities. Text data, for example, is abundant and easily accessible from sources such as websites, social media, and digital publications. In contrast, obtaining large-scale labeled datasets for modalities like video can be more resource-intensive and challenging. Consequently, multimodal models often have to be trained with incomplete or missing data from certain modalities. This can introduce biases into their predictions, as the model may rely more heavily on the modalities with richer training data, potentially overlooking important cues from other modalities.
A new modular model architecture developed by researchers at the Swiss Federal Institute of Technology Lausanne has the potential to eliminate the sources of bias that plague present multimodal algorithms. Named MultiModN, the system can accept text, video, image, sound, and time-series data, and also respond in any combination of these modalities. But instead of fusing the input modality representations in parallel, MultiModN is composed of separate modules, one for each modality, that work in sequence.
This architecture enables each module to be trained independently, which prevents the injection of bias when some types of training data are more sparse than others. As an added benefit, the separation of modalities also makes the model more interpretable, so the decision-making process can be better understood.
The researchers decided to first evaluate their algorithm in the role of a clinical decision support system. As it turns out, it is especially well-suited to this application. Missing data is highly prevalent in medical records due to factors like people skipping tests that were ordered. In theory, MultiModN should be able to learn from multiple data types in these records without picking up any bad habits as a result of these missing data points. And experiments proved that to be the case — MultiModN was found to be robust to differences in missingness between training and testing datasets.
While the initial results are very promising, the team notes that relevant, open-source multimodal datasets are hard to come by, so MultiModN could not be tested as extensively as they would have liked. As such, additional work may be needed in the future if this approach is adopted for a real-world problem. If you would like to take a look at the code for yourself, it has been made available in a GitHub repository.