The Little Language Model That Could
Using textual entailment and self-training, MIT CSAIL researchers were able to create language models that outperformed 500x larger models.
Large language models are advanced artificial intelligence systems designed to understand and generate human-like text. These models, such as OpenAI's GPT-4 and Google’s PaLM 2, have garnered immense attention and popularity due to their remarkable capabilities and potential applications. They can be employed for a wide range of tasks, including natural language understanding, text completion, translation, summarization, and even creative writing.
One of the main reasons why large language models have gained such popularity is their ability to process and generate text that is coherent, contextually relevant, and exhibits human-like fluency. These models are trained on massive amounts of data, often comprising billions of sentences from books, articles, websites, and other textual sources. This extensive training enables them to acquire a deep understanding of language patterns and semantic relationships, enabling them to provide intelligent responses and generate meaningful content.
However, there are notable drawbacks to large language models. The sheer computational resources and expenses required to train and operate them are immense. Training these models necessitates a powerful hardware infrastructure, including high-performance GPUs and extensive data storage capabilities. The training process itself can take weeks or even months and involves consuming vast amounts of electricity, which contributes to environmental concerns. These resource requirements pose significant limitations on who can afford to develop and use such models, often restricting their accessibility to organizations with substantial resources.
Of course smaller language models can be used, which drastically cuts down on the resources and expenses needed to train a model and run inferences. But this will hinder the model’s performance — after all, you get what you pay for, right? Maybe not. Recent work reported on by a team of researchers at MIT’s CSAIL may serve to turn this conventional wisdom on its head. Their research has shown that it is not the size of the model that counts, but how you train it.
To make a mighty yet (relatively) tiny model, the team relied on textual entailments to give their algorithm a boost in understanding natural language. Textual entailments are directional relationships between fragments of text, in which the truth of one fragment follows from another. For example, given a premise stating that “all men are mortal,” the hypothesis that “the man named Socrates is a mortal” would be entailed by the premise.
The concept of entailment was used to train an entailment model. Prompting this entailment model can determine if a particular piece of information is entailed, given some input text. That added context can provide a language model with the ability to adapt to novel tasks without needing to be shown additional training data.
The performance of the language model was further improved through a technique known as self-training, in which the model learns from its own predictions. This is done without requiring heavy supervision from humans, or a manually annotated dataset. Self-training can, however, generate incorrect labels, so an algorithm called Simple Pseudo-Label Editing (SimPLE) was developed that allows for manual intervention by humans during the first few rounds of training. By correcting early mistakes, the accuracy of the final model was much better.
The researcher’s methods resulted in a model that could outperform language models 500 times its size on certain language understanding tasks, like sentiment analysis, question answering, and news classification. Not only do these pint-sized models reduce costs and resource utilization, but they also can help to preserve privacy by executing them on-site. Relying on external organizations to provide access to a language model via an API makes it necessary to send potentially sensitive data over the Internet to off-site cloud resources.
The team still has a bit of work to do — it was noted, for example, that the self-training process did not perform as well with multi-class classification tasks as it did with binary classifications. But with a bit of refinement, this work could contribute to the democratization of language models, allowing many diverse groups to reap their benefits.