Can We Be Reasonable?
Have chatbots really developed the ability to reason? MIT researchers suggest it is more likely that they just memorize their training data.
The large language models (LLMs) that power today’s latest and greatest chatbots have achieved a level of sophistication that has never been seen before. In fact, they are so good that their responses are very often indistinguishable from those of a human. Aside from helping to usher in a resurgence of interest in artificial intelligence (AI)-powered tools, LLMs have also sparked a lot of conversation about what is really happening underneath the hood of our most advanced AI algorithms.
Some people have even gone so far as to conclude that the largest and most complex LLMs have attained some level of consciousness. While most people dismiss these claims as hyperbole, there are still many people that look at the conversations produced by LLMs and take them seriously. If you were hoping to be chatting with an intelligent, 2001-esque computer like the HAL 9000 in the near future — or if you think that is what you might be doing even now when you talk to a chatbot — then a team of researchers at MIT and Boston University would like to rain on your parade.
The researchers were interested in better understanding how much of an LLMs knowledge can be attributed to emergent reasoning capabilities, and how much is just plain old memorization of facts and probable sequences of words that were found in the training data. Their strategy for investigating this involved first questioning LLMs — such as GPT-4, Claude, and PaLM-2 — about topics that were likely to be present in their training data. They then tested what they called “counterfactual scenarios,” or hypothetical situations that would not be expected to be found in the training dataset.
Starting with arithmetic, the team fired off some questions to the LLMs. Since the vast majority of arithmetic is done in base-10, they would expect the algorithms to only excel at performing operations in other bases if they actually understand the concepts. Conversely, if they perform more poorly with other bases, it is an indication that they are likely just memorizing what they have previously seen. As it turned out, there were huge drops in accuracy across the board when bases other than 10 were used.
Many other experiments were conducted to assess the models’ knowledge of topics like spatial reasoning, chess problems, and musical chord fingering. While the algorithms generally performed quite well on the typical questions, they struggled mightily with the counterfactuals once again. In fact, they often performed so poorly that their results were no better than a random guess.
Typically when a model performs like this, we say that it was overfit to the training data, and that is a bad thing — we want our models to be able to generalize to new situations. The findings of this study hint that LLMs may just be overfit to their training datasets. But because their training datasets can be so huge, like the entire content of the public internet, we have been quite happy with that. If the model has seen so much, there is less need to adapt to unseen scenarios. Even still, if you are hoping for the development of truly intelligent machines, today’s LLMs do not appear to be the way to go.