Will the Twin Be a Win?
Gemini, Google's latest multimodal AI model, can understand text, code, audio, image, and video data, and it now powers their Bard chatbot.
OpenAI’s ChatGPT was released just over a year ago, and the tremendous interest generated by this tool quickly launched an epic AI arms race that is still raging. The demand for more advanced and sophisticated generative AI models has prompted major tech companies and research institutions to intensify their efforts in the field of artificial intelligence. As a result, we have witnessed a rapid evolution in the capabilities of conversational AI, and much more, with each subsequent release attempting to outperform its predecessors.
Although many of these models are extremely large and require massive amounts of compute resources to operate, the competitive landscape has not been limited to large organizations alone. The open-source community has played a pivotal role, contributing to the democratization of AI technology. Collaborative efforts have led to the development of alternative models, which allow individuals to run these sophisticated algorithms on their own personal computers. This has also fueled rapid innovation, with more people and organizations being able to contribute to new technological advances.
The latest large-scale effort intended to move the field forward was recently announced by Google. Their Bard chatbot has not exactly taken a leading position in this crowded field yet, with many users finding its capabilities to be underwhelming. The jury is still out, but this may soon change. Google has just replaced LaMDA — the model that had been powering Bard — with their latest generative AI model named Gemini.
Google calls Gemini the most capable, and most generalized, model that they have ever created — and on paper, at least, it looks pretty impressive. It was designed from the ground up to be highly multimodal. Many past efforts have relied on separate models that work together to process different types of data. Gemini, on the other hand, can understand text, code, audio, image, and video data. With all of these capabilities sitting side-by-side in a unified model, there is a lot of potential for generalizing across different sources of information. And that is exactly the sort of ability that is needed for artificial systems to gain a better understanding of the world around them, and to interact more naturally with humans.
In a break from current trends, Gemini is not delivered in a one-size-fits-all package. Three different model sizes have been released to meet the needs of a variety of use cases. Gemini Ultra is the largest, for when highly complex tasks are to be performed and the sky's the limit for available resources. Gemini Pro, which now powers Bard, was designed to be capable across a wide range of tasks, but not such a resource hog. Finally, Gemini Nano was created for on-device use. This model can power applications on smartphones without requiring an internet connection for cloud-based processing.
Of course none of this means a thing if the model does not perform well, so how does it stack up against the competition? If you have confidence in the ability of benchmarks to assess the performance of a model, then Gemini has advanced the state-of-the-art. Using a panel of 32 academic benchmarks commonly used to evaluate large language models on tasks like reasoning, math, coding, and understanding of images, video, and audio, Gemini was demonstrated to consistently outperform GPT-4V.
Google notes that the multimodal capabilities of Gemini will help it to excel at uncovering hidden knowledge that can be found in vast amounts of data. These same skills could make it very good at other tasks, like advanced reasoning and coding. But as they say, the proof is in the pudding. Give it a try and see what you think. Does the real-world performance match the expectations?