Google’s Video Promised an AI Revolution. The Reality Is a Bit More Boring
It’s incredible. Maybe.
Until recently, OpenAI was the unbeatable artificial intelligence market leader, with its ChatGPT chatbot being the bar set for every other AI company. Then, the whole company blew up and regrouped over in-group drama, destroying much of their reputation and respect. This article in the Washington Post didn’t help the matter.
Given the new opening in mindshare for serious AI businesses, competitors have leaped to show what they could do. Meta, on the back of their rapidly improving Llama models, unveiled their public image generation tool, Imagine. Visual Electric took this to the next level with their tool, which is functionally the PhotoShop of AI. Elon Musk — taking a break from ruining Twitter and disappointing with the Cyber Truck — is rolling out wider availability to X.Ai’s Grok, which is like a worse ChatGPT, with bad Dad humor.
And then, on the 6th of December, Google released a YouTube video showing off their new ‘Gemini’ multimodal model, which made the rest seem irrelevant. That video has over two million views already and sent many technology commentators into a frenzy of Twitter excitement. It looked like a revolutionary leap forward for AI tools; but just because you see it in a video doesn’t mean it’s real.
From a technical perspective, Gemini promises to have two key innovations over existing AI systems like ChatGPT: improvements to AI reasoning, and true multi-modality. The former means that it can draw assumptions from presented information, and then make rational decisions about the next move to take, much as we do naturally. This has always been a difficulty for AI Large Language Models, which simulate the outcome of speech without mirroring the way we think. Its ability to recommend decisions isn’t a result of genuine thought, but rather, a simulation of how people have responded to similar queries. If Google has genuinely cracked this, and Gemini can reason, then they’re way ahead of their competitors.
The second innovation, multimodality, means that Gemini does not need the help of other AI systems to generate the varied outcomes it thinks you want. The GPT-4 version of ChatGPT can generate images for you, but only by handing your prompt to another AI model, then returning its results. By contrast, Gemini treats the processing and generation of image, audio, video, and text content, along with web crawling, as parts of the one process. For Gemini, it’s all just information to be inputted and outputted, and it can intuit which kind a user is seeking.
It’s incredible. Maybe.
Many users were blown away by what they were seeing: an AI voice instantly reacting to a user’s behavior, straight from living video footage, providing useful information from this context and the user’s vague prompts. I was suspicious though, and this was only increased by Google’s note in the video description that “latency has been reduced and Gemini outputs have been shortened for brevity.”
Given that the video was so impressive based on the relevancy and speed of Gemini’s responses, along with the ability to understand complex contexts, that caveat makes it a lot less impressive. And this all got a lot worse when Parmy Olson of Bloomberg asked Google to clarify further.
Whereas the video showed Gemini reacting live to video footage and spoken prompts, Google confirmed to Olson that this was just illustrative, and the actual system had reacted to written prompts and screenshots of the video. In other words, not what the video demonstrated, and something that ChatGPT could already do.
The multimodality and contextual reasoning still seem very impressive, and the new version of Google’s Bard chatbot, now updated with Gemini processing, seems to produce faster, better responses than the relatively neutered ChatGPT. But I have little reason to trust the broader claims in this video until I see more.
It’s also worth noting that the demonstration video was with ‘Gemini’ generally, but Google says that Gemini will run in Ultra, Pro, and Nano forms. Nano will run on-device on photos, laptops et cetera, whereas the larger models are for database uses, and it’s not clear which version we are seeing here, or if a server-based Gemini will be available to everyday people through their browsers. Given the ambiguity, this is likely far more powerful than a Google Pixel phone will be able to run for some years, and just how much capacity will be lost in the Nano model is still to be seen.