Overview
The large-model landscape shifted again overnight as Google announced Gemini, a new multimodal foundation model positioned to compete with OpenAI's GPT-4. Google said Gemini will be integrated across its products, including Bard, Search, and Cloud.
Announcement and rollout
At Google I/O in May, Google revealed it was developing a multimodal foundation model called Gemini and indicated that future Google products would be built on it. As of the public announcement, Bard and the Pixel 8 Pro phone have already started using Gemini capabilities. The model is initially available in English, with other languages expected later. Google has said it plans to integrate Gemini into Search, advertising products, and Chrome.
Architecture and scope
Google describes Gemini as a multimodal model built from scratch that can understand and combine different information types, including text, code, audio, images, and video. According to Google, Gemini is designed to run efficiently from data centers down to mobile devices, and its advanced capabilities aim to enhance how developers and enterprise customers build and scale AI solutions.
Model family
- Gemini Ultra: Positioned as Google's most capable model and a direct competitor to GPT-4; aimed at data center and enterprise applications. Google expects to release it next year.
- Gemini Pro: A mid-tier model intended to outperform GPT-3.5 baseline. It will be used to support various Google AI services and is already deployed in Bard.
- Gemini Nano: A lightweight model optimized for mobile devices. Pixel 8 Pro users can access Nano-powered features such as summaries in the Recorder app, smart replies in Gboard, video features, and improvements for photography and image editing.
Benchmarks and comparative testing
Google published a technical report titled "Gemini: A Family of Highly Capable Multimodal Models" that includes comparisons with GPT-4 and GPT-3.5. Google ran 32 benchmarks covering a broad range of tasks, from general language understanding to code generation.
According to the reported results, Gemini Ultra supports text input and output, and the Ultra variant can also handle images, video, and audio. Google reports that Gemini Ultra outperformed GPT-4 on 30 of the 32 academic benchmarks commonly used in large language model research, covering areas such as natural image, audio and video understanding, and mathematical reasoning.
Google reports that Gemini Ultra achieved a 90.0% score on MMLU, and was the first model in their evaluation to exceed human expert performance on that benchmark. MMLU assesses knowledge and problem-solving across 57 subjects, including mathematics, physics, history, law, medicine, and ethics. Google also reports a 59.4% state-of-the-art score on a multimodal reasoning benchmark composed of tasks that require cross-domain, deliberative reasoning.
Image-based evaluations reportedly show Gemini Ultra outperforming previous state-of-the-art models without requiring OCR systems that extract text from images for additional processing, highlighting its native multimodal capabilities and early signs of more complex multimodal reasoning.
Design and multimodal capabilities
Google states that Gemini was pretrained with multimodal data and then fine-tuned on additional multimodal datasets to improve effectiveness. The reported multimodal capabilities include:
- Complex reasoning: Multimodal reasoning designed to help interpret complex written and visual information and to extract insights from large document collections.
- Understanding text, images, and audio: Training to recognize and interpret text, images, and audio jointly, enabling more informed responses on complex topics.
- Advanced coding: Capabilities for understanding, explaining, and generating high-quality code in mainstream programming languages such as Python, Java, C++, and Go.
Coding systems and AlphaCode 2
Google used a specialized version of Gemini to create an advanced code generation system called AlphaCode 2, aimed at competitive programming problems that involve complex mathematics and theoretical computer science. Compared with the AlphaCode system released two years earlier, AlphaCode 2 reportedly solves nearly twice as many problems. Google estimates AlphaCode 2 performs better than 85% of contest participants, compared with about 50% for the original AlphaCode. Performance improves further when programmers collaborate by specifying properties for code examples.
Efficiency and infrastructure
Google reports Gemini was trained on its internally designed TPU v4 and v5e hardware, and claims it runs faster and at lower cost than previous models such as PaLM. Alongside Gemini, Google announced Cloud TPU v5p, a next-generation TPU system intended for training large generative AI models and accelerating model development for developers and enterprise customers.
Market impact and outlook
Industry observers note that Gemini appears to be a model capable of competing with GPT-4, which could drive further progress through competition. At the same time, Gemini Ultra is not yet embedded in consumer or enterprise products, so some view the announcement as an early signaling move while product integration progresses. The extent to which Gemini will affect the large-model market depends on real-world deployments, latency and cost characteristics, multilingual availability, and ecosystem adoption.
Discussion
How Gemini will influence the large-model landscape remains to be seen. Which aspects of Gemini's published capabilities do you find most consequential for developers and enterprise users?