Mon. Dec 23rd, 2024
Deepmind's Highly Functional Multimodal Model Gemin Reaches Human Expert Level

Multimodal large-scale language models (MLLMs) have recently emerged as a prominent research topic, leveraging the capabilities of powerful large-scale language models (LLMs) to perform diverse multimodal tasks. MLLM’s notable features, such as creating stories based on images and mathematical reasoning without OCR, mark a departure from traditional methods and suggest a potential trajectory toward artificial general intelligence.

Building on this trajectory, the Google DeepMind research team introduces an innovative family of multimodal models in a new paper titled “Gemini: A Family of Highly Capable Multimodal Models.” These Gemini models have great capabilities across image, audio, video, and text understanding, pushing the boundaries of large-scale language modeling, image interpretation, audio processing, and video understanding.

The foundation of the Gemini model lies in the Transformer decoder (Vaswani et al., 2017), which is enhanced with architectural and model optimization enhancements. These improvements drive stable training at scale and optimized inference on Google’s Tensor Processing Units. The first version, Gemini 1.0, comes in three sizes. Ultra is designed for highly complex tasks. Pro: Provides enhanced performance and large-scale deployment potential. Nano is tailored for on-device applications, each addressing different computational limitations and application requirements.

Gemini models are trained to seamlessly integrate text input with a variety of audio and visual inputs, including natural images, charts, screenshots, PDFs, and videos, and generate text and image outputs. Of particular note is Gemini’s ability to handle variable input resolutions to understand video and allocate computing resources to tasks that require fine-grained understanding. Additionally, Gemini captures nuances that may be missed when audio is rudimentarily mapped to text input.

Developing the Gemini family of models required innovations in training algorithms, datasets, and infrastructure. Pro models benefit from the inherent scalability of our infrastructure and learning algorithms, leveraging some of the resources of Ultra models to complete pre-training in a matter of weeks. The Nano series leverages advances in distillation and training algorithms to create top-level small language models perfect for tasks like summarization and reading comprehension, powering next-generation on-device experiences.

Extensive evaluation across a variety of benchmarks reveals that the Gemini Ultra model outperforms on 30 out of 32 benchmarks, achieving human expert performance, especially on the widely studied test benchmark MMLU. Ta. The team is optimistic about the transformative potential of the Gemini model in cross-modal inference and language understanding, and envisions a number of use cases. The researchers emphasize their commitment to deploying these models responsibly, prioritizing ethical considerations in their application to users.

paper Gemini: A family of multimodal models with advanced features upon arXiv.


author: Hecate He | Editor:Chain Zhang


We know you don’t want to miss out on news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly Get weekly AI updates.