What is multimodal AI?
Multimodal AI is a type of artificial intelligence that can understand and process different types of information, such as text, images, audio, and video, all at the same time. Multimodal gen AI models produce outputs based on these various inputs.
When you wake up in the morning, you reorient yourself into the world in a variety of ways. Before you open your eyes, you might hear the ambient sounds in your room (unless a not-so-ambient sound was what woke you up in the first place).
You might feel cosey under the covers, or cold because you kicked them off while you were sleeping. And once you open your eyes, you get a visual sense of what is going on in your room. These sense recognitions, along with the moods they evoke, create a nuanced perception of the morning and set you up for the rest of your day.
How multimodal gen AI models work
Multimodal gen AI models work in a similar way. They mirror the brain’s ability to combine sensory inputs for a nuanced, holistic understanding of the world, much like how humans use their variety of senses to perceive reality.
These gen AI models’ ability to seamlessly perceive multiple inputs—and simultaneously generate output—allows them to interact with the world in innovative, transformative ways and represents a significant advancement in AI.
By combining the strengths of different types of content (including text, images, audio, and video) from different sources, multimodal gen AI models can understand data in a more comprehensive way, which enables them to process more complex inquiries that result in fewer hallucinations (inaccurate or misleading outputs).
Today, enterprises that have deployed gen AI primarily use text-based large language models (LLMs). But a shift toward multimodal AI is underway, with the potential for a larger range of applications and more complex use cases.
Multimodal gen AI models are well suited to the moment’s demands on business. As Internet of Things (IoT)–enabled devices collect more types and greater volumes of data than ever before, organisations can use multimodal AI models to process and integrate multisensory information, then deliver the increasingly personalized experiences that customers seek in retail, healthcare, and entertainment.
Multimodal gen AI models can also make technology more accessible to nontechnical users. Because the models can process multisensory inputs, users are able to interact with them by speaking, gesturing, or using an augmented reality or virtual reality controller. The ease of use also means that more people of varying abilities can reap the benefits that gen AI offers, such as increased productivity.
And, finally, AI models in general are becoming less expensive and more powerful with each passing month. Not only is their performance improving, but the time it takes to generate results is decreasing as is the number of unintended outputs or errors. What is more, the cost of building these models is decreasing sharply.
For example, researchers at Sony AI recently demonstrated that a model that cost $100,000 to train in 2022 can now be trained for less than $2,000.
The field of multimodal AI is evolving quickly, with new models and innovative use cases emerging every day, reshaping what is possible with AI. In this Explainer, we will explore how multimodal gen AI models work, what they are used for, and where the technology is headed next.
Four steps to process information
Multimodal AI models typically consist of multiple neural networks, each tailored to process—or “encode”—one specific format, such as text, images, audio, and video. The outputs are then combined through various fusion techniques, and in the final step, a classifier translates the fused outputs into a prediction or decision. Here is more about each step:
Data input and preprocessing. Data from different formats is gathered and preprocessed. Types of preprocessing include tokenizing text, resising images, and converting audio to spectrograms.
Feature encoding. Encoder tools within individual neural networks transfer the data (such as a picture or a sentence) to machine-readable feature vectors or embeddings (typically represented by a series of numbers). Each modality is generally processed differently. For example, image pixels can be converted into feature vectors via CLIP (contrastive language–image pretraining), while text could be embedded using transformer architectures, such as those that power OpenAI’s GPT series.
Fusion mechanisms. Encoded data from the different modalities is mapped into a shared space using various fusion mechanisms, which merge the embedded text from different modalities into a layer. The fusion step allows the model to dynamically focus on the parts of the data that are most relevant to the task. Fusion also enables the model to understand the relationships between the different modalities, which enables cross-modal understanding.
Generative modelling. The generative step converts the data fused in the previous step into actionable outputs. For example, in image captioning, the model might generate a sentence that describes the image. Different models use different techniques; some adopt autoregressive methods to predict the next element in a sequence, while others utilize generative adversarial networks (GANs) or variational autoencoders (VAEs) to create outputs.
How do multimodal models compare with text-only models?
LLMs are efficient and cost-effective for text-based applications. By contrast, multimodal models—which are about twice as expensive per token as LLMs—enhance the capacity for more complex tasks by integrating multiple data types, such as text and images. What is more, multimodal AI models are typically not significantly slower than text-only models.
How to use multimodal gen AI models
Organisations looking to implement multimodal gen AI can consider the following use cases:
Accelerating creative processes in marketing and product design. Organisations can use multimodal AI models to design personalised marketing campaigns that seamlessly blend text, images, and video. On the product side, organizations can use multimodal AI to generate product prototypes.
Reducing fraud in insurance claims. Multimodal models can reduce fraud in the insurance industry by cross-checking a diverse set of data sources, including customer statements, transaction logs, and claim supplements such as photos or videos. More efficient fraud detection can streamline the processing of claims for legitimate cases.
Enhancing trend detection. By analysing unstructured data from diverse sources, including social media posts, images, and videos, organisations can tailor their marketing strategies and products to resonate with local audiences.
Transforming patient care. Multimodal AI can change patient care dramatically by enabling virtual assistants to communicate through text, speech, images, videos, and gestures, making interactions more intuitive, empathetic, and personalised.
Providing real-time support in call centres and healthcare. Multimodal models can use low-latency voice processing to enable real-time assistance for patients through call centres and medical-assistance platforms.
In call centres, these models can listen to customer interactions, transcribe their concerns, and provide instant recommendations to patients via agents. In medical settings, they can transcribe and analyse patient symptoms and then suggest next steps—all while maintaining seamless, natural conversations with the patients themselves. This capability enhances decision-making and patient satisfaction.
Streamlining user interaction testing. Multimodal AI can revolutionise automated user interaction testing by simulating interactions across web browsers, applications, and games. By analysing both code and visual data, this capability can autonomously verify accessibility standards, such as screen reader compatibility and colour contrast, while also assessing the overall user experience.
By bringing together a diverse set of formats and data types, the information these models produce can empower leaders and their companies to stay competitive and innovative. The companies that invest early in these use cases may need to address some new technical risks but may also gain an advantage by being first movers.
How to access and deploy multimodal AI
The majority of organisations using multimodal AI are likely to be categorised as takers. This means they will deploy user-friendly applications that are built on pretrained models from third-party providers.
Other organisations will want to customise out-of-the-box systems to improve performance in their specific use cases; these companies will be called shapers. Potential customisations include fine-tuning the model to reduce costs and improve performance on specific tasks, training the model on proprietary data, building scaffolding for continuous feedback and active learning, and adding guardrails to prevent unwanted responses and improve the model’s level of responsibility.
A final category of companies will be makers, which tend to be technologically advanced organisations that train their models in-house. This training can cost up to millions of dollars and requires specialized technical expertise and access to sophisticated hardware.
For organisations that strive to be makers, a robust and user-friendly multimodal application requires several critical factors: an intuitive user interface, a powerful backend infrastructure (including a multimodal search pipeline that’s capable of understanding relationships across different data types), efficient strategies to deploy the model, and stringent data cleaning, security, and privacy protocols to protect user information.
Developing multimodal model architectures presents significant challenges, particularly when it comes to alignment and colearning. Alignment ensures that the modalities are properly synchronized with each other—more specifically, that audio output aligns with the corresponding video or that speech output aligns with the corresponding text.
Colearning allows models to recognize and utilize correlations across modalities without succumbing to negative transfer (where a model’s learning from one modality actually hinders its comprehension of another).
Examples of working with multimodal AI
Life sciences companies are using multimodal AI to transform both drug discovery and clinical care delivery. Leading foundation models—a type of AI model trained on massive, general-purpose data sets—can accept a protein’s amino acid sequence (that is, the sequence of letters that represents the different molecules that comprise the protein) as an input.
The scientists behind AlphaFold, an AI system developed by Google DeepMind, were honored by the Nobel committee in 2024 for constructing a model that can predict the 3D structure of a molecule in just a couple of minutes.
In the past, this process would have taken several months and required expensive experimental methods, such as X-ray crystallography. Another example is ESM-3, which goes a step further than AlphaFold.
It not only predicts the protein’s structure but also captures its functional and evolutionary information in a single, unified model. ESM-3 uses multimodal AI to learn simultaneously from sequences, structures, and biological annotations (like metadata), which enables the model to determine what a protein looks like, what it does, and how it evolved—all at once.
In clinical healthcare, single-modality foundation models have already outperformed clinical experts in certain tasks, such as mammography. The multimodal foundation models that are currently in development could simultaneously consider an X-ray, mammogram, doctors’ notes, medical history, and genetic test results, generating a holistic picture of a patient’s risk of developing cancer rather than an isolated data point on their cancer risk matrix.
What risks are associated with multimodal AI?
Multimodal AI carries the same risks and limitations as other gen AI applications, including bias, data privacy, and exposure to expanding AI regulations.