concept

Multimodal model

A multimodal model accepts more than one input type — text plus images, audio, or video — and reasons across them in a single forward pass.

Multimodal capability went from research to production between 2023 and 2026. GPT-4o, Claude 4 Sonnet, Gemini 2 Pro, and Llama 4 Maverick all accept image input natively; Gemini and GPT-4o also accept audio and video. "Native" multimodality (a single transformer that processes all modalities in a shared representation space) outperforms older "adapter" approaches that bolt a vision encoder onto a frozen text model. In 2026 multimodal is the default — text-only inputs are an opt-in optimisation for cost-sensitive paths.

When to use multimodal model

Common mistakes

FAQ

What is multimodal model?

A multimodal model accepts more than one input type — text plus images, audio, or video — and reasons across them in a single forward pass.

When should I use multimodal model?

Document QA over PDFs / scanned forms. Image moderation, alt-text, accessibility. Video summarisation (with audio). Voice agents that need image awareness.

What are the most common mistakes with multimodal model?

Sending 4K images when 1024px would suffice — token cost on image input is significant. Treating audio input like text — different providers handle voice activity differently.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/multimodal.md.