Multimodal model
A multimodal model accepts more than one input type — text plus images, audio, or video — and reasons across them in a single forward pass.
Multimodal capability went from research to production between 2023 and 2026. GPT-4o, Claude 4 Sonnet, Gemini 2 Pro, and Llama 4 Maverick all accept image input natively; Gemini and GPT-4o also accept audio and video. "Native" multimodality (a single transformer that processes all modalities in a shared representation space) outperforms older "adapter" approaches that bolt a vision encoder onto a frozen text model. In 2026 multimodal is the default — text-only inputs are an opt-in optimisation for cost-sensitive paths.
When to use multimodal model
- Document QA over PDFs / scanned forms.
- Image moderation, alt-text, accessibility.
- Video summarisation (with audio).
- Voice agents that need image awareness.
Common mistakes
- Sending 4K images when 1024px would suffice — token cost on image input is significant.
- Treating audio input like text — different providers handle voice activity differently.
FAQ
What is multimodal model?
A multimodal model accepts more than one input type — text plus images, audio, or video — and reasons across them in a single forward pass.
When should I use multimodal model?
Document QA over PDFs / scanned forms. Image moderation, alt-text, accessibility. Video summarisation (with audio). Voice agents that need image awareness.
What are the most common mistakes with multimodal model?
Sending 4K images when 1024px would suffice — token cost on image input is significant. Treating audio input like text — different providers handle voice activity differently.
Related terms
- Context window — The context window is the maximum number of tokens — system prompt, conversation history, retrieved documents, and the response — that a language model can process in a single turn.
- Reasoning model — A reasoning model is an LLM trained to produce extensive internal chain-of-thought before its final answer, trading latency for higher accuracy on hard problems.
- Function calling (tool use) — Function calling lets a language model emit a structured request to invoke a developer-defined tool, enabling reliable JSON output and agent workflows.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/multimodal.md.