# Multimodal model

**Source:** https://promtable.com/glossary/multimodal

> A multimodal model accepts more than one input type — text plus images, audio, or video — and reasons across them in a single forward pass.

---
A multimodal model accepts more than one input type — text plus images, audio, or video — and reasons across them in a single forward pass.

Multimodal capability went from research to production between 2023 and 2026. GPT-4o, Claude 4 Sonnet, Gemini 2 Pro, and Llama 4 Maverick all accept image input natively; Gemini and GPT-4o also accept audio and video. "Native" multimodality (a single transformer that processes all modalities in a shared representation space) outperforms older "adapter" approaches that bolt a vision encoder onto a frozen text model. In 2026 multimodal is the default — text-only inputs are an opt-in optimisation for cost-sensitive paths.

## When to use

- Document QA over PDFs / scanned forms.
- Image moderation, alt-text, accessibility.
- Video summarisation (with audio).
- Voice agents that need image awareness.

## Common mistakes

- Sending 4K images when 1024px would suffice — token cost on image input is significant.
- Treating audio input like text — different providers handle voice activity differently.

## Related terms

- [context-window](https://promtable.com/glossary/context-window)
- [reasoning-model](https://promtable.com/glossary/reasoning-model)
- [function-calling](https://promtable.com/glossary/function-calling)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/multimodal
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/multimodal".
Contact: info@vibecodingturkey.com.