concept

Realtime API

A Realtime API is the WebSocket / WebRTC-based LLM endpoint that supports streaming audio in + audio out for natural duplex conversation — OpenAI Realtime API, Gemini Live, ElevenLabs Conversational, Cartesia Sonic are 2026 leaders.

Pre-realtime voice agents stitched STT + LLM + TTS sequentially → 1-2s round-trip. Realtime APIs flip this: a single WebSocket / WebRTC connection streams audio in + audio out with the model reasoning on a shared connection. Latency drops to 200-500ms total round-trip. Architecture: WebRTC for audio transport (low jitter, NAT traversal) or WebSocket for simpler integrations, model-native voice modes (GPT-4o voice, Gemini audio), tool calling mid-conversation, interrupt handling. Production wins: voice agents finally feel natural; UX matches human conversation pacing. Trade-offs: realtime APIs are expensive (audio tokens cost more than text), session limits cap conversation length, debugging streaming is harder than request / response.

When to use realtime api

Common mistakes

FAQ

What is realtime api?

A Realtime API is the WebSocket / WebRTC-based LLM endpoint that supports streaming audio in + audio out for natural duplex conversation — OpenAI Realtime API, Gemini Live, ElevenLabs Conversational, Cartesia Sonic are 2026 leaders.

When should I use realtime api?

Production voice agents. Real-time multimodal demos.

What are the most common mistakes with realtime api?

Wrapping a Realtime API in your own STT + TTS — defeats the latency benefit. Not budgeting audio token cost — much higher than text per minute.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/realtime-api.md.