Realtime API
A Realtime API is the WebSocket / WebRTC-based LLM endpoint that supports streaming audio in + audio out for natural duplex conversation — OpenAI Realtime API, Gemini Live, ElevenLabs Conversational, Cartesia Sonic are 2026 leaders.
Pre-realtime voice agents stitched STT + LLM + TTS sequentially → 1-2s round-trip. Realtime APIs flip this: a single WebSocket / WebRTC connection streams audio in + audio out with the model reasoning on a shared connection. Latency drops to 200-500ms total round-trip. Architecture: WebRTC for audio transport (low jitter, NAT traversal) or WebSocket for simpler integrations, model-native voice modes (GPT-4o voice, Gemini audio), tool calling mid-conversation, interrupt handling. Production wins: voice agents finally feel natural; UX matches human conversation pacing. Trade-offs: realtime APIs are expensive (audio tokens cost more than text), session limits cap conversation length, debugging streaming is harder than request / response.
When to use realtime api
- Production voice agents.
- Real-time multimodal demos.
Common mistakes
- Wrapping a Realtime API in your own STT + TTS — defeats the latency benefit.
- Not budgeting audio token cost — much higher than text per minute.
FAQ
What is realtime api?
A Realtime API is the WebSocket / WebRTC-based LLM endpoint that supports streaming audio in + audio out for natural duplex conversation — OpenAI Realtime API, Gemini Live, ElevenLabs Conversational, Cartesia Sonic are 2026 leaders.
When should I use realtime api?
Production voice agents. Real-time multimodal demos.
What are the most common mistakes with realtime api?
Wrapping a Realtime API in your own STT + TTS — defeats the latency benefit. Not budgeting audio token cost — much higher than text per minute.
Related terms
- Voice agent platform — A voice agent platform is a managed stack that combines STT + LLM + TTS + telephony into a single API for building production phone / voice agents — Vapi, Retell, Bland are the 2026 leaders.
- Barge-in — Barge-in is the voice-agent feature where the user can interrupt the assistant mid-response — the assistant detects the speech and stops talking — making conversations feel natural instead of robotic turn-taking.
- Voice (LLM apps) — Voice in LLM apps refers to the full speech pipeline — speech-to-text (STT), language model, text-to-speech (TTS) — that lets users converse with an AI assistant in spoken language.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/realtime-api.md.