technique

Streaming response

Streaming response is the LLM API pattern where tokens are emitted incrementally over Server-Sent Events / WebSocket as the model generates — drastically improves perceived latency, enables progressive UI updates, mandatory for interactive UX.

Non-streaming LLM calls wait for the full response then return it — for a 1000-token response, the user stares at a loading spinner for 5-10 seconds. Streaming flips this: tokens arrive as generated, UI renders them in real time, user sees the first word in 100-300ms. Implementation: API uses Server-Sent Events (`text/event-stream`) or WebSocket; client SDK parses delta events into token strings; UI appends to a buffer. Most AI SDKs handle this; modern UI frameworks (Vercel AI SDK + React, SvelteKit) ship streaming-friendly hooks. Trade-offs: error handling is harder (partial response, then error), tool-call streaming is complex, harder to cache full responses. By 2026 non-streaming chat UX is rare; only batch jobs skip it.

When to use streaming response

Common mistakes

FAQ

What is streaming response?

Streaming response is the LLM API pattern where tokens are emitted incrementally over Server-Sent Events / WebSocket as the model generates — drastically improves perceived latency, enables progressive UI updates, mandatory for interactive UX.

When should I use streaming response?

Any interactive chat / agent UX.

What are the most common mistakes with streaming response?

Buffering the stream before returning — defeats the purpose. Forgetting partial-response error handling.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/streaming-response.md.