concept

Response streaming

Response streaming pipes the model's output token-by-token to the client as it's generated, so users see text appearing in real time instead of waiting for the full answer.

Streaming is mandatory for any user-facing LLM feature in 2026 — without it the user stares at a spinner for 5-30 seconds. Every major API supports server-sent events (SSE) or websocket streaming. The engineering trade-offs: streaming complicates error handling mid-response, makes structured-output mode (JSON) harder to validate progressively, and prevents the server from inspecting the full output before sending. For tool-use loops, stream the user-facing answer but buffer tool calls until complete.

When to use response streaming

Common mistakes

FAQ

What is response streaming?

Response streaming pipes the model's output token-by-token to the client as it's generated, so users see text appearing in real time instead of waiting for the full answer.

When should I use response streaming?

Any chat or assistant UX. Long-form generation where users want immediate feedback.

What are the most common mistakes with response streaming?

Streaming JSON without progressive validation — partial JSON breaks parsing. No abort/cancel on the client — runaway streams keep generating after the user navigates away.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/response-streaming.md.