LLM observability
LLM observability is the production discipline of capturing requests, responses, latencies, costs, and outcomes across LLM-driven systems — the prerequisite for debugging, evaluating, and optimizing AI features in 2026.
Traditional APM (Datadog, NewRelic) doesn't capture what matters for LLM apps: token counts, prompts, completions, tool calls, eval scores. LLM observability tools (Langfuse, Helicone, Braintrust, LangSmith, Arize Phoenix, OpenLLMetry) capture: full prompts + completions per request, model + version, latency / TTFT / TPS, token usage + cost, tool calls + return values, multi-step trace correlation, user feedback signals (thumbs up / down), eval scores. The data flows into dashboards (cost per user, error rate by model), datasets (curate from prod for evals), and alerts (cost spikes, hallucination rate). Without observability, prompt regressions ship silently. With it, every change is verifiable.
When to use llm observability
- Any production LLM feature — non-negotiable.
- Pre-prod: capture dev runs for offline eval.
Common mistakes
- Logging prompts but not completions — half the data is missing.
- Skipping user feedback signals — the ground truth for prod quality.
FAQ
What is llm observability?
LLM observability is the production discipline of capturing requests, responses, latencies, costs, and outcomes across LLM-driven systems — the prerequisite for debugging, evaluating, and optimizing AI features in 2026.
When should I use llm observability?
Any production LLM feature — non-negotiable. Pre-prod: capture dev runs for offline eval.
What are the most common mistakes with llm observability?
Logging prompts but not completions — half the data is missing. Skipping user feedback signals — the ground truth for prod quality.
Related terms
- Agent tracing — Agent tracing captures the full execution graph of an agent run — every step, every tool call, every model output — so engineers can debug, audit, and improve the agent over time.
- Evals (LLM evaluations) — Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.
- Evals-driven development — Evals-driven development is the discipline of writing the eval suite first, then iterating prompts and models against it — borrowing test-driven development for LLM work.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/llm-observability.md.