Shadow deployment (LLM)
Shadow deployment runs a new model or prompt alongside the production one — receiving the same traffic but never showing output to users — to measure quality, latency, and cost before flipping live.
Shadow deployment is the LLM analogue of feature flagging for traditional code. The shadow path receives production traffic, runs the new prompt or model, and logs results without affecting users. After enough samples you compare quality (eval scores), latency, and cost against the live baseline. Only flip live when the shadow meets thresholds. In 2026 shadow deployment is the standard practice for model upgrades (GPT-4o → GPT-5, Claude 4.5 → 4.6) and prompt rewrites — the alternative is shipping blind and rolling back when users complain.
When to use shadow deployment (llm)
- Model upgrades.
- Major prompt rewrites.
- New routing or orchestration layers.
Common mistakes
- Shadow with too little traffic — confidence intervals stay too wide to decide.
- Comparing only aggregate scores — investigate per-cohort regressions.
FAQ
What is shadow deployment (llm)?
Shadow deployment runs a new model or prompt alongside the production one — receiving the same traffic but never showing output to users — to measure quality, latency, and cost before flipping live.
When should I use shadow deployment (llm)?
Model upgrades. Major prompt rewrites. New routing or orchestration layers.
What are the most common mistakes with shadow deployment (llm)?
Shadow with too little traffic — confidence intervals stay too wide to decide. Comparing only aggregate scores — investigate per-cohort regressions.
Related terms
- A/B testing prompts — A/B testing prompts runs two prompt variants against the same input distribution and compares scored outputs, attributing quality differences to the prompt change.
- Evals (LLM evaluations) — Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.
- Model router — A model router picks which language model handles each request based on cost, latency, or task type — the standard production pattern in 2026.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/shadow-deployment.md.