concept

Synthetic data

Synthetic data is training or evaluation data generated by a model rather than collected from humans — increasingly used to fine-tune smaller models and to fill gaps in real datasets.

By 2026 synthetic data is a load-bearing part of the LLM stack. Teacher models generate examples that train student models (distillation), domain experts use LLMs to bootstrap labelled datasets, and evals are constructed from synthetic edge cases that real users haven't hit yet. The technique works best when the teacher model is materially stronger than the target task — synthetic-only loops between equals tend to collapse. Quality control still matters: detect-then-filter pipelines (an LLM judge that rejects bad synthetic examples) ship in every credible synthetic-data workflow.

When to use synthetic data

Common mistakes

FAQ

What is synthetic data?

Synthetic data is training or evaluation data generated by a model rather than collected from humans — increasingly used to fine-tune smaller models and to fill gaps in real datasets.

When should I use synthetic data?

Distilling a small fast model from a large teacher. Bootstrapping a domain dataset when human labels are scarce. Generating eval edge cases.

What are the most common mistakes with synthetic data?

Training a model on its own outputs (model collapse). Skipping a quality filter — synthetic noise propagates.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/synthetic-data.md.