Synthetic data
Synthetic data is training or evaluation data generated by a model rather than collected from humans — increasingly used to fine-tune smaller models and to fill gaps in real datasets.
By 2026 synthetic data is a load-bearing part of the LLM stack. Teacher models generate examples that train student models (distillation), domain experts use LLMs to bootstrap labelled datasets, and evals are constructed from synthetic edge cases that real users haven't hit yet. The technique works best when the teacher model is materially stronger than the target task — synthetic-only loops between equals tend to collapse. Quality control still matters: detect-then-filter pipelines (an LLM judge that rejects bad synthetic examples) ship in every credible synthetic-data workflow.
When to use synthetic data
- Distilling a small fast model from a large teacher.
- Bootstrapping a domain dataset when human labels are scarce.
- Generating eval edge cases.
Common mistakes
- Training a model on its own outputs (model collapse).
- Skipping a quality filter — synthetic noise propagates.
FAQ
What is synthetic data?
Synthetic data is training or evaluation data generated by a model rather than collected from humans — increasingly used to fine-tune smaller models and to fill gaps in real datasets.
When should I use synthetic data?
Distilling a small fast model from a large teacher. Bootstrapping a domain dataset when human labels are scarce. Generating eval edge cases.
What are the most common mistakes with synthetic data?
Training a model on its own outputs (model collapse). Skipping a quality filter — synthetic noise propagates.
Related terms
- Distillation — Distillation trains a smaller "student" model to mimic a larger "teacher" model's outputs, capturing most of the quality at a fraction of the inference cost.
- Fine-tuning — Fine-tuning updates a pretrained model's weights on task-specific data, baking the new behaviour into the model rather than relying on prompts.
- Evals (LLM evaluations) — Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/synthetic-data.md.