# Synthetic data

**Source:** https://promtable.com/glossary/synthetic-data

> Synthetic data is training or evaluation data generated by a model rather than collected from humans — increasingly used to fine-tune smaller models and to fill gaps in real datasets.

---
Synthetic data is training or evaluation data generated by a model rather than collected from humans — increasingly used to fine-tune smaller models and to fill gaps in real datasets.

By 2026 synthetic data is a load-bearing part of the LLM stack. Teacher models generate examples that train student models (distillation), domain experts use LLMs to bootstrap labelled datasets, and evals are constructed from synthetic edge cases that real users haven't hit yet. The technique works best when the teacher model is materially stronger than the target task — synthetic-only loops between equals tend to collapse. Quality control still matters: detect-then-filter pipelines (an LLM judge that rejects bad synthetic examples) ship in every credible synthetic-data workflow.

## When to use

- Distilling a small fast model from a large teacher.
- Bootstrapping a domain dataset when human labels are scarce.
- Generating eval edge cases.

## Common mistakes

- Training a model on its own outputs (model collapse).
- Skipping a quality filter — synthetic noise propagates.

## Related terms

- [distillation](https://promtable.com/glossary/distillation)
- [fine-tuning](https://promtable.com/glossary/fine-tuning)
- [evals](https://promtable.com/glossary/evals)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/synthetic-data
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/synthetic-data".
Contact: info@vibecodingturkey.com.