Model collapse
Model collapse is what happens when a model is trained or fine-tuned on its own outputs across generations — quality degrades, diversity shrinks, and tail knowledge is forgotten.
Documented by Shumailov et al. in 2023-2024 and confirmed across 2026 research, model collapse occurs when synthetic-data loops feed back into training without quality filtering or grounded data. The model converges toward its own modal outputs, loses long-tail knowledge, and produces increasingly homogeneous output. Practical implications in 2026: synthetic data pipelines must include real human / grounded data, must filter for quality, and must monitor diversity metrics. Pre-training corpora are now heavily contaminated with AI-generated content; major labs invest in provenance detection and human-authored data sources to combat collapse.
Common mistakes
- Distilling a student on the teacher's outputs without any real data anchor.
- Running synthetic-data flywheels without quality gates.
FAQ
What is model collapse?
Model collapse is what happens when a model is trained or fine-tuned on its own outputs across generations — quality degrades, diversity shrinks, and tail knowledge is forgotten.
What are the most common mistakes with model collapse?
Distilling a student on the teacher's outputs without any real data anchor. Running synthetic-data flywheels without quality gates.
Related terms
- Synthetic data — Synthetic data is training or evaluation data generated by a model rather than collected from humans — increasingly used to fine-tune smaller models and to fill gaps in real datasets.
- Distillation — Distillation trains a smaller "student" model to mimic a larger "teacher" model's outputs, capturing most of the quality at a fraction of the inference cost.
- Fine-tuning — Fine-tuning updates a pretrained model's weights on task-specific data, baking the new behaviour into the model rather than relying on prompts.
Sources
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/model-collapse.md.