# Training dataset

**Source:** https://promtable.com/glossary/training-dataset

> A training dataset is the JSONL / Parquet collection of (input, output) pairs used to fine-tune a model — for instruction tuning, RLHF / DPO preference data, vision examples, or domain-specific patterns. Quality + scale + diversity matter more than raw size.

---
A training dataset is the JSONL / Parquet collection of (input, output) pairs used to fine-tune a model — for instruction tuning, RLHF / DPO preference data, vision examples, or domain-specific patterns. Quality + scale + diversity matter more than raw size.

Fine-tune outcomes are dominated by the dataset: garbage in, garbage out. Common shapes: instruction tuning ('User: ... Assistant: ...'), preference pairs (`{prompt, chosen, rejected}` for DPO), function-call examples (tool use), vision (image + caption / answer), code (problem + solution + tests). Quality drivers: (1) consistency (same style throughout), (2) diversity (cover the input distribution), (3) hardness (mix easy + hard cases), (4) noise (typos / mistakes confuse the model). Size guidelines for 2026: instruction tuning — 1K-10K high-quality examples usually beats 100K mediocre; LoRA can succeed on 100-1K examples for narrow tasks; full fine-tune needs more. Provenance + license matter for production (no scraped training data with unclear rights).

## When to use

- Fine-tuning, RLHF / DPO, instruction tuning.

## Common mistakes

- Quantity over quality — 100K noisy examples lose to 5K curated.
- Mismatched style — training data style ≠ desired output style → model picks up wrong style.

## Related terms

- [instruction-tuning](https://promtable.com/glossary/instruction-tuning)
- [synthetic-data](https://promtable.com/glossary/synthetic-data)
- [dpo](https://promtable.com/glossary/dpo)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/training-dataset
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/training-dataset".
Contact: info@vibecodingturkey.com.