concept

Training dataset

A training dataset is the JSONL / Parquet collection of (input, output) pairs used to fine-tune a model — for instruction tuning, RLHF / DPO preference data, vision examples, or domain-specific patterns. Quality + scale + diversity matter more than raw size.

Fine-tune outcomes are dominated by the dataset: garbage in, garbage out. Common shapes: instruction tuning ('User: ... Assistant: ...'), preference pairs (`{prompt, chosen, rejected}` for DPO), function-call examples (tool use), vision (image + caption / answer), code (problem + solution + tests). Quality drivers: (1) consistency (same style throughout), (2) diversity (cover the input distribution), (3) hardness (mix easy + hard cases), (4) noise (typos / mistakes confuse the model). Size guidelines for 2026: instruction tuning — 1K-10K high-quality examples usually beats 100K mediocre; LoRA can succeed on 100-1K examples for narrow tasks; full fine-tune needs more. Provenance + license matter for production (no scraped training data with unclear rights).

When to use training dataset

Common mistakes

FAQ

What is training dataset?

A training dataset is the JSONL / Parquet collection of (input, output) pairs used to fine-tune a model — for instruction tuning, RLHF / DPO preference data, vision examples, or domain-specific patterns. Quality + scale + diversity matter more than raw size.

When should I use training dataset?

Fine-tuning, RLHF / DPO, instruction tuning.

What are the most common mistakes with training dataset?

Quantity over quality — 100K noisy examples lose to 5K curated. Mismatched style — training data style ≠ desired output style → model picks up wrong style.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/training-dataset.md.