Preference dataset
A preference dataset is the (prompt, chosen response, rejected response) collection used to fine-tune model alignment via RLHF / DPO / IPO — teaching the model what good responses look like by comparison, not just by example.
Instruction-tuning teaches the model to follow inputs; preference fine-tunes teach it to prefer good outputs over bad ones. Format: `{prompt, chosen, rejected}` triples where chosen + rejected are responses to the same prompt with a quality difference. Sources: human raters (expensive but high-quality), LLM judges (synthetic — cheaper but biased toward judge model), user feedback (thumbs up / down on prod responses). Methods: RLHF (full RL with reward model + PPO — complex), [[DPO]] (direct preference optimization — simpler), IPO / KTO (variants). Production patterns: collect preference data from prod user feedback, fine-tune monthly to improve. By 2026 DPO has largely replaced RLHF in fine-tune pipelines for cost / complexity reasons.
When to use preference dataset
- Alignment + style fine-tuning.
- Production quality improvement from user feedback.
Common mistakes
- Using LLM-judge preference for safety-critical fine-tunes — judge biases propagate.
- Small preference dataset (< 1K) — typically not enough for DPO to shift behavior.
FAQ
What is preference dataset?
A preference dataset is the (prompt, chosen response, rejected response) collection used to fine-tune model alignment via RLHF / DPO / IPO — teaching the model what good responses look like by comparison, not just by example.
When should I use preference dataset?
Alignment + style fine-tuning. Production quality improvement from user feedback.
What are the most common mistakes with preference dataset?
Using LLM-judge preference for safety-critical fine-tunes — judge biases propagate. Small preference dataset (< 1K) — typically not enough for DPO to shift behavior.
Related terms
- Direct preference optimisation (DPO) — Direct preference optimisation is a fine-tuning method that aligns a model to human preferences directly from preference pairs — without training an explicit reward model first.
- Training dataset — A training dataset is the JSONL / Parquet collection of (input, output) pairs used to fine-tune a model — for instruction tuning, RLHF / DPO preference data, vision examples, or domain-specific patterns. Quality + scale + diversity matter more than raw size.
- Constitutional AI — Constitutional AI is Anthropic's alignment method where a model is trained to follow a written constitution — a set of principles applied during self-critique and revision — without per-task human preference labels at every step.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/preference-dataset.md.