Direct preference optimisation (DPO)
Direct preference optimisation is a fine-tuning method that aligns a model to human preferences directly from preference pairs — without training an explicit reward model first.
DPO (Rafailov et al., 2023) simplifies RLHF by treating the policy model itself as the reward function — you optimise a loss on preference pairs (chosen vs rejected) directly. Empirically matches or beats PPO-based RLHF on alignment quality at much lower engineering complexity. By 2026 DPO and its variants (KTO, ORPO, IPO) have largely replaced classical RLHF for instruction-tuning open-weight models. Closed labs still combine multiple techniques (RLHF, Constitutional AI, DPO, RL on verifiable outcomes) but DPO is the open-weight default.
When to use direct preference optimisation (dpo)
- Aligning open-weight models to human preferences.
- Cheaper alternative to PPO-based RLHF for fine-tuning teams.
Common mistakes
- Skipping a reference model and over-fitting the policy.
- Mixing low-quality preference data — garbage in / garbage out.
FAQ
What is direct preference optimisation (dpo)?
Direct preference optimisation is a fine-tuning method that aligns a model to human preferences directly from preference pairs — without training an explicit reward model first.
When should I use direct preference optimisation (dpo)?
Aligning open-weight models to human preferences. Cheaper alternative to PPO-based RLHF for fine-tuning teams.
What are the most common mistakes with direct preference optimisation (dpo)?
Skipping a reference model and over-fitting the policy. Mixing low-quality preference data — garbage in / garbage out.
Related terms
- Fine-tuning — Fine-tuning updates a pretrained model's weights on task-specific data, baking the new behaviour into the model rather than relying on prompts.
- Instruction tuning — Instruction tuning is the post-training stage where a base language model is fine-tuned on examples of (instruction, ideal response) pairs to follow human instructions reliably.
- Evals (LLM evaluations) — Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.
Sources
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/dpo.md.