# Direct preference optimisation (DPO)

**Source:** https://promtable.com/glossary/dpo

> Direct preference optimisation is a fine-tuning method that aligns a model to human preferences directly from preference pairs — without training an explicit reward model first.

---
Direct preference optimisation is a fine-tuning method that aligns a model to human preferences directly from preference pairs — without training an explicit reward model first.

DPO (Rafailov et al., 2023) simplifies RLHF by treating the policy model itself as the reward function — you optimise a loss on preference pairs (chosen vs rejected) directly. Empirically matches or beats PPO-based RLHF on alignment quality at much lower engineering complexity. By 2026 DPO and its variants (KTO, ORPO, IPO) have largely replaced classical RLHF for instruction-tuning open-weight models. Closed labs still combine multiple techniques (RLHF, Constitutional AI, DPO, RL on verifiable outcomes) but DPO is the open-weight default.

## When to use

- Aligning open-weight models to human preferences.
- Cheaper alternative to PPO-based RLHF for fine-tuning teams.

## Common mistakes

- Skipping a reference model and over-fitting the policy.
- Mixing low-quality preference data — garbage in / garbage out.

## Related terms

- [fine-tuning](https://promtable.com/glossary/fine-tuning)
- [instruction-tuning](https://promtable.com/glossary/instruction-tuning)
- [evals](https://promtable.com/glossary/evals)

## Sources

- [Rafailov et al. 2023 (arXiv)](https://arxiv.org/abs/2305.18290)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/dpo
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/dpo".
Contact: info@vibecodingturkey.com.