technique

Direct preference optimisation (DPO)

Direct preference optimisation is a fine-tuning method that aligns a model to human preferences directly from preference pairs — without training an explicit reward model first.

DPO (Rafailov et al., 2023) simplifies RLHF by treating the policy model itself as the reward function — you optimise a loss on preference pairs (chosen vs rejected) directly. Empirically matches or beats PPO-based RLHF on alignment quality at much lower engineering complexity. By 2026 DPO and its variants (KTO, ORPO, IPO) have largely replaced classical RLHF for instruction-tuning open-weight models. Closed labs still combine multiple techniques (RLHF, Constitutional AI, DPO, RL on verifiable outcomes) but DPO is the open-weight default.

When to use direct preference optimisation (dpo)

Common mistakes

FAQ

What is direct preference optimisation (dpo)?

Direct preference optimisation is a fine-tuning method that aligns a model to human preferences directly from preference pairs — without training an explicit reward model first.

When should I use direct preference optimisation (dpo)?

Aligning open-weight models to human preferences. Cheaper alternative to PPO-based RLHF for fine-tuning teams.

What are the most common mistakes with direct preference optimisation (dpo)?

Skipping a reference model and over-fitting the policy. Mixing low-quality preference data — garbage in / garbage out.

Sources

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/dpo.md.