Direct Preference Optimization (DPO)

Created November 29, 2025 · Updated December 27, 2025

The RLHF - Reinforcement Learning with Human Feedback setup for aligning LLMs is very cumbersome:
- requires training multiple copies of the LLM for reward and value models
- requires sampling from the LM policy in the training loop (expensive!)
RL is generally a "last-resort" when the reward is completely blackbox or non-differentiable, but pairwise preference optimization is not at all blackbox and can be defined as a differentiable binary decision, say by assuming Bradley-Terry Model .
DPO thus proposes a simple binary Cross entropy loss for fine-tuning against a preference dataset avoiding the need to perform RL with a reward model.
Results are pretty good: DPO matches or exceeds PPO - Proximal Policy Optimization based RLHF on alignment tasks.
Theoretically bulletproof too! Maximum likelihood estimation of Bradley-Terry model learns the same policy as maximizing the reward function learnt from pairwise preferences.

The DPO Objective

DPO follows the Bradley-Terry Model's assumption of sigmoid of reward difference being the predictor of pairwise outcome:

p^*\left(y_1 \succ y_2 \mid x\right)=\frac{\exp \left(r^*\left(x, y_1\right)\right)}{\exp \left(r^*\left(x, y_1\right)\right)+\exp \left(r^*\left(x, y_2\right)\right)} .

We can then use Maximum Likelihood Estimation straightforward-ly to estimate the parameters of this reward model $$r$$ .

\mathcal{L}_R\left(r_\phi, \mathcal{D}\right)=-\mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \sigma\left(r_\phi\left(x, y_w\right)-r_\phi\left(x, y_l\right)\right)\right]

Now following the convention of RL objectives that explicitly prevent diverging too far from a "reference" policy by minimizing a KL Divergence term (introduced in TRPO - Trust-Region Policy Optimization), the BT model can directly incorporate this constraint directly as such:

p^*\left(y_1 \succ y_2 \mid x\right)=\frac{1}{1+\exp \left(\beta \log \frac{\pi^*\left(y_2 \mid x\right)}{\pi_{\mathrm{ref}}\left(y_2 \mid x\right)}-\beta \log \frac{\pi^*\left(y_1 \mid x\right)}{\pi_{\mathrm{ref}}\left(y_1 \mid x\right)}\right)}

where larger deviation from reference policy are discourage by using the idea of important weights (see Importance Sampling). Correspondingly the loss function is given as:

\mathcal{L}_{\mathrm{DPO}}\left(\pi_\theta ; \pi_{\mathrm{ref}}\right)=-\mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_w \mid x\right)}{\pi_{\mathrm{ref}}\left(y_w \mid x\right)}-\beta \log \frac{\pi_\theta\left(y_l \mid x\right)}{\pi_{\mathrm{ref}}\left(y_l \mid x\right)}\right)\right]

Note that $\beta$ is a hyperparameter controlling strength of the "KL penalty".

The Training Pipeline

The general DPO pipeline is given as:

Create preference dataset: Sample completions $y_1, y_2 \sim \pi_{\text {ref }}(\cdot \mid x)$ for every prompt $$x$$ , label with human preferences to construct the offline dataset of preferences $\left.\mathcal{D}=\left\{x^{(i)}, y_w^{(i)}, y_l\right)^{(i)}\right\}_{i=1}^N$
To help mitigate issues from Distribution Shift, first maximize the likelihood of preferred completions by next-token prediction (SFT).
Then minimize the DPO loss. Default setting for $\beta$ is 0.1.

DPO vs BT Reward Model

IPO paper does a deeper analysis on the behavior of DPO. They show that:

The more deterministic the preferences are, the strength of the weaker the strength of KL-regularization becomes to the point what value of $\beta$ is used becomes irrelevant.
In empirical setting, this leads to substantial overfitting and setting higher $\beta$ value does nothing.

A learned reward model (under BT) on the other hand rarely models deterministic preferences due to regularization (under-fitting) and thus does not veer away too far from the reference policy.

However, with BT model it's only the difference that matters, which creates two properties:

Rewards are unbounded: To push p(a > b) closer to 1, the model just needs to make the difference larger and there is no ceiling on the individual reward values.
Rewards are shift invariant: Absolute scale is arbitrary as adding constant doesn't affect the probs.

This unboundedness causes:

Reward Hacking
- Model finds inputs that produce extreme reward values even if they don't reflect true quality.
- Can add a penalty term to BT loss that penalizes squared difference i.e. $\mathcal{L}=-\log \sigma\left(r_A-r_B\right)+\lambda\left(r_A-r_B\right)^2$ (which is a zero-centered gaussian prior on rewards).
  - IPO bakes in regularization directly into the objective.
Instability
- Unbounded rewards cause large gradients and unstable training.
- Clamp difference to some range like [-10, 10] before applying sigmoid. Simple but can create flat gradients at the boundaries.
- Compute advantage that don't care about the magnitude of the reward, just ranking among samples.

IPO

IPO objective ensures the regularization towards reference policy is always maintained and thus avoid over-fitting to the preference dataset.

Drop BT's sigmoid of difference assumption, just learn to separate winners by a margin.
$\mathcal{L}_{I P O}=\left(\log \frac{\pi\left(y_w \mid x\right)}{\pi_{r e f}\left(y_w \mid x\right)}-\log \frac{\pi\left(y_l \mid x\right)}{\pi_{r e f}\left(y_l \mid x\right)}-\frac{1}{2 \beta}\right)^2$
Make the difference equal to a target margin $(1 / 2 \beta)$ , then stop. Since there's no sigmoid asymptote to chase, the objective naturally saturates once the margin is achieved.
But looses probabilistic interpretation.

IPO/DPO vs RLHF

IPO/DPO are offline. They learn from a fixed preference dataset. RLHF is online, it generates new samples, scores them, and updates. So exploration beyond original dataset is a huge win for RLHF, if there is something coherent to explore towards.

IPO and DPO might be better when:

Simple pipeline is preferred
Less compute, its just's straight-forward supervised learning
Preference data is high quality and has good coverage