Direct Preference Optimization (DPO)
Bradley-Terry Model
DPO starts from the Bradley-Terry model, which connects rewards to preferences. This model expresses the probability that a human prefers response (winner) over (loser) given prompt :
where is the implicit reward model ( – reward of preferred response, – reward of rejected response) and is the sigmoid function.
And feedback comes as preferences over model samples
Given this, we can define the loss function for the reward model:
where are the parameters of the reward model.
The loss function is essentially binary classification — the logit is just the difference in rewards.
Deriving the DPO Objective
Instead of training a separate reward model (as in RLHF), DPO uses a theoretical result connecting the optimal policy to the reward function.
The standard RLHF objective is:
In other words: maximize expected reward, minus a KL penalty to keep the policy close to the reference.
This objective has a closed-form solution for the optimal policy :
where is the partition function:
Looks almost the same, but here we sum over all possible responses .
We can rearrange this to express the reward in terms of the optimal policy , reference policy , and partition function :
DPO Loss
Substituting this reward expression into the Bradley-Terry model (the terms cancel out, what a relief!), we get the preference probability in terms of the policy:
Now we can train the policy directly by minimizing the negative log-likelihood of the preference data:
This bypasses the need for a separate reward model entirely. Pretty cool, huh?
Implementation
import torch
import torch.nn.functional as FLog Probabilities
We need to compute the log probability of a sequence, . Since the model is autoregressive, this is the sum of the log probabilities of each token given the history.
def get_log_probs(model, input_ids, attention_mask, labels):
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
logits = outputs.logitsAlign logits and labels. The model predicts the NEXT token, so logits[t] corresponds to labels[t+1]. We remove the last logit (no next token) and the first label (no prediction).
logits = logits[:, :-1, :]
labels = labels[:, 1:]Create a mask for padding tokens (where labels are -100). We must mask BEFORE gathering because -100 is not a valid tensor index.
mask = (labels != -100).float()
Replace -100 with 0 (or any valid index) to prevent index errors in gather. We clone labels to ensure we don't modify the input tensor in place.
labels_safe = labels.clone()
labels_safe[labels == -100] = 0Compute log probabilities for all vocabulary tokens at each position.
log_probs = F.log_softmax(logits, dim=-1)
Select the log probability of the ACTUAL token that appeared in the sequence. gather dim=-1 selects the value at the index specified by labels_safe.
per_token_log_probs = torch.gather(log_probs, dim=-1, index=labels_safe.unsqueeze(-1)).squeeze(-1)Multiply by mask to zero out padding tokens, then sum over the sequence. Result shape: (batch_size,)
return (per_token_log_probs * mask).sum(dim=-1)DPO Loss Function
This implements the DPO objective derived above. It minimizes the negative log-likelihood of the preference data under the implicit reward model.
Also, we can use substraction instead of ratio since .
def dpo_loss(
policy_model,
reference_model,
chosen_ids, chosen_mask, chosen_labels,
rejected_ids, rejected_mask, rejected_labels,
beta: float = 0.1,
):Compute policy log probabilities for chosen and rejected responses. This gives us and .
pi_chosen = get_log_probs(policy_model, chosen_ids, chosen_mask, chosen_labels)
pi_rejected = get_log_probs(policy_model, rejected_ids, rejected_mask, rejected_labels)Compute reference log probabilities (frozen model). This gives us and . We use no_grad() because we don't update the reference model.
with torch.no_grad():
ref_chosen = get_log_probs(reference_model, chosen_ids, chosen_mask, chosen_labels)
ref_rejected = get_log_probs(reference_model, rejected_ids, rejected_mask, rejected_labels)Calculate implicit rewards (log-ratios).
chosen_logratios = pi_chosen - ref_chosen
rejected_logratios = pi_rejected - ref_rejectedCompute the Bradley-Terry logits.
logits = beta * (chosen_logratios - rejected_logratios)
- Compute the negative log-likelihood loss.
F.logsigmoid(x) computes log(1 / (1 + exp(-x))) stably.
loss = -F.logsigmoid(logits).mean()
return lossTraining Loop
Standard PyTorch training step, passing both chosen and rejected sequences.
def train_step(policy_model, reference_model, batch, optimizer, beta=0.1):
optimizer.zero_grad()
loss = dpo_loss(
policy_model, reference_model,
batch["chosen_input_ids"], batch["chosen_attention_mask"], batch["chosen_labels"],
batch["rejected_input_ids"], batch["rejected_attention_mask"], batch["rejected_labels"],
beta=beta,
)
loss.backward()
optimizer.step()
return loss.item()