Soft Adaptive Policy Optimization (SAPO)
From GRPO to SAPO
Group Relative Policy Optimization (GRPO)
Standard PPO requires a critic (or value) model to estimate advantages. GRPO eliminates the critic by computing advantages relative to a group of sampled responses.
The Core Idea
For each prompt , sample responses from the policy. Score each response with a reward function . Then compute the advantage:
This is just z-score normalization within the group. Note that we use the same advantage for all tokens in a response ().
The algorithm is actually pretty simple:
- Sample a group of responses for each prompt
- Score each response with a reward model
- Normalize rewards within the group -> advantages
- Update the policy using these relative advantages
But why groups? In PPO, we compute advantages as , where is a learned baseline (the expected return from state ) – basically calculating the gain compared to the baseline. Standard PPO trains a critic network to estimate .
For LLMs, this adds overhead – either a separate value model or an extra head on the base model. GRPO's solution: estimate the baseline from a group of samples. Generate responses for the same prompt, score them, use the mean as the baseline. If response A scores higher than average – reinforce it. If response B scores lower – suppress it.
No learned value function, just peer pressure.
GRPO Objective
Let's start from the beginning.
The probability of generating response given prompt is the product of per-token probabilities:
In other words: how likely is this exact sequence? Multiply the probability of each token, given everything before it.
For each query , GRPO samples a group of responses from and computes rewards .
The objective:
Why min? If (good response), we want to increase . The ensures we don't increase too much: once , the clipped term is smaller, so the gradient stops. If (bad response), we want to decrease . The ensures we don't decrease too much – once , the clipped term is smaller, so the gradient stops. This creates a trust region around the old policy.
Notation:
- – response index in the group
- – token position within response
- – the -th token of the -th response
- – all tokens before position in response
- – advantage for response (shared across all tokens in that response)
The token-level importance ratio:
The KL divergence term (it's an approximation, see Schulman et al. (2020)) penalizes deviation from a reference policy :
Note: and are different. is the policy that generated the samples (updated each iteration). is a fixed reference (usually the SFT model) used for KL regularization to prevent the policy from drifting too far from the original model.
The Problem: Hard Clipping
The clip operation creates a hard cutoff:
When the ratio hits the boundary, the gradient contribution is zeroed out (the derivative of a constant = zero).
Why This Is Bad
1. Token-level variance is high in LLMs
In long sequences, individual tokens can have extreme probability ratios even when the sequence overall is on-policy. A single rare token can push outside .
2. The min operator discards useful signal
When and , PPO takes the clipped value. But the gradient of with respect to is zero – the token contributes nothing to learning. Since we propagate the same advantage to each token, this can zero out useful signal from an entire sequence. Task failed successfully.
3. Hyperparameter sensitivity
- Small (0.1): aggressive clipping -> many tokens discarded -> slow learning
- Large (0.3): off-policy tokens dominate -> training instability
The Qwen team observed this is especially problematic for MoE models (routing decisions add extra variance) and long-context training, where token-level variance is even higher.
SAPO: Soft Adaptive Policy Optimization
SAPO replaces the hard clip with a soft gate centered at . Instead of zeroing gradients, it down-weights off-policy tokens a bit more smoothly:
- Near (on-policy): weight stays high -> keep gradients
- As moves away: weight decays gradually -> soften, not zero
The Soft Gate
Instead of clipping , SAPO applies a sigmoid-shaped weighting function:
where is a temperature and is the sigmoid function.
Why ? This centers the function at (on-policy):
- increases (policy favors this token more)
- decreases (policy favors this token less)
Temperature controls sharpness:
- Small : sharp transition (approaches hard clipping)
- Large : smooth transition (tolerant of off-policy drift)
Asymmetric Temperatures
SAPO uses different temperatures for positive vs negative advantages:
with (e.g., 0.05 vs 0.1).
Why? Negative updates are more destabilizing – when you push down the probability of one token, you push up probabilities of many other (potentially wrong) tokens. This is fine (it's not). Tighter gating makes weights decay faster for off-policy tokens with negative advantage, limiting the damage.
The SAPO Objective
where
with being the asymmetric temperature defined above.
Results: SAPO gives more stable training and better Pass@1 on math benchmarks. Gains are consistent across Qwen3-VL model sizes.
Implementation
Simplified from TRL's GRPOTrainer
import torchSAPO Token Loss
The soft gate function that replaces hard clipping:
where is the importance ratio.
def get_sapo_token_loss(ratio: torch.Tensor, tau: float) -> torch.Tensor:
return torch.sigmoid(tau * (ratio - 1)) * (4.0 / tau)SAPO Loss
Per-token loss with asymmetric temperatures:
where if , else .
def compute_sapo_loss(
log_probs: torch.Tensor, # (batch, seq_len) - log π_θ
old_log_probs: torch.Tensor, # (batch, seq_len) - log π_old
advantages: torch.Tensor, # (batch,) - per-sequence advantages
mask: torch.Tensor, # (batch, seq_len) - completion mask
tau_pos: float = 20.0,
tau_neg: float = 40.0,
beta: float = 0.01,
ref_log_probs: torch.Tensor = None # (batch, seq_len) - log π_ref for KL
) -> torch.Tensor:Importance ratio: r = π_θ / π_old = exp(log π_θ - log π_old)
log_ratio = log_probs - old_log_probs
ratio = torch.exp(log_ratio)
Expand advantages: (batch,) -> (batch, seq_len)
advantages_expanded = advantages.unsqueeze(-1).expand_as(ratio)
Asymmetric temperatures based on advantage sign
per_token_loss = torch.empty_like(ratio)
positive_mask = advantages_expanded > 0
per_token_loss[positive_mask] = get_sapo_token_loss(ratio[positive_mask], tau_pos)
per_token_loss[~positive_mask] = get_sapo_token_loss(ratio[~positive_mask], tau_neg)
Multiply by advantage (negative because we minimize)
per_token_loss = -per_token_loss * advantages_expanded
KL penalty: D_KL[π_θ || π_ref] = π_ref/π_θ - log(π_ref/π_θ) - 1
if ref_log_probs is not None and beta > 0:
ratio_ref = torch.exp(ref_log_probs - log_probs)
kl = ratio_ref - (ref_log_probs - log_probs) - 1
per_token_loss = per_token_loss + beta * kl
Average over tokens, then over batch
loss = ((per_token_loss * mask).sum(-1) / mask.sum(-1).clamp(min=1.0)).mean()
return lossAdvantage Computation
Z-score normalization within the group:
def compute_advantages(rewards: torch.Tensor) -> torch.Tensor:
mean = rewards.mean()
std = rewards.std().clamp(min=1e-8)
return (rewards - mean) / stdNotes
- For PPO -> GRPO derivation, see this blogpost. Here I covered only the basics needed for SAPO.
- Relation with GSPO (and related derivations) was intentionally omitted for the simplicity.
- I simplified notation in some formulas (because it's too verbose and I'm lazy). Check the papers for complete notation.
- The original SAPO paper doesn't include the KL term in the objective; I added it for consistency with vanilla GRPO paper.
References
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, Zhihong Shao et al.
- Soft Adaptive Policy Optimization, Chang Gao et al.
- TRL GRPOTrainer, HuggingFace