fix(ppo): stop corrupting the logged rollout/kl metric#2114
Open
EazyReal wants to merge 1 commit into
Open
Conversation
compute_advantages_and_returns stores the approximate KL in rollout_data["kl"], which log_rollout_data reduces and logs as rollout/kl. The ppo estimator builds its reward signal as -kl_coef * kl (plus the scalar reward at the last token) in place (`k *= kl_coef; k[-1] += reward`), and `k` aliases the tensors in rollout_data["kl"]. So after the ppo branch the logged KL is overwritten with the reward. Every other estimator (grpo/gspo/cispo/reinforce++) treats kl as read-only. Build the reward out-of-place so the stored KL stays intact. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5cbfc4a to
1526792
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
compute_advantages_and_returnsstores the approximate KL inrollout_data["kl"], whichlog_rollout_datalater reduces and logs asrollout/kl. Theppoestimator turns that KL into its reward signal —-kl_coef * kl, plus the scalar reward at the last token — in place:kaliases the tensors held inrollout_data["kl"], so after the ppo branch the stored KL has been overwritten with the reward.compute_advantages_and_returnsruns beforelog_rollout_data, andklis not in its skip list, so the loggedrollout/klis wrong under PPO. Every other estimator (grpo/gspo/cispo/reinforce++) treatsklas read-only.Fix
Build the reward out-of-place (
k = k * kl_coef) so the subsequentk[-1] += rewardno longer touches the stored KL. The reward math is unchanged.Test
tests/test_ppo_kl_metric.py(registered in thecpu-unittestmatrix) runs the ppo estimator with a known KL and assertsrollout_data["kl"]is unchanged afterward. Fails onmain(KL overwritten at the last token), passes after the fix. Megatron is stubbed, mirroringtests/test_value_temperature.py; no GPU.