fix(ppo): preserve raw KL so rollout/kl logging is correct#2114
Open
EazyReal wants to merge 2 commits into
Open
fix(ppo): preserve raw KL so rollout/kl logging is correct#2114EazyReal wants to merge 2 commits into
EazyReal wants to merge 2 commits into
Conversation
1526792 to
cbf085d
Compare
compute_advantages_and_returns stores the approximate KL in rollout_data["kl"], which log_rollout_data reduces and logs as rollout/kl. The ppo estimator builds its reward signal as -kl_coef * kl (plus the scalar reward at the last token) in place (`k *= kl_coef; k[-1] += reward`), and `k` aliases the tensors in rollout_data["kl"]. So after the ppo branch the logged KL is overwritten with the reward. Every other estimator (grpo/gspo/cispo/reinforce++) treats kl as read-only. Build the reward out-of-place so the stored KL stays intact. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cbf085d to
0a034ba
Compare
Contributor
Author
|
@zhuzilin could you review this now-cleaned version? PPO reward shaping was mutating the raw KL tensor before metrics, so rollout/kl logging could report shaped rewards instead of KL. The fix keeps raw KL separate and mirrors the local-k pattern used by the reinforce loss helpers. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
In
compute_advantages_and_returns(slime/backends/megatron_utils/loss.py), the PPO estimator now builds its per-token reward tensor out-of-place:per_token_klis the raw approximate KL tensor stored inrollout_data["kl"];token_level_rewardsis the separate PPO reward signal (-kl_coef * kl, plus the scalar reward at the final token). The reward math is unchanged.A CPU unit test (
tests/test_ppo_kl_metric.py) is added and registered in thecpu-unittestmatrix.Why
compute_advantages_and_returnsstores approximate KL inrollout_data["kl"], whichlog_rollout_datalater reduces and logs asrollout/kl. The previous PPO branch usedk *= kl_coefand thenk[-1] += reward; becausekaliasedrollout_data["kl"][i], it overwrote the logged raw KL with the reward-shaped tensor.The final form is intentionally explicit rather than relying on local-name shadowing. It matches the convention used by the other KL-penalized estimators:
get_reinforce_plus_plus_returnsderivestoken_level_rewardsfrom KL, andget_reinforce_plus_plus_baseline_advantagescomputes fromkl_tensorwithout mutating the stored KL.Validation
tests/test_ppo_kl_metric.pyruns the PPO estimator with a known KL (k1,kl_coef=0.05) and assertsrollout_data["kl"]still equals the freshly computed KL afterward. It fails on the old in-place update and passes with this fix. Megatron is stubbed throughmegatron.core.mpuwithcp_size=1, mirroringtests/test_value_temperature.py; no GPU is required.Local run:
PYTHONPATH=$PWD uvx --python 3.10 --with torch --with packaging --with numpy --with httpx pytest tests/test_ppo_kl_metric.py