Skip to content

fix(ppo): stop corrupting the logged rollout/kl metric#2114

Open
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:fix/ppo-kl-inplace-metric
Open

fix(ppo): stop corrupting the logged rollout/kl metric#2114
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:fix/ppo-kl-inplace-metric

Conversation

@EazyReal

Copy link
Copy Markdown
Contributor

What

compute_advantages_and_returns stores the approximate KL in rollout_data["kl"], which log_rollout_data later reduces and logs as rollout/kl. The ppo estimator turns that KL into its reward signal — -kl_coef * kl, plus the scalar reward at the last token — in place:

for reward, k in zip(old_rewards, kl, strict=False):
    k *= kl_coef          # k aliases rollout_data["kl"][i]
    if cp_rank == 0:
        k[-1] += reward
    rewards.append(k)

k aliases the tensors held in rollout_data["kl"], so after the ppo branch the stored KL has been overwritten with the reward. compute_advantages_and_returns runs before log_rollout_data, and kl is not in its skip list, so the logged rollout/kl is wrong under PPO. Every other estimator (grpo/gspo/cispo/reinforce++) treats kl as read-only.

Fix

Build the reward out-of-place (k = k * kl_coef) so the subsequent k[-1] += reward no longer touches the stored KL. The reward math is unchanged.

Test

tests/test_ppo_kl_metric.py (registered in the cpu-unittest matrix) runs the ppo estimator with a known KL and asserts rollout_data["kl"] is unchanged afterward. Fails on main (KL overwritten at the last token), passes after the fix. Megatron is stubbed, mirroring tests/test_value_temperature.py; no GPU.

compute_advantages_and_returns stores the approximate KL in
rollout_data["kl"], which log_rollout_data reduces and logs as
rollout/kl. The ppo estimator builds its reward signal as
-kl_coef * kl (plus the scalar reward at the last token) in place
(`k *= kl_coef; k[-1] += reward`), and `k` aliases the tensors in
rollout_data["kl"]. So after the ppo branch the logged KL is overwritten
with the reward. Every other estimator (grpo/gspo/cispo/reinforce++)
treats kl as read-only.

Build the reward out-of-place so the stored KL stays intact.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@EazyReal EazyReal force-pushed the fix/ppo-kl-inplace-metric branch from 5cbfc4a to 1526792 Compare June 22, 2026 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant