fix(ppo): stop corrupting the logged rollout/kl metric by EazyReal · Pull Request #2114 · THUDM/slime

EazyReal · 2026-06-21T18:43:12Z

What

compute_advantages_and_returns stores the approximate KL in rollout_data["kl"], which log_rollout_data later reduces and logs as rollout/kl. The ppo estimator turns that KL into its reward signal — -kl_coef * kl, plus the scalar reward at the last token — in place:

for reward, k in zip(old_rewards, kl, strict=False):
    k *= kl_coef          # k aliases rollout_data["kl"][i]
    if cp_rank == 0:
        k[-1] += reward
    rewards.append(k)

k aliases the tensors held in rollout_data["kl"], so after the ppo branch the stored KL has been overwritten with the reward. compute_advantages_and_returns runs before log_rollout_data, and kl is not in its skip list, so the logged rollout/kl is wrong under PPO. Every other estimator (grpo/gspo/cispo/reinforce++) treats kl as read-only.

Fix

Build the reward out-of-place (k = k * kl_coef) so the subsequent k[-1] += reward no longer touches the stored KL. The reward math is unchanged.

Test

tests/test_ppo_kl_metric.py (registered in the cpu-unittest matrix) runs the ppo estimator with a known KL and asserts rollout_data["kl"] is unchanged afterward. Fails on main (KL overwritten at the last token), passes after the fix. Megatron is stubbed, mirroring tests/test_value_temperature.py; no GPU.

compute_advantages_and_returns stores the approximate KL in rollout_data["kl"], which log_rollout_data reduces and logs as rollout/kl. The ppo estimator builds its reward signal as -kl_coef * kl (plus the scalar reward at the last token) in place (`k *= kl_coef; k[-1] += reward`), and `k` aliases the tensors in rollout_data["kl"]. So after the ppo branch the logged KL is overwritten with the reward. Every other estimator (grpo/gspo/cispo/reinforce++) treats kl as read-only. Build the reward out-of-place so the stored KL stays intact. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

EazyReal force-pushed the fix/ppo-kl-inplace-metric branch from 5cbfc4a to 1526792 Compare June 22, 2026 20:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ppo): stop corrupting the logged rollout/kl metric#2114

fix(ppo): stop corrupting the logged rollout/kl metric#2114
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:fix/ppo-kl-inplace-metric

EazyReal commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EazyReal commented Jun 21, 2026

What

Fix

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant