Skip to content

fix(ppo): preserve raw KL so rollout/kl logging is correct#2114

Open
EazyReal wants to merge 2 commits into
THUDM:mainfrom
EazyReal:fix/ppo-kl-inplace-metric
Open

fix(ppo): preserve raw KL so rollout/kl logging is correct#2114
EazyReal wants to merge 2 commits into
THUDM:mainfrom
EazyReal:fix/ppo-kl-inplace-metric

Conversation

@EazyReal

@EazyReal EazyReal commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

What changed

In compute_advantages_and_returns (slime/backends/megatron_utils/loss.py), the PPO estimator now builds its per-token reward tensor out-of-place:

for reward, per_token_kl in zip(old_rewards, kl, strict=False):
    token_level_rewards = per_token_kl * kl_coef
    if cp_rank == 0:
        token_level_rewards[-1] += reward
    rewards.append(token_level_rewards)

per_token_kl is the raw approximate KL tensor stored in rollout_data["kl"]; token_level_rewards is the separate PPO reward signal (-kl_coef * kl, plus the scalar reward at the final token). The reward math is unchanged.

A CPU unit test (tests/test_ppo_kl_metric.py) is added and registered in the cpu-unittest matrix.

Why

compute_advantages_and_returns stores approximate KL in rollout_data["kl"], which log_rollout_data later reduces and logs as rollout/kl. The previous PPO branch used k *= kl_coef and then k[-1] += reward; because k aliased rollout_data["kl"][i], it overwrote the logged raw KL with the reward-shaped tensor.

The final form is intentionally explicit rather than relying on local-name shadowing. It matches the convention used by the other KL-penalized estimators: get_reinforce_plus_plus_returns derives token_level_rewards from KL, and get_reinforce_plus_plus_baseline_advantages computes from kl_tensor without mutating the stored KL.

Validation

tests/test_ppo_kl_metric.py runs the PPO estimator with a known KL (k1, kl_coef=0.05) and asserts rollout_data["kl"] still equals the freshly computed KL afterward. It fails on the old in-place update and passes with this fix. Megatron is stubbed through megatron.core.mpu with cp_size=1, mirroring tests/test_value_temperature.py; no GPU is required.

Local run:

PYTHONPATH=$PWD uvx --python 3.10 --with torch --with packaging --with numpy --with httpx pytest tests/test_ppo_kl_metric.py

@EazyReal EazyReal force-pushed the fix/ppo-kl-inplace-metric branch 2 times, most recently from 1526792 to cbf085d Compare June 24, 2026 03:18
compute_advantages_and_returns stores the approximate KL in
rollout_data["kl"], which log_rollout_data reduces and logs as
rollout/kl. The ppo estimator builds its reward signal as
-kl_coef * kl (plus the scalar reward at the last token) in place
(`k *= kl_coef; k[-1] += reward`), and `k` aliases the tensors in
rollout_data["kl"]. So after the ppo branch the logged KL is overwritten
with the reward. Every other estimator (grpo/gspo/cispo/reinforce++)
treats kl as read-only.

Build the reward out-of-place so the stored KL stays intact.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@EazyReal EazyReal force-pushed the fix/ppo-kl-inplace-metric branch from cbf085d to 0a034ba Compare June 24, 2026 04:17
@EazyReal EazyReal changed the title fix(ppo): stop corrupting the logged rollout/kl metric fix(ppo): preserve raw KL for rollout/kl logging Jun 25, 2026
@EazyReal EazyReal changed the title fix(ppo): preserve raw KL for rollout/kl logging fix(ppo): preserve raw KL so rollout/kl logging is correct Jun 25, 2026
@EazyReal

Copy link
Copy Markdown
Contributor Author

@zhuzilin could you review this now-cleaned version? PPO reward shaping was mutating the raw KL tensor before metrics, so rollout/kl logging could report shaped rewards instead of KL. The fix keeps raw KL separate and mirrors the local-k pattern used by the reinforce loss helpers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant