fix(ppo): preserve raw KL so rollout/kl logging is correct by EazyReal · Pull Request #2114 · THUDM/slime

EazyReal · 2026-06-21T18:43:12Z

What changed

In compute_advantages_and_returns (slime/backends/megatron_utils/loss.py), the PPO estimator now builds its per-token reward tensor out-of-place:

for reward, per_token_kl in zip(old_rewards, kl, strict=False):
    token_level_rewards = per_token_kl * kl_coef
    if cp_rank == 0:
        token_level_rewards[-1] += reward
    rewards.append(token_level_rewards)

per_token_kl is the raw approximate KL tensor stored in rollout_data["kl"]; token_level_rewards is the separate PPO reward signal (-kl_coef * kl, plus the scalar reward at the final token). The reward math is unchanged.

A CPU unit test (tests/test_ppo_kl_metric.py) is added and registered in the cpu-unittest matrix.

Why

compute_advantages_and_returns stores approximate KL in rollout_data["kl"], which log_rollout_data later reduces and logs as rollout/kl. The previous PPO branch used k *= kl_coef and then k[-1] += reward; because k aliased rollout_data["kl"][i], it overwrote the logged raw KL with the reward-shaped tensor.

The final form is intentionally explicit rather than relying on local-name shadowing. It matches the convention used by the other KL-penalized estimators: get_reinforce_plus_plus_returns derives token_level_rewards from KL, and get_reinforce_plus_plus_baseline_advantages computes from kl_tensor without mutating the stored KL.

Validation

tests/test_ppo_kl_metric.py runs the PPO estimator with a known KL (k1, kl_coef=0.05) and asserts rollout_data["kl"] still equals the freshly computed KL afterward. It fails on the old in-place update and passes with this fix. Megatron is stubbed through megatron.core.mpu with cp_size=1, mirroring tests/test_value_temperature.py; no GPU is required.

Local run:

PYTHONPATH=$PWD uvx --python 3.10 --with torch --with packaging --with numpy --with httpx pytest tests/test_ppo_kl_metric.py

compute_advantages_and_returns stores the approximate KL in rollout_data["kl"], which log_rollout_data reduces and logs as rollout/kl. The ppo estimator builds its reward signal as -kl_coef * kl (plus the scalar reward at the last token) in place (`k *= kl_coef; k[-1] += reward`), and `k` aliases the tensors in rollout_data["kl"]. So after the ppo branch the logged KL is overwritten with the reward. Every other estimator (grpo/gspo/cispo/reinforce++) treats kl as read-only. Build the reward out-of-place so the stored KL stays intact. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

EazyReal · 2026-06-25T08:46:27Z

@zhuzilin could you review this now-cleaned version? PPO reward shaping was mutating the raw KL tensor before metrics, so rollout/kl logging could report shaped rewards instead of KL. The fix keeps raw KL separate and mirrors the local-k pattern used by the reinforce loss helpers.

EazyReal force-pushed the fix/ppo-kl-inplace-metric branch 2 times, most recently from 1526792 to cbf085d Compare June 24, 2026 03:18

EazyReal force-pushed the fix/ppo-kl-inplace-metric branch from cbf085d to 0a034ba Compare June 24, 2026 04:17

fix(ppo): make KL reward tensor explicit

b0d9dc8

EazyReal changed the title ~~fix(ppo): stop corrupting the logged rollout/kl metric~~ fix(ppo): preserve raw KL for rollout/kl logging Jun 25, 2026

EazyReal changed the title ~~fix(ppo): preserve raw KL for rollout/kl logging~~ fix(ppo): preserve raw KL so rollout/kl logging is correct Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(ppo): preserve raw KL so rollout/kl logging is correct#2114

fix(ppo): preserve raw KL so rollout/kl logging is correct#2114
EazyReal wants to merge 2 commits into
THUDM:mainfrom
EazyReal:fix/ppo-kl-inplace-metric

EazyReal commented Jun 21, 2026 •

edited

Loading

Uh oh!

EazyReal commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

EazyReal commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

Validation

Uh oh!

EazyReal commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EazyReal commented Jun 21, 2026 •

edited

Loading