feat(rl): add off-policy IS correction hook (current policy vs rollout) by EazyReal · Pull Request #2084 · THUDM/slime

EazyReal · 2026-06-15T20:44:44Z

What changed

policy_loss_function (slime/backends/megatron_utils/loss.py) now passes the current grad-carrying log-probs as cur_log_probs into the TIS-correction kwargs, alongside the existing frozen train_log_probs (pi_theta_old) and rollout_log_probs (pi_rollout). The existing corrections (vanilla_tis_function, icepop_function) ignore the new kwarg via **kwargs.
New off_policy_is_function in slime/utils/ppo_utils.py: a truncated-IS correction whose detached weight is clip(pi_theta / pi_rollout) against the actual rollout logprob, so one weight corrects both the train/inference mismatch and async (multi-version) staleness. The existing TIS only had pi_theta_old / pi_rollout, which equals this only in the single-update-per-rollout limit.
Selected through the existing hook mechanism, no new arg or dispatch branch:
--use-tis --custom-tis-function-path slime.utils.ppo_utils.off_policy_is_function.
Clipping uses --eps-clip / --eps-clip-high (the CISPO/PPO clip convention); --eps-clip 1.0 gives canonical single-sided clipping. Note this reuses the policy-loss clip range rather than the --tis-clip* range used by the other TIS hooks.

Why

On a REINFORCE base (--advantage-estimator reinforce) this reproduces the CISPO surrogate (MiniMax-M1, arxiv 2506.13585) as a composable correction: L = -clip(pi_theta/pi_rollout).detach() * A * log pi, with gradient only through log pi. It generalizes the train/inference-mismatch correction to also absorb off-policy staleness without adding a dedicated estimator.

Validation

CPU unit test tests/test_off_policy_is.py (registered in the cpu-unittest matrix, NUM_GPUS = 0), run via pytest tests/test_off_policy_is.py:

clamp behavior and clipfrac on known ratios, with loss_masks passed through unchanged;
on a manual REINFORCE base, loss and gradient match the closed-form CISPO surrogate exactly (gradient flows only through log pi);
--eps-clip 1.0 disables the lower bound (single-sided).
The cur_log_probs wiring in loss.py imports megatron and is exercised by the GPU CI suites.

EazyReal · 2026-06-25T08:47:52Z

@zhuzilin could you review this one? It adds the off-policy IS hook at the policy-loss boundary using current-policy logprobs against rollout logprobs, preserving gradient flow through pi_theta while keeping the correction weight detached.

EazyReal mentioned this pull request Jun 21, 2026

RFC: factor the policy loss into orthogonal axes (advantage × policy-loss × is-level × correction × regularizer) EazyReal/slime#1

Open

EazyReal changed the title ~~feat(rl): composable current-policy importance-sampling correction (TIS hook)~~ feat(rl): add off-policy IS correction hook (current policy vs rollout) Jun 24, 2026

EazyReal force-pushed the off-policy-is branch from 1b70bf3 to 45c297c Compare June 24, 2026 03:18

feat(rl): add off-policy IS correction hook (current policy vs rollout)

e40db8f

EazyReal force-pushed the off-policy-is branch from 45c297c to e40db8f Compare June 24, 2026 04:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(rl): add off-policy IS correction hook (current policy vs rollout)#2084

feat(rl): add off-policy IS correction hook (current policy vs rollout)#2084
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:off-policy-is

EazyReal commented Jun 15, 2026 •

edited

Loading

Uh oh!

EazyReal commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

EazyReal commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

Validation

Uh oh!

EazyReal commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EazyReal commented Jun 15, 2026 •

edited

Loading