Skip to content

feat(rl): add off-policy IS correction hook (current policy vs rollout)#2084

Open
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:off-policy-is
Open

feat(rl): add off-policy IS correction hook (current policy vs rollout)#2084
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:off-policy-is

Conversation

@EazyReal

@EazyReal EazyReal commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What changed

  • policy_loss_function (slime/backends/megatron_utils/loss.py) now passes the current grad-carrying log-probs as cur_log_probs into the TIS-correction kwargs, alongside the existing frozen train_log_probs (pi_theta_old) and rollout_log_probs (pi_rollout). The existing corrections (vanilla_tis_function, icepop_function) ignore the new kwarg via **kwargs.
  • New off_policy_is_function in slime/utils/ppo_utils.py: a truncated-IS correction whose detached weight is clip(pi_theta / pi_rollout) against the actual rollout logprob, so one weight corrects both the train/inference mismatch and async (multi-version) staleness. The existing TIS only had pi_theta_old / pi_rollout, which equals this only in the single-update-per-rollout limit.
  • Selected through the existing hook mechanism, no new arg or dispatch branch:
    --use-tis --custom-tis-function-path slime.utils.ppo_utils.off_policy_is_function.
  • Clipping uses --eps-clip / --eps-clip-high (the CISPO/PPO clip convention); --eps-clip 1.0 gives canonical single-sided clipping. Note this reuses the policy-loss clip range rather than the --tis-clip* range used by the other TIS hooks.

Why

On a REINFORCE base (--advantage-estimator reinforce) this reproduces the CISPO surrogate (MiniMax-M1, arxiv 2506.13585) as a composable correction: L = -clip(pi_theta/pi_rollout).detach() * A * log pi, with gradient only through log pi. It generalizes the train/inference-mismatch correction to also absorb off-policy staleness without adding a dedicated estimator.

Validation

CPU unit test tests/test_off_policy_is.py (registered in the cpu-unittest matrix, NUM_GPUS = 0), run via pytest tests/test_off_policy_is.py:

  • clamp behavior and clipfrac on known ratios, with loss_masks passed through unchanged;
  • on a manual REINFORCE base, loss and gradient match the closed-form CISPO surrogate exactly (gradient flows only through log pi);
  • --eps-clip 1.0 disables the lower bound (single-sided).
    The cur_log_probs wiring in loss.py imports megatron and is exercised by the GPU CI suites.

@EazyReal EazyReal changed the title feat(rl): composable current-policy importance-sampling correction (TIS hook) feat(rl): add off-policy IS correction hook (current policy vs rollout) Jun 24, 2026
@EazyReal

Copy link
Copy Markdown
Contributor Author

@zhuzilin could you review this one? It adds the off-policy IS hook at the policy-loss boundary using current-policy logprobs against rollout logprobs, preserving gradient flow through pi_theta while keeping the correction weight detached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant