Skip to content

fix: support eval-only mode (--num-rollout 0)#2109

Open
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:fix/eval-only-num-rollout-zero
Open

fix: support eval-only mode (--num-rollout 0)#2109
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:fix/eval-only-num-rollout-zero

Conversation

@EazyReal

Copy link
Copy Markdown
Contributor

What

train.py documents an eval-only mode (--num-rollout 0 with --eval-interval set), but it crashes on startup: with num_rollout == 0 the estimated train_iters is 0, so lr_decay_steps is 0 and Megatron's OptimizerParamScheduler aborts on assert lr_decay_steps > 0 — before the eval-only short-circuit in train.py is ever reached.

Fix

Clamp train_iters to >= 1. It only sizes the LR-decay schedule, and the scheduler is never stepped in eval-only mode, so the value is irrelevant there; for real training (num_rollout > 0) max(1, …) is a no-op.

Adds a CPU regression test (Megatron stubbed, mirroring tests/test_megatron_argument_validation.py) registered in the cpu-unittest matrix.

Fixes #1785

train.py advertises an eval-only path (num_rollout == 0 with eval_interval
set), but the optimizer scheduler is still constructed during model init.
With num_rollout == 0 the estimated train_iters is 0, so lr_decay_steps is 0
and Megatron's OptimizerParamScheduler aborts on `assert lr_decay_steps > 0`,
making the advertised eval-only path unreachable.

Clamp train_iters to >= 1: it only sizes the LR-decay schedule and the
scheduler is never stepped in eval-only mode, so the exact value is
irrelevant. Add a CPU unit test (Megatron stubbed) that reproduces the
assertion and registers it in the cpu-unittest CI matrix.

Fixes THUDM#1785

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@EazyReal EazyReal force-pushed the fix/eval-only-num-rollout-zero branch from a658aac to 9e5f530 Compare June 20, 2026 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] train.py num_rollout==0 error

1 participant