feat(loss): add pg_loss aggregation modes#2090
Open
EazyReal wants to merge 2 commits into
Open
Conversation
40d955f to
3774a73
Compare
--loss-aggregation for the four ScaleRL pg_loss aggregation modes3774a73 to
f23073b
Compare
f23073b to
0fe5e98
Compare
c3da906 to
fa69ef7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On main, slime has two built-in
pg_lossnormalizations: the default sample/rollout mean and the legacy--calculate-per-token-losspath. Prompt-group normalization and fixed-divisor normalization both require a custompg_lossreducer today. That is fragile because the reducer has to compose correctly with CP slicing, micro-batch packing, and Megatron's final train-step divisor; a wrong constant factor silently changes the effective learning rate.This PR moves the common
pg_lossaggregation choices behind--loss-aggregation. For one train step, letNbe the number of rollouts,G = n_samples_per_prompt,P = N / G,M_ibe the valid-token count for rollouti, andL = --loss-aggregation-divisor.sample_mean(1 / N) * sum_i token_mean(i)token_meansum_i,t loss_it * mask_it / sum_i M_i--calculate-per-token-losspath under the new unified knob.prompt_mean(1 / P) * sum_g token_mean(prompt_group_g)Sample.group_indexand gives each prompt group one unit of weight.constantsum_i,t loss_it * mask_it / (L * N)The implementation reuses slime's existing reducer path instead of adding a parallel loss stack. Rollout builds
prompt_mask_sumsonly forprompt_mean;cp_utils.get_sum_of_sample_meanstill owns CP-aware summation; and the train step keeps its existing final scaling. Forprompt_mean, the reducer scales the prompt-group sum byn_samples_per_prompt, so users get the requested prompt mean directly instead of a constant-offset objective that would need learning-rate compensation.Startup validation rejects configurations that would silently use the wrong denominator, such as
--loss-aggregation-divisoroutsideconstantor--calculate-per-token-losstogether withprompt_mean/constant. The custompg_lossreducer hook remains available and keeps precedence for non-standard objectives.The focused tests cover the final
prompt_meanscalar, CP/sample-denominator invariance, argument validation, and rollout metadata construction.Tested with:
uv run --no-project --with pytest --with torch --with numpy --with httpx --with pyyaml python -m pytest -q tests/test_cp_utils.py tests/test_megatron_argument_validation.py tests/test_rollout_validation.pyuv run --no-project --with ruff ruff check --config pyproject.toml docs/en/get_started/customization.md slime/backends/megatron_utils/loss.py slime/ray/rollout.py slime/utils/arguments.py tests/test_cp_utils.py tests/test_megatron_argument_validation.py tests/test_rollout_validation.pyuv run --no-project --with black black --check --line-length 119 slime/backends/megatron_utils/loss.py slime/utils/arguments.py