fix(rm_hub): grade the final ###Response segment in deepscaler reward#2116
Open
SuperMarioYL wants to merge 1 commit into
Open
fix(rm_hub): grade the final ###Response segment in deepscaler reward#2116SuperMarioYL wants to merge 1 commit into
SuperMarioYL wants to merge 1 commit into
Conversation
The deepscaler rule-based reward splits the response on the answer marker and grades the segment after it. The </think> branch takes the last segment ([-1]), but the ###Response branch took the second segment ([1]). When a response contains more than one ###Response marker (e.g. the model echoes the marker in its reasoning before emitting the final answer), [1] grades a middle segment instead of the final answer, so a correct answer in the last segment is silently scored 0 — a wrong RL reward signal. Take [-1] for the ###Response branch too, so both branches grade the final segment. Add regression tests pinning last-segment grading and cross-branch consistency.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
get_deepscaler_rule_based_rewardsplits the model response on the answermarker and grades the segment that follows it. The two marker branches were
inconsistent:
The
</think>branch takes the last segment ([-1]), but the###Responsebranch takes the second segment ([1]). When a responsecontains more than one
###Responsemarker — e.g. the model echoes themarker inside its reasoning before emitting the final answer —
[1]grades amiddle segment instead of the final answer. A correct answer in the last
segment is then silently scored
0, i.e. a wrong RL reward signal that noother CI check would surface.
Reproduction
Fix
Take
[-1]for the###Responsebranch as well, so both branches grade thefinal segment. This matches the
</think>branch and the contract alreadydocumented in
tests/test_rm_deepscaler.py("only what comes after isgraded"). The single-marker case is unchanged (
split(...)[1] == split(...)[-1]when there is exactly one marker), so existing behavior is preserved.
Tests
Added two regression tests to the existing CPU unit suite
(
tests/test_rm_deepscaler.py, already in thecpu-unittestCI matrix):test_response_split_on_response_marker_grades_last_segment— multiple###Responsemarkers grade the final segment; the wrong intermediate answeris not graded. (Red before this change, green after.)
test_think_and_response_markers_agree_on_last_segment— the two markerbranches pick the same final segment regardless of which separator the chat
template emits.
All 12 tests in the file pass;
ruff check,ruff format, andisortareclean.
This change is orthogonal to the missing-response guard in #2115 (which adds a
check above these branches and does not touch the split index), so the two do
not conflict.