Skip to content

fix(rm_hub): grade the final ###Response segment in deepscaler reward#2116

Open
SuperMarioYL wants to merge 1 commit into
THUDM:mainfrom
SuperMarioYL:fix/deepscaler-response-segment
Open

fix(rm_hub): grade the final ###Response segment in deepscaler reward#2116
SuperMarioYL wants to merge 1 commit into
THUDM:mainfrom
SuperMarioYL:fix/deepscaler-response-segment

Conversation

@SuperMarioYL

Copy link
Copy Markdown

What

get_deepscaler_rule_based_reward splits the model response on the answer
marker and grades the segment that follows it. The two marker branches were
inconsistent:

if "</think>" in response:
    model_solution = response.split("</think>")[-1]   # last segment
elif "###Response" in response:
    model_solution = response.split("###Response")[1]  # second segment

The </think> branch takes the last segment ([-1]), but the
###Response branch takes the second segment ([1]). When a response
contains more than one ###Response marker — e.g. the model echoes the
marker inside its reasoning before emitting the final answer — [1] grades a
middle segment instead of the final answer. A correct answer in the last
segment is then silently scored 0, i.e. a wrong RL reward signal that no
other CI check would surface.

Reproduction

from slime.rollout.rm_hub.deepscaler import get_deepscaler_rule_based_reward

resp = r"reasoning \boxed{1}###Response\boxed{2}###Response\boxed{42}"
get_deepscaler_rule_based_reward(resp, "42")   # -> 0  (expected 1)

Fix

Take [-1] for the ###Response branch as well, so both branches grade the
final segment. This matches the </think> branch and the contract already
documented in tests/test_rm_deepscaler.py ("only what comes after is
graded"). The single-marker case is unchanged (split(...)[1] == split(...)[-1]
when there is exactly one marker), so existing behavior is preserved.

Tests

Added two regression tests to the existing CPU unit suite
(tests/test_rm_deepscaler.py, already in the cpu-unittest CI matrix):

  • test_response_split_on_response_marker_grades_last_segment — multiple
    ###Response markers grade the final segment; the wrong intermediate answer
    is not graded. (Red before this change, green after.)
  • test_think_and_response_markers_agree_on_last_segment — the two marker
    branches pick the same final segment regardless of which separator the chat
    template emits.

All 12 tests in the file pass; ruff check, ruff format, and isort are
clean.

This change is orthogonal to the missing-response guard in #2115 (which adds a
check above these branches and does not touch the split index), so the two do
not conflict.

The deepscaler rule-based reward splits the response on the answer marker
and grades the segment after it. The </think> branch takes the last
segment ([-1]), but the ###Response branch took the second segment ([1]).
When a response contains more than one ###Response marker (e.g. the model
echoes the marker in its reasoning before emitting the final answer), [1]
grades a middle segment instead of the final answer, so a correct answer
in the last segment is silently scored 0 — a wrong RL reward signal.

Take [-1] for the ###Response branch too, so both branches grade the
final segment. Add regression tests pinning last-segment grading and
cross-branch consistency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant