fix(rm_hub): grade the final ###Response segment in deepscaler reward by SuperMarioYL · Pull Request #2116 · THUDM/slime

SuperMarioYL · 2026-06-22T02:18:17Z

What

get_deepscaler_rule_based_reward splits the model response on the answer
marker and grades the segment that follows it. The two marker branches were
inconsistent:

if "</think>" in response:
    model_solution = response.split("</think>")[-1]   # last segment
elif "###Response" in response:
    model_solution = response.split("###Response")[1]  # second segment

The </think> branch takes the last segment ([-1]), but the
###Response branch takes the second segment ([1]). When a response
contains more than one ###Response marker — e.g. the model echoes the
marker inside its reasoning before emitting the final answer — [1] grades a
middle segment instead of the final answer. A correct answer in the last
segment is then silently scored 0, i.e. a wrong RL reward signal that no
other CI check would surface.

Reproduction

from slime.rollout.rm_hub.deepscaler import get_deepscaler_rule_based_reward

resp = r"reasoning \boxed{1}###Response\boxed{2}###Response\boxed{42}"
get_deepscaler_rule_based_reward(resp, "42")   # -> 0  (expected 1)

Fix

Take [-1] for the ###Response branch as well, so both branches grade the
final segment. This matches the </think> branch and the contract already
documented in tests/test_rm_deepscaler.py ("only what comes after is
graded"). The single-marker case is unchanged (split(...)[1] == split(...)[-1]
when there is exactly one marker), so existing behavior is preserved.

Tests

Added two regression tests to the existing CPU unit suite
(tests/test_rm_deepscaler.py, already in the cpu-unittest CI matrix):

test_response_split_on_response_marker_grades_last_segment — multiple
###Response markers grade the final segment; the wrong intermediate answer
is not graded. (Red before this change, green after.)
test_think_and_response_markers_agree_on_last_segment — the two marker
branches pick the same final segment regardless of which separator the chat
template emits.

All 12 tests in the file pass; ruff check, ruff format, and isort are
clean.

This change is orthogonal to the missing-response guard in #2115 (which adds a
check above these branches and does not touch the split index), so the two do
not conflict.

The deepscaler rule-based reward splits the response on the answer marker and grades the segment after it. The </think> branch takes the last segment ([-1]), but the ###Response branch took the second segment ([1]). When a response contains more than one ###Response marker (e.g. the model echoes the marker in its reasoning before emitting the final answer), [1] grades a middle segment instead of the final answer, so a correct answer in the last segment is silently scored 0 — a wrong RL reward signal. Take [-1] for the ###Response branch too, so both branches grade the final segment. Add regression tests pinning last-segment grading and cross-branch consistency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rm_hub): grade the final ###Response segment in deepscaler reward#2116

fix(rm_hub): grade the final ###Response segment in deepscaler reward#2116
SuperMarioYL wants to merge 1 commit into
THUDM:mainfrom
SuperMarioYL:fix/deepscaler-response-segment

SuperMarioYL commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SuperMarioYL commented Jun 22, 2026

What

Reproduction

Fix

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant