Skip to content

[reward] refactor: extract make_eval_script / parse_eval_output as public helpers#62

Open
aoshen02 wants to merge 2 commits into
verl-project:mainfrom
aoshen02:feat/swebench-eval-helpers
Open

[reward] refactor: extract make_eval_script / parse_eval_output as public helpers#62
aoshen02 wants to merge 2 commits into
verl-project:mainfrom
aoshen02:feat/swebench-eval-helpers

Conversation

@aoshen02

Copy link
Copy Markdown
Collaborator

Summary

Extract two standalone functions from SWEBenchRewardSpec so external callers (vime/slime coding_agent_rl sandbox) can build eval scripts and parse results without instantiating the full reward spec class:

  • make_eval_script(metadata, workdir)str: builds a self-contained bash eval script
  • parse_eval_output(metadata, output)(solved, report): parses test output using swebench grading

SWEBenchRewardSpec.compute_reward and _get_eval_report now delegate to these functions, removing ~20 lines of duplicated logic.

Motivation

vime (#250) and slime (THUDM/slime#2079) both need SWE-bench eval in their sandbox.evaluate() path. With public helpers:

from uni_agent.reward.swe_bench import make_eval_script, parse_eval_output

script = make_eval_script(metadata, workdir)
# ... write to sandbox and execute ...
solved, report = parse_eval_output(metadata, output)

Test plan

  • Existing tests/ pass (no behavioral change — same logic, just extracted)
  • from uni_agent.reward.swe_bench import make_eval_script, parse_eval_output works

🤖 Generated with Claude Code

…blic helpers

Extract two standalone functions from SWEBenchRewardSpec so external
callers (vime/slime coding_agent_rl sandbox) can build eval scripts and
parse results without instantiating the full reward spec class:

- make_eval_script(metadata, workdir) -> str: builds a bash eval script
- parse_eval_output(metadata, output) -> (solved, report): parses test output

SWEBenchRewardSpec.compute_reward and _get_eval_report now delegate to
these functions, removing duplicated logic (_get_logs_eval is inlined
into parse_eval_output).

Motivation: vime and slime both need SWE-bench eval in their sandbox
evaluate() path. Previously they either copy-pasted the logic or used
__new__ hacks. With public helpers they just:

    from uni_agent.reward.swe_bench import make_eval_script, parse_eval_output
    script = make_eval_script(metadata, workdir)
    # ... execute in sandbox ...
    solved, report = parse_eval_output(metadata, output)

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: aoshen <aoshen@inferact.ai>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors uni_agent/reward/swe_bench.py by extracting script generation and output parsing logic into standalone, module-level functions (make_eval_script and parse_eval_output), simplifying compute_reward and _get_eval_report. The review feedback suggests making parse_eval_output more robust by defensively handling empty or None values for eval_output and checking if the test lists (FAIL_TO_PASS and PASS_TO_PASS) are already parsed objects rather than raw JSON strings.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +51 to +62
if START_TEST_OUTPUT not in eval_output or END_TEST_OUTPUT not in eval_output:
return False, report

test_content = eval_output.split(START_TEST_OUTPUT)[1].split(END_TEST_OUTPUT)[0]
status_map = MAP_REPO_TO_PARSER[repo](test_content, None)
report["found_eval_status"] = True

eval_ref = {
"instance_id": metadata["instance_id"],
"FAIL_TO_PASS": json.loads(metadata.get("FAIL_TO_PASS", "[]")),
"PASS_TO_PASS": json.loads(metadata.get("PASS_TO_PASS", "[]")),
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To make this public helper more robust for external callers (e.g., when integrated with Hugging Face datasets or when the evaluation fails/times out), we should:

  1. Defensively handle cases where eval_output is None or empty to avoid TypeError during substring checks.
  2. Handle cases where FAIL_TO_PASS and PASS_TO_PASS are already parsed as lists/dicts rather than JSON-serialized strings.
Suggested change
if START_TEST_OUTPUT not in eval_output or END_TEST_OUTPUT not in eval_output:
return False, report
test_content = eval_output.split(START_TEST_OUTPUT)[1].split(END_TEST_OUTPUT)[0]
status_map = MAP_REPO_TO_PARSER[repo](test_content, None)
report["found_eval_status"] = True
eval_ref = {
"instance_id": metadata["instance_id"],
"FAIL_TO_PASS": json.loads(metadata.get("FAIL_TO_PASS", "[]")),
"PASS_TO_PASS": json.loads(metadata.get("PASS_TO_PASS", "[]")),
}
if not eval_output or START_TEST_OUTPUT not in eval_output or END_TEST_OUTPUT not in eval_output:
return False, report
test_content = eval_output.split(START_TEST_OUTPUT)[1].split(END_TEST_OUTPUT)[0]
status_map = MAP_REPO_TO_PARSER[repo](test_content, None)
report["found_eval_status"] = True
fail_to_pass = metadata.get("FAIL_TO_PASS", "[]")
pass_to_pass = metadata.get("PASS_TO_PASS", "[]")
eval_ref = {
"instance_id": metadata["instance_id"],
"FAIL_TO_PASS": json.loads(fail_to_pass) if isinstance(fail_to_pass, str) else fail_to_pass,
"PASS_TO_PASS": json.loads(pass_to_pass) if isinstance(pass_to_pass, str) else pass_to_pass,
}

…y at import time

External callers (vime/slime sandbox) only need make_eval_script/parse_eval_output,
which don't use AgentEnv. The top-level `from uni_agent.interaction import AgentEnv`
pulled in swerex (via env.py), failing on environments without it installed.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: aoshen <aoshen@inferact.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant