[reward] refactor: extract make_eval_script / parse_eval_output as public helpers#62
[reward] refactor: extract make_eval_script / parse_eval_output as public helpers#62aoshen02 wants to merge 2 commits into
Conversation
…blic helpers
Extract two standalone functions from SWEBenchRewardSpec so external
callers (vime/slime coding_agent_rl sandbox) can build eval scripts and
parse results without instantiating the full reward spec class:
- make_eval_script(metadata, workdir) -> str: builds a bash eval script
- parse_eval_output(metadata, output) -> (solved, report): parses test output
SWEBenchRewardSpec.compute_reward and _get_eval_report now delegate to
these functions, removing duplicated logic (_get_logs_eval is inlined
into parse_eval_output).
Motivation: vime and slime both need SWE-bench eval in their sandbox
evaluate() path. Previously they either copy-pasted the logic or used
__new__ hacks. With public helpers they just:
from uni_agent.reward.swe_bench import make_eval_script, parse_eval_output
script = make_eval_script(metadata, workdir)
# ... execute in sandbox ...
solved, report = parse_eval_output(metadata, output)
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: aoshen <aoshen@inferact.ai>
There was a problem hiding this comment.
Code Review
This pull request refactors uni_agent/reward/swe_bench.py by extracting script generation and output parsing logic into standalone, module-level functions (make_eval_script and parse_eval_output), simplifying compute_reward and _get_eval_report. The review feedback suggests making parse_eval_output more robust by defensively handling empty or None values for eval_output and checking if the test lists (FAIL_TO_PASS and PASS_TO_PASS) are already parsed objects rather than raw JSON strings.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if START_TEST_OUTPUT not in eval_output or END_TEST_OUTPUT not in eval_output: | ||
| return False, report | ||
|
|
||
| test_content = eval_output.split(START_TEST_OUTPUT)[1].split(END_TEST_OUTPUT)[0] | ||
| status_map = MAP_REPO_TO_PARSER[repo](test_content, None) | ||
| report["found_eval_status"] = True | ||
|
|
||
| eval_ref = { | ||
| "instance_id": metadata["instance_id"], | ||
| "FAIL_TO_PASS": json.loads(metadata.get("FAIL_TO_PASS", "[]")), | ||
| "PASS_TO_PASS": json.loads(metadata.get("PASS_TO_PASS", "[]")), | ||
| } |
There was a problem hiding this comment.
To make this public helper more robust for external callers (e.g., when integrated with Hugging Face datasets or when the evaluation fails/times out), we should:
- Defensively handle cases where
eval_outputisNoneor empty to avoidTypeErrorduring substring checks. - Handle cases where
FAIL_TO_PASSandPASS_TO_PASSare already parsed as lists/dicts rather than JSON-serialized strings.
| if START_TEST_OUTPUT not in eval_output or END_TEST_OUTPUT not in eval_output: | |
| return False, report | |
| test_content = eval_output.split(START_TEST_OUTPUT)[1].split(END_TEST_OUTPUT)[0] | |
| status_map = MAP_REPO_TO_PARSER[repo](test_content, None) | |
| report["found_eval_status"] = True | |
| eval_ref = { | |
| "instance_id": metadata["instance_id"], | |
| "FAIL_TO_PASS": json.loads(metadata.get("FAIL_TO_PASS", "[]")), | |
| "PASS_TO_PASS": json.loads(metadata.get("PASS_TO_PASS", "[]")), | |
| } | |
| if not eval_output or START_TEST_OUTPUT not in eval_output or END_TEST_OUTPUT not in eval_output: | |
| return False, report | |
| test_content = eval_output.split(START_TEST_OUTPUT)[1].split(END_TEST_OUTPUT)[0] | |
| status_map = MAP_REPO_TO_PARSER[repo](test_content, None) | |
| report["found_eval_status"] = True | |
| fail_to_pass = metadata.get("FAIL_TO_PASS", "[]") | |
| pass_to_pass = metadata.get("PASS_TO_PASS", "[]") | |
| eval_ref = { | |
| "instance_id": metadata["instance_id"], | |
| "FAIL_TO_PASS": json.loads(fail_to_pass) if isinstance(fail_to_pass, str) else fail_to_pass, | |
| "PASS_TO_PASS": json.loads(pass_to_pass) if isinstance(pass_to_pass, str) else pass_to_pass, | |
| } |
…y at import time External callers (vime/slime sandbox) only need make_eval_script/parse_eval_output, which don't use AgentEnv. The top-level `from uni_agent.interaction import AgentEnv` pulled in swerex (via env.py), failing on environments without it installed. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: aoshen <aoshen@inferact.ai>
Summary
Extract two standalone functions from
SWEBenchRewardSpecso external callers (vime/slimecoding_agent_rlsandbox) can build eval scripts and parse results without instantiating the full reward spec class:make_eval_script(metadata, workdir)→str: builds a self-contained bash eval scriptparse_eval_output(metadata, output)→(solved, report): parses test output using swebench gradingSWEBenchRewardSpec.compute_rewardand_get_eval_reportnow delegate to these functions, removing ~20 lines of duplicated logic.Motivation
vime (#250) and slime (THUDM/slime#2079) both need SWE-bench eval in their
sandbox.evaluate()path. With public helpers:Test plan
tests/pass (no behavioral change — same logic, just extracted)from uni_agent.reward.swe_bench import make_eval_script, parse_eval_outputworks🤖 Generated with Claude Code