[reward] refactor: extract make_eval_script / parse_eval_output as public helpers by aoshen02 · Pull Request #62 · verl-project/uni-agent

aoshen02 · 2026-06-15T06:09:22Z

Summary

Extract two standalone functions from SWEBenchRewardSpec so external callers (vime/slime coding_agent_rl sandbox) can build eval scripts and parse results without instantiating the full reward spec class:

make_eval_script(metadata, workdir) → str: builds a self-contained bash eval script
parse_eval_output(metadata, output) → (solved, report): parses test output using swebench grading

SWEBenchRewardSpec.compute_reward and _get_eval_report now delegate to these functions, removing ~20 lines of duplicated logic.

Motivation

vime (#250) and slime (THUDM/slime#2079) both need SWE-bench eval in their sandbox.evaluate() path. With public helpers:

from uni_agent.reward.swe_bench import make_eval_script, parse_eval_output

script = make_eval_script(metadata, workdir)
# ... write to sandbox and execute ...
solved, report = parse_eval_output(metadata, output)

Test plan

Existing tests/ pass (no behavioral change — same logic, just extracted)
from uni_agent.reward.swe_bench import make_eval_script, parse_eval_output works

🤖 Generated with Claude Code

…blic helpers Extract two standalone functions from SWEBenchRewardSpec so external callers (vime/slime coding_agent_rl sandbox) can build eval scripts and parse results without instantiating the full reward spec class: - make_eval_script(metadata, workdir) -> str: builds a bash eval script - parse_eval_output(metadata, output) -> (solved, report): parses test output SWEBenchRewardSpec.compute_reward and _get_eval_report now delegate to these functions, removing duplicated logic (_get_logs_eval is inlined into parse_eval_output). Motivation: vime and slime both need SWE-bench eval in their sandbox evaluate() path. Previously they either copy-pasted the logic or used __new__ hacks. With public helpers they just: from uni_agent.reward.swe_bench import make_eval_script, parse_eval_output script = make_eval_script(metadata, workdir) # ... execute in sandbox ... solved, report = parse_eval_output(metadata, output) Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: aoshen <aoshen@inferact.ai>

gemini-code-assist

Code Review

This pull request refactors uni_agent/reward/swe_bench.py by extracting script generation and output parsing logic into standalone, module-level functions (make_eval_script and parse_eval_output), simplifying compute_reward and _get_eval_report. The review feedback suggests making parse_eval_output more robust by defensively handling empty or None values for eval_output and checking if the test lists (FAIL_TO_PASS and PASS_TO_PASS) are already parsed objects rather than raw JSON strings.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-15T06:10:16Z

+    if START_TEST_OUTPUT not in eval_output or END_TEST_OUTPUT not in eval_output:
+        return False, report
+
+    test_content = eval_output.split(START_TEST_OUTPUT)[1].split(END_TEST_OUTPUT)[0]
+    status_map = MAP_REPO_TO_PARSER[repo](test_content, None)
+    report["found_eval_status"] = True
+
+    eval_ref = {
+        "instance_id": metadata["instance_id"],
+        "FAIL_TO_PASS": json.loads(metadata.get("FAIL_TO_PASS", "[]")),
+        "PASS_TO_PASS": json.loads(metadata.get("PASS_TO_PASS", "[]")),
+    }


To make this public helper more robust for external callers (e.g., when integrated with Hugging Face datasets or when the evaluation fails/times out), we should:

Defensively handle cases where eval_output is None or empty to avoid TypeError during substring checks.

Handle cases where FAIL_TO_PASS and PASS_TO_PASS are already parsed as lists/dicts rather than JSON-serialized strings.

Suggested change

if START_TEST_OUTPUT not in eval_output or END_TEST_OUTPUT not in eval_output:

return False, report

test_content = eval_output.split(START_TEST_OUTPUT)[1].split(END_TEST_OUTPUT)[0]

status_map = MAP_REPO_TO_PARSER[repo](test_content, None)

report["found_eval_status"] = True

eval_ref = {

"instance_id": metadata["instance_id"],

"FAIL_TO_PASS": json.loads(metadata.get("FAIL_TO_PASS", "[]")),

"PASS_TO_PASS": json.loads(metadata.get("PASS_TO_PASS", "[]")),

}

if not eval_output or START_TEST_OUTPUT not in eval_output or END_TEST_OUTPUT not in eval_output:

return False, report

test_content = eval_output.split(START_TEST_OUTPUT)[1].split(END_TEST_OUTPUT)[0]

status_map = MAP_REPO_TO_PARSER[repo](test_content, None)

report["found_eval_status"] = True

fail_to_pass = metadata.get("FAIL_TO_PASS", "[]")

pass_to_pass = metadata.get("PASS_TO_PASS", "[]")

eval_ref = {

"instance_id": metadata["instance_id"],

"FAIL_TO_PASS": json.loads(fail_to_pass) if isinstance(fail_to_pass, str) else fail_to_pass,

"PASS_TO_PASS": json.loads(pass_to_pass) if isinstance(pass_to_pass, str) else pass_to_pass,

}

…y at import time External callers (vime/slime sandbox) only need make_eval_script/parse_eval_output, which don't use AgentEnv. The top-level `from uni_agent.interaction import AgentEnv` pulled in swerex (via env.py), failing on environments without it installed. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: aoshen <aoshen@inferact.ai>

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[reward] refactor: extract make_eval_script / parse_eval_output as public helpers#62

[reward] refactor: extract make_eval_script / parse_eval_output as public helpers#62
aoshen02 wants to merge 2 commits into
verl-project:mainfrom
aoshen02:feat/swebench-eval-helpers

aoshen02 commented Jun 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

aoshen02 commented Jun 15, 2026

Summary

Motivation

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant