Improve terminal-bench eval execution defaults by wgqqqqq · Pull Request #1031 · GCWing/BitFun

wgqqqqq · 2026-06-02T04:38:19Z

Summary

cherry-pick terminal-bench eval execution default improvements
add eval deadline guidance/metadata propagation and tighter Bash budget handling
keep eval exec behavior focused on concrete artifacts, verification, and non-interactive reliability

Verification

cargo check -p bitfun-cli

CLI chat and exec consumed EventQueue entries for UI only, so TokenUsageUpdated never reached TokenUsageSubscriber and nothing was written to ~/.config/bitfun/data/token_usage.

Persist file logs in exec mode, emit stable BITFUN_EXIT stderr lines on failure, and create patch output parent directories so automated runners can classify errors and debug full traces. Co-authored-by: Cursor <cursoragent@cursor.com>

Disable runtime enforcement of configured tool, Bash, and subagent timeouts while keeping timeout parameters accepted for compatibility. Wait for CLI exec turns to settle before emitting patch output so session state is persisted before process exit.

SWE-bench Verified analysis showed 33.8% of failures come from incomplete fixes — agents patching one variant of a symbol while missing function vs class, sync vs async, or version-specific sites (Sphinx autodoc_typehints_description: 10/16 failures). Extend the shared agentic_mode "Doing tasks" guidance: before editing a bug or behavior change, enumerate the scope of impact via Grep (+ inline python ast when grep is ambiguous), and record candidate sites in TodoWrite as the completion checklist. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feat(prompt): require scope enumeration before edits

Continuing the SWE-bench Verified incomplete-fix work (P0a covered scope enumeration before edits; this covers verification after). Add to shared agentic_mode "Doing tasks": after a behavior-changing edit, run static checks, repo-shipped tests scoped to the modified module, and any tests the task description quotes. Treat failures as the next signal, not as the end state. Avoids hidden-evaluator leakage by sourcing tests only from the repo and task input. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feat(prompt): require self-verification before declaring task done

…inputs Log CLI tool start inputs

jacksontwu and others added 11 commits May 29, 2026 12:40

feat(cli): improve eval token accounting

0673fe1

fix(cli): persist token usage by routing dequeued events internally

a0fd6b3

CLI chat and exec consumed EventQueue entries for UI only, so TokenUsageUpdated never reached TokenUsageSubscriber and nothing was written to ~/.config/bitfun/data/token_usage.

Fix exec hang after bash timeout

98c7972

Record LLM TPS metrics in token usage

66ca5cd

fix(cli): resolve main rebase integration

2595526

Merge pull request GCWing#974 from nonoqing/evals-on-release

6d9ee93

feat(prompt): require scope enumeration before edits

Merge pull request GCWing#1007 from nonoqing/evals-on-release

ec16d8e

feat(prompt): require self-verification before declaring task done

wgqqqqq marked this pull request as draft June 2, 2026 04:41

wgqqqqq and others added 4 commits June 2, 2026 14:47

Log CLI tool start inputs

d917c21

Merge pull request GCWing#1034 from wgqqqqq/codex/log-cli-tool-start-…

7d4ed07

…inputs Log CLI tool start inputs

[terminal-bench] improve eval execution defaults

13c6792

feat(eval): tighten execution budget handling

3fcec1c

wgqqqqq force-pushed the evals-on-release branch from eee4660 to 3fcec1c Compare June 2, 2026 07:06

revert(eval): remove deadline passthrough

3721449

kev1n77 force-pushed the evals-on-release branch from b89834a to ddf5bdc Compare June 15, 2026 11:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve terminal-bench eval execution defaults#1031

Improve terminal-bench eval execution defaults#1031
wgqqqqq wants to merge 16 commits into
GCWing:evals-on-releasefrom
wgqqqqq:evals-on-release

wgqqqqq commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

wgqqqqq commented Jun 2, 2026

Summary

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants