Skip to content

Improve terminal-bench eval execution defaults#1031

Draft
wgqqqqq wants to merge 16 commits into
GCWing:evals-on-releasefrom
wgqqqqq:evals-on-release
Draft

Improve terminal-bench eval execution defaults#1031
wgqqqqq wants to merge 16 commits into
GCWing:evals-on-releasefrom
wgqqqqq:evals-on-release

Conversation

@wgqqqqq

@wgqqqqq wgqqqqq commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • cherry-pick terminal-bench eval execution default improvements
  • add eval deadline guidance/metadata propagation and tighter Bash budget handling
  • keep eval exec behavior focused on concrete artifacts, verification, and non-interactive reliability

Verification

  • cargo check -p bitfun-cli

jacksontwu and others added 11 commits May 29, 2026 12:40
CLI chat and exec consumed EventQueue entries for UI only, so
TokenUsageUpdated never reached TokenUsageSubscriber and nothing was
written to ~/.config/bitfun/data/token_usage.
Persist file logs in exec mode, emit stable BITFUN_EXIT stderr lines on failure, and create patch output parent directories so automated runners can classify errors and debug full traces.

Co-authored-by: Cursor <cursoragent@cursor.com>
Disable runtime enforcement of configured tool, Bash, and subagent timeouts while keeping timeout parameters accepted for compatibility.

Wait for CLI exec turns to settle before emitting patch output so session state is persisted before process exit.
SWE-bench Verified analysis showed 33.8% of failures come from
incomplete fixes — agents patching one variant of a symbol while
missing function vs class, sync vs async, or version-specific
sites (Sphinx autodoc_typehints_description: 10/16 failures).

Extend the shared agentic_mode "Doing tasks" guidance: before
editing a bug or behavior change, enumerate the scope of impact
via Grep (+ inline python ast when grep is ambiguous), and record
candidate sites in TodoWrite as the completion checklist.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
feat(prompt): require scope enumeration before edits
Continuing the SWE-bench Verified incomplete-fix work (P0a covered
scope enumeration before edits; this covers verification after).

Add to shared agentic_mode "Doing tasks": after a behavior-changing
edit, run static checks, repo-shipped tests scoped to the modified
module, and any tests the task description quotes. Treat failures
as the next signal, not as the end state. Avoids hidden-evaluator
leakage by sourcing tests only from the repo and task input.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
feat(prompt): require self-verification before declaring task done
@wgqqqqq wgqqqqq marked this pull request as draft June 2, 2026 04:41
@wgqqqqq wgqqqqq force-pushed the evals-on-release branch from eee4660 to 3fcec1c Compare June 2, 2026 07:06
@kev1n77 kev1n77 force-pushed the evals-on-release branch from b89834a to ddf5bdc Compare June 15, 2026 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants