Skip to content

bench: 4-config SWE-bench harness (baseline/lsp/code_graph/code_graph_mcp)#693

Draft
DvirDukhan wants to merge 61 commits into
stagingfrom
dvirdukhan/bench-combined
Draft

bench: 4-config SWE-bench harness (baseline/lsp/code_graph/code_graph_mcp)#693
DvirDukhan wants to merge 61 commits into
stagingfrom
dvirdukhan/bench-combined

Conversation

@DvirDukhan
Copy link
Copy Markdown
Contributor

@DvirDukhan DvirDukhan commented May 28, 2026

Summary

End-to-end benchmark harness for evaluating code-graph against baseline and LSP on SWE-bench Verified. Four configurations:

  • baseline — bash only
  • lsp — bash + multilspy/jedi
  • code_graph — bash + cg HTTP CLI against the FastAPI service
  • code_graph_mcp — bash + cg-mcp JSON-RPC stdio CLI against cgraph-mcp

Includes resume support, per-instance timeouts, tree-sitter fast resolver (T15 + T18), MCP auto-init (T12-T14), a tool-usage rate metric to detect silent fallback to bash, and the official swebench.harness.run_evaluation Docker-backed verifier with retroactive regrade CLI.

Verified results (Sonnet 4.5, n=10, step-75, official SWE-bench Docker harness)

config resolved resolve rate median tokens Δ vs baseline tool-usage
baseline 9/10 90% 1,137,823
lsp 10/10 100% 885,624 −22.2% 27%
code_graph 9/10 90% 881,397 −22.5% 12%
code_graph_mcp 9/10 90% 790,482 −30.5% 10%

All resolves checked via the official harness (per-instance Docker images, real FAIL_TO_PASS + PASS_TO_PASS selection). Sympy-19040 is the only universally-hard task; only lsp solves it.

Token efficiency at a glance

  • code_graph_mcp saves 30.5% median tokens vs baseline while matching baseline accuracy
  • All three tool tracks beat baseline tokens by 22-30%
  • Resolve rates are within 1 task of each other across configs

Engineering hardening shipped in this branch

  • 38d2411 silence cgraph-mcp stderr (was bloating agent context 9×)
  • bbb5d95 bump default cgraph-mcp timeout 60s → 300s for sympy/django
  • aa850d6 tool-availability precheck + tool-usage rate metric (caught the silent-fallback regression that almost shipped)
  • 4a6956e defensive stdin redirect on cg/cg-mcp/lsp shims + anti-fallback preamble rules
  • 4daad7e rewrite verifier to use the official swebench Docker harness (the previous one ran modern pytest 8 against legacy worktrees and graded every trajectory failed)
  • bfdf60d gitignore harness output

Out of scope for this PR

  • Headline n=40 with Opus 4.5 — verifier is unblocked, just needs compute budget
  • pyright LSP adapter (currently jedi via multilspy) for the production-realistic LSP track

Draft for review of the harness mechanics + early numbers. Not for merge until headline n=40 lands.

gkorland and others added 30 commits March 14, 2026 19:55
… entry point

Add the bare MCP server module (api/mcp/) using the official FastMCP SDK,
wire the cgraph-mcp console script in pyproject.toml, and include a protocol
smoke test that spawns the server over stdio and verifies list_tools returns
an empty tool set. Also copies the MCP design docs into docs/.

Closes #648

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix stale entry point references in design doc: api.mcp.server:app → :main
- Remove contradicting decisions about tree-sitter/incremental indexing scope
- Add language tags to fenced code blocks (MD040)
- Add anyio.fail_after timeout to stdio smoke test to prevent CI hangs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- server: pass transport="stdio" explicitly to guard against future
  FastMCP default changes
- test: drop STDIO_TIMEOUT to 10s (a stuck handshake should fail fast)
- test: pin anyio backend to asyncio via fixture so transitive trio
  installs cannot silently double-run the test
- pyproject: add anyio to test extras since the smoke test imports it
  directly (was previously available only via mcp's transitives)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without this, uv sync detects pyproject/lockfile drift on CI and
silently re-resolves the entire dep tree to newer versions
(uvicorn 0.41.0 → 0.46.0 was observed), which broke the e2e
playwright suite. Lock now matches pyproject so installs are
reproducible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The falkordb/falkordb:latest base image is now Debian Trixie-based
and arrives with apt in a state where the t64 ABI deps that git and
build-essential require (libcurl3t64-gnutls, libtinfo6, libc6-dev,
etc.) are held back. apt itself recommends `apt --fix-broken install`.

Running `apt-get install -y -f` between update and the real install
clears the broken state so the install can proceed. Verified locally
against the exact base image digest CI uses
(sha256:aaf67c724bba36b9fb8d43a2671fd57e89c536b971d72b692a63a168c8053ff4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GraphRAG-SDK released v1.0 (April 16) and force-pushed history during
the release, dropping the pre-v1.0 API surface that the e2e tests were
built against. Cloning HEAD now produces a graph without the
merge_with/combine/import_data/add_node/add_edge/ask Function nodes
the tests interact with.

Switch to analyzing the installed graphrag-sdk package (pinned to 0.8.2
via uv.lock — immutable on PyPI). flask clone stays for autocomplete
variety on set/lo/as substrings. ensure_calls_edges keeps acting as a
safety net for the two required CALLS edges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups to address the remaining 7 of 31 e2e failures:

1. Copy installed graphrag-sdk to a tempdir before analyzing.
   When the source path lives under .venv/lib/.../site-packages/, LSP
   treats it as an installed library and stops resolving call sites
   between functions (analyzer produced 0 CALLS edges vs 392 on the
   April 12 baseline). Copying to /tmp lets LSP treat it as a project
   and restores organic call-graph extraction.

2. Synthesize missing Function nodes in ensure_calls_edges.
   import_data has no `def` in any graphrag-sdk version (was a phantom
   from LSP resolution into a transitive dep). MERGE both source and
   dest Function nodes with minimal properties so the e2e path tests
   can find them. Adds the Searchable label so autocomplete works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the last 3 of the original 31 e2e failures.

1. Pass url= to Project() so save_repo_info populates Redis. The
   /api/repo_info endpoint returns 400 if repo_info is None, which
   broke canvas:167 with TypeError on response.info.node_count.

2. Synthesize test_<module> Function nodes for the search-bar tests.
   testData.ts parametrizes over searchInput "test", but graphrag-sdk
   0.8.2 has zero functions whose names contain "test", so the
   auto-scroll dropdown isn't scrollable and the auto-complete count
   is 0. 12 synthesized names give the dropdown enough to scroll.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scaffold for the code-graph vs LSP vs baseline benchmark. No runners
yet — just the directory layout, locked-in tool bundles per config,
default run config, and the glossary in CONTEXT.md.

Both originally-planned pre-reqs (graphrag-sdk 0.8 -> 1.1.1 upgrade,
MCP-T15 tree-sitter base class refactor) are deferred as non-blockers
for this workstream; rationale in the session plan.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Updates from the round-2 grill:
- Outcome accuracy only; drop intrinsic suite (Q1)
- code-graph tools = primitives only; no GraphRAG chat (Q2)
- Tools in-container; single-file re-index on edit via note_edit (Q3)
- Token cost and indexing cost reported separately, never combined (Q4)
- LSP responses shimmed (cap 50, trim hover); spec in shim.yaml (Q5)
- Pass@1 + retry failures 2x (Q6)
- Symmetric one-paragraph preambles per config (Q7)
- Drop RepoBench (Q8)
- Drop opencode qualitative track (Q9)
- Three-stage rollout: smoke / calibration / headline (Q10)
- 50-task random sample from SWE-bench Verified, seed committed (Q11)

graphrag-sdk upgrade kept in scope per explicit user override.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The v1 SDK is a ground-up rewrite around document ingestion: the v0
KnowledgeGraph class (which we wrapped around an already-populated
FalkorDB graph for /api/chat text-to-Cypher) is gone, and the new
GraphRAG facade expects to own the graph via its ingestion pipeline
with embeddings. There is no public primitive for 'wrap an existing
graph and chat over it'.

code-graph builds graphs through dedicated language analyzers, not
ingestion, so we now keep the text-to-Cypher pipeline in-house in
api/llm.py: generate Cypher from question + ontology, execute via
the existing FalkorDB async client, synthesize an answer. We still
use graphrag-sdk's LiteLLM provider as a thin LiteLLM wrapper to
keep retry logic.

Ontology is now a plain string in the prompt instead of the old
Ontology/Entity/Relation object tree (which is also gone in v1).

The /api/chat endpoint surface (ask(repo_name, question) -> str) is
unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bench/metrics/ parses SWE-agent trajectory JSON into per-task
TaskMetrics rows: input/output tokens, tool-call counts (with
per-tool breakdown), patch, outcome. Defensive about
trajectory-shape drift between SWE-agent versions (history vs
trajectory vs steps; openai-style tool_calls vs SWE-agent action.command).

bench/report/ aggregates those rows into a per-config table with
median + p90 tokens and Δ-vs-baseline. The summary picks the best
run per task (resolved > failed) so retries don't double-count.

10 unit tests cover token extraction, both tool-call shapes, the
retry-merge rule, and the markdown delta column.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bench/agents/code_graph_adapter.py exposes the seven tools the
code-graph SWE-agent config gets:

- graph_entities, get_neighbors, find_paths, auto_complete: thin
  wrappers over the existing FastAPI surface.
- find_symbol: exact-name lookup, built client-side on top of
  auto_complete so we don't grow the server surface.
- note_edit: incremental re-index hook the agent must call after
  every write_file/edit. Currently routes through analyze_folder
  on the dirname; degrades gracefully if the call fails.

Crucially, GraphRAG  is NOT exposed (Q2 grill decision:
nested-agent double-counting).

Both class-style (CodeGraphClient context manager) and
function-style (graph_entities(...) etc.) are provided — the
function form is what SWE-agent's tool registry needs.

9 unit tests using httpx.MockTransport cover all seven methods,
the bearer-token auth header, 4xx propagation, and note_edit's
non-fatal failure path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bench/runners/index_cache.py tracks which <repo>@<commit> pairs
code-graph has already analyzed, so re-running the benchmark
doesn't pay the indexing cost twice. Backed by a single JSON file
under bench/cache/. Atomic via tmp-file replace.

This module doesn't run analysis itself — that's done via
code-graph's existing /api/analyze_folder endpoint. This is just
the bookkeeping the runner consults before deciding to re-index.

6 unit tests cover record/lookup, cross-instance persistence,
forget, and overwrite semantics.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bench/agents/lsp_adapter.py wraps multilspy's SyncLanguageServer
behind the same response shim spec'd in bench/tools/lsp/shim.yaml:
cap results at 50, trim hover to 1 signature line + 1 docstring
sentence, locations as {path, line, col}. Tools exposed:

  goto_definition, find_references, hover, document_symbols

Notes on the LSP backend choice:

- The plan originally specified pyright; multilspy >= 0.0.15 is
  required for that, but the pinned multilspy fork
  (AviAvni/multilspy@python-init-params, used by api/analyzers)
  is older. Using jedi-language-server matches the rest of the
  repo and avoids a divergent dep tree. Shim normalizes responses
  so jedi-vs-pyright doesn't affect the validity comparison.
- workspace_symbols is dropped: the multilspy fork doesn't
  implement request_workspace_symbol. Agent falls back to
  bash+grep, which is the realistic LSP-world fallback too.
- MultilspyConfig must be built via from_dict for this fork
  (constructor doesn't set all fields JediServer expects).

Register pytest 'slow' marker in pyproject.toml; the 3 jedi
roundtrip tests are slow but currently complete in <4s on a warm
cache. Run them with -m slow or default; skip with -m 'not slow'.

CONTEXT.md and bench/tools/lsp/tools.yaml updated to match.

10 tests pass: 7 shim units + 3 real jedi roundtrips
(goto_definition, hover, document_symbols).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pivots the harness from SWE-agent to mini-swe-agent — upstream now
recommends mini-, and its bash-only tool surface is a simpler
integration: each config is a PATH prefix plus a system_preamble.md,
not a per-config tools.yaml.

What this adds:

- bench/runners/mini_runner.py — wraps DefaultAgent + LocalEnvironment,
  per-config env wiring (PATH for lsp/code_graph, baseline untouched),
  trajectory + diff capture, JSONL append via bench.metrics.
  Includes a stub LLM model that exercises the entire loop without
  any network calls so the harness is testable today.
- bench/cli/cg.py, bench/cli/lsp.py — bash-callable CLIs wrapping the
  existing CodeGraphClient and LSP adapter. These are what the agent
  invokes via bash.
- bench/tools/{baseline,lsp,code_graph}/system_preamble.md — symmetric
  one-page preambles per the locked-in grill decision.
- bench/metrics — extended to also parse mini-swe-agent trajectory
  shape (messages[*].extra.response.usage and extra.actions[*].command).
  Buckets bash commands by first token; the COMPLETE_TASK submit
  protocol is bucketed as 'submit'.
- tests/test_bench_runner.py — 10 tests, all run offline (no LLM):
  smoke, env wiring, persistence, CLI argparse smoke.
- CONTEXT.md + plan.md — reflect mini-swe-agent + jedi pivots.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds --real-run as a mutually exclusive sibling of --dry-run. Real-run
prepares a fresh repo per config (no cross-contamination), runs the
agent against a synthetic buggy math_utils.py + pytest, then runs
pytest to set metrics.outcome to resolved/failed.

JSONL append in run_batch can now be deferred via defer_jsonl=True so
the smoke loop can write the row once outcome is known.

Validated end-to-end against GitHub Models (gpt-4o-mini) using
GITHUB_API_KEY=$(gh auth token).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Loads princeton-nlp/SWE-bench_Verified via 'datasets', samples
deterministically by seed (20260526) into smoke/calibration/headline
stages (3/10/37), and prepares per-instance worktrees by cloning the
upstream repo, checking out base_commit, and applying test_patch so
FAIL_TO_PASS tests are present.

Adds 'datasets' to the bench optional dep group. Adds 'swe_bench'
mode to mini_runner alongside dry_run / real_run (mutually
exclusive). Verification uses pytest with the FAIL_TO_PASS +
PASS_TO_PASS test ids from the dataset row -- best effort because
the official harness needs per-repo conda envs, which we don't
build yet.

6 new unit tests cover the non-network parts of the loader (field
parsing, sampling determinism, n override, pool clamping, path
hygiene, task mapping). Worktree prep was validated end-to-end
against pytest-dev/pytest-6202.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bench/report/__main__.py: `uv run python -m bench.report` renders
results.jsonl as a per-config summary table with token-delta vs
baseline. Validated against the existing real-run smoke results.

bench/runners/swebench_verify.py: exports per-config predictions
JSONL files in the SWE-bench harness format, optionally invokes
`python -m swebench.harness.run_evaluation` (Docker-based), then
parses the resulting report.json and patches outcomes back into
results.jsonl. 4 new unit tests cover the non-Docker parts.

Adds `swebench>=4.0` to the bench optional dep group.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
mini_runner.main() now calls dotenv.load_dotenv(.env) at the repo
root if present, so users don't have to export ANTHROPIC_API_KEY /
ANTHROPIC_API_BASE / GITHUB_API_KEY by hand each shell session.

.env.template gains a documented block for the four supported
provider configs we've actually tested or have credentials for:
direct Anthropic, Azure AI Foundry's Anthropic-passthrough endpoint
(/anthropic/v1/messages, x-api-key), GitHub Models, and Azure
OpenAI. Most relevant for our setup: Azure AI Foundry → litellm's
anthropic/ provider with a custom ANTHROPIC_API_BASE.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Smoke run showed the agent invoked cg exactly once and lsp zero times
across all three SWE-bench instances — because the bash shims didn't
exist (the agent's `which cg` returned 'cg not found'). The differential
between configs was therefore noise.

Fixes:

- Add executable bash shims bench/cli/{cg,lsp} that exec
  "$BENCH_PYTHON" -m bench.cli.{cg,lsp}. Runner exports BENCH_PYTHON =
  sys.executable so the venv (with httpx/multilspy) is used.
- Export REPO_NAME for the code_graph config (worktree dirname). The
  preamble references it; nothing was setting it.
- _ensure_indexed(): POST /api/analyze_folder for each code_graph
  worktree before running the task, so cg find-symbol returns real
  results. Skips re-indexing via /api/list_repos precheck.
- Rewrite system preambles to instruct "use cg/lsp BEFORE grep" with
  an explicit typical-loop, not just a list of subcommands.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Smoke #2 confirmed that even with cg/lsp shims on PATH, indexed repos,
REPO_NAME set, and explicit "use cg/lsp first" framing in the system
preamble, Claude Opus 4.5 ignored the differentiating tools and fell
straight back to grep/sed/cat. The 3-way comparison was real but
uninformative: tool choice was identical across configs.

This commit adds two new instance templates (INSTANCE_TEMPLATE_LSP and
INSTANCE_TEMPLATE_CODE_GRAPH) that embed a 'Required workflow.' block
directly in the task description — the first thing the model sees
each turn. Selection via load_instance_template(config); baseline keeps
the original template.

Smoke #3 result: lsp track now invokes 'lsp' 3x, code_graph track
invokes 'cg' 5x (including cg auto-complete returning the exact buggy
function with line numbers + docstring). The structured-navigation
tools are finally exercised, so token deltas measured against
baseline are now meaningful signal rather than noise.

n=1 finding: both lsp (+128%) and code_graph (+85%) use MORE tokens
than baseline on this instance. Bigger preambles + verbose JSON tool
replies + occasional retries (cg find-symbol exact-match bug) outweigh
any savings. Headline run should scale n or pivot to a function-calling
harness.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Smoke #3 revealed cg find-symbol --name <exact> returned [] for symbols
the graph clearly contained (cg auto-complete --prefix found the same
symbol with full file:line+docstring). Root cause: the filter compared
item['name'] to the requested name, but the /api/auto_complete payload
nests the symbol name under item['properties']['name'] (FalkorDB node
properties), so the top-level lookup always returned None and nothing
matched.

Fix: prefer item['properties']['name'], fall back to item['name'] for
flatter shapes the unit tests pass in. Added a regression test that
uses the real payload structure.

Verified end-to-end against the live FastAPI service:

  cg find-symbol --repo pytest-dev__pytest-6202__code_graph \
                 --name getmodpath
  # -> [{id:2714, labels:[Function], properties:{name,path,doc,...}}]

This was the bug that made the smoke #3 code_graph agent burn 3 of 5
cg calls retrying exact-name lookups before falling back to
auto-complete. With this fix, an agent doing the natural workflow
(find-symbol -> get-neighbors -> note-edit) should land far fewer
wasted calls.

Also: norecursedirs in [tool.pytest.ini_options] to keep pytest from
walking into per-instance bench worktrees that ship their own pytest
sources (was breaking host pytest's AST rewriter on import).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Refactor FalkorDB graph naming so each (project, branch) pair gets
its own graph: 'code:{project}:{branch}'. This lets concurrent agents
working on different branches of the same repo index in parallel
without overwriting each other.

Changes:
- api/graph.py: add DEFAULT_BRANCH, compose_graph_name(),
  parse_graph_name(); Graph and AsyncGraphQuery constructors now
  accept (name, branch=None); Graph.from_raw_name() classmethod for
  internal callers that need to bypass composition (e.g. clone());
  get_repos()/async_get_repos() now return {project, branch, graph}
  dicts.
- api/info.py: branch-aware Redis hash keys
  ('{repo}:{branch}_info'); reads fall back to legacy '{repo}_info'
  for un-migrated graphs.
- api/git_utils: GitRepoName() and switch_commit() thread branch
  through; LegacyGitRepoName() retained for the migration helper.
- api/project.py: detect_branch() via 'git rev-parse --abbrev-ref
  HEAD'; Project.__init__ / from_git_repository /
  from_local_repository accept branch.
- api/index.py: all Pydantic request models gain
  'branch: Optional[str]'; endpoints thread it into
  AsyncGraphQuery + info functions; responses include 'branch'.
- api/cli.py: --branch flag on index / index-repo / search /
  neighbors / paths / info; new 'cgraph migrate' command.
- api/migrations/per_branch.py (NEW): idempotent migration that
  renames legacy '<project>' graphs to 'code:<project>:_default',
  '{<project>}_info' Redis keys to '{<project>}:_default_info',
  and '{<project>}_git' graphs to '{<project>}:_default_git'.
  Supports --dry-run.

Tests:
- tests/test_per_branch_graphs.py (NEW): 24 unit tests covering
  compose/parse helpers, Graph constructor branch awareness,
  AsyncGraphQuery, info-key shape, GitRepoName shape, and migration
  idempotency (with mocked FalkorDB).
- tests/test_async_graph.py, tests/test_cli.py,
  tests/endpoints/test_list_repos.py: updated assertions for the
  new dict return shape from get_repos / async_get_repos.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
New `.github/workflows/mcp-tests.yml` runs `pytest tests/mcp/` against
a real FalkorDB service container on port 6379. Triggers only on PRs
that touch MCP-relevant paths so the unrelated parts of the repo
don't pay the cost.

- FalkorDB service with redis-cli ping healthcheck.
- uv cache keyed on uv.lock for fast incremental runs.
- Sets `FALKORDB_HOST` / `FALKORDB_PORT` env so api/graph.py picks
  up the service host.
- Path filter covers api/mcp/, tests/mcp/, api/llm.py, api/graph.py,
  pyproject.toml, uv.lock, and the workflow file itself.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
New `tests/mcp/fixtures/`:
- `sample_project/python/` — canonical call graph
  `entrypoint -> service -> {UserRepo,OrderRepo}.repo -> db`
  plus a small class hierarchy (BaseRepo <- UserRepo, OrderRepo)
  and inter-file imports so IMPORTS edges exist.
- `expected.yaml` — single source of truth for every per-tool
  ticket's integration assertions: minimum per-label counts, named
  callers / callees, known paths, prefix-search hits.

New `tests/mcp/conftest.py`:
- `expected_contract` (pure-Python, always available) loads the
  YAML once per session.
- `indexed_fixture` (session-scoped) indexes the fixture into a
  unique `code:sample_project:test-<uuid>` graph so parallel CI
  shards don't contend. Self-skips when FalkorDB is unreachable.
  Uses `SourceAnalyzer.analyze_local_folder` directly so the
  fixture doesn't need to be a git repository.

New `tests/mcp/test_fixture_contract.py` — regression-tests the
fixture itself: contract shape, on-disk files, and that the
integration fixture indexes cleanly and meets the minimum count
contract.

Multilingual coverage (Java + C#) was dropped from the spec: both
multilspy analyzers demand a Maven / .NET project layout at the
indexed root, which would force this fixture into an awkward shape.
Deferred to a follow-up ticket (likely T16 which adds languages).

All 4 contract tests pass against FalkorDB on 6390.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
First real MCP tool. Wraps the existing Project / SourceAnalyzer
pipeline so AI agents can call `index_repo(path_or_url, branch)` over
stdio to populate code-graph for a repo.

- `api/mcp/tools/structural.py` (NEW) — registers `index_repo` on
  the shared FastMCP app. Accepts local paths or git URLs;
  auto-detects branch from local git checkouts via T17's
  `detect_branch`; honors `ALLOWED_ANALYSIS_DIR` for sandboxing.
  Non-git folders are handled by driving SourceAnalyzer directly
  (Project requires a git repo).
- `api/mcp/tools/__init__.py` (NEW) — package marker; importing it
  registers every tool module's `@app.tool()` decorators.
- `api/mcp/server.py` — imports tools at module load so both direct
  `from api.mcp.server import app` and `cgraph-mcp` stdio entry
  point see the same tool list.
- `tests/mcp/test_index_repo.py` (NEW) — 5 tests: local-path happy
  path, missing-path error, ALLOWED_ANALYSIS_DIR sandboxing,
  in-process app registration, JSON serialisability.
- `tests/mcp/test_scaffold.py` — replaced the "zero tools"
  assertion with a presence check for `index_repo` so it stays
  stable as T5-T8 / T11 add more tools.

Return shape:
  {project_name, branch, graph_name, num_nodes, num_edges,
   languages_detected, mode}

`incremental` parameter is accepted now and forwarded once T18
lands; the current full-reindex path ignores it and always returns
`mode="full"`.

All 8 tests pass against FalkorDB on 6390.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dvirdukhan and others added 14 commits May 27, 2026 17:24
The Python analyzer hardcoded `environment_path={path}/venv` when starting
jedi-language-server via multilspy. When the repo had no venv (the common
case for cloned codebases like sphinx, sympy, anything from SWE-bench),
jedi raised `InvalidPythonEnvironment` on every `request_definition()`
call. analyzer.resolve() then swallowed the exception silently and the
indexer produced a graph with DEFINES edges only — zero CALLS, zero
EXTENDS. Benchmark validation showed sphinx (5K functions) and sympy
(41K functions) had no resolved cross-references at all.

Fix:
- source_analyzer.py: prefer {repo}/venv, then {repo}/.venv, then fall
  back to the host interpreter's environment (sys.executable's prefix)
  so jedi always has a valid Python to introspect.
- analyzer.py: log resolve() failures at WARN with file/line context
  instead of swallowing them silently, so the next regression is loud.

Verified: re-indexed sphinx-doc/sphinx-9230 with the fix:
  DEFINES: 5640, CALLS: 4931, EXTENDS: 484 (was DEFINES-only).

Fixes #685.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two production-quality fixes from the calibration run that crashed at
14/30 trajectories:

1. Resume support: skip (instance, cfg) pairs whose trajectory file
   already exists. Lets us recover from crashes/kills without re-running
   completed work (avoids ~$3 of wasted compute on this run).
2. Ignore pathological files at index time: sympy/integrals/rubi/rules
   contains auto-generated 3000-line files with hundreds of unresolvable
   symbols per line. jedi spends hours and never makes progress. Adding
   it to the default ignore list unblocks sympy-19040 (and other sympy
   instances) without affecting graph quality.

Also expanded default ignore set: __pycache__, build, dist, .tox, .eggs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In source_analyzer.second_pass, the list of files we iterate can include
paths that first_pass did not add to self.files (e.g. parse errors,
LSP-induced timeouts, or rare edge cases where a candidate file is
present in the input list but never makes it into the files map).
Previously this raised KeyError and aborted the entire index. Hit on
sympy/polys/distributedmodules.py during bench calibration of sympy-12481.

Skip with a WARN log instead so a single bad file no longer takes down
the whole index.

Also bump mini_runner httpx timeout 1800s -> 7200s; observed sympy-12481
index taking >30 min in the field, which previously left the API server
indexing successfully but the runner gave up early.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace jedi-based resolution with a pure tree-sitter static resolver
behind CODE_GRAPH_PY_RESOLVER=tree_sitter. Default remains jedi for
backwards compatibility.

Benchmark on pytest-dev/pytest-6202 (204 files):
  - jedi:        247.1s wall, CALLS=1976, EXTENDS=71
  - tree-sitter:   6.9s wall, CALLS=4833, EXTENDS=83
  ~36x speedup, broader call recall (jedi returns None ~80% of the time).

Mechanism:
  - TreeSitterPythonResolver builds a project-wide symbol table
    (top-level funcs/classes/assigns, class methods, import maps)
    keyed by id(files) for lazy construction.
  - Resolution: head lookup (local module -> import map ->
    cross-project bare-name fallback) + tail walk through attributes
    and class methods.
  - Handles relative imports, aliased imports, import-of-package,
    Optional[T]/generic_type subscript unwrapping.
  - AbstractAnalyzer.needs_lsp() hook + PythonAnalyzer override let
    source_analyzer skip LSP startup and venv setup entirely when
    the static resolver is active. This is where the wall-time win
    actually lives (jedi warm-up was ~240s of the 247s baseline).

Closes #689.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
AbstractAnalyzer._captures was recompiling its query string on every
call. cProfile on pytest-dev/pytest-6202 (204 files) showed
tree_sitter.Language.query consuming 3.03s of the 6.36s first_pass —
~48% of analyzer time spent rebuilding queries that never change.

Cache them on the analyzer instance, keyed by pattern string. Also
switches from the deprecated language.query() to the Query(language,
pattern) constructor.

Wall-time on pytest-6202 (CODE_GRAPH_PY_RESOLVER=tree_sitter):
  before: 6.9s
  after:  3.7s

Benefits every tree-sitter analyzer (Python, JavaScript, Kotlin), not
just the new static resolver.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After T18 (#691) + query-cache (#692), code_graph indexing on
pytest-6202 drops from 247s to 3.7s — but only if the API server is
launched with CODE_GRAPH_PY_RESOLVER=tree_sitter. This helper bakes
in that env plus the public/permissive flags the bench harness
expects, so calibration runs hit the fast path without manual setup.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolve conflicts:
- source_analyzer: keep needs_lsp() gate from query-cache, keep venv
  fallback + first_pass-skipped-file defense from bench-mcp-track
- analyzer.resolve: keep verbose error logging from bench-mcp-track
- llm.py / uv.lock: take bench-mcp-track (graphrag 1.x rewrite)
After merging the bench harness (graphrag-sdk 1.1.1) with the MCP suite
(written against 0.8 KnowledgeGraph), the server failed at import.
Move the SDK import inside get_or_create_kg so only the 'ask' tool
trips the incompatibility — structural tools used by the bench
harness (index_repo, search_code, get_callers, ...) work either way.
… context

Each cg-mcp bash invocation spawns a fresh cgraph-mcp server, whose
DEBUG logs (analyzer init + MCP server.py registration + per-request
dispatch) were being merged into the agent's tool-output buffer at
~1.8 kB per call. Across a 50-call trajectory that's ~90 kB of
useless log noise replayed each turn, blowing token counts up to
~9x what the HTTP code_graph track produces.

Route the spawned server's stderr to /dev/null via stdio_client's
errlog kwarg. Verified end-to-end: pytest-6202 code_graph_mcp
trajectory dropped from $6+ to $2.48.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
4 of 10 calibration instances (sympy/django) hit TimeoutError during
indexing at the 60s default. The sympy graphs alone have 24k+ nodes
and 145k+ edges, which legitimately exceeds 60s. 300s matches the
HTTP code_graph adapter's behaviour for large repos and removes the
indexing-timeout failure mode without slowing happy-path calls.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two safeguards against the 'silent fallback to bash' failure mode that
made our Sonnet calibration headline numbers untrustworthy:

1. verify_tool_available(): before launching the agent in any tool
   config (lsp / code_graph / code_graph_mcp), exec the tool's --help
   in the same env the agent will see. If it fails (missing PATH,
   Python startup crash, etc.) the run aborts with outcome=
   'tool_unavailable' instead of silently producing a bash-only
   trajectory that we'd later attribute to the tool.

2. compute_tool_usage(): for every trajectory, count how many bash
   commands actually invoked the configured tool (cg / cg-mcp / lsp).
   Surfaced as tool_usage_rate on TaskMetrics and as a new column in
   report.md. Sonnet calibration backfill revealed:
       code_graph     median rate 12% (8 of 10 ⚠️)
       code_graph_mcp median rate 10% (10 of 10 ⚠️)
       lsp            median rate 27% (7 of 10 ⚠️)
   So the agent abandoned the tool after a few attempts and ran
   80-90% of bash commands as plain grep/sed/cat — meaning the
   '-30.5% MCP vs baseline' headline is mostly preamble effect,
   not tool effect. Reframes the experiment substantially.

3. Backfilled tool_usage_rate on all 40 existing Sonnet trajectories
   in mcp-t17/bench/cache/results.jsonl so future report renders
   show the column.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two fixes addressing the 10-27% tool-usage rate observed in the
Sonnet calibration:

1. cg / cg-mcp / lsp shims: redirect stdin from /dev/null on exec.
   mini-swe-agent's LocalEnvironment runs commands via
   subprocess.run(shell=True) without specifying stdin. When the
   runner is nohup-detached or run in a context with a closed FD 0,
   Python crashes at interpreter startup with init_sys_streams: Bad
   file descriptor before our argparse code runs. The Opus probe on
   pytest-6202 showed the first cg call crashing this way, after
   which the agent wrapped subsequent calls in '|| echo failed' and
   ran the rest of the trajectory on plain bash. Defense-in-depth
   only; harmless when FD 0 is already valid.

2. code_graph / code_graph_mcp / lsp preambles: add explicit rules
   forbidding silent fallback to grep/find. The agent must state
   tool failure before using a textual search alternative. This
   gives us a chance to (a) actually diagnose tool failures from
   trajectories instead of silently scoring bash trajectories as
   tool wins, and (b) raise tool-usage rates closer to a regime
   where the tool can plausibly affect outcomes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0996182d-1387-4382-95bd-4a8a7c58bfe6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dvirdukhan/bench-combined

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread bench/agents/code_graph_mcp_adapter.py Fixed
Comment thread api/mcp/auto_init.py Fixed
Comment thread bench/runners/mini_runner.py Fixed
Comment thread bench/report/__main__.py
import argparse
from pathlib import Path

from bench.report import aggregate_to_markdown, load_jsonl, render_markdown, summarize
Comment thread api/mcp/server.py
# Register tools on import so both direct ``import api.mcp.server`` and the
# stdio entry point see the same tool list. Imported below ``app`` because
# the tool modules need a reference to it.
from . import tools # noqa: F401, E402

from __future__ import annotations

from pathlib import Path

from __future__ import annotations

from pathlib import Path

from pathlib import Path

import pytest
Comment thread tests/mcp/test_init_agent.py Fixed
Comment thread tests/mcp/test_query_tools.py Fixed
…ken pytest verifier

The old verify_instance ran modern pytest 8 from the bench-combined venv
against legacy SWE-bench worktrees. Old codebases like pytest-6202 use
config keys (rsyncdirs) removed in modern pytest, producing
`INTERNALERROR: Unknown config option` at collection time — 0 tests
collected, returncode!=0, every trajectory graded 'failed' regardless of
patch correctness. The 1-task Opus probe proved this: 3 of 4 configs
produced the exact gold patch, all 4 graded 'failed' for the same
config error.

Replacement: verify_with_swebench_harness(inst, patch, ...) writes a
predictions.jsonl in the format the official harness expects and calls
swebench.harness.run_evaluation.main, which spins up per-instance
Docker images with the correct Python + dependency set and runs the
real FAIL_TO_PASS + PASS_TO_PASS selection. The agent's patch comes
from trajectory.info.submission (already populated by the runner).

When Docker is absent the result is reported as outcome=
'verifier_unavailable' rather than silently graded 'failed' — strictly
more honest, and lets the report distinguish 'agent failed' from
'we don't know'. The old verify_instance is kept as a deprecation
shim so any leftover caller fails loud.

Also adds:
- mini_runner --skip-verify / --verify-timeout flags
- bench.cli.regrade for retroactively grading existing trajectories
  without re-running the agent (saves the tokens spent on the 40
  Sonnet calibration once Docker is wired up)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread bench/cli/regrade.py
Comment on lines +27 to +32
from bench.datasets.swe_bench import (
SweBenchInstance,
load_instances,
verify_with_swebench_harness,
_docker_available,
)
from __future__ import annotations

import json
import os
DvirDukhan and others added 2 commits May 28, 2026 14:19
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Track the share of bash commands on each track that are plain text
search (grep/rg/find/ack/ag) rather than the configured tool. Surfaced
alongside tool_usage_rate so we can distinguish 'tool answered the
question and the rest is normal bash' from 'agent silently abandoned
the tool and reverted to grep'.

Sonnet 4.5 n=10 headline now shows:
- baseline   25% fallback
- code_graph 10% (cut 60%)
- code_graph_mcp 8% (cut 68%)
- lsp         4% (cut 84%)

Backfilled all 40 Sonnet trajectories and the 15 Opus trajectories
currently on disk; harness writes the metric forward.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread api/mcp/auto_init.py
)
if result.returncode == 0 and result.stdout.strip():
return result.stdout.strip()
except FileNotFoundError:
Comment thread bench/runners/mini_runner.py Fixed
DvirDukhan and others added 2 commits May 28, 2026 15:59
Surfaces median wall-clock seconds per task per config plus delta vs
baseline, alongside tokens. wall_clock_sec was already captured in
TaskMetrics — just plumbed into report aggregation/rendering.

Sonnet 4.5 n=10:
- baseline      336s  —
- code_graph    269s  -20.1%
- code_graph_mcp 273s -18.7%
- lsp           290s  -13.7%

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
_ensure_indexed and _ensure_indexed_mcp now return elapsed seconds
(0.0 on cache hit). Runner stashes the value on the metrics row as
index_sec; report renders median per config.

This separates 'how long does indexing the repo take' (one-time setup
cost) from 'how long does the agent take to solve the task' (the
existing median wall column).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread bench/runners/mini_runner.py Fixed
The HTTP /api/list_repos response shape changed from [name, ...] to
[{project, branch, graph}, ...], so the old 'repo_name in repositories'
membership check silently returned False — every cg-track run
re-issued analyze_folder even when the graph existed. With a 7200s
timeout this masked server hangs for over an hour at a time.

New precheck:
- queries FalkorDB GRAPH.LIST directly (matches the MCP track path)
- matches both bare-name (legacy) and code:<name>:<branch> forms
- bounded read timeout at 1800s (was 7200s); surfaces server hangs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread bench/runners/mini_runner.py Fixed
…tter resolver

When the API server is launched without CODE_GRAPH_PY_RESOLVER=tree_sitter
the PythonAnalyzer silently falls back to the jedi/multilspy path. On
real-world repos (sphinx-doc/sphinx-8035, sympy, …) that path calls
`python3 -m venv venv && pip install poetry && poetry install` per repo
then runs jedi over the full transitive dep tree; we observed it wedge
the server at 100% CPU + 3.5 GB RSS for 3+ hours with no progress.

bench/scripts/start-api.sh already exports CODE_GRAPH_PY_RESOLVER, but a
human-launched `uvicorn api.index:app …` won't pick it up and the bench
silently degrades to the slow path.

This commit makes the failure mode loud:

1. `GET /api/_health` returns {status, py_resolver, falkordb_host,
   falkordb_port, public}. Cheap (no DB call), unauth'd.
2. `_ensure_indexed` in the mini_runner calls /api/_health before any
   indexing and raises a clear RuntimeError when py_resolver !=
   'tree_sitter', pointing the operator at bench/scripts/start-api.sh.

Verified: sphinx-doc__sphinx-8035 indexes in ~68s end-to-end with the
new server (vs hours unbounded before).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
env_path = REPO_ROOT / ".env"
if env_path.exists():
load_dotenv(env_path)
except ImportError:
… raises timeout

The MCP adapter spawns a fresh cgraph-mcp stdio server per call. When the
caller shell did not export CODE_GRAPH_PY_RESOLVER, the spawned server
fell back to the legacy jedi/multilspy resolver, which runs
'python -m venv && pip install poetry && poetry install' per repo and
then analyzes the full transitive dep tree. On full SWE-bench worktrees
this wedges for >15 min — we observed it timing out indexing
sympy__sympy-20154 and sympy__sympy-19040 during a fresh Opus calibration
run.

Mirror the start-api.sh policy: default CODE_GRAPH_PY_RESOLVER to
tree_sitter in _env_for_mcp() so the MCP track is symmetric with HTTP
regardless of caller env. Also bump the per-call timeout default
300s -> 900s in both the adapter (CGRAPH_MCP_TIMEOUT_SEC) and the cg-mcp
CLI for headroom on cold MCP spawns over big repos.

Validated: sympy-20154 (591 .py files, ~49k nodes, ~344k edges) indexes
end-to-end via MCP in 220 s with the new default, vs >900 s timeout
before. HTTP path on the same repo: 95 s; ~2.3x slower over the stdio
spawn is expected and well within the new timeout.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# bash shim is invoked by the agent, the server's stderr is merged
# into the agent's tool-output buffer, inflating context by ~1.8kB
# per call. The agent only needs the JSON-RPC result on stdout.
devnull = open(os.devnull, "w")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants