bench: 4-config SWE-bench harness (baseline/lsp/code_graph/code_graph_mcp) by DvirDukhan · Pull Request #693 · FalkorDB/code-graph

DvirDukhan · 2026-05-28T09:33:12Z

Summary

End-to-end benchmark harness for evaluating code-graph against baseline and LSP on SWE-bench Verified. Four configurations:

baseline — bash only
lsp — bash + multilspy/jedi
code_graph — bash + cg HTTP CLI against the FastAPI service
code_graph_mcp — bash + cg-mcp JSON-RPC stdio CLI against cgraph-mcp

Includes resume support, per-instance timeouts, tree-sitter fast resolver (T15 + T18), MCP auto-init (T12-T14), a tool-usage rate metric to detect silent fallback to bash, and the official swebench.harness.run_evaluation Docker-backed verifier with retroactive regrade CLI.

Verified results (Sonnet 4.5, n=10, step-75, official SWE-bench Docker harness)

config	resolved	resolve rate	median tokens	Δ vs baseline	tool-usage
baseline	9/10	90%	1,137,823	—	—
lsp	10/10	100%	885,624	−22.2%	27%
code_graph	9/10	90%	881,397	−22.5%	12%
code_graph_mcp	9/10	90%	790,482	−30.5%	10%

All resolves checked via the official harness (per-instance Docker images, real FAIL_TO_PASS + PASS_TO_PASS selection). Sympy-19040 is the only universally-hard task; only lsp solves it.

Token efficiency at a glance

code_graph_mcp saves 30.5% median tokens vs baseline while matching baseline accuracy
All three tool tracks beat baseline tokens by 22-30%
Resolve rates are within 1 task of each other across configs

Engineering hardening shipped in this branch

38d2411 silence cgraph-mcp stderr (was bloating agent context 9×)
bbb5d95 bump default cgraph-mcp timeout 60s → 300s for sympy/django
aa850d6 tool-availability precheck + tool-usage rate metric (caught the silent-fallback regression that almost shipped)
4a6956e defensive stdin redirect on cg/cg-mcp/lsp shims + anti-fallback preamble rules
4daad7e rewrite verifier to use the official swebench Docker harness (the previous one ran modern pytest 8 against legacy worktrees and graded every trajectory failed)
bfdf60d gitignore harness output

Out of scope for this PR

Headline n=40 with Opus 4.5 — verifier is unblocked, just needs compute budget
pyright LSP adapter (currently jedi via multilspy) for the production-realistic LSP track

Draft for review of the harness mechanics + early numbers. Not for merge until headline n=40 lands.

staging-->main

… entry point Add the bare MCP server module (api/mcp/) using the official FastMCP SDK, wire the cgraph-mcp console script in pyproject.toml, and include a protocol smoke test that spawns the server over stdio and verifies list_tools returns an empty tool set. Also copies the MCP design docs into docs/. Closes #648 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix stale entry point references in design doc: api.mcp.server:app → :main - Remove contradicting decisions about tree-sitter/incremental indexing scope - Add language tags to fenced code blocks (MD040) - Add anyio.fail_after timeout to stdio smoke test to prevent CI hangs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- server: pass transport="stdio" explicitly to guard against future FastMCP default changes - test: drop STDIO_TIMEOUT to 10s (a stuck handshake should fail fast) - test: pin anyio backend to asyncio via fixture so transitive trio installs cannot silently double-run the test - pyproject: add anyio to test extras since the smoke test imports it directly (was previously available only via mcp's transitives) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Without this, uv sync detects pyproject/lockfile drift on CI and silently re-resolves the entire dep tree to newer versions (uvicorn 0.41.0 → 0.46.0 was observed), which broke the e2e playwright suite. Lock now matches pyproject so installs are reproducible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This reverts commit 0c7e3db.

The falkordb/falkordb:latest base image is now Debian Trixie-based and arrives with apt in a state where the t64 ABI deps that git and build-essential require (libcurl3t64-gnutls, libtinfo6, libc6-dev, etc.) are held back. apt itself recommends `apt --fix-broken install`. Running `apt-get install -y -f` between update and the real install clears the broken state so the install can proceed. Verified locally against the exact base image digest CI uses (sha256:aaf67c724bba36b9fb8d43a2671fd57e89c536b971d72b692a63a168c8053ff4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GraphRAG-SDK released v1.0 (April 16) and force-pushed history during the release, dropping the pre-v1.0 API surface that the e2e tests were built against. Cloning HEAD now produces a graph without the merge_with/combine/import_data/add_node/add_edge/ask Function nodes the tests interact with. Switch to analyzing the installed graphrag-sdk package (pinned to 0.8.2 via uv.lock — immutable on PyPI). flask clone stays for autocomplete variety on set/lo/as substrings. ensure_calls_edges keeps acting as a safety net for the two required CALLS edges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two follow-ups to address the remaining 7 of 31 e2e failures: 1. Copy installed graphrag-sdk to a tempdir before analyzing. When the source path lives under .venv/lib/.../site-packages/, LSP treats it as an installed library and stops resolving call sites between functions (analyzer produced 0 CALLS edges vs 392 on the April 12 baseline). Copying to /tmp lets LSP treat it as a project and restores organic call-graph extraction. 2. Synthesize missing Function nodes in ensure_calls_edges. import_data has no `def` in any graphrag-sdk version (was a phantom from LSP resolution into a transitive dep). MERGE both source and dest Function nodes with minimal properties so the e2e path tests can find them. Adds the Searchable label so autocomplete works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes the last 3 of the original 31 e2e failures. 1. Pass url= to Project() so save_repo_info populates Redis. The /api/repo_info endpoint returns 400 if repo_info is None, which broke canvas:167 with TypeError on response.info.node_count. 2. Synthesize test_<module> Function nodes for the search-bar tests. testData.ts parametrizes over searchInput "test", but graphrag-sdk 0.8.2 has zero functions whose names contain "test", so the auto-scroll dropdown isn't scrollable and the auto-complete count is 0. 12 synthesized names give the dropdown enough to scroll. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Scaffold for the code-graph vs LSP vs baseline benchmark. No runners yet — just the directory layout, locked-in tool bundles per config, default run config, and the glossary in CONTEXT.md. Both originally-planned pre-reqs (graphrag-sdk 0.8 -> 1.1.1 upgrade, MCP-T15 tree-sitter base class refactor) are deferred as non-blockers for this workstream; rationale in the session plan. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Updates from the round-2 grill: - Outcome accuracy only; drop intrinsic suite (Q1) - code-graph tools = primitives only; no GraphRAG chat (Q2) - Tools in-container; single-file re-index on edit via note_edit (Q3) - Token cost and indexing cost reported separately, never combined (Q4) - LSP responses shimmed (cap 50, trim hover); spec in shim.yaml (Q5) - Pass@1 + retry failures 2x (Q6) - Symmetric one-paragraph preambles per config (Q7) - Drop RepoBench (Q8) - Drop opencode qualitative track (Q9) - Three-stage rollout: smoke / calibration / headline (Q10) - 50-task random sample from SWE-bench Verified, seed committed (Q11) graphrag-sdk upgrade kept in scope per explicit user override. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The v1 SDK is a ground-up rewrite around document ingestion: the v0 KnowledgeGraph class (which we wrapped around an already-populated FalkorDB graph for /api/chat text-to-Cypher) is gone, and the new GraphRAG facade expects to own the graph via its ingestion pipeline with embeddings. There is no public primitive for 'wrap an existing graph and chat over it'. code-graph builds graphs through dedicated language analyzers, not ingestion, so we now keep the text-to-Cypher pipeline in-house in api/llm.py: generate Cypher from question + ontology, execute via the existing FalkorDB async client, synthesize an answer. We still use graphrag-sdk's LiteLLM provider as a thin LiteLLM wrapper to keep retry logic. Ontology is now a plain string in the prompt instead of the old Ontology/Entity/Relation object tree (which is also gone in v1). The /api/chat endpoint surface (ask(repo_name, question) -> str) is unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

bench/metrics/ parses SWE-agent trajectory JSON into per-task TaskMetrics rows: input/output tokens, tool-call counts (with per-tool breakdown), patch, outcome. Defensive about trajectory-shape drift between SWE-agent versions (history vs trajectory vs steps; openai-style tool_calls vs SWE-agent action.command). bench/report/ aggregates those rows into a per-config table with median + p90 tokens and Δ-vs-baseline. The summary picks the best run per task (resolved > failed) so retries don't double-count. 10 unit tests cover token extraction, both tool-call shapes, the retry-merge rule, and the markdown delta column. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

bench/agents/code_graph_adapter.py exposes the seven tools the code-graph SWE-agent config gets: - graph_entities, get_neighbors, find_paths, auto_complete: thin wrappers over the existing FastAPI surface. - find_symbol: exact-name lookup, built client-side on top of auto_complete so we don't grow the server surface. - note_edit: incremental re-index hook the agent must call after every write_file/edit. Currently routes through analyze_folder on the dirname; degrades gracefully if the call fails. Crucially, GraphRAG is NOT exposed (Q2 grill decision: nested-agent double-counting). Both class-style (CodeGraphClient context manager) and function-style (graph_entities(...) etc.) are provided — the function form is what SWE-agent's tool registry needs. 9 unit tests using httpx.MockTransport cover all seven methods, the bearer-token auth header, 4xx propagation, and note_edit's non-fatal failure path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

bench/runners/index_cache.py tracks which <repo>@<commit> pairs code-graph has already analyzed, so re-running the benchmark doesn't pay the indexing cost twice. Backed by a single JSON file under bench/cache/. Atomic via tmp-file replace. This module doesn't run analysis itself — that's done via code-graph's existing /api/analyze_folder endpoint. This is just the bookkeeping the runner consults before deciding to re-index. 6 unit tests cover record/lookup, cross-instance persistence, forget, and overwrite semantics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

bench/agents/lsp_adapter.py wraps multilspy's SyncLanguageServer behind the same response shim spec'd in bench/tools/lsp/shim.yaml: cap results at 50, trim hover to 1 signature line + 1 docstring sentence, locations as {path, line, col}. Tools exposed: goto_definition, find_references, hover, document_symbols Notes on the LSP backend choice: - The plan originally specified pyright; multilspy >= 0.0.15 is required for that, but the pinned multilspy fork (AviAvni/multilspy@python-init-params, used by api/analyzers) is older. Using jedi-language-server matches the rest of the repo and avoids a divergent dep tree. Shim normalizes responses so jedi-vs-pyright doesn't affect the validity comparison. - workspace_symbols is dropped: the multilspy fork doesn't implement request_workspace_symbol. Agent falls back to bash+grep, which is the realistic LSP-world fallback too. - MultilspyConfig must be built via from_dict for this fork (constructor doesn't set all fields JediServer expects). Register pytest 'slow' marker in pyproject.toml; the 3 jedi roundtrip tests are slow but currently complete in <4s on a warm cache. Run them with -m slow or default; skip with -m 'not slow'. CONTEXT.md and bench/tools/lsp/tools.yaml updated to match. 10 tests pass: 7 shim units + 3 real jedi roundtrips (goto_definition, hover, document_symbols). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Pivots the harness from SWE-agent to mini-swe-agent — upstream now recommends mini-, and its bash-only tool surface is a simpler integration: each config is a PATH prefix plus a system_preamble.md, not a per-config tools.yaml. What this adds: - bench/runners/mini_runner.py — wraps DefaultAgent + LocalEnvironment, per-config env wiring (PATH for lsp/code_graph, baseline untouched), trajectory + diff capture, JSONL append via bench.metrics. Includes a stub LLM model that exercises the entire loop without any network calls so the harness is testable today. - bench/cli/cg.py, bench/cli/lsp.py — bash-callable CLIs wrapping the existing CodeGraphClient and LSP adapter. These are what the agent invokes via bash. - bench/tools/{baseline,lsp,code_graph}/system_preamble.md — symmetric one-page preambles per the locked-in grill decision. - bench/metrics — extended to also parse mini-swe-agent trajectory shape (messages[*].extra.response.usage and extra.actions[*].command). Buckets bash commands by first token; the COMPLETE_TASK submit protocol is bucketed as 'submit'. - tests/test_bench_runner.py — 10 tests, all run offline (no LLM): smoke, env wiring, persistence, CLI argparse smoke. - CONTEXT.md + plan.md — reflect mini-swe-agent + jedi pivots. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds --real-run as a mutually exclusive sibling of --dry-run. Real-run prepares a fresh repo per config (no cross-contamination), runs the agent against a synthetic buggy math_utils.py + pytest, then runs pytest to set metrics.outcome to resolved/failed. JSONL append in run_batch can now be deferred via defer_jsonl=True so the smoke loop can write the row once outcome is known. Validated end-to-end against GitHub Models (gpt-4o-mini) using GITHUB_API_KEY=$(gh auth token). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Loads princeton-nlp/SWE-bench_Verified via 'datasets', samples deterministically by seed (20260526) into smoke/calibration/headline stages (3/10/37), and prepares per-instance worktrees by cloning the upstream repo, checking out base_commit, and applying test_patch so FAIL_TO_PASS tests are present. Adds 'datasets' to the bench optional dep group. Adds 'swe_bench' mode to mini_runner alongside dry_run / real_run (mutually exclusive). Verification uses pytest with the FAIL_TO_PASS + PASS_TO_PASS test ids from the dataset row -- best effort because the official harness needs per-repo conda envs, which we don't build yet. 6 new unit tests cover the non-network parts of the loader (field parsing, sampling determinism, n override, pool clamping, path hygiene, task mapping). Worktree prep was validated end-to-end against pytest-dev/pytest-6202. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

bench/report/__main__.py: `uv run python -m bench.report` renders results.jsonl as a per-config summary table with token-delta vs baseline. Validated against the existing real-run smoke results. bench/runners/swebench_verify.py: exports per-config predictions JSONL files in the SWE-bench harness format, optionally invokes `python -m swebench.harness.run_evaluation` (Docker-based), then parses the resulting report.json and patches outcomes back into results.jsonl. 4 new unit tests cover the non-Docker parts. Adds `swebench>=4.0` to the bench optional dep group. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mini_runner.main() now calls dotenv.load_dotenv(.env) at the repo root if present, so users don't have to export ANTHROPIC_API_KEY / ANTHROPIC_API_BASE / GITHUB_API_KEY by hand each shell session. .env.template gains a documented block for the four supported provider configs we've actually tested or have credentials for: direct Anthropic, Azure AI Foundry's Anthropic-passthrough endpoint (/anthropic/v1/messages, x-api-key), GitHub Models, and Azure OpenAI. Most relevant for our setup: Azure AI Foundry → litellm's anthropic/ provider with a custom ANTHROPIC_API_BASE. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Smoke run showed the agent invoked cg exactly once and lsp zero times across all three SWE-bench instances — because the bash shims didn't exist (the agent's `which cg` returned 'cg not found'). The differential between configs was therefore noise. Fixes: - Add executable bash shims bench/cli/{cg,lsp} that exec "$BENCH_PYTHON" -m bench.cli.{cg,lsp}. Runner exports BENCH_PYTHON = sys.executable so the venv (with httpx/multilspy) is used. - Export REPO_NAME for the code_graph config (worktree dirname). The preamble references it; nothing was setting it. - _ensure_indexed(): POST /api/analyze_folder for each code_graph worktree before running the task, so cg find-symbol returns real results. Skips re-indexing via /api/list_repos precheck. - Rewrite system preambles to instruct "use cg/lsp BEFORE grep" with an explicit typical-loop, not just a list of subcommands. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Smoke #2 confirmed that even with cg/lsp shims on PATH, indexed repos, REPO_NAME set, and explicit "use cg/lsp first" framing in the system preamble, Claude Opus 4.5 ignored the differentiating tools and fell straight back to grep/sed/cat. The 3-way comparison was real but uninformative: tool choice was identical across configs. This commit adds two new instance templates (INSTANCE_TEMPLATE_LSP and INSTANCE_TEMPLATE_CODE_GRAPH) that embed a 'Required workflow.' block directly in the task description — the first thing the model sees each turn. Selection via load_instance_template(config); baseline keeps the original template. Smoke #3 result: lsp track now invokes 'lsp' 3x, code_graph track invokes 'cg' 5x (including cg auto-complete returning the exact buggy function with line numbers + docstring). The structured-navigation tools are finally exercised, so token deltas measured against baseline are now meaningful signal rather than noise. n=1 finding: both lsp (+128%) and code_graph (+85%) use MORE tokens than baseline on this instance. Bigger preambles + verbose JSON tool replies + occasional retries (cg find-symbol exact-match bug) outweigh any savings. Headline run should scale n or pivot to a function-calling harness. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Smoke #3 revealed cg find-symbol --name <exact> returned [] for symbols the graph clearly contained (cg auto-complete --prefix found the same symbol with full file:line+docstring). Root cause: the filter compared item['name'] to the requested name, but the /api/auto_complete payload nests the symbol name under item['properties']['name'] (FalkorDB node properties), so the top-level lookup always returned None and nothing matched. Fix: prefer item['properties']['name'], fall back to item['name'] for flatter shapes the unit tests pass in. Added a regression test that uses the real payload structure. Verified end-to-end against the live FastAPI service: cg find-symbol --repo pytest-dev__pytest-6202__code_graph \ --name getmodpath # -> [{id:2714, labels:[Function], properties:{name,path,doc,...}}] This was the bug that made the smoke #3 code_graph agent burn 3 of 5 cg calls retrying exact-name lookups before falling back to auto-complete. With this fix, an agent doing the natural workflow (find-symbol -> get-neighbors -> note-edit) should land far fewer wasted calls. Also: norecursedirs in [tool.pytest.ini_options] to keep pytest from walking into per-instance bench worktrees that ship their own pytest sources (was breaking host pytest's AST rewriter on import). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Refactor FalkorDB graph naming so each (project, branch) pair gets its own graph: 'code:{project}:{branch}'. This lets concurrent agents working on different branches of the same repo index in parallel without overwriting each other. Changes: - api/graph.py: add DEFAULT_BRANCH, compose_graph_name(), parse_graph_name(); Graph and AsyncGraphQuery constructors now accept (name, branch=None); Graph.from_raw_name() classmethod for internal callers that need to bypass composition (e.g. clone()); get_repos()/async_get_repos() now return {project, branch, graph} dicts. - api/info.py: branch-aware Redis hash keys ('{repo}:{branch}_info'); reads fall back to legacy '{repo}_info' for un-migrated graphs. - api/git_utils: GitRepoName() and switch_commit() thread branch through; LegacyGitRepoName() retained for the migration helper. - api/project.py: detect_branch() via 'git rev-parse --abbrev-ref HEAD'; Project.__init__ / from_git_repository / from_local_repository accept branch. - api/index.py: all Pydantic request models gain 'branch: Optional[str]'; endpoints thread it into AsyncGraphQuery + info functions; responses include 'branch'. - api/cli.py: --branch flag on index / index-repo / search / neighbors / paths / info; new 'cgraph migrate' command. - api/migrations/per_branch.py (NEW): idempotent migration that renames legacy '<project>' graphs to 'code:<project>:_default', '{<project>}_info' Redis keys to '{<project>}:_default_info', and '{<project>}_git' graphs to '{<project>}:_default_git'. Supports --dry-run. Tests: - tests/test_per_branch_graphs.py (NEW): 24 unit tests covering compose/parse helpers, Graph constructor branch awareness, AsyncGraphQuery, info-key shape, GitRepoName shape, and migration idempotency (with mocked FalkorDB). - tests/test_async_graph.py, tests/test_cli.py, tests/endpoints/test_list_repos.py: updated assertions for the new dict return shape from get_repos / async_get_repos. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

New `.github/workflows/mcp-tests.yml` runs `pytest tests/mcp/` against a real FalkorDB service container on port 6379. Triggers only on PRs that touch MCP-relevant paths so the unrelated parts of the repo don't pay the cost. - FalkorDB service with redis-cli ping healthcheck. - uv cache keyed on uv.lock for fast incremental runs. - Sets `FALKORDB_HOST` / `FALKORDB_PORT` env so api/graph.py picks up the service host. - Path filter covers api/mcp/, tests/mcp/, api/llm.py, api/graph.py, pyproject.toml, uv.lock, and the workflow file itself. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

New `tests/mcp/fixtures/`: - `sample_project/python/` — canonical call graph `entrypoint -> service -> {UserRepo,OrderRepo}.repo -> db` plus a small class hierarchy (BaseRepo <- UserRepo, OrderRepo) and inter-file imports so IMPORTS edges exist. - `expected.yaml` — single source of truth for every per-tool ticket's integration assertions: minimum per-label counts, named callers / callees, known paths, prefix-search hits. New `tests/mcp/conftest.py`: - `expected_contract` (pure-Python, always available) loads the YAML once per session. - `indexed_fixture` (session-scoped) indexes the fixture into a unique `code:sample_project:test-<uuid>` graph so parallel CI shards don't contend. Self-skips when FalkorDB is unreachable. Uses `SourceAnalyzer.analyze_local_folder` directly so the fixture doesn't need to be a git repository. New `tests/mcp/test_fixture_contract.py` — regression-tests the fixture itself: contract shape, on-disk files, and that the integration fixture indexes cleanly and meets the minimum count contract. Multilingual coverage (Java + C#) was dropped from the spec: both multilspy analyzers demand a Maven / .NET project layout at the indexed root, which would force this fixture into an awkward shape. Deferred to a follow-up ticket (likely T16 which adds languages). All 4 contract tests pass against FalkorDB on 6390. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

First real MCP tool. Wraps the existing Project / SourceAnalyzer pipeline so AI agents can call `index_repo(path_or_url, branch)` over stdio to populate code-graph for a repo. - `api/mcp/tools/structural.py` (NEW) — registers `index_repo` on the shared FastMCP app. Accepts local paths or git URLs; auto-detects branch from local git checkouts via T17's `detect_branch`; honors `ALLOWED_ANALYSIS_DIR` for sandboxing. Non-git folders are handled by driving SourceAnalyzer directly (Project requires a git repo). - `api/mcp/tools/__init__.py` (NEW) — package marker; importing it registers every tool module's `@app.tool()` decorators. - `api/mcp/server.py` — imports tools at module load so both direct `from api.mcp.server import app` and `cgraph-mcp` stdio entry point see the same tool list. - `tests/mcp/test_index_repo.py` (NEW) — 5 tests: local-path happy path, missing-path error, ALLOWED_ANALYSIS_DIR sandboxing, in-process app registration, JSON serialisability. - `tests/mcp/test_scaffold.py` — replaced the "zero tools" assertion with a presence check for `index_repo` so it stays stable as T5-T8 / T11 add more tools. Return shape: {project_name, branch, graph_name, num_nodes, num_edges, languages_detected, mode} `incremental` parameter is accepted now and forwarded once T18 lands; the current full-reindex path ignores it and always returns `mode="full"`. All 8 tests pass against FalkorDB on 6390. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The Python analyzer hardcoded `environment_path={path}/venv` when starting jedi-language-server via multilspy. When the repo had no venv (the common case for cloned codebases like sphinx, sympy, anything from SWE-bench), jedi raised `InvalidPythonEnvironment` on every `request_definition()` call. analyzer.resolve() then swallowed the exception silently and the indexer produced a graph with DEFINES edges only — zero CALLS, zero EXTENDS. Benchmark validation showed sphinx (5K functions) and sympy (41K functions) had no resolved cross-references at all. Fix: - source_analyzer.py: prefer {repo}/venv, then {repo}/.venv, then fall back to the host interpreter's environment (sys.executable's prefix) so jedi always has a valid Python to introspect. - analyzer.py: log resolve() failures at WARN with file/line context instead of swallowing them silently, so the next regression is loud. Verified: re-indexed sphinx-doc/sphinx-9230 with the fix: DEFINES: 5640, CALLS: 4931, EXTENDS: 484 (was DEFINES-only). Fixes #685. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Two production-quality fixes from the calibration run that crashed at 14/30 trajectories: 1. Resume support: skip (instance, cfg) pairs whose trajectory file already exists. Lets us recover from crashes/kills without re-running completed work (avoids ~$3 of wasted compute on this run). 2. Ignore pathological files at index time: sympy/integrals/rubi/rules contains auto-generated 3000-line files with hundreds of unresolvable symbols per line. jedi spends hours and never makes progress. Adding it to the default ignore list unblocks sympy-19040 (and other sympy instances) without affecting graph quality. Also expanded default ignore set: __pycache__, build, dist, .tox, .eggs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

In source_analyzer.second_pass, the list of files we iterate can include paths that first_pass did not add to self.files (e.g. parse errors, LSP-induced timeouts, or rare edge cases where a candidate file is present in the input list but never makes it into the files map). Previously this raised KeyError and aborted the entire index. Hit on sympy/polys/distributedmodules.py during bench calibration of sympy-12481. Skip with a WARN log instead so a single bad file no longer takes down the whole index. Also bump mini_runner httpx timeout 1800s -> 7200s; observed sympy-12481 index taking >30 min in the field, which previously left the API server indexing successfully but the runner gave up early. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace jedi-based resolution with a pure tree-sitter static resolver behind CODE_GRAPH_PY_RESOLVER=tree_sitter. Default remains jedi for backwards compatibility. Benchmark on pytest-dev/pytest-6202 (204 files): - jedi: 247.1s wall, CALLS=1976, EXTENDS=71 - tree-sitter: 6.9s wall, CALLS=4833, EXTENDS=83 ~36x speedup, broader call recall (jedi returns None ~80% of the time). Mechanism: - TreeSitterPythonResolver builds a project-wide symbol table (top-level funcs/classes/assigns, class methods, import maps) keyed by id(files) for lazy construction. - Resolution: head lookup (local module -> import map -> cross-project bare-name fallback) + tail walk through attributes and class methods. - Handles relative imports, aliased imports, import-of-package, Optional[T]/generic_type subscript unwrapping. - AbstractAnalyzer.needs_lsp() hook + PythonAnalyzer override let source_analyzer skip LSP startup and venv setup entirely when the static resolver is active. This is where the wall-time win actually lives (jedi warm-up was ~240s of the 247s baseline). Closes #689. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

AbstractAnalyzer._captures was recompiling its query string on every call. cProfile on pytest-dev/pytest-6202 (204 files) showed tree_sitter.Language.query consuming 3.03s of the 6.36s first_pass — ~48% of analyzer time spent rebuilding queries that never change. Cache them on the analyzer instance, keyed by pattern string. Also switches from the deprecated language.query() to the Query(language, pattern) constructor. Wall-time on pytest-6202 (CODE_GRAPH_PY_RESOLVER=tree_sitter): before: 6.9s after: 3.7s Benefits every tree-sitter analyzer (Python, JavaScript, Kotlin), not just the new static resolver. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

After T18 (#691) + query-cache (#692), code_graph indexing on pytest-6202 drops from 247s to 3.7s — but only if the API server is launched with CODE_GRAPH_PY_RESOLVER=tree_sitter. This helper bakes in that env plus the public/permissive flags the bench harness expects, so calibration runs hit the fast path without manual setup. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Resolve conflicts: - source_analyzer: keep needs_lsp() gate from query-cache, keep venv fallback + first_pass-skipped-file defense from bench-mcp-track - analyzer.resolve: keep verbose error logging from bench-mcp-track - llm.py / uv.lock: take bench-mcp-track (graphrag 1.x rewrite)

After merging the bench harness (graphrag-sdk 1.1.1) with the MCP suite (written against 0.8 KnowledgeGraph), the server failed at import. Move the SDK import inside get_or_create_kg so only the 'ask' tool trips the incompatibility — structural tools used by the bench harness (index_repo, search_code, get_callers, ...) work either way.

… context Each cg-mcp bash invocation spawns a fresh cgraph-mcp server, whose DEBUG logs (analyzer init + MCP server.py registration + per-request dispatch) were being merged into the agent's tool-output buffer at ~1.8 kB per call. Across a 50-call trajectory that's ~90 kB of useless log noise replayed each turn, blowing token counts up to ~9x what the HTTP code_graph track produces. Route the spawned server's stderr to /dev/null via stdio_client's errlog kwarg. Verified end-to-end: pytest-6202 code_graph_mcp trajectory dropped from $6+ to $2.48. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

4 of 10 calibration instances (sympy/django) hit TimeoutError during indexing at the 60s default. The sympy graphs alone have 24k+ nodes and 145k+ edges, which legitimately exceeds 60s. 300s matches the HTTP code_graph adapter's behaviour for large repos and removes the indexing-timeout failure mode without slowing happy-path calls. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Two safeguards against the 'silent fallback to bash' failure mode that made our Sonnet calibration headline numbers untrustworthy: 1. verify_tool_available(): before launching the agent in any tool config (lsp / code_graph / code_graph_mcp), exec the tool's --help in the same env the agent will see. If it fails (missing PATH, Python startup crash, etc.) the run aborts with outcome= 'tool_unavailable' instead of silently producing a bash-only trajectory that we'd later attribute to the tool. 2. compute_tool_usage(): for every trajectory, count how many bash commands actually invoked the configured tool (cg / cg-mcp / lsp). Surfaced as tool_usage_rate on TaskMetrics and as a new column in report.md. Sonnet calibration backfill revealed: code_graph median rate 12% (8 of 10 ⚠️) code_graph_mcp median rate 10% (10 of 10 ⚠️) lsp median rate 27% (7 of 10 ⚠️) So the agent abandoned the tool after a few attempts and ran 80-90% of bash commands as plain grep/sed/cat — meaning the '-30.5% MCP vs baseline' headline is mostly preamble effect, not tool effect. Reframes the experiment substantially. 3. Backfilled tool_usage_rate on all 40 existing Sonnet trajectories in mcp-t17/bench/cache/results.jsonl so future report renders show the column. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Two fixes addressing the 10-27% tool-usage rate observed in the Sonnet calibration: 1. cg / cg-mcp / lsp shims: redirect stdin from /dev/null on exec. mini-swe-agent's LocalEnvironment runs commands via subprocess.run(shell=True) without specifying stdin. When the runner is nohup-detached or run in a context with a closed FD 0, Python crashes at interpreter startup with init_sys_streams: Bad file descriptor before our argparse code runs. The Opus probe on pytest-6202 showed the first cg call crashing this way, after which the agent wrapped subsequent calls in '|| echo failed' and ran the rest of the trajectory on plain bash. Defense-in-depth only; harmless when FD 0 is already valid. 2. code_graph / code_graph_mcp / lsp preambles: add explicit rules forbidding silent fallback to grep/find. The agent must state tool failure before using a textual search alternative. This gives us a chance to (a) actually diagnose tool failures from trajectories instead of silently scoring bash trajectories as tool wins, and (b) raise tool-usage rates closer to a regime where the tool can plausibly affect outcomes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

coderabbitai · 2026-05-28T09:33:22Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0996182d-1387-4382-95bd-4a8a7c58bfe6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dvirdukhan/bench-combined

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

+import argparse
+from pathlib import Path
+
+from bench.report import aggregate_to_markdown, load_jsonl, render_markdown, summarize


+# Register tools on import so both direct ``import api.mcp.server`` and the
+# stdio entry point see the same tool list. Imported below ``app`` because
+# the tool modules need a reference to it.
+from . import tools  # noqa: F401, E402


+
+from __future__ import annotations
+
+from pathlib import Path


+
+from __future__ import annotations
+
+from pathlib import Path


+
+from pathlib import Path
+
+import pytest


…ken pytest verifier The old verify_instance ran modern pytest 8 from the bench-combined venv against legacy SWE-bench worktrees. Old codebases like pytest-6202 use config keys (rsyncdirs) removed in modern pytest, producing `INTERNALERROR: Unknown config option` at collection time — 0 tests collected, returncode!=0, every trajectory graded 'failed' regardless of patch correctness. The 1-task Opus probe proved this: 3 of 4 configs produced the exact gold patch, all 4 graded 'failed' for the same config error. Replacement: verify_with_swebench_harness(inst, patch, ...) writes a predictions.jsonl in the format the official harness expects and calls swebench.harness.run_evaluation.main, which spins up per-instance Docker images with the correct Python + dependency set and runs the real FAIL_TO_PASS + PASS_TO_PASS selection. The agent's patch comes from trajectory.info.submission (already populated by the runner). When Docker is absent the result is reported as outcome= 'verifier_unavailable' rather than silently graded 'failed' — strictly more honest, and lets the report distinguish 'agent failed' from 'we don't know'. The old verify_instance is kept as a deprecation shim so any leftover caller fails loud. Also adds: - mini_runner --skip-verify / --verify-timeout flags - bench.cli.regrade for retroactively grading existing trajectories without re-running the agent (saves the tokens spent on the 40 Sonnet calibration once Docker is wired up) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

+from bench.datasets.swe_bench import (
+    SweBenchInstance,
+    load_instances,
+    verify_with_swebench_harness,
+    _docker_available,
+)


+from __future__ import annotations
+
+import json
+import os


Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Track the share of bash commands on each track that are plain text search (grep/rg/find/ack/ag) rather than the configured tool. Surfaced alongside tool_usage_rate so we can distinguish 'tool answered the question and the rest is normal bash' from 'agent silently abandoned the tool and reverted to grep'. Sonnet 4.5 n=10 headline now shows: - baseline 25% fallback - code_graph 10% (cut 60%) - code_graph_mcp 8% (cut 68%) - lsp 4% (cut 84%) Backfilled all 40 Sonnet trajectories and the 15 Opus trajectories currently on disk; harness writes the metric forward. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

+        )
+        if result.returncode == 0 and result.stdout.strip():
+            return result.stdout.strip()
+    except FileNotFoundError:


Surfaces median wall-clock seconds per task per config plus delta vs baseline, alongside tokens. wall_clock_sec was already captured in TaskMetrics — just plumbed into report aggregation/rendering. Sonnet 4.5 n=10: - baseline 336s — - code_graph 269s -20.1% - code_graph_mcp 273s -18.7% - lsp 290s -13.7% Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

_ensure_indexed and _ensure_indexed_mcp now return elapsed seconds (0.0 on cache hit). Runner stashes the value on the metrics row as index_sec; report renders median per config. This separates 'how long does indexing the repo take' (one-time setup cost) from 'how long does the agent take to solve the task' (the existing median wall column). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The HTTP /api/list_repos response shape changed from [name, ...] to [{project, branch, graph}, ...], so the old 'repo_name in repositories' membership check silently returned False — every cg-track run re-issued analyze_folder even when the graph existed. With a 7200s timeout this masked server hangs for over an hour at a time. New precheck: - queries FalkorDB GRAPH.LIST directly (matches the MCP track path) - matches both bare-name (legacy) and code:<name>:<branch> forms - bounded read timeout at 1800s (was 7200s); surfaces server hangs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…tter resolver When the API server is launched without CODE_GRAPH_PY_RESOLVER=tree_sitter the PythonAnalyzer silently falls back to the jedi/multilspy path. On real-world repos (sphinx-doc/sphinx-8035, sympy, …) that path calls `python3 -m venv venv && pip install poetry && poetry install` per repo then runs jedi over the full transitive dep tree; we observed it wedge the server at 100% CPU + 3.5 GB RSS for 3+ hours with no progress. bench/scripts/start-api.sh already exports CODE_GRAPH_PY_RESOLVER, but a human-launched `uvicorn api.index:app …` won't pick it up and the bench silently degrades to the slow path. This commit makes the failure mode loud: 1. `GET /api/_health` returns {status, py_resolver, falkordb_host, falkordb_port, public}. Cheap (no DB call), unauth'd. 2. `_ensure_indexed` in the mini_runner calls /api/_health before any indexing and raises a clear RuntimeError when py_resolver != 'tree_sitter', pointing the operator at bench/scripts/start-api.sh. Verified: sphinx-doc__sphinx-8035 indexes in ~68s end-to-end with the new server (vs hours unbounded before). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

+        env_path = REPO_ROOT / ".env"
+        if env_path.exists():
+            load_dotenv(env_path)
+    except ImportError:


… raises timeout The MCP adapter spawns a fresh cgraph-mcp stdio server per call. When the caller shell did not export CODE_GRAPH_PY_RESOLVER, the spawned server fell back to the legacy jedi/multilspy resolver, which runs 'python -m venv && pip install poetry && poetry install' per repo and then analyzes the full transitive dep tree. On full SWE-bench worktrees this wedges for >15 min — we observed it timing out indexing sympy__sympy-20154 and sympy__sympy-19040 during a fresh Opus calibration run. Mirror the start-api.sh policy: default CODE_GRAPH_PY_RESOLVER to tree_sitter in _env_for_mcp() so the MCP track is symmetric with HTTP regardless of caller env. Also bump the per-call timeout default 300s -> 900s in both the adapter (CGRAPH_MCP_TIMEOUT_SEC) and the cg-mcp CLI for headroom on cold MCP spawns over big repos. Validated: sympy-20154 (591 .py files, ~49k nodes, ~344k edges) indexes end-to-end via MCP in 220 s with the new default, vs >900 s timeout before. HTTP path on the same repo: 95 s; ~2.3x slower over the stdio spawn is expected and well within the new timeout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

+    # bash shim is invoked by the agent, the server's stderr is merged
+    # into the agent's tool-output buffer, inflating context by ~1.8kB
+    # per call. The agent only needs the JSON-RPC result on stdout.
+    devnull = open(os.devnull, "w")


gkorland and others added 30 commits March 14, 2026 19:55

Merge pull request #618 from FalkorDB/staging

4e3c11c

staging-->main

Revert "chore(mcp): regenerate uv.lock for anyio test extra"

4a05363

This reverts commit 0c7e3db.

bench: add --limit flag for quick single-instance runs

03c7a73

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

dvirdukhan and others added 14 commits May 27, 2026 17:24

refactor(analyzers): extract TreeSitterAnalyzer base class (T15 #663)

4841701

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge dvirdukhan/mcp-smoke-combined into bench-combined for calibration

f9e8156