qa: add Browser Use v2 agent backend (recommended) alongside Claude subagent by ShawnPana · Pull Request #6 · browser-use/plugins

ShawnPana · 2026-06-19T01:52:07Z

What

/qa can now run with either of two backends, with the choice surfaced in SKILL.md and v2 recommended for real QA:

Browser Use v2 cloud agent (recommended — built for QA). Hands the whole test to an autonomous Browser Use agent: judge mode (pass/fail vs expected behavior) + structuredOutput (forces the 1–5 score), server-side and parallelizable, with step-by-step screenshot evidence. Spends BU credits (~$0.01/task + ~$0.006/step + $0.02/hr browser).
Claude Code subagent (existing). Drive browser-harness on a cloud browser yourself via references/methodology.md. No Browser Use task credits.

Key resolution — rides on browser-harness

The v2 REST calls (POST /api/v2/tasks → poll GET /tasks/{id}) run inside a browser-harness heredoc, so they use the BROWSER_USE_API_KEY browser-harness already resolves (its .env / env / self-signup) — no separate key plumbing. The task runs on a Browser Use cloud browser, so no local Chrome is needed for the test; browser-harness is just the key store + HTTP runtime. (Plain curl with $BROWSER_USE_API_KEY works too.)

Files

qa/skills/qa/SKILL.md — new "Choose a backend" section (recommend v2, note credits) + backend-branched procedure.
qa/skills/qa/references/browser-use-v2.md — the create→poll→report flow (judge + 1–5 structuredOutput schema), result→verdict mapping (judgeVerdict, structured score, steps[] screenshots, cost), credits note, and gotchas (localhost needs a tunnel, structuredOutput is a stringified schema, 429 concurrency cap).

Notes

The v2 task API is documented as "legacy" (docs steer new projects to v3), but the judge + structured-output evaluation features QA needs live on v2 — called out in the reference.
Built from the documented v2 API; the embedded flow compiles. Recommend a smoke-test against a real task before relying on it in anger.

🤖 Generated with Claude Code

Summary by cubic

Adds a Browser Use v2 cloud-agent backend to /qa with judge pass/fail and 1–5 scoring, and shifts the workflow by scope: single flow uses browser-harness; many flows fan out to subagents (v2 recommended). V2 runs via a browser-harness heredoc with BROWSER_USE_API_KEY; verdicts now trust judgeVerdict over the agent’s self-score, and the Claude subagent remains available.

New Features
- V2 agent: judge + 1–5 structured score, server-side with screenshots; spends credits (~$0.01/task + ~$0.006/step + $0.02/hr); tunnel localhost before startUrl; maxSteps: 50 is a ceiling.
- Fan-out + verdicts: create tasks and poll in parallel; mind 429 cap. judgeVerdict is authoritative (judge=false ⇒ fail and cap score ≤2); a flow passes only if judgeVerdict is true; overall score = weakest judge-corrected path.
- Docs: qa/skills/qa/SKILL.md adds “Single flow vs. fan-out” with scope-driven procedure. qa/skills/qa/references/browser-use-v2.md covers create→poll→report, explicit verdict mapping, a self-contained output template, v2 public-URL teardown no-op, and key verification.

^{Written for commit 1c01b97. Summary will update on new commits.}

…ubagent /qa can now run two ways, with the choice surfaced in SKILL.md: - Browser Use v2 cloud agent (recommended, built for QA): autonomous agent with judge mode (pass/fail) + structured 1-5 output, server-side and parallel. Spends BU credits (~$0.01/task + steps + browser). - Claude Code subagent (current): drive browser-harness yourself, no task credits. The v2 calls run inside a browser-harness heredoc so they ride on the key browser-harness already stores (BROWSER_USE_API_KEY) — no separate plumbing. - SKILL.md: "Choose a backend" section + backend-branched procedure. - references/browser-use-v2.md: create -> poll -> report flow (POST /api/v2/ tasks with judge + structuredOutput score schema; GET /tasks/{id}), result mapping to the 1-5 verdict, cost/credits note, and gotchas. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cubic-dev-ai

1 issue found across 2 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="qa/skills/qa/references/browser-use-v2.md">

<violation number="1" location="qa/skills/qa/references/browser-use-v2.md:86">
P2: Poll the lightweight `/tasks/{id}/status` endpoint during the loop instead of the full `/tasks/{id}`. The API explicitly recommends this pattern and the full endpoint loads steps, screenshots, and outputFiles on every poll — wasteful for a 5s tick.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic}

cubic-dev-ai · 2026-06-19T01:56:17Z

+print("created task", tid, "session", created["sessionId"], flush=True)
+
+while True:                                               # poll to a terminal state
+    t = call("GET", "/tasks/" + tid)


P2: Poll the lightweight /tasks/{id}/status endpoint during the loop instead of the full /tasks/{id}. The API explicitly recommends this pattern and the full endpoint loads steps, screenshots, and outputFiles on every poll — wasteful for a 5s tick.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At qa/skills/qa/references/browser-use-v2.md, line 86: <comment>Poll the lightweight `/tasks/{id}/status` endpoint during the loop instead of the full `/tasks/{id}`. The API explicitly recommends this pattern and the full endpoint loads steps, screenshots, and outputFiles on every poll — wasteful for a 5s tick.</comment> <file context> @@ -0,0 +1,123 @@ +print("created task", tid, "session", created["sessionId"], flush=True) + +while True: # poll to a terminal state + t = call("GET", "/tasks/" + tid) + if t["status"] in ("finished", "failed", "stopped"): + break </file context>

…put template, teardown no-op note From a cold skill e2e test (scored browser-use.com 5/5 via v2, $0.03): - maxSteps -> 50 with a 'ceiling not cost lever' note (agent stops when done). - Inline the verdict output template so the v2 backend doesn't require opening methodology.md. - Note v2-public-URL teardown is a no-op; verify key before the billable create. - Soften nextGoal (may be empty). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ny flows = subagents (v2 recommended) Per design clarification: v2 agents are for FAN-OUT, not a single-test backend. - SKILL.md: 'Single flow vs fan-out' decision — one flow drives browser-harness directly (methodology.md); many flows fan out to subagents, user's choice of Claude Code subagents or Browser Use v2 agents (recommended). - Procedure is scope-driven; verdict aggregates per-flow scores (weakest path). - browser-use-v2.md: add a parallel fan-out pattern (create all, poll all) + the 429 concurrency-cap caveat. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="qa/skills/qa/references/browser-use-v2.md">

<violation number="1" location="qa/skills/qa/references/browser-use-v2.md:145">
P1: Tuple-unpacking `_, c = call(...)` will fail because `call()` returns a dict (from `json.load(r)`), not a tuple. Unpacking a dict iterates over its keys, so `c` would be assigned a key string like "id", making `c["id"]` raise `TypeError`.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic}

cubic-dev-ai · 2026-06-19T02:04:41Z

+]
+ids = []
+for f in flows:
+    _, c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})


P1: Tuple-unpacking _, c = call(...) will fail because call() returns a dict (from json.load(r)), not a tuple. Unpacking a dict iterates over its keys, so c would be assigned a key string like "id", making c["id"] raise TypeError.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At qa/skills/qa/references/browser-use-v2.md, line 145: <comment>Tuple-unpacking `_, c = call(...)` will fail because `call()` returns a dict (from `json.load(r)`), not a tuple. Unpacking a dict iterates over its keys, so `c` would be assigned a key string like "id", making `c["id"]` raise `TypeError`.</comment> <file context> @@ -108,10 +109,57 @@ Report exactly as `methodology.md`'s output format, sourced from the agent's res +] +ids = [] +for f in flows: + _, c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50}) + ids.append((f["task"][:40], c["id"])) + </file context>

Suggested change

_, c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})

c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})

… + mapping) Live fan-out over x.ai's 6 navbar tabs caught the case this guards against: an agent self-scored x.ai/pricing 5/5 'fully functional', but the judge saw the page rendered blank and returned judgeVerdict=false. Aggregating on the self-score masked the failure. - Mapping: judgeVerdict overrides the structured self-score; judge=false => the flow FAILED regardless (cap <=2, lead with failure_reason). - Fan-out: a flow PASSES only if judgeVerdict is True; flag self-score/judge mismatches; overall = weakest judge-corrected score. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cubic-dev-ai Bot reviewed Jun 19, 2026

View reviewed changes

ShawnPana and others added 2 commits June 18, 2026 18:59

cubic-dev-ai Bot reviewed Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qa: add Browser Use v2 agent backend (recommended) alongside Claude subagent#6

qa: add Browser Use v2 agent backend (recommended) alongside Claude subagent#6
ShawnPana wants to merge 4 commits into
mainfrom
qa-v2-backend

ShawnPana commented Jun 19, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Jun 19, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Jun 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	_, c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})
	c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})

Conversation

ShawnPana commented Jun 19, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Key resolution — rides on browser-harness

Files

Notes

Summary by cubic

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ShawnPana commented Jun 19, 2026 •

edited by cubic-dev-ai Bot

Loading

cubic-dev-ai Bot Jun 19, 2026 •

edited

Loading

cubic-dev-ai Bot Jun 19, 2026 •

edited

Loading