qa: add Browser Use v2 agent backend (recommended) alongside Claude subagent#6
qa: add Browser Use v2 agent backend (recommended) alongside Claude subagent#6ShawnPana wants to merge 4 commits into
Conversation
…ubagent
/qa can now run two ways, with the choice surfaced in SKILL.md:
- Browser Use v2 cloud agent (recommended, built for QA): autonomous agent
with judge mode (pass/fail) + structured 1-5 output, server-side and
parallel. Spends BU credits (~$0.01/task + steps + browser).
- Claude Code subagent (current): drive browser-harness yourself, no task
credits.
The v2 calls run inside a browser-harness heredoc so they ride on the key
browser-harness already stores (BROWSER_USE_API_KEY) — no separate plumbing.
- SKILL.md: "Choose a backend" section + backend-branched procedure.
- references/browser-use-v2.md: create -> poll -> report flow (POST /api/v2/
tasks with judge + structuredOutput score schema; GET /tasks/{id}), result
mapping to the 1-5 verdict, cost/credits note, and gotchas.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 2 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="qa/skills/qa/references/browser-use-v2.md">
<violation number="1" location="qa/skills/qa/references/browser-use-v2.md:86">
P2: Poll the lightweight `/tasks/{id}/status` endpoint during the loop instead of the full `/tasks/{id}`. The API explicitly recommends this pattern and the full endpoint loads steps, screenshots, and outputFiles on every poll — wasteful for a 5s tick.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
| print("created task", tid, "session", created["sessionId"], flush=True) | ||
|
|
||
| while True: # poll to a terminal state | ||
| t = call("GET", "/tasks/" + tid) |
There was a problem hiding this comment.
P2: Poll the lightweight /tasks/{id}/status endpoint during the loop instead of the full /tasks/{id}. The API explicitly recommends this pattern and the full endpoint loads steps, screenshots, and outputFiles on every poll — wasteful for a 5s tick.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At qa/skills/qa/references/browser-use-v2.md, line 86:
<comment>Poll the lightweight `/tasks/{id}/status` endpoint during the loop instead of the full `/tasks/{id}`. The API explicitly recommends this pattern and the full endpoint loads steps, screenshots, and outputFiles on every poll — wasteful for a 5s tick.</comment>
<file context>
@@ -0,0 +1,123 @@
+print("created task", tid, "session", created["sessionId"], flush=True)
+
+while True: # poll to a terminal state
+ t = call("GET", "/tasks/" + tid)
+ if t["status"] in ("finished", "failed", "stopped"):
+ break
</file context>
…put template, teardown no-op note From a cold skill e2e test (scored browser-use.com 5/5 via v2, $0.03): - maxSteps -> 50 with a 'ceiling not cost lever' note (agent stops when done). - Inline the verdict output template so the v2 backend doesn't require opening methodology.md. - Note v2-public-URL teardown is a no-op; verify key before the billable create. - Soften nextGoal (may be empty). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ny flows = subagents (v2 recommended) Per design clarification: v2 agents are for FAN-OUT, not a single-test backend. - SKILL.md: 'Single flow vs fan-out' decision — one flow drives browser-harness directly (methodology.md); many flows fan out to subagents, user's choice of Claude Code subagents or Browser Use v2 agents (recommended). - Procedure is scope-driven; verdict aggregates per-flow scores (weakest path). - browser-use-v2.md: add a parallel fan-out pattern (create all, poll all) + the 429 concurrency-cap caveat. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="qa/skills/qa/references/browser-use-v2.md">
<violation number="1" location="qa/skills/qa/references/browser-use-v2.md:145">
P1: Tuple-unpacking `_, c = call(...)` will fail because `call()` returns a dict (from `json.load(r)`), not a tuple. Unpacking a dict iterates over its keys, so `c` would be assigned a key string like "id", making `c["id"]` raise `TypeError`.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic | Re-trigger cubic
| ] | ||
| ids = [] | ||
| for f in flows: | ||
| _, c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50}) |
There was a problem hiding this comment.
P1: Tuple-unpacking _, c = call(...) will fail because call() returns a dict (from json.load(r)), not a tuple. Unpacking a dict iterates over its keys, so c would be assigned a key string like "id", making c["id"] raise TypeError.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At qa/skills/qa/references/browser-use-v2.md, line 145:
<comment>Tuple-unpacking `_, c = call(...)` will fail because `call()` returns a dict (from `json.load(r)`), not a tuple. Unpacking a dict iterates over its keys, so `c` would be assigned a key string like "id", making `c["id"]` raise `TypeError`.</comment>
<file context>
@@ -108,10 +109,57 @@ Report exactly as `methodology.md`'s output format, sourced from the agent's res
+]
+ids = []
+for f in flows:
+ _, c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})
+ ids.append((f["task"][:40], c["id"]))
+
</file context>
| _, c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50}) | |
| c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50}) |
… + mapping) Live fan-out over x.ai's 6 navbar tabs caught the case this guards against: an agent self-scored x.ai/pricing 5/5 'fully functional', but the judge saw the page rendered blank and returned judgeVerdict=false. Aggregating on the self-score masked the failure. - Mapping: judgeVerdict overrides the structured self-score; judge=false => the flow FAILED regardless (cap <=2, lead with failure_reason). - Fan-out: a flow PASSES only if judgeVerdict is True; flag self-score/judge mismatches; overall = weakest judge-corrected score. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What
/qacan now run with either of two backends, with the choice surfaced inSKILL.mdand v2 recommended for real QA:judgemode (pass/fail vs expected behavior) +structuredOutput(forces the 1–5 score), server-side and parallelizable, with step-by-step screenshot evidence. Spends BU credits (~$0.01/task + ~$0.006/step + $0.02/hr browser).browser-harnesson a cloud browser yourself viareferences/methodology.md. No Browser Use task credits.Key resolution — rides on browser-harness
The v2 REST calls (
POST /api/v2/tasks→ pollGET /tasks/{id}) run inside abrowser-harnessheredoc, so they use theBROWSER_USE_API_KEYbrowser-harness already resolves (its.env/ env / self-signup) — no separate key plumbing. The task runs on a Browser Use cloud browser, so no local Chrome is needed for the test; browser-harness is just the key store + HTTP runtime. (Plaincurlwith$BROWSER_USE_API_KEYworks too.)Files
qa/skills/qa/SKILL.md— new "Choose a backend" section (recommend v2, note credits) + backend-branched procedure.qa/skills/qa/references/browser-use-v2.md— the create→poll→report flow (judge + 1–5structuredOutputschema), result→verdict mapping (judgeVerdict, structured score,steps[]screenshots,cost), credits note, and gotchas (localhost needs a tunnel,structuredOutputis a stringified schema, 429 concurrency cap).Notes
🤖 Generated with Claude Code
Summary by cubic
Adds a Browser Use v2 cloud-agent backend to
/qawith judge pass/fail and 1–5 scoring, and shifts the workflow by scope: single flow usesbrowser-harness; many flows fan out to subagents (v2 recommended). V2 runs via abrowser-harnessheredoc withBROWSER_USE_API_KEY; verdicts now trustjudgeVerdictover the agent’s self-score, and the Claude subagent remains available.startUrl;maxSteps: 50is a ceiling.429cap.judgeVerdictis authoritative (judge=false ⇒ fail and cap score ≤2); a flow passes only ifjudgeVerdictis true; overall score = weakest judge-corrected path.qa/skills/qa/SKILL.mdadds “Single flow vs. fan-out” with scope-driven procedure.qa/skills/qa/references/browser-use-v2.mdcovers create→poll→report, explicit verdict mapping, a self-contained output template, v2 public-URL teardown no-op, and key verification.Written for commit 1c01b97. Summary will update on new commits.