Skip to content

qa: add Browser Use v2 agent backend (recommended) alongside Claude subagent#6

Open
ShawnPana wants to merge 4 commits into
mainfrom
qa-v2-backend
Open

qa: add Browser Use v2 agent backend (recommended) alongside Claude subagent#6
ShawnPana wants to merge 4 commits into
mainfrom
qa-v2-backend

Conversation

@ShawnPana

@ShawnPana ShawnPana commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

What

/qa can now run with either of two backends, with the choice surfaced in SKILL.md and v2 recommended for real QA:

  • Browser Use v2 cloud agent (recommended — built for QA). Hands the whole test to an autonomous Browser Use agent: judge mode (pass/fail vs expected behavior) + structuredOutput (forces the 1–5 score), server-side and parallelizable, with step-by-step screenshot evidence. Spends BU credits (~$0.01/task + ~$0.006/step + $0.02/hr browser).
  • Claude Code subagent (existing). Drive browser-harness on a cloud browser yourself via references/methodology.md. No Browser Use task credits.

Key resolution — rides on browser-harness

The v2 REST calls (POST /api/v2/tasks → poll GET /tasks/{id}) run inside a browser-harness heredoc, so they use the BROWSER_USE_API_KEY browser-harness already resolves (its .env / env / self-signup) — no separate key plumbing. The task runs on a Browser Use cloud browser, so no local Chrome is needed for the test; browser-harness is just the key store + HTTP runtime. (Plain curl with $BROWSER_USE_API_KEY works too.)

Files

  • qa/skills/qa/SKILL.md — new "Choose a backend" section (recommend v2, note credits) + backend-branched procedure.
  • qa/skills/qa/references/browser-use-v2.md — the create→poll→report flow (judge + 1–5 structuredOutput schema), result→verdict mapping (judgeVerdict, structured score, steps[] screenshots, cost), credits note, and gotchas (localhost needs a tunnel, structuredOutput is a stringified schema, 429 concurrency cap).

Notes

  • The v2 task API is documented as "legacy" (docs steer new projects to v3), but the judge + structured-output evaluation features QA needs live on v2 — called out in the reference.
  • Built from the documented v2 API; the embedded flow compiles. Recommend a smoke-test against a real task before relying on it in anger.

🤖 Generated with Claude Code


Summary by cubic

Adds a Browser Use v2 cloud-agent backend to /qa with judge pass/fail and 1–5 scoring, and shifts the workflow by scope: single flow uses browser-harness; many flows fan out to subagents (v2 recommended). V2 runs via a browser-harness heredoc with BROWSER_USE_API_KEY; verdicts now trust judgeVerdict over the agent’s self-score, and the Claude subagent remains available.

  • New Features
    • V2 agent: judge + 1–5 structured score, server-side with screenshots; spends credits (~$0.01/task + ~$0.006/step + $0.02/hr); tunnel localhost before startUrl; maxSteps: 50 is a ceiling.
    • Fan-out + verdicts: create tasks and poll in parallel; mind 429 cap. judgeVerdict is authoritative (judge=false ⇒ fail and cap score ≤2); a flow passes only if judgeVerdict is true; overall score = weakest judge-corrected path.
    • Docs: qa/skills/qa/SKILL.md adds “Single flow vs. fan-out” with scope-driven procedure. qa/skills/qa/references/browser-use-v2.md covers create→poll→report, explicit verdict mapping, a self-contained output template, v2 public-URL teardown no-op, and key verification.

Written for commit 1c01b97. Summary will update on new commits.

Review in cubic

…ubagent

/qa can now run two ways, with the choice surfaced in SKILL.md:
- Browser Use v2 cloud agent (recommended, built for QA): autonomous agent
  with judge mode (pass/fail) + structured 1-5 output, server-side and
  parallel. Spends BU credits (~$0.01/task + steps + browser).
- Claude Code subagent (current): drive browser-harness yourself, no task
  credits.

The v2 calls run inside a browser-harness heredoc so they ride on the key
browser-harness already stores (BROWSER_USE_API_KEY) — no separate plumbing.

- SKILL.md: "Choose a backend" section + backend-branched procedure.
- references/browser-use-v2.md: create -> poll -> report flow (POST /api/v2/
  tasks with judge + structuredOutput score schema; GET /tasks/{id}), result
  mapping to the 1-5 verdict, cost/credits note, and gotchas.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="qa/skills/qa/references/browser-use-v2.md">

<violation number="1" location="qa/skills/qa/references/browser-use-v2.md:86">
P2: Poll the lightweight `/tasks/{id}/status` endpoint during the loop instead of the full `/tasks/{id}`. The API explicitly recommends this pattern and the full endpoint loads steps, screenshots, and outputFiles on every poll — wasteful for a 5s tick.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

print("created task", tid, "session", created["sessionId"], flush=True)

while True: # poll to a terminal state
t = call("GET", "/tasks/" + tid)

@cubic-dev-ai cubic-dev-ai Bot Jun 19, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Poll the lightweight /tasks/{id}/status endpoint during the loop instead of the full /tasks/{id}. The API explicitly recommends this pattern and the full endpoint loads steps, screenshots, and outputFiles on every poll — wasteful for a 5s tick.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At qa/skills/qa/references/browser-use-v2.md, line 86:

<comment>Poll the lightweight `/tasks/{id}/status` endpoint during the loop instead of the full `/tasks/{id}`. The API explicitly recommends this pattern and the full endpoint loads steps, screenshots, and outputFiles on every poll — wasteful for a 5s tick.</comment>

<file context>
@@ -0,0 +1,123 @@
+print("created task", tid, "session", created["sessionId"], flush=True)
+
+while True:                                               # poll to a terminal state
+    t = call("GET", "/tasks/" + tid)
+    if t["status"] in ("finished", "failed", "stopped"):
+        break
</file context>
Fix with cubic

ShawnPana and others added 2 commits June 18, 2026 18:59
…put template, teardown no-op note

From a cold skill e2e test (scored browser-use.com 5/5 via v2, $0.03):
- maxSteps -> 50 with a 'ceiling not cost lever' note (agent stops when done).
- Inline the verdict output template so the v2 backend doesn't require opening methodology.md.
- Note v2-public-URL teardown is a no-op; verify key before the billable create.
- Soften nextGoal (may be empty).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ny flows = subagents (v2 recommended)

Per design clarification: v2 agents are for FAN-OUT, not a single-test backend.
- SKILL.md: 'Single flow vs fan-out' decision — one flow drives browser-harness
  directly (methodology.md); many flows fan out to subagents, user's choice of
  Claude Code subagents or Browser Use v2 agents (recommended).
- Procedure is scope-driven; verdict aggregates per-flow scores (weakest path).
- browser-use-v2.md: add a parallel fan-out pattern (create all, poll all) +
  the 429 concurrency-cap caveat.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="qa/skills/qa/references/browser-use-v2.md">

<violation number="1" location="qa/skills/qa/references/browser-use-v2.md:145">
P1: Tuple-unpacking `_, c = call(...)` will fail because `call()` returns a dict (from `json.load(r)`), not a tuple. Unpacking a dict iterates over its keys, so `c` would be assigned a key string like "id", making `c["id"]` raise `TypeError`.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

]
ids = []
for f in flows:
_, c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})

@cubic-dev-ai cubic-dev-ai Bot Jun 19, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Tuple-unpacking _, c = call(...) will fail because call() returns a dict (from json.load(r)), not a tuple. Unpacking a dict iterates over its keys, so c would be assigned a key string like "id", making c["id"] raise TypeError.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At qa/skills/qa/references/browser-use-v2.md, line 145:

<comment>Tuple-unpacking `_, c = call(...)` will fail because `call()` returns a dict (from `json.load(r)`), not a tuple. Unpacking a dict iterates over its keys, so `c` would be assigned a key string like "id", making `c["id"]` raise `TypeError`.</comment>

<file context>
@@ -108,10 +109,57 @@ Report exactly as `methodology.md`'s output format, sourced from the agent's res
+]
+ids = []
+for f in flows:
+    _, c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})
+    ids.append((f["task"][:40], c["id"]))
+
</file context>
Suggested change
_, c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})
c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})
Fix with cubic

… + mapping)

Live fan-out over x.ai's 6 navbar tabs caught the case this guards against:
an agent self-scored x.ai/pricing 5/5 'fully functional', but the judge saw
the page rendered blank and returned judgeVerdict=false. Aggregating on the
self-score masked the failure.

- Mapping: judgeVerdict overrides the structured self-score; judge=false => the
  flow FAILED regardless (cap <=2, lead with failure_reason).
- Fan-out: a flow PASSES only if judgeVerdict is True; flag self-score/judge
  mismatches; overall = weakest judge-corrected score.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant