browser-use · ShawnPana · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026
diff --git a/qa/skills/qa/SKILL.md b/qa/skills/qa/SKILL.md
@@ -13,6 +13,17 @@ From the user's invocation (the text after `/qa`, or their message):
 - **Target** — a URL (`https://…`) or a local dev server (`localhost:5173`, `:3000`, "the app on 5173"). **Required** — if absent, ask for it before doing anything else.
 - **What to test** (optional) — a flow or focus ("the signup", "search + filters"). If omitted, test the most obvious happy path and say so in the report.
 
+## Single flow vs. fan-out (decide this first)
+
+**Scale the approach to the ask:**
+
+- **Testing one flow / one thing?** Don't bother with subagents — **drive `browser-harness` directly** yourself, following `references/methodology.md`. That's the right, lowest-overhead tool for a single test, and it's how the rest of this skill works.
+- **Testing many flows / a lot at once?** **Fan out to subagents — one per flow — so they run in parallel.** Here the user has a choice of subagent type (ask if unclear; **recommend v2**):
+  - **Browser Use v2 cloud agents — recommended.** Each flow becomes an autonomous v2 task with **`judge`** (pass/fail) + **`structuredOutput`** (1–5 score), running server-side and **in parallel**, returning step-by-step screenshot evidence. **Spends Browser Use credits** (~$0.01/task + ~$0.006/step + $0.02/hr browser). Per-task flow + how to fan out: `references/browser-use-v2.md`.
+  - **Your harness's built-in subagents** — spawn Claude Code subagents (the Agent tool), each driving `browser-harness` through `references/methodology.md`. No Browser Use *task* credits; uses your agent's own usage.
+
+Rule of thumb: **one flow → browser-harness directly; many flows → subagents (v2 recommended).** Either way `browser-harness` is required — as the direct driver, the subagent driver, the v2 key store, and the localhost tunnel.
+
 ## Dependency: browser-harness (required)
 
 This skill runs the test through **browser-harness** — a separate plugin + CLI. It is not optional; QA must run on a real Browser Use cloud browser, never the user's local Chrome.
@@ -34,8 +45,12 @@ Do not attempt to QA with anything other than browser-harness + a cloud browser.
 ## Procedure
 
 1. **Confirm the target is reachable** (`curl -s -o /dev/null -w "%{http_code}" <url>`), and identify what the app is (title, README) so you can frame a sensible test task.
-2. **Read `references/methodology.md`** in this skill directory and follow it exactly. It defines: how to get a Browser Use API key (or self-sign-up), how to tunnel a localhost app and point a cloud browser at it, the field-tested gotchas (host-header rewrite, proxy-off, the per-tab interstitial header, CORS-pinned APIs), the test loop, the 1-5 rubric, and the output format.
-3. **Run the test**, then **tear everything down** (stop the cloud browser so it stops billing; kill the tunnel).
-4. **Return the verdict**: lead with `Score: N/5`, then task, result, what worked, issues (tagged), edge cases tried, and screenshot evidence — exactly as `references/methodology.md` specifies.
+2. **Run it, scaled to the ask** (see "Single flow vs. fan-out" above):
+   - **One flow** → drive **browser-harness directly** per `references/methodology.md`: resolve the key, tunnel localhost, and run the test loop with the field-tested gotchas (host-header rewrite, proxy-off, per-tab interstitial header, CORS-pinned APIs).
+   - **Many flows → fan out, one subagent per flow:**
+     - **v2 agents (recommended)** → per `references/browser-use-v2.md`, create one task per flow (each with `judge` + a 1–5 `structuredOutput` schema), poll them all, and collect the verdicts. A `localhost` target still needs a tunnel (the cloud agent can't reach localhost) — tunnel it and pass the public `startUrl`.
+     - **Claude subagents** → spawn one Agent per flow, each following `references/methodology.md` on browser-harness.
+3. **Tear down** what you started — only the **one-flow / Claude** paths have anything to stop (cloud browser + tunnel). A v2 task on a public URL has nothing to tear down (its one-off session auto-closes).
+4. **Return the verdict**: lead with `Score: N/5`, then task, result, what worked, issues (tagged), edge cases, and evidence — per the rubric and output format in `references/methodology.md`. **Fanning out?** Give a per-flow `Score: N/5` line and an **overall score that reflects the weakest critical path** (don't average a broken flow up because others passed).
 
 Scale effort to the ask: a quick "does X work?" is a few interactions and one score; "thoroughly QA this" warrants more flows and edge cases. Keep the verdict honest, specific, and reproducible.
diff --git a/qa/skills/qa/references/browser-use-v2.md b/qa/skills/qa/references/browser-use-v2.md
@@ -0,0 +1,182 @@
+# Browser Use v2 agent backend (recommended for QA)
+
+Run the QA test as an autonomous **Browser Use cloud agent** instead of driving browser-harness
+step by step. It's purpose-built for QA: a **judge** evaluates pass/fail against expected
+behavior, and **structured output** forces the 1–5 score. It runs server-side, parallelizes, and
+returns step-by-step evidence (screenshots + actions).
+
+**Cost / credits:** the v2 agent spends Browser Use credits — about **$0.01 per task + ~$0.006 per
+step (LLM) + $0.02/hr browser**, drawn from the account's monthly allowance. (The Claude-subagent
+backend in `methodology.md` spends no Browser Use *task* credits.) Recommend v2 for real QA; fall
+back to the Claude subagent to avoid credits.
+
+> Note: the docs label the v2 API "legacy" and steer new projects to v3 — but the **`judge` +
+> structured-output** evaluation features QA needs live on v2 (`POST /api/v2/tasks`), so that's
+> what this backend uses.
+
+## The endpoints
+
+- **Create:** `POST https://api.browser-use.com/api/v2/tasks` → `202 {id, sessionId}`
+- **Poll:** `GET https://api.browser-use.com/api/v2/tasks/{id}` → `status` ∈ `created → started →
+  finished | failed | stopped`, plus `output`, `judgeVerdict`, `judgement`, `steps[]`, `cost`.
+- Auth header on both: `X-Browser-Use-API-Key`.
+
+## Key resolution — via browser-harness (it stores the key)
+
+The v2 API authenticates with `BROWSER_USE_API_KEY` — the same key `methodology.md` step 0 resolves
+(browser-harness's `.env`, the process env, or self-signup). The cleanest way to use
+*browser-harness's stored key* is to run the calls **inside a `browser-harness` heredoc**, where
+the key is already loaded into `os.environ` — no separate plumbing, no re-exporting. (Plain `curl`
+with `$BROWSER_USE_API_KEY` also works if it's exported. The v2 task itself runs on a Browser Use
+cloud browser, so no local Chrome is needed for the test — browser-harness here is just the key
+store + HTTP runtime.)
+
+## Flow: create → poll → report
+
+Fill in `task`, `startUrl` (the public URL — tunnel a localhost target first), and
+`judgeGroundTruth` (what success looks like), then run:
+
+```bash
+browser-harness <<'PY'
+import os, json, time, urllib.request, urllib.error
+KEY = os.environ.get("BROWSER_USE_API_KEY")
+assert KEY, "no BROWSER_USE_API_KEY — resolve it per methodology.md step 0"
+BASE = "https://api.browser-use.com/api/v2"
+
+def call(method, path, body=None):
+    req = urllib.request.Request(
+        BASE + path,
+        data=json.dumps(body).encode() if body is not None else None,
+        method=method,
+        headers={"X-Browser-Use-API-Key": KEY, "Content-Type": "application/json"},
+    )
+    try:
+        with urllib.request.urlopen(req) as r:
+            return json.load(r)
+    except urllib.error.HTTPError as e:
+        raise SystemExit(f"v2 API {e.code}: {e.read().decode()[:300]}")
+
+# 1-5 score schema (structuredOutput must be a *stringified* JSON schema)
+SCORE_SCHEMA = json.dumps({
+    "type": "object",
+    "properties": {
+        "score":   {"type": "integer", "minimum": 1, "maximum": 5},
+        "verdict": {"type": "string"},
+        "worked":  {"type": "array", "items": {"type": "string"}},
+        "issues":  {"type": "array", "items": {"type": "string"}},
+    },
+    "required": ["score", "verdict"],
+})
+
+created = call("POST", "/tasks", {
+    "task": "QA TASK HERE — e.g. 'Add an item to the cart, go to checkout, and report whether "
+            "it completes. Score 1-5 (5=flawless, 1=broken) with what worked and any issues.'",
+    "startUrl": "https://PUBLIC-URL-UNDER-TEST",          # tunnel localhost first; pass the public URL
+    "judge": True,
+    "judgeGroundTruth": "SUCCESS LOOKS LIKE — e.g. 'An order-confirmation / thank-you page is shown.'",
+    "structuredOutput": SCORE_SCHEMA,
+    "maxSteps": 50,   # a CEILING, not a target — the agent stops when done, so this doesn't inflate cost;
+                      # 50 gives room for a multi-step flow. Raise for long flows; cost = steps actually taken.
+    # optional: "llm": "browser-use-2.0", "vision": True,
+    #           "sessionSettings": {"proxyCountryCode": "us", "enableRecording": True}
+})
+tid = created["id"]
+print("created task", tid, "session", created["sessionId"], flush=True)
+
+while True:                                               # poll to a terminal state
+    t = call("GET", "/tasks/" + tid)
+    if t["status"] in ("finished", "failed", "stopped"):
+        break
+    time.sleep(5)
+
+print(json.dumps({
+    "status":       t["status"],
+    "score_output": t.get("output"),       # the structuredOutput JSON → the 1-5 score object
+    "judgeVerdict": t.get("judgeVerdict"),  # True = passed the ground-truth check, False = failed
+    "judgement":    t.get("judgement"),     # judge's reasoning (stringified JSON report)
+    "cost_usd":     t.get("cost"),          # what this run spent
+    "num_steps":    len(t.get("steps") or []),
+}, indent=2))
+PY
+```
+
+## Mapping the result to the verdict
+
+Report exactly as `methodology.md`'s output format, sourced from the agent's result:
+
+- **`Score: N/5`** ← the `score` field of the structured `output` — **but `judgeVerdict` overrides it.**
+  The structured `score` is the agent's *self-report* and can be wrong (an agent will happily score a
+  blank page 5/5). **If `judgeVerdict` is `False`, the flow FAILED regardless of the self-score** — cap
+  it at ≤2 and lead with the judge's `failure_reason`. (Real example: an agent self-scored x.ai/pricing
+  5/5 "fully functional"; the judge saw the page rendered blank and returned `false`. Trust the judge.)
+- **Result / pass-fail** ← `judgeVerdict` (true = met the ground truth, false = didn't). The agent's
+  `score`/`isSuccess` are self-reports and are less reliable — **`judgeVerdict` is authoritative.**
+- **What worked / issues** ← the structured `worked` / `issues` arrays, cross-checked against
+  `judgement` (the judge's reasoning).
+- **Evidence** ← `steps[]`: each has `url`, `screenshotUrl`, `actions` (and sometimes `nextGoal` —
+  may be empty). Cite the `screenshotUrl`s of the key moments.
+- **Cost** ← surface `cost` so the user sees what the run spent.
+
+Report it in this format (the same one the Claude backend uses — self-contained here so you don't
+need to open `methodology.md`):
+
+```
+Score: N/5
+Task: <what you asked the agent to verify>
+Result: <pass/fail from judgeVerdict + one line>
+What worked:
+- <from the structured `worked` array / judgement>
+Issues:
+- [tag] <from `issues` / judgement; empty if none>
+Evidence: <key steps[].screenshotUrl links>
+Cost: $X.XX (Browser Use v2 agent, <n> steps)
+```
+
+## Fan out: many flows in parallel
+
+The whole point of v2 subagents is parallel coverage. To test several flows at once, **create all
+the tasks first** (each `POST /tasks` returns immediately with an `id`), then **poll them all** —
+they run concurrently in Browser Use cloud:
+
+```python
+flows = [
+    {"task": "Test signup: …", "startUrl": URL, "judgeGroundTruth": "Account created, lands on dashboard."},
+    {"task": "Test checkout: …", "startUrl": URL, "judgeGroundTruth": "Order confirmation shown."},
+    {"task": "Test search + filters: …", "startUrl": URL, "judgeGroundTruth": "Filtered results update."},
+]
+ids = []
+for f in flows:
+    _, c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})
-    _, c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})
+    c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})
-    _, c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})
+    c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})
+    ids.append((f["task"][:40], c["id"]))
+
+results = {}
+while len(results) < len(ids):
+    for label, tid in ids:
+        if tid in results: continue
+        _, t = call("GET", "/tasks/" + tid)
+        if t["status"] in ("finished", "failed", "stopped"):
+            self_score = (json.loads(t["output"]).get("score") if t.get("output") else None)
+            passed = t.get("judgeVerdict") is True          # JUDGE is authoritative, not the self-score
+            results[tid] = {"label": label, "passed": passed, "self_score": self_score,
+                            "score": (self_score if passed else min(self_score or 2, 2)),  # judge=False caps it
+                            "cost": t.get("cost")}
+    time.sleep(5)
+# Per flow: PASS only if judgeVerdict is True. A flow where the agent self-scored high but
+# judgeVerdict is False is a *caught failure* — flag it and score it low.
+# Overall = the weakest flow (min of the judge-corrected scores) — never average a failed flow up.
+```
+
+Watch the **concurrent-session cap** (Free = 3): creating more than the cap at once yields `429` —
+batch the creates to stay under it.
+
+## Gotchas
+
+- **`structuredOutput` is a *string*** — pass `json.dumps(schema)`, not the schema object.
+- **localhost isn't reachable** by the cloud agent — tunnel it (ngrok, per `methodology.md`) and
+  pass the public `startUrl`; for a free-ngrok host, tell the agent in the `task` to click through
+  any "You are about to visit" interstitial.
+- **`429 TooManyConcurrentActiveSessionsError`** — the account hit its concurrent-session cap
+  (Free = 3); wait or stop other sessions.
+- **`maxSteps` is a safety ceiling, not a cost lever** — the agent stops when the task is complete, so the cap doesn't drive cost (this run capped at 15 but used 5 steps). Keep ~50 for headroom; raise for long flows.
+- **Teardown is a no-op for the v2 path on a public URL** — the one-off session auto-closes and there's no tunnel to kill. (Only the Claude/localhost path needs teardown.)
+- **Verify the key resolves before the billable create** — the snippet's `assert KEY` does this; if it's missing, resolve it per `methodology.md` step 0 *before* calling `POST /tasks`.