Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 18 additions & 3 deletions qa/skills/qa/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,17 @@ From the user's invocation (the text after `/qa`, or their message):
- **Target** — a URL (`https://…`) or a local dev server (`localhost:5173`, `:3000`, "the app on 5173"). **Required** — if absent, ask for it before doing anything else.
- **What to test** (optional) — a flow or focus ("the signup", "search + filters"). If omitted, test the most obvious happy path and say so in the report.

## Single flow vs. fan-out (decide this first)

**Scale the approach to the ask:**

- **Testing one flow / one thing?** Don't bother with subagents — **drive `browser-harness` directly** yourself, following `references/methodology.md`. That's the right, lowest-overhead tool for a single test, and it's how the rest of this skill works.
- **Testing many flows / a lot at once?** **Fan out to subagents — one per flow — so they run in parallel.** Here the user has a choice of subagent type (ask if unclear; **recommend v2**):
- **Browser Use v2 cloud agents — recommended.** Each flow becomes an autonomous v2 task with **`judge`** (pass/fail) + **`structuredOutput`** (1–5 score), running server-side and **in parallel**, returning step-by-step screenshot evidence. **Spends Browser Use credits** (~$0.01/task + ~$0.006/step + $0.02/hr browser). Per-task flow + how to fan out: `references/browser-use-v2.md`.
- **Your harness's built-in subagents** — spawn Claude Code subagents (the Agent tool), each driving `browser-harness` through `references/methodology.md`. No Browser Use *task* credits; uses your agent's own usage.

Rule of thumb: **one flow → browser-harness directly; many flows → subagents (v2 recommended).** Either way `browser-harness` is required — as the direct driver, the subagent driver, the v2 key store, and the localhost tunnel.

## Dependency: browser-harness (required)

This skill runs the test through **browser-harness** — a separate plugin + CLI. It is not optional; QA must run on a real Browser Use cloud browser, never the user's local Chrome.
Expand All @@ -34,8 +45,12 @@ Do not attempt to QA with anything other than browser-harness + a cloud browser.
## Procedure

1. **Confirm the target is reachable** (`curl -s -o /dev/null -w "%{http_code}" <url>`), and identify what the app is (title, README) so you can frame a sensible test task.
2. **Read `references/methodology.md`** in this skill directory and follow it exactly. It defines: how to get a Browser Use API key (or self-sign-up), how to tunnel a localhost app and point a cloud browser at it, the field-tested gotchas (host-header rewrite, proxy-off, the per-tab interstitial header, CORS-pinned APIs), the test loop, the 1-5 rubric, and the output format.
3. **Run the test**, then **tear everything down** (stop the cloud browser so it stops billing; kill the tunnel).
4. **Return the verdict**: lead with `Score: N/5`, then task, result, what worked, issues (tagged), edge cases tried, and screenshot evidence — exactly as `references/methodology.md` specifies.
2. **Run it, scaled to the ask** (see "Single flow vs. fan-out" above):
- **One flow** → drive **browser-harness directly** per `references/methodology.md`: resolve the key, tunnel localhost, and run the test loop with the field-tested gotchas (host-header rewrite, proxy-off, per-tab interstitial header, CORS-pinned APIs).
- **Many flows → fan out, one subagent per flow:**
- **v2 agents (recommended)** → per `references/browser-use-v2.md`, create one task per flow (each with `judge` + a 1–5 `structuredOutput` schema), poll them all, and collect the verdicts. A `localhost` target still needs a tunnel (the cloud agent can't reach localhost) — tunnel it and pass the public `startUrl`.
- **Claude subagents** → spawn one Agent per flow, each following `references/methodology.md` on browser-harness.
3. **Tear down** what you started — only the **one-flow / Claude** paths have anything to stop (cloud browser + tunnel). A v2 task on a public URL has nothing to tear down (its one-off session auto-closes).
4. **Return the verdict**: lead with `Score: N/5`, then task, result, what worked, issues (tagged), edge cases, and evidence — per the rubric and output format in `references/methodology.md`. **Fanning out?** Give a per-flow `Score: N/5` line and an **overall score that reflects the weakest critical path** (don't average a broken flow up because others passed).

Scale effort to the ask: a quick "does X work?" is a few interactions and one score; "thoroughly QA this" warrants more flows and edge cases. Keep the verdict honest, specific, and reproducible.
182 changes: 182 additions & 0 deletions qa/skills/qa/references/browser-use-v2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# Browser Use v2 agent backend (recommended for QA)

Run the QA test as an autonomous **Browser Use cloud agent** instead of driving browser-harness
step by step. It's purpose-built for QA: a **judge** evaluates pass/fail against expected
behavior, and **structured output** forces the 1–5 score. It runs server-side, parallelizes, and
returns step-by-step evidence (screenshots + actions).

**Cost / credits:** the v2 agent spends Browser Use credits — about **$0.01 per task + ~$0.006 per
step (LLM) + $0.02/hr browser**, drawn from the account's monthly allowance. (The Claude-subagent
backend in `methodology.md` spends no Browser Use *task* credits.) Recommend v2 for real QA; fall
back to the Claude subagent to avoid credits.

> Note: the docs label the v2 API "legacy" and steer new projects to v3 — but the **`judge` +
> structured-output** evaluation features QA needs live on v2 (`POST /api/v2/tasks`), so that's
> what this backend uses.

## The endpoints

- **Create:** `POST https://api.browser-use.com/api/v2/tasks` → `202 {id, sessionId}`
- **Poll:** `GET https://api.browser-use.com/api/v2/tasks/{id}` → `status` ∈ `created → started →
finished | failed | stopped`, plus `output`, `judgeVerdict`, `judgement`, `steps[]`, `cost`.
- Auth header on both: `X-Browser-Use-API-Key`.

## Key resolution — via browser-harness (it stores the key)

The v2 API authenticates with `BROWSER_USE_API_KEY` — the same key `methodology.md` step 0 resolves
(browser-harness's `.env`, the process env, or self-signup). The cleanest way to use
*browser-harness's stored key* is to run the calls **inside a `browser-harness` heredoc**, where
the key is already loaded into `os.environ` — no separate plumbing, no re-exporting. (Plain `curl`
with `$BROWSER_USE_API_KEY` also works if it's exported. The v2 task itself runs on a Browser Use
cloud browser, so no local Chrome is needed for the test — browser-harness here is just the key
store + HTTP runtime.)

## Flow: create → poll → report

Fill in `task`, `startUrl` (the public URL — tunnel a localhost target first), and
`judgeGroundTruth` (what success looks like), then run:

```bash
browser-harness <<'PY'
import os, json, time, urllib.request, urllib.error
KEY = os.environ.get("BROWSER_USE_API_KEY")
assert KEY, "no BROWSER_USE_API_KEY — resolve it per methodology.md step 0"
BASE = "https://api.browser-use.com/api/v2"

def call(method, path, body=None):
req = urllib.request.Request(
BASE + path,
data=json.dumps(body).encode() if body is not None else None,
method=method,
headers={"X-Browser-Use-API-Key": KEY, "Content-Type": "application/json"},
)
try:
with urllib.request.urlopen(req) as r:
return json.load(r)
except urllib.error.HTTPError as e:
raise SystemExit(f"v2 API {e.code}: {e.read().decode()[:300]}")

# 1-5 score schema (structuredOutput must be a *stringified* JSON schema)
SCORE_SCHEMA = json.dumps({
"type": "object",
"properties": {
"score": {"type": "integer", "minimum": 1, "maximum": 5},
"verdict": {"type": "string"},
"worked": {"type": "array", "items": {"type": "string"}},
"issues": {"type": "array", "items": {"type": "string"}},
},
"required": ["score", "verdict"],
})

created = call("POST", "/tasks", {
"task": "QA TASK HERE — e.g. 'Add an item to the cart, go to checkout, and report whether "
"it completes. Score 1-5 (5=flawless, 1=broken) with what worked and any issues.'",
"startUrl": "https://PUBLIC-URL-UNDER-TEST", # tunnel localhost first; pass the public URL
"judge": True,
"judgeGroundTruth": "SUCCESS LOOKS LIKE — e.g. 'An order-confirmation / thank-you page is shown.'",
"structuredOutput": SCORE_SCHEMA,
"maxSteps": 50, # a CEILING, not a target — the agent stops when done, so this doesn't inflate cost;
# 50 gives room for a multi-step flow. Raise for long flows; cost = steps actually taken.
# optional: "llm": "browser-use-2.0", "vision": True,
# "sessionSettings": {"proxyCountryCode": "us", "enableRecording": True}
})
tid = created["id"]
print("created task", tid, "session", created["sessionId"], flush=True)

while True: # poll to a terminal state
t = call("GET", "/tasks/" + tid)

@cubic-dev-ai cubic-dev-ai Bot Jun 19, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Poll the lightweight /tasks/{id}/status endpoint during the loop instead of the full /tasks/{id}. The API explicitly recommends this pattern and the full endpoint loads steps, screenshots, and outputFiles on every poll — wasteful for a 5s tick.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At qa/skills/qa/references/browser-use-v2.md, line 86:

<comment>Poll the lightweight `/tasks/{id}/status` endpoint during the loop instead of the full `/tasks/{id}`. The API explicitly recommends this pattern and the full endpoint loads steps, screenshots, and outputFiles on every poll — wasteful for a 5s tick.</comment>

<file context>
@@ -0,0 +1,123 @@
+print("created task", tid, "session", created["sessionId"], flush=True)
+
+while True:                                               # poll to a terminal state
+    t = call("GET", "/tasks/" + tid)
+    if t["status"] in ("finished", "failed", "stopped"):
+        break
</file context>
Fix with cubic

if t["status"] in ("finished", "failed", "stopped"):
break
time.sleep(5)

print(json.dumps({
"status": t["status"],
"score_output": t.get("output"), # the structuredOutput JSON → the 1-5 score object
"judgeVerdict": t.get("judgeVerdict"), # True = passed the ground-truth check, False = failed
"judgement": t.get("judgement"), # judge's reasoning (stringified JSON report)
"cost_usd": t.get("cost"), # what this run spent
"num_steps": len(t.get("steps") or []),
}, indent=2))
PY
```

## Mapping the result to the verdict

Report exactly as `methodology.md`'s output format, sourced from the agent's result:

- **`Score: N/5`** ← the `score` field of the structured `output` — **but `judgeVerdict` overrides it.**
The structured `score` is the agent's *self-report* and can be wrong (an agent will happily score a
blank page 5/5). **If `judgeVerdict` is `False`, the flow FAILED regardless of the self-score** — cap
it at ≤2 and lead with the judge's `failure_reason`. (Real example: an agent self-scored x.ai/pricing
5/5 "fully functional"; the judge saw the page rendered blank and returned `false`. Trust the judge.)
- **Result / pass-fail** ← `judgeVerdict` (true = met the ground truth, false = didn't). The agent's
`score`/`isSuccess` are self-reports and are less reliable — **`judgeVerdict` is authoritative.**
- **What worked / issues** ← the structured `worked` / `issues` arrays, cross-checked against
`judgement` (the judge's reasoning).
- **Evidence** ← `steps[]`: each has `url`, `screenshotUrl`, `actions` (and sometimes `nextGoal` —
may be empty). Cite the `screenshotUrl`s of the key moments.
- **Cost** ← surface `cost` so the user sees what the run spent.

Report it in this format (the same one the Claude backend uses — self-contained here so you don't
need to open `methodology.md`):

```
Score: N/5
Task: <what you asked the agent to verify>
Result: <pass/fail from judgeVerdict + one line>
What worked:
- <from the structured `worked` array / judgement>
Issues:
- [tag] <from `issues` / judgement; empty if none>
Evidence: <key steps[].screenshotUrl links>
Cost: $X.XX (Browser Use v2 agent, <n> steps)
```

## Fan out: many flows in parallel

The whole point of v2 subagents is parallel coverage. To test several flows at once, **create all
the tasks first** (each `POST /tasks` returns immediately with an `id`), then **poll them all** —
they run concurrently in Browser Use cloud:

```python
flows = [
{"task": "Test signup: …", "startUrl": URL, "judgeGroundTruth": "Account created, lands on dashboard."},
{"task": "Test checkout: …", "startUrl": URL, "judgeGroundTruth": "Order confirmation shown."},
{"task": "Test search + filters: …", "startUrl": URL, "judgeGroundTruth": "Filtered results update."},
]
ids = []
for f in flows:
_, c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})

@cubic-dev-ai cubic-dev-ai Bot Jun 19, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Tuple-unpacking _, c = call(...) will fail because call() returns a dict (from json.load(r)), not a tuple. Unpacking a dict iterates over its keys, so c would be assigned a key string like "id", making c["id"] raise TypeError.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At qa/skills/qa/references/browser-use-v2.md, line 145:

<comment>Tuple-unpacking `_, c = call(...)` will fail because `call()` returns a dict (from `json.load(r)`), not a tuple. Unpacking a dict iterates over its keys, so `c` would be assigned a key string like "id", making `c["id"]` raise `TypeError`.</comment>

<file context>
@@ -108,10 +109,57 @@ Report exactly as `methodology.md`'s output format, sourced from the agent's res
+]
+ids = []
+for f in flows:
+    _, c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})
+    ids.append((f["task"][:40], c["id"]))
+
</file context>
Suggested change
_, c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})
c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})
Fix with cubic

ids.append((f["task"][:40], c["id"]))

results = {}
while len(results) < len(ids):
for label, tid in ids:
if tid in results: continue
_, t = call("GET", "/tasks/" + tid)
if t["status"] in ("finished", "failed", "stopped"):
self_score = (json.loads(t["output"]).get("score") if t.get("output") else None)
passed = t.get("judgeVerdict") is True # JUDGE is authoritative, not the self-score
results[tid] = {"label": label, "passed": passed, "self_score": self_score,
"score": (self_score if passed else min(self_score or 2, 2)), # judge=False caps it
"cost": t.get("cost")}
time.sleep(5)
# Per flow: PASS only if judgeVerdict is True. A flow where the agent self-scored high but
# judgeVerdict is False is a *caught failure* — flag it and score it low.
# Overall = the weakest flow (min of the judge-corrected scores) — never average a failed flow up.
```

Watch the **concurrent-session cap** (Free = 3): creating more than the cap at once yields `429` —
batch the creates to stay under it.

## Gotchas

- **`structuredOutput` is a *string*** — pass `json.dumps(schema)`, not the schema object.
- **localhost isn't reachable** by the cloud agent — tunnel it (ngrok, per `methodology.md`) and
pass the public `startUrl`; for a free-ngrok host, tell the agent in the `task` to click through
any "You are about to visit" interstitial.
- **`429 TooManyConcurrentActiveSessionsError`** — the account hit its concurrent-session cap
(Free = 3); wait or stop other sessions.
- **`maxSteps` is a safety ceiling, not a cost lever** — the agent stops when the task is complete, so the cap doesn't drive cost (this run capped at 15 but used 5 steps). Keep ~50 for headroom; raise for long flows.
- **Teardown is a no-op for the v2 path on a public URL** — the one-off session auto-closes and there's no tunnel to kill. (Only the Claude/localhost path needs teardown.)
- **Verify the key resolves before the billable create** — the snippet's `assert KEY` does this; if it's missing, resolve it per `methodology.md` step 0 *before* calling `POST /tasks`.