diff --git a/qa/skills/qa/SKILL.md b/qa/skills/qa/SKILL.md index 14a42a3..8b4e52b 100644 --- a/qa/skills/qa/SKILL.md +++ b/qa/skills/qa/SKILL.md @@ -13,6 +13,17 @@ From the user's invocation (the text after `/qa`, or their message): - **Target** — a URL (`https://…`) or a local dev server (`localhost:5173`, `:3000`, "the app on 5173"). **Required** — if absent, ask for it before doing anything else. - **What to test** (optional) — a flow or focus ("the signup", "search + filters"). If omitted, test the most obvious happy path and say so in the report. +## Single flow vs. fan-out (decide this first) + +**Scale the approach to the ask:** + +- **Testing one flow / one thing?** Don't bother with subagents — **drive `browser-harness` directly** yourself, following `references/methodology.md`. That's the right, lowest-overhead tool for a single test, and it's how the rest of this skill works. +- **Testing many flows / a lot at once?** **Fan out to subagents — one per flow — so they run in parallel.** Here the user has a choice of subagent type (ask if unclear; **recommend v2**): + - **Browser Use v2 cloud agents — recommended.** Each flow becomes an autonomous v2 task with **`judge`** (pass/fail) + **`structuredOutput`** (1–5 score), running server-side and **in parallel**, returning step-by-step screenshot evidence. **Spends Browser Use credits** (~$0.01/task + ~$0.006/step + $0.02/hr browser). Per-task flow + how to fan out: `references/browser-use-v2.md`. + - **Your harness's built-in subagents** — spawn Claude Code subagents (the Agent tool), each driving `browser-harness` through `references/methodology.md`. No Browser Use *task* credits; uses your agent's own usage. + +Rule of thumb: **one flow → browser-harness directly; many flows → subagents (v2 recommended).** Either way `browser-harness` is required — as the direct driver, the subagent driver, the v2 key store, and the localhost tunnel. + ## Dependency: browser-harness (required) This skill runs the test through **browser-harness** — a separate plugin + CLI. It is not optional; QA must run on a real Browser Use cloud browser, never the user's local Chrome. @@ -34,8 +45,12 @@ Do not attempt to QA with anything other than browser-harness + a cloud browser. ## Procedure 1. **Confirm the target is reachable** (`curl -s -o /dev/null -w "%{http_code}" `), and identify what the app is (title, README) so you can frame a sensible test task. -2. **Read `references/methodology.md`** in this skill directory and follow it exactly. It defines: how to get a Browser Use API key (or self-sign-up), how to tunnel a localhost app and point a cloud browser at it, the field-tested gotchas (host-header rewrite, proxy-off, the per-tab interstitial header, CORS-pinned APIs), the test loop, the 1-5 rubric, and the output format. -3. **Run the test**, then **tear everything down** (stop the cloud browser so it stops billing; kill the tunnel). -4. **Return the verdict**: lead with `Score: N/5`, then task, result, what worked, issues (tagged), edge cases tried, and screenshot evidence — exactly as `references/methodology.md` specifies. +2. **Run it, scaled to the ask** (see "Single flow vs. fan-out" above): + - **One flow** → drive **browser-harness directly** per `references/methodology.md`: resolve the key, tunnel localhost, and run the test loop with the field-tested gotchas (host-header rewrite, proxy-off, per-tab interstitial header, CORS-pinned APIs). + - **Many flows → fan out, one subagent per flow:** + - **v2 agents (recommended)** → per `references/browser-use-v2.md`, create one task per flow (each with `judge` + a 1–5 `structuredOutput` schema), poll them all, and collect the verdicts. A `localhost` target still needs a tunnel (the cloud agent can't reach localhost) — tunnel it and pass the public `startUrl`. + - **Claude subagents** → spawn one Agent per flow, each following `references/methodology.md` on browser-harness. +3. **Tear down** what you started — only the **one-flow / Claude** paths have anything to stop (cloud browser + tunnel). A v2 task on a public URL has nothing to tear down (its one-off session auto-closes). +4. **Return the verdict**: lead with `Score: N/5`, then task, result, what worked, issues (tagged), edge cases, and evidence — per the rubric and output format in `references/methodology.md`. **Fanning out?** Give a per-flow `Score: N/5` line and an **overall score that reflects the weakest critical path** (don't average a broken flow up because others passed). Scale effort to the ask: a quick "does X work?" is a few interactions and one score; "thoroughly QA this" warrants more flows and edge cases. Keep the verdict honest, specific, and reproducible. diff --git a/qa/skills/qa/references/browser-use-v2.md b/qa/skills/qa/references/browser-use-v2.md new file mode 100644 index 0000000..42f7222 --- /dev/null +++ b/qa/skills/qa/references/browser-use-v2.md @@ -0,0 +1,182 @@ +# Browser Use v2 agent backend (recommended for QA) + +Run the QA test as an autonomous **Browser Use cloud agent** instead of driving browser-harness +step by step. It's purpose-built for QA: a **judge** evaluates pass/fail against expected +behavior, and **structured output** forces the 1–5 score. It runs server-side, parallelizes, and +returns step-by-step evidence (screenshots + actions). + +**Cost / credits:** the v2 agent spends Browser Use credits — about **$0.01 per task + ~$0.006 per +step (LLM) + $0.02/hr browser**, drawn from the account's monthly allowance. (The Claude-subagent +backend in `methodology.md` spends no Browser Use *task* credits.) Recommend v2 for real QA; fall +back to the Claude subagent to avoid credits. + +> Note: the docs label the v2 API "legacy" and steer new projects to v3 — but the **`judge` + +> structured-output** evaluation features QA needs live on v2 (`POST /api/v2/tasks`), so that's +> what this backend uses. + +## The endpoints + +- **Create:** `POST https://api.browser-use.com/api/v2/tasks` → `202 {id, sessionId}` +- **Poll:** `GET https://api.browser-use.com/api/v2/tasks/{id}` → `status` ∈ `created → started → + finished | failed | stopped`, plus `output`, `judgeVerdict`, `judgement`, `steps[]`, `cost`. +- Auth header on both: `X-Browser-Use-API-Key`. + +## Key resolution — via browser-harness (it stores the key) + +The v2 API authenticates with `BROWSER_USE_API_KEY` — the same key `methodology.md` step 0 resolves +(browser-harness's `.env`, the process env, or self-signup). The cleanest way to use +*browser-harness's stored key* is to run the calls **inside a `browser-harness` heredoc**, where +the key is already loaded into `os.environ` — no separate plumbing, no re-exporting. (Plain `curl` +with `$BROWSER_USE_API_KEY` also works if it's exported. The v2 task itself runs on a Browser Use +cloud browser, so no local Chrome is needed for the test — browser-harness here is just the key +store + HTTP runtime.) + +## Flow: create → poll → report + +Fill in `task`, `startUrl` (the public URL — tunnel a localhost target first), and +`judgeGroundTruth` (what success looks like), then run: + +```bash +browser-harness <<'PY' +import os, json, time, urllib.request, urllib.error +KEY = os.environ.get("BROWSER_USE_API_KEY") +assert KEY, "no BROWSER_USE_API_KEY — resolve it per methodology.md step 0" +BASE = "https://api.browser-use.com/api/v2" + +def call(method, path, body=None): + req = urllib.request.Request( + BASE + path, + data=json.dumps(body).encode() if body is not None else None, + method=method, + headers={"X-Browser-Use-API-Key": KEY, "Content-Type": "application/json"}, + ) + try: + with urllib.request.urlopen(req) as r: + return json.load(r) + except urllib.error.HTTPError as e: + raise SystemExit(f"v2 API {e.code}: {e.read().decode()[:300]}") + +# 1-5 score schema (structuredOutput must be a *stringified* JSON schema) +SCORE_SCHEMA = json.dumps({ + "type": "object", + "properties": { + "score": {"type": "integer", "minimum": 1, "maximum": 5}, + "verdict": {"type": "string"}, + "worked": {"type": "array", "items": {"type": "string"}}, + "issues": {"type": "array", "items": {"type": "string"}}, + }, + "required": ["score", "verdict"], +}) + +created = call("POST", "/tasks", { + "task": "QA TASK HERE — e.g. 'Add an item to the cart, go to checkout, and report whether " + "it completes. Score 1-5 (5=flawless, 1=broken) with what worked and any issues.'", + "startUrl": "https://PUBLIC-URL-UNDER-TEST", # tunnel localhost first; pass the public URL + "judge": True, + "judgeGroundTruth": "SUCCESS LOOKS LIKE — e.g. 'An order-confirmation / thank-you page is shown.'", + "structuredOutput": SCORE_SCHEMA, + "maxSteps": 50, # a CEILING, not a target — the agent stops when done, so this doesn't inflate cost; + # 50 gives room for a multi-step flow. Raise for long flows; cost = steps actually taken. + # optional: "llm": "browser-use-2.0", "vision": True, + # "sessionSettings": {"proxyCountryCode": "us", "enableRecording": True} +}) +tid = created["id"] +print("created task", tid, "session", created["sessionId"], flush=True) + +while True: # poll to a terminal state + t = call("GET", "/tasks/" + tid) + if t["status"] in ("finished", "failed", "stopped"): + break + time.sleep(5) + +print(json.dumps({ + "status": t["status"], + "score_output": t.get("output"), # the structuredOutput JSON → the 1-5 score object + "judgeVerdict": t.get("judgeVerdict"), # True = passed the ground-truth check, False = failed + "judgement": t.get("judgement"), # judge's reasoning (stringified JSON report) + "cost_usd": t.get("cost"), # what this run spent + "num_steps": len(t.get("steps") or []), +}, indent=2)) +PY +``` + +## Mapping the result to the verdict + +Report exactly as `methodology.md`'s output format, sourced from the agent's result: + +- **`Score: N/5`** ← the `score` field of the structured `output` — **but `judgeVerdict` overrides it.** + The structured `score` is the agent's *self-report* and can be wrong (an agent will happily score a + blank page 5/5). **If `judgeVerdict` is `False`, the flow FAILED regardless of the self-score** — cap + it at ≤2 and lead with the judge's `failure_reason`. (Real example: an agent self-scored x.ai/pricing + 5/5 "fully functional"; the judge saw the page rendered blank and returned `false`. Trust the judge.) +- **Result / pass-fail** ← `judgeVerdict` (true = met the ground truth, false = didn't). The agent's + `score`/`isSuccess` are self-reports and are less reliable — **`judgeVerdict` is authoritative.** +- **What worked / issues** ← the structured `worked` / `issues` arrays, cross-checked against + `judgement` (the judge's reasoning). +- **Evidence** ← `steps[]`: each has `url`, `screenshotUrl`, `actions` (and sometimes `nextGoal` — + may be empty). Cite the `screenshotUrl`s of the key moments. +- **Cost** ← surface `cost` so the user sees what the run spent. + +Report it in this format (the same one the Claude backend uses — self-contained here so you don't +need to open `methodology.md`): + +``` +Score: N/5 +Task: +Result: +What worked: +- +Issues: +- [tag] +Evidence: +Cost: $X.XX (Browser Use v2 agent, steps) +``` + +## Fan out: many flows in parallel + +The whole point of v2 subagents is parallel coverage. To test several flows at once, **create all +the tasks first** (each `POST /tasks` returns immediately with an `id`), then **poll them all** — +they run concurrently in Browser Use cloud: + +```python +flows = [ + {"task": "Test signup: …", "startUrl": URL, "judgeGroundTruth": "Account created, lands on dashboard."}, + {"task": "Test checkout: …", "startUrl": URL, "judgeGroundTruth": "Order confirmation shown."}, + {"task": "Test search + filters: …", "startUrl": URL, "judgeGroundTruth": "Filtered results update."}, +] +ids = [] +for f in flows: + _, c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50}) + ids.append((f["task"][:40], c["id"])) + +results = {} +while len(results) < len(ids): + for label, tid in ids: + if tid in results: continue + _, t = call("GET", "/tasks/" + tid) + if t["status"] in ("finished", "failed", "stopped"): + self_score = (json.loads(t["output"]).get("score") if t.get("output") else None) + passed = t.get("judgeVerdict") is True # JUDGE is authoritative, not the self-score + results[tid] = {"label": label, "passed": passed, "self_score": self_score, + "score": (self_score if passed else min(self_score or 2, 2)), # judge=False caps it + "cost": t.get("cost")} + time.sleep(5) +# Per flow: PASS only if judgeVerdict is True. A flow where the agent self-scored high but +# judgeVerdict is False is a *caught failure* — flag it and score it low. +# Overall = the weakest flow (min of the judge-corrected scores) — never average a failed flow up. +``` + +Watch the **concurrent-session cap** (Free = 3): creating more than the cap at once yields `429` — +batch the creates to stay under it. + +## Gotchas + +- **`structuredOutput` is a *string*** — pass `json.dumps(schema)`, not the schema object. +- **localhost isn't reachable** by the cloud agent — tunnel it (ngrok, per `methodology.md`) and + pass the public `startUrl`; for a free-ngrok host, tell the agent in the `task` to click through + any "You are about to visit" interstitial. +- **`429 TooManyConcurrentActiveSessionsError`** — the account hit its concurrent-session cap + (Free = 3); wait or stop other sessions. +- **`maxSteps` is a safety ceiling, not a cost lever** — the agent stops when the task is complete, so the cap doesn't drive cost (this run capped at 15 but used 5 steps). Keep ~50 for headroom; raise for long flows. +- **Teardown is a no-op for the v2 path on a public URL** — the one-off session auto-closes and there's no tunnel to kill. (Only the Claude/localhost path needs teardown.) +- **Verify the key resolves before the billable create** — the snippet's `assert KEY` does this; if it's missing, resolve it per `methodology.md` step 0 *before* calling `POST /tasks`.