Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 14 additions & 3 deletions qa/skills/qa/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,15 @@ From the user's invocation (the text after `/qa`, or their message):
- **Target** — a URL (`https://…`) or a local dev server (`localhost:5173`, `:3000`, "the app on 5173"). **Required** — if absent, ask for it before doing anything else.
- **What to test** (optional) — a flow or focus ("the signup", "search + filters"). If omitted, test the most obvious happy path and say so in the report.

## Choose a backend (recommended: Browser Use v2 agent)

The test can run two ways — pick one (ask the user if it's unclear; **recommend v2** for real QA):

- **Browser Use v2 cloud agent — recommended, *built for QA*.** Hand the whole test to an autonomous Browser Use agent: it has a **judge** (pass/fail evaluation against expected behavior) and **structured output** (forces the 1–5 score schema), runs server-side, parallelizes, and returns step-by-step evidence with screenshots. **It spends Browser Use credits** (~$0.01/task + ~$0.006/step + $0.02/hr browser, from the account's monthly allowance). Flow: **`references/browser-use-v2.md`**.
- **Claude Code subagent — no task credits, full control.** You (or a spawned subagent) drive **browser-harness** on a cloud browser yourself, following the field-tested loop in **`references/methodology.md`**. Spends no Browser Use *task* credits — just cloud-browser time and your agent's own usage.

Default to **v2** (the judge + structured score is the right tool for scoring); fall back to the Claude subagent when the user would rather not spend credits or wants you driving directly. **Either way, `browser-harness` is required** — it's the key store for v2, and the test driver + localhost tunnel for the Claude path.

## Dependency: browser-harness (required)

This skill runs the test through **browser-harness** — a separate plugin + CLI. It is not optional; QA must run on a real Browser Use cloud browser, never the user's local Chrome.
Expand All @@ -34,8 +43,10 @@ Do not attempt to QA with anything other than browser-harness + a cloud browser.
## Procedure

1. **Confirm the target is reachable** (`curl -s -o /dev/null -w "%{http_code}" <url>`), and identify what the app is (title, README) so you can frame a sensible test task.
2. **Read `references/methodology.md`** in this skill directory and follow it exactly. It defines: how to get a Browser Use API key (or self-sign-up), how to tunnel a localhost app and point a cloud browser at it, the field-tested gotchas (host-header rewrite, proxy-off, the per-tab interstitial header, CORS-pinned APIs), the test loop, the 1-5 rubric, and the output format.
3. **Run the test**, then **tear everything down** (stop the cloud browser so it stops billing; kill the tunnel).
4. **Return the verdict**: lead with `Score: N/5`, then task, result, what worked, issues (tagged), edge cases tried, and screenshot evidence — exactly as `references/methodology.md` specifies.
2. **Run the test with the chosen backend:**
- **v2 agent** → read **`references/browser-use-v2.md`** and follow it: create the task (with `judge` + a 1–5 `structuredOutput` schema), poll to completion, and read the verdict from `judgeVerdict` + the structured score. A `localhost` target still needs a tunnel (the cloud agent can't reach localhost) — tunnel it per `methodology.md` and pass the public URL as `startUrl`.
- **Claude subagent** → read **`references/methodology.md`** and follow it exactly: get/resolve the key, tunnel localhost, drive the cloud browser through the test loop with the field-tested gotchas (host-header rewrite, proxy-off, per-tab interstitial header, CORS-pinned APIs).
3. **Tear everything down** (stop the cloud browser so it stops billing; kill the tunnel). The v2 agent's one-off session auto-closes.
4. **Return the verdict**: lead with `Score: N/5`, then task, result, what worked, issues (tagged), edge cases tried, and evidence — using the rubric and output format in `references/methodology.md` (both backends report the same way; for v2, the score/verdict/evidence come from the agent's structured output + `steps`).

Scale effort to the ask: a quick "does X work?" is a few interactions and one score; "thoroughly QA this" warrants more flows and edge cases. Keep the verdict honest, specific, and reproducible.
123 changes: 123 additions & 0 deletions qa/skills/qa/references/browser-use-v2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Browser Use v2 agent backend (recommended for QA)

Run the QA test as an autonomous **Browser Use cloud agent** instead of driving browser-harness
step by step. It's purpose-built for QA: a **judge** evaluates pass/fail against expected
behavior, and **structured output** forces the 1–5 score. It runs server-side, parallelizes, and
returns step-by-step evidence (screenshots + actions).

**Cost / credits:** the v2 agent spends Browser Use credits — about **$0.01 per task + ~$0.006 per
step (LLM) + $0.02/hr browser**, drawn from the account's monthly allowance. (The Claude-subagent
backend in `methodology.md` spends no Browser Use *task* credits.) Recommend v2 for real QA; fall
back to the Claude subagent to avoid credits.

> Note: the docs label the v2 API "legacy" and steer new projects to v3 — but the **`judge` +
> structured-output** evaluation features QA needs live on v2 (`POST /api/v2/tasks`), so that's
> what this backend uses.

## The endpoints

- **Create:** `POST https://api.browser-use.com/api/v2/tasks` → `202 {id, sessionId}`
- **Poll:** `GET https://api.browser-use.com/api/v2/tasks/{id}` → `status` ∈ `created → started →
finished | failed | stopped`, plus `output`, `judgeVerdict`, `judgement`, `steps[]`, `cost`.
- Auth header on both: `X-Browser-Use-API-Key`.

## Key resolution — via browser-harness (it stores the key)

The v2 API authenticates with `BROWSER_USE_API_KEY` — the same key `methodology.md` step 0 resolves
(browser-harness's `.env`, the process env, or self-signup). The cleanest way to use
*browser-harness's stored key* is to run the calls **inside a `browser-harness` heredoc**, where
the key is already loaded into `os.environ` — no separate plumbing, no re-exporting. (Plain `curl`
with `$BROWSER_USE_API_KEY` also works if it's exported. The v2 task itself runs on a Browser Use
cloud browser, so no local Chrome is needed for the test — browser-harness here is just the key
store + HTTP runtime.)

## Flow: create → poll → report

Fill in `task`, `startUrl` (the public URL — tunnel a localhost target first), and
`judgeGroundTruth` (what success looks like), then run:

```bash
browser-harness <<'PY'
import os, json, time, urllib.request, urllib.error
KEY = os.environ.get("BROWSER_USE_API_KEY")
assert KEY, "no BROWSER_USE_API_KEY — resolve it per methodology.md step 0"
BASE = "https://api.browser-use.com/api/v2"

def call(method, path, body=None):
req = urllib.request.Request(
BASE + path,
data=json.dumps(body).encode() if body is not None else None,
method=method,
headers={"X-Browser-Use-API-Key": KEY, "Content-Type": "application/json"},
)
try:
with urllib.request.urlopen(req) as r:
return json.load(r)
except urllib.error.HTTPError as e:
raise SystemExit(f"v2 API {e.code}: {e.read().decode()[:300]}")

# 1-5 score schema (structuredOutput must be a *stringified* JSON schema)
SCORE_SCHEMA = json.dumps({
"type": "object",
"properties": {
"score": {"type": "integer", "minimum": 1, "maximum": 5},
"verdict": {"type": "string"},
"worked": {"type": "array", "items": {"type": "string"}},
"issues": {"type": "array", "items": {"type": "string"}},
},
"required": ["score", "verdict"],
})

created = call("POST", "/tasks", {
"task": "QA TASK HERE — e.g. 'Add an item to the cart, go to checkout, and report whether "
"it completes. Score 1-5 (5=flawless, 1=broken) with what worked and any issues.'",
"startUrl": "https://PUBLIC-URL-UNDER-TEST", # tunnel localhost first; pass the public URL
"judge": True,
"judgeGroundTruth": "SUCCESS LOOKS LIKE — e.g. 'An order-confirmation / thank-you page is shown.'",
"structuredOutput": SCORE_SCHEMA,
"maxSteps": 60,
# optional: "llm": "browser-use-2.0", "vision": True,
# "sessionSettings": {"proxyCountryCode": "us", "enableRecording": True}
})
tid = created["id"]
print("created task", tid, "session", created["sessionId"], flush=True)

while True: # poll to a terminal state
t = call("GET", "/tasks/" + tid)

@cubic-dev-ai cubic-dev-ai Bot Jun 19, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Poll the lightweight /tasks/{id}/status endpoint during the loop instead of the full /tasks/{id}. The API explicitly recommends this pattern and the full endpoint loads steps, screenshots, and outputFiles on every poll — wasteful for a 5s tick.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At qa/skills/qa/references/browser-use-v2.md, line 86:

<comment>Poll the lightweight `/tasks/{id}/status` endpoint during the loop instead of the full `/tasks/{id}`. The API explicitly recommends this pattern and the full endpoint loads steps, screenshots, and outputFiles on every poll — wasteful for a 5s tick.</comment>

<file context>
@@ -0,0 +1,123 @@
+print("created task", tid, "session", created["sessionId"], flush=True)
+
+while True:                                               # poll to a terminal state
+    t = call("GET", "/tasks/" + tid)
+    if t["status"] in ("finished", "failed", "stopped"):
+        break
</file context>
Fix with cubic

if t["status"] in ("finished", "failed", "stopped"):
break
time.sleep(5)

print(json.dumps({
"status": t["status"],
"score_output": t.get("output"), # the structuredOutput JSON → the 1-5 score object
"judgeVerdict": t.get("judgeVerdict"), # True = passed the ground-truth check, False = failed
"judgement": t.get("judgement"), # judge's reasoning (stringified JSON report)
"cost_usd": t.get("cost"), # what this run spent
"num_steps": len(t.get("steps") or []),
}, indent=2))
PY
```

## Mapping the result to the verdict

Report exactly as `methodology.md`'s output format, sourced from the agent's result:

- **`Score: N/5`** ← the `score` field of the structured `output`.
- **Result / pass-fail** ← `judgeVerdict` (true = met the ground truth, false = didn't); the agent's
`isSuccess` is its self-report and is less reliable — prefer `judgeVerdict`.
- **What worked / issues** ← the structured `worked` / `issues` arrays, cross-checked against
`judgement` (the judge's reasoning).
- **Evidence** ← `steps[]`: each has `url`, `screenshotUrl`, `actions`, `nextGoal`. Cite the
screenshot URLs of the key moments.
- **Cost** ← surface `cost` so the user sees what the run spent.

## Gotchas

- **`structuredOutput` is a *string*** — pass `json.dumps(schema)`, not the schema object.
- **localhost isn't reachable** by the cloud agent — tunnel it (ngrok, per `methodology.md`) and
pass the public `startUrl`; for a free-ngrok host, tell the agent in the `task` to click through
any "You are about to visit" interstitial.
- **`429 TooManyConcurrentActiveSessionsError`** — the account hit its concurrent-session cap
(Free = 3); wait or stop other sessions.
- **`maxSteps`** caps the run — bump it for multi-step flows, but it also caps cost.