-
Notifications
You must be signed in to change notification settings - Fork 0
qa: add Browser Use v2 agent backend (recommended) alongside Claude subagent #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ShawnPana
wants to merge
4
commits into
main
Choose a base branch
from
qa-v2-backend
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+200
−3
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
e871841
qa: add Browser Use v2 agent backend (recommended) alongside Claude s…
ShawnPana 442086a
qa/v2: clarify maxSteps is a ceiling (default 50), self-contained out…
ShawnPana ad6b6e6
qa: reframe as scope-driven — one flow = browser-harness directly; ma…
ShawnPana 1c01b97
qa/v2: make judgeVerdict authoritative over agent self-score (fan-out…
ShawnPana File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,182 @@ | ||||||
| # Browser Use v2 agent backend (recommended for QA) | ||||||
|
|
||||||
| Run the QA test as an autonomous **Browser Use cloud agent** instead of driving browser-harness | ||||||
| step by step. It's purpose-built for QA: a **judge** evaluates pass/fail against expected | ||||||
| behavior, and **structured output** forces the 1–5 score. It runs server-side, parallelizes, and | ||||||
| returns step-by-step evidence (screenshots + actions). | ||||||
|
|
||||||
| **Cost / credits:** the v2 agent spends Browser Use credits — about **$0.01 per task + ~$0.006 per | ||||||
| step (LLM) + $0.02/hr browser**, drawn from the account's monthly allowance. (The Claude-subagent | ||||||
| backend in `methodology.md` spends no Browser Use *task* credits.) Recommend v2 for real QA; fall | ||||||
| back to the Claude subagent to avoid credits. | ||||||
|
|
||||||
| > Note: the docs label the v2 API "legacy" and steer new projects to v3 — but the **`judge` + | ||||||
| > structured-output** evaluation features QA needs live on v2 (`POST /api/v2/tasks`), so that's | ||||||
| > what this backend uses. | ||||||
|
|
||||||
| ## The endpoints | ||||||
|
|
||||||
| - **Create:** `POST https://api.browser-use.com/api/v2/tasks` → `202 {id, sessionId}` | ||||||
| - **Poll:** `GET https://api.browser-use.com/api/v2/tasks/{id}` → `status` ∈ `created → started → | ||||||
| finished | failed | stopped`, plus `output`, `judgeVerdict`, `judgement`, `steps[]`, `cost`. | ||||||
| - Auth header on both: `X-Browser-Use-API-Key`. | ||||||
|
|
||||||
| ## Key resolution — via browser-harness (it stores the key) | ||||||
|
|
||||||
| The v2 API authenticates with `BROWSER_USE_API_KEY` — the same key `methodology.md` step 0 resolves | ||||||
| (browser-harness's `.env`, the process env, or self-signup). The cleanest way to use | ||||||
| *browser-harness's stored key* is to run the calls **inside a `browser-harness` heredoc**, where | ||||||
| the key is already loaded into `os.environ` — no separate plumbing, no re-exporting. (Plain `curl` | ||||||
| with `$BROWSER_USE_API_KEY` also works if it's exported. The v2 task itself runs on a Browser Use | ||||||
| cloud browser, so no local Chrome is needed for the test — browser-harness here is just the key | ||||||
| store + HTTP runtime.) | ||||||
|
|
||||||
| ## Flow: create → poll → report | ||||||
|
|
||||||
| Fill in `task`, `startUrl` (the public URL — tunnel a localhost target first), and | ||||||
| `judgeGroundTruth` (what success looks like), then run: | ||||||
|
|
||||||
| ```bash | ||||||
| browser-harness <<'PY' | ||||||
| import os, json, time, urllib.request, urllib.error | ||||||
| KEY = os.environ.get("BROWSER_USE_API_KEY") | ||||||
| assert KEY, "no BROWSER_USE_API_KEY — resolve it per methodology.md step 0" | ||||||
| BASE = "https://api.browser-use.com/api/v2" | ||||||
|
|
||||||
| def call(method, path, body=None): | ||||||
| req = urllib.request.Request( | ||||||
| BASE + path, | ||||||
| data=json.dumps(body).encode() if body is not None else None, | ||||||
| method=method, | ||||||
| headers={"X-Browser-Use-API-Key": KEY, "Content-Type": "application/json"}, | ||||||
| ) | ||||||
| try: | ||||||
| with urllib.request.urlopen(req) as r: | ||||||
| return json.load(r) | ||||||
| except urllib.error.HTTPError as e: | ||||||
| raise SystemExit(f"v2 API {e.code}: {e.read().decode()[:300]}") | ||||||
|
|
||||||
| # 1-5 score schema (structuredOutput must be a *stringified* JSON schema) | ||||||
| SCORE_SCHEMA = json.dumps({ | ||||||
| "type": "object", | ||||||
| "properties": { | ||||||
| "score": {"type": "integer", "minimum": 1, "maximum": 5}, | ||||||
| "verdict": {"type": "string"}, | ||||||
| "worked": {"type": "array", "items": {"type": "string"}}, | ||||||
| "issues": {"type": "array", "items": {"type": "string"}}, | ||||||
| }, | ||||||
| "required": ["score", "verdict"], | ||||||
| }) | ||||||
|
|
||||||
| created = call("POST", "/tasks", { | ||||||
| "task": "QA TASK HERE — e.g. 'Add an item to the cart, go to checkout, and report whether " | ||||||
| "it completes. Score 1-5 (5=flawless, 1=broken) with what worked and any issues.'", | ||||||
| "startUrl": "https://PUBLIC-URL-UNDER-TEST", # tunnel localhost first; pass the public URL | ||||||
| "judge": True, | ||||||
| "judgeGroundTruth": "SUCCESS LOOKS LIKE — e.g. 'An order-confirmation / thank-you page is shown.'", | ||||||
| "structuredOutput": SCORE_SCHEMA, | ||||||
| "maxSteps": 50, # a CEILING, not a target — the agent stops when done, so this doesn't inflate cost; | ||||||
| # 50 gives room for a multi-step flow. Raise for long flows; cost = steps actually taken. | ||||||
| # optional: "llm": "browser-use-2.0", "vision": True, | ||||||
| # "sessionSettings": {"proxyCountryCode": "us", "enableRecording": True} | ||||||
| }) | ||||||
| tid = created["id"] | ||||||
| print("created task", tid, "session", created["sessionId"], flush=True) | ||||||
|
|
||||||
| while True: # poll to a terminal state | ||||||
| t = call("GET", "/tasks/" + tid) | ||||||
| if t["status"] in ("finished", "failed", "stopped"): | ||||||
| break | ||||||
| time.sleep(5) | ||||||
|
|
||||||
| print(json.dumps({ | ||||||
| "status": t["status"], | ||||||
| "score_output": t.get("output"), # the structuredOutput JSON → the 1-5 score object | ||||||
| "judgeVerdict": t.get("judgeVerdict"), # True = passed the ground-truth check, False = failed | ||||||
| "judgement": t.get("judgement"), # judge's reasoning (stringified JSON report) | ||||||
| "cost_usd": t.get("cost"), # what this run spent | ||||||
| "num_steps": len(t.get("steps") or []), | ||||||
| }, indent=2)) | ||||||
| PY | ||||||
| ``` | ||||||
|
|
||||||
| ## Mapping the result to the verdict | ||||||
|
|
||||||
| Report exactly as `methodology.md`'s output format, sourced from the agent's result: | ||||||
|
|
||||||
| - **`Score: N/5`** ← the `score` field of the structured `output` — **but `judgeVerdict` overrides it.** | ||||||
| The structured `score` is the agent's *self-report* and can be wrong (an agent will happily score a | ||||||
| blank page 5/5). **If `judgeVerdict` is `False`, the flow FAILED regardless of the self-score** — cap | ||||||
| it at ≤2 and lead with the judge's `failure_reason`. (Real example: an agent self-scored x.ai/pricing | ||||||
| 5/5 "fully functional"; the judge saw the page rendered blank and returned `false`. Trust the judge.) | ||||||
| - **Result / pass-fail** ← `judgeVerdict` (true = met the ground truth, false = didn't). The agent's | ||||||
| `score`/`isSuccess` are self-reports and are less reliable — **`judgeVerdict` is authoritative.** | ||||||
| - **What worked / issues** ← the structured `worked` / `issues` arrays, cross-checked against | ||||||
| `judgement` (the judge's reasoning). | ||||||
| - **Evidence** ← `steps[]`: each has `url`, `screenshotUrl`, `actions` (and sometimes `nextGoal` — | ||||||
| may be empty). Cite the `screenshotUrl`s of the key moments. | ||||||
| - **Cost** ← surface `cost` so the user sees what the run spent. | ||||||
|
|
||||||
| Report it in this format (the same one the Claude backend uses — self-contained here so you don't | ||||||
| need to open `methodology.md`): | ||||||
|
|
||||||
| ``` | ||||||
| Score: N/5 | ||||||
| Task: <what you asked the agent to verify> | ||||||
| Result: <pass/fail from judgeVerdict + one line> | ||||||
| What worked: | ||||||
| - <from the structured `worked` array / judgement> | ||||||
| Issues: | ||||||
| - [tag] <from `issues` / judgement; empty if none> | ||||||
| Evidence: <key steps[].screenshotUrl links> | ||||||
| Cost: $X.XX (Browser Use v2 agent, <n> steps) | ||||||
| ``` | ||||||
|
|
||||||
| ## Fan out: many flows in parallel | ||||||
|
|
||||||
| The whole point of v2 subagents is parallel coverage. To test several flows at once, **create all | ||||||
| the tasks first** (each `POST /tasks` returns immediately with an `id`), then **poll them all** — | ||||||
| they run concurrently in Browser Use cloud: | ||||||
|
|
||||||
| ```python | ||||||
| flows = [ | ||||||
| {"task": "Test signup: …", "startUrl": URL, "judgeGroundTruth": "Account created, lands on dashboard."}, | ||||||
| {"task": "Test checkout: …", "startUrl": URL, "judgeGroundTruth": "Order confirmation shown."}, | ||||||
| {"task": "Test search + filters: …", "startUrl": URL, "judgeGroundTruth": "Filtered results update."}, | ||||||
| ] | ||||||
| ids = [] | ||||||
| for f in flows: | ||||||
| _, c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50}) | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. P1: Tuple-unpacking Prompt for AI agents
Suggested change
|
||||||
| ids.append((f["task"][:40], c["id"])) | ||||||
|
|
||||||
| results = {} | ||||||
| while len(results) < len(ids): | ||||||
| for label, tid in ids: | ||||||
| if tid in results: continue | ||||||
| _, t = call("GET", "/tasks/" + tid) | ||||||
| if t["status"] in ("finished", "failed", "stopped"): | ||||||
| self_score = (json.loads(t["output"]).get("score") if t.get("output") else None) | ||||||
| passed = t.get("judgeVerdict") is True # JUDGE is authoritative, not the self-score | ||||||
| results[tid] = {"label": label, "passed": passed, "self_score": self_score, | ||||||
| "score": (self_score if passed else min(self_score or 2, 2)), # judge=False caps it | ||||||
| "cost": t.get("cost")} | ||||||
| time.sleep(5) | ||||||
| # Per flow: PASS only if judgeVerdict is True. A flow where the agent self-scored high but | ||||||
| # judgeVerdict is False is a *caught failure* — flag it and score it low. | ||||||
| # Overall = the weakest flow (min of the judge-corrected scores) — never average a failed flow up. | ||||||
| ``` | ||||||
|
|
||||||
| Watch the **concurrent-session cap** (Free = 3): creating more than the cap at once yields `429` — | ||||||
| batch the creates to stay under it. | ||||||
|
|
||||||
| ## Gotchas | ||||||
|
|
||||||
| - **`structuredOutput` is a *string*** — pass `json.dumps(schema)`, not the schema object. | ||||||
| - **localhost isn't reachable** by the cloud agent — tunnel it (ngrok, per `methodology.md`) and | ||||||
| pass the public `startUrl`; for a free-ngrok host, tell the agent in the `task` to click through | ||||||
| any "You are about to visit" interstitial. | ||||||
| - **`429 TooManyConcurrentActiveSessionsError`** — the account hit its concurrent-session cap | ||||||
| (Free = 3); wait or stop other sessions. | ||||||
| - **`maxSteps` is a safety ceiling, not a cost lever** — the agent stops when the task is complete, so the cap doesn't drive cost (this run capped at 15 but used 5 steps). Keep ~50 for headroom; raise for long flows. | ||||||
| - **Teardown is a no-op for the v2 path on a public URL** — the one-off session auto-closes and there's no tunnel to kill. (Only the Claude/localhost path needs teardown.) | ||||||
| - **Verify the key resolves before the billable create** — the snippet's `assert KEY` does this; if it's missing, resolve it per `methodology.md` step 0 *before* calling `POST /tasks`. | ||||||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Poll the lightweight
/tasks/{id}/statusendpoint during the loop instead of the full/tasks/{id}. The API explicitly recommends this pattern and the full endpoint loads steps, screenshots, and outputFiles on every poll — wasteful for a 5s tick.Prompt for AI agents