browser-use · thomwolf · May 26, 2026 · May 26, 2026 · May 26, 2026 · Jun 19, 2026
diff --git a/SKILL.md b/SKILL.md
@@ -74,6 +74,7 @@ Only if you start struggling with a specific mechanic while navigating, look in
 - `downloads.md`
 - `drag-and-drop.md`
 - `dropdowns.md`
+- `headless-automation.md`
 - `iframes.md`
 - `network-requests.md`
 - `print-as-pdf.md`
@@ -161,6 +162,8 @@ Chrome / Browser Use cloud -> CDP WS -> daemon.py -> /tmp/bu-<NAME>.sock -> run.
 ## Gotchas (field-tested)
 
 - **Chrome 144+ `chrome://inspect/#remote-debugging` does NOT serve `/json/version`.** Read `DevToolsActivePort` instead.
+- **Chrome 148+ silently ignores `--remote-debugging-port` on the default user-data-dir.** Stderr logs `DevTools remote debugging requires a non-default data directory`. For unattended automation, use a dedicated `--user-data-dir` and headless mode — see `interaction-skills/headless-automation.md`.
+- **`--headless=new` does not write `DevToolsActivePort`.** The harness's discovery loop won't find it. Synthesise the file after Chrome is up, or set `BU_CDP_WS` directly.
 - **Try attaching before asking for setup.** If `uv run browser-harness` already works, skip the remote-debugging instructions entirely. Decide what to escalate from the harness's error message, not from whether Chrome is visibly running.
 - **The remote-debugging checkbox is per-profile sticky in Chrome.** Once ticked on a profile, every future Chrome launch auto-enables CDP — only navigate to `chrome://inspect/#remote-debugging` when `DevToolsActivePort` is genuinely missing on a fresh profile.
 - **The first connect may block on Chrome's Allow dialog.** If setup hangs, explicitly tell the user to click `Allow` in Chrome if it appears, then keep polling for up to 30 seconds instead of treating follow-on errors as a new failure.

diff --git a/daemon.py b/daemon.py
@@ -26,6 +26,7 @@ def _load_env():
 PID = f"/tmp/bu-{NAME}.pid"
 BUF = 500
 PROFILES = [
+    Path.home() / "Library/Application Support/Google/Chrome-CDP",
     Path.home() / "Library/Application Support/Google/Chrome",
     Path.home() / "Library/Application Support/Microsoft Edge",
     Path.home() / "Library/Application Support/Microsoft Edge Beta",
@@ -62,6 +63,9 @@ def get_ws_url():
             port, path = (base / "DevToolsActivePort").read_text().strip().split("\n", 1)
         except (FileNotFoundError, NotADirectoryError):
             continue
+        except ValueError:
+            # malformed/empty DevToolsActivePort (stale profile) — skip, try next
+            continue
         deadline = time.time() + 30
         while True:
             probe = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

diff --git a/domain-skills/angellist/investor-portfolio.md b/domain-skills/angellist/investor-portfolio.md
@@ -0,0 +1,90 @@
+# AngelList — Investor / scout portfolio extraction
+
+## URL shape
+
+Fund LP view (the one scouts use to file + review deployments):
+```
+https://venture.angellist.com/v/lead/<fund-slug>/vehicle/<vehicle-slug>/investing
+```
+
+Individual investment detail panel opens **in the same URL** with `/investing/<investment_id>` appended. The panel is a right-side slide-over, not a new page.
+
+## Virtualized row list
+
+The investment table is **virtualized** — only ~40 rows render initially even when the fund has 50+ investments. To load the tail:
+
+```python
+for _ in range(8):
+    js("window.scrollTo(0, document.documentElement.scrollHeight)")
+    wait(0.8)
+```
+
+After that, all rows are in the DOM and `document.querySelectorAll('.styles_title__XcP6F')` returns the full list.
+
+## Rows are not anchors
+
+Row `<a>` tags don't exist. The click target is a React div with an `onclick` handler. Coordinate clicks **do not** reliably hit it. Use a DOM click instead:
+
+```python
+js("""
+(() => {
+  const title = Array.from(document.querySelectorAll('.styles_title__XcP6F'))
+    .find(el => el.textContent.trim() === 'Berta Systems, Inc.');
+  const row = title.closest('.styles_row__M6RnG');
+  row.click();
+  return 'OK';
+})()
+""")
+```
+
+Stable selectors:
+- `.styles_title__XcP6F` — company name cell
+- `.styles_row__M6RnG` — the row container with the click handler
+- `.styles_rowWrapper__XFda_.styles_clickable__1_n5z` — the outer row wrapper (also clickable, same result)
+
+URL updates to `.../investing/<id>` after the click — a reliable "it worked" signal.
+
+## Detail panel — extraction
+
+The side panel has no stable outer class. Grab it by walking up from the "Investment in <name>" heading:
+
+```python
+js("""
+(() => {
+  const heading = Array.from(document.querySelectorAll('div,span,h1,h2,h3,h4,h5'))
+    .find(el => el.textContent?.trim().startsWith('Investment in'));
+  let panel = heading;
+  for (let i = 0; i < 15; i++) {
+    if (!panel.parentElement) break;
+    panel = panel.parentElement;
+    if (panel.offsetWidth > 300 && panel.offsetWidth < 900) return panel.innerText;
+  }
+  return panel.innerText;
+})()
+""")
+```
+
+The panel `innerText` is label-newline-value, easy to parse with regexes like:
+```python
+re.search(r"\nFounders\n(.+?)(?=\n[A-Z][a-zA-Z ?\-]+\n|$)", panel, re.DOTALL)
+```
+
+Fields always present: `Investment Amount`, `Fund Thesis Match`, `Investing in` (instrument), `Round`, `Round Size`, `Conversion Cap`, `Discount`, `Pro-rata rights included?`, `Equity warrants ...?`, `Country of Incorporation`, `Type of Incorporation`, `Founders`, `Description`, `Which category does this deal fall into?`, `Please provide the founder's LinkedIn profile URL.`, `Company Website`.
+
+Sometimes present: `Notable Co-Investors` (can be `—`), `Reason for Investing` (can contain pitch deck URL).
+
+## Iteration pattern
+
+To extract all investments, click each row, extract panel, move on. The panel switches content on each click — no need to close it between rows. Sleep ~1.5–2s between clicks for the panel to update.
+
+## Traps
+
+- **Cookie consent dialog** (OneTrust) has a huge hidden `innerText` blob. If you grab `document.body.innerText` without scoping to the panel, you'll get cookie consent copy, not investment data. Always target the panel via the "Investment in" heading.
+- **Word-joiner char** `\u2060` sneaks into founder names (esp. when the filer pasted from Slack). Strip before writing: `name.replace('\u2060', '')`.
+- **Founder separator** is usually `,` but sometimes `;` — split on `[,;]`.
+- **Role annotations** like `"Oliver Gilan (CEO)"` — strip `\s*\(.*?\)\s*` if you want plain names.
+- **Currency on round_size varies** (USD default, but also `£`, `€`, `CHF`). Don't assume USD.
+
+## Fund scope
+
+LP view only shows investments made from that specific vehicle. Thomas has multiple scout funds (Fund III, Fund IV) — each has its own URL. The `/investing` endpoint shows "fully deployed" and "no longer accepting submissions" for closed funds but the history remains visible.
diff --git a/domain-skills/x.com/articles.md b/domain-skills/x.com/articles.md
@@ -0,0 +1,67 @@
+# x.com — Long-form Articles (`/i/article/<id>`)
+
+How to extract X long-form articles ("X Articles", formerly Twitter Articles).
+Auth-walled — direct fetch / defuddle / Wayback all return near-empty HTML —
+so drive the user's logged-in Chrome via the harness.
+
+## URL patterns
+
+- Bookmark form: `https://x.com/i/article/<numeric_id>`
+- Canonical form (after redirect): `https://x.com/<author_handle>/article/<canonical_id>`
+
+The redirect happens at navigation time; either form lands on the same page.
+The canonical id often differs from the original — record both if you care.
+
+## DOM landmarks
+
+```
+[data-testid=twitter-article-title]       — title (clean)
+[data-testid=twitterArticleRichTextView]  — body (Draft.js, ~10k chars typical)
+[data-testid=twitterArticleReadView]      — wrapper that includes
+                                            title + author + body + stats noise
+[data-testid=User-Name]                   — author name + @handle + post date
+```
+
+Don't use `twitterArticleReadView` directly — its `innerText` includes the
+trailing engagement counters ("49\n174\n1.1K\n295K"). Use the title and
+rich-text views separately.
+
+## Body structure: Draft.js blocks
+
+The body is rendered by Draft.js — block-level structure is preserved in
+class names on the immediate descendants of `twitterArticleRichTextView`:
+
+```
+.longform-unstyled              → paragraph
+.longform-header-one            → # h1
+.longform-header-two            → ## h2
+.longform-header-three          → ### h3
+.longform-blockquote            → > quote
+.longform-unordered-list-item   → - item
+.longform-ordered-list-item     → 1. item
+.longform-image                 → contains <img src="…">
+```
+
+Inline formatting (bold, italic, links) lives inside the block as nested
+`<span style="font-weight: bold">` / `<span style="font-style: italic">` /
+`<a href="…">`. For a quick clip, `innerText` of each block is fine — it
+flattens to plain text but preserves paragraph breaks. For higher fidelity
+walk the spans and emit `**bold**` / `*italic*` / `[text](url)`.
+
+Watch out: the block's text nodes contain the unicode "narrow no-break space"
+(U+00A0) in many places. Strip / normalize when comparing or counting words.
+
+## Reference implementation
+
+See `bookmark-sync/twitter/article.py` in the user's Projects dir.
+
+## Speed
+
+A single article (3-second post-load wait) clips in ~5s on a warm browser.
+46 articles in a row clipped without a hiccup.
+
+## Auth
+
+Requires the user to be logged in to X in the attached Chrome. Logged-out
+behavior: the page renders the article behind a login wall and the
+`twitterArticleRichTextView` element is absent — the extractor returns None.
diff --git a/domain-skills/x.com/bookmarks.md b/domain-skills/x.com/bookmarks.md
@@ -0,0 +1,88 @@
+# x.com — Bookmarks (`/i/bookmarks`)
+
+How to extract a user's bookmarked tweets reliably.
+
+## URL & API
+
+- Page: `https://x.com/i/bookmarks`
+- Private GraphQL endpoint hit on initial load and on every pagination:
+  `https://x.com/i/api/graphql/<hash>/Bookmarks?variables=...&features=...`
+  - Match with `/\/Bookmarks\?/` — note: there is also a separate
+    `BookmarkFoldersSlice` request which is unrelated (folder list).
+  - The hash in `<hash>/Bookmarks` rotates; do not pin it.
+
+## Transport: XHR, not fetch
+
+X uses **`XMLHttpRequest`** for the Bookmarks call. Patching only `window.fetch`
+will see zero hits. Hook both XHR and fetch to be safe.
+
+## Inject the hook BEFORE page scripts
+
+`document.body`-level `js(...)` patches arrive too late — X has already issued
+the initial Bookmarks XHR by then. Use:
+
+```python
+cdp("Page.addScriptToEvaluateOnNewDocument", source=PATCH_JS)
+cdp("Page.reload", ignoreCache=False)   # if already on /i/bookmarks
+# or new_tab("https://x.com/i/bookmarks")
+wait_for_load()
+```
+
+`Page.addScriptToEvaluateOnNewDocument` runs before the page's own scripts on
+every navigation/reload, so the very first Bookmarks XHR is captured.
+
+## Pagination: `End` key, not mouseWheel
+
+X's lazy-load only fires on a real "scroll near bottom" signal. The harness's
+`scroll(x, y, dy=-N)` (CDP `Input.dispatchMouseEvent` mouseWheel) does **not**
+trigger pagination — confirmed: 150 mouseWheel iterations yielded only the
+initial 20 tweets.
+
+What works:
+
+```python
+js("window.scrollTo(0, document.body.scrollHeight)")
+press_key("End")
+time.sleep(1.5)
+```
+
+This consistently fires the next `/Bookmarks` XHR. Empirically 1.2–1.6s between
+scrolls is enough; faster than that and X coalesces.
+
+## Response shape
+
+```
+data.bookmark_timeline_v2.timeline.instructions[].entries[]
+  .content.entryType == "TimelineTimelineItem"
+  .content.itemContent.tweet_results.result
+      .rest_id                           # tweet id
+      .legacy.id_str
+      .legacy.full_text                  # tweet body
+      .legacy.created_at                 # "Mon Apr 20 04:25:49 +0000 2026"
+      .legacy.entities.urls[].expanded_url  # outbound links
+      .core.user_results.result.core.screen_name        # NEW path
+      .core.user_results.result.legacy.screen_name      # legacy path (also present)
+```
+
+Tombstoned tweets sometimes wrap the result in a `{"tweet": ...}` outer object
+— always do `tweet = ir.get("tweet", ir)` defensively.
+
+## What's NOT exposed
+
+- **`bookmarked_at`** — Twitter does not expose when the user bookmarked the
+  tweet, only the tweet's own `created_at`. If you need to time-bound a sync
+  (e.g. "last 4 months"), use tweet `created_at` as a coarse signal and
+  tolerate a few consecutive old tweets before stopping (bookmark-time order
+  ≠ post-time order, so you can briefly slip below the cutoff and recover).
+
+## Stop conditions
+
+Two independent counters:
+- **consecutive_old**: tweets older than cutoff seen in a row → end of useful
+  range. ~4 tolerated before stopping.
+- **consecutive_empty**: scroll iterations that produced no new tweets → end
+  of bookmarks. ~5 tolerated before stopping.
+
+## Reference implementation
+
+See `bookmark-sync/twitter/pull.py` in the user's Projects dir.