Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ Only if you start struggling with a specific mechanic while navigating, look in
- `downloads.md`
- `drag-and-drop.md`
- `dropdowns.md`
- `headless-automation.md`
- `iframes.md`
- `network-requests.md`
- `print-as-pdf.md`
Expand Down Expand Up @@ -161,6 +162,8 @@ Chrome / Browser Use cloud -> CDP WS -> daemon.py -> /tmp/bu-<NAME>.sock -> run.
## Gotchas (field-tested)

- **Chrome 144+ `chrome://inspect/#remote-debugging` does NOT serve `/json/version`.** Read `DevToolsActivePort` instead.
- **Chrome 148+ silently ignores `--remote-debugging-port` on the default user-data-dir.** Stderr logs `DevTools remote debugging requires a non-default data directory`. For unattended automation, use a dedicated `--user-data-dir` and headless mode — see `interaction-skills/headless-automation.md`.
- **`--headless=new` does not write `DevToolsActivePort`.** The harness's discovery loop won't find it. Synthesise the file after Chrome is up, or set `BU_CDP_WS` directly.
- **Try attaching before asking for setup.** If `uv run browser-harness` already works, skip the remote-debugging instructions entirely. Decide what to escalate from the harness's error message, not from whether Chrome is visibly running.
- **The remote-debugging checkbox is per-profile sticky in Chrome.** Once ticked on a profile, every future Chrome launch auto-enables CDP — only navigate to `chrome://inspect/#remote-debugging` when `DevToolsActivePort` is genuinely missing on a fresh profile.
- **The first connect may block on Chrome's Allow dialog.** If setup hangs, explicitly tell the user to click `Allow` in Chrome if it appears, then keep polling for up to 30 seconds instead of treating follow-on errors as a new failure.
Expand Down
4 changes: 4 additions & 0 deletions daemon.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ def _load_env():
PID = f"/tmp/bu-{NAME}.pid"
BUF = 500
PROFILES = [
Path.home() / "Library/Application Support/Google/Chrome-CDP",
Path.home() / "Library/Application Support/Google/Chrome",
Path.home() / "Library/Application Support/Microsoft Edge",
Path.home() / "Library/Application Support/Microsoft Edge Beta",
Expand Down Expand Up @@ -62,6 +63,9 @@ def get_ws_url():
port, path = (base / "DevToolsActivePort").read_text().strip().split("\n", 1)
except (FileNotFoundError, NotADirectoryError):
continue
except ValueError:

@cubic-dev-ai cubic-dev-ai Bot Jun 19, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The except ValueError guard is incomplete: a two-line DevToolsActivePort with a non-numeric port (e.g. abc\n/devtools/browser/...) passes the split() step, then later crashes at int(port.strip()) inside the inner try that only catches OSError. This still aborts profile fallback, defeating the PR's goal of skipping all malformed files.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At daemon.py, line 66:

<comment>The `except ValueError` guard is incomplete: a two-line `DevToolsActivePort` with a non-numeric port (e.g. `abc\n/devtools/browser/...`) passes the `split()` step, then later crashes at `int(port.strip())` inside the inner `try` that only catches `OSError`. This still aborts profile fallback, defeating the PR's goal of skipping all malformed files.</comment>

<file context>
@@ -62,6 +63,9 @@ def get_ws_url():
             port, path = (base / "DevToolsActivePort").read_text().strip().split("\n", 1)
         except (FileNotFoundError, NotADirectoryError):
             continue
+        except ValueError:
+            # malformed/empty DevToolsActivePort (stale profile) — skip, try next
+            continue
</file context>
Fix with cubic

# malformed/empty DevToolsActivePort (stale profile) — skip, try next
continue
deadline = time.time() + 30
while True:
probe = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
Expand Down
90 changes: 90 additions & 0 deletions domain-skills/angellist/investor-portfolio.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# AngelList — Investor / scout portfolio extraction

## URL shape

Fund LP view (the one scouts use to file + review deployments):
```
https://venture.angellist.com/v/lead/<fund-slug>/vehicle/<vehicle-slug>/investing
```

Individual investment detail panel opens **in the same URL** with `/investing/<investment_id>` appended. The panel is a right-side slide-over, not a new page.

## Virtualized row list

The investment table is **virtualized** — only ~40 rows render initially even when the fund has 50+ investments. To load the tail:

```python
for _ in range(8):
js("window.scrollTo(0, document.documentElement.scrollHeight)")
wait(0.8)
```

After that, all rows are in the DOM and `document.querySelectorAll('.styles_title__XcP6F')` returns the full list.

## Rows are not anchors

Row `<a>` tags don't exist. The click target is a React div with an `onclick` handler. Coordinate clicks **do not** reliably hit it. Use a DOM click instead:

```python
js("""
(() => {
const title = Array.from(document.querySelectorAll('.styles_title__XcP6F'))
.find(el => el.textContent.trim() === 'Berta Systems, Inc.');
const row = title.closest('.styles_row__M6RnG');
row.click();
return 'OK';
})()
""")
```

Stable selectors:
- `.styles_title__XcP6F` — company name cell
- `.styles_row__M6RnG` — the row container with the click handler
- `.styles_rowWrapper__XFda_.styles_clickable__1_n5z` — the outer row wrapper (also clickable, same result)

URL updates to `.../investing/<id>` after the click — a reliable "it worked" signal.

## Detail panel — extraction

The side panel has no stable outer class. Grab it by walking up from the "Investment in <name>" heading:

```python
js("""
(() => {
const heading = Array.from(document.querySelectorAll('div,span,h1,h2,h3,h4,h5'))
.find(el => el.textContent?.trim().startsWith('Investment in'));
let panel = heading;
for (let i = 0; i < 15; i++) {
if (!panel.parentElement) break;
panel = panel.parentElement;
if (panel.offsetWidth > 300 && panel.offsetWidth < 900) return panel.innerText;
}
return panel.innerText;
})()
""")
```

The panel `innerText` is label-newline-value, easy to parse with regexes like:
```python
re.search(r"\nFounders\n(.+?)(?=\n[A-Z][a-zA-Z ?\-]+\n|$)", panel, re.DOTALL)
```

Fields always present: `Investment Amount`, `Fund Thesis Match`, `Investing in` (instrument), `Round`, `Round Size`, `Conversion Cap`, `Discount`, `Pro-rata rights included?`, `Equity warrants ...?`, `Country of Incorporation`, `Type of Incorporation`, `Founders`, `Description`, `Which category does this deal fall into?`, `Please provide the founder's LinkedIn profile URL.`, `Company Website`.

Sometimes present: `Notable Co-Investors` (can be `—`), `Reason for Investing` (can contain pitch deck URL).

## Iteration pattern

To extract all investments, click each row, extract panel, move on. The panel switches content on each click — no need to close it between rows. Sleep ~1.5–2s between clicks for the panel to update.

## Traps

- **Cookie consent dialog** (OneTrust) has a huge hidden `innerText` blob. If you grab `document.body.innerText` without scoping to the panel, you'll get cookie consent copy, not investment data. Always target the panel via the "Investment in" heading.
- **Word-joiner char** `\u2060` sneaks into founder names (esp. when the filer pasted from Slack). Strip before writing: `name.replace('\u2060', '')`.
- **Founder separator** is usually `,` but sometimes `;` — split on `[,;]`.
- **Role annotations** like `"Oliver Gilan (CEO)"` — strip `\s*\(.*?\)\s*` if you want plain names.
- **Currency on round_size varies** (USD default, but also `£`, `€`, `CHF`). Don't assume USD.

## Fund scope

LP view only shows investments made from that specific vehicle. Thomas has multiple scout funds (Fund III, Fund IV) — each has its own URL. The `/investing` endpoint shows "fully deployed" and "no longer accepting submissions" for closed funds but the history remains visible.
67 changes: 67 additions & 0 deletions domain-skills/x.com/articles.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# x.com — Long-form Articles (`/i/article/<id>`)

How to extract X long-form articles ("X Articles", formerly Twitter Articles).
Auth-walled — direct fetch / defuddle / Wayback all return near-empty HTML —
so drive the user's logged-in Chrome via the harness.

## URL patterns

- Bookmark form: `https://x.com/i/article/<numeric_id>`
- Canonical form (after redirect): `https://x.com/<author_handle>/article/<canonical_id>`

The redirect happens at navigation time; either form lands on the same page.
The canonical id often differs from the original — record both if you care.

## DOM landmarks

```
[data-testid=twitter-article-title] — title (clean)
[data-testid=twitterArticleRichTextView] — body (Draft.js, ~10k chars typical)
[data-testid=twitterArticleReadView] — wrapper that includes
title + author + body + stats noise
[data-testid=User-Name] — author name + @handle + post date
```

Don't use `twitterArticleReadView` directly — its `innerText` includes the
trailing engagement counters ("49\n174\n1.1K\n295K"). Use the title and
rich-text views separately.

## Body structure: Draft.js blocks

The body is rendered by Draft.js — block-level structure is preserved in
class names on the immediate descendants of `twitterArticleRichTextView`:

```
.longform-unstyled → paragraph
.longform-header-one → # h1
.longform-header-two → ## h2
.longform-header-three → ### h3
.longform-blockquote → > quote
.longform-unordered-list-item → - item
.longform-ordered-list-item → 1. item
.longform-image → contains <img src="…">
```

Inline formatting (bold, italic, links) lives inside the block as nested
`<span style="font-weight: bold">` / `<span style="font-style: italic">` /
`<a href="…">`. For a quick clip, `innerText` of each block is fine — it
flattens to plain text but preserves paragraph breaks. For higher fidelity
walk the spans and emit `**bold**` / `*italic*` / `[text](url)`.

Watch out: the block's text nodes contain the unicode "narrow no-break space"
(U+00A0) in many places. Strip / normalize when comparing or counting words.

## Reference implementation

See `bookmark-sync/twitter/article.py` in the user's Projects dir.

## Speed

A single article (3-second post-load wait) clips in ~5s on a warm browser.
46 articles in a row clipped without a hiccup.

## Auth

Requires the user to be logged in to X in the attached Chrome. Logged-out
behavior: the page renders the article behind a login wall and the
`twitterArticleRichTextView` element is absent — the extractor returns None.
88 changes: 88 additions & 0 deletions domain-skills/x.com/bookmarks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# x.com — Bookmarks (`/i/bookmarks`)

How to extract a user's bookmarked tweets reliably.

## URL & API

- Page: `https://x.com/i/bookmarks`
- Private GraphQL endpoint hit on initial load and on every pagination:
`https://x.com/i/api/graphql/<hash>/Bookmarks?variables=...&features=...`
- Match with `/\/Bookmarks\?/` — note: there is also a separate
`BookmarkFoldersSlice` request which is unrelated (folder list).
- The hash in `<hash>/Bookmarks` rotates; do not pin it.

## Transport: XHR, not fetch

X uses **`XMLHttpRequest`** for the Bookmarks call. Patching only `window.fetch`
will see zero hits. Hook both XHR and fetch to be safe.

## Inject the hook BEFORE page scripts

`document.body`-level `js(...)` patches arrive too late — X has already issued
the initial Bookmarks XHR by then. Use:

```python
cdp("Page.addScriptToEvaluateOnNewDocument", source=PATCH_JS)
cdp("Page.reload", ignoreCache=False) # if already on /i/bookmarks
# or new_tab("https://x.com/i/bookmarks")
wait_for_load()
```

`Page.addScriptToEvaluateOnNewDocument` runs before the page's own scripts on
every navigation/reload, so the very first Bookmarks XHR is captured.

## Pagination: `End` key, not mouseWheel

X's lazy-load only fires on a real "scroll near bottom" signal. The harness's
`scroll(x, y, dy=-N)` (CDP `Input.dispatchMouseEvent` mouseWheel) does **not**
trigger pagination — confirmed: 150 mouseWheel iterations yielded only the
initial 20 tweets.

What works:

```python
js("window.scrollTo(0, document.body.scrollHeight)")
press_key("End")
time.sleep(1.5)
```

This consistently fires the next `/Bookmarks` XHR. Empirically 1.2–1.6s between
scrolls is enough; faster than that and X coalesces.

## Response shape

```
data.bookmark_timeline_v2.timeline.instructions[].entries[]
.content.entryType == "TimelineTimelineItem"
.content.itemContent.tweet_results.result
.rest_id # tweet id
.legacy.id_str
.legacy.full_text # tweet body
.legacy.created_at # "Mon Apr 20 04:25:49 +0000 2026"
.legacy.entities.urls[].expanded_url # outbound links
.core.user_results.result.core.screen_name # NEW path
.core.user_results.result.legacy.screen_name # legacy path (also present)
```

Tombstoned tweets sometimes wrap the result in a `{"tweet": ...}` outer object
— always do `tweet = ir.get("tweet", ir)` defensively.

## What's NOT exposed

- **`bookmarked_at`** — Twitter does not expose when the user bookmarked the
tweet, only the tweet's own `created_at`. If you need to time-bound a sync
(e.g. "last 4 months"), use tweet `created_at` as a coarse signal and
tolerate a few consecutive old tweets before stopping (bookmark-time order
≠ post-time order, so you can briefly slip below the cutoff and recover).

## Stop conditions

Two independent counters:
- **consecutive_old**: tweets older than cutoff seen in a row → end of useful
range. ~4 tolerated before stopping.
- **consecutive_empty**: scroll iterations that produced no new tweets → end
of bookmarks. ~5 tolerated before stopping.

## Reference implementation

See `bookmark-sync/twitter/pull.py` in the user's Projects dir.
Loading