-
Notifications
You must be signed in to change notification settings - Fork 1.4k
daemon: skip malformed/empty DevToolsActivePort instead of crashing #456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
thomwolf
wants to merge
4
commits into
browser-use:main
Choose a base branch
from
thomwolf:fix/devtoolsactiveport-malformed-skip
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
a9ba3b9
daemon: also look for DevToolsActivePort in Chrome-CDP profile
thomwolf d080f7b
domain-skills: add angellist + x.com WIP notes
thomwolf 4441b68
docs: add headless-automation interaction skill
thomwolf 16943d9
daemon: skip malformed/empty DevToolsActivePort instead of crashing
thomwolf File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| # AngelList — Investor / scout portfolio extraction | ||
|
|
||
| ## URL shape | ||
|
|
||
| Fund LP view (the one scouts use to file + review deployments): | ||
| ``` | ||
| https://venture.angellist.com/v/lead/<fund-slug>/vehicle/<vehicle-slug>/investing | ||
| ``` | ||
|
|
||
| Individual investment detail panel opens **in the same URL** with `/investing/<investment_id>` appended. The panel is a right-side slide-over, not a new page. | ||
|
|
||
| ## Virtualized row list | ||
|
|
||
| The investment table is **virtualized** — only ~40 rows render initially even when the fund has 50+ investments. To load the tail: | ||
|
|
||
| ```python | ||
| for _ in range(8): | ||
| js("window.scrollTo(0, document.documentElement.scrollHeight)") | ||
| wait(0.8) | ||
| ``` | ||
|
|
||
| After that, all rows are in the DOM and `document.querySelectorAll('.styles_title__XcP6F')` returns the full list. | ||
|
|
||
| ## Rows are not anchors | ||
|
|
||
| Row `<a>` tags don't exist. The click target is a React div with an `onclick` handler. Coordinate clicks **do not** reliably hit it. Use a DOM click instead: | ||
|
|
||
| ```python | ||
| js(""" | ||
| (() => { | ||
| const title = Array.from(document.querySelectorAll('.styles_title__XcP6F')) | ||
| .find(el => el.textContent.trim() === 'Berta Systems, Inc.'); | ||
| const row = title.closest('.styles_row__M6RnG'); | ||
| row.click(); | ||
| return 'OK'; | ||
| })() | ||
| """) | ||
| ``` | ||
|
|
||
| Stable selectors: | ||
| - `.styles_title__XcP6F` — company name cell | ||
| - `.styles_row__M6RnG` — the row container with the click handler | ||
| - `.styles_rowWrapper__XFda_.styles_clickable__1_n5z` — the outer row wrapper (also clickable, same result) | ||
|
|
||
| URL updates to `.../investing/<id>` after the click — a reliable "it worked" signal. | ||
|
|
||
| ## Detail panel — extraction | ||
|
|
||
| The side panel has no stable outer class. Grab it by walking up from the "Investment in <name>" heading: | ||
|
|
||
| ```python | ||
| js(""" | ||
| (() => { | ||
| const heading = Array.from(document.querySelectorAll('div,span,h1,h2,h3,h4,h5')) | ||
| .find(el => el.textContent?.trim().startsWith('Investment in')); | ||
| let panel = heading; | ||
| for (let i = 0; i < 15; i++) { | ||
| if (!panel.parentElement) break; | ||
| panel = panel.parentElement; | ||
| if (panel.offsetWidth > 300 && panel.offsetWidth < 900) return panel.innerText; | ||
| } | ||
| return panel.innerText; | ||
| })() | ||
| """) | ||
| ``` | ||
|
|
||
| The panel `innerText` is label-newline-value, easy to parse with regexes like: | ||
| ```python | ||
| re.search(r"\nFounders\n(.+?)(?=\n[A-Z][a-zA-Z ?\-]+\n|$)", panel, re.DOTALL) | ||
| ``` | ||
|
|
||
| Fields always present: `Investment Amount`, `Fund Thesis Match`, `Investing in` (instrument), `Round`, `Round Size`, `Conversion Cap`, `Discount`, `Pro-rata rights included?`, `Equity warrants ...?`, `Country of Incorporation`, `Type of Incorporation`, `Founders`, `Description`, `Which category does this deal fall into?`, `Please provide the founder's LinkedIn profile URL.`, `Company Website`. | ||
|
|
||
| Sometimes present: `Notable Co-Investors` (can be `—`), `Reason for Investing` (can contain pitch deck URL). | ||
|
|
||
| ## Iteration pattern | ||
|
|
||
| To extract all investments, click each row, extract panel, move on. The panel switches content on each click — no need to close it between rows. Sleep ~1.5–2s between clicks for the panel to update. | ||
|
|
||
| ## Traps | ||
|
|
||
| - **Cookie consent dialog** (OneTrust) has a huge hidden `innerText` blob. If you grab `document.body.innerText` without scoping to the panel, you'll get cookie consent copy, not investment data. Always target the panel via the "Investment in" heading. | ||
| - **Word-joiner char** `\u2060` sneaks into founder names (esp. when the filer pasted from Slack). Strip before writing: `name.replace('\u2060', '')`. | ||
| - **Founder separator** is usually `,` but sometimes `;` — split on `[,;]`. | ||
| - **Role annotations** like `"Oliver Gilan (CEO)"` — strip `\s*\(.*?\)\s*` if you want plain names. | ||
| - **Currency on round_size varies** (USD default, but also `£`, `€`, `CHF`). Don't assume USD. | ||
|
|
||
| ## Fund scope | ||
|
|
||
| LP view only shows investments made from that specific vehicle. Thomas has multiple scout funds (Fund III, Fund IV) — each has its own URL. The `/investing` endpoint shows "fully deployed" and "no longer accepting submissions" for closed funds but the history remains visible. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,67 @@ | ||
| # x.com — Long-form Articles (`/i/article/<id>`) | ||
|
|
||
| How to extract X long-form articles ("X Articles", formerly Twitter Articles). | ||
| Auth-walled — direct fetch / defuddle / Wayback all return near-empty HTML — | ||
| so drive the user's logged-in Chrome via the harness. | ||
|
|
||
| ## URL patterns | ||
|
|
||
| - Bookmark form: `https://x.com/i/article/<numeric_id>` | ||
| - Canonical form (after redirect): `https://x.com/<author_handle>/article/<canonical_id>` | ||
|
|
||
| The redirect happens at navigation time; either form lands on the same page. | ||
| The canonical id often differs from the original — record both if you care. | ||
|
|
||
| ## DOM landmarks | ||
|
|
||
| ``` | ||
| [data-testid=twitter-article-title] — title (clean) | ||
| [data-testid=twitterArticleRichTextView] — body (Draft.js, ~10k chars typical) | ||
| [data-testid=twitterArticleReadView] — wrapper that includes | ||
| title + author + body + stats noise | ||
| [data-testid=User-Name] — author name + @handle + post date | ||
| ``` | ||
|
|
||
| Don't use `twitterArticleReadView` directly — its `innerText` includes the | ||
| trailing engagement counters ("49\n174\n1.1K\n295K"). Use the title and | ||
| rich-text views separately. | ||
|
|
||
| ## Body structure: Draft.js blocks | ||
|
|
||
| The body is rendered by Draft.js — block-level structure is preserved in | ||
| class names on the immediate descendants of `twitterArticleRichTextView`: | ||
|
|
||
| ``` | ||
| .longform-unstyled → paragraph | ||
| .longform-header-one → # h1 | ||
| .longform-header-two → ## h2 | ||
| .longform-header-three → ### h3 | ||
| .longform-blockquote → > quote | ||
| .longform-unordered-list-item → - item | ||
| .longform-ordered-list-item → 1. item | ||
| .longform-image → contains <img src="…"> | ||
| ``` | ||
|
|
||
| Inline formatting (bold, italic, links) lives inside the block as nested | ||
| `<span style="font-weight: bold">` / `<span style="font-style: italic">` / | ||
| `<a href="…">`. For a quick clip, `innerText` of each block is fine — it | ||
| flattens to plain text but preserves paragraph breaks. For higher fidelity | ||
| walk the spans and emit `**bold**` / `*italic*` / `[text](url)`. | ||
|
|
||
| Watch out: the block's text nodes contain the unicode "narrow no-break space" | ||
| (U+00A0) in many places. Strip / normalize when comparing or counting words. | ||
|
|
||
| ## Reference implementation | ||
|
|
||
| See `bookmark-sync/twitter/article.py` in the user's Projects dir. | ||
|
|
||
| ## Speed | ||
|
|
||
| A single article (3-second post-load wait) clips in ~5s on a warm browser. | ||
| 46 articles in a row clipped without a hiccup. | ||
|
|
||
| ## Auth | ||
|
|
||
| Requires the user to be logged in to X in the attached Chrome. Logged-out | ||
| behavior: the page renders the article behind a login wall and the | ||
| `twitterArticleRichTextView` element is absent — the extractor returns None. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,88 @@ | ||
| # x.com — Bookmarks (`/i/bookmarks`) | ||
|
|
||
| How to extract a user's bookmarked tweets reliably. | ||
|
|
||
| ## URL & API | ||
|
|
||
| - Page: `https://x.com/i/bookmarks` | ||
| - Private GraphQL endpoint hit on initial load and on every pagination: | ||
| `https://x.com/i/api/graphql/<hash>/Bookmarks?variables=...&features=...` | ||
| - Match with `/\/Bookmarks\?/` — note: there is also a separate | ||
| `BookmarkFoldersSlice` request which is unrelated (folder list). | ||
| - The hash in `<hash>/Bookmarks` rotates; do not pin it. | ||
|
|
||
| ## Transport: XHR, not fetch | ||
|
|
||
| X uses **`XMLHttpRequest`** for the Bookmarks call. Patching only `window.fetch` | ||
| will see zero hits. Hook both XHR and fetch to be safe. | ||
|
|
||
| ## Inject the hook BEFORE page scripts | ||
|
|
||
| `document.body`-level `js(...)` patches arrive too late — X has already issued | ||
| the initial Bookmarks XHR by then. Use: | ||
|
|
||
| ```python | ||
| cdp("Page.addScriptToEvaluateOnNewDocument", source=PATCH_JS) | ||
| cdp("Page.reload", ignoreCache=False) # if already on /i/bookmarks | ||
| # or new_tab("https://x.com/i/bookmarks") | ||
| wait_for_load() | ||
| ``` | ||
|
|
||
| `Page.addScriptToEvaluateOnNewDocument` runs before the page's own scripts on | ||
| every navigation/reload, so the very first Bookmarks XHR is captured. | ||
|
|
||
| ## Pagination: `End` key, not mouseWheel | ||
|
|
||
| X's lazy-load only fires on a real "scroll near bottom" signal. The harness's | ||
| `scroll(x, y, dy=-N)` (CDP `Input.dispatchMouseEvent` mouseWheel) does **not** | ||
| trigger pagination — confirmed: 150 mouseWheel iterations yielded only the | ||
| initial 20 tweets. | ||
|
|
||
| What works: | ||
|
|
||
| ```python | ||
| js("window.scrollTo(0, document.body.scrollHeight)") | ||
| press_key("End") | ||
| time.sleep(1.5) | ||
| ``` | ||
|
|
||
| This consistently fires the next `/Bookmarks` XHR. Empirically 1.2–1.6s between | ||
| scrolls is enough; faster than that and X coalesces. | ||
|
|
||
| ## Response shape | ||
|
|
||
| ``` | ||
| data.bookmark_timeline_v2.timeline.instructions[].entries[] | ||
| .content.entryType == "TimelineTimelineItem" | ||
| .content.itemContent.tweet_results.result | ||
| .rest_id # tweet id | ||
| .legacy.id_str | ||
| .legacy.full_text # tweet body | ||
| .legacy.created_at # "Mon Apr 20 04:25:49 +0000 2026" | ||
| .legacy.entities.urls[].expanded_url # outbound links | ||
| .core.user_results.result.core.screen_name # NEW path | ||
| .core.user_results.result.legacy.screen_name # legacy path (also present) | ||
| ``` | ||
|
|
||
| Tombstoned tweets sometimes wrap the result in a `{"tweet": ...}` outer object | ||
| — always do `tweet = ir.get("tweet", ir)` defensively. | ||
|
|
||
| ## What's NOT exposed | ||
|
|
||
| - **`bookmarked_at`** — Twitter does not expose when the user bookmarked the | ||
| tweet, only the tweet's own `created_at`. If you need to time-bound a sync | ||
| (e.g. "last 4 months"), use tweet `created_at` as a coarse signal and | ||
| tolerate a few consecutive old tweets before stopping (bookmark-time order | ||
| ≠ post-time order, so you can briefly slip below the cutoff and recover). | ||
|
|
||
| ## Stop conditions | ||
|
|
||
| Two independent counters: | ||
| - **consecutive_old**: tweets older than cutoff seen in a row → end of useful | ||
| range. ~4 tolerated before stopping. | ||
| - **consecutive_empty**: scroll iterations that produced no new tweets → end | ||
| of bookmarks. ~5 tolerated before stopping. | ||
|
|
||
| ## Reference implementation | ||
|
|
||
| See `bookmark-sync/twitter/pull.py` in the user's Projects dir. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: The
except ValueErrorguard is incomplete: a two-lineDevToolsActivePortwith a non-numeric port (e.g.abc\n/devtools/browser/...) passes thesplit()step, then later crashes atint(port.strip())inside the innertrythat only catchesOSError. This still aborts profile fallback, defeating the PR's goal of skipping all malformed files.Prompt for AI agents