Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions agent-workspace/domain-skills/reddit/scraping.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,32 @@ Fails on:
- Private / quarantined subreddits (401)
- NSFW posts without an authenticated session
- Anti-scraping 429s under load — back off or switch to the browser path
- **Blanket 403 walls (observed June 2026):** reddit.com can 403 *every* anonymous `.json` request from a network it dislikes (datacenter/VPN IPs), browser User-Agent or not. The response is an HTML challenge page, not JSON. When this happens, Path 1 is dead for the whole session — don't retry/back off, switch to Path 1.5.

## Path 1.5: JSON endpoints through the browser (beats IP blocks)

The `.json` endpoints render as plain text in a real tab, and requests from the user's Chrome carry their cookies + real TLS fingerprint, so they pass where `http_get` 403s. No DOM scraping needed — navigate and parse `document.body.innerText`:

```python
import json
ensure_real_tab()
goto_url("https://www.reddit.com/r/Bogleheads/search.json?q=tax%20loss%20harvesting&restrict_sr=on&sort=top&t=month&limit=15&raw_json=1")
wait_for_load()
data = json.loads(js("document.body.innerText"))
posts = [c["data"] for c in data["data"]["children"]]
# fields: title, selftext, score, num_comments, permalink, created_utc
```

Useful JSON endpoints beyond single posts:

- **Subreddit search:** `/r/<sub>/search.json?q=<query>&restrict_sr=on&sort=top&t=month&limit=25&raw_json=1` — `q` supports quoted phrases and `OR` (`q=tax efficient OR "tax loss harvesting"`, URL-encoded). `t` ∈ hour/day/week/month/year/all.
- **Thread + top comments:** `/r/<sub>/comments/<id>.json?limit=10&sort=top&depth=1&raw_json=1` — `data[1]["data"]["children"]` are top-level comments (`body`, `score`, `author`); filter out `stickied`.

@cubic-dev-ai cubic-dev-ai Bot Jun 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Path 1.5 documentation for /comments/<id>.json omits kind: "more" entries in data[1]["data"]["children"], giving an incorrect data-shape guarantee that could cause KeyErrors in agent-generated code. The existing Path 1 section already correctly documents kind: "more" for the same endpoint.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-workspace/domain-skills/reddit/scraping.md, line 51:

<comment>Path 1.5 documentation for `/comments/<id>.json` omits `kind: "more"` entries in `data[1]["data"]["children"]`, giving an incorrect data-shape guarantee that could cause KeyErrors in agent-generated code. The existing Path 1 section already correctly documents `kind: "more"` for the same endpoint.</comment>

<file context>
@@ -29,6 +29,32 @@ Fails on:
+Useful JSON endpoints beyond single posts:
+
+- **Subreddit search:** `/r/<sub>/search.json?q=<query>&restrict_sr=on&sort=top&t=month&limit=25&raw_json=1` — `q` supports quoted phrases and `OR` (`q=tax efficient OR "tax loss harvesting"`, URL-encoded). `t` ∈ hour/day/week/month/year/all.
+- **Thread + top comments:** `/r/<sub>/comments/<id>.json?limit=10&sort=top&depth=1&raw_json=1` — `data[1]["data"]["children"]` are top-level comments (`body`, `score`, `author`); filter out `stickied`.
+- `raw_json=1` stops Reddit HTML-escaping `&`, `<`, `>` in text fields.
+
</file context>
Fix with cubic

- `raw_json=1` stops Reddit HTML-escaping `&`, `<`, `>` in text fields.

Gotchas for this path:

- **Loop over many URLs with retry.** Mid-loop `Runtime.evaluate timed out; expression: document.readyState` means the tab session went stale; calling `ensure_real_tab()` again and re-navigating recovers it. Wrap each fetch in a 2-3 attempt retry rather than failing the whole sweep.
- Single quotes inside f-strings break `browser-harness -c '...'` shell quoting — use `.format()` / double quotes, or pass the script via `"$(cat file.py)"`.

## Path 2: Browser DOM extraction (logged-in)

Expand Down