From 39bfbb054aa329cb68f2faeb60f1853f91c38008 Mon Sep 17 00:00:00 2001 From: Arda Sisbot Date: Thu, 11 Jun 2026 15:37:04 -0700 Subject: [PATCH] reddit: document blanket 403 walls + JSON-via-browser fallback (Path 1.5) Anonymous .json requests can be 403-blocked wholesale at the CDN level (observed June 2026, datacenter IP, browser UA made no difference). Document the recovery: navigate .json URLs in the user's Chrome and parse document.body.innerText. Also adds the subreddit search.json endpoint, thread-comments params, stale-session retry, and -c shell-quoting trap. Co-Authored-By: Claude Fable 5 --- .../domain-skills/reddit/scraping.md | 26 +++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/agent-workspace/domain-skills/reddit/scraping.md b/agent-workspace/domain-skills/reddit/scraping.md index 82ea0fd0..7380d4d5 100644 --- a/agent-workspace/domain-skills/reddit/scraping.md +++ b/agent-workspace/domain-skills/reddit/scraping.md @@ -29,6 +29,32 @@ Fails on: - Private / quarantined subreddits (401) - NSFW posts without an authenticated session - Anti-scraping 429s under load — back off or switch to the browser path +- **Blanket 403 walls (observed June 2026):** reddit.com can 403 *every* anonymous `.json` request from a network it dislikes (datacenter/VPN IPs), browser User-Agent or not. The response is an HTML challenge page, not JSON. When this happens, Path 1 is dead for the whole session — don't retry/back off, switch to Path 1.5. + +## Path 1.5: JSON endpoints through the browser (beats IP blocks) + +The `.json` endpoints render as plain text in a real tab, and requests from the user's Chrome carry their cookies + real TLS fingerprint, so they pass where `http_get` 403s. No DOM scraping needed — navigate and parse `document.body.innerText`: + +```python +import json +ensure_real_tab() +goto_url("https://www.reddit.com/r/Bogleheads/search.json?q=tax%20loss%20harvesting&restrict_sr=on&sort=top&t=month&limit=15&raw_json=1") +wait_for_load() +data = json.loads(js("document.body.innerText")) +posts = [c["data"] for c in data["data"]["children"]] +# fields: title, selftext, score, num_comments, permalink, created_utc +``` + +Useful JSON endpoints beyond single posts: + +- **Subreddit search:** `/r//search.json?q=&restrict_sr=on&sort=top&t=month&limit=25&raw_json=1` — `q` supports quoted phrases and `OR` (`q=tax efficient OR "tax loss harvesting"`, URL-encoded). `t` ∈ hour/day/week/month/year/all. +- **Thread + top comments:** `/r//comments/.json?limit=10&sort=top&depth=1&raw_json=1` — `data[1]["data"]["children"]` are top-level comments (`body`, `score`, `author`); filter out `stickied`. +- `raw_json=1` stops Reddit HTML-escaping `&`, `<`, `>` in text fields. + +Gotchas for this path: + +- **Loop over many URLs with retry.** Mid-loop `Runtime.evaluate timed out; expression: document.readyState` means the tab session went stale; calling `ensure_real_tab()` again and re-navigating recovers it. Wrap each fetch in a 2-3 attempt retry rather than failing the whole sweep. +- Single quotes inside f-strings break `browser-harness -c '...'` shell quoting — use `.format()` / double quotes, or pass the script via `"$(cat file.py)"`. ## Path 2: Browser DOM extraction (logged-in)