Skip to content

feat(rewrite): add OpenGraph and Twitter Card preview rules#4295

Open
ChrisJr404 wants to merge 1 commit into
miniflux:mainfrom
ChrisJr404:feat/rewrite-twitter-opengraph-4291
Open

feat(rewrite): add OpenGraph and Twitter Card preview rules#4295
ChrisJr404 wants to merge 1 commit into
miniflux:mainfrom
ChrisJr404:feat/rewrite-twitter-opengraph-4291

Conversation

@ChrisJr404

Copy link
Copy Markdown

Closes #4291.

What

Adds two content rewrite rules that pull values from the scraped page's <head>:

  • add_open_graph(\"description\", \"image\", ...) — reads og:* meta tags
  • add_twitter_card(\"description\", \"image\", ...) — reads twitter:* meta tags

Both accept either bare suffixes (description, image, title, site_name, ...) or fully-qualified keys (og:description, twitter:image). Called without arguments they default to description + image.

When description and image are both available the rule renders a <figure> with the image and a caption; otherwise it falls back to a paragraph. Other suffixes are rendered as a labelled paragraph (<p><strong>site_name:</strong> Example</p>). All metadata values are HTML-escaped before being written into the entry content.

Why

Some sites lean so heavily on JS that scraping returns very little, but their <head> exposes rich preview metadata. Bluesky is the example in the issue: an RSS item points at a post but only carries a short snippet, while the linked page has og:description, og:image, twitter:description, twitter:image, etc. The new rules let users opt into using those values for the entry body.

Example feed-side configuration (custom rewrite rules field):

add_open_graph(\"description\", \"image\")

or for a Twitter-Card-only site:

add_twitter_card(\"description\", \"image\")

How

The scraper already fetches the page once when the crawler is enabled. The change buffers the fetched HTML so it can be parsed twice — once for the existing readability/custom-rules extraction, once for <head> meta tags — without an extra HTTP request. The collected map is exposed on a new ScrapeResult struct (replacing the old multi-return signature on ScrapeWebsite) and threaded into the rewrite layer through a new RewriteContext.

When the crawler is disabled or no requested key is present the rules are no-ops, so existing feeds that do not opt in are unaffected.

Notes / open questions for reviewers

  • The new directive name (add_open_graph / add_twitter_card) follows the existing add_* naming. Happy to rename if you prefer something more compact.
  • The default property list (description + image) was chosen to match the Bluesky-style use case in the issue. Easy to extend the defaults or expose a third helper that pulls everything available.
  • The ScrapeWebsite return type changed from three values to a ScrapeResult struct since metadata makes a fourth value awkward; the only callers are inside internal/reader/processor so no external API is affected.

Tests

  • internal/reader/scraper/metadata_test.go — covers OpenGraph extraction, Twitter Cards using both name and property attributes, ignoring unrelated meta, first-value-wins on duplicates, and rejection of empty/whitespace content.
  • internal/reader/rewrite/preview_meta_test.go — covers prepending image+description, default arg fallback, fully-qualified keys, family-mismatch rejection, no-metadata no-op, missing-property no-op, the labelled-paragraph fallback, and HTML escaping of attacker-controlled meta values.
  • Existing content_rewrite_test.go updated for the new ApplyContentRewriteRules signature.

go test ./... and go vet ./... pass locally.

Adds two new content rewrite rules — `add_open_graph` and `add_twitter_card`
— that prepend the entry content with values pulled from the scraped page's
`<head>` meta tags. This is useful for sites whose RSS body is sparse but
whose linked page exposes rich preview metadata (Bluesky, Mastodon link
posts, social previews of single-page apps, ...).

The scraper now buffers the fetched HTML once and exposes the collected
OG/Twitter values via a new `ScrapeResult.Metadata` map alongside the
existing extracted content. The processor passes the map down to the
rewrite layer through a new `RewriteContext` struct so individual rules can
consume it without re-fetching the page.

Both rules accept either bare property suffixes (`description`, `image`,
`title`, ...) or fully-qualified keys (`og:description`, `twitter:image`).
With no arguments they default to `description` + `image`. When the scraper
is disabled or the requested keys are missing the rules are no-ops.

Closes miniflux#4291.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Content rewrite rules that use Twitter Cards and OpenGraph attribute values

1 participant