Is image extraction deterministic? Can the same embedded image produce a stable hash across runs? #3137

nixmaldonado · 2026-03-16T20:59:30Z

nixmaldonado
Mar 16, 2026

Context

I'm using Docling with the CV pipeline (page-level rasterization) to extract images from PDFs as part of a batch processing pipeline. I'm trying to determine whether image extraction is deterministic — specifically, whether the same image embedded in a PDF will always produce the same bytes (and thus the same hash) on repeated runs or across environments.

Questions

Determinism within the CV pipeline: When Docling rasterizes a page and crops image regions, is that process deterministic? Or can floating-point rendering differences, antialiasing, or thread scheduling cause pixel-level variation between runs?
Raw stream extraction: For PDFs that contain embedded raster images (JPEG, PNG streams), is there a way to extract those image bytes directly from the PDF stream — bypassing the page rendering step — so the output is byte-identical to the original embedded image? This would make hash-based deduplication reliable.
Cross-document deduplication: If the same image appears in two different PDFs, is there any supported mechanism to detect that they are identical?

Use case

I want to deduplicate extracted images across a large corpus of PDFs using content hashing (e.g., MD5/SHA256). This requires the extraction to be deterministic and ideally to preserve the original image bytes when possible.

Any guidance appreciated — including if this is already possible via a config option I may have missed.

Answered by dosubot[bot]

Mar 16, 2026

Hi @nixmaldonado! I'm Dosu and I’m helping the docling team.

Based on my research into the Docling codebase:

1. Determinism of the CV pipeline: Image extraction is not guaranteed to be deterministic. Docling uses pypdfium2's render() method to rasterize page regions and then crop/resize them. The code renders at 1.5x scale and then resizes, which involves floating-point operations. There's no explicit documentation guaranteeing reproducibility across runs or environments.

2. Raw stream extraction: Unfortunately, Docling does not currently support direct extraction of embedded image bytes from PDF streams. The current implementation detects image objects using pdfium_c.FPDF_PAGEOBJ_IMAGE …

View full answer

@nixmaldonado · 2026-03-16T21:02:26Z

dosubot[bot]
Bot Mar 16, 2026

Hi @nixmaldonado! I'm Dosu and I’m helping the docling team.

Based on my research into the Docling codebase:

1. Determinism of the CV pipeline: Image extraction is not guaranteed to be deterministic. Docling uses pypdfium2's render() method to rasterize page regions and then crop/resize them. The code renders at 1.5x scale and then resizes, which involves floating-point operations. There's no explicit documentation guaranteeing reproducibility across runs or environments.

2. Raw stream extraction: Unfortunately, Docling does not currently support direct extraction of embedded image bytes from PDF streams. The current implementation detects image objects using pdfium_c.FPDF_PAGEOBJ_IMAGE to find bounding boxes, but then re-renders those regions rather than extracting the native JPEG/PNG streams. This means you won't get byte-identical output to the original embedded images.

3. Cross-document deduplication: There's no built-in mechanism for detecting identical images across documents. You'd need to implement this externally.

Recommendations for your use case:

Since rendering-based extraction may have slight variations, consider using perceptual hashing (e.g., pHash, dHash) instead of cryptographic hashing for deduplication—these are tolerant of minor pixel-level differences.
Alternatively, you could use an external PDF library (like PyMuPDF/fitz or pikepdf) to extract raw embedded image streams directly, then use Docling for the document structure analysis.
If you need to test determinism in your specific environment, run the same extraction multiple times and compare byte-level hashes to verify consistency.

This would be a useful feature request if you'd like to open one—native image stream extraction would enable reliable hash-based deduplication.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is image extraction deterministic? Can the same embedded image produce a stable hash across runs? #3137

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is image extraction deterministic? Can the same embedded image produce a stable hash across runs? #3137

Uh oh!

nixmaldonado Mar 16, 2026

Context

Questions

Use case

Replies: 1 comment

Uh oh!

dosubot[bot] Bot Mar 16, 2026

nixmaldonado
Mar 16, 2026

dosubot[bot]
Bot Mar 16, 2026