Is image extraction deterministic? Can the same embedded image produce a stable hash across runs? #3137
-
ContextI'm using Docling with the CV pipeline (page-level rasterization) to extract images from PDFs as part of a batch processing pipeline. I'm trying to determine whether image extraction is deterministic — specifically, whether the same image embedded in a PDF will always produce the same bytes (and thus the same hash) on repeated runs or across environments. Questions
Use caseI want to deduplicate extracted images across a large corpus of PDFs using content hashing (e.g., MD5/SHA256). This requires the extraction to be deterministic and ideally to preserve the original image bytes when possible. Any guidance appreciated — including if this is already possible via a config option I may have missed. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
Hi @nixmaldonado! I'm Dosu and I’m helping the docling team. Based on my research into the Docling codebase: 1. Determinism of the CV pipeline: Image extraction is not guaranteed to be deterministic. Docling uses pypdfium2's 2. Raw stream extraction: Unfortunately, Docling does not currently support direct extraction of embedded image bytes from PDF streams. The current implementation detects image objects using 3. Cross-document deduplication: There's no built-in mechanism for detecting identical images across documents. You'd need to implement this externally. Recommendations for your use case:
This would be a useful feature request if you'd like to open one—native image stream extraction would enable reliable hash-based deduplication. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Hi @nixmaldonado! I'm Dosu and I’m helping the docling team.
Based on my research into the Docling codebase:
1. Determinism of the CV pipeline: Image extraction is not guaranteed to be deterministic. Docling uses pypdfium2's
render()method to rasterize page regions and then crop/resize them. The code renders at 1.5x scale and then resizes, which involves floating-point operations. There's no explicit documentation guaranteeing reproducibility across runs or environments.2. Raw stream extraction: Unfortunately, Docling does not currently support direct extraction of embedded image bytes from PDF streams. The current implementation detects image objects using
pdfium_c.FPDF_PAGEOBJ_IMAGE…