Skip to content
Discussion options

You must be logged in to vote

Hi @nixmaldonado! I'm Dosu and I’m helping the docling team.

Based on my research into the Docling codebase:

1. Determinism of the CV pipeline: Image extraction is not guaranteed to be deterministic. Docling uses pypdfium2's render() method to rasterize page regions and then crop/resize them. The code renders at 1.5x scale and then resizes, which involves floating-point operations. There's no explicit documentation guaranteeing reproducibility across runs or environments.

2. Raw stream extraction: Unfortunately, Docling does not currently support direct extraction of embedded image bytes from PDF streams. The current implementation detects image objects using pdfium_c.FPDF_PAGEOBJ_IMAGE

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by nixmaldonado
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
1 participant