Skip to content

fix: pdf extraction without images still includes extracted images when output_format is markdown #796

@steffen-baumann-procureai

Description

Description

Extraction result includes extracted images (base64) from a PDF even though the configuration explicitly sets

"pdf_options": {
  "extract_images": false,
}

This seems to only occur, when the output_format is set to either markdown or djot. Setting plain or structured works as expected: no image is extracted (no base64 excluded).

Expectation: Independent of the output_format setting, the result should not include extracted images when explicitly disabled. Especially as the output_format doesn't define the result format but just what's included in the content per the docs:

output_format - Controls the text format within the content:

This finally leads to longer response times but more importantly larger responses. A productive sample was a 1.6MB large PDF with 3 scanned pages included. The result was over 100MB. Whereas the plain text extraction (so without the images) was just 10KB.

Steps to reproduce

  1. Use this baseline configuration:
{
  "include_document_structure": true,
  "pdf_options": {
    "extract_images": false,
    "extract_metadata": true,
    "hierarchy": {
      "enable": true,
      "k_clusters": 6,
      "include_bbox": true,
      "ocr_coverage_threshold": 0.2
    }
  },
  "postprocessor": {
    "enabled_processors": [
      "whitespace_normalization",
      "mojibake_fix",
      "quality_scoring"
    ]
  },
  "ocr": {
    "backend": "tesseract",
    "tesseract_config": {
      "psm": 11,
      "oem": 1,
      "min_confidence": 0.8,
      "enable_table_detection": true,
      "language": "deu",
      "output_format": "markdown"
    }
  },
  "images": {
    "extract_images": false,
    "target_dpi": 200,
    "max_image_dimension": 20000
  },
  "language_detection": {
    "enabled": true,
    "min_confidence": 0.8,
    "detect_multiple": true
  },
  "token_reduction": {
    "mode": "off"
  },
  "enable_quality_processing": true,
  "force_ocr": false,
  "result_format": "unified"
}
  1. Extract from the following test file (see attached TestOCR.pdf) - it uses "output_format": "plain" as that is the default
  2. The output is as expected and doesn't include any images (see response_plain.json)
  3. Explicitly add "output_format": "markdown" as root property.
  4. Extract the same test file again
  5. The output contains "images": [{ "data": "..."}] --> BUG (see response_markdown.json)

Relevant files and configuration

  • verified on Kreuzberg vv4.8.5 only so far
    • scanned for fixed defects in that area but couldn't find a related one

TestOCR.pdf
response_plain.json
response_markdown.json

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

Status

In Review

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions