Description
Extraction result includes extracted images (base64) from a PDF even though the configuration explicitly sets
"pdf_options": {
"extract_images": false,
}
This seems to only occur, when the output_format is set to either markdown or djot. Setting plain or structured works as expected: no image is extracted (no base64 excluded).
Expectation: Independent of the output_format setting, the result should not include extracted images when explicitly disabled. Especially as the output_format doesn't define the result format but just what's included in the content per the docs:
output_format - Controls the text format within the content:
This finally leads to longer response times but more importantly larger responses. A productive sample was a 1.6MB large PDF with 3 scanned pages included. The result was over 100MB. Whereas the plain text extraction (so without the images) was just 10KB.
Steps to reproduce
- Use this baseline configuration:
{
"include_document_structure": true,
"pdf_options": {
"extract_images": false,
"extract_metadata": true,
"hierarchy": {
"enable": true,
"k_clusters": 6,
"include_bbox": true,
"ocr_coverage_threshold": 0.2
}
},
"postprocessor": {
"enabled_processors": [
"whitespace_normalization",
"mojibake_fix",
"quality_scoring"
]
},
"ocr": {
"backend": "tesseract",
"tesseract_config": {
"psm": 11,
"oem": 1,
"min_confidence": 0.8,
"enable_table_detection": true,
"language": "deu",
"output_format": "markdown"
}
},
"images": {
"extract_images": false,
"target_dpi": 200,
"max_image_dimension": 20000
},
"language_detection": {
"enabled": true,
"min_confidence": 0.8,
"detect_multiple": true
},
"token_reduction": {
"mode": "off"
},
"enable_quality_processing": true,
"force_ocr": false,
"result_format": "unified"
}
- Extract from the following test file (see attached TestOCR.pdf) - it uses
"output_format": "plain" as that is the default
- The output is as expected and doesn't include any
images (see response_plain.json)
- Explicitly add
"output_format": "markdown" as root property.
- Extract the same test file again
- The output contains
"images": [{ "data": "..."}] --> BUG (see response_markdown.json)
Relevant files and configuration
- verified on Kreuzberg vv4.8.5 only so far
- scanned for fixed defects in that area but couldn't find a related one
TestOCR.pdf
response_plain.json
response_markdown.json
Description
Extraction result includes extracted images (base64) from a PDF even though the configuration explicitly sets
This seems to only occur, when the
output_formatis set to eithermarkdownordjot. Settingplainorstructuredworks as expected: no image is extracted (no base64 excluded).Expectation: Independent of the
output_formatsetting, the result should not include extracted images when explicitly disabled. Especially as theoutput_formatdoesn't define the result format but just what's included in the content per the docs:This finally leads to longer response times but more importantly larger responses. A productive sample was a 1.6MB large PDF with 3 scanned pages included. The result was over 100MB. Whereas the plain text extraction (so without the
images) was just 10KB.Steps to reproduce
"output_format": "plain"as that is the defaultimages(see response_plain.json)"output_format": "markdown"as root property."images": [{ "data": "..."}]--> BUG (see response_markdown.json)Relevant files and configuration
TestOCR.pdf
response_plain.json
response_markdown.json