fix: pdf extraction without images still includes extracted images when output_format is markdown

### Description

Extraction result includes extracted images (base64) from a PDF even though the configuration explicitly sets
```
"pdf_options": {
  "extract_images": false,
}
```

This seems to only occur, when the `output_format` is set to either `markdown` or `djot`. Setting `plain` or `structured` works as expected: no image is extracted (no base64 excluded).

Expectation: Independent of the `output_format` setting, the result should not include extracted images when explicitly disabled. Especially as the `output_format` doesn't define the result format but just what's included in the content per the [docs](https://docs.kreuzberg.dev/reference/configuration/?h=output#result-format-vs-output-format):

> output_format - Controls the **text format** within the content:

This finally leads to longer response times but more importantly larger responses. A productive sample was a 1.6MB large PDF with 3 scanned pages included. The result was over 100MB. Whereas the plain text extraction (so without the `images`) was just 10KB. 

### Steps to reproduce

1. Use this baseline configuration: 
```
{
  "include_document_structure": true,
  "pdf_options": {
    "extract_images": false,
    "extract_metadata": true,
    "hierarchy": {
      "enable": true,
      "k_clusters": 6,
      "include_bbox": true,
      "ocr_coverage_threshold": 0.2
    }
  },
  "postprocessor": {
    "enabled_processors": [
      "whitespace_normalization",
      "mojibake_fix",
      "quality_scoring"
    ]
  },
  "ocr": {
    "backend": "tesseract",
    "tesseract_config": {
      "psm": 11,
      "oem": 1,
      "min_confidence": 0.8,
      "enable_table_detection": true,
      "language": "deu",
      "output_format": "markdown"
    }
  },
  "images": {
    "extract_images": false,
    "target_dpi": 200,
    "max_image_dimension": 20000
  },
  "language_detection": {
    "enabled": true,
    "min_confidence": 0.8,
    "detect_multiple": true
  },
  "token_reduction": {
    "mode": "off"
  },
  "enable_quality_processing": true,
  "force_ocr": false,
  "result_format": "unified"
}
```

2. Extract from the following test file (see attached **TestOCR.pdf**) - it uses `"output_format": "plain"` as that is the default
3. The output is as expected and doesn't include any `images` (see **response_plain.json**)
4. Explicitly add `"output_format": "markdown"` as root property.
5. Extract the same test file again
6. The output contains `"images": [{ "data": "..."}]` --> BUG (see **response_markdown.json**)

### Relevant files and configuration

* verified on Kreuzberg v[v4.8.5](https://github.com/kreuzberg-dev/kreuzberg/releases/tag/v4.8.5) only so far 
  * scanned for fixed defects in that area but couldn't find a related one

[TestOCR.pdf](https://github.com/user-attachments/files/27015459/TestOCR.pdf)
[response_plain.json](https://github.com/user-attachments/files/27015605/response_plain.json)
[response_markdown.json](https://github.com/user-attachments/files/27015610/response_markdown.json)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pdf extraction without images still includes extracted images when output_format is markdown #796

Description

Steps to reproduce

Relevant files and configuration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

fix: pdf extraction without images still includes extracted images when output_format is markdown #796

Description

Description

Steps to reproduce

Relevant files and configuration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions