Backend Options and Fixing Changed Letter #2697

egesnr · 2025-11-27T13:33:43Z

egesnr
Nov 27, 2025

Hello,
I've been trying to extract turkish letters from pdf but some of the pdf there are some false converted turkish letters.
For example:
Original Text: Satışların Maliyeti
Converted Text: Sat õ ş lar õ n Maliyeti
The code I used:
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False #

# Convert the document
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            #backend=PyPdfiumDocumentBackend,
            pipeline_options=pipeline_options,
            
        ),
    },
)
result = converter.convert(temp_path)
docling_doc = result.document

When I enable backend=PyPdfiumDocumentBackend the error fixes but I didnt understand why I ve solved totally random. What is the reason of the error and what is the solution?
Note*** I assume my code doesnt use any ocr method because I assume I disabled it but if it is not please tell me how should I avoid using ocr

@egesnr · 2025-11-27T13:37:48Z

dosubot[bot]
Bot Nov 27, 2025

Hi @egesnr! I'm Dosu and I’m helping the docling team.

The issue with Turkish letters being incorrectly extracted (like "Satışların" becoming "Sat õ ş lar õ n") is due to how the default PDF backend in Docling handles character encoding and font mapping. The default backend sometimes misinterprets special or accented characters, leading to these errors—not because of OCR, but because of limitations in its text extraction logic [source].

When you switch to PyPdfiumDocumentBackend, Docling uses the pypdfium2 library, which is much more robust at extracting text directly from the PDF's embedded text layer and correctly handles Turkish and other non-ASCII characters [source]. This is why enabling this backend fixes your problem—it's not random, it's because this backend is better at decoding the PDF's internal font encoding.

Setting pipeline_options.do_ocr = False is enough to disable OCR. With PyPdfiumDocumentBackend, OCR is never performed, regardless of the do_ocr flag—text is always extracted directly from the PDF's text layer [source]. If your PDF has no embedded text layer and OCR is disabled, Docling will not extract any text (the output will be empty) [source].

To reliably extract Turkish letters from PDFs without OCR, always use PyPdfiumDocumentBackend as your backend in the DocumentConverter configuration. Seeing OCR-related log messages does not mean OCR is running—they just indicate that OCR engines are registered [source].

If you ever encounter PDFs with no selectable text (scanned images), you’ll need to enable OCR, but for digital PDFs with embedded text, your current approach is correct.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

4 replies

egesnr Dec 8, 2025
Author

@dosu PyPdfiumDocumentBackend performs bad for some pdf to detect columns. It mixes different columns to 1 column or somethnig wrong. What should i use better

dosubot[bot] Bot Dec 8, 2025

PyPdfiumDocumentBackend mixes columns because it uses simple horizontal merging heuristics, which often fail for complex multi-column layouts—so text from different columns can get merged together see source.

For better column detection, try switching to DoclingParseV4DocumentBackend (the default) or DoclingParseV2DocumentBackend. These backends use more advanced segmentation and layout analysis, which can separate columns more accurately see source. You can select them like this:

from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

pipeline_options = PdfPipelineOptions()
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = False  # Try True/False for best results

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,
            backend=DoclingParseV4DocumentBackend
        ),
    },
)
result = converter.convert("your_file.pdf")

You can also tune column and table detection using these options:

do_table_structure=True enables table extraction.
table_structure_options.do_cell_matching (True/False): toggling this can help with column separation.
layout_options.model_spec: try different layout models like DOCLING_LAYOUT_HERON or DOCLING_LAYOUT_EGRET_MEDIUM for improved layout analysis see source.

If none of the backends work well for your PDFs, enabling OCR (do_ocr=True) and table structure extraction may help, but this is less reliable for digital PDFs with embedded text see source.

Let me know if you want example configs for v2 or want to try layout model tuning!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

egesnr Dec 8, 2025
Author

@dosu it does better that cases but it also convert "Hasılat" word to "Has õ lat" in a pdf and I solved it to apply PyPdfiumDocumentBackend now what should i do both solution solves only a part of my problem not all of them

dosubot[bot] Bot Dec 8, 2025

This is a known limitation: PyPdfiumDocumentBackend is best for Turkish character extraction but weak at column detection, while DoclingParseV4DocumentBackend is better at columns but can mis-extract Turkish letters due to font encoding issues and merging logic. Docling does not support combining both backends in a single run—only one backend can be used per conversion, and there’s no built-in hybrid workflow or fusion switch for merging their outputs automatically (source).

Practical workarounds:

Run both backends separately on the same PDF, then merge the results externally (for example, use PyPdfium for text, DoclingParseV4 for layout/columns).
Post-process the extracted text to fix unwanted spaces or splits in Turkish words—this is often needed because the merging logic is spatial, not language-aware (source).
Tune DoclingParseV4DocumentBackend options: try different layout_options.model_spec values (like DOCLING_LAYOUT_HERON or DOCLING_LAYOUT_EGRET_MEDIUM), and experiment with do_table_structure and do_cell_matching to improve column detection.

If you want, I can help with a sample script to post-process text or outline a workflow for merging results from both backends. Let me know if you want to try that or if you can share sample outputs!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend Options and Fixing Changed Letter #2697

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Backend Options and Fixing Changed Letter #2697

Uh oh!

egesnr Nov 27, 2025

Replies: 1 comment · 4 replies

Uh oh!

dosubot[bot] Bot Nov 27, 2025

Uh oh!

egesnr Dec 8, 2025 Author

Uh oh!

dosubot[bot] Bot Dec 8, 2025

Uh oh!

egesnr Dec 8, 2025 Author

Uh oh!

dosubot[bot] Bot Dec 8, 2025

egesnr
Nov 27, 2025

Replies: 1 comment 4 replies

dosubot[bot]
Bot Nov 27, 2025

egesnr Dec 8, 2025
Author

egesnr Dec 8, 2025
Author