Replies: 1 comment 4 replies
-
|
Hi @egesnr! I'm Dosu and I’m helping the docling team. The issue with Turkish letters being incorrectly extracted (like "Satışların" becoming "Sat õ ş lar õ n") is due to how the default PDF backend in Docling handles character encoding and font mapping. The default backend sometimes misinterprets special or accented characters, leading to these errors—not because of OCR, but because of limitations in its text extraction logic [source]. When you switch to Setting To reliably extract Turkish letters from PDFs without OCR, always use If you ever encounter PDFs with no selectable text (scanned images), you’ll need to enable OCR, but for digital PDFs with embedded text, your current approach is correct. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I've been trying to extract turkish letters from pdf but some of the pdf there are some false converted turkish letters.
For example:
Original Text: Satışların Maliyeti
Converted Text: Sat õ ş lar õ n Maliyeti
The code I used:
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False #
When I enable backend=PyPdfiumDocumentBackend the error fixes but I didnt understand why I ve solved totally random. What is the reason of the error and what is the solution?
Note*** I assume my code doesnt use any ocr method because I assume I disabled it but if it is not please tell me how should I avoid using ocr
Beta Was this translation helpful? Give feedback.
All reactions