Replies: 1 comment
-
|
Hi @ddddrill! I'm Dosu and I’m helping the docling team. To disable OCR in the simplest Docling PDF pipeline, set do_ocr=False in PdfPipelineOptions and use PyPdfiumDocumentBackend as your backend—just like in your code. With this backend, OCR is never performed: text is extracted directly from the PDF's text layer, and the do_ocr flag is ignored by PyPdfiumDocumentBackend, so no OCR will run regardless of its value. If your PDF has no embedded text layer, the output will be empty (no fallback to OCR) reference. If you still see OCR-related log messages, that's normal—they just indicate that OCR engines are registered, not that OCR is actually running reference. If you want to avoid model downloads or timeouts from other pipeline steps (like layout or table models), you may need to disable those features or provide local model paths reference. So, your approach is correct: with PyPdfiumDocumentBackend and do_ocr=False, OCR is fully disabled. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I can't disable OCR in the simplest pipeline
#!/usr/bin/env python3
import logging
from pathlib import Path
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
logging.basicConfig(level=logging.INFO)
def extract_text_from_pdf_no_ocr(pdf_path: str):
if not Path(pdf_path).exists():
raise FileNotFoundError(f" {pdf_path}")
Beta Was this translation helpful? Give feedback.
All reactions