How to disable OCR in simple pipeline #2506

ddddrill · 2025-10-22T08:48:09Z

ddddrill
Oct 22, 2025

I can't disable OCR in the simplest pipeline

#!/usr/bin/env python3
import logging
from pathlib import Path

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend

logging.basicConfig(level=logging.INFO)

def extract_text_from_pdf_no_ocr(pdf_path: str):
if not Path(pdf_path).exists():
raise FileNotFoundError(f" {pdf_path}")

try:

    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = False
    pipeline_options.do_table_structure = False 
    pipeline_options.do_code_enrichment = False 
   
    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
                backend=PyPdfiumDocumentBackend  # <-- Ключевое изменение!
            )
        }
    )

@ddddrill · 2025-10-22T08:51:37Z

dosubot[bot]
Bot Oct 22, 2025

Hi @ddddrill! I'm Dosu and I’m helping the docling team.

To disable OCR in the simplest Docling PDF pipeline, set do_ocr=False in PdfPipelineOptions and use PyPdfiumDocumentBackend as your backend—just like in your code. With this backend, OCR is never performed: text is extracted directly from the PDF's text layer, and the do_ocr flag is ignored by PyPdfiumDocumentBackend, so no OCR will run regardless of its value. If your PDF has no embedded text layer, the output will be empty (no fallback to OCR) reference.

If you still see OCR-related log messages, that's normal—they just indicate that OCR engines are registered, not that OCR is actually running reference. If you want to avoid model downloads or timeouts from other pipeline steps (like layout or table models), you may need to disable those features or provide local model paths reference.

So, your approach is correct: with PyPdfiumDocumentBackend and do_ocr=False, OCR is fully disabled.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to disable OCR in simple pipeline #2506

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to disable OCR in simple pipeline #2506

Uh oh!

ddddrill Oct 22, 2025

Replies: 1 comment

Uh oh!

dosubot[bot] Bot Oct 22, 2025

ddddrill
Oct 22, 2025

dosubot[bot]
Bot Oct 22, 2025